Tuesday, November 5, 2024

What is SelfQueryRetriever in Langchain

In Langchain, SelfQueryRetriever is a specialized retriever designed to make the process of retrieving relevant documents more dynamic and context-aware. Unlike traditional retrievers that solely rely on similarity searches (e.g., vector searches), the SelfQueryRetriever allows for more sophisticated, natural language-based queries by combining natural language understanding with structured search capabilities.

Key Features of SelfQueryRetriever:

Natural Language Queries: It allows users to input complex, free-form questions or queries in natural language.

Dynamic Query Modification: It uses a language model (LLM) to modify or enhance the query dynamically based on the user input. This ensures that the query is refined to retrieve the most relevant results.

Structured Filters: It can also convert a user's question into structured filters that help narrow down the search more effectively. For example, it can apply specific criteria like filtering by date, category, or other metadata fields that are relevant to the search.

How SelfQueryRetriever Works:

Self-Querying: The retriever can automatically generate additional filters or modify the query to help retrieve more accurate or relevant results. It does this by analyzing the user query and applying specific transformations based on the context of the search.

LLM-Powered Refinement: A language model is used to understand the query and extract essential parameters that can guide the retrieval process. These parameters can be key-value pairs or specific instructions, enhancing the retrieval operation by filtering or adjusting the search criteria.

Difference from Other Retrievers:

Standard Retriever:


Relies on similarity search techniques (like vector search or keyword matching).

Simply matches the user's query to the stored documents and retrieves the most similar ones based on embeddings.

No dynamic query modification or structured filtering is involved.

SelfQueryRetriever:


More intelligent because it uses an LLM to interpret and enhance the user query.

It can apply structured filters based on the query (e.g., filter documents by date or category).

It dynamically refines the query using the LLM to ensure that the retrieval is both accurate and relevant.

Example Use Case:

Suppose you have a database of documents with metadata such as "author," "date," "category," etc. A user asks:


“Can you show me all network security articles written after 2020?”


A Standard Retriever would search for documents based on the similarity between the query and the document content (probably looking for the keywords “network security”).

A SelfQueryRetriever would use an LLM to break down the query into actionable parts:

Retrieve documents about network security.

Filter documents where the date is after 2020.

Return only articles matching both criteria.

This makes SelfQueryRetriever far more powerful in scenarios where specific, structured information needs to be extracted from large corpora of documents.


Sample Code:

Here’s a simple example of using SelfQueryRetriever in Langchain:


python

Copy code

from langchain.chains.query_constructor.base import AttributeInfo

from langchain.retrievers.self_query.base import SelfQueryRetriever

from langchain.vectorstores import FAISS

from langchain.llms import OpenAI

from langchain.embeddings import OpenAIEmbeddings


# Define the attributes (metadata) of your documents

metadata_field_info = [

    AttributeInfo(name="author", description="The author of the document", type="string"),

    AttributeInfo(name="date", description="The publication date of the document", type="date"),

    AttributeInfo(name="category", description="The category of the document", type="string")

]


# Initialize your vector store and LLM

embedding_model = OpenAIEmbeddings()

vector_store = FAISS(embedding_model)

llm = OpenAI()


# Create SelfQueryRetriever

self_query_retriever = SelfQueryRetriever.from_llm(

    vector_store=vector_store,

    llm=llm,

    document_contents="content",  # What field contains the document content

    metadata_field_info=metadata_field_info,

    verbose=True

)


# Use the retriever to answer a query

query = "Show me all security reports written by Alice after 2021."

retrieved_docs = self_query_retriever.get_relevant_documents(query)


for doc in retrieved_docs:

    print(doc)

When to Use SelfQueryRetriever:

When your data has a lot of structured information (like metadata) and you need to refine queries based on that structure.

For advanced retrieval scenarios where the user queries require dynamic, intelligent modification or filtering.

In scenarios where similarity search alone might not retrieve the most relevant documents, and you need additional filtering or query modifications.


References:;

OpenAI 

Friday, November 1, 2024

Does mark down syntax help LLMs?

Markdown syntax like ###, **, and --- does not directly help ChatGPT-4 or other LLMs understand the content better in terms of meaning or context. The models interpret the underlying plain text, so formatting elements such as bold text, headings, or dividers are not processed in a way that changes the actual understanding of the text.


However, Markdown can still be useful in prompts for several reasons:


Improved clarity for human readability: Markdown can make it easier for humans to read and structure their prompts or responses, especially in cases like multi-step instructions, lists, or key points. This improved readability might indirectly lead to better prompts, helping the user or developers focus on clarity when communicating with the model.


Separating sections: For complex inputs, Markdown can visually organize the information, making it clear which parts belong to certain instructions or queries. In a multi-part conversation with a model, this can help both the human and the AI keep track of different sections logically.


Implicit structure hints: While the LLM doesn't interpret ### as a heading per se, the repetition of certain patterns (like labeled sections) might help it pick up on the structure of the text, such as treating a section starting with ### Inputs as listing relevant inputs.


In summary, Markdown won't improve the model’s inherent understanding, but it can help make your prompts clearer, well-structured, and easier to follow, which can lead to more accurate outputs by guiding how you formulate your instructions.

Tuesday, October 29, 2024

What is Ray Framework?

Ray is an open-source framework designed to enable the development of scalable and distributed applications in Python. It provides a simple and flexible programming model for building distributed systems, making it easier to leverage the power of parallel and distributed computing. Some key features and capabilities of the Ray framework include:

Ray allows you to easily parallelize your Python code by executing tasks concurrently across multiple CPU cores or even across a cluster of machines. This enables faster execution and improved performance for computationally intensive tasks.

Ray provides a distributed execution model, allowing you to scale your applications beyond a single machine. It offers tools for distributed scheduling, fault tolerance, and resource management, making it easier to handle large-scale computations

With Ray, you can define Python functions that can be executed remotely. This enables you to offload computation to different nodes in a cluster, distributing the workload and improving overall efficiency.

Ray provides high-level abstractions for distributed data processing, such as distributed data frames and distributed object stores. These features make it easier to work with large datasets and perform operations like filtering, aggregation, and transformation in a distributed manner.

Ray includes built-in support for reinforcement learning algorithms and distributed training. It provides a scalable execution environment for training and evaluating machine learning models, enabling efficient experimentation and faster training times.

1. Ray AI Runtime (AIR)

This open-source collection of Python libraries is designed specifically for ML engineers, data scientists, and researchers. It equips them with a unified and scalable toolkit for developing ML applications. The Ray AI Runtime consists of 5 core libraries:

Ray Data

Achieve scalability and flexibility in data loading and transformation across various stages, such as training, tuning, and prediction, regardless of the underlying framework.

Ray Train

Enables distributed model training across multiple nodes and cores, incorporating fault tolerance mechanisms that seamlessly integrate with widely used training libraries.

Ray Tune

Scale your hyperparameter tuning process to enhance model performance, ensuring optimal configurations are discovered.

Ray Serve

Effortlessly deploy models for online inference with Ray's scalable and programmable serving capabilities. Optionally, leverage micro batching to further enhance performance.

Ray RLlib

Seamlessly integrate scalable distributed reinforcement learning workloads with other Ray AIR libraries, enabling efficient execution of reinforcement learning tasks.

references:

https://www.datacamp.com/tutorial/distributed-processing-using-ray-framework-in-python

PymPDF - read page by page and extract images

 def navigate_page_by_page_pympdf():

## navigating page by page
doc = pymupdf.open("deployment_guide.pdf") # open a document
out = open("output.txt", "wb") # create a text output
for page in doc: # iterate the document pages
text = page.get_text().encode("utf8") # get plain text (is in UTF-8)
print("Text read is ",text)
# out.write(text) # write text of page
# out.write(bytes((12,))) # write page delimiter (form feed 0x0C)
out.close()

def extract_images_pympdf():
doc = pymupdf.open("deployment_guide.pdf") # open a document

for page_index in range(len(doc)): # iterate over pdf pages
page = doc[page_index] # get the page
image_list = page.get_images()

# print the number of images found on the page
if image_list:
print(f"Found {len(image_list)} images on page {page_index}")
else:
print("No images found on page", page_index)

for image_index, img in enumerate(image_list, start=1): # enumerate the image list
xref = img[0] # get the XREF of the image
pix = pymupdf.Pixmap(doc, xref) # create a Pixmap

if pix.n - pix.alpha > 3: # CMYK: convert to RGB first
pix = pymupdf.Pixmap(pymupdf.csRGB, pix)

pix.save("page_%s-image_%s.png" % (page_index, image_index)) # save the image as png
pix = None

Monday, October 28, 2024

PyMuPDF - How to extract tables

def extract_text_from_pdf(pdf_path):
"""Read table content only of all pages in the document.

Chatbots typically have limitations on the amount of data that can
can be passed in (number of tokens).

We therefore only extract information on the PDF's pages that are
contained in tables.
As we even know that the PDF actually contains ONE logical table
that has been segmented for reporting purposes, our approach
is the following:
* The cell contents of each table row are joined into one string
separated by ";".
* If table segment on the first page also has an external header row,
join the column names separated by ";". Also ignore any subsequent
table row that equals the header string. This deals with table
header repeat situations.
"""
# open document
doc = pymupdf.open(pdf_path)

text = "" # we will return this string
row_count = 0 # counts table rows
header = "" # overall table header: output this only once!

# iterate over the pages
for page in doc:
# only read the table rows on each page, ignore other content
tables = page.find_tables() # a "TableFinder" object
for table in tables:

# on first page extract external column names if present
if page.number == 0 and table.header.external:
# build the overall table header string
# technical note: incomplete / complex tables may have
# "None" in some header cells. Just use empty string then.
header = (
";".join(
[
name if name is not None else ""
for name in table.header.names
]
)
+ "\n"
)
text += header
row_count += 1 # increase row counter

# output the table body
for row in table.extract(): # iterate over the table rows

# again replace any "None" in cells by an empty string
row_text = (
";".join([cell if cell is not None else "" for cell in row]) + "\n"
)
if row_text != header: # omit duplicates of header row
text += row_text
row_count += 1 # increase row counter
doc.close() # close document
print(f"Loaded {row_count} table rows from file '{doc.name}'.\n")
return text

 references:

https://python.plainenglish.io/why-pymupdf4llm-is-the-best-tool-for-extracting-data-from-pdfs-even-if-you-didnt-know-you-needed-7bff75313691

Saturday, October 26, 2024

What is PyMuPDF4LLM

PyMuPDF4LLM is based on top of the tried and tested PyMuPDF and utilizes the library behind the scenes to achieve the following:


Support for multi-column pages

Support for image and vector graphics extraction (and inclusion of references in the MD text)

Support for page chunking output

Direct support for output as LlamaIndex Documents


Multi-Column Pages

The text extraction can handle document layouts with multiple columns and meaning that “newspaper” type layouts are supported. The associated Markdown output will faithfully represent the intended reading order.


Image Support

PyMuPDF4LLM will also output image files alongside the Markdown if we request write_images:

import pymupdf4llm

output = pymupdf4llm.to_markdown("input.pdf", write_images=True)

The resulting output will create a markdown text output with references to any images that may have been found in the document. The images will be saved to the location from where you have run the Python script and the markdown will have logically referenced them with the correct markdown syntax for images.

Page Chunking

We can obtain output with enriched semantic information if we request page_chunks:


import pymupdf4llm

output = pymupdf4llm.to_markdown("input.pdf", page_chunks=True)


This delivers a list of dictionary objects for each page of the document with the following schema:


metadata — dictionary consisting of the document’s metadata.

toc_items — list of Table of Contents items pointing to the page.

tables — list of tables on this page.

images — list of images on the page.

graphics — list of vector graphics rectangles on the page.

text — page content as Markdown text.

Pymupdfllm has support in LLamaIndex.

import pymupdf4llm

llama_reader = pymupdf4llm.LlamaMarkdownReader()

llama_docs = llama_reader.load_data("input.pdf")

Wednesday, October 23, 2024

What is Meta AI interface

Like chatGPT or Google Gemini, the Meta AI chat interface is powered by Llama 3. However, meta AI is only available in the US, so you must use a VPN to connect to meta.ai if you are in a Europe zone like me.

On meta.ai, you can either generate an image or have a chat. Both two generative models are free for use.

What is incredible about mata.ai:

The speed of token generation is extremely rapid comparing Gemini or ChatGPT, we feel that big monster infra is behind the scenes.

Free plan for image generation, while I type the prompt for image generation, the image is generated based on what I am typing. It’s a real-time generation.

References

https://levelup.gitconnected.com/llama-3-metas-latest-ai-breakthrough-offers-power-and-open-access-166ebb94f794