Thursday, February 27, 2025

What are various types of Indexes in LLamaIndex

An Index is a data structure that allows us to quickly retrieve relevant context for a user query. For LlamaIndex, it's the core foundation for retrieval-augmented generation (RAG) use-cases.


At a high-level, Indexes are built from Documents. They are used to build Query Engines and Chat Engines which enables question & answer and chat over your data.


Under the hood, Indexes store data in Node objects (which represent chunks of the original documents), and expose a Retriever interface that supports additional configuration and automation.


The most common index by far is the VectorStoreIndex;


What are various indexes In 


This guide describes how each index works with diagrams.


Some terminology:


Node: Corresponds to a chunk of text from a Document. LlamaIndex takes in Document objects and internally parses/chunks them into Node objects.

Response Synthesis: Our module which synthesizes a response given the retrieved Node.


Summary Index (formerly List Index)#

The summary index simply stores Nodes as a sequential chain.


Querying#

During query time, if no other query parameters are specified, LlamaIndex simply loads all Nodes in the list into our Response Synthesis module.


The summary index does offer numerous ways of querying a summary index, from an embedding-based query which will fetch the top-k neighbors, or with the addition of a keyword filter, as seen below:


Vector Store Index


The vector store index stores each Node and a corresponding embedding in a Vector Store.


Querying

Querying a vector store index involves fetching the top-k most similar Nodes, and passing those into our Response Synthesis module.



Tree Index#

The tree index builds a hierarchical tree from a set of Nodes (which become leaf nodes in this tree).


Querying#

Querying a tree index involves traversing from root nodes down to leaf nodes. By default, (child_branch_factor=1), a query chooses one child node given a parent node. If child_branch_factor=2, a query chooses two child nodes per level.


Keyword Table Index#

The keyword table index extracts keywords from each Node and builds a mapping from each keyword to the corresponding Nodes of that keyword.



Querying#

During query time, we extract relevant keywords from the query, and match those with pre-extracted Node keywords to fetch the corresponding Nodes. The extracted Nodes are passed to our Response Synthesis module.




Property Graph Index#

The Property Graph Index works by first building a knowledge graph containing labelled nodes and relations. The construction of this graph is extremely customizable, ranging from letting the LLM extract whatever it wants, to extracting using a strict schema, to even implementing your own extraction modules.


Optionally, nodes can also be embedded for retrieval later.


You can also skip creation, and connect to an existing knowledge graph using an integration like Neo4j.


Querying#

Querying a Property Graph Index is also highly flexible. Retrieval works by using several sub-retrievers and combining results. By default, keyword + synoymn expanasion is used, as well as vector retrieval (if your graph was embedded), to retrieve relevant triples.


You can also chose to include the source text in addition to the retrieved triples (unavailble for graphs created outside of LlamaIndex).





references:

https://docs.llamaindex.ai/en/stable/module_guides/indexing/

What is Notion and Notion Reader

Notion is an all-in-one workspace that blends note-taking, project management, and database functionality into a single tool. It's designed to help individuals and teams organize their work, ideas, and information.   

Here's a breakdown of Notion's key features:

Note-Taking:

Notion allows you to create rich text documents, including headings, lists, images, and embedded content.   

It's great for taking meeting notes, writing documentation, or brainstorming ideas.   

Project Management:

You can create task lists, Kanban boards, and calendars to manage projects and track progress.   

Notion allows you to assign tasks, set due dates, and collaborate with team members.   

Databases:

Notion's database functionality lets you create structured tables, lists, and galleries.   

You can use databases to organize information, such as customer lists, product catalogs, or research data.   

Wikis:

Notion is very good at creating internal wikis, and knowledge bases.   

Customization:

Notion is highly customizable, allowing you to create workspaces that fit your specific needs.   

You can use templates, create custom blocks, and design your own layouts.   

Collaboration:

Notion is designed for collaboration, allowing multiple users to work on the same pages and databases.   

It includes features like comments, mentions, and real-time editing.   

Notion Reader


"Notion Reader" isn't an official, standalone product or feature offered directly by Notion. However, the term can refer to a few different concepts:


Reading Content in Notion:

Essentially, any user viewing content within the Notion platform is acting as a "Notion reader." This includes viewing notes, documents, database entries, or any other type of content created within a Notion workspace.

Third-Party Apps or Extensions:

There might be third-party browser extensions or applications that enhance the reading experience within Notion. These could provide features like:

Improved formatting for reading long documents.

Text-to-speech functionality.

The ability to save Notion pages for offline reading.

  

Publicly Shared Notion Pages:

Notion allows users to publish pages to the web. When someone views these publicly shared pages, they are "reading" Notion content without necessarily being a Notion user themselves.   

API Usage:

Developers using the Notion API to extract and display content in external applications could also be considered creating a form of "Notion reader" experience.

In summary, Notion is a powerful workspace application, and "Notion reader" generally refers to the act of viewing content within the Notion platform or through related tools.

 


Wednesday, February 26, 2025

Difference Between Agentic RAG & Intelligent Chunking

Agentic RAG:

Involves autonomous agents that iteratively refine queries, retrievals, and responses.

Agents can re-query, chain multiple retrievals, or generate additional context before answering.

Example: If a document chunk is insufficient, an agent may decide to fetch related sections, summarize, or ask follow-up queries.

Intelligent Chunking (Hierarchical RAG):

Focuses on better preprocessing of documents by identifying logically linked sections before embedding.

Helps improve retrieval quality by maintaining document structure and relationships.

Example: Instead of blindly chunking by fixed tokens, the system understands sections like "Introduction" and "Methodology" belong together.

Can They Be Combined?

Yes! A hybrid approach would:

Use Intelligent Chunking to pre-process documents efficiently.

Employ Agentic RAG to refine retrieval dynamically during query time.

Would you like an example using LangChain or LlamaIndex to implement this? 


references:

OpenaAI

Monday, February 24, 2025

ML Speciality: Transformers - Part 1

 It begins with RNNS and LSTMs 

They introduced a feedback loop fr propagating information forward. 

This is useful for modelling sequential things. 

Useful for modelling sequential things 

When doing language translation, we have encoder and decoder architecture  Below is the architecture on this






Below is explanation of diagram, which is use case for translating the English Sentence to Spanish 

English Input:

The process begins with an English sentence, such as "Hello, world."
Embedding Layer:

The words in the English sentence are converted into numerical representations called word embeddings. Each word is mapped to a vector that captures its semantic meaning.
This layer turns the text into a format the RNN can process.
RNN Layers (Recurrent Neural Network):

The embedded words are fed into one or more RNN layers.
Each RNN layer consists of RNN cells.

RNN Cell:
Each cell takes the current word embedding and the previous hidden state as input.
It processes this information and produces a new hidden state and an output.
The hidden state carries information from previous time steps, allowing the RNN to remember context.
The hidden state is passed along to the next RNN cell, and the output is passed to the next layer. This is why it is called a recurrent neural network.

Dense Layer (Fully Connected Layer):

The output from the final RNN layer is passed through a dense layer.
This layer transforms the RNN's output into a probability distribution over the Spanish vocabulary.
The output of the dense layer is a probability distribution that shows the likelihood of each spanish word being the correct translation.
Spanish Output:

The word with the highest probability is selected as the translated word.
The process is repeated until a complete Spanish sentence is generated, such as "Hola, mundo."
Key Concepts:

Word Embeddings: Numerical representations of words that capture their semantic meaning.
Hidden State: A vector that carries information from previous time steps, allowing the RNN to remember context.
RNN Cell: The basic building block of an RNN, which processes the current input and the previous hidden state.
Dense Layer: A fully connected layer that transforms the RNN's output into a probability distribution.
Important Notes:

This diagram represents a basic RNN architecture. More advanced architectures, such as LSTMs (Long Short-Term Memory) or GRUs (Gated Recurrent Units), are often used in practice to improve performance.
Attention mechanisms are also commonly used in modern translation systems to allow the model to focus on the most relevant parts of the input sentence.
Sequence to sequence models, are also very common. They use an encoder and a decoder. The encoder encodes the english sentence, and the decoder decodes the spanish sentence.





Sunday, February 23, 2025

A quick comparison of PDF parsers

 PyMuPDF (fitz):

Focus:

PyMuPDF is a Python binding for the MuPDF library, a lightweight PDF, XPS, and eBook viewer.

It offers low-level access to PDF content and structure, providing extensive control over text extraction, image extraction, and document manipulation.

Strengths:

High Performance: MuPDF is known for its speed and efficiency, making PyMuPDF suitable for processing large volumes of PDFs.

Low level Access: Great for fine grained control of PDF data.

Comprehensive Functionality: Provides a wide range of functions for manipulating PDF documents, including text extraction, image extraction, and document rendering.

Good for many layouts: Can handle many different pdf layouts.

Weaknesses:

Raw Text Output: By default, PyMuPDF extracts raw text, which may not preserve the document's logical structure.

Layout Reconstruction: Requires significant post-processing to reconstruct complex layouts, making it less ideal for LLM-ready output.

Requires coding: It can require a lot of code to extract the needed data, and to format that data.

Use Cases:

Suitable for applications that require high-performance PDF processing and low-level access to document content.

Useful for tasks such as batch PDF processing, image extraction, and document conversion.

2. PyMuPDF4LLM:


Focus:

PyMuPDF4LLM builds upon PyMuPDF, specifically designed to enhance PDF parsing for LLM applications.

It focuses on generating structured Markdown output that preserves the document's logical structure.

Strengths:

LLM-Optimized Output: Generates Markdown-formatted text, which is highly structured and easily digestible by LLMs.

Improved Layout Understanding: Offers enhanced layout understanding compared to raw PyMuPDF, leading to more accurate text extraction and structure preservation.

Ease of use: Makes it much easier to have LLM ready data than standard PyMuPDF.

Weaknesses:

Relatively newer compared to PyMuPDF, so its ecosystem may be evolving.

Still depends on the underlying MuPDF library, so limitations of MuPDF may apply.

Use Cases:

Ideal for RAG applications that require structured context for LLMs.

Suitable for tasks such as document summarization, question answering, and information retrieval.

3. LlamaParse:


Focus:

LlamaParse is designed with a strong emphasis on structured document parsing, particularly for LLM consumption. It prioritizes layout understanding and generating output that preserves the document's inherent structure.

It is designed to give clean markdown output.

Strengths:

Superior Layout Understanding: Excels at recognizing and preserving the logical structure of PDFs, including headings, lists, tables, and paragraphs.

Markdown output: The markdown output is very useful for LLMs.

LLM-Optimized Output: Generates output that is highly structured and easily digestible by LLMs, leading to improved downstream performance.

Robustness: Designed to handle complex layouts and challenging PDF structures.

Weaknesses:

May have a slightly higher processing overhead compared to simpler parsers.

Relatively newer, so its ecosystem and community support may be evolving.

Use Cases:

Ideal for applications where preserving document structure is crucial, such as legal documents, research papers, and technical manuals.

Excellent for RAG pipelines that require accurate and structured context for LLMs.

4. Unstructured:


Focus:

Unstructured is a versatile library that aims to extract text and metadata from various document types, including PDFs.

It offers a wide range of "elements" (text, tables, images) that can be extracted.

Strengths:

Broad Document Support: Handles a wide variety of file formats, not just PDFs.

Element Extraction: Provides detailed information about the extracted elements, including their type and position.

Flexibility: Offers various extraction strategies and configuration options.

Community and Ecosystem: Has a well-established community and a mature ecosystem.

Weaknesses:

Layout understanding may not be as robust as LlamaParse or PyMuPDF4LLM, especially for complex PDFs.

Output can require significant post-processing to create LLM ready outputs.

May require more configuration and fine-tuning for optimal performance.

Use Cases:

Suitable for applications that require extracting text and metadata from a diverse range of document types.

Useful for general-purpose document processing and data extraction.

5. Vectorize (in RAG Context):


Focus:

"Vectorize" in this context refers to the process of converting extracted text into vector embeddings for use in vector databases. It's a step in the RAG pipeline, not a standalone PDF parser.

The quality of the vector embeddings are highly dependent on the quality of the parsed text.

Strengths (When Combined with a Good Parser):

Enables semantic search and retrieval of relevant document chunks.

Allows LLMs to access information beyond simple keyword matching.

When used with a good parser, very precise results can be achieved.

Weaknesses (Dependence on Parser):

The quality of vector embeddings and RAG performance is highly dependent on the accuracy and structure of the parsed text.

If the PDF parser produces inaccurate or unstructured output, the vector embeddings will be less effective.

Vectorize is not a PDF parser, it is a method of converting data to vectors.

Use Cases:

Essential for RAG applications that require semantic search and retrieval.

Works best when combined with a robust PDF parser that preserves document structure.

Key Takeaways:


For raw, fast PDF processing, PyMuPDF is excellent.

For LLM-focused, structured Markdown output, PyMuPDF4LLM and LlamaParse are the top contenders, with LlamaParse often providing superior layout understanding.

Unstructured is great for general purpose document parsing.

Vectorize is not a parser, but a critical step in RAG pipelines, and is reliant on the quality of the parser used.

What are the main factors to check when evaluating PDF parser effectiveness?

Scanned PDFs (Image-Based PDFs):

Test: Include a page or a full PDF that is a scanned image, not actual text. This will evaluate the parser's ability to handle OCR (Optical Character Recognition) or its integration with OCR libraries.

Purpose: Many PDFs are created from scans, and this is a critical test for real-world scenarios.

PDFs with Different Font Types and Sizes:

Test: Use a PDF with a mix of fonts, font sizes, and styles (e.g., bold, italic, underlined).

Purpose: Assess how well the parser handles font variations, which can affect text extraction accuracy and layout reconstruction.

PDFs with Embedded Images and Graphics

Test: Include PDFs with complex embedded images, vector graphics, and annotations.

Purpose: Evaluate the parser's ability to extract image data, preserve image quality, and handle annotations.

PDFs with Complex Tables:

Test: Include tables with merged cells, nested tables, tables spanning multiple pages, and tables with complex formatting.

Purpose: Test the parser's robustness in handling challenging table structures.

PDFs with Form Fields:

Test: Include a PDF with fillable form fields (e.g., text boxes, checkboxes, radio buttons).

Purpose: Evaluate the parser's ability to extract form field data and preserve field structure.

PDFs with Bookmarks and Outlines:

Test: Include PDFs with well-defined bookmarks and outlines.

Purpose: Assess the parser's ability to extract and preserve the document's logical structure.

PDFs with Metadata:

Test: Include PDFs with embedded metadata (e.g., author, title, keywords).

Purpose: Evaluate the parser's ability to extract and preserve document metadata.

PDFs with Different Compression Techniques:

Test: Include PDFs with different image compression techniques (e.g., JPEG, JPEG2000).

Purpose: Evaluate how well the parser handles various compression methods.

PDFs with Security Restrictions:

Test: Include PDFs with password protection or other security restrictions.

Purpose: Assess the parser's ability to handle encrypted or restricted PDFs.

Handling of Special Characters and Unicode:

Test: Include a PDF with a wide range of special characters and Unicode symbols.

Purpose: Evaluate the parser's ability to handle international characters and special symbols accurately.

Testing for correct reading order of the text:

Test: Create a PDF with a deliberately jumbled text order.

Purpose: Verify that the parser can correctly reconstruct the intended reading order.

Testing for correct identification of headers and footers:

Test: create a PDF with headers and footers.

Purpose: Verify that the parser can correctly identify and extract header and footer information.

Testing for correct identification of page numbers:

Test: Create a PDF with different page number styles.

Purpose: Verify that the parser can correctly identify and extract page numbers.

Testing for correct identification of lists:

Test: Create a PDF with numbered and bulleted lists.

Purpose: verify that the parser can correctly identify and extract lists.

Evaluation Metrics:

Accuracy: How accurately the parser extracts text and data.

Layout Preservation: How well the parser preserves the original document's layout.

Speed: The time it takes for the parser to process a PDF.

Memory Usage: The amount of memory the parser consumes.

Robustness: The parser's ability to handle various PDF formats and complexities.

Error Handling: How well the parser handles errors and exceptions.

Completeness: How much of the information that is present in the PDF is actually extracted.

By incorporating these tests and scenarios into your evaluation, you'll gain a more comprehensive understanding of the strengths and weaknesses of different PDF parsers.


Saturday, February 22, 2025

How PyMuPDF analyses various PDF formats

PyMuPDF, at its core, leverages the MuPDF library, which is a lightweight PDF, XPS, and eBook viewer. Therefore the way PyMuPDF reads various PDF layouts is deeply tied to the MuPDF rendering engine. Here's a general overview of the approach:   

Fundamental Principles:

PDF Structure Understanding:

PDFs are structured documents, and PyMuPDF/MuPDF excels at parsing this underlying structure. It analyzes the PDF's internal objects, which define the placement of text, images, and other elements.   

This involves navigating the PDF's object hierarchy, including pages, content streams, and other elements.

Text Extraction:

PyMuPDF can extract text in various ways, ranging from raw text extraction to more sophisticated methods that attempt to preserve layout.   

It analyzes the text positioning information within the PDF to determine the flow of text.

The page.get_text() method is very important, and it has various parameters to control the output.

Layout Analysis:

To handle different layouts, PyMuPDF analyzes the spatial relationships between text elements. This includes:

Identifying the coordinates of text blocks.   

Detecting columns and other layout structures.   

Understanding the flow of text across different regions of the page.

Rendering Engine:

MuPDF's rendering engine plays a crucial role in accurately interpreting the PDF's layout.

It handles the complexities of PDF rendering, including font handling, graphics rendering, and color management.

Key Aspects of Layout Handling:

Coordinate-Based Analysis:

PyMuPDF relies heavily on the coordinate information within the PDF to understand the layout.

It uses this information to determine the relative positions of text elements and to reconstruct the reading order.

Text Extraction Modes:

PyMuPDF provides different text extraction modes that allow you to control the level of layout preservation.   

This allows you to choose the appropriate mode for your specific needs, depending on the complexity of the PDF layout.

Handling Complex Layouts:

For complex layouts, such as those with multiple columns or tables, PyMuPDF's ability to analyze the spatial relationships between text elements is crucial.

It can identify the boundaries of columns and tables and extract the text in the correct order.   

PyMuPDF4LLM:

It is very important to note the existance of PyMuPDF4LLM. This library builds on top of PyMuPDF and is designed to make PDF parsing even better for use within LLM workflows. It has features that are designed to produce mark down format, and that helps LLMs to process the data better.   

Important Notes:

PDFs can vary significantly in their structure and complexity, which can make it challenging to accurately parse all layouts.

Scanned PDFs, where the text is embedded as images, require Optical Character Recognition (OCR) to extract the text. PyMuPDF has OCR capability's that can be used.   

While PyMuPDF is very capable, for particularly complex tables it might be necessary to augment the parsing with tools like Camelot or pdfplumber, which specialize in table extraction.

In essence, PyMuPDF's layout handling combines PDF structure understanding, coordinate-based analysis, and the capabilities of the MuPDF rendering engine. PyMuPDF4LLM is a tool built on top of the original library that is very powerful for LLM usage.   


Friday, February 21, 2025

What is TeX and LateX document formattings?

 TeX and LaTeX are powerful typesetting systems widely used for creating high-quality documents, particularly in scientific and technical fields. Here's a breakdown of each:   


TeX (pronounced "tekh" or "tek"):


Core Engine: TeX is the underlying typesetting engine created by Donald Knuth. It's a programming language specifically designed for precise control over the layout and formatting of text and mathematical formulas.   

Low-Level Control: TeX provides very fine-grained control over every aspect of the document, from character spacing to page layout.

Focus on Typography: Knuth's primary goal was to create a system that produced beautiful and consistent typography.

Complexity: TeX can be quite complex to learn and use directly, as it involves writing low-level commands.

Usage: While TeX itself is still used, it's more common to use LaTeX, which provides a higher-level interface.

LaTeX (pronounced "LAY-tekh" or "LAH-tekh"):


Macro Package: LaTeX is a macro package built on top of TeX. It provides a set of higher-level commands and environments that simplify document creation.   

Simplified Syntax: LaTeX makes it easier to write documents by abstracting away many of the low-level details of TeX.

Document Structure: LaTeX encourages a structured approach to document creation, allowing you to define sections, subsections, figures, tables, and other elements using logical commands.   

Mathematical Typesetting: LaTeX is renowned for its excellent mathematical typesetting capabilities. It makes it easy to create complex mathematical formulas and equations.

Packages: LaTeX has a vast library of packages that extend its functionality, providing support for various document types, languages, and formatting options.   

Usage: LaTeX is widely used for:

Writing academic papers and theses

Creating scientific and technical reports   

Typesetting books and articles   

Generating presentations (using packages like Beamer)   

Creating complex mathematical documents

Key Differences and Similarities:


TeX is the engine, LaTeX is the interface: Think of TeX as the engine of a car and LaTeX as the dashboard and controls.

Both are markup languages: They use plain text files with markup commands to describe the document's structure and formatting.

Both produce high-quality output: TeX and LaTeX documents are known for their professional appearance and excellent typography.   

LaTeX is easier to use: LaTeX simplifies the process of creating documents compared to using TeX directly.

Both are cross-platform: TeX and LaTeX can be used on various operating systems (Windows, macOS, Linux).   

In essence:


LaTeX provides a user-friendly way to harness the power of TeX. It's the preferred choice for most users who need to create professional-looking documents, especially those with complex mathematical or technical content.

references:

Gemini

What is Recursive Table Retriever and Recursive Node Retriever?

 Both Recursive Table Retrieval and Recursive Node Retrieval in LlamaIndex are techniques designed to efficiently retrieve information from structured or hierarchical data. They leverage the structural information to guide the retrieval process, making it more targeted and efficient.


1. Recursive Table Retrieval:


This technique is specifically designed for data stored in a tabular format, often within documents or web pages.  It's particularly useful when you have tables nested within other content or when you want to retrieve information that spans multiple tables.


Core Idea:  Recursive Table Retrieval recognizes that tables often have relationships to the surrounding text or other tables. It uses this information to guide the retrieval process.


How it Works:


Table Indexing: LlamaIndex creates an index of the tables in your data. This index can include information about the table structure, column headers, and cell content.

Initial Retrieval: When you issue a query, the retriever first searches for relevant tables based on the query. This might involve matching keywords in the query to table headers or cell content.

Contextual Retrieval: Once a relevant table is found, the retriever can then retrieve additional context related to that table. This might include:

The text surrounding the table.

Other tables that are related to the initial table (e.g., tables on the same page or in the same document).

Information from a higher level in the document hierarchy (e.g., section headings).

Recursive Exploration: The retriever can recursively explore related tables and context until it has gathered enough information to answer the query.

Example: Imagine you have a document with multiple tables about different aspects of a product. You ask a question that requires information from several of these tables. Recursive Table Retrieval would identify the relevant tables and then gather the necessary data from each to provide a comprehensive answer.


2. Recursive Node Retrieval:


This technique is more general and can be applied to any data that can be represented as a hierarchical structure (e.g., a tree, a nested list, or a document with sections and subsections).  It's a generalization of the Recursive Retriever concept.


Core Idea:  Recursive Node Retrieval uses the hierarchical structure of your data to guide the search. It starts at a higher level of the hierarchy and recursively drills down to more specific content only when necessary.


How it Works:


Hierarchical Indexing: LlamaIndex creates an index that reflects the hierarchical structure of your data. This index can include information about the parent-child relationships between nodes (chunks of text or data).

Top-Level Retrieval: When you issue a query, the retriever starts at the top level of the hierarchy (e.g., a summary document or a top-level node in a tree).

Relevance Check: It determines if the content at the current level is relevant to the query.

Recursive Drill-Down: If the content is relevant, the retriever recursively descends to the next level of the hierarchy, exploring the children of the current node.

Context Aggregation: It gathers all the relevant content it finds during the recursive search and returns it as the context for your LLM query.

Example: Imagine you have a book with chapters, sections, and paragraphs. You ask a question about a specific detail in one of the paragraphs. Recursive Node Retrieval would start by looking at the chapter titles, then the section headings within the relevant chapter, and finally retrieve the specific paragraph you need.

Key Differences and Similarities:

Data Type: Recursive Table Retrieval is specialized for tabular data, while Recursive Node Retrieval is more general and can be used with any hierarchical data.

Contextual Awareness: Both techniques are contextually aware. They use the structural information in the data to guide the retrieval process and gather related information.

Efficiency: Both aim to improve retrieval efficiency by avoiding exhaustive searches of the entire dataset. They focus on the most promising parts of the data based on the hierarchy.

In summary: Recursive Table Retrieval and Recursive Node Retrieval are powerful techniques in LlamaIndex for efficiently retrieving information from structured data.  They leverage the hierarchical relationships within the data to guide the search, making it more targeted and efficient.  The choice between them depends on whether your data is primarily tabular or has a more general hierarchical structure.

References:

Gemini 

What are Composable Objects in LlamaIndex Retrievers

Composable objects in LlamaIndex retrievers refer to the ability to combine and chain different retriever components together to create more complex and powerful retrieval pipelines.  It's a way to build custom retrieval strategies by composing simpler building blocks.

Here's a breakdown of the concept:

Core Idea: LlamaIndex allows you to treat retrievers and other related components (like node parsers, query engines, etc.) as composable objects. This means you can combine them in a flexible way to create custom retrieval workflows that are tailored to your specific data and needs.

How it Works:

Retriever Components: LlamaIndex provides various retriever components, including:

Retrievers: These are the core components that fetch relevant nodes (text chunks) based on a query (e.g., BM25Retriever, VectorStoreRetriever, KeywordRetriever, etc.).

Node Parsers: These components process and structure the retrieved nodes (e.g., splitting them into smaller chunks, adding metadata, etc.).   

Query Engines: These are higher-level components that combine retrievers with LLMs to perform question answering or other tasks.   

Composition: You can combine these components using various techniques:

Chaining: You can chain retrievers together, so the output of one retriever becomes the input for the next. For example, you might first use a KeywordRetriever to filter down the documents and then use a VectorStoreRetriever to find the most semantically similar nodes within those documents.

Fusion: You can combine the results of multiple retrievers using a fusion strategy (as with the SimpleFusionRetriever).

Custom Logic: You can define your own custom logic to combine or filter the retrieved nodes.   

Flexibility: This composability gives you great flexibility in designing your retrieval pipeline. You can create complex workflows that are optimized for your specific data and retrieval task.

Example (Conceptual):

Let's say you have a large collection of documents, and you want to retrieve information based on both keywords and semantic similarity.

You create a BM25Retriever for keyword search and a VectorStoreRetriever for semantic search.

You create a SimpleFusionRetriever to combine the results of the two retrievers.

You can further process the merged results using a NodeParser to split the retrieved nodes into smaller chunks or add metadata.

Finally, you can use a QueryEngine to combine the processed nodes with an LLM to answer a question based on the retrieved information.

Benefits:

Customization: You can create highly customized retrieval pipelines tailored to your specific needs.

Modularity: You can reuse and combine different retriever components to create complex workflows.   

Flexibility: You have great flexibility in how you combine and process retrieved information.

Improved Performance: By combining different retrieval strategies, you can often improve the overall retrieval performance.

In summary: Composable objects in LlamaIndex retrievers allow you to build complex retrieval pipelines by combining simpler building blocks. 1   This modularity and flexibility enable you to create highly customized retrieval strategies that are optimized for your specific data and retrieval tasks.  It's a key feature of LlamaIndex that allows for advanced retrieval workflows


References:

Gemini 


What is AutoMerging Retriever?

The Auto Merging Retriever in LlamaIndex is designed to intelligently merge and manage retrieved context from different sources, particularly useful when dealing with hierarchical or interconnected data.  It aims to provide the most relevant and concise context to the LLM by automatically determining what information to include and how to combine it.   

Here's a breakdown of its functionality and core idea:

Core Idea:  The Auto Merging Retriever recognizes that simply concatenating all retrieved information might not be optimal for the LLM.  It might lead to redundant information, overly long prompts, or a loss of focus on the most important details.  The Auto Merging Retriever addresses this by intelligently merging and filtering retrieved context, aiming for conciseness and relevance.   

How it Works:

Retrieval from Multiple Sources: The Auto Merging Retriever typically works with multiple retrievers or data sources. This could involve retrieving from different parts of a document (e.g., sections, subsections), different documents in a collection, or even different types of data (e.g., text, tables).

Node Evaluation and Merging:  The retriever evaluates the relevance of individual retrieved "nodes" (chunks of text or data).  It might use a scoring mechanism (e.g., based on similarity to the query) to determine the importance of each node.

Automatic Merging Logic: The core of the Auto Merging Retriever is its logic for automatically merging and combining the retrieved nodes.  This can involve:   

Deduplication: Removing redundant or overlapping information.

Summarization: Condensing information from multiple nodes into a shorter summary.

Contextualization: Adding context to nodes to make them more understandable. This might involve including surrounding sentences or headings.   

Filtering: Excluding less relevant or unimportant nodes.

Hierarchical Merging: If the data has a hierarchical structure, the retriever can intelligently merge information from different levels of the hierarchy.   

Context Construction: The retriever constructs the final context for the LLM by combining the merged and filtered nodes.  It might use a combination of techniques to ensure that the context is coherent, concise, and focused on the query.

Example (Conceptual):

Imagine you have a long document with multiple sections about different aspects of a topic.

You ask a question about a specific detail within one of the sections.

The Auto Merging Retriever might retrieve nodes (text chunks) from that specific section and also retrieve relevant context from other sections that provide background information or related details.

It then merges these nodes, potentially summarizing some of the background information and focusing on the specific details related to your question.   

The final context provided to the LLM is a concise and focused summary of the relevant information, including the specific details you asked about and the necessary background context.

Benefits:

Concise Context: Avoids overwhelming the LLM with too much information.

Improved Relevance: Focuses the context on the most important details.

Reduced Redundancy: Eliminates overlapping or duplicate information.   

Better Performance: Can lead to more accurate and focused LLM responses.

Handles Complex Data: Works well with hierarchical or interconnected data.

When to use it:

Long documents: When you're working with long documents and want to provide the LLM with only the most relevant sections.

Complex data structures: When you have data organized in a hierarchical or interconnected manner.

Summarization tasks: When you want to provide the LLM with summaries of relevant information.

Multi-source retrieval: When you're retrieving from multiple sources and need to combine the results intelligently.   

In LlamaIndex, you would use the AutoMergingRetriever class and configure it with your retrievers and data sources.  The specific merging logic can be customized depending on your needs.  This retriever is particularly powerful when combined with other LlamaIndex features like SummaryIndex or tree-structured data.


References:

https://docs.llamaindex.ai/en/stable/examples/retrievers/auto_merging_retriever/

Thursday, February 20, 2025

How does Simple Fusion Retriever in LlamaIndex work?

 The Simple Fusion Retriever in LlamaIndex combines the results of multiple retrievers to improve the overall retrieval performance. Its core idea is that different retrieval methods might capture different aspects of relevance, and by fusing their results, you can get a more comprehensive and accurate set of retrieved documents or nodes.   

Here's a breakdown of how it works and its core idea:

Core Idea: The Simple Fusion Retriever leverages the strengths of different retrieval methods by combining their outputs. It assumes that each retriever might find a subset of relevant documents, and by merging these subsets (and potentially re-ranking them), you can increase the chances of retrieving all the truly relevant information.   

How it Works:

Multiple Retrievers: You provide the SimpleFusionRetriever with a list of other retrievers. These can be any type of retriever available in LlamaIndex, such as BM25Retriever, VectorStoreRetriever, KeywordRetriever, etc.

Independent Retrieval: When you issue a query, each of the underlying retrievers independently retrieves the top-k documents or nodes according to its own criteria.

Fusion (Merging and Ranking): The SimpleFusionRetriever then combines the results from all the individual retrievers.  There are a couple of ways this fusion can happen:

Simple Union: The simplest approach is just to take the union of all the retrieved documents. This means all unique documents returned by at least one retriever become part of the combined set.

Ranked Fusion (More Common): A more sophisticated approach is to combine the results and then re-rank them based on some criteria. This might involve:

Score Aggregation: Each document gets a new score based on the scores it received from the individual retrievers. This can be a simple sum, a weighted sum, or a more complex function.

Re-ranking: The combined set of documents is then re-ranked based on these aggregated scores. This allows documents that were considered relevant by multiple retrievers to be ranked higher.   

Return Results: The SimpleFusionRetriever returns the top-k documents from the re-ranked set as the final retrieved context.

Example (Conceptual):

Let's say you have a query about "artificial intelligence in healthcare."

BM25 Retriever: Might find documents that contain the keywords "artificial intelligence," "healthcare," and related terms.

Vector Store Retriever: Might find documents that are semantically similar to the query, even if they don't contain the exact keywords.   

The Simple Fusion Retriever would combine the results from both retrievers. Documents that were highly ranked by both retrievers would likely be ranked even higher in the fused results, because they are relevant from both a keyword and semantic perspective.

Benefits:

Improved Recall: By combining results from multiple retrievers, you can increase the chances of retrieving all the relevant documents.

Better Relevance: Re-ranking based on aggregated scores can improve the overall relevance of the retrieved results.   

Flexibility: You can easily combine different types of retrievers to leverage their complementary strengths.

When to use it:

When you have multiple retrieval methods available: If you're using different types of retrievers (BM25, vector search, etc.), the Simple Fusion Retriever can be a good way to combine their results.

When you want to improve recall: If you're concerned about missing relevant documents, fusing results can help.

When you want to balance different aspects of relevance: Different retrieval methods might capture different aspects of relevance (e.g., keyword match vs. semantic similarity). Fusion allows you to combine these different perspectives.

In LlamaIndex, you would use the SimpleFusionRetriever class and pass it a list of your other retrievers. You can then use the SimpleFusionRetriever like any other retriever to fetch context for your LLM queries.


What is BM25 Retriever in LlamaIndex ?

 The BM25 Retriever in LlamaIndex uses the BM25 (Best Matching 25) algorithm to retrieve relevant documents or text chunks (called "nodes" in LlamaIndex) in response to a query. It's a classic information retrieval algorithm that's effective for keyword-based search and often serves as a strong baseline. Here's how it works:   


Core Idea: BM25 goes beyond simple keyword matching. It considers not only whether a term appears in a document, but also how frequently it appears and how rare it is across the entire collection of documents. This helps to give higher scores to documents that contain important query terms more often, but also to down-weight documents that contain common words that don't contribute much to relevance.   


How it Works (Step-by-Step):


Indexing: The BM25 Retriever first creates an index of your documents or nodes. This index stores information about the terms that appear in each document, their frequencies, and some statistics about the overall corpus.   


Query Processing: When you provide a query, the retriever tokenizes it (breaks it down into individual words) and may perform some additional preprocessing (like stemming or stop word removal).   


Scoring: For each document in the collection, BM25 calculates a score that represents how relevant that document is to the query. This score is based on:   


Term Frequency (TF): How often each query term appears in the document. BM25 considers term frequency saturation, meaning that the score doesn't increase linearly with term frequency. After a certain point, additional occurrences of a term have diminishing returns.   

Inverse Document Frequency (IDF): How rare each query term is across the entire collection of documents. Rare terms are given more weight because they are more discriminative.

Document Length: BM25 normalizes for document length. Longer documents are more likely to contain more query terms simply because they are longer, not necessarily because they are more relevant.   

Ranking: The retriever ranks the documents based on their BM25 scores. Documents with higher scores are considered more relevant.   


Retrieval: The retriever returns the top-k documents (or nodes) with the highest BM25 scores as the retrieved context for your LLM query.


Key Parameters in LlamaIndex:


k1 and b: These are two important parameters in the BM25 algorithm that control term frequency saturation and length normalization, respectively. You can tune these parameters to optimize retrieval performance for your specific data.   

similarity_top_k: This parameter determines how many top-ranked documents or nodes are returned by the retriever.   

Benefits of BM25 Retriever:


Effective Keyword Search: BM25 is very good at finding documents that contain the query terms, even if the phrasing is slightly different.

Considers Term Importance: It takes into account both term frequency and inverse document frequency, giving higher scores to documents with important and rare terms.   

Fast Retrieval: BM25 is relatively fast, making it suitable for retrieving from large collections of documents.   

Strong Baseline: It often serves as a good baseline for comparison with more complex retrieval methods.

When to use it:


Keyword-based search: When you want to retrieve documents based on the presence of specific keywords.

Large document collections: When you need efficient retrieval from a large number of documents.

Hybrid retrieval: BM25 can be combined with other retrieval methods (like vector search) to improve overall retrieval performance.   

In LlamaIndex, you would use the BM25Retriever class to create a retriever that uses the BM25 algorithm. You can then use this retriever to fetch relevant context for your LLM queries.


How does Recursive Retriever works in LlamaIndex?

The Recursive Retriever in LlamaIndex is designed to efficiently retrieve relevant context from a hierarchical data structure, especially useful for long documents or collections of documents organized in a tree-like manner.  It's particularly helpful when dealing with summaries or nested information.  Here's how it works:   


Core Idea: The Recursive Retriever leverages the hierarchical structure of your data to perform targeted searches. Instead of naively searching the entire dataset, it starts at a higher level (e.g., a summary document or a top-level node in a tree) and recursively drills down to more specific content only when necessary.   


How it Works (Step-by-Step):


Hierarchical Data Structure: You provide LlamaIndex with data organized hierarchically. This could be:


A document with sections, subsections, and paragraphs.

A collection of documents with summaries and sub-documents.

Any data that can be represented as a tree or nested structure.

Top-Level Retrieval: When you issue a query, the Recursive Retriever first searches at the highest level of the hierarchy.  For example, it might search the summaries of all documents or the top-level sections of a long document.   


Relevance Check: The retriever determines if the top-level content it retrieved is relevant to the query.  This could be done using similarity search (comparing query embeddings to summary embeddings), keyword matching, or other methods.


Recursive Drill-Down: If the top-level content is deemed relevant, the retriever recursively descends to the next level of the hierarchy.  For instance, if a document summary is relevant, it will then search the sub-documents associated with that summary.   


Repeat: Steps 3 and 4 are repeated until the retriever reaches the desired level of granularity or until it finds enough relevant context.  It keeps going down the tree as long as the content at the current level is relevant.


Context Aggregation: Finally, the retriever gathers all the relevant content it has found during the recursive search and returns it as the context for your LLM query.


Example (Conceptual):


Imagine you have a book with chapters, sections, and paragraphs.


You ask a question about a specific topic.

The Recursive Retriever first searches the chapter titles (top level).

It finds a chapter title that seems relevant.

It then searches the section headings within that chapter (next level).

It finds a section heading that's even more relevant.

It finally retrieves the paragraphs within that section (lowest level).

It returns those paragraphs as the context for your question.

Benefits of Recursive Retrieval:


Efficiency: Avoids searching the entire dataset, significantly speeding up retrieval, especially for large hierarchical data.

Relevance: Focuses the search on the most promising parts of the data, leading to more relevant context.   

Scalability: Works well with large datasets because the search is targeted and doesn't involve exhaustive scanning.   

Handles Hierarchical Data: Specifically designed for data with a tree-like structure, which is common in many real-world scenarios.

When to use it:


Long documents: When you have a single document with internal structure (chapters, sections, etc.).

Document collections: When you have multiple documents organized hierarchically (e.g., by topic, category, etc.).

Summarization tasks: When you want to use summaries at different levels of granularity.

Knowledge graphs: For traversing and retrieving information from knowledge graphs.

In LlamaIndex, you would use the RecursiveRetriever class and configure it with your hierarchical data structure.  You can then use it like any other retriever to fetch context for your LLM queries.

Wednesday, February 19, 2025

What are some of the basic RAG techniques With LlamaIndex - Part 1 - Prompt Engineering

Prompt Engineering

If you're encountering failures related to the LLM, like hallucinations or poorly formatted outputs, then this should be one of the first things you try.


Customizing Prompts => Most of the prebuilt modules having prompts inside, these can be queried and viewed and also can be updated as required 


documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine(response_mode="tree_summarize")

# define prompt viewing function

def display_prompt_dict(prompts_dict):

    for k, p in prompts_dict.items():

        text_md = f"**Prompt Key**: {k}<br>" f"**Text:** <br>"

        display(Markdown(text_md))

        print(p.get_template())

        display(Markdown("<br><br>"))


prompts_dict = query_engine.get_prompts()


# from response synthesiser 

prompts_dict = query_engine.response_synthesizer.get_prompts()

display_prompt_dict(prompts_dict)



query_engine = index.as_query_engine(response_mode="compact")

prompts_dict = query_engine.get_prompts()

display_prompt_dict(prompts_dict)


response = query_engine.query("What did the author do growing up?")

print(str(response))


For customizing the prompt, it can be done like below 

from llama_index.core import PromptTemplate


# reset

query_engine = index.as_query_engine(response_mode="tree_summarize")


# shakespeare!

new_summary_tmpl_str = (

    "Context information is below.\n"

    "---------------------\n"

    "{context_str}\n"

    "---------------------\n"

    "Given the context information and not prior knowledge, "

    "answer the query in the style of a Shakespeare play.\n"

    "Query: {query_str}\n"

    "Answer: "

)

new_summary_tmpl = PromptTemplate(new_summary_tmpl_str)


query_engine.update_prompts(

    {"response_synthesizer:summary_template": new_summary_tmpl}

)



Advanced Prompts => 


Partial formatting

Prompt template variable mappings

Prompt function mappings



Partial formatting (partial_format) allows you to partially format a prompt, filling in some variables while leaving others to be filled in later.


This is a nice convenience function so you don't have to maintain all the required prompt variables all the way down to format, you can partially format as they come in.


This will create a copy of the prompt template.


qa_prompt_tmpl_str = """\

Context information is below.

---------------------

{context_str}

---------------------

Given the context information and not prior knowledge, answer the query.

Please write the answer in the style of {tone_name}

Query: {query_str}

Answer: \

"""


prompt_tmpl = PromptTemplate(qa_prompt_tmpl_str)



2. Prompt Template Variable Mappings

Template var mappings allow you to specify a mapping from the "expected" prompt keys (e.g. context_str and query_str for response synthesis), with the keys actually in your template.


This allows you re-use your existing string templates without having to annoyingly change out the template variables.


qa_prompt_tmpl_str = """\

Context information is below.

---------------------

{my_context}

---------------------

Given the context information and not prior knowledge, answer the query.

Query: {my_query}

Answer: \

"""


template_var_mappings = {"context_str": "my_context", "query_str": "my_query"}


prompt_tmpl = PromptTemplate(

    qa_prompt_tmpl_str, template_var_mappings=template_var_mappings

)


Prompt Function Mappings


You can also pass in functions as template variables instead of fixed values.


This allows you to dynamically inject certain values, dependent on other values, during query-time.



qa_prompt_tmpl_str = """\

Context information is below.

---------------------

{context_str}

---------------------

Given the context information and not prior knowledge, answer the query.

Query: {query_str}

Answer: \

"""

def format_context_fn(**kwargs):

    # format context with bullet points

    context_list = kwargs["context_str"].split("\n\n")

    fmtted_context = "\n\n".join([f"- {c}" for c in context_list])

    return fmtted_context



prompt_tmpl = PromptTemplate(

    qa_prompt_tmpl_str, function_mappings={"context_str": format_context_fn}

)


references:

https://docs.llamaindex.ai/en/stable/optimizing/basic_strategies/basic_strategies/

How to Perform structured retrieval for large number of documents?



A big issue with the standard RAG stack (top-k retrieval + basic text splitting) is that it doesn’t do well as the number of documents scales up - e.g. if you have 100 different PDFs. In this setting, given a query you may want to use structured information to help with more precise retrieval; for instance, if you ask a question that's only relevant to two PDFs, using structured information to ensure those two PDFs get returned beyond raw embedding similarity with chunks.


Key Techniques#

There’s a few ways of performing more structured tagging/retrieval for production-quality RAG systems, each with their own pros/cons.


1. Metadata Filters + Auto Retrieval Tag each document with metadata and then store in a vector database. During inference time, use the LLM to infer the right metadata filters to query the vector db in addition to the semantic query string.


Pros ✅: Supported in major vector dbs. Can filter document via multiple dimensions.

Cons 🚫: Can be hard to define the right tags. Tags may not contain enough relevant information for more precise retrieval. Also tags represent keyword search at the document-level, doesn’t allow for semantic lookups.


 2. Store Document Hierarchies (summaries -> raw chunks) + Recursive Retrieval Embed document summaries and map to chunks per document. Fetch at the document-level first before chunk level.


Pros ✅: Allows for semantic lookups at the document level.

Cons 🚫: Doesn’t allow for keyword lookups by structured tags (can be more precise than semantic search). Also autogenerating summaries can be expensive. 

references:

https://docs.llamaindex.ai/en/stable/optimizing/production_rag/


Metadata Replacement + Node Sentence Window based retrieval for RAG

We use the SentenceWindowNodeParser to parse documents into single sentences per node. Each node also contains a "window" with the sentences on either side of the node sentence.


Then, after retrieval, before passing the retrieved sentences to the LLM, the single sentences are replaced with a window containing the surrounding sentences using the MetadataReplacementNodePostProcessor.


This is most useful for large documents/indexes, as it helps to retrieve more fine-grained details.


By default, the sentence window is 5 sentences on either side of the original sentence



The whole process can be split into like below 



Step 1: Load Data, Build the Index

!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf 


from llama_index.core import SimpleDirectoryReader


documents = SimpleDirectoryReader(

    input_files=["./IPCC_AR6_WGII_Chapter03.pdf"]

).load_data()



Extract Nodes¶


We extract out the set of nodes that will be stored in the VectorIndex. This includes both the nodes with the sentence window parser, as well as the "base" nodes extracted using the standard parser.



nodes = node_parser.get_nodes_from_documents(documents)

base_nodes = text_splitter.get_nodes_from_documents(documents)


Build the Indexes

We build both the sentence index, as well as the "base" index (with default chunk sizes).

from llama_index.core import VectorStoreIndex

sentence_index = VectorStoreIndex(nodes)

base_index = VectorStoreIndex(base_nodes)


Step 2: Querying

With MetadataReplacementPostProcessor


we now use the MetadataReplacementPostProcessor to replace the sentence in each node with it's surrounding context.


from llama_index.core.postprocessor import MetadataReplacementPostProcessor


query_engine = sentence_index.as_query_engine(

    similarity_top_k=2,

    # the target key defaults to `window` to match the node_parser's default

    node_postprocessors=[

        MetadataReplacementPostProcessor(target_metadata_key="window")

    ],

)

window_response = query_engine.query(

    "What are the concerns surrounding the AMOC?"

)

print(window_response)


We can also check the original sentence that was retrieved for each node, as well as the actual window of sentences that was sent to the LLM.


window = window_response.source_nodes[0].node.metadata["window"]

sentence = window_response.source_nodes[0].node.metadata["original_text"]


print(f"Window: {window}")

print("------------------")

print(f"Original Sentence: {sentence}")




Contrast with normal VectorStoreIndex


query_engine = base_index.as_query_engine(similarity_top_k=2)

vector_response = query_engine.query(

    "What are the concerns surrounding the AMOC?"

)

print(vector_response)



Well, that didn't work. Let's bump up the top k! This will be slower and use more tokens compared to the sentence window index.


query_engine = base_index.as_query_engine(similarity_top_k=5)

vector_response = query_engine.query(

    "What are the concerns surrounding the AMOC?"

)

print(vector_response)



Step 3: Analysis


So the SentenceWindowNodeParser + MetadataReplacementNodePostProcessor combo is the clear winner here


Embeddings at a sentence level seem to capture more fine-grained details, like the word AMOC.


We can also compare the retrieved chunks for each index!


for source_node in window_response.source_nodes:

    print(source_node.node.metadata["original_text"])

    print("--------")


Here, we can see that the sentence window index easily retrieved two nodes that talk about AMOC. Remember, the embeddings are based purely on the original sentence here, but the LLM actually ends up reading the surrounding context as well!


let's try and disect why the naive vector index failed.


for node in vector_response.source_nodes:

    print("AMOC mentioned?", "AMOC" in node.node.text)

    print("--------")


So source node at index [2] mentions AMOC, but what did this text actually look like?

print(vector_response.source_nodes[2].node.text)




Step 4: Evaluation 

We more rigorously evaluate how well the sentence window retriever works compared to the base retriever.


We define/load an eval benchmark dataset and then run different evaluations over it.


WARNING: This can be expensive, especially with GPT-4. Use caution and tune the sample size to fit your budget.



from llama_index.core.evaluation import DatasetGenerator, QueryResponseDataset


from llama_index.llms.openai import OpenAI

import nest_asyncio

import random


nest_asyncio.apply()


num_nodes_eval = 30

# there are 428 nodes total. Take the first 200 to generate questions (the back half of the doc is all references)

sample_eval_nodes = random.sample(base_nodes[:200], num_nodes_eval)

# NOTE: run this if the dataset isn't already saved

# generate questions from the largest chunks (1024)

dataset_generator = DatasetGenerator(

    sample_eval_nodes,

    llm=OpenAI(model="gpt-4"),

    show_progress=True,

    num_questions_per_chunk=2,

)


eval_dataset = await dataset_generator.agenerate_dataset_from_nodes()


val_dataset.save_json("data/ipcc_eval_qr_dataset.json")


eval_dataset = QueryResponseDataset.from_json("data/ipcc_eval_qr_dataset.json")


import asyncio

import nest_asyncio


nest_asyncio.apply()


from llama_index.core.evaluation import (

    CorrectnessEvaluator,

    SemanticSimilarityEvaluator,

    RelevancyEvaluator,

    FaithfulnessEvaluator,

    PairwiseComparisonEvaluator,

)



from collections import defaultdict

import pandas as pd


# NOTE: can uncomment other evaluators

evaluator_c = CorrectnessEvaluator(llm=OpenAI(model="gpt-4"))

evaluator_s = SemanticSimilarityEvaluator()

evaluator_r = RelevancyEvaluator(llm=OpenAI(model="gpt-4"))

evaluator_f = FaithfulnessEvaluator(llm=OpenAI(model="gpt-4"))

# pairwise_evaluator = PairwiseComparisonEvaluator(llm=OpenAI(model="gpt-4"))


from llama_index.core.evaluation.eval_utils import (

    get_responses,

    get_results_df,

)

from llama_index.core.evaluation import BatchEvalRunner


max_samples = 30


eval_qs = eval_dataset.questions

ref_response_strs = [r for (_, r) in eval_dataset.qr_pairs]


# resetup base query engine and sentence window query engine

# base query engine

base_query_engine = base_index.as_query_engine(similarity_top_k=2)

# sentence window query engine

query_engine = sentence_index.as_query_engine(

    similarity_top_k=2,

    # the target key defaults to `window` to match the node_parser's default

    node_postprocessors=[

        MetadataReplacementPostProcessor(target_metadata_key="window")

    ],

)


import numpy as np


base_pred_responses = get_responses(

    eval_qs[:max_samples], base_query_engine, show_progress=True

)

pred_responses = get_responses(

    eval_qs[:max_samples], query_engine, show_progress=True

)


pred_response_strs = [str(p) for p in pred_responses]

base_pred_response_strs = [str(p) for p in base_pred_responses]


evaluator_dict = {

    "correctness": evaluator_c,

    "faithfulness": evaluator_f,

    "relevancy": evaluator_r,

    "semantic_similarity": evaluator_s,

}

batch_runner = BatchEvalRunner(evaluator_dict, workers=2, show_progress=True)


eval_results = await batch_runner.aevaluate_responses(

    queries=eval_qs[:max_samples],

    responses=pred_responses[:max_samples],

    reference=ref_response_strs[:max_samples],

)


base_eval_results = await batch_runner.aevaluate_responses(

    queries=eval_qs[:max_samples],

    responses=base_pred_responses[:max_samples],

    reference=ref_response_strs[:max_samples],

)


results_df = get_results_df(

    [eval_results, base_eval_results],

    ["Sentence Window Retriever", "Base Retriever"],

    ["correctness", "relevancy", "faithfulness", "semantic_similarity"],

)

display(results_df)




References:

https://docs.llamaindex.ai/en/stable/examples/node_postprocessor/MetadataReplacementDemo/

What is Document Summary Index in LlamaIndex?

The document summary index will extract a summary from each document and store that summary, as well as all nodes corresponding to the document.


Retrieval can be performed through the LLM or embeddings (which is a TODO). We first select the relevant documents to the query based on their summaries. All retrieved nodes corresponding to the selected documents are retrieved.


The Steps involved in this is like below 


Step 1: Load Datasets

Load Wikipedia pages on different cities


city_docs = []

for wiki_title in wiki_titles:

    docs = SimpleDirectoryReader(

        input_files=[f"data/{wiki_title}.txt"]

    ).load_data()

    docs[0].doc_id = wiki_title

    city_docs.extend(docs)


Step 2: Build Document Summary Index 

two ways of building the index:


a. default mode of building the document summary index

b. customizing the summary query


# LLM (gpt-3.5-turbo)

chatgpt = OpenAI(temperature=0, model="gpt-3.5-turbo")

splitter = SentenceSplitter(chunk_size=1024)



# default mode of building the index

response_synthesizer = get_response_synthesizer(

    response_mode="tree_summarize", use_async=True

)

doc_summary_index = DocumentSummaryIndex.from_documents(

    city_docs,

    llm=chatgpt,

    transformations=[splitter],

    response_synthesizer=response_synthesizer,

    show_progress=True,

)


doc_summary_index.get_document_summary("Boston")

doc_summary_index.storage_context.persist("index")



from llama_index.core import load_index_from_storage

from llama_index.core import StorageContext


# rebuild storage context

storage_context = StorageContext.from_defaults(persist_dir="index")

doc_summary_index = load_index_from_storage(storage_context)


Step 3: 

Performing retrieval from Summary Index 


References:

https://docs.llamaindex.ai/en/stable/examples/index_structs/doc_summary/DocSummary/

Tuesday, February 18, 2025

What is PandasQueryEngine ?

PandasQueryEngine: convert natural language to Pandas python code using LLMs.

The input to the PandasQueryEngine is a Pandas dataframe, and the output is a response. The LLM infers dataframe operations to perform in order to retrieve the result.

Let's start on a Toy DataFrame

Here let's load a very simple dataframe containing city and population pairs, and run the PandasQueryEngine on it.

By setting verbose=True we can see the intermediate generated instructions.

# Test on some sample data

df = pd.DataFrame(

    {

        "city": ["Toronto", "Tokyo", "Berlin"],

        "population": [2930000, 13960000, 3645000],

    }

)

query_engine = PandasQueryEngine(df=df, verbose=True)

response = query_engine.query(

    "What is the city with the highest population?",

)

We can also take the step of using an LLM to synthesize a response.

query_engine = PandasQueryEngine(df=df, verbose=True, synthesize_response=True)

response = query_engine.query(

    "What is the city with the highest population? Give both the city and population",

)

print(str(response))

Analyzing the Titanic DataSet 

df = pd.read_csv("./titanic_train.csv")

query_engine = PandasQueryEngine(df=df, verbose=True)

response = query_engine.query(

    "What is the correlation between survival and age?",

)

display(Markdown(f"<b>{response}</b>"))

print(response.metadata["pandas_instruction_str"])

References:

https://docs.llamaindex.ai/en/stable/examples/query_engine/pandas_query_engine/


What is Camelot Library for PDF extraction

Camelot is a Python library that makes it easy to extract tables from PDF files.  It's particularly useful for PDFs where the tables are not easily selectable or copyable (e.g., scanned PDFs or PDFs with complex layouts).  Camelot works by using a combination of image processing and text analysis to identify and extract table data.   

Here's a breakdown of what Camelot does and why it's helpful:

Key Features and Benefits:

Table Detection: Camelot can automatically detect tables within a PDF, even if they aren't marked up as tables in the PDF's internal structure.   

Table Extraction: Once tables are detected, Camelot extracts the data from them and provides it in a structured format (like a Pandas DataFrame).   

Handles Different Table Types: It can handle various table formats, including tables with borders, tables without borders, and tables with complex layouts.   

Output to Pandas DataFrames: The extracted table data is typically returned as a Pandas DataFrame, making it easy to further process and analyze the data in Python.   

Command-Line Interface: Camelot also comes with a command-line interface, which can be useful for quick table extraction tasks.   

How it Works (Simplified):


Image Processing: Camelot often uses image processing techniques to identify the boundaries of tables within the PDF. This is especially helpful for PDFs where the tables aren't readily discernible from the underlying PDF structure.   

Text Analysis: It analyzes the text content within the identified table regions to reconstruct the table structure and extract the data.   

When to Use Camelot:


PDFs with Non-Selectable Tables: If you're working with PDFs where you can't easily select or copy the table data, Camelot is likely the right tool.

Complex Table Layouts: When tables have complex formatting, borders, or spanning cells that make standard PDF text extraction difficult, Camelot can help.   

Automating Table Extraction: If you need to extract tables from many PDFs programmatically, Camelot provides a convenient way to do this.

Limitations:


Scanned PDFs: Camelot primarily works with text-based PDFs. It does not have built-in OCR (Optical Character Recognition) capabilities. If your PDF is a scanned image, you'll need to use an OCR library (like Tesseract) first to convert the image to text before you can use Camelot.

Accuracy: While Camelot is good at table detection and extraction, its accuracy can vary depending on the complexity of the PDF and the tables. You might need to adjust some parameters or do some manual cleanup in some cases.



In summary: Camelot is a valuable library for extracting table data from PDFs, particularly when the tables are difficult to extract using other methods.  It combines image processing and text analysis to identify and extract table data, providing it in a structured format that can be easily used in Python.  Keep in mind its limitations with scanned PDFs and the potential for some inaccuracies.


References:

Gemini

What are some of advanced techniques for building production grade RAG?

 Decoupling chunks used for retrieval vs. chunks used for synthesis

Structured Retrieval for Larger Document Sets

Dynamically Retrieve Chunks Depending on your Task

Optimize context embeddings


Key Techniques#

There’s two main ways to take advantage of this idea:

1. Embed a document summary, which links to chunks associated with the document.

This can help retrieve relevant documents at a high-level before retrieving chunks vs. retrieving chunks directly (that might be in irrelevant documents).


Key Techniques#

There’s two main ways to take advantage of this idea:

1. Embed a document summary, which links to chunks associated with the document.

This can help retrieve relevant documents at a high-level before retrieving chunks vs. retrieving chunks directly (that might be in irrelevant documents).

2. Embed a sentence, which then links to a window around the sentence.

This allows for finer-grained retrieval of relevant context (embedding giant chunks leads to “lost in the middle” problems), but also ensures enough context for LLM synthesis.


Structured Retrieval for Larger Document Sets

A big issue with the standard RAG stack (top-k retrieval + basic text splitting) is that it doesn’t do well as the number of documents scales up - e.g. if you have 100 different PDFs. In this setting, given a query you may want to use structured information to help with more precise retrieval; for instance, if you ask a question that's only relevant to two PDFs, using structured information to ensure those two PDFs get returned beyond raw embedding similarity with chunks.

Key Techniques#

1. Metadata Filters + Auto Retrieval Tag each document with metadata and then store in a vector database. During inference time, use the LLM to infer the right metadata filters to query the vector db in addition to the semantic query string.

Pros ✅: Supported in major vector dbs. Can filter document via multiple dimensions.

Cons 🚫: Can be hard to define the right tags. Tags may not contain enough relevant information for more precise retrieval. Also tags represent keyword search at the document-level, doesn’t allow for semantic lookups.

2. Store Document Hierarchies (summaries -> raw chunks) + Recursive Retrieval Embed document summaries and map to chunks per document. Fetch at the document-level first before chunk level.

Pros ✅: Allows for semantic lookups at the document level.

Cons 🚫: Doesn’t allow for keyword lookups by structured tags (can be more precise than semantic search). Also autogenerating summaries can be expensive.

Dynamically Retrieve Chunks Depending on your Task

RAG isn't just about question-answering about specific facts, which top-k similarity is optimized for. There can be a broad range of queries that a user might ask. Queries that are handled by naive RAG stacks include ones that ask about specific facts e.g. "Tell me about the D&I initiatives for this company in 2023" or "What did the narrator do during his time at Google". But queries can also include summarization e.g. "Can you give me a high-level overview of this document", or comparisons "Can you compare/contrast X and Y". All of these use cases may require different retrieval techniques.

LlamaIndex provides some core abstractions to help you do task-specific retrieval. This includes router module as well as our data agent module. This also includes some advanced query engine modules. This also include other modules that join structured and unstructured data.

You can use these modules to do joint question-answering and summarization, or even combine structured queries with unstructured queries.

Optimize Context Embeddings

This is related to the motivation described above in "decoupling chunks used for retrieval vs. synthesis". We want to make sure that the embeddings are optimized for better retrieval over your specific data corpus. Pre-trained models may not capture the salient properties of the data relevant to your use cas

Key Techniques#

Beyond some of the techniques listed above, we can also try finetuning the embedding model. We can actually do this over an unstructured text corpus, in a label-free way.

referneces:

https://docs.llamaindex.ai/en/stable/optimizing/production_rag/

Monday, February 17, 2025

When using PyMuPDF4LLM, LlamaIndex is one of the option as output what are the advantages of these?

When parsing a PDF and getting the result as a LlamaIndex Document, the primary advantage is the ability to seamlessly integrate the extracted information with other data sources and readily query it using a large language model (LLM) within the LlamaIndex framework, allowing for richer, more contextual responses and analysis compared to simply extracting raw text from a PDF alone; essentially, it enables you to build sophisticated knowledge-based applications by combining data from various sources, including complex PDFs, in a unified way. 

Key benefits:

Contextual Understanding:

LlamaIndex can interpret the extracted PDF data within the broader context of other related information, leading to more accurate and relevant responses when querying. 

Multi-Source Querying:

You can easily query across multiple documents, including the parsed PDF, without needing separate data processing pipelines for each source. 

Advanced Parsing with LlamaParse:

LlamaIndex provides a dedicated "LlamaParse" tool specifically designed for complex PDF parsing, including tables and figures, which can be directly integrated into your workflow. 

RAG Applications:

By representing PDF data as LlamaIndex documents, you can readily build "Retrieval Augmented Generation" (RAG) applications that can retrieve relevant information from your PDF collection based on user queries. 

references:

Gemini 



Sunday, February 16, 2025

What are the main features for PyMuPDF4LLM?

 PyMuPDF4LLM is based on top of the tried and tested PyMuPDF and utilizes the library behind the scenes to achieve the following:

Support for multi-column pages

Support for image and vector graphics extraction (and inclusion of references in the MD text)

Support for page chunking output

Direct support for output as LlamaIndex Documents

Multi-Column Pages

The text extraction can handle document layouts with multiple columns and meaning that “newspaper” type layouts are supported. The associated Markdown output will faithfully represent the intended reading order.

Image Support

PyMuPDF4LLM will also output image files alongside the Markdown if we request write_images:

import pymupdf4llm

output = pymupdf4llm.to_markdown("input.pdf", write_images=True)

The resulting output will create a markdown text output with references to any images that may have been found in the document. The images will be saved to the location from where you have run the Python script and the markdown will have logically referenced them with the correct markdown syntax for images.


Page Chunking

We can obtain output with enriched semantic information if we request page_chunks:

import pymupdf4llm

output = pymupdf4llm.to_markdown("input.pdf", page_chunks=True)


This delivers a list of dictionary objects for each page of the document with the following schema:


metadata — dictionary consisting of the document’s metadata.

toc_items — list of Table of Contents items pointing to the page.

tables — list of tables on this page.

images — list of images on the page.

graphics — list of vector graphics rectangles on the page.

text — page content as Markdown text.

In this way page chunking allows for more structured results for your LLM input.


LlamaIndex Documents Output

If you are using LlamaIndex for your LLM application then you are in luck! PyMuPDF4LLM has a seamless integration as follows:

import pymupdf4llm

llama_reader = pymupdf4llm.LlamaMarkdownReader()

llama_docs = llama_reader.load_data("input.pdf")


With these simple 3 lines of code you will receive LLamaIndex document objects from the PDF file input for use with your LLM application!



What is Test-Time Scaling technique?

Test-Time Scaling (TTS) is a technique used to improve the performance of Large Language Models (LLMs) during inference (i.e., when the model is used to generate text or make predictions, not during training).  It works by adjusting the model's output probabilities based on the observed distribution of tokens in the generated text.   

Here's a breakdown of how it works:

Standard LLM Inference:  Typically, LLMs generate text by sampling from the probability distribution over the vocabulary at each step.  The model predicts the probability of each possible next token, and then a sampling strategy (e.g., greedy decoding, beam search, temperature sampling) is used to select the next token.   

The Problem:  LLMs can sometimes produce outputs that are repetitive, generic, or lack diversity.  This is partly because the model's probability distribution might be overconfident, assigning high probabilities to a small set of tokens and low probabilities to many others.   

Test-Time Scaling: TTS addresses this issue by introducing a scaling factor to the model's output probabilities.  This scaling factor is typically applied to the logits (the pre-softmax outputs of the model).

How Scaling Works: The scaling factor is usually less than 1.  When the logits are scaled down, the probability distribution becomes "flatter" or less peaked. This has the effect of:

Increasing the probability of less frequent tokens: This helps to reduce repetition and encourages the model to explore a wider range of vocabulary.

Reducing the probability of highly frequent tokens: This can help to prevent the model from getting stuck in repetitive loops or generating overly generic text.   

Adaptive Scaling (Often Used):  In many implementations, the scaling factor is adaptive.  It's adjusted based on the characteristics of the generated text so far.  For example, if the generated text is becoming repetitive, the scaling factor might be decreased further to increase diversity.  Conversely, if the text is becoming too random or incoherent, the scaling factor might be increased to make the distribution more peaked.

Benefits of TTS:

Improved Text Quality: TTS can lead to more diverse, creative, and less repetitive text generation.

Better Performance on Downstream Tasks: For tasks like machine translation or text summarization, TTS can improve the accuracy and fluency of the generated output.

In summary: TTS is a post-processing technique applied during inference. It adjusts the LLM's output probabilities to encourage more diverse and less repetitive text generation.  By scaling down the logits, the probability distribution is flattened, making it more likely for the model to choose less frequent tokens and avoid getting stuck in repetitive loops.  Adaptive scaling makes the process even more effective by dynamically adjusting the scaling factor based on the generated text.

references:

https://www.marktechpost.com/2025/02/13/can-1b-llm-surpass-405b-llm-optimizing-computation-for-small-llms-to-outperform-larger-models/


 


Saturday, February 15, 2025

What is LLM as a judge and how to compares to RAGAS?

The idea behind LLM-is-a-judge is simple – provide an LLM with the output of your system and the ground truth answer, and ask it to score the output based on some criteria.

The challenge is to get the judge to score according to domain-specific and problem-specific standards.

in other words, we needed to evaluate the evaluators!

First, we ran a sanity test – we used our system to generate answers based on ground truth context, and scored them using the judges.

This test confirmed that both judges behaved as expected: the answers, which were based on the actual ground truth context, scored high – both in absolute terms and in relation to the scores of running the system including the retrieval phase on the same questions. 

Next, we performed an evaluation of the correctness score by comparing it to the correctness score generated by human domain experts.

Our main focus was investigating the correlation between our various LLM-as-a-judge tools to the human-labeled examples, looking at trends rather than the absolute score values.

This method helped us deal with another risk – human experts’ can have a subjective perception of absolute score numbers. So instead of looking at the exact score they assigned, we focused on the relative ranking of examples.

Both RAGAS and our own judge correlated reasonably well to the human scores, with our own judge being better correlated, especially in the higher score bands

The results convinced us that our LLM-as-a-Judge offers a sufficiently reliable mechanism for assessing our system’s quality – both for comparing the quality of system versions to each other in order to make decisions about release candidates, and for finding examples which can indicate systematic quality issues we need to address.

references:
https://www.qodo.ai/blog/evaluating-rag-for-large-scale-codebases/

What are couple of issues with RAGAS?

RAGAS covers a number of key metrics useful in LLM evaluation, including answer correctness (later renamed to “factual correctness”) and context accuracy via precision and recall.

RAGAS implements correctness tests by converting both the generated answer and the ground truth (reference) into a series of simplified statements.

The score is essentially a grade for the level of overlap between statements from reference vs. the generated answer, combined with some weight for overall similarity between the answers.

When eyeballing the scores RAGAS generated, we noticed two recurring issues:

For relatively short answers, every small “missed fact” results in significant penalties.

When one of the answers was more detailed than the other, the correctness score suffered greatly, despite both answers being valid and even useful

The latter issue was common enough, and didn’t align with our intention for the correctness metric, so we needed to find a way to evaluate the “essence” of the answers as well as the details.

references:

https://www.qodo.ai/blog/evaluating-rag-for-large-scale-codebases/