Wednesday, February 19, 2025

What are some of the basic RAG techniques With LlamaIndex - Part 1 - Prompt Engineering

Prompt Engineering

If you're encountering failures related to the LLM, like hallucinations or poorly formatted outputs, then this should be one of the first things you try.


Customizing Prompts => Most of the prebuilt modules having prompts inside, these can be queried and viewed and also can be updated as required 


documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine(response_mode="tree_summarize")

# define prompt viewing function

def display_prompt_dict(prompts_dict):

    for k, p in prompts_dict.items():

        text_md = f"**Prompt Key**: {k}<br>" f"**Text:** <br>"

        display(Markdown(text_md))

        print(p.get_template())

        display(Markdown("<br><br>"))


prompts_dict = query_engine.get_prompts()


# from response synthesiser 

prompts_dict = query_engine.response_synthesizer.get_prompts()

display_prompt_dict(prompts_dict)



query_engine = index.as_query_engine(response_mode="compact")

prompts_dict = query_engine.get_prompts()

display_prompt_dict(prompts_dict)


response = query_engine.query("What did the author do growing up?")

print(str(response))


For customizing the prompt, it can be done like below 

from llama_index.core import PromptTemplate


# reset

query_engine = index.as_query_engine(response_mode="tree_summarize")


# shakespeare!

new_summary_tmpl_str = (

    "Context information is below.\n"

    "---------------------\n"

    "{context_str}\n"

    "---------------------\n"

    "Given the context information and not prior knowledge, "

    "answer the query in the style of a Shakespeare play.\n"

    "Query: {query_str}\n"

    "Answer: "

)

new_summary_tmpl = PromptTemplate(new_summary_tmpl_str)


query_engine.update_prompts(

    {"response_synthesizer:summary_template": new_summary_tmpl}

)



Advanced Prompts => 


Partial formatting

Prompt template variable mappings

Prompt function mappings



Partial formatting (partial_format) allows you to partially format a prompt, filling in some variables while leaving others to be filled in later.


This is a nice convenience function so you don't have to maintain all the required prompt variables all the way down to format, you can partially format as they come in.


This will create a copy of the prompt template.


qa_prompt_tmpl_str = """\

Context information is below.

---------------------

{context_str}

---------------------

Given the context information and not prior knowledge, answer the query.

Please write the answer in the style of {tone_name}

Query: {query_str}

Answer: \

"""


prompt_tmpl = PromptTemplate(qa_prompt_tmpl_str)



2. Prompt Template Variable Mappings

Template var mappings allow you to specify a mapping from the "expected" prompt keys (e.g. context_str and query_str for response synthesis), with the keys actually in your template.


This allows you re-use your existing string templates without having to annoyingly change out the template variables.


qa_prompt_tmpl_str = """\

Context information is below.

---------------------

{my_context}

---------------------

Given the context information and not prior knowledge, answer the query.

Query: {my_query}

Answer: \

"""


template_var_mappings = {"context_str": "my_context", "query_str": "my_query"}


prompt_tmpl = PromptTemplate(

    qa_prompt_tmpl_str, template_var_mappings=template_var_mappings

)


Prompt Function Mappings


You can also pass in functions as template variables instead of fixed values.


This allows you to dynamically inject certain values, dependent on other values, during query-time.



qa_prompt_tmpl_str = """\

Context information is below.

---------------------

{context_str}

---------------------

Given the context information and not prior knowledge, answer the query.

Query: {query_str}

Answer: \

"""

def format_context_fn(**kwargs):

    # format context with bullet points

    context_list = kwargs["context_str"].split("\n\n")

    fmtted_context = "\n\n".join([f"- {c}" for c in context_list])

    return fmtted_context



prompt_tmpl = PromptTemplate(

    qa_prompt_tmpl_str, function_mappings={"context_str": format_context_fn}

)


references:

https://docs.llamaindex.ai/en/stable/optimizing/basic_strategies/basic_strategies/

How to Perform structured retrieval for large number of documents?



A big issue with the standard RAG stack (top-k retrieval + basic text splitting) is that it doesn’t do well as the number of documents scales up - e.g. if you have 100 different PDFs. In this setting, given a query you may want to use structured information to help with more precise retrieval; for instance, if you ask a question that's only relevant to two PDFs, using structured information to ensure those two PDFs get returned beyond raw embedding similarity with chunks.


Key Techniques#

There’s a few ways of performing more structured tagging/retrieval for production-quality RAG systems, each with their own pros/cons.


1. Metadata Filters + Auto Retrieval Tag each document with metadata and then store in a vector database. During inference time, use the LLM to infer the right metadata filters to query the vector db in addition to the semantic query string.


Pros ✅: Supported in major vector dbs. Can filter document via multiple dimensions.

Cons 🚫: Can be hard to define the right tags. Tags may not contain enough relevant information for more precise retrieval. Also tags represent keyword search at the document-level, doesn’t allow for semantic lookups.


 2. Store Document Hierarchies (summaries -> raw chunks) + Recursive Retrieval Embed document summaries and map to chunks per document. Fetch at the document-level first before chunk level.


Pros ✅: Allows for semantic lookups at the document level.

Cons 🚫: Doesn’t allow for keyword lookups by structured tags (can be more precise than semantic search). Also autogenerating summaries can be expensive. 

references:

https://docs.llamaindex.ai/en/stable/optimizing/production_rag/


Metadata Replacement + Node Sentence Window based retrieval for RAG

We use the SentenceWindowNodeParser to parse documents into single sentences per node. Each node also contains a "window" with the sentences on either side of the node sentence.


Then, after retrieval, before passing the retrieved sentences to the LLM, the single sentences are replaced with a window containing the surrounding sentences using the MetadataReplacementNodePostProcessor.


This is most useful for large documents/indexes, as it helps to retrieve more fine-grained details.


By default, the sentence window is 5 sentences on either side of the original sentence



The whole process can be split into like below 



Step 1: Load Data, Build the Index

!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf 


from llama_index.core import SimpleDirectoryReader


documents = SimpleDirectoryReader(

    input_files=["./IPCC_AR6_WGII_Chapter03.pdf"]

).load_data()



Extract Nodes¶


We extract out the set of nodes that will be stored in the VectorIndex. This includes both the nodes with the sentence window parser, as well as the "base" nodes extracted using the standard parser.



nodes = node_parser.get_nodes_from_documents(documents)

base_nodes = text_splitter.get_nodes_from_documents(documents)


Build the Indexes

We build both the sentence index, as well as the "base" index (with default chunk sizes).

from llama_index.core import VectorStoreIndex

sentence_index = VectorStoreIndex(nodes)

base_index = VectorStoreIndex(base_nodes)


Step 2: Querying

With MetadataReplacementPostProcessor


we now use the MetadataReplacementPostProcessor to replace the sentence in each node with it's surrounding context.


from llama_index.core.postprocessor import MetadataReplacementPostProcessor


query_engine = sentence_index.as_query_engine(

    similarity_top_k=2,

    # the target key defaults to `window` to match the node_parser's default

    node_postprocessors=[

        MetadataReplacementPostProcessor(target_metadata_key="window")

    ],

)

window_response = query_engine.query(

    "What are the concerns surrounding the AMOC?"

)

print(window_response)


We can also check the original sentence that was retrieved for each node, as well as the actual window of sentences that was sent to the LLM.


window = window_response.source_nodes[0].node.metadata["window"]

sentence = window_response.source_nodes[0].node.metadata["original_text"]


print(f"Window: {window}")

print("------------------")

print(f"Original Sentence: {sentence}")




Contrast with normal VectorStoreIndex


query_engine = base_index.as_query_engine(similarity_top_k=2)

vector_response = query_engine.query(

    "What are the concerns surrounding the AMOC?"

)

print(vector_response)



Well, that didn't work. Let's bump up the top k! This will be slower and use more tokens compared to the sentence window index.


query_engine = base_index.as_query_engine(similarity_top_k=5)

vector_response = query_engine.query(

    "What are the concerns surrounding the AMOC?"

)

print(vector_response)



Step 3: Analysis


So the SentenceWindowNodeParser + MetadataReplacementNodePostProcessor combo is the clear winner here


Embeddings at a sentence level seem to capture more fine-grained details, like the word AMOC.


We can also compare the retrieved chunks for each index!


for source_node in window_response.source_nodes:

    print(source_node.node.metadata["original_text"])

    print("--------")


Here, we can see that the sentence window index easily retrieved two nodes that talk about AMOC. Remember, the embeddings are based purely on the original sentence here, but the LLM actually ends up reading the surrounding context as well!


let's try and disect why the naive vector index failed.


for node in vector_response.source_nodes:

    print("AMOC mentioned?", "AMOC" in node.node.text)

    print("--------")


So source node at index [2] mentions AMOC, but what did this text actually look like?

print(vector_response.source_nodes[2].node.text)




Step 4: Evaluation 

We more rigorously evaluate how well the sentence window retriever works compared to the base retriever.


We define/load an eval benchmark dataset and then run different evaluations over it.


WARNING: This can be expensive, especially with GPT-4. Use caution and tune the sample size to fit your budget.



from llama_index.core.evaluation import DatasetGenerator, QueryResponseDataset


from llama_index.llms.openai import OpenAI

import nest_asyncio

import random


nest_asyncio.apply()


num_nodes_eval = 30

# there are 428 nodes total. Take the first 200 to generate questions (the back half of the doc is all references)

sample_eval_nodes = random.sample(base_nodes[:200], num_nodes_eval)

# NOTE: run this if the dataset isn't already saved

# generate questions from the largest chunks (1024)

dataset_generator = DatasetGenerator(

    sample_eval_nodes,

    llm=OpenAI(model="gpt-4"),

    show_progress=True,

    num_questions_per_chunk=2,

)


eval_dataset = await dataset_generator.agenerate_dataset_from_nodes()


val_dataset.save_json("data/ipcc_eval_qr_dataset.json")


eval_dataset = QueryResponseDataset.from_json("data/ipcc_eval_qr_dataset.json")


import asyncio

import nest_asyncio


nest_asyncio.apply()


from llama_index.core.evaluation import (

    CorrectnessEvaluator,

    SemanticSimilarityEvaluator,

    RelevancyEvaluator,

    FaithfulnessEvaluator,

    PairwiseComparisonEvaluator,

)



from collections import defaultdict

import pandas as pd


# NOTE: can uncomment other evaluators

evaluator_c = CorrectnessEvaluator(llm=OpenAI(model="gpt-4"))

evaluator_s = SemanticSimilarityEvaluator()

evaluator_r = RelevancyEvaluator(llm=OpenAI(model="gpt-4"))

evaluator_f = FaithfulnessEvaluator(llm=OpenAI(model="gpt-4"))

# pairwise_evaluator = PairwiseComparisonEvaluator(llm=OpenAI(model="gpt-4"))


from llama_index.core.evaluation.eval_utils import (

    get_responses,

    get_results_df,

)

from llama_index.core.evaluation import BatchEvalRunner


max_samples = 30


eval_qs = eval_dataset.questions

ref_response_strs = [r for (_, r) in eval_dataset.qr_pairs]


# resetup base query engine and sentence window query engine

# base query engine

base_query_engine = base_index.as_query_engine(similarity_top_k=2)

# sentence window query engine

query_engine = sentence_index.as_query_engine(

    similarity_top_k=2,

    # the target key defaults to `window` to match the node_parser's default

    node_postprocessors=[

        MetadataReplacementPostProcessor(target_metadata_key="window")

    ],

)


import numpy as np


base_pred_responses = get_responses(

    eval_qs[:max_samples], base_query_engine, show_progress=True

)

pred_responses = get_responses(

    eval_qs[:max_samples], query_engine, show_progress=True

)


pred_response_strs = [str(p) for p in pred_responses]

base_pred_response_strs = [str(p) for p in base_pred_responses]


evaluator_dict = {

    "correctness": evaluator_c,

    "faithfulness": evaluator_f,

    "relevancy": evaluator_r,

    "semantic_similarity": evaluator_s,

}

batch_runner = BatchEvalRunner(evaluator_dict, workers=2, show_progress=True)


eval_results = await batch_runner.aevaluate_responses(

    queries=eval_qs[:max_samples],

    responses=pred_responses[:max_samples],

    reference=ref_response_strs[:max_samples],

)


base_eval_results = await batch_runner.aevaluate_responses(

    queries=eval_qs[:max_samples],

    responses=base_pred_responses[:max_samples],

    reference=ref_response_strs[:max_samples],

)


results_df = get_results_df(

    [eval_results, base_eval_results],

    ["Sentence Window Retriever", "Base Retriever"],

    ["correctness", "relevancy", "faithfulness", "semantic_similarity"],

)

display(results_df)




References:

https://docs.llamaindex.ai/en/stable/examples/node_postprocessor/MetadataReplacementDemo/

What is Document Summary Index in LlamaIndex?

The document summary index will extract a summary from each document and store that summary, as well as all nodes corresponding to the document.


Retrieval can be performed through the LLM or embeddings (which is a TODO). We first select the relevant documents to the query based on their summaries. All retrieved nodes corresponding to the selected documents are retrieved.


The Steps involved in this is like below 


Step 1: Load Datasets

Load Wikipedia pages on different cities


city_docs = []

for wiki_title in wiki_titles:

    docs = SimpleDirectoryReader(

        input_files=[f"data/{wiki_title}.txt"]

    ).load_data()

    docs[0].doc_id = wiki_title

    city_docs.extend(docs)


Step 2: Build Document Summary Index 

two ways of building the index:


a. default mode of building the document summary index

b. customizing the summary query


# LLM (gpt-3.5-turbo)

chatgpt = OpenAI(temperature=0, model="gpt-3.5-turbo")

splitter = SentenceSplitter(chunk_size=1024)



# default mode of building the index

response_synthesizer = get_response_synthesizer(

    response_mode="tree_summarize", use_async=True

)

doc_summary_index = DocumentSummaryIndex.from_documents(

    city_docs,

    llm=chatgpt,

    transformations=[splitter],

    response_synthesizer=response_synthesizer,

    show_progress=True,

)


doc_summary_index.get_document_summary("Boston")

doc_summary_index.storage_context.persist("index")



from llama_index.core import load_index_from_storage

from llama_index.core import StorageContext


# rebuild storage context

storage_context = StorageContext.from_defaults(persist_dir="index")

doc_summary_index = load_index_from_storage(storage_context)


Step 3: 

Performing retrieval from Summary Index 


References:

https://docs.llamaindex.ai/en/stable/examples/index_structs/doc_summary/DocSummary/

Tuesday, February 18, 2025

What is PandasQueryEngine ?

PandasQueryEngine: convert natural language to Pandas python code using LLMs.

The input to the PandasQueryEngine is a Pandas dataframe, and the output is a response. The LLM infers dataframe operations to perform in order to retrieve the result.

Let's start on a Toy DataFrame

Here let's load a very simple dataframe containing city and population pairs, and run the PandasQueryEngine on it.

By setting verbose=True we can see the intermediate generated instructions.

# Test on some sample data

df = pd.DataFrame(

    {

        "city": ["Toronto", "Tokyo", "Berlin"],

        "population": [2930000, 13960000, 3645000],

    }

)

query_engine = PandasQueryEngine(df=df, verbose=True)

response = query_engine.query(

    "What is the city with the highest population?",

)

We can also take the step of using an LLM to synthesize a response.

query_engine = PandasQueryEngine(df=df, verbose=True, synthesize_response=True)

response = query_engine.query(

    "What is the city with the highest population? Give both the city and population",

)

print(str(response))

Analyzing the Titanic DataSet 

df = pd.read_csv("./titanic_train.csv")

query_engine = PandasQueryEngine(df=df, verbose=True)

response = query_engine.query(

    "What is the correlation between survival and age?",

)

display(Markdown(f"<b>{response}</b>"))

print(response.metadata["pandas_instruction_str"])

References:

https://docs.llamaindex.ai/en/stable/examples/query_engine/pandas_query_engine/


What is Camelot Library for PDF extraction

Camelot is a Python library that makes it easy to extract tables from PDF files.  It's particularly useful for PDFs where the tables are not easily selectable or copyable (e.g., scanned PDFs or PDFs with complex layouts).  Camelot works by using a combination of image processing and text analysis to identify and extract table data.   

Here's a breakdown of what Camelot does and why it's helpful:

Key Features and Benefits:

Table Detection: Camelot can automatically detect tables within a PDF, even if they aren't marked up as tables in the PDF's internal structure.   

Table Extraction: Once tables are detected, Camelot extracts the data from them and provides it in a structured format (like a Pandas DataFrame).   

Handles Different Table Types: It can handle various table formats, including tables with borders, tables without borders, and tables with complex layouts.   

Output to Pandas DataFrames: The extracted table data is typically returned as a Pandas DataFrame, making it easy to further process and analyze the data in Python.   

Command-Line Interface: Camelot also comes with a command-line interface, which can be useful for quick table extraction tasks.   

How it Works (Simplified):


Image Processing: Camelot often uses image processing techniques to identify the boundaries of tables within the PDF. This is especially helpful for PDFs where the tables aren't readily discernible from the underlying PDF structure.   

Text Analysis: It analyzes the text content within the identified table regions to reconstruct the table structure and extract the data.   

When to Use Camelot:


PDFs with Non-Selectable Tables: If you're working with PDFs where you can't easily select or copy the table data, Camelot is likely the right tool.

Complex Table Layouts: When tables have complex formatting, borders, or spanning cells that make standard PDF text extraction difficult, Camelot can help.   

Automating Table Extraction: If you need to extract tables from many PDFs programmatically, Camelot provides a convenient way to do this.

Limitations:


Scanned PDFs: Camelot primarily works with text-based PDFs. It does not have built-in OCR (Optical Character Recognition) capabilities. If your PDF is a scanned image, you'll need to use an OCR library (like Tesseract) first to convert the image to text before you can use Camelot.

Accuracy: While Camelot is good at table detection and extraction, its accuracy can vary depending on the complexity of the PDF and the tables. You might need to adjust some parameters or do some manual cleanup in some cases.



In summary: Camelot is a valuable library for extracting table data from PDFs, particularly when the tables are difficult to extract using other methods.  It combines image processing and text analysis to identify and extract table data, providing it in a structured format that can be easily used in Python.  Keep in mind its limitations with scanned PDFs and the potential for some inaccuracies.


References:

Gemini

What are some of advanced techniques for building production grade RAG?

 Decoupling chunks used for retrieval vs. chunks used for synthesis

Structured Retrieval for Larger Document Sets

Dynamically Retrieve Chunks Depending on your Task

Optimize context embeddings


Key Techniques#

There’s two main ways to take advantage of this idea:

1. Embed a document summary, which links to chunks associated with the document.

This can help retrieve relevant documents at a high-level before retrieving chunks vs. retrieving chunks directly (that might be in irrelevant documents).


Key Techniques#

There’s two main ways to take advantage of this idea:

1. Embed a document summary, which links to chunks associated with the document.

This can help retrieve relevant documents at a high-level before retrieving chunks vs. retrieving chunks directly (that might be in irrelevant documents).

2. Embed a sentence, which then links to a window around the sentence.

This allows for finer-grained retrieval of relevant context (embedding giant chunks leads to “lost in the middle” problems), but also ensures enough context for LLM synthesis.


Structured Retrieval for Larger Document Sets

A big issue with the standard RAG stack (top-k retrieval + basic text splitting) is that it doesn’t do well as the number of documents scales up - e.g. if you have 100 different PDFs. In this setting, given a query you may want to use structured information to help with more precise retrieval; for instance, if you ask a question that's only relevant to two PDFs, using structured information to ensure those two PDFs get returned beyond raw embedding similarity with chunks.

Key Techniques#

1. Metadata Filters + Auto Retrieval Tag each document with metadata and then store in a vector database. During inference time, use the LLM to infer the right metadata filters to query the vector db in addition to the semantic query string.

Pros ✅: Supported in major vector dbs. Can filter document via multiple dimensions.

Cons 🚫: Can be hard to define the right tags. Tags may not contain enough relevant information for more precise retrieval. Also tags represent keyword search at the document-level, doesn’t allow for semantic lookups.

2. Store Document Hierarchies (summaries -> raw chunks) + Recursive Retrieval Embed document summaries and map to chunks per document. Fetch at the document-level first before chunk level.

Pros ✅: Allows for semantic lookups at the document level.

Cons 🚫: Doesn’t allow for keyword lookups by structured tags (can be more precise than semantic search). Also autogenerating summaries can be expensive.

Dynamically Retrieve Chunks Depending on your Task

RAG isn't just about question-answering about specific facts, which top-k similarity is optimized for. There can be a broad range of queries that a user might ask. Queries that are handled by naive RAG stacks include ones that ask about specific facts e.g. "Tell me about the D&I initiatives for this company in 2023" or "What did the narrator do during his time at Google". But queries can also include summarization e.g. "Can you give me a high-level overview of this document", or comparisons "Can you compare/contrast X and Y". All of these use cases may require different retrieval techniques.

LlamaIndex provides some core abstractions to help you do task-specific retrieval. This includes router module as well as our data agent module. This also includes some advanced query engine modules. This also include other modules that join structured and unstructured data.

You can use these modules to do joint question-answering and summarization, or even combine structured queries with unstructured queries.

Optimize Context Embeddings

This is related to the motivation described above in "decoupling chunks used for retrieval vs. synthesis". We want to make sure that the embeddings are optimized for better retrieval over your specific data corpus. Pre-trained models may not capture the salient properties of the data relevant to your use cas

Key Techniques#

Beyond some of the techniques listed above, we can also try finetuning the embedding model. We can actually do this over an unstructured text corpus, in a label-free way.

referneces:

https://docs.llamaindex.ai/en/stable/optimizing/production_rag/

Monday, February 17, 2025

When using PyMuPDF4LLM, LlamaIndex is one of the option as output what are the advantages of these?

When parsing a PDF and getting the result as a LlamaIndex Document, the primary advantage is the ability to seamlessly integrate the extracted information with other data sources and readily query it using a large language model (LLM) within the LlamaIndex framework, allowing for richer, more contextual responses and analysis compared to simply extracting raw text from a PDF alone; essentially, it enables you to build sophisticated knowledge-based applications by combining data from various sources, including complex PDFs, in a unified way. 

Key benefits:

Contextual Understanding:

LlamaIndex can interpret the extracted PDF data within the broader context of other related information, leading to more accurate and relevant responses when querying. 

Multi-Source Querying:

You can easily query across multiple documents, including the parsed PDF, without needing separate data processing pipelines for each source. 

Advanced Parsing with LlamaParse:

LlamaIndex provides a dedicated "LlamaParse" tool specifically designed for complex PDF parsing, including tables and figures, which can be directly integrated into your workflow. 

RAG Applications:

By representing PDF data as LlamaIndex documents, you can readily build "Retrieval Augmented Generation" (RAG) applications that can retrieve relevant information from your PDF collection based on user queries. 

references:

Gemini 



Sunday, February 16, 2025

What are the main features for PyMuPDF4LLM?

 PyMuPDF4LLM is based on top of the tried and tested PyMuPDF and utilizes the library behind the scenes to achieve the following:

Support for multi-column pages

Support for image and vector graphics extraction (and inclusion of references in the MD text)

Support for page chunking output

Direct support for output as LlamaIndex Documents

Multi-Column Pages

The text extraction can handle document layouts with multiple columns and meaning that “newspaper” type layouts are supported. The associated Markdown output will faithfully represent the intended reading order.

Image Support

PyMuPDF4LLM will also output image files alongside the Markdown if we request write_images:

import pymupdf4llm

output = pymupdf4llm.to_markdown("input.pdf", write_images=True)

The resulting output will create a markdown text output with references to any images that may have been found in the document. The images will be saved to the location from where you have run the Python script and the markdown will have logically referenced them with the correct markdown syntax for images.


Page Chunking

We can obtain output with enriched semantic information if we request page_chunks:

import pymupdf4llm

output = pymupdf4llm.to_markdown("input.pdf", page_chunks=True)


This delivers a list of dictionary objects for each page of the document with the following schema:


metadata — dictionary consisting of the document’s metadata.

toc_items — list of Table of Contents items pointing to the page.

tables — list of tables on this page.

images — list of images on the page.

graphics — list of vector graphics rectangles on the page.

text — page content as Markdown text.

In this way page chunking allows for more structured results for your LLM input.


LlamaIndex Documents Output

If you are using LlamaIndex for your LLM application then you are in luck! PyMuPDF4LLM has a seamless integration as follows:

import pymupdf4llm

llama_reader = pymupdf4llm.LlamaMarkdownReader()

llama_docs = llama_reader.load_data("input.pdf")


With these simple 3 lines of code you will receive LLamaIndex document objects from the PDF file input for use with your LLM application!



What is Test-Time Scaling technique?

Test-Time Scaling (TTS) is a technique used to improve the performance of Large Language Models (LLMs) during inference (i.e., when the model is used to generate text or make predictions, not during training).  It works by adjusting the model's output probabilities based on the observed distribution of tokens in the generated text.   

Here's a breakdown of how it works:

Standard LLM Inference:  Typically, LLMs generate text by sampling from the probability distribution over the vocabulary at each step.  The model predicts the probability of each possible next token, and then a sampling strategy (e.g., greedy decoding, beam search, temperature sampling) is used to select the next token.   

The Problem:  LLMs can sometimes produce outputs that are repetitive, generic, or lack diversity.  This is partly because the model's probability distribution might be overconfident, assigning high probabilities to a small set of tokens and low probabilities to many others.   

Test-Time Scaling: TTS addresses this issue by introducing a scaling factor to the model's output probabilities.  This scaling factor is typically applied to the logits (the pre-softmax outputs of the model).

How Scaling Works: The scaling factor is usually less than 1.  When the logits are scaled down, the probability distribution becomes "flatter" or less peaked. This has the effect of:

Increasing the probability of less frequent tokens: This helps to reduce repetition and encourages the model to explore a wider range of vocabulary.

Reducing the probability of highly frequent tokens: This can help to prevent the model from getting stuck in repetitive loops or generating overly generic text.   

Adaptive Scaling (Often Used):  In many implementations, the scaling factor is adaptive.  It's adjusted based on the characteristics of the generated text so far.  For example, if the generated text is becoming repetitive, the scaling factor might be decreased further to increase diversity.  Conversely, if the text is becoming too random or incoherent, the scaling factor might be increased to make the distribution more peaked.

Benefits of TTS:

Improved Text Quality: TTS can lead to more diverse, creative, and less repetitive text generation.

Better Performance on Downstream Tasks: For tasks like machine translation or text summarization, TTS can improve the accuracy and fluency of the generated output.

In summary: TTS is a post-processing technique applied during inference. It adjusts the LLM's output probabilities to encourage more diverse and less repetitive text generation.  By scaling down the logits, the probability distribution is flattened, making it more likely for the model to choose less frequent tokens and avoid getting stuck in repetitive loops.  Adaptive scaling makes the process even more effective by dynamically adjusting the scaling factor based on the generated text.

references:

https://www.marktechpost.com/2025/02/13/can-1b-llm-surpass-405b-llm-optimizing-computation-for-small-llms-to-outperform-larger-models/


 


Saturday, February 15, 2025

What is LLM as a judge and how to compares to RAGAS?

The idea behind LLM-is-a-judge is simple – provide an LLM with the output of your system and the ground truth answer, and ask it to score the output based on some criteria.

The challenge is to get the judge to score according to domain-specific and problem-specific standards.

in other words, we needed to evaluate the evaluators!

First, we ran a sanity test – we used our system to generate answers based on ground truth context, and scored them using the judges.

This test confirmed that both judges behaved as expected: the answers, which were based on the actual ground truth context, scored high – both in absolute terms and in relation to the scores of running the system including the retrieval phase on the same questions. 

Next, we performed an evaluation of the correctness score by comparing it to the correctness score generated by human domain experts.

Our main focus was investigating the correlation between our various LLM-as-a-judge tools to the human-labeled examples, looking at trends rather than the absolute score values.

This method helped us deal with another risk – human experts’ can have a subjective perception of absolute score numbers. So instead of looking at the exact score they assigned, we focused on the relative ranking of examples.

Both RAGAS and our own judge correlated reasonably well to the human scores, with our own judge being better correlated, especially in the higher score bands

The results convinced us that our LLM-as-a-Judge offers a sufficiently reliable mechanism for assessing our system’s quality – both for comparing the quality of system versions to each other in order to make decisions about release candidates, and for finding examples which can indicate systematic quality issues we need to address.

references:
https://www.qodo.ai/blog/evaluating-rag-for-large-scale-codebases/

What are couple of issues with RAGAS?

RAGAS covers a number of key metrics useful in LLM evaluation, including answer correctness (later renamed to “factual correctness”) and context accuracy via precision and recall.

RAGAS implements correctness tests by converting both the generated answer and the ground truth (reference) into a series of simplified statements.

The score is essentially a grade for the level of overlap between statements from reference vs. the generated answer, combined with some weight for overall similarity between the answers.

When eyeballing the scores RAGAS generated, we noticed two recurring issues:

For relatively short answers, every small “missed fact” results in significant penalties.

When one of the answers was more detailed than the other, the correctness score suffered greatly, despite both answers being valid and even useful

The latter issue was common enough, and didn’t align with our intention for the correctness metric, so we needed to find a way to evaluate the “essence” of the answers as well as the details.

references:

https://www.qodo.ai/blog/evaluating-rag-for-large-scale-codebases/


Monday, February 10, 2025

What is n8n?

n8n is a distributed node-based workflow automation tool (https://n8n.io/). n8n workflows can be executed in standalone mode or in queue mode. The queue mode allows to set-up master slave architecture with one main process and multiple worker processes such that multiple workflows can be executed in parallel. In queue mode, main process submits the execution requests to message broker and worker processes pick those requests and execute them. Standalone mode doesn’t have separation of main process and worker process (they are bundled together) and can be used if parallelism is not required.

Component Description

n8n Main Process

This provides the Editor UI, Internal APIs and workflow tiggers.


Editor UI — This is the n8n user interface where we can manually configure the workflows, execute them, monitor the execution status, set up credentials, etc

Internal APIs — These are set of APIs that allow CRUD operations on the workflows. N8n exposes REST APIs (https://docs.n8n.io/api/) for same

Triggers — There are several trigger nodes available in n8n to trigger workflow execution (e.g., cron, interval, webhook, etc)

Redis Message Broker

This is where queue is set up in high availability mode. Main process submits the workflow execution requests to the queue and worker nodes pick and execute those requests


n8n Worker Processes

One or more worker processes can be set-up to execute the workflows. These components do the heavy-lifting and must be configured appropriately since they have direct impact on performance and scaling. Some of the key aspects are listed below.


Memory and CPU allocation

Workflow execution concurrency

Number of worker processes

PostgreSQL Database

Workflows, executions and credentials are stored in the database. PostgreSQL DB can be used for same.


Sunday, February 9, 2025

What is Approximate Near Neibour Search

Approximate Nearest Neighbors (ANN) algorithms are a class of algorithms designed to find the approximate nearest neighbors to a query point in a high-dimensional space.  They don't guarantee finding the absolute nearest neighbors (hence "approximate"), but they do so much faster than exact nearest neighbor search, which becomes computationally very expensive in high dimensions.  This speedup is crucial for many applications, including:   

Information Retrieval: Finding similar documents, images, or other data points.   

Recommendation Systems: Recommending items similar to those a user has interacted with.   

Clustering: Preprocessing step for some clustering algorithms.

Anomaly Detection: Identifying data points that are significantly different from others.

Why Approximate?

The curse of dimensionality makes exact nearest neighbor search computationally prohibitive in high-dimensional spaces.  The number of possible neighbors grows exponentially with the number of dimensions, making exhaustive search impractical. ANN algorithms trade off a small amount of accuracy (by finding approximate neighbors) for a massive gain in speed.   

HNSW (Hierarchical Navigable Small World):

HNSW is a graph-based ANN algorithm. It constructs a hierarchical graph structure where each node represents a data point.  The graph is designed so that you can efficiently navigate from any node to its nearest neighbors. The hierarchy allows for faster searching by starting at a higher level and progressively moving down to the lower levels.   

How it works (simplified):

Graph Construction: Data points are connected to their nearest neighbors at multiple levels of the hierarchy. Connections at higher levels are longer-range, while connections at lower levels are shorter-range.   

Search: Given a query point, the search starts at the top level of the graph. It navigates to the nearest neighbor at that level. Then, it proceeds to the next level and repeats the process until it reaches the lowest level. The nodes visited at the lowest level are considered the approximate nearest neighbors.   

Advantages:  Excellent performance, relatively low memory usage, good for various distance metrics.

Used in: Milvus (a vector database), and other vector search libraries.   

FAISS (Facebook AI Similarity Search):

FAISS is a library developed by Facebook AI for efficient similarity search and clustering of dense vectors. It provides implementations of many ANN algorithms, including several variations of Locality Sensitive Hashing (LSH), Product Quantization (PQ), and IVF (Inverted File Index).  It is highly optimized for performance and can handle very large datasets.   

Key Techniques used in FAISS:

Quantization: Reduces the size of vectors by mapping them to a smaller set of representative vectors.

Locality Sensitive Hashing (LSH): Uses hash functions to map similar vectors to the same buckets, increasing the probability of finding neighbors.   

Inverted File Index (IVF): Partitions the data into clusters and builds an index on these clusters, enabling faster search by only considering relevant clusters.   

Advantages:  Highly optimized for performance, supports various ANN algorithms, handles large datasets.

Used in: Many large-scale similarity search applications.



Wednesday, February 5, 2025

What would be a fair comparison of various PDF parsers available in the market now?


Key Considerations and Recommendations:

For general-purpose, high-performance PDF processing (including complex PDFs): PyMuPDF (fitz) is an excellent choice. Its speed and ability to handle complex layouts make it a strong contender.

For modifying or manipulating PDFs (merging, splitting, etc.): pikepdf and pdfrw are your best bets. pikepdf is generally preferred for its ease of use. pdfrw is more low-level and powerful.

For simple text and metadata extraction from relatively straightforward PDFs: PyPDF2 is a decent option. It's pure Python, so no external dependencies. However, it may not be as robust as PyMuPDF or PDFMiner.six for complex PDFs.

For extracting data from tables in PDFs: pdfplumber is specifically designed for this and does an excellent job.

For robust text and data extraction, particularly when you need more control: PDFMiner.six is a solid choice.

Which one to choose?


Most common use case (text extraction from various PDFs): PyMuPDF (fitz)

PDF Manipulation: pikepdf

Table Extraction: pdfplumber

Simple PDF text extraction (pure Python): PyPDF2 (but be aware of its limitations)

Remember to install the library you choose using pip install <library_name>.  For PyMuPDF, you'll likely need to install the pre-built wheels for your platform to avoid compilation issues.  Refer to the library's documentation for installation instructions.


What is MSR-VTT (Microsoft Research Video to Text)

 MSR-VTT (Microsoft Research Video to Text) is a benchmark dataset developed by Microsoft Research for video captioning and retrieval tasks. It is widely used in computer vision and natural language processing (NLP) to evaluate models that generate textual descriptions from videos or retrieve videos based on textual queries.


Key Features of MSR-VTT

Large-Scale Dataset:


Contains 10,000 video clips.

Covers 257 video categories (e.g., music, news, sports, gaming).

Rich Annotations:


Each video is annotated with 20 natural language descriptions.

A total of 200,000 captions describing video content.

Diverse Video Content:


Extracted from real-world sources like YouTube.

Covers a wide range of topics (e.g., entertainment, education, sports, music).

Benchmark for Video Captioning & Retrieval:


Used for training and evaluating models in:

Video-to-Text Generation (automatic captioning).

Text-to-Video Retrieval (finding relevant videos from text queries).

Use Cases

Training AI models for automated video captioning.

Video search and retrieval using textual queries.

Improving multimodal AI systems that process both visual and textual data.

Benchmarking video understanding models in NLP and computer vision research.

Challenges in MSR-VTT

Complex Video Semantics: Understanding actions, objects, and scene context in videos.

Natural Language Variability: Different ways of describing the same video.

Multimodal Learning: Combining visual, audio, and textual information effectively.

Would you like details on how to use MSR-VTT in a model training pipeline?

What are main Parts of Diffusion Model

Diffusion Model Overview

Take a look at our diffusion architecture main components.


The main part is a 3D U-Net, which is good at working with videos because they have frames that change over time. This U-Net isn’t just simple layers; it also uses attention. Attention helps the model focus on important parts of the video. Temporal attention looks at how frames relate to each other in time, and spatial attention focuses on the different parts of each frame. These attention layers, along with special blocks, help the model learn from the video data. Also, to make the model generate video that match the text prompt we provide, we use text embeddings.

The model works using something called diffusion. Think of it like this, first, we add noise to the training videos until they’re just random. Then, the model learns to undo this noise. To generate a video, it starts with pure noise and then gradually removes the noise using the U-Net, using the text embeddings that we provided as a guide. Important steps here include adding noise and then removing it. The text prompt is converted to embeddings using BERT and passed to UNet, enabling to generate videos from text. By doing this again and again, we get a video that matches the text we gave, which lets us make videos from words.

Instead of looking at the original complex diagram, let’s visualize a simpler and easier architecture diagram that we will be coding.




Let’s read through the flow of our architecture that we will be coding:


Input Video: The process begins with a video that we want to learn from or use as a starting point.

UNet3D Encoding: The video goes through the UNet3D’s encoder, which is like a special part of the system that shrinks the video and extracts important features from it.

UNet3D Bottleneck Processing: The shrunken video features are then processed in the UNet3D’s bottleneck, the part that has the smallest spatial dimensions of the video in feature map.

UNet3D Decoding: Next, the processed features are sent through the UNet3D’s decoder, which enlarges the features back to a video, adding the learned structure.

Text Conditioning: The provided text prompt, used to guide the generation, is converted into a text embedding, which provides input to the UNet3D at various points, allowing the video to be generated accordingly.

Diffusion Process: During training, noise is added to the video, and the model learns to remove it. During video generation, we starts with noise, and the model uses the UNet3D to gradually remove the noise, generating the video.

Output Video: Finally, we get the output video that is generated by the model and is based on the input video or noise and the text provided as a prompt.

references:

https://levelup.gitconnected.com/building-a-tiny-text-to-video-model-from-scratch-using-python-f31bdab12fbb

https://github.com/FareedKhan-dev/text2video-from-scratch?source=post_page-----f31bdab12fbb--------------------------------



Sunday, February 2, 2025

What is Playwright and PlayWright MCP Server ?

 Playwright is a powerful open-source framework for automating web browsers (Chromium, Firefox, and WebKit). It allows you to control browsers programmatically, which is incredibly useful for tasks like:   


End-to-end testing: Testing the full user flow of a web application, simulating real user interactions.   

Web scraping: Extracting data from websites.   

Automation of web tasks: Automating repetitive actions on websites.   

Playwright supports multiple programming languages, including JavaScript/TypeScript, Python, Java, and C#.  It provides a consistent API across these languages, making it easy to switch between them if needed.   


Key Features of Playwright:


Cross-browser support: Works with Chromium, Firefox, and WebKit.   

Auto-waiting: Playwright intelligently waits for elements to be ready before interacting with them, reducing the need for explicit waits.   

Resilient execution: Playwright handles flaky tests and automatically retries actions.   

Powerful selectors: Supports CSS and XPath selectors for precise element targeting.   

Debugging tools: Provides excellent debugging capabilities, including browser contexts, tracing, and video recording.   

Headless and headed modes: Can run browsers in headless mode (without a visible window) for faster execution or in headed mode for visual debugging.   

Playwright MCP Server (Microsoft Playwright Server):


The Playwright MCP Server, often referred to as the browser server, is a service that manages the browsers that Playwright controls. It acts as a central point for communication between your Playwright scripts and the actual browser instances.   

Why is it used?

Remote Execution: The MCP Server enables you to run your Playwright tests or scripts on a different machine than where your code is running. This is useful for distributed testing or running tests in a cloud environment.   

Browser Management: The server handles the launching and closing of browsers, ensuring that they are properly managed. This is especially helpful when running many tests concurrently.

Scalability: The server can be scaled to handle a large number of concurrent browser sessions, making it suitable for large projects or continuous integration/continuous deployment (CI/CD) pipelines.

In summary:

Playwright: The framework that provides the API for controlling browsers.   

Playwright MCP Server: A service that manages the browsers and facilitates remote execution and scaling.   

You don't always need the MCP Server. If you're running simple Playwright scripts locally and don't need remote execution or advanced browser management, you can run Playwright directly without the server. However, for more complex scenarios, especially in testing environments, the MCP Server becomes very valuable.