Wednesday, February 19, 2025

What are some of the basic RAG techniques With LlamaIndex - Part 1 - Prompt Engineering

Prompt Engineering

If you're encountering failures related to the LLM, like hallucinations or poorly formatted outputs, then this should be one of the first things you try.


Customizing Prompts => Most of the prebuilt modules having prompts inside, these can be queried and viewed and also can be updated as required 


documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine(response_mode="tree_summarize")

# define prompt viewing function

def display_prompt_dict(prompts_dict):

    for k, p in prompts_dict.items():

        text_md = f"**Prompt Key**: {k}<br>" f"**Text:** <br>"

        display(Markdown(text_md))

        print(p.get_template())

        display(Markdown("<br><br>"))


prompts_dict = query_engine.get_prompts()


# from response synthesiser 

prompts_dict = query_engine.response_synthesizer.get_prompts()

display_prompt_dict(prompts_dict)



query_engine = index.as_query_engine(response_mode="compact")

prompts_dict = query_engine.get_prompts()

display_prompt_dict(prompts_dict)


response = query_engine.query("What did the author do growing up?")

print(str(response))


For customizing the prompt, it can be done like below 

from llama_index.core import PromptTemplate


# reset

query_engine = index.as_query_engine(response_mode="tree_summarize")


# shakespeare!

new_summary_tmpl_str = (

    "Context information is below.\n"

    "---------------------\n"

    "{context_str}\n"

    "---------------------\n"

    "Given the context information and not prior knowledge, "

    "answer the query in the style of a Shakespeare play.\n"

    "Query: {query_str}\n"

    "Answer: "

)

new_summary_tmpl = PromptTemplate(new_summary_tmpl_str)


query_engine.update_prompts(

    {"response_synthesizer:summary_template": new_summary_tmpl}

)



Advanced Prompts => 


Partial formatting

Prompt template variable mappings

Prompt function mappings



Partial formatting (partial_format) allows you to partially format a prompt, filling in some variables while leaving others to be filled in later.


This is a nice convenience function so you don't have to maintain all the required prompt variables all the way down to format, you can partially format as they come in.


This will create a copy of the prompt template.


qa_prompt_tmpl_str = """\

Context information is below.

---------------------

{context_str}

---------------------

Given the context information and not prior knowledge, answer the query.

Please write the answer in the style of {tone_name}

Query: {query_str}

Answer: \

"""


prompt_tmpl = PromptTemplate(qa_prompt_tmpl_str)



2. Prompt Template Variable Mappings

Template var mappings allow you to specify a mapping from the "expected" prompt keys (e.g. context_str and query_str for response synthesis), with the keys actually in your template.


This allows you re-use your existing string templates without having to annoyingly change out the template variables.


qa_prompt_tmpl_str = """\

Context information is below.

---------------------

{my_context}

---------------------

Given the context information and not prior knowledge, answer the query.

Query: {my_query}

Answer: \

"""


template_var_mappings = {"context_str": "my_context", "query_str": "my_query"}


prompt_tmpl = PromptTemplate(

    qa_prompt_tmpl_str, template_var_mappings=template_var_mappings

)


Prompt Function Mappings


You can also pass in functions as template variables instead of fixed values.


This allows you to dynamically inject certain values, dependent on other values, during query-time.



qa_prompt_tmpl_str = """\

Context information is below.

---------------------

{context_str}

---------------------

Given the context information and not prior knowledge, answer the query.

Query: {query_str}

Answer: \

"""

def format_context_fn(**kwargs):

    # format context with bullet points

    context_list = kwargs["context_str"].split("\n\n")

    fmtted_context = "\n\n".join([f"- {c}" for c in context_list])

    return fmtted_context



prompt_tmpl = PromptTemplate(

    qa_prompt_tmpl_str, function_mappings={"context_str": format_context_fn}

)


references:

https://docs.llamaindex.ai/en/stable/optimizing/basic_strategies/basic_strategies/

How to Perform structured retrieval for large number of documents?



A big issue with the standard RAG stack (top-k retrieval + basic text splitting) is that it doesn’t do well as the number of documents scales up - e.g. if you have 100 different PDFs. In this setting, given a query you may want to use structured information to help with more precise retrieval; for instance, if you ask a question that's only relevant to two PDFs, using structured information to ensure those two PDFs get returned beyond raw embedding similarity with chunks.


Key Techniques#

There’s a few ways of performing more structured tagging/retrieval for production-quality RAG systems, each with their own pros/cons.


1. Metadata Filters + Auto Retrieval Tag each document with metadata and then store in a vector database. During inference time, use the LLM to infer the right metadata filters to query the vector db in addition to the semantic query string.


Pros ✅: Supported in major vector dbs. Can filter document via multiple dimensions.

Cons 🚫: Can be hard to define the right tags. Tags may not contain enough relevant information for more precise retrieval. Also tags represent keyword search at the document-level, doesn’t allow for semantic lookups.


 2. Store Document Hierarchies (summaries -> raw chunks) + Recursive Retrieval Embed document summaries and map to chunks per document. Fetch at the document-level first before chunk level.


Pros ✅: Allows for semantic lookups at the document level.

Cons 🚫: Doesn’t allow for keyword lookups by structured tags (can be more precise than semantic search). Also autogenerating summaries can be expensive. 

references:

https://docs.llamaindex.ai/en/stable/optimizing/production_rag/


Metadata Replacement + Node Sentence Window based retrieval for RAG

We use the SentenceWindowNodeParser to parse documents into single sentences per node. Each node also contains a "window" with the sentences on either side of the node sentence.


Then, after retrieval, before passing the retrieved sentences to the LLM, the single sentences are replaced with a window containing the surrounding sentences using the MetadataReplacementNodePostProcessor.


This is most useful for large documents/indexes, as it helps to retrieve more fine-grained details.


By default, the sentence window is 5 sentences on either side of the original sentence



The whole process can be split into like below 



Step 1: Load Data, Build the Index

!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf 


from llama_index.core import SimpleDirectoryReader


documents = SimpleDirectoryReader(

    input_files=["./IPCC_AR6_WGII_Chapter03.pdf"]

).load_data()



Extract Nodes¶


We extract out the set of nodes that will be stored in the VectorIndex. This includes both the nodes with the sentence window parser, as well as the "base" nodes extracted using the standard parser.



nodes = node_parser.get_nodes_from_documents(documents)

base_nodes = text_splitter.get_nodes_from_documents(documents)


Build the Indexes

We build both the sentence index, as well as the "base" index (with default chunk sizes).

from llama_index.core import VectorStoreIndex

sentence_index = VectorStoreIndex(nodes)

base_index = VectorStoreIndex(base_nodes)


Step 2: Querying

With MetadataReplacementPostProcessor


we now use the MetadataReplacementPostProcessor to replace the sentence in each node with it's surrounding context.


from llama_index.core.postprocessor import MetadataReplacementPostProcessor


query_engine = sentence_index.as_query_engine(

    similarity_top_k=2,

    # the target key defaults to `window` to match the node_parser's default

    node_postprocessors=[

        MetadataReplacementPostProcessor(target_metadata_key="window")

    ],

)

window_response = query_engine.query(

    "What are the concerns surrounding the AMOC?"

)

print(window_response)


We can also check the original sentence that was retrieved for each node, as well as the actual window of sentences that was sent to the LLM.


window = window_response.source_nodes[0].node.metadata["window"]

sentence = window_response.source_nodes[0].node.metadata["original_text"]


print(f"Window: {window}")

print("------------------")

print(f"Original Sentence: {sentence}")




Contrast with normal VectorStoreIndex


query_engine = base_index.as_query_engine(similarity_top_k=2)

vector_response = query_engine.query(

    "What are the concerns surrounding the AMOC?"

)

print(vector_response)



Well, that didn't work. Let's bump up the top k! This will be slower and use more tokens compared to the sentence window index.


query_engine = base_index.as_query_engine(similarity_top_k=5)

vector_response = query_engine.query(

    "What are the concerns surrounding the AMOC?"

)

print(vector_response)



Step 3: Analysis


So the SentenceWindowNodeParser + MetadataReplacementNodePostProcessor combo is the clear winner here


Embeddings at a sentence level seem to capture more fine-grained details, like the word AMOC.


We can also compare the retrieved chunks for each index!


for source_node in window_response.source_nodes:

    print(source_node.node.metadata["original_text"])

    print("--------")


Here, we can see that the sentence window index easily retrieved two nodes that talk about AMOC. Remember, the embeddings are based purely on the original sentence here, but the LLM actually ends up reading the surrounding context as well!


let's try and disect why the naive vector index failed.


for node in vector_response.source_nodes:

    print("AMOC mentioned?", "AMOC" in node.node.text)

    print("--------")


So source node at index [2] mentions AMOC, but what did this text actually look like?

print(vector_response.source_nodes[2].node.text)




Step 4: Evaluation 

We more rigorously evaluate how well the sentence window retriever works compared to the base retriever.


We define/load an eval benchmark dataset and then run different evaluations over it.


WARNING: This can be expensive, especially with GPT-4. Use caution and tune the sample size to fit your budget.



from llama_index.core.evaluation import DatasetGenerator, QueryResponseDataset


from llama_index.llms.openai import OpenAI

import nest_asyncio

import random


nest_asyncio.apply()


num_nodes_eval = 30

# there are 428 nodes total. Take the first 200 to generate questions (the back half of the doc is all references)

sample_eval_nodes = random.sample(base_nodes[:200], num_nodes_eval)

# NOTE: run this if the dataset isn't already saved

# generate questions from the largest chunks (1024)

dataset_generator = DatasetGenerator(

    sample_eval_nodes,

    llm=OpenAI(model="gpt-4"),

    show_progress=True,

    num_questions_per_chunk=2,

)


eval_dataset = await dataset_generator.agenerate_dataset_from_nodes()


val_dataset.save_json("data/ipcc_eval_qr_dataset.json")


eval_dataset = QueryResponseDataset.from_json("data/ipcc_eval_qr_dataset.json")


import asyncio

import nest_asyncio


nest_asyncio.apply()


from llama_index.core.evaluation import (

    CorrectnessEvaluator,

    SemanticSimilarityEvaluator,

    RelevancyEvaluator,

    FaithfulnessEvaluator,

    PairwiseComparisonEvaluator,

)



from collections import defaultdict

import pandas as pd


# NOTE: can uncomment other evaluators

evaluator_c = CorrectnessEvaluator(llm=OpenAI(model="gpt-4"))

evaluator_s = SemanticSimilarityEvaluator()

evaluator_r = RelevancyEvaluator(llm=OpenAI(model="gpt-4"))

evaluator_f = FaithfulnessEvaluator(llm=OpenAI(model="gpt-4"))

# pairwise_evaluator = PairwiseComparisonEvaluator(llm=OpenAI(model="gpt-4"))


from llama_index.core.evaluation.eval_utils import (

    get_responses,

    get_results_df,

)

from llama_index.core.evaluation import BatchEvalRunner


max_samples = 30


eval_qs = eval_dataset.questions

ref_response_strs = [r for (_, r) in eval_dataset.qr_pairs]


# resetup base query engine and sentence window query engine

# base query engine

base_query_engine = base_index.as_query_engine(similarity_top_k=2)

# sentence window query engine

query_engine = sentence_index.as_query_engine(

    similarity_top_k=2,

    # the target key defaults to `window` to match the node_parser's default

    node_postprocessors=[

        MetadataReplacementPostProcessor(target_metadata_key="window")

    ],

)


import numpy as np


base_pred_responses = get_responses(

    eval_qs[:max_samples], base_query_engine, show_progress=True

)

pred_responses = get_responses(

    eval_qs[:max_samples], query_engine, show_progress=True

)


pred_response_strs = [str(p) for p in pred_responses]

base_pred_response_strs = [str(p) for p in base_pred_responses]


evaluator_dict = {

    "correctness": evaluator_c,

    "faithfulness": evaluator_f,

    "relevancy": evaluator_r,

    "semantic_similarity": evaluator_s,

}

batch_runner = BatchEvalRunner(evaluator_dict, workers=2, show_progress=True)


eval_results = await batch_runner.aevaluate_responses(

    queries=eval_qs[:max_samples],

    responses=pred_responses[:max_samples],

    reference=ref_response_strs[:max_samples],

)


base_eval_results = await batch_runner.aevaluate_responses(

    queries=eval_qs[:max_samples],

    responses=base_pred_responses[:max_samples],

    reference=ref_response_strs[:max_samples],

)


results_df = get_results_df(

    [eval_results, base_eval_results],

    ["Sentence Window Retriever", "Base Retriever"],

    ["correctness", "relevancy", "faithfulness", "semantic_similarity"],

)

display(results_df)




References:

https://docs.llamaindex.ai/en/stable/examples/node_postprocessor/MetadataReplacementDemo/

What is Document Summary Index in LlamaIndex?

The document summary index will extract a summary from each document and store that summary, as well as all nodes corresponding to the document.


Retrieval can be performed through the LLM or embeddings (which is a TODO). We first select the relevant documents to the query based on their summaries. All retrieved nodes corresponding to the selected documents are retrieved.


The Steps involved in this is like below 


Step 1: Load Datasets

Load Wikipedia pages on different cities


city_docs = []

for wiki_title in wiki_titles:

    docs = SimpleDirectoryReader(

        input_files=[f"data/{wiki_title}.txt"]

    ).load_data()

    docs[0].doc_id = wiki_title

    city_docs.extend(docs)


Step 2: Build Document Summary Index 

two ways of building the index:


a. default mode of building the document summary index

b. customizing the summary query


# LLM (gpt-3.5-turbo)

chatgpt = OpenAI(temperature=0, model="gpt-3.5-turbo")

splitter = SentenceSplitter(chunk_size=1024)



# default mode of building the index

response_synthesizer = get_response_synthesizer(

    response_mode="tree_summarize", use_async=True

)

doc_summary_index = DocumentSummaryIndex.from_documents(

    city_docs,

    llm=chatgpt,

    transformations=[splitter],

    response_synthesizer=response_synthesizer,

    show_progress=True,

)


doc_summary_index.get_document_summary("Boston")

doc_summary_index.storage_context.persist("index")



from llama_index.core import load_index_from_storage

from llama_index.core import StorageContext


# rebuild storage context

storage_context = StorageContext.from_defaults(persist_dir="index")

doc_summary_index = load_index_from_storage(storage_context)


Step 3: 

Performing retrieval from Summary Index 


References:

https://docs.llamaindex.ai/en/stable/examples/index_structs/doc_summary/DocSummary/

Tuesday, February 18, 2025

What is PandasQueryEngine ?

PandasQueryEngine: convert natural language to Pandas python code using LLMs.

The input to the PandasQueryEngine is a Pandas dataframe, and the output is a response. The LLM infers dataframe operations to perform in order to retrieve the result.

Let's start on a Toy DataFrame

Here let's load a very simple dataframe containing city and population pairs, and run the PandasQueryEngine on it.

By setting verbose=True we can see the intermediate generated instructions.

# Test on some sample data

df = pd.DataFrame(

    {

        "city": ["Toronto", "Tokyo", "Berlin"],

        "population": [2930000, 13960000, 3645000],

    }

)

query_engine = PandasQueryEngine(df=df, verbose=True)

response = query_engine.query(

    "What is the city with the highest population?",

)

We can also take the step of using an LLM to synthesize a response.

query_engine = PandasQueryEngine(df=df, verbose=True, synthesize_response=True)

response = query_engine.query(

    "What is the city with the highest population? Give both the city and population",

)

print(str(response))

Analyzing the Titanic DataSet 

df = pd.read_csv("./titanic_train.csv")

query_engine = PandasQueryEngine(df=df, verbose=True)

response = query_engine.query(

    "What is the correlation between survival and age?",

)

display(Markdown(f"<b>{response}</b>"))

print(response.metadata["pandas_instruction_str"])

References:

https://docs.llamaindex.ai/en/stable/examples/query_engine/pandas_query_engine/


What is Camelot Library for PDF extraction

Camelot is a Python library that makes it easy to extract tables from PDF files.  It's particularly useful for PDFs where the tables are not easily selectable or copyable (e.g., scanned PDFs or PDFs with complex layouts).  Camelot works by using a combination of image processing and text analysis to identify and extract table data.   

Here's a breakdown of what Camelot does and why it's helpful:

Key Features and Benefits:

Table Detection: Camelot can automatically detect tables within a PDF, even if they aren't marked up as tables in the PDF's internal structure.   

Table Extraction: Once tables are detected, Camelot extracts the data from them and provides it in a structured format (like a Pandas DataFrame).   

Handles Different Table Types: It can handle various table formats, including tables with borders, tables without borders, and tables with complex layouts.   

Output to Pandas DataFrames: The extracted table data is typically returned as a Pandas DataFrame, making it easy to further process and analyze the data in Python.   

Command-Line Interface: Camelot also comes with a command-line interface, which can be useful for quick table extraction tasks.   

How it Works (Simplified):


Image Processing: Camelot often uses image processing techniques to identify the boundaries of tables within the PDF. This is especially helpful for PDFs where the tables aren't readily discernible from the underlying PDF structure.   

Text Analysis: It analyzes the text content within the identified table regions to reconstruct the table structure and extract the data.   

When to Use Camelot:


PDFs with Non-Selectable Tables: If you're working with PDFs where you can't easily select or copy the table data, Camelot is likely the right tool.

Complex Table Layouts: When tables have complex formatting, borders, or spanning cells that make standard PDF text extraction difficult, Camelot can help.   

Automating Table Extraction: If you need to extract tables from many PDFs programmatically, Camelot provides a convenient way to do this.

Limitations:


Scanned PDFs: Camelot primarily works with text-based PDFs. It does not have built-in OCR (Optical Character Recognition) capabilities. If your PDF is a scanned image, you'll need to use an OCR library (like Tesseract) first to convert the image to text before you can use Camelot.

Accuracy: While Camelot is good at table detection and extraction, its accuracy can vary depending on the complexity of the PDF and the tables. You might need to adjust some parameters or do some manual cleanup in some cases.



In summary: Camelot is a valuable library for extracting table data from PDFs, particularly when the tables are difficult to extract using other methods.  It combines image processing and text analysis to identify and extract table data, providing it in a structured format that can be easily used in Python.  Keep in mind its limitations with scanned PDFs and the potential for some inaccuracies.


References:

Gemini

What are some of advanced techniques for building production grade RAG?

 Decoupling chunks used for retrieval vs. chunks used for synthesis

Structured Retrieval for Larger Document Sets

Dynamically Retrieve Chunks Depending on your Task

Optimize context embeddings


Key Techniques#

There’s two main ways to take advantage of this idea:

1. Embed a document summary, which links to chunks associated with the document.

This can help retrieve relevant documents at a high-level before retrieving chunks vs. retrieving chunks directly (that might be in irrelevant documents).


Key Techniques#

There’s two main ways to take advantage of this idea:

1. Embed a document summary, which links to chunks associated with the document.

This can help retrieve relevant documents at a high-level before retrieving chunks vs. retrieving chunks directly (that might be in irrelevant documents).

2. Embed a sentence, which then links to a window around the sentence.

This allows for finer-grained retrieval of relevant context (embedding giant chunks leads to “lost in the middle” problems), but also ensures enough context for LLM synthesis.


Structured Retrieval for Larger Document Sets

A big issue with the standard RAG stack (top-k retrieval + basic text splitting) is that it doesn’t do well as the number of documents scales up - e.g. if you have 100 different PDFs. In this setting, given a query you may want to use structured information to help with more precise retrieval; for instance, if you ask a question that's only relevant to two PDFs, using structured information to ensure those two PDFs get returned beyond raw embedding similarity with chunks.

Key Techniques#

1. Metadata Filters + Auto Retrieval Tag each document with metadata and then store in a vector database. During inference time, use the LLM to infer the right metadata filters to query the vector db in addition to the semantic query string.

Pros ✅: Supported in major vector dbs. Can filter document via multiple dimensions.

Cons 🚫: Can be hard to define the right tags. Tags may not contain enough relevant information for more precise retrieval. Also tags represent keyword search at the document-level, doesn’t allow for semantic lookups.

2. Store Document Hierarchies (summaries -> raw chunks) + Recursive Retrieval Embed document summaries and map to chunks per document. Fetch at the document-level first before chunk level.

Pros ✅: Allows for semantic lookups at the document level.

Cons 🚫: Doesn’t allow for keyword lookups by structured tags (can be more precise than semantic search). Also autogenerating summaries can be expensive.

Dynamically Retrieve Chunks Depending on your Task

RAG isn't just about question-answering about specific facts, which top-k similarity is optimized for. There can be a broad range of queries that a user might ask. Queries that are handled by naive RAG stacks include ones that ask about specific facts e.g. "Tell me about the D&I initiatives for this company in 2023" or "What did the narrator do during his time at Google". But queries can also include summarization e.g. "Can you give me a high-level overview of this document", or comparisons "Can you compare/contrast X and Y". All of these use cases may require different retrieval techniques.

LlamaIndex provides some core abstractions to help you do task-specific retrieval. This includes router module as well as our data agent module. This also includes some advanced query engine modules. This also include other modules that join structured and unstructured data.

You can use these modules to do joint question-answering and summarization, or even combine structured queries with unstructured queries.

Optimize Context Embeddings

This is related to the motivation described above in "decoupling chunks used for retrieval vs. synthesis". We want to make sure that the embeddings are optimized for better retrieval over your specific data corpus. Pre-trained models may not capture the salient properties of the data relevant to your use cas

Key Techniques#

Beyond some of the techniques listed above, we can also try finetuning the embedding model. We can actually do this over an unstructured text corpus, in a label-free way.

referneces:

https://docs.llamaindex.ai/en/stable/optimizing/production_rag/

Monday, February 17, 2025

When using PyMuPDF4LLM, LlamaIndex is one of the option as output what are the advantages of these?

When parsing a PDF and getting the result as a LlamaIndex Document, the primary advantage is the ability to seamlessly integrate the extracted information with other data sources and readily query it using a large language model (LLM) within the LlamaIndex framework, allowing for richer, more contextual responses and analysis compared to simply extracting raw text from a PDF alone; essentially, it enables you to build sophisticated knowledge-based applications by combining data from various sources, including complex PDFs, in a unified way. 

Key benefits:

Contextual Understanding:

LlamaIndex can interpret the extracted PDF data within the broader context of other related information, leading to more accurate and relevant responses when querying. 

Multi-Source Querying:

You can easily query across multiple documents, including the parsed PDF, without needing separate data processing pipelines for each source. 

Advanced Parsing with LlamaParse:

LlamaIndex provides a dedicated "LlamaParse" tool specifically designed for complex PDF parsing, including tables and figures, which can be directly integrated into your workflow. 

RAG Applications:

By representing PDF data as LlamaIndex documents, you can readily build "Retrieval Augmented Generation" (RAG) applications that can retrieve relevant information from your PDF collection based on user queries. 

references:

Gemini 



Sunday, February 16, 2025

What are the main features for PyMuPDF4LLM?

 PyMuPDF4LLM is based on top of the tried and tested PyMuPDF and utilizes the library behind the scenes to achieve the following:

Support for multi-column pages

Support for image and vector graphics extraction (and inclusion of references in the MD text)

Support for page chunking output

Direct support for output as LlamaIndex Documents

Multi-Column Pages

The text extraction can handle document layouts with multiple columns and meaning that “newspaper” type layouts are supported. The associated Markdown output will faithfully represent the intended reading order.

Image Support

PyMuPDF4LLM will also output image files alongside the Markdown if we request write_images:

import pymupdf4llm

output = pymupdf4llm.to_markdown("input.pdf", write_images=True)

The resulting output will create a markdown text output with references to any images that may have been found in the document. The images will be saved to the location from where you have run the Python script and the markdown will have logically referenced them with the correct markdown syntax for images.


Page Chunking

We can obtain output with enriched semantic information if we request page_chunks:

import pymupdf4llm

output = pymupdf4llm.to_markdown("input.pdf", page_chunks=True)


This delivers a list of dictionary objects for each page of the document with the following schema:


metadata — dictionary consisting of the document’s metadata.

toc_items — list of Table of Contents items pointing to the page.

tables — list of tables on this page.

images — list of images on the page.

graphics — list of vector graphics rectangles on the page.

text — page content as Markdown text.

In this way page chunking allows for more structured results for your LLM input.


LlamaIndex Documents Output

If you are using LlamaIndex for your LLM application then you are in luck! PyMuPDF4LLM has a seamless integration as follows:

import pymupdf4llm

llama_reader = pymupdf4llm.LlamaMarkdownReader()

llama_docs = llama_reader.load_data("input.pdf")


With these simple 3 lines of code you will receive LLamaIndex document objects from the PDF file input for use with your LLM application!



What is Test-Time Scaling technique?

Test-Time Scaling (TTS) is a technique used to improve the performance of Large Language Models (LLMs) during inference (i.e., when the model is used to generate text or make predictions, not during training).  It works by adjusting the model's output probabilities based on the observed distribution of tokens in the generated text.   

Here's a breakdown of how it works:

Standard LLM Inference:  Typically, LLMs generate text by sampling from the probability distribution over the vocabulary at each step.  The model predicts the probability of each possible next token, and then a sampling strategy (e.g., greedy decoding, beam search, temperature sampling) is used to select the next token.   

The Problem:  LLMs can sometimes produce outputs that are repetitive, generic, or lack diversity.  This is partly because the model's probability distribution might be overconfident, assigning high probabilities to a small set of tokens and low probabilities to many others.   

Test-Time Scaling: TTS addresses this issue by introducing a scaling factor to the model's output probabilities.  This scaling factor is typically applied to the logits (the pre-softmax outputs of the model).

How Scaling Works: The scaling factor is usually less than 1.  When the logits are scaled down, the probability distribution becomes "flatter" or less peaked. This has the effect of:

Increasing the probability of less frequent tokens: This helps to reduce repetition and encourages the model to explore a wider range of vocabulary.

Reducing the probability of highly frequent tokens: This can help to prevent the model from getting stuck in repetitive loops or generating overly generic text.   

Adaptive Scaling (Often Used):  In many implementations, the scaling factor is adaptive.  It's adjusted based on the characteristics of the generated text so far.  For example, if the generated text is becoming repetitive, the scaling factor might be decreased further to increase diversity.  Conversely, if the text is becoming too random or incoherent, the scaling factor might be increased to make the distribution more peaked.

Benefits of TTS:

Improved Text Quality: TTS can lead to more diverse, creative, and less repetitive text generation.

Better Performance on Downstream Tasks: For tasks like machine translation or text summarization, TTS can improve the accuracy and fluency of the generated output.

In summary: TTS is a post-processing technique applied during inference. It adjusts the LLM's output probabilities to encourage more diverse and less repetitive text generation.  By scaling down the logits, the probability distribution is flattened, making it more likely for the model to choose less frequent tokens and avoid getting stuck in repetitive loops.  Adaptive scaling makes the process even more effective by dynamically adjusting the scaling factor based on the generated text.

references:

https://www.marktechpost.com/2025/02/13/can-1b-llm-surpass-405b-llm-optimizing-computation-for-small-llms-to-outperform-larger-models/


 


Saturday, February 15, 2025

What is LLM as a judge and how to compares to RAGAS?

The idea behind LLM-is-a-judge is simple – provide an LLM with the output of your system and the ground truth answer, and ask it to score the output based on some criteria.

The challenge is to get the judge to score according to domain-specific and problem-specific standards.

in other words, we needed to evaluate the evaluators!

First, we ran a sanity test – we used our system to generate answers based on ground truth context, and scored them using the judges.

This test confirmed that both judges behaved as expected: the answers, which were based on the actual ground truth context, scored high – both in absolute terms and in relation to the scores of running the system including the retrieval phase on the same questions. 

Next, we performed an evaluation of the correctness score by comparing it to the correctness score generated by human domain experts.

Our main focus was investigating the correlation between our various LLM-as-a-judge tools to the human-labeled examples, looking at trends rather than the absolute score values.

This method helped us deal with another risk – human experts’ can have a subjective perception of absolute score numbers. So instead of looking at the exact score they assigned, we focused on the relative ranking of examples.

Both RAGAS and our own judge correlated reasonably well to the human scores, with our own judge being better correlated, especially in the higher score bands

The results convinced us that our LLM-as-a-Judge offers a sufficiently reliable mechanism for assessing our system’s quality – both for comparing the quality of system versions to each other in order to make decisions about release candidates, and for finding examples which can indicate systematic quality issues we need to address.

references:
https://www.qodo.ai/blog/evaluating-rag-for-large-scale-codebases/

What are couple of issues with RAGAS?

RAGAS covers a number of key metrics useful in LLM evaluation, including answer correctness (later renamed to “factual correctness”) and context accuracy via precision and recall.

RAGAS implements correctness tests by converting both the generated answer and the ground truth (reference) into a series of simplified statements.

The score is essentially a grade for the level of overlap between statements from reference vs. the generated answer, combined with some weight for overall similarity between the answers.

When eyeballing the scores RAGAS generated, we noticed two recurring issues:

For relatively short answers, every small “missed fact” results in significant penalties.

When one of the answers was more detailed than the other, the correctness score suffered greatly, despite both answers being valid and even useful

The latter issue was common enough, and didn’t align with our intention for the correctness metric, so we needed to find a way to evaluate the “essence” of the answers as well as the details.

references:

https://www.qodo.ai/blog/evaluating-rag-for-large-scale-codebases/


Monday, February 10, 2025

What is n8n?

n8n is a distributed node-based workflow automation tool (https://n8n.io/). n8n workflows can be executed in standalone mode or in queue mode. The queue mode allows to set-up master slave architecture with one main process and multiple worker processes such that multiple workflows can be executed in parallel. In queue mode, main process submits the execution requests to message broker and worker processes pick those requests and execute them. Standalone mode doesn’t have separation of main process and worker process (they are bundled together) and can be used if parallelism is not required.

Component Description

n8n Main Process

This provides the Editor UI, Internal APIs and workflow tiggers.


Editor UI — This is the n8n user interface where we can manually configure the workflows, execute them, monitor the execution status, set up credentials, etc

Internal APIs — These are set of APIs that allow CRUD operations on the workflows. N8n exposes REST APIs (https://docs.n8n.io/api/) for same

Triggers — There are several trigger nodes available in n8n to trigger workflow execution (e.g., cron, interval, webhook, etc)

Redis Message Broker

This is where queue is set up in high availability mode. Main process submits the workflow execution requests to the queue and worker nodes pick and execute those requests


n8n Worker Processes

One or more worker processes can be set-up to execute the workflows. These components do the heavy-lifting and must be configured appropriately since they have direct impact on performance and scaling. Some of the key aspects are listed below.


Memory and CPU allocation

Workflow execution concurrency

Number of worker processes

PostgreSQL Database

Workflows, executions and credentials are stored in the database. PostgreSQL DB can be used for same.


Sunday, February 9, 2025

What is Approximate Near Neibour Search

Approximate Nearest Neighbors (ANN) algorithms are a class of algorithms designed to find the approximate nearest neighbors to a query point in a high-dimensional space.  They don't guarantee finding the absolute nearest neighbors (hence "approximate"), but they do so much faster than exact nearest neighbor search, which becomes computationally very expensive in high dimensions.  This speedup is crucial for many applications, including:   

Information Retrieval: Finding similar documents, images, or other data points.   

Recommendation Systems: Recommending items similar to those a user has interacted with.   

Clustering: Preprocessing step for some clustering algorithms.

Anomaly Detection: Identifying data points that are significantly different from others.

Why Approximate?

The curse of dimensionality makes exact nearest neighbor search computationally prohibitive in high-dimensional spaces.  The number of possible neighbors grows exponentially with the number of dimensions, making exhaustive search impractical. ANN algorithms trade off a small amount of accuracy (by finding approximate neighbors) for a massive gain in speed.   

HNSW (Hierarchical Navigable Small World):

HNSW is a graph-based ANN algorithm. It constructs a hierarchical graph structure where each node represents a data point.  The graph is designed so that you can efficiently navigate from any node to its nearest neighbors. The hierarchy allows for faster searching by starting at a higher level and progressively moving down to the lower levels.   

How it works (simplified):

Graph Construction: Data points are connected to their nearest neighbors at multiple levels of the hierarchy. Connections at higher levels are longer-range, while connections at lower levels are shorter-range.   

Search: Given a query point, the search starts at the top level of the graph. It navigates to the nearest neighbor at that level. Then, it proceeds to the next level and repeats the process until it reaches the lowest level. The nodes visited at the lowest level are considered the approximate nearest neighbors.   

Advantages:  Excellent performance, relatively low memory usage, good for various distance metrics.

Used in: Milvus (a vector database), and other vector search libraries.   

FAISS (Facebook AI Similarity Search):

FAISS is a library developed by Facebook AI for efficient similarity search and clustering of dense vectors. It provides implementations of many ANN algorithms, including several variations of Locality Sensitive Hashing (LSH), Product Quantization (PQ), and IVF (Inverted File Index).  It is highly optimized for performance and can handle very large datasets.   

Key Techniques used in FAISS:

Quantization: Reduces the size of vectors by mapping them to a smaller set of representative vectors.

Locality Sensitive Hashing (LSH): Uses hash functions to map similar vectors to the same buckets, increasing the probability of finding neighbors.   

Inverted File Index (IVF): Partitions the data into clusters and builds an index on these clusters, enabling faster search by only considering relevant clusters.   

Advantages:  Highly optimized for performance, supports various ANN algorithms, handles large datasets.

Used in: Many large-scale similarity search applications.



Wednesday, February 5, 2025

What would be a fair comparison of various PDF parsers available in the market now?


Key Considerations and Recommendations:

For general-purpose, high-performance PDF processing (including complex PDFs): PyMuPDF (fitz) is an excellent choice. Its speed and ability to handle complex layouts make it a strong contender.

For modifying or manipulating PDFs (merging, splitting, etc.): pikepdf and pdfrw are your best bets. pikepdf is generally preferred for its ease of use. pdfrw is more low-level and powerful.

For simple text and metadata extraction from relatively straightforward PDFs: PyPDF2 is a decent option. It's pure Python, so no external dependencies. However, it may not be as robust as PyMuPDF or PDFMiner.six for complex PDFs.

For extracting data from tables in PDFs: pdfplumber is specifically designed for this and does an excellent job.

For robust text and data extraction, particularly when you need more control: PDFMiner.six is a solid choice.

Which one to choose?


Most common use case (text extraction from various PDFs): PyMuPDF (fitz)

PDF Manipulation: pikepdf

Table Extraction: pdfplumber

Simple PDF text extraction (pure Python): PyPDF2 (but be aware of its limitations)

Remember to install the library you choose using pip install <library_name>.  For PyMuPDF, you'll likely need to install the pre-built wheels for your platform to avoid compilation issues.  Refer to the library's documentation for installation instructions.


What is MSR-VTT (Microsoft Research Video to Text)

 MSR-VTT (Microsoft Research Video to Text) is a benchmark dataset developed by Microsoft Research for video captioning and retrieval tasks. It is widely used in computer vision and natural language processing (NLP) to evaluate models that generate textual descriptions from videos or retrieve videos based on textual queries.


Key Features of MSR-VTT

Large-Scale Dataset:


Contains 10,000 video clips.

Covers 257 video categories (e.g., music, news, sports, gaming).

Rich Annotations:


Each video is annotated with 20 natural language descriptions.

A total of 200,000 captions describing video content.

Diverse Video Content:


Extracted from real-world sources like YouTube.

Covers a wide range of topics (e.g., entertainment, education, sports, music).

Benchmark for Video Captioning & Retrieval:


Used for training and evaluating models in:

Video-to-Text Generation (automatic captioning).

Text-to-Video Retrieval (finding relevant videos from text queries).

Use Cases

Training AI models for automated video captioning.

Video search and retrieval using textual queries.

Improving multimodal AI systems that process both visual and textual data.

Benchmarking video understanding models in NLP and computer vision research.

Challenges in MSR-VTT

Complex Video Semantics: Understanding actions, objects, and scene context in videos.

Natural Language Variability: Different ways of describing the same video.

Multimodal Learning: Combining visual, audio, and textual information effectively.

Would you like details on how to use MSR-VTT in a model training pipeline?

What are main Parts of Diffusion Model

Diffusion Model Overview

Take a look at our diffusion architecture main components.


The main part is a 3D U-Net, which is good at working with videos because they have frames that change over time. This U-Net isn’t just simple layers; it also uses attention. Attention helps the model focus on important parts of the video. Temporal attention looks at how frames relate to each other in time, and spatial attention focuses on the different parts of each frame. These attention layers, along with special blocks, help the model learn from the video data. Also, to make the model generate video that match the text prompt we provide, we use text embeddings.

The model works using something called diffusion. Think of it like this, first, we add noise to the training videos until they’re just random. Then, the model learns to undo this noise. To generate a video, it starts with pure noise and then gradually removes the noise using the U-Net, using the text embeddings that we provided as a guide. Important steps here include adding noise and then removing it. The text prompt is converted to embeddings using BERT and passed to UNet, enabling to generate videos from text. By doing this again and again, we get a video that matches the text we gave, which lets us make videos from words.

Instead of looking at the original complex diagram, let’s visualize a simpler and easier architecture diagram that we will be coding.




Let’s read through the flow of our architecture that we will be coding:


Input Video: The process begins with a video that we want to learn from or use as a starting point.

UNet3D Encoding: The video goes through the UNet3D’s encoder, which is like a special part of the system that shrinks the video and extracts important features from it.

UNet3D Bottleneck Processing: The shrunken video features are then processed in the UNet3D’s bottleneck, the part that has the smallest spatial dimensions of the video in feature map.

UNet3D Decoding: Next, the processed features are sent through the UNet3D’s decoder, which enlarges the features back to a video, adding the learned structure.

Text Conditioning: The provided text prompt, used to guide the generation, is converted into a text embedding, which provides input to the UNet3D at various points, allowing the video to be generated accordingly.

Diffusion Process: During training, noise is added to the video, and the model learns to remove it. During video generation, we starts with noise, and the model uses the UNet3D to gradually remove the noise, generating the video.

Output Video: Finally, we get the output video that is generated by the model and is based on the input video or noise and the text provided as a prompt.

references:

https://levelup.gitconnected.com/building-a-tiny-text-to-video-model-from-scratch-using-python-f31bdab12fbb

https://github.com/FareedKhan-dev/text2video-from-scratch?source=post_page-----f31bdab12fbb--------------------------------



Sunday, February 2, 2025

What is Playwright and PlayWright MCP Server ?

 Playwright is a powerful open-source framework for automating web browsers (Chromium, Firefox, and WebKit). It allows you to control browsers programmatically, which is incredibly useful for tasks like:   


End-to-end testing: Testing the full user flow of a web application, simulating real user interactions.   

Web scraping: Extracting data from websites.   

Automation of web tasks: Automating repetitive actions on websites.   

Playwright supports multiple programming languages, including JavaScript/TypeScript, Python, Java, and C#.  It provides a consistent API across these languages, making it easy to switch between them if needed.   


Key Features of Playwright:


Cross-browser support: Works with Chromium, Firefox, and WebKit.   

Auto-waiting: Playwright intelligently waits for elements to be ready before interacting with them, reducing the need for explicit waits.   

Resilient execution: Playwright handles flaky tests and automatically retries actions.   

Powerful selectors: Supports CSS and XPath selectors for precise element targeting.   

Debugging tools: Provides excellent debugging capabilities, including browser contexts, tracing, and video recording.   

Headless and headed modes: Can run browsers in headless mode (without a visible window) for faster execution or in headed mode for visual debugging.   

Playwright MCP Server (Microsoft Playwright Server):


The Playwright MCP Server, often referred to as the browser server, is a service that manages the browsers that Playwright controls. It acts as a central point for communication between your Playwright scripts and the actual browser instances.   

Why is it used?

Remote Execution: The MCP Server enables you to run your Playwright tests or scripts on a different machine than where your code is running. This is useful for distributed testing or running tests in a cloud environment.   

Browser Management: The server handles the launching and closing of browsers, ensuring that they are properly managed. This is especially helpful when running many tests concurrently.

Scalability: The server can be scaled to handle a large number of concurrent browser sessions, making it suitable for large projects or continuous integration/continuous deployment (CI/CD) pipelines.

In summary:

Playwright: The framework that provides the API for controlling browsers.   

Playwright MCP Server: A service that manages the browsers and facilitates remote execution and scaling.   

You don't always need the MCP Server. If you're running simple Playwright scripts locally and don't need remote execution or advanced browser management, you can run Playwright directly without the server. However, for more complex scenarios, especially in testing environments, the MCP Server becomes very valuable.


Thursday, January 23, 2025

What is wrk load testing tool

wrk is a modern HTTP benchmarking tool capable of generating significant load when run on a single multi-core CPU.

brew install wrk

wrk -t12 -c400 -d30s --latency http://127.0.0.1:8080/index.html

This runs a benchmark for 30 seconds, using 12 threads, and keeping 400 HTTP connections open.

Running 30s test @ http://localhost:8080/index.html

  12 threads and 400 connections

  Thread Stats   Avg      Stdev     Max   +/- Stdev

    Latency   635.91us    0.89ms  12.92ms   93.69%

    Req/Sec    56.20k     8.07k   62.00k    86.54%

Latency Distribution

  50% 250.00us

  75% 491.00us

  90% 700.00us

  99% 5.80ms  

22464657 requests in 30.00s, 17.76GB read

Requests/sec: 748868.53

Transfer/sec:    606.33MB


You can see the average latency of the request, but most important (with the parameter --latency) the quantile.



it’s also possible with the parameter -s to send lua script because wrk supports executing a LuaJIT script , and that’s fantastic, to be able to script our benchmark tests.



-- init random

math.randomseed(os.time())

-- the request function that will run at each request

request = function() 

   

   -- define the path that will search for q=%v 9%v being a random 

      number between 0 and 1000)

   url_path = "/somepath/search?q=" .. math.random(0,1000)

-- if we want to print the path generated

   --print(url_path)

-- Return the request object with the current URL path

   return wrk.format("GET", url_path)

end

wrk -c1 -t1 -d5s -s ./my-script.lua --latency http://localhost:8000

wrk run in 3 phase :

setup

running

done

Setup

The setup phase begins after the target IP address has been resolved and all threads have been initialized but not yet started.

At this point you can change for each thread being execute the addresses found during the Ip target phase. you are able to manipulate

thread.addr : get or set the thread’s server address

thread:get(name) : get the value of a global in the thread’s env

thread:set(name, value) — set the value of a global in the thread’s env

thread:stop() — stop the thread

setup = function(thread)   

   -- code

end


Running

The most useful one.

Running phase are split in 3 others steps :

init

request

response


function init(args) 

Wednesday, January 22, 2025

What are the main things to take care when developing a Fast API application and how?

1. Log the critical parameters: Client IP, Timestamp etc. for monitoring and analytics

2. Authentication/Authorization of requests 

3. Rate Limiter 

4. Any Domain/customer specific requestictions

5. Cached entry? Return from cache

6. Add Gateway security Header 

7. Add response headers for OWASP 

8. Avoid caching 


In general below are some of the main features of API gateways 

Traffic Routing

Load balancing

Rate Limiting

Caching

Authentication and Authorization

Preventing unnecessary requests from landing on application servers

Central Logging

Consolidated security audit


Some header things below. 

These are below 

 # OWASP Secure Headers https://owasp.org/www-project-secure-headers/

        modified_headers['X-XSS-Protection'] = '1; mode=block'

        modified_headers['X-Frame-Options'] = 'DENY'

        modified_headers['Strict-Transport-Security'] = 'max-age=63072000; includeSubDomains'

        modified_headers['X-Content-Type-Options'] = 'nosniff'


# Avoid Caching Tokens

        modified_headers['Expires'] = '0'

        modified_headers['Cache-Control'] = 'no-cache, no-store, must-revalidate'

        modified_headers['Pragma'] = 'no-cache'


What is a gateyway Authentication token?

"gateway-jwt-token" likely refers to a JSON Web Token (JWT) used for authentication and authorization within an API Gateway system.   

Here's a breakdown:

API Gateway: An API Gateway is a server that acts as a single entry point for a set of microservices or backend services. It handles tasks like authentication, authorization, rate limiting, and request routing.   

JWT:

A compact and self-contained way to securely transmit information between parties as a JSON object.   

It consists of three parts:

Header: Contains metadata about the token (e.g., algorithm used).   

Payload: Contains claims about the entity (e.g., user ID, roles, permissions).   

Signature: Ensures the integrity and authenticity of the token.   

"gateway-jwt-token"

This is likely a placeholder or a specific naming convention for the JWT issued by the API Gateway.

It might be used in:

API documentation: To describe the authentication mechanism.

Code examples: To illustrate how to obtain and use the token in client applications.

Configuration files: To configure the API Gateway to issue and validate JWTs.

In essence:

The "gateway-jwt-token" represents the mechanism by which the API Gateway authenticates and authorizes requests. Clients must present a valid JWT in their requests to access protected resources. The API Gateway verifies the token's authenticity and extracts relevant information (e.g., user roles) to determine access control.   

Key Considerations:

Security: Ensure that the JWTs are properly signed and encrypted to prevent tampering.   

Token Management: Implement proper token issuance, expiration, and revocation mechanisms.

Integration: Integrate the JWT authentication mechanism with other security measures (e.g., rate limiting, IP whitelisting).

If you have a specific context or a particular API Gateway system in mind, providing more information would allow for a more precise explanation.


References:

Gemini


What is NodebookLM from Google

What is NodebookLM from Google 


NotebookLM is an AI-powered research and writing tool from Google that helps users organize and refine their ideas. It's an experimental product from Google Labs that's designed to improve how users interact with documents and notes. 

Features:

Summarize and extract information: NotebookLM can summarize and extract information from complex and dense sources. 

Add sources: Users can upload PDFs, Google Docs, Google Slides, website URLs, and more. 

Collaborate: Users can collaborate with NotebookLM to refine and organize their ideas. 

Share notebooks: Users can share their notebooks with others so they can ask questions and access sources. 

Audio Overviews: Users can interact with AI hosts during Audio Overviews. 

Versions:

NotebookLM: The personal version of NotebookLM 

NotebookLM Plus: A premium version with higher utilization limits, premium features, and additional sharing options and analytics 

NotebookLM Enterprise: An enterprise-ready version of NotebookLM that includes enterprise grade security and compliance 

You can use NotebookLM to: 

Read, take notes, and collaborate with NotebookLM

Summarize and extract information from complex sources

Create an "everything notebook" for general brainstorming

Create specialized notebooks for specific topics or projects



References:

https://blog.google/technology/ai/notebooklm-google-ai/


Tuesday, January 21, 2025

What s PrivateLink in AWS and Advantages

In AWS, "PrivateLink" refers to a feature that allows you to privately connect your VPC (Virtual Private Cloud) to various AWS services and resources hosted by other AWS accounts, without exposing your network traffic to the public internet, essentially acting as a secure, private gateway between your VPC and other services within the AWS network; you can access these services using private IP addresses within your VPC, eliminating the need for public IP addresses or going through the public internet. 

Key points about PrivateLink:

Private connectivity:

The primary benefit is that all communication happens within the AWS private network, ensuring secure data transfer. 

VPC endpoints:

To connect to services via PrivateLink, you create "VPC endpoints" within your VPC which act as entry points to access the desired service. 

Service providers:

You can also expose your own services hosted in your VPC as "VPC endpoint services" to other AWS accounts, allowing them to access your services privately. 

No public IP needed:

You can access services using private IP addresses within your VPC, eliminating the need for public IP addresses. 

Inter-container traffic security

To encrypt inter-container traffic, you can leverage technologies like TLS/SSL certificates, dedicated container network overlays with encryption capabilities, cloud-based key management services, and sidecar containers that handle encryption/decryption, ensuring secure communication between containers within a cluster. 

Key approaches to inter-container traffic encryption:

TLS/SSL termination:

Use standard TLS/SSL certificates at the container network level to encrypt communication between containers, similar to how web traffic is secured. 

Container network overlays with encryption:

Utilize container networking solutions like Cilium, Calico, or Flannel that offer built-in encryption capabilities for inter-container traffic. 

Sidecar containers:

Deploy dedicated sidecar containers alongside your application containers to handle encryption and decryption of data in transit, providing a dedicated layer of security. 

Cloud-managed key management services:

Leverage cloud providers like AWS KMS or Azure Key Vault to manage encryption keys centrally, ensuring proper key rotation and access control. 

Important considerations when implementing inter-container encryption:

Performance impact:

Encryption adds overhead, so carefully evaluate the performance implications of your chosen encryption method, especially in high-traffic scenarios. 

Key management and rotation:

Establish robust key management practices to securely store, rotate, and distribute encryption keys across your container environment. 

Network segmentation:

Combine encryption with network segmentation to further isolate container traffic and mitigate potential security risks. 

Application compatibility:

Ensure your applications are designed to handle encrypted communication, including proper certificate management and handling of encrypted data streams. 


References:

Gemini 

What is Variance Weightage in sagemaaker?

You can have multiple Production Variants behind an Amazon SageMaker endpoint. Each production variant has an initial variant weight and based on the ratio of each variant weight to the total sum of weights, SageMaker can distribute the calls to each of the models. For example, if you have only one production variant with a weight of 1, all traffic will go to this variant. If you add another production variant with an initial weight of 2, the new variant will get 2/3 of the traffic and the first variant will get 1/3.

references:

https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProductionVariant.html


Kubernetes alternatives to run locally

The below table gives a quick comparison 




Using Minikube (Lightweight Kubernetes for Local Development)

Best for: Beginners, lightweight local development, single-node clusters

Runs on: macOS, Windows, Linux


Pros

Easy to set up and lightweight

Supports multiple Kubernetes versions

Works with Docker as a backend

⚠️ Cons

Not a full multi-node cluster

Not ideal for production testing


Kind (Kubernetes in Docker)

✅ Best for: Running multi-node clusters in containers

🚀 Runs on: macOS, Windows, Linux


Pros

Runs full Kubernetes in Docker containers

Lightweight and fast

Supports multi-node clusters

⚠️ Cons


Needs Docker installed

Not as feature-rich as full Kubernetes

K3s (Lightweight Kubernetes)

✅ Best for: Edge computing, lightweight Kubernetes clusters

🚀 Runs on: macOS (via Multipass), Linux, Raspberry Pi


Pros

Lightweight, optimized for small clusters

Fast and uses fewer resources

⚠️ Cons

Requires VM on macOS

Not as feature-rich as full Kubernetes


Docker Desktop Kubernetes

✅ Best for: Developers already using Docker Desktop

🚀 Runs on: macOS, Windows


Pros

No additional tools needed if you use Docker Desktop

Simple and fast

⚠️ Cons

Not customizable

Uses more system resources


Rancher Desktop (Alternative to Docker Desktop)

✅ Best for: Developers who want an open-source alternative to Docker Desktop

🚀 Runs on: macOS, Windows, Linux


Pros

Fully open-source

No Docker dependency

⚠️ Cons

Requires manual configuration



Final Recommendation

For Beginners: 🚀 Minikube

For Multi-Node Clusters: 🔥 Kind

For Edge Devices & IoT: 🛠 K3s

For Docker Users: 🐳 Docker Desktop

For Open-Source Fans: 🏡 Rancher Desktop


References:

Thursday, January 16, 2025

What is Langchain Agent-Inbox?


Allows to view and respond to interrupts generated from Langgraph. 




class HumanInterruptConfig(TypedDict):

    allow_ignore: bool

    allow_respond: bool

    allow_edit: bool

    allow_accept: bool


class ActionRequest(TypedDict):

    action: str

    args: dict


class HumanInterrupt(TypedDict):

    action_request: ActionRequest

    config: HumanInterruptConfig

    description: Optional[str]


class HumanResponse(TypedDict):

    type: Literal['accept', 'ignore', 'response', 'edit']

    args: Union[None, str, ActionRequest]


The human interrupt schema is used to define the types of interrupts, and what actions can be taken in response to each interrupt. We've landed on four types of actions:


accept: Accept the interrupt's arguments, or action. Will send an ActionRequest in the args field on HumanResponse. This ActionRequest will be the exact same as the action_request field on HumanInterrupt, but with all keys of the args field on ActionRequest converted to strings.

edit: Edit the interrupt's arguments. Sends an instance of ActionRequest in the args field on HumanResponse. The args field on ActionRequest will have the same structure as the args field on HumanInterrupt, but the values of the keys will be strings, and will contain any edits the user has made.

response: Send a response to the interrupt. Does not require any arguments. Will always send back a single string in the args field on HumanResponse.

ignore: Ignore the interrupt's arguments, or action. Returns null for the args field on HumanResponse.

You can set any combination of these actions in the config field on HumanInterrupt.


At the moment, you're required to pass a list of HumanInterrupt objects in the interrupt function, however the UI is currently limited to rendering only the first object in the list. (We are open to suggestions for how to improve this schema, so if you have feedback, please reach out to discuss!). The same goes for the HumanResponse, which the Agent Inbox will always send back as a list with a single HumanResponse object in it.


Tuesday, January 14, 2025

OpenSPG in AI

OpenSPG is a knowledge graph engine developed by Ant Group in collaboration with OpenKG, based on the SPG (Semantic-enhanced Programmable Graph) framework, which is a summary of Ant Group's years of experience in constructing and applying diverse domain knowledge graphs in the financial scenarios.


SPG (Semantic-enhanced Programmable Graph): semantic-enhanced programmable framework is a set of semantic representation framework based on property graph precipitated by Ant Knowledge Graph platform after years of supporting business in the financial field. It creatively integrates LPG structural and RDF semantic, which overcomes the problem that RDF/OWL semantic complexity cannot be industrially landed, and fully inherits the advantages of LPG structural simplicity and compatibility with big data system. The framework defines and represents knowledge semantics from three aspects. First, SPG explicitly defines the formal representation and programmable framework of "knowledge", so that it can be defined, programmed, understood and processed by machines. Secondly, SPG achieves compatibility and progressive advancement between knowledge levels, supporting the construction of knowledge graphs and the continuous iterative evolution of incomplete data states in industrial-level scenarios. Finally, SPG serves as an effective bridge between big data and AI technology systems, facilitating the efficient transformation of massive data into knowledge-based insights. By doing so, it enhances the value and application potential of the data. With the SPG framework, we can construct and manage graph data more efficiently, and at the same time, we can better support business requirements and application scenarios. Since SPG framework has good scalability and flexibility, new business scenarios can quickly build their domain models and solutions by extending the domain knowledge model and developing new operators.


references:

https://github.com/OpenSPG/openspg?tab=readme-ov-file

https://spg.openkg.cn/en-US


OpenSPG in Physics!

 What is OpenSPG


OpenSPG refers to Open Source Physics Graphics, a specialized software library or framework designed for high-performance physics simulations and visualizations. Its primary goal is to combine computational efficiency with advanced graphical rendering, enabling developers, researchers, and engineers to model and visualize complex physics phenomena.


Although there is limited information about a specific project named "OpenSPG" in mainstream open-source communities as of my knowledge cutoff, the name may refer to specialized or niche software, potentially in the following contexts:


Possible Interpretations of OpenSPG

Physics Simulation Framework:


Designed for simulating dynamic systems in areas such as mechanics, thermodynamics, electromagnetism, or fluid dynamics.

May integrate physics engines like Bullet or PyBullet for robust and accurate simulations.

Graphics-Focused Platform:


Provides APIs for rendering high-quality 2D/3D physics models, often used in educational tools, research, or entertainment.

Integrates with graphics libraries like OpenGL, Vulkan, or DirectX.

Educational or Research Tool:


May be tailored for academic purposes, offering pre-built modules to study concepts of classical or quantum physics.

Open Source Collaboration:


Likely an open-source initiative aimed at democratizing access to tools for physics modeling and visualization, fostering innovation in the field.

Core Features (Hypothetical)

If OpenSPG is similar to other physics simulation and graphics platforms, its features might include:


Cross-Platform Support: Works on various operating systems like Windows, macOS, and Linux.

Integration with Machine Learning: Used for AI-driven optimization of simulations or training reinforcement learning models in simulated environments.

High Performance: Exploits modern GPU acceleration for real-time rendering and computation.

Ease of Use: Includes user-friendly APIs and GUI tools for creating simulations without needing extensive programming expertise.


references:

OpenAI 

Monday, January 13, 2025

What is difference between ray[air] and ray[default] package

The difference between ray[air] and ray[default] lies in the set of dependencies they install when you set up the Ray library. These options allow users to customize their installation to suit their specific needs, minimizing unnecessary dependencies.

1. ray[default]

What it includes:

This is the default installation, which contains the core Ray functionality and basic dependencies required to use Ray for distributed computation.

Use case:

Ideal for general-purpose distributed applications, task orchestration, and workflows.

Provides the base components needed for Ray core features like task scheduling, actor management, and Ray's Python API.

Dependencies:

Core Ray libraries only (e.g., task scheduling and execution).

Install Command:

pip install ray[default]

ray[air]

What it includes:

The ray[air] installation includes all dependencies required for Ray AI Runtime (AIR), which is Ray's unified framework for machine learning workflows.

Use case:

Best suited for machine learning and AI-related tasks, such as training, hyperparameter tuning, and serving models.

Adds support for Ray Train, Ray Tune, Ray Serve, and Ray Data, which are specialized components of Ray for ML/AI workflows.

Dependencies:


Includes additional libraries such as:

torch, tensorflow, and other ML frameworks for training models.

scikit-learn for machine learning utilities.

Libraries for handling large datasets and streaming data efficiently.

Ray Serve dependencies for model deployment and serving.

Install Command:

pip install ray[air]

When to Use Which?

Use Case Recommended Installation

Distributed task orchestration ray[default]

General distributed computing ray[default]

Machine learning workflows ray[air]

Model training and hyperparameter tuning ray[air]

Model serving and deployment ray[air]

Handling large-scale data processing ray[air]


Summary

ray[default]: Lightweight, general-purpose distributed computing.

ray[air]: Focused on ML/AI workflows with added dependencies for training, tuning, and serving models.


Friday, January 10, 2025

What is SimCSE Embeeding model?

 Enhanced Semantic Search with Sentence Transformers / Clustering  ( or any other depending on the data) 

A Data specific embedding function that can improve the similarity search ( Instead of using OpenAIEmbedding) 

SimCSE (Simple Contrastive Sentence Embeddings) is a powerful framework for creating high-quality sentence embeddings. It is based on a contrastive learning approach and trains embeddings by minimizing the distance between embeddings of augmented sentence pairs and maximizing the distance between embeddings of different sentences. There are multiple SimCSE-based models, both unsupervised and supervised, developed using pre-trained language models as a base.


Common SimCSE-Based Embedding Models:

SimCSE-Unsupervised Models:


These models generate sentence embeddings using contrastive learning on unlabeled data.

They rely on dropout noise as the augmentation mechanism, meaning the same sentence is passed through the model twice with different dropout masks.

Examples:

bert-base-uncased with SimCSE

roberta-base with SimCSE

distilbert-base-uncased with SimCSE

SimCSE-Supervised Models:


These models are trained on datasets with labeled pairs (e.g., sentence pairs with known similarity scores) such as the Natural Language Inference (NLI) dataset.

They use contrastive learning to align embeddings of semantically similar sentence pairs and separate dissimilar ones.

Examples:

bert-large-nli-stsb trained with SimCSE

roberta-large-nli-stsb trained with SimCSE

xlnet-base-nli-stsb with SimCSE

Multilingual SimCSE Models:


Adapt SimCSE for multilingual text by using multilingual pre-trained models like mBERT or XLM-R.

Examples:

xlm-roberta-base with SimCSE

bert-multilingual with SimCSE

Domain-Specific SimCSE Models:


SimCSE fine-tuned on domain-specific corpora to create embeddings specialized for particular tasks or fields.

Examples:

Legal text embeddings

Biomedical text embeddings (e.g., using BioBERT or SciBERT)

Key Features of SimCSE Models:

They leverage existing pre-trained language models like BERT, RoBERTa, or DistilBERT as the backbone.

SimCSE embeddings are highly effective for tasks like semantic similarity, text clustering, and information retrieval.

Supervised SimCSE models generally outperform unsupervised ones due to task-specific labeled data.

Impact of SimCSE on Retrieval and Similarity Search:

High-Quality Embeddings: SimCSE embeddings capture fine-grained semantic nuances, improving retrieval accuracy.

Efficient Similarity Search: Enhanced embeddings lead to better clustering and ranking in similarity-based searches.

Domain Adaptability: By fine-tuning SimCSE models on specific domains, the performance for domain-specific retrieval improves significantly.

For implementations, you can find pre-trained SimCSE models on platforms like Hugging Face, or you can train your own model using the SimCSE framework available in the official GitHub repository.



references:

OpenAI 

Thursday, January 9, 2025

What are the transports in MCP?

MCP currently defines two standard transport mechanisms for client-server communication:

stdio, communication over standard in and standard out

HTTP with Server-Sent Events (SSE)

Clients SHOULD support stdio whenever possible.

In the stdio transport:

The client launches the MCP server as a subprocess.

The server receives JSON-RPC messages on its standard input (stdin) and writes responses to its standard output (stdout).

Messages are delimited by newlines, and MUST NOT contain embedded newlines.

The server MAY write UTF-8 strings to its standard error (stderr) for logging purposes. Clients MAY capture, forward, or ignore this logging.

The server MUST NOT write anything to its stdout that is not a valid MCP message.

The client MUST NOT write anything to the server’s stdin that is not a valid MCP message.

HTTP with SSE 

In the SSE transport, the server operates as an independent process that can handle multiple client connections.

The server MUST provide two endpoints:

An SSE endpoint, for clients to establish a connection and receive messages from the server

A regular HTTP POST endpoint for clients to send messages to the server

When a client connects, the server MUST send an endpoint event containing a URI for the client to use for sending messages. All subsequent client messages MUST be sent as HTTP POST requests to this endpoint.

Server messages are sent as SSE message events, with the message content encoded as JSON in the event data.

Custom Transports 

Clients and servers MAY implement additional custom transport mechanisms to suit their specific needs. The protocol is transport-agnostic and can be implemented over any communication channel that supports bidirectional message exchange.

Implementers who choose to support custom transports MUST ensure they preserve the JSON-RPC message format and lifecycle requirements defined by MCP. Custom transports SHOULD document their specific connection establishment and message exchange patterns to aid interoperability.

references:

https://spec.modelcontextprotocol.io/specification/basic/transports/

Base protocol for MCP

 Base Protocol: Core JSON-RPC message types

Lifecycle Management: Connection initialization, capability negotiation, and session control

Server Features: Resources, prompts, and tools exposed by servers

Client Features: Sampling and root directory lists provided by clients

Utilities: Cross-cutting concerns like logging and argument completion


Auth 

Authentication and authorization are not currently part of the core MCP specification, but we are considering ways to introduce them in future. 


Lifecycle Phases 

Initialization 

The initialization phase MUST be the first interaction between client and server. During this phase, the client and server:


Establish protocol version compatibility

Exchange and negotiate capabilities

Share implementation details

The client MUST initiate this phase by sending an initialize request containing:


Protocol version supported

Client capabilities

Client implementation information


The server MUST respond with its own capabilities and information:

Sample client info 

{

  "jsonrpc": "2.0",

  "id": 1,

  "method": "initialize",

  "params": {

    "protocolVersion": "2024-11-05",

    "capabilities": {

      "roots": {

        "listChanged": true

      },

      "sampling": {}

    },

    "clientInfo": {

      "name": "ExampleClient",

      "version": "1.0.0"

    }

  }

}


Sample server info 

{

  "jsonrpc": "2.0",

  "id": 1,

  "result": {

    "protocolVersion": "2024-11-05",

    "capabilities": {

      "logging": {},

      "prompts": {

        "listChanged": true

      },

      "resources": {

        "subscribe": true,

        "listChanged": true

      },

      "tools": {

        "listChanged": true

      }

    },

    "serverInfo": {

      "name": "ExampleServer",

      "version": "1.0.0"

    }

  }

}


The client SHOULD NOT send requests other than pings before the server has responded to the initialize request.

The server SHOULD NOT send requests other than pings and logging before receiving the initialized notification.


Version Negotiation 

In the initialize request, the client MUST send a protocol version it supports. This SHOULD be the latest version supported by the client.


If the server supports the requested protocol version, it MUST respond with the same version. Otherwise, the server MUST respond with another protocol version it supports. This SHOULD be the latest version supported by the server.


If the client does not support the version in the server’s response, it SHOULD disconnect.




Error Handling 

Implementations SHOULD be prepared to handle these error cases:


Protocol version mismatch

Failure to negotiate required capabilities

Initialize request timeout

Shutdown timeout

Implementations SHOULD implement appropriate timeouts for all requests, to prevent hung connections and resource exhaustion.


Example initialization error:

{

  "jsonrpc": "2.0",

  "id": 1,

  "error": {

    "code": -32602,

    "message": "Unsupported protocol version",

    "data": {

      "supported": ["2024-11-05"],

      "requested": "1.0.0"

    }

  }

}

references:

https://spec.modelcontextprotocol.io/specification/basic/


Wednesday, January 8, 2025

What is capabiltiy Negotiation in MCP

Capability Negotiation 

The Model Context Protocol uses a capability-based negotiation system where clients and servers explicitly declare their supported features during initialization. Capabilities determine which protocol features and primitives are available during a session.

Servers declare capabilities like resource subscriptions, tool support, and prompt templates

Clients declare capabilities like sampling support and notification handling

Both parties must respect declared capabilities throughout the session

Additional capabilities can be negotiated through extensions to the protocol



Each capability unlocks specific protocol features for use during the session. For example:

Implemented server features must be advertised in the server’s capabilities

Emitting resource subscription notifications requires the server to declare subscription support

Tool invocation requires the server to declare tool capabilities

Sampling requires the client to declare support in its capabilities

This capability negotiation ensures clients and servers have a clear understanding of supported functionality while maintaining protocol extensibility.

references:

https://spec.modelcontextprotocol.io/specification/architecture/


MCP What are Security and Trust & Safety principles in specification?

The Model Context Protocol enables powerful capabilities through arbitrary data access and code execution paths. With this power comes important security and trust considerations that all implementors must carefully address.


Key Principles 

User Consent and Control


Users must explicitly consent to and understand all data access and operations

Users must retain control over what data is shared and what actions are taken

Implementors should provide clear UIs for reviewing and authorizing activities

Data Privacy


Hosts must obtain explicit user consent before exposing user data to servers

Hosts must not transmit resource data elsewhere without user consent

User data should be protected with appropriate access controls

Tool Safety


Tools represent arbitrary code execution and must be treated with appropriate caution

Hosts must obtain explicit user consent before invoking any tool

Users should understand what each tool does before authorizing its use

LLM Sampling Controls


Users must explicitly approve any LLM sampling requests

Users should control:

Whether sampling occurs at all

The actual prompt that will be sent

What results the server can see

The protocol intentionally limits server visibility into prompts

Implementation Guidelines 

While MCP itself cannot enforce these security principles at the protocol level, implementors SHOULD:


Build robust consent and authorization flows into their applications

Provide clear documentation of security implications

Implement appropriate access controls and data protections

Follow security best practices in their integrations

Consider privacy implications in their feature designs

Monday, January 6, 2025

MCP Standards in a nutshell

 1. Request and Response Formats

MCP aims to standardize the structure of requests sent to models and the responses they return. This includes:

Input Formats: Ensuring models can process queries in a common format, regardless of the vendor.

Output Formats: Defining a consistent structure for model responses, including metadata like confidence scores, provenance information, and structured data (e.g., JSON).

Error Handling: Standardized error codes and messages for better debugging and reliability.

2. Context Sharing and State Management

MCP proposes mechanisms to manage and share context between models or sessions, such as:

Memory Persistence: How context is maintained across multiple queries.

Session Management: Allowing continuity in conversations or tasks by persisting user-defined context.

Global Context: Enabling multiple models or tools to access shared context seamlessly.

3. Compatibility Across Tools and APIs

The protocol aims to bridge different vendor ecosystems by:

Unified API Interfaces: A single API specification that can be implemented by all participating models.

Interoperability Standards: Enabling models, vector databases, and tools to work together in workflows like retrieval-augmented generation (RAG) without vendor lock-in.

4. Metadata and Provenance Standards

MCP emphasizes the importance of detailed metadata in model responses, including:

Source Attribution: Where information comes from, especially in multi-source systems.

Confidence Scores: How certain the model is about its outputs.

Execution Logs: Tracing the steps taken to generate a response.

5. Tool Interactions and Plugin Standards

MCP proposes standards for how models interact with external tools, databases, and APIs, including:

Plugin Interfaces: Defining a unified way to integrate tools (e.g., calculators, retrieval systems).

Execution Standards: How models should invoke tools and handle tool responses.

6. Security and Privacy

Establishing protocols to ensure:

Secure Data Transmission: Encrypting queries and responses.

Access Control: Defining who can interact with the model or tools.

Compliance: Adhering to legal and ethical standards for data handling.

7. Evaluation and Logging Standards

Proposals for how to:

Benchmark Models: Using standardized datasets or metrics.

Log Interactions: Tracking user-model interactions for auditing or improving system behavior.

Summary

MCP is essentially proposing a holistic standard that covers:


Request/Response Formats

Context and State Management

Interoperability Across Vendors and Tools

Metadata and Provenance

Security and Compliance

Tool and Plugin Interactions

By addressing these areas, MCP aims to create a more unified, efficient, and user-friendly ecosystem for working with AI models. However, its adoption depends on industry-wide collaboration and agreement.


What are GLM ( Generalized Linear Models)

GLMs can handle a wider range of data distributions.

Explanation:

Generalized Linear Models (GLMs) are an extension of traditional linear regression models that can handle response variables with distributions other than the normal distribution. For example:

Binary outcomes (using logistic regression).

Count data (using Poisson regression).

Proportions (using binomial regression).

Key Advantages of GLMs:

Wider range of distributions: GLMs use a link function to relate the mean of the response variable to the linear predictor, allowing flexibility in modeling different types of data distributions.

Relaxation of linearity assumptions: GLMs allow for non-linear relationships between predictors and the response variable through the link function.

Misconceptions about the other options:

"Simpler to interpret than regression models": Not always true; the interpretation of coefficients in GLMs depends on the link function, which can make them less intuitive than traditional regression.

"Less computationally intensive": GLMs can be more computationally intensive due to their iterative fitting procedures (e.g., maximum likelihood estimation).

"Do not require the assumption of linearity": This is somewhat true in the sense that the relationship between the predictors and the response variable is modeled through a link function, but the linearity assumption still applies to the predictors in the linear predictor.