Tuesday, March 25, 2025

What is Node Post Processing in LLamaIndex

Node Postprocessor

Node Postprocessors apply transformations or filtering to a set of nodes before returning them. In LlamaIndex, node postprocessors are integrated into the query engine, functioning after the node retrieval step and before the response synthesis step. LlamaIndex provides an API for adding custom postprocessors and offers several ready-to-use node postprocessors. Some of the most commonly used node postprocessors are:


CohereRerank: This module is a component of the Cohere natural language processing system that selects the best output from a set of candidates. It uses a neural network to score each candidate based on relevance, semantic similarity, theme, and style. The candidates are then ranked according to their scores, and the top N are returned as the final output.


LLMRerank: Similar to the CohereRerank approach, but it uses an LLM to re-order nodes, returning the top N ranked nodes.


SimilarityPostprocessor: This postprocessor removes nodes that fall below a specified similarity score threshold.



Saturday, March 22, 2025

WCSS (Within-Cluster Sum of Squares) in K-Means Clustering

 WCSS stands for "Within-Cluster Sum of Squares". It's a measure of the compactness or tightness of clusters in a K-Means clustering algorithm.   

Definition:

WCSS is calculated as the sum of the squared distances between each data point and the centroid of the cluster to which it is assigned.   

Formula:

WCSS = Σ (distance(point, centroid))^2   

Where:

Σ represents the summation over all data points.

distance(point, centroid) is the Euclidean distance (or another suitable distance metric) between a data point and its cluster's centroid.

Significance:

Cluster Evaluation:

WCSS helps to evaluate the quality of the clustering.   

Lower WCSS values generally indicate tighter, more compact clusters.   

However, simply minimizing WCSS isn't the sole goal, as it can be driven to zero by increasing the number of clusters (k).

Elbow Method:

WCSS is the primary metric used in the Elbow method for determining the optimal number of clusters (k).

The Elbow method plots WCSS against different values of k.   

The "elbow" point in the plot, where the rate of decrease in WCSS sharply changes, is often considered a good estimate for the optimal k.   

Understanding Cluster Compactness:

WCSS provides a quantitative measure of how well the data points fit within their assigned clusters.   

It helps to understand the homogeneity of the clusters.   

Algorithm Optimization:

K-Means aims to minimize the WCSS during its iterative process.

The algorithm adjusts the cluster centroids to reduce the overall WCSS.

In summary:

WCSS is a crucial metric in K-Means clustering. It measures the compactness of clusters and is used to evaluate the clustering quality and to help determine the optimal number of clusters using the Elbow method. Lower WCSS values indicate tighter clusters, but the goal is to find a balance between minimizing WCSS and having a meaningful number of clusters.   


How to use TSNE library method to initialise library.

 To use t-SNE (t-Distributed Stochastic Neighbor Embedding) to reduce dimensionality from 10 to 2 using the scikit-learn library in Python, you would initialize the TSNE class as follows:

from sklearn.manifold import TSNE

# Initialize t-SNE

tsne = TSNE(n_components=2, perplexity=30, random_state=42)


Explanation of the parameters:

n_components=2: This is the most important parameter for your requirement. It specifies that you want to reduce the dimensionality to 2 dimensions.

perplexity=30: This parameter controls the balance between local and global aspects of your data. The typical range is between 5 and 50. It is a good starting point to use 30. You may need to experiment with different values depending on your dataset.

random_state=42: This parameter sets the seed for the random number generator. Setting a random state ensures that you get reproducible results. You can use any integer value.


Complete Example:

from sklearn.manifold import TSNE

import numpy as np


# Sample 10-dimensional data (replace with your actual data)

data_10d = np.random.rand(100, 10)  # 100 samples, 10 features


# Initialize t-SNE

tsne = TSNE(n_components=2, perplexity=30, random_state=42)


# Reduce dimensionality

data_2d = tsne.fit_transform(data_10d)


# Now 'data_2d' contains the 2-dimensional representation of your data

print(data_2d.shape)  # Should output (100, 2)


Important Notes:

t-SNE is computationally expensive, especially for large datasets.

The perplexity parameter can significantly affect the visualization. Experiment with different values to find the one that best reveals the structure of your data.

t-SNE is used for visualization, and not recommended for other machine learning tasks.



  

Why ZScore Scaling is important in K Means clustering

 Z-score scaling, also known as standardization, is a data preprocessing technique that is often used before applying K-Means clustering. It's used to transform the data so that it has a mean of 0 and a standard deviation of 1.   


Why Z-Score Scaling is Important for K-Means:


Equal Feature Weights:


K-Means relies on calculating the distance between data points. If features have vastly different scales, features with larger ranges will dominate the distance calculations.   

Z-score scaling ensures that all features have a similar scale, giving them equal weight in the clustering process.   

Improved Convergence:


K-Means can converge faster and more reliably when features are scaled.

Handling Outliers:


Z-score scaling can help to mitigate the impact of outliers, which can significantly affect the centroid calculations in K-Means.

How Z-Score Scaling Works:


For each feature:


Calculate the mean (μ) of the feature.


Calculate the standard deviation (σ) of the feature.


Transform each value (x) of the feature using the formula:


z = (x - μ) / σ   

Example:


Let's say you have a feature "age" with values [20, 30, 40, 100].


Mean (μ): (20 + 30 + 40 + 100) / 4 = 47.5

Standard Deviation (σ): (approximately) 35.36

Z-scores:

(20 - 47.5) / 35.36 = -0.78

(30 - 47.5) / 35.36 = -0.50

(40 - 47.5) / 35.36 = -0.21

(100 - 47.5) / 35.36 = 1.48

In Summary:


Z-score scaling is a crucial preprocessing step for K-Means clustering. 1  It ensures that features are on a similar scale, improves convergence, and helps to mitigate the impact of outliers, leading to more accurate and reliable clustering results. 2  

Friday, March 21, 2025

What is Perplexity value in tSNE

 The perplexity parameter in t-SNE is a crucial setting that influences the algorithm's behavior and the resulting visualization. It essentially controls the balance between preserving local and global structure in the data.   


What Perplexity Represents:


Perplexity can be thought of as a measure of the effective number of local neighbors each point considers.

It's related to the variance (spread) of the Gaussian distribution used to calculate pairwise similarities in the high-dimensional space.

In simpler terms, it determines how many nearby points each point is "concerned" with when trying to preserve its local structure.

How Perplexity Works:


Local Neighborhood Size:


A smaller perplexity value causes t-SNE to focus on very close neighbors. It will prioritize preserving the fine-grained local structure of the data.   

A larger perplexity value makes t-SNE consider a wider range of neighbors. It will attempt to preserve a more global view of the data's structure.

Balancing Local and Global:


The choice of perplexity affects the trade-off between preserving local and global relationships.   

Too low a perplexity can lead to noisy visualizations with many small, disconnected clusters.   

Too high a perplexity can obscure fine-grained local structure and make the visualization appear overly smooth.   

Impact on Visualization:


Low Perplexity:

Reveals fine-grained local patterns.   

Can produce many small, tight clusters.

May be sensitive to noise.   

High Perplexity:

Shows broader global patterns.

Produces smoother, more spread-out visualizations.

Less sensitive to noise.

Practical Considerations:


Typical Range:

Perplexity is typically set between 5 and 50.   

The optimal value depends on the size and density of your dataset.

Experimentation:

It's often necessary to experiment with different perplexity values to find the one that produces the most informative visualization.

Dataset Size:

Larger datasets generally benefit from higher perplexity values.

Smaller datasets might require lower perplexity values.

No Single "Best" Value:

There is no single "best" perplexity value. The optimal value is subjective and depends on the specific dataset and the goals of the visualization.   

In summary:


The perplexity parameter in t-SNE controls the algorithm's focus on local versus global structure. It influences the number of neighbors each point considers, affecting the resulting visualization's appearance and interpretability. Experimentation is often necessary to find a suitable value.   


What is t-SNE (t-Distributed Stochastic Neighbor Embedding)

t-SNE is a non-linear dimensionality reduction technique primarily used for visualizing high-dimensional data in a lower-dimensional space (typically 2D or 3D). It's particularly effective at revealing the underlying structure of data by preserving local similarities.   

How it Works:

High-Dimensional Similarity:

t-SNE first calculates the pairwise similarities between data points in the original high-dimensional space.   

It uses a Gaussian distribution to model the probability of points being neighbors.

This step focuses on capturing local relationships – how close points are to each other in the high-dimensional space.

Low-Dimensional Mapping:

It then aims to find a corresponding low-dimensional representation of the data points.

It uses a t-distribution (hence the "t" in t-SNE) to model the pairwise similarities in the low-dimensional space.

The t-distribution has heavier tails than a Gaussian, which helps to spread out dissimilar points in the low-dimensional space, preventing the "crowding problem" where points tend to clump together.   

Minimizing Divergence:

t-SNE minimizes the Kullback-Leibler (KL) divergence between the high-dimensional and low-dimensional similarity distributions.   

This optimization process iteratively adjusts the positions of the points in the low-dimensional space to best preserve the local similarities from the high-dimensional space.

Characteristics of t-SNE:

Pairwise Similarity:

t-SNE focuses on preserving the pairwise similarities between data points. This is its core mechanism.   

Non-Linearity:

It's a non-linear technique, meaning it can capture complex, non-linear relationships in the data.   

Local Structure:

It excels at preserving the local structure of the data, meaning that points that are close together in the high-dimensional space will tend to be close together in the low-dimensional space.   

Visualization:

It's primarily used for visualization, not for general-purpose dimensionality reduction.



Can multiple parameters be used for performing clustering?

Yes, absolutely! In a clustering solution, you can simultaneously use multiple parameters (or features) to segment your data. This is precisely how customer segmentation (and many other clustering applications) is typically done.


How it Works:

Feature Selection:

You identify the relevant parameters or features that are likely to influence the clustering.

In your example, "frequency of purchase," "value of purchase," and "recency of purchase" are excellent choices for customer segmentation.

Data Preparation:

You prepare your data by:

Handling missing values.

Scaling or normalizing the features (to ensure that features with larger ranges don't dominate the clustering).   

Encoding categorical features if necessary.

Clustering Algorithm:

You choose a clustering algorithm (e.g., K-Means, hierarchical clustering, DBSCAN).

K-Means, for example, calculates the distance between data points based on all the selected features.   

Clustering:

The algorithm groups customers based on their similarity across all the selected features.

Customers with similar purchase frequency, purchase value, and recency will be grouped into the same cluster.

Cluster Profiling:


You analyze the characteristics of each cluster by examining the average values of the selected features for the customers in each cluster.   

This allows you to understand the distinct customer segments.

Example with Your Parameters:


Let's say you're using K-Means clustering with "frequency of purchase," "value of purchase," and "recency of purchase."


Cluster 1 (High-Value Loyalists):

High frequency of purchase.

High value of purchase.

Recent purchases.

Cluster 2 (Occasional Spenders):

Low frequency of purchase.

Moderate value of purchase.

Less recent purchases.

Cluster 3 (New or Low-Value Customers):

Low frequency of purchase.

Low value of purchase.

Potentially recent purchases.

Benefits of Using Multiple Parameters:


Comprehensive Segmentation: Provides a more holistic view of customer behavior.

Improved Accuracy: Leads to more accurate and meaningful customer segments.

Actionable Insights: Enables targeted marketing and customer relationship management strategies.   

Therefore, using multiple parameters is not only possible but also essential for effective clustering and customer segmentation.


What is Cluster Profiling and what is centroid in a cluster?

Cluster profiling is the process of analyzing and characterizing the data points that belong to each cluster identified in a clustering algorithm (like K-Means). It involves understanding the key attributes, patterns, and trends within each cluster.   

Centroid in Cluster Profiling

In the context of centroid-based clustering algorithms like K-Means, the centroid plays a crucial role in cluster profiling.

What is a Centroid?

It's the central point of a cluster, representing the average values of all the data points within that cluster.

In K-Means, the algorithm iteratively adjusts the centroids to minimize the distances between data points and their assigned cluster centroids.   

Role in Profiling:


The centroid acts as a representative of the data points within a cluster.

By examining the values of the features at the centroid, you can gain insights into the characteristics that define that particular cluster.

For example:

In customer segmentation, the centroid of a cluster might represent the average age, income, and purchase behavior of customers in that segment.   

In image analysis, the centroid could represent the average color, texture, or shape features of images within a cluster.   

In Summary:


Cluster profiling involves understanding the characteristics of each cluster. The centroid, as the central point of a cluster, provides a crucial reference point for analyzing and interpreting the data within that cluster. By examining the values of the features at the centroid, you can gain valuable insights into the defining characteristics of each cluste


Silhouette Score in K-Means Clustering?

The silhouette score is a metric used to evaluate the quality of clusters created by algorithms like K-Means. It measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation).   

How it Works:

For each data point:

Calculate a: The average distance of the point to all other points within the same cluster.   

Calculate b: The average distance of the point to all points in the nearest other cluster.   

Calculate the silhouette coefficient s:

s = (b - a) / max(a, b)

The silhouette score for the entire clustering is the average of the silhouette coefficients for all data points.   

Interpretation:

1: Indicates that the point is well-clustered. It's far away from neighboring clusters and close to points in its own cluster.   

0: Indicates that the point is on or very close to the decision boundary between two neighboring clusters.   

-1: Indicates that the point might be assigned to the wrong cluster.   

Range of Silhouette Score:

The silhouette score ranges from -1 to 1.   

Key Considerations:

Higher is Better: A higher silhouette score generally indicates better clustering.   

Cluster Quality: The silhouette score can help assess the quality of clusters produced by K-Means or other clustering algorithms.   

Choosing k: While not as visually intuitive as the elbow method, the silhouette score can also be used to help choose the optimal number of clusters (k). You can calculate the silhouette score for different values of k and choose the k that yields the highest score.

Limitations: The silhouette score may not always be a perfect indicator of cluster quality, especially for complex datasets

What is Elbow Point?

The Elbow method is a heuristic used in determining the optimal number of clusters (k) for k-means clustering. It involves plotting the within-cluster sum of squares (WCSS) against the number of clusters (k).   

How it Works:

Calculate WCSS for Different Values of k:

For various values of k (e.g., k = 1, 2, 3, ...), run the k-means algorithm.   

For each k, calculate the WCSS, which is the sum of the squared distances between each point and its assigned cluster's centroid.   

Plot WCSS vs. k:

Create a line plot with the number of clusters (k) on the x-axis and the WCSS on the y-axis.   

Identify the "Elbow" Point:

Look for the "elbow" point in the plot. This is the point where the rate of decrease in WCSS sharply changes.   

The elbow point represents a good trade-off between minimizing WCSS and not having too many clusters.

Why it Works:

As k increases:

The WCSS generally decreases because points are assigned to closer clusters.   

When k equals the number of data points, WCSS becomes zero because each point forms its own cluster.

The "Elbow":

The "elbow" point indicates a point of diminishing returns. After this point, increasing k doesn't significantly reduce WCSS.   

In summary:

The Elbow method plots the within-cluster sum of squares (WCSS) against different values of k to help determine the optimal number of clusters for k-means. The "elbow" in the plot is used as a visual indicator of the best k value.  

What is example selector in Langchain ?

If you have a large number of examples, you may need to select which ones to include in the prompt. The Example Selector is the class responsible for doing so.


class BaseExampleSelector(ABC):

    """Interface for selecting examples to include in prompts."""


    @abstractmethod

    def select_examples(self, input_variables: Dict[str, str]) -> List[dict]:

        """Select which examples to use based on the inputs."""

        

    @abstractmethod

    def add_example(self, example: Dict[str, str]) -> Any:

        """Add new example to store."""


The only method it needs to define is a select_examples method. This takes in the input variables and then returns a list of examples. It is up to each specific implementation as to how those examples are selected.



from langchain_core.prompts.few_shot import FewShotPromptTemplate

from langchain_core.prompts.prompt import PromptTemplate


example_prompt = PromptTemplate.from_template("Input: {input} -> Output: {output}")



prompt = FewShotPromptTemplate(

    example_selector=example_selector,

    example_prompt=example_prompt,

    suffix="Input: {input} -> Output:",

    prefix="Translate the following words from English to Italain:",

    input_variables=["input"],

)


print(prompt.format(input="word"))


Similarity Uses semantic similarity between inputs and examples to decide which examples to choose.

MMR Uses Max Marginal Relevance between inputs and examples to decide which examples to choose.

Length Selects examples based on how many can fit within a certain length

Ngram Uses ngram overlap between inputs and examples to decide which examples to choose.




Wednesday, March 19, 2025

What are details to look for in MTEB Leader board ?

MTEB [1] is a multi-task and multi-language comparison of embedding models. It comes in the form of a leaderboard, based on multiple scores, and only one model stands at the top! Does it make it easy to choose the right model for your application? You wish! This guide is an attempt to provide tips on how to make clever use of MTEB. As our team worked on making the French benchmark available [2], the examples will rely on the French MTEB. Nonetheless, those tips apply to the entire benchmark.

MTEB is a leaderboard. It shows you scores. What it doesn't show you? Significance.

While being a great resource for discovering and comparing models, MTEB might not be as straightforward as one might expect. As of today (1st of March 2024), many SOTA models have been tested, and most of them display close average scores. For the French MTEB, these average scores are computed on 26 different tasks (and 56 for English MTEB!) and no standard deviation comes with it. Even though the top model looks better than the others, the score difference with a model that comes after it might not be significant. One can directly get the raw results to compute statistical metrics. As an example, we performed critical difference tests and found that, with a p-value of 0.05, the current 9 top models in the French MTEB leaderboard are statistically equivalent. It would require even more datasets to perceive statistical significance.

Dive into data

Do not just look at the average scores of models on the task you are interested in. Instead, look at the individual scores on the datasets that best represent your use case.

Consider the model's characteristics

Using the model displaying the best average score for your application might be tempting. However, a model comes with its characteristics, leading to its usage constraints. Make sure those constraints match yours.

Do not forget MTEB is a leaderboard...

And as leaderboards sometimes do, it encourages to compete without following the rules.

Indeed, keep in mind that many providers want to see their models on top of that list and, being based on public datasets, some malpractices such as data leakage or overfitting on data could bias MTEB evaluation.


references:

https://huggingface.co/blog/lyon-nlp-group/mteb-leaderboard-best-practices

What is RAGFlow?

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding. It offers a streamlined RAG workflow for businesses of any scale, combining LLM (Large Language Models) to provide truthful question-answering capabilities, backed by well-founded citations from various complex formatted data.


Key Features

🍭 "Quality in, quality out"

Deep document understanding-based knowledge extraction from unstructured data with complicated formats.

Finds "needle in a data haystack" of literally unlimited tokens.

🍱 Template-based chunking

Intelligent and explainable.

Plenty of template options to choose from.

🌱 Grounded citations with reduced hallucinations

Visualization of text chunking to allow human intervention.

Quick view of the key references and traceable citations to support grounded answers.

🍔 Compatibility with heterogeneous data sources

Supports Word, slides, excel, txt, images, scanned copies, structured data, web pages, and more.

🛀 Automated and effortless RAG workflow

Streamlined RAG orchestration catered to both personal and large businesses.

Configurable LLMs as well as embedding models.

Multiple recall paired with fused re-ranking.

Intuitive APIs for seamless integration with business.

What is vm.max_map_count in linux Systems?

In the context of Linux systems, vm.max_map_count is a sysctl parameter that defines the maximum number of memory map areas a process can have. Here's a more detailed explanation:

Memory Mapping:

Memory mapping is a technique that allows a process to access files or devices as if they were part of its virtual memory. This is done by mapping a file or device into the process's address space.   

vm.max_map_count:

This sysctl setting limits the number of these memory map areas that a single process can create.   

It's important because some applications, particularly those that heavily utilize memory-mapped files (like databases such as Elasticsearch), may require a higher max_map_count value.

Why It Matters:

If an application attempts to create more memory map areas than the vm.max_map_count limit allows, it can lead to errors or unexpected behavior, such as "out of memory" exceptions.

Therefore, in certain situations, it becomes necessary to increase this value.   

Practical Usage:

You can check the current value of vm.max_map_count using the command sysctl vm.max_map_count.

To change the value temporarily, you can use sysctl -w vm.max_map_count=<new_value>.

To make the change persistent across reboots, you can add the setting vm.max_map_count=<new_value> to the /etc/sysctl.conf file (or files within /etc/sysctl.d/).

In essence, vm.max_map_count is a kernel parameter that controls the maximum number of memory map areas a process can have, and it's often adjusted to accommodate the requirements of memory-intensive applications.


What are interactive plots based visualisation in Fiftyone Brain

FiftyOne provides a powerful fiftyone.core.plots framework that contains a variety of interactive plotting methods that enable you to visualize your datasets and uncover patterns that are not apparent from inspecting either the raw media files or aggregate statistics.


With FiftyOne, you can visualize geolocated data on maps, generate interactive evaluation reports such as confusion matrices and PR curves, create dashboards of custom statistics, and even generate low-dimensional representations of your data that you can use to identify data clusters corresponding to model failure modes, annotation gaps, and more.


What do we mean by interactive plots? First, FiftyOne plots are powered by Plotly, which means they are responsive JavaScript-based plots that can be zoomed, panned, and lasso-ed. Second, FiftyOne plots can be linked to the FiftyOne App, so that selecting points in a plot will automatically load the corresponding samples/labels in the App (and vice versa) for you to visualize! Linking plots to their source media is a paradigm that should play a critical part in any visual dataset analysis pipeline.


The builtin plots provided by FiftyOne are chosen to help you analyze and improve the quality of your datasets and models, with minimal customization required on your part to get started. At the same time, data/model interpretability is not a narrowly-defined space that can be fully automated. That’s why FiftyOne’s plotting framework is highly customizable and extensible, all by writing pure Python (no JavaScript knowledge required).

References:

https://docs.voxel51.com/user_guide/plots.html


Tuesday, March 18, 2025

What Embedding visualization with fifty one. How it is useful

 embeddings could be used to analyze data and models? Use FiftyOne's embeddings visualization capabilities to reveal hidden structure in the data, mine hard samples, pre-annotate data, recommend new samples for annotation, and more.

FiftyOne provides a powerful embeddings visualization capability that you can use to generate low-dimensional representations of the samples and objects in your datasets.

This notebook highlights several applications of visualizing image embeddings, with the goal of motivating some of the many possible workflows that you can perform.

Specifically, we’ll cover the following concepts:

Loading datasets from the FiftyOne Dataset Zoo

Using compute_visualization() to generate 2D representations of images

Providing custom embeddings to compute_visualization()

Visualizing embeddings via interactive plots connected to the FiftyOne App

And we’ll demonstrate how to use embeddings to:

Identify anomolous/incorrect image labels

Find examples of scenarios of interest

Pre-annotate unlabeled data for training

So, what’s the takeaway?

Combing through individual images in a dataset and staring at aggregate performance metrics trying to figure out how to improve the performance of a model is an ineffective and time-consuming process. Visualizing your dataset in a low-dimensional embedding space is a powerful workflow that can reveal patterns and clusters in your data that can answer important questions about the critical failure modes of your model and how to augment your dataset to address these failures.

Using the FiftyOne Brain’s embeddings visualization capability on your ML projects can help you uncover hidden patterns in your data and take action to improve the quality of your datasets and models.


References:

https://docs.voxel51.com/tutorials/image_embeddings.html


Saturday, March 15, 2025

How to generate image using truepix AI service

async def generate_t2i_image(prompt: str) -> str:

    """

    Generates an image from a text prompt using Truepix AI.

    """

    api_url = "https://api.truepix.ai" # Coming SOON

    headers = {"Authorization": "Bearer YOUR_API_KEY"}

    payload = {"prompt": prompt}


    try:

        async with aiohttp.ClientSession() as session:

            # Start image generation

            async with session.post(api_url, json=payload, headers=headers) as response:

                if response.status != 200:

                    return f"Error: Status {response.status}"

                data = await response.json()

                task_id = data.get("task_id")

                if not task_id:

                    return "Error: No task ID"


            # Check progress every 10 seconds

            while True:

                poll_url = f"https://api.truepix.ai" # Coming SOON

                async with session.get(poll_url, headers=headers) as poll_response:

                    if poll_response.status != 200:

                        return f"Error: Status {poll_response.status}"

                    result = await poll_response.json()

                    if result.get("status") == "success":

                        return result.get("result_url", "Error: No URL")

                    await asyncio.sleep(10)

    except Exception as e:

        return f"Error: {str(e)}"

What can MCP SErver do? some of the FAQs

 Frequently Asked Questions (FAQs) About MCP

What exactly is the Model Context Protocol (MCP)?

MCP is an open standard that defines how AI models can interact with external tools, resources, and prompts in a consistent and secure manner. It’s designed to simplify the process of giving AI agents access to real-world data and capabilities, making them more useful and versatile. Think of it as creating a standardized “API for AI agents.”

Is MCP specific to Claude, or does it work with other AI models?

MCP is designed as an open protocol, meaning it’s not tied to any specific AI model or provider. While this tutorial uses Claude Desktop as an example host, the goal of MCP is to be compatible with various AI models and platforms. The benefit of a standard like MCP is that it fosters interoperability across the AI ecosystem. Any AI model or platform that implements the MCP client can potentially connect to any MCP server, regardless of the AI provider.

What kinds of things can I build with MCP Servers?

The possibilities are virtually limitless! If an action can be automated or data can be accessed programmatically, it can likely be exposed as an MCP tool. Examples include:

Data Retrieval: Connect to databases, spreadsheets, CRM systems, or any data source to provide AI with information.

Application Integration: Allow AI to interact with apps like email clients, calendars, task managers, social media, e-commerce platforms.

Smart Home Control: Enable AI to control smart devices, adjust lighting, temperature, security systems.

Content Generation & Manipulation: Use APIs to generate text, images, audio, video, or manipulate existing digital content.

Workflow Automation: Build complex workflows where AI orchestrates actions across multiple tools and services.


Writing an MCP server

 MCP solves a similar problem for AI, creating a standard way for AI Clients (like Claude, Cursor or others) to connect to a wide range of tools and data sources. Think of it as a standardized port that allows your AI to effortlessly access things like real-time stock prices, your email inbox, or even complex APIs, all without complicated, one-off setups.


Imagine giving your AI a Swiss Army knife. With MCP, it gains a set of tools: it can fetch information, generate images, interact with services, and even automate tasks, all while keeping your data access secure within your own infrastructure. It’s about making AI agents truly capable and versatile.




uv init server

cd server

uv venv

source .venv/bin/activate  # Windows: .venv\Scripts\activate

uv add "mcp[cli]" httpx aiohttp yfinance asyncio

touch server.py  # Windows: New-Item server.py



import yfinance as yf

from mcp.server.fastmcp import FastMCP

import asyncio

import aiohttp


mcp = FastMCP("stock_prices")


@mcp.tool()

async def get_stock_price(ticker: str) -> str:


@mcp.tool()

async def generate_t2i_image(prompt: str) -> str:



{

  "mcpServers": {

    "server": {

      "command": "/full/path/to/uv",  # **Replace this** with the full path to your 'uv' executable

      "args": [

        "--directory",

        "/full/path/to/server",# **Replace this** with the full absolute path to your 'server' project directory

        "run",

        "server.py"

      ]

    }

  }

}



How It All Fits Together


Claude (MCP Client/Host): Claude acts as the intelligent brain, equipped with an MCP client. It wants to extend its capabilities by using external tools.

Your MCP Server (server.py): This is the tool provider (the MCP server). It exposes specific functionalities — like fetching stock prices and generating images — via the MCP protocol.

MCP Protocol: This is the standardized language and set of rules that allows Claude (via its MCP client) to communicate with your MCP server. It’s the “bridge” enabling smooth interaction.

Your Client sends a request (e.g., “Get AAPL price”), the Server processes it, and sends back “$174.50”. Simple and fast.


Thursday, March 13, 2025

Does 51Brain support dimensionality reduction ?

Here's a conceptual outline and a code snippet demonstrating how you can achieve this with FiftyOne:


Conceptual Outline


Compute Embeddings:

First, you'll need to compute the embeddings for your data (e.g., images, text). You can use any embedding model or technique you prefer (e.g., OpenAI embeddings, Sentence Transformers, etc.).

Store Embeddings in FiftyOne:

FiftyOne allows you to store these embeddings as fields on your samples in a FiftyOne dataset.

Perform Dimensionality Reduction:

FiftyOne integrates with dimensionality reduction techniques like PCA (Principal Component Analysis) or t-SNE (t-distributed Stochastic Neighbor Embedding) or UMAP (Uniform Manifold Approximation and Projection).   

You can apply these techniques to reduce the high-dimensional embeddings to a lower-dimensional space (e.g., 2D or 3D) for visualization.

Visualize in FiftyOne:

FiftyOne's visualization capabilities enable you to plot these reduced-dimensional embeddings.

You can then interact with the plot, select data points, and link them back to the original data samples (e.g., images).


Monday, March 10, 2025

Multi Layer Perceptron (MLP)



MLP is a fundamental building block within the transformer’s feed-forward network. Its role is to introduce non-linearity and learn complex relationships within the embedded representations. When defining an MLP module, an important parameter is n_embed, which defines the dimensionality of the input embedding.

The MLP typically consists of a hidden linear layer that expands the input dimension by a factor (often 4, which we will use), followed by a non-linear activation function, commonly ReLU. This structure allows our network to learn more complex features. Finally, a projection linear layer maps the expanded representation back to the original embedding dimension. This sequence of transformations enables the MLP to refine the representations learned by the attention mechanism.


Architecture of the AI co-scientist:

 Multi-Agent Architecture

The AI co-scientist is a multi-agent system built on Gemini 2.0 and designed for scientific hypothesis generation and validation. It leverages asynchronous task execution, self-improving loops, and scaling test-time compute to enhance its reasoning capabilities. The details of multiple specialized agents and their role in hypothesis generation and validation is as follows:


Supervisor Agent — Manages asynchronous task execution, Allocates computational resources dynamically, Stores intermediate outputs in context memory for iterative refinement.

Generation Agent — Explores literature via web search, Simulates scientific debates to generate initial hypotheses, Uses iterative assumption identification to break complex ideas into testable statements.

Reflection Agent — Critically reviews hypotheses for novelty, correctness, and plausibility, Conducts deep verification by breaking down hypotheses into sub-assumptions, Uses simulation review to test hypotheses in a step-wise manner.

Ranking Agent (Tournament-Based Evaluation) — Conducts Elo-based tournaments where hypotheses are pairwise compared, Uses scientific debates to refine and improve the ranking of hypotheses.

Evolution Agent — Refines hypotheses by Adding supporting literature, Simplifying and restructuring ideas, Generating out-of-the-box variations. Ensures self-improvement over multiple iterations.

Proximity Agent- Groups similar hypotheses to avoid redundancy.Helps diversify search space by encouraging novel directions.

Meta-Review Agent- Synthesizes common errors from scientific debates. Improves feedback propagation to refine agent behaviors. Creates research overviews summarizing validated hypotheses. Generation Agent: Explores literature, conducts simulated scientific debates, and generates initial hypotheses.

references:
https://storage.googleapis.com/coscientist_paper/ai_coscientist.pdf

Sunday, March 9, 2025

Prompt Engineering Techniques for GPT-4o

Include details in your query to get more relevant answers

In order to get a highly relevant response, make sure that requests provide any important details or context. Otherwise you are leaving it up to the model to guess what you mean.




Tactic: Ask the model to adopt a persona

Based on internal evals, the gpt-4.5-preview model has a particular system message that results in better performance. Add your own system message contents after this:


SYSTEM

You are a highly capable, thoughtful, and precise assistant. Your goal is to deeply understand the user's intent, ask clarifying questions when needed, think step-by-step through complex problems, provide clear and accurate answers, and proactively anticipate helpful follow-up information. Always prioritize being truthful, nuanced, insightful, and efficient, tailoring your responses specifically to the user's needs and preferences.


Tactic: Use delimiters to clearly indicate distinct parts of the input


Delimiters like triple quotation marks, XML tags, section titles, etc. can help demarcate sections of text to be treated differently.



USER

Summarize the text delimited by triple quotes with a haiku.


"""insert text here"""


SYSTEM

You will be provided with a pair of articles (delimited with XML tags) about the same topic. First summarize the arguments of each article. Then indicate which of them makes a better argument and explain why.

USER

<article> insert first article here </article>


<article> insert second article here </article>


SYSTEM

You will be provided with a thesis abstract and a suggested title for it. The thesis title should give the reader a good idea of the topic of the thesis but should also be eye-catching. If the title does not meet these criteria, suggest 5 alternatives.

USER

Abstract: insert abstract here


Title: insert title here


For straightforward tasks such as these, using delimiters might not make a difference in the output quality. However, the more complex a task is the more important it is to disambiguate task details. Don’t make the model work to understand exactly what you are asking of them.


Tactic: Specify the steps required to complete a task


SYSTEM

Use the following step-by-step instructions to respond to user inputs.


Step 1 - The user will provide you with text in triple quotes. Summarize this text in one sentence with a prefix that says "Summary: ".


Step 2 - Translate the summary from Step 1 into Spanish, with a prefix that says "Translation: ".

USER

"""insert text here"""



Tactic: Provide examples

Providing general instructions that apply to all examples is generally more efficient than demonstrating all permutations of a task by example, but in some cases providing examples may be easier. For example, if you intend for the model to copy a particular style of responding to user queries which is difficult to describe explicitly. This is known as "few-shot" prompting.


SYSTEM



SYSTEM

Answer in a consistent style.

USER

Teach me about patience.

ASSISTANT

The river that carves the deepest valley flows from a modest spring; the grandest symphony originates from a single note; the most intricate tapestry begins with a solitary thread.

USER

Teach me about the ocean.


Tactic: Specify the desired length of the output

You can ask the model to produce outputs that are of a given target length. The targeted output length can be specified in terms of the count of words, sentences, paragraphs, bullet points, etc. Note however that instructing the model to generate a specific number of words does not work with high precision. The model can more reliably generate outputs with a specific number of paragraphs or bullet points.


USER

Summarize the text delimited by triple quotes in about 50 words.


"""insert text here"""


USER

Summarize the text delimited by triple quotes in 2 paragraphs.


"""insert text here"""


USER

Summarize the text delimited by triple quotes in 3 bullet points.


"""insert text here"""



Strategy: Provide reference text


If we can provide a model with trusted information that is relevant to the current query, then we can instruct the model to use the provided information to compose its answer.


SYSTEM


Use the provided articles delimited by triple quotes to answer questions. If the answer cannot be found in the articles, write "I could not find an answer."

USER

<insert articles, each delimited by triple quotes>


Question: <insert question here>



Tactic: Instruct the model to answer with citations from a reference text

If the input has been supplemented with relevant knowledge, it's straightforward to request that the model add citations to its answers by referencing passages from provided documents. Note that citations in the output can then be verified programmatically by string matching within the provided documents.


SYSTEM

You will be provided with a document delimited by triple quotes and a question. Your task is to answer the question using only the provided document and to cite the passage(s) of the document used to answer the question. If the document does not contain the information needed to answer this question then simply write: "Insufficient information." If an answer to the question is provided, it must be annotated with a citation. Use the following format for to cite relevant passages ({"citation": …}).

USER

"""<insert document here>"""


Question: <insert question here>


QwQ-32B, model

 DeepSeek R1 has achieved state-of-the-art performance by integrating cold-start data and multi-stage training, enabling deep thinking and complex reasoning.

QwQ-32B, a model with 32 billion parameters that achieves performance comparable to DeepSeek-R1, which boasts 671 billion parameters (with 37 billion activated). This remarkable outcome underscores the effectiveness of RL when applied to robust foundation models pretrained on extensive world knowledge.



QwQ-32B is evaluated across a range of benchmarks designed to assess its mathematical reasoning, coding proficiency, and general problem-solving capabilities. The results below highlight QwQ-32B’s performance in comparison to other leading models, including DeepSeek-R1-Distilled-Qwen-32B, DeepSeek-R1-Distilled-Llama-70B, o1-mini, and the original DeepSeek-R1.


We began with a cold-start checkpoint and implemented a reinforcement learning (RL) scaling approach driven by outcome-based rewards. In the initial stage, we scale RL specifically for math and coding tasks. Rather than relying on traditional reward models, we utilized an accuracy verifier for math problems to ensure the correctness of final solutions and a code execution server to assess whether the generated codes successfully pass predefined test cases. As training episodes progress, performance in both domains shows continuous improvement. After the first stage, we add another stage of RL for general capabilities. It is trained with rewards from general reward model and some rule-based verifiers. We find that this stage of RL training with a small amount of steps can increase the performance of other general capabilities, such as instruction following, alignment with human preference, and agent performance, without significant performance drop in math and coding.


Below are brief examples demonstrating how to use QwQ-32B via Hugging Face Transformers and Alibaba Cloud DashScope API.


from transformers import AutoModelForCausalLM, AutoTokenizer


model_name = "Qwen/QwQ-32B"


model = AutoModelForCausalLM.from_pretrained(

    model_name,

    torch_dtype="auto",

    device_map="auto"

)

tokenizer = AutoTokenizer.from_pretrained(model_name)


prompt = "How many r's are in the word \"strawberry\""

messages = [

    {"role": "user", "content": prompt}

]

text = tokenizer.apply_chat_template(

    messages,

    tokenize=False,

    add_generation_prompt=True

)


model_inputs = tokenizer([text], return_tensors="pt").to(model.device)


generated_ids = model.generate(

    **model_inputs,

    max_new_tokens=32768

)

generated_ids = [

    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)

]


response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(response)

from openai import OpenAI

import os


# Initialize OpenAI client

client = OpenAI(

    # If the environment variable is not configured, replace with your API Key: api_key="sk-xxx"

    # How to get an API Key:https://help.aliyun.com/zh/model-studio/developer-reference/get-api-key

    api_key=os.getenv("DASHSCOPE_API_KEY"),

    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"

)


reasoning_content = ""

content = ""


is_answering = False


completion = client.chat.completions.create(

    model="qwq-32b",

    messages=[

        {"role": "user", "content": "Which is larger, 9.9 or 9.11?"}

    ],

    stream=True,

    # Uncomment the following line to return token usage in the last chunk

    # stream_options={

    #     "include_usage": True

    # }

)


print("\n" + "=" * 20 + "reasoning content" + "=" * 20 + "\n")


for chunk in completion:

    # If chunk.choices is empty, print usage

    if not chunk.choices:

        print("\nUsage:")

        print(chunk.usage)

    else:

        delta = chunk.choices[0].delta

        # Print reasoning content

        if hasattr(delta, 'reasoning_content') and delta.reasoning_content is not None:

            print(delta.reasoning_content, end='', flush=True)

            reasoning_content += delta.reasoning_content

        else:

            if delta.content != "" and is_answering is False:

                print("\n" + "=" * 20 + "content" + "=" * 20 + "\n")

                is_answering = True

            # Print content

            print(delta.content, end='', flush=True)

            content += delta.content


Saturday, March 8, 2025

How Mistral OCR Works:

 Mistral AI has introduced Mistral OCR, a powerful Optical Character Recognition API designed for advanced document understanding. Here's a breakdown of how it works and how to use it:   

Advanced Document Understanding:

Mistral OCR goes beyond basic text extraction. It's designed to comprehend the various elements within documents, including:

Text.

Images.   

Tables.   

Mathematical equations.   

Complex layouts (e.g., LaTeX).   

  

Multimodal and Multilingual:

It's capable of processing documents with mixed content (text and images) and supports a wide range of languages and scripts.   

"Doc-as-Prompt" Functionality:

This innovative feature allows users to use documents as prompts, enabling more precise information extraction and structured output formatting (e.g., JSON).   

Performance and Efficiency:

Mistral OCR is designed for speed and efficiency, capable of processing a high volume of documents.   

Technology:

It is powered by advanced AI models, that allow for a very high degree of accuracy, and comprehension of complex document layouts.

Parsing Documents Using Mistral OCR:


To use Mistral OCR, you'll typically interact with its API. Here's a general outline based on available information:


API Access:

You'll need access to the Mistral AI API, which may require an API key.   

The API is accessible on Mistral's developer suite, La Plateforme.   

Input Formats:

Mistral OCR supports various input formats, including:

PDF documents.   

Images.

API Requests:

You'll send API requests to the Mistral OCR endpoint, providing the document as input.   

You can specify parameters to control the output format and extraction options.   

Output:

The API returns the extracted content in a structured format, such as:

Markdown.

JSON.

This structured output makes it easier to parse and process the extracted information.   

Code examples:

Mistral AI provides code examples in languages like python, and typescript, that can be used to interact with the API.   

Key Features and Benefits:


High Accuracy:

Mistral OCR has demonstrated strong performance in benchmark tests, outperforming other leading OCR models.   

Complex Document Handling:

It excels at processing documents with intricate layouts and mixed content.   

Multilingual Support:

Its ability to handle a wide range of languages makes it suitable for global applications.   

Self-Hosting Option:

For organizations with strict data privacy requirements, Mistral AI offers a self-hosting option.   

To get the most accurate and up-to-date information on how to use Mistral OCR, I recommend referring to the official Mistral AI documentation.



Sources and related content


Integrating Okta or Auth0 with an FGARetriever involves a few key steps:

1. Authentication and Authorization:


Okta/Auth0 as Identity Providers (IdPs):

Okta and Auth0 handle user authentication (verifying user identity) and authorization (determining user permissions).   

They provide tokens (e.g., JWTs) that contain user information and claims.   

FGARetriever's Role:

FGARetriever needs to receive and validate these tokens.

It then uses the information within the tokens to enforce access control policies.

2. Token Validation and User Context:


Token Verification:

FGARetriever must verify the authenticity and integrity of the Okta/Auth0 tokens.

This involves checking the token's signature and issuer.

Libraries in your chosen programming language (e.g., Python, Node.js) can help with JWT validation.   

Extracting User Information:

From the validated token, extract user attributes (e.g., user ID, roles, groups).

This information is essential for evaluating access control policies.

Passing Context to FGARetriever:

When a user makes a query, pass the extracted user context to the FGARetriever.

3. Access Control Policy Enforcement:


Policy Engine:

FGARetriever needs to integrate with a policy engine that can evaluate access control policies.   

This could be a custom policy engine or a dedicated access control service.

Policy Definition:

Define access control policies that specify which users or roles have access to which documents.

These policies should use the user attributes extracted from the Okta/Auth0 tokens.

Policy Evaluation:

FGARetriever uses the policy engine to evaluate the policies based on the user context and the retrieved documents.

Only documents that the user is authorized to access are returned.

4. Implementation Considerations:


Middleware or Interceptors:

Implement middleware or interceptors in your application to handle token validation and user context extraction.   

This ensures that every request is properly authenticated and authorized.

Caching:

Cache validated tokens and access control decisions to improve performance.

Error Handling:

Implement robust error handling to handle invalid tokens or authorization failures.

Security Best Practices:

Follow security best practices for token management and access control.

Use HTTPS to protect communication between your application and Okta/Auth0.   

Protect your API keys.

Conceptual Workflow:


User Authentication:

User authenticates with Okta/Auth0.

Token Issuance:

Okta/Auth0 issues a JWT to the user.

API Request:

User sends an API request to your application, including the JWT in the Authorization header.   

Token Validation:

Your application validates the JWT.

User Context Extraction:

Your application extracts user attributes from the JWT.

FGARetriever Query:

Your application passes the user query and user context to the FGARetriever.

Semantic Search:

FGARetriever performs semantic search.

Policy Evaluation:

FGARetriever evaluates access control policies using the user context.

Filtered Results:

FGARetriever returns only authorized documents.

Response:

Your application sends the response to the user.

Libraries and Tools:


JWT Libraries:

Use libraries like PyJWT (Python) or jsonwebtoken (Node.js) for JWT validation.

Okta/Auth0 SDKs:

Use the official Okta or Auth0 SDKs for easier integration.

Policy Engines:

Consider using policy engines like Open Policy Agent (OPA).

By following these steps, you can effectively integrate Okta or Auth0 with an FGARetriever to implement fine-grained access control in your applications.



Sources and related content


What is FGARetriever ?

FGARetriever stands for Fine-Grained Access Control Retriever. It's a specialized type of retriever designed to incorporate fine-grained access control policies into the retrieval process. This means it doesn't just retrieve relevant documents based on semantic similarity, but also considers who is making the request and what they are authorized to see.   


Here's a breakdown of its key aspects:

Core Functionality:

Access Control Policies:

FGARetriever integrates with access control systems or policy engines.

It evaluates access policies to determine whether the user or application making the request has permission to access the retrieved documents.

Contextual Retrieval:

It combines semantic search with access control, ensuring that only authorized and relevant documents are retrieved.

This is crucial in applications where sensitive or confidential information is involved.

Fine-Grained Control:

It allows for very granular control over access, based on user roles, attributes, or other contextual factors.   

This is more sophisticated than simple role-based access control (RBAC).

Use Cases:


Enterprise Search:

In corporate environments, FGARetriever can ensure that employees only see documents they are authorized to access.

Healthcare Applications:

It can be used to protect patient data, ensuring that only authorized healthcare professionals can access sensitive medical records.   

Financial Services:

It can be used to enforce regulatory compliance and protect confidential financial information.

Any application with sensitive data:

Any application that needs to protect data, and also provide search capabilities.

How It Works (Conceptual):


User Request:

A user submits a query to the retriever.

Semantic Search:

The retriever performs a semantic search to find relevant documents.

Access Control Evaluation:

The retriever evaluates access control policies based on the user's identity and attributes.

It determines which documents the user is authorized to access.

Filtered Results:

The retriever returns only the documents that are both relevant and authorized.

Key Advantages:


Enhanced Security:

It strengthens data security by preventing unauthorized access to sensitive information.

Compliance:

It helps organizations comply with data privacy regulations.

Improved User Experience:

It provides users with relevant search results while protecting sensitive data.

In essence, FGARetriever adds an access control layer to your retrieval system, making it suitable for applications that require a high level of security and compliance.

Thursday, March 6, 2025

Few Middlewares for FastAPI

 CORs 

=======

from fastapi import FastAPI

from fastapi.middleware.cors import CORSMiddleware


app = FastAPI()


app.add_middleware(

    CORSMiddleware,

    allow_origins=["*"],  # Allows all origins

    allow_credentials=True,

    allow_methods=["*"],

    allow_headers=["*"],

)


@app.get("/")

async def root():

    return {"message": "Hello World"}


GZipMiddleware

===============


from fastapi import FastAPI

from fastapi.middleware.gzip import GZipMiddleware


app = FastAPI()

app.add_middleware(GZipMiddleware, minimum_size=1000)  # Compress responses larger than 1000 bytes


@app.get("/")

async def root():

    return {"message": "This is a test message that will be compressed."}



HTTPSRedirect Middleware

====================

from fastapi import FastAPI

from fastapi.middleware.httpsredirect import HTTPSRedirectMiddleware


app = FastAPI()

app.add_middleware(HTTPSRedirectMiddleware)


@app.get("/")

async def root():

    return {"message": "You are being redirected to HTTPS!"}


4. Session Middleware

=====================

from fastapi import FastAPI, Request

from starlette.middleware.sessions import SessionMiddleware


app = FastAPI()

app.add_middleware(SessionMiddleware, secret_key="your-secret-key")


@app.get("/set/")

async def set_session_data(request: Request):

    request.session['user'] = 'john_doe'

    return {"message": "Session data set"}


@app.get("/get/")

async def get_session_data(request: Request):

    user = request.session.get('user', 'guest')

    return {"user": user}



TrustedHost Middleware

======================

from fastapi import FastAPI

from fastapi.middleware.trustedhost import TrustedHostMiddleware


app = FastAPI()

app.add_middleware(TrustedHostMiddleware, allowed_hosts=["example.com", "*.example.com"])


@app.get("/")

async def root():

    return {"message": "This request came from a trusted host."}




Error Handling Middleware

=========================

from fastapi import FastAPI, Request

from fastapi.responses import JSONResponse

from starlette.middleware.base import BaseHTTPMiddleware


class ErrorHandlingMiddleware(BaseHTTPMiddleware):

    async def dispatch(self, request: Request, call_next):

        try:

            response = await call_next(request)

        except Exception as e:

            response = JSONResponse({"error": str(e)}, status_code=500)

        return response


app = FastAPI()

app.add_middleware(ErrorHandlingMiddleware)


@app.get("/")

async def root():

    raise ValueError("This is an error!")


Rate Limiting Middleware

==========================

from fastapi import FastAPI, Request

from fastapi.responses import JSONResponse

from starlette.middleware.base import BaseHTTPMiddleware

import time


class RateLimitMiddleware(BaseHTTPMiddleware):

    def __init__(self, app, max_requests: int, window: int):

        super().__init__(app)

        self.max_requests = max_requests

        self.window = window

        self.requests = {}


    async def dispatch(self, request: Request, call_next):

        client_ip = request.client.host

        current_time = time.time()


        if client_ip not in self.requests:

            self.requests[client_ip] = []


        self.requests[client_ip] = [timestamp for timestamp in self.requests[client_ip] if timestamp > current_time - self.window]


        if len(self.requests[client_ip]) >= self.max_requests:

            return JSONResponse(status_code=429, content={"error": "Too many requests"})


        self.requests[client_ip].append(current_time)

        return await call_next(request)



app = FastAPI()

app.add_middleware(RateLimitMiddleware, max_requests=5, window=60)


@app.get("/")

async def root():

    return {"message": "You haven't hit the rate limit yet!"}




 Authentication Middleware

==========================

from fastapi import FastAPI, Request, HTTPException

from starlette.middleware.base import BaseHTTPMiddleware

from fastapi.responses import PlainTextResponse


class AuthMiddleware(BaseHTTPMiddleware):

    async def dispatch(self, request: Request, call_next):

        token = request.headers.get("Authorization")

        if not token or token != "Bearer valid-token":

            return PlainTextResponse(status_code=401, content="Unauthorized")

        return await call_next(request)


app = FastAPI()

app.add_middleware(AuthMiddleware)


@app.get("/secure-data/")

async def secure_data():

    return {"message": "This is secured data"}




Headers Injection Middleware

===========================

from fastapi import FastAPI

from starlette.middleware.base import BaseHTTPMiddleware


class CustomHeaderMiddleware(BaseHTTPMiddleware):

    async def dispatch(self, request, call_next):

        response = await call_next(request)

        response.headers['Cache-Control'] = 'public, max-age=3600'

        response.headers["X-Content-Type-Options"] = "nosniff"

        response.headers["X-Frame-Options"] = "DENY"

        response.headers["X-XSS-Protection"] = "1; mode=block"

        response.headers["Strict-Transport-Security"] = "max-age=31536000; includeSubDomains"

        return response


app = FastAPI()

app.add_middleware(CustomHeaderMiddleware)


@app.get("/data/")

async def get_data():

    return {"message": "This response is cached for 1 hour."}



 Logging Middleware

===================

from fastapi import FastAPI, Request

import logging

from starlette.middleware.base import BaseHTTPMiddleware


logger = logging.getLogger("my_logger")


class LoggingMiddleware(BaseHTTPMiddleware):

    async def dispatch(self, request: Request, call_next):

        logger.info(f"Request: {request.method} {request.url}")

        response = await call_next(request)

        logger.info(f"Response status: {response.status_code}")

        return response


app = FastAPI()

app.add_middleware(LoggingMiddleware)


@app.get("/")

async def root():

    return {"message": "Check your logs for the request and response details."}



Timeout Middleware

==================

from fastapi import FastAPI, Request, HTTPException

from fastapi.responses import PlainTextResponse

import asyncio

from starlette.middleware.base import BaseHTTPMiddleware


class TimeoutMiddleware(BaseHTTPMiddleware):

    def __init__(self, app, timeout: int):

        super().__init__(app)

        self.timeout = timeout


    async def dispatch(self, request: Request, call_next):

        try:

            return await asyncio.wait_for(call_next(request), timeout=self.timeout)

        except asyncio.TimeoutError:

            return PlainTextResponse(status_code=504, content="Request timed out")


app = FastAPI()

app.add_middleware(TimeoutMiddleware, timeout=5)


@app.get("/")

async def root():

    await asyncio.sleep(10)  # Simulates a long-running process

    return {"message": "This won't be reached if the timeout is less than 10 seconds."}



 IP Whitelisting Middleware

===========================


from fastapi import FastAPI, Request, HTTPException

from starlette.middleware.base import BaseHTTPMiddleware

from fastapi.responses import PlainTextResponse


class IPWhitelistMiddleware(BaseHTTPMiddleware):

    def __init__(self, app, whitelist):

        super().__init__(app)

        self.whitelist = whitelist


    async def dispatch(self, request: Request, call_next):

        client_ip = request.client.host

        if client_ip not in self.whitelist:

            return PlainTextResponse(status_code=403, content="IP not allowed")

        return await call_next(request)


app = FastAPI()

app.add_middleware(IPWhitelistMiddleware, whitelist=["127.0.0.1", "192.168.1.1"])


@app.get("/")

async def root():

    return {"message": "Your IP is whitelisted!"}




ProxyHeadersMiddleware

=========================


from fastapi import FastAPI, Request

from uvicorn.middleware.proxy_headers import ProxyHeadersMiddleware



app = FastAPI()

app.add_middleware(ProxyHeadersMiddleware)



@app.get("/")

async def root(request: Request):

    return {"client_ip": request.client.host}



CSRF Middleware

================


from fastapi import FastAPI, Request

from starlette_csrf import CSRFMiddleware


app = FastAPI()


app.add_middleware(CSRFMiddleware, secret="__CHANGE_ME__")



@app.get("/")

async def root(request: Request):

    return {"message": request.cookies.get('csrftoken')}



GlobalsMiddleware

=================


from fastapi import FastAPI, Depends

from fastapi_g_context import GlobalsMiddleware, g


app = FastAPI()

app.add_middleware(GlobalsMiddleware)


async def set_globals() -> None:

    g.username = "JohnDoe"

    g.request_id = "123456"

    g.is_admin = True


@app.get("/", dependencies=[Depends(set_globals)])

async def info():

    return {"username": g.username, "request_id": g.request_id, "is_admin": g.is_admin}




spaCyLayout and PDF Extraction

Key Features of spaCyLayout

Multi-format Support

Process PDFs, Word documents, and other formats seamlessly, offering flexibility for diverse document types.

Structured Output

Extracts clean, structured data in text-based formats, simplifying subsequent analysis.

Integration with spaCy

Creates spaCy Doc objects with labeled spans and tables for seamless integration into spaCy workflows.

Chunking Support

Supports text chunking, useful for applications like Retrieval-Augmented Generation (RAG) pipelines.


import spacy

from spacy_layout import spaCyLayout


nlp = spacy.load("en_core_web_sm")

layout = spaCyLayout(nlp)


# Assuming you have a PDF file named 'document.pdf'

doc = layout("document.pdf")


# Extract the full text

print(doc.text)

# Extract tables as DataFrames

for table in doc._.tables:

    print(f"Table {table.i}:")

    print(table._.data)

    print("\n")

# Access layout spans with labels and attributes

for span in doc.spans["layout"]:

    print(f"Span type: {span.label_}, Text: {span.text}")


Advanced Features of spaCyLayout

Customizable Table Rendering

Customize table rendering with the display_table callback function.

Hierarchical Section Detection

Detect and organize sections using headings for improved structure.

Multi-page Document Support

Seamlessly handle multi-page documents without losing context.

Pipeline Integration

Combine spaCyLayout with spaCy’s other NLP components for enhanced processing.


Best Practices

Preprocessing

Remove unnecessary elements (e.g., headers, footers, page numbers) for cleaner output.

Model Fine-Tuning

Fine-tune spaCy models for domain-specific documents to improve accuracy.

Error Handling

Handle unexpected PDF structures gracefully to avoid processing failures.

Optimized Chunking

Experiment with chunking strategies for the right balance of detail and coherence.


Tuesday, March 4, 2025

What are the components of Attention layer?

Yes, you're absolutely right! The core components of an attention layer in a transformer are Query (Q), Key (K), and Value (V) vectors. Let's break down what they are and how they work:

1. Query, Key, and Value Vectors:

Query (Q):

The query vector represents the "search query" for information in the input sequence.   

It asks, "What information am I looking for in the other parts of the sequence?"

Key (K):

The key vectors represent the "labels" or "identifiers" of the information in the input sequence.

They say, "Here's what information I contain."

Value (V):

The value vectors represent the actual "content" or "information" associated with each key.

They say, "Here's the actual information you can retrieve."

2. How Attention Works:


The attention mechanism calculates a weighted sum of the value vectors, where the weights are determined by the similarity between the query and key vectors. Here's a step-by-step explanation:


Linear Transformations:

The input embeddings are passed through three separate linear layers to create the Q, K, and V vectors.   

Calculating Attention Scores:

The attention scores are calculated by taking the dot product of the query and key vectors.   

This dot product represents the similarity between the query and key.

The scores are then scaled by dividing by the square root of the dimension of the key vectors (to stabilize training).   

Softmax Activation:

The scaled scores are passed through a softmax function to normalize them into probabilities.   

These probabilities represent the attention weights.   

Weighted Sum:

The attention weights are then multiplied by the value vectors.   

The resulting weighted value vectors are summed to produce the output of the attention layer.

3. Intuitive Analogy:

Imagine you're at a library:

Query: You're looking for a book on "machine learning." This is your query.

Keys: The library's card catalog contains cards with titles and keywords. These are the keys.

Values: The actual books on the shelves are the values.

The attention mechanism helps you find the books (values) that are most relevant to your query (machine learning) by comparing your query with the keywords in the card catalog (keys).   

4. Significance:

Capturing Relationships: Attention allows the transformer to capture long-range dependencies and relationships between words in a sequence.   

Parallel Processing: The attention mechanism can be computed in parallel, making transformers highly efficient.   

Contextual Understanding: Attention enables the model to focus on the most relevant parts of the input sequence for each word, leading to a better contextual understanding.   

In summary: The attention layer uses Query, Key, and Value vectors to enable the model to focus on the most relevant parts of the input sequence. This mechanism is a key component of the transformer architecture and is responsible for its success in various natural language processing tasks

Explanation of the Diagram:


Input Embeddings:

The process begins with the input sequence, which has been converted into numerical embeddings.

Linear Transformations:

The input embeddings are passed through three separate linear layers (represented by the arrows) to create the Query (Q), Key (K), and Value (V) vectors.

Dot Product (Q * Kᵀ):

The Query (Q) and transposed Key (Kᵀ) vectors are multiplied using a dot product. This calculates the similarity between each query and each key.

Scale and Softmax:

The dot product results are scaled (divided by the square root of the dimension of the key vectors) and then passed through a softmax function. This normalizes the scores into attention weights (probabilities).

Attention Weights:

The attention weights represent how much attention each key-value pair should receive.

Multiply by V:

The attention weights are multiplied by the Value (V) vectors. This creates weighted value vectors.

Weighted Value Vectors:

The weighted value vectors represent the information from the value vectors, weighted by their relevance to the query.

Summation:

The weighted value vectors are summed together to produce the final output of the attention layer.

Attention Output:

The attention output is a vector that represents the contextually relevant information from the input sequence.

Visualizing the "Attention":

Imagine drawing lines (or arrows) between the words in the input sequence, where the thickness of the line represents the attention weight. The thicker the line, the more attention the model is paying to that word.

Key Concepts in the Diagram:

Q, K, V: The core components of the attention mechanism.

Dot Product: A measure of similarity.

Softmax: Normalizes the scores into probabilities.

Weighted Sum: Combines the value vectors based on their attention weights.

This visual representation should help you understand how the attention mechanism works within a transformer layer.



Gemini 2.0 or LlamaParse?

When comparing LlamaParse and Gemini 2.0 for PDF parsing, it's essential to consider factors beyond just speed, such as accuracy, cost, and specific use-case requirements. Here's a breakdown based on available information:

LlamaParse:

Strengths:

Known for its reliability and focus on structured document parsing.   

Offers features like multilingual translation during parsing.   

Designed to handle complex document layouts.   

Allows for the plugging in of external multimodal model vendors, like Gemini 2.0.   

Considerations:

Performance can vary between free and premium versions.

Specific features, like image extraction, might have limitations.   

Gemini 2.0:

Strengths:

Leverages powerful multimodal capabilities, enabling it to understand both text and visual elements in PDFs.   

Demonstrates strong performance in processing diverse document types.

Potential for significant cost reduction in large-scale PDF processing.

It is being used within the LlamaParse framework.   

Considerations:

Accuracy can still have minor discrepancies, especially with complex formatting.

Performance and cost may vary depending on the specific Gemini 2.0 model used.

Speed and Performance:


Reports indicate that using LLMs like Gemini 2.0 can drastically reduce processing times compared to traditional PDF parsers.   

LlamaParse, especially when integrated with models like Gemini 2.0, aims to provide optimized and efficient parsing.

Therefore, it is hard to give a definitive answer as to which is faster, as it is becoming common to use Gemini within the LlamaParse framework.

"Better" Depends on Your Needs:


For high accuracy and complex layouts: LlamaParse, especially when using multimodal models, is a strong contender.   

For large-scale processing and cost-effectiveness: Gemini 2.0 shows significant promise.

For applications needing multimodal understanding: Gemini 2.0's capabilities are a clear advantage.

Key Takeaways:


The landscape of PDF parsing is rapidly evolving with the advancements in LLMs.

Both LlamaParse and Gemini 2.0 offer powerful capabilities, and their performance can be further enhanced when used in conjunction.

Consider your specific requirements, such as document complexity, processing volume, and cost constraints, when making a decision.


What is Lexoid PDF parser?

Lexoid is a document parsing library developed by Oid Labs that efficiently extracts structured data from PDF documents. It supports both Large Language Model (LLM)-based and non-LLM (static) parsing methods, offering flexibility based on specific use cases. 

Pros:

Versatility: By supporting both LLM-based and non-LLM parsing, Lexoid can adapt to various document structures and complexities.

Efficiency: The library is designed for efficient parsing, making it suitable for applications requiring quick data extraction.

Open Source: Being open-source, Lexoid allows for customization and integration into diverse projects.

Cons:

Maturity: As a relatively new tool, Lexoid may still be undergoing development and optimization, potentially leading to undiscovered bugs or limitations.

Community Support: Given its recent introduction, there might be limited community resources or documentation available.

In summary, Lexoid offers a flexible and efficient solution for PDF parsing, accommodating both LLM-based and traditional parsing approaches. However, users should be mindful of its current development stage and the potential need for community support. 

Multimodal Parsing Capabilities:

While Lexoid is designed for efficient document parsing, the available information does not specify its capabilities regarding the extraction of diverse elements such as text, paragraphs, tables, and images from PDFs. Additionally, there is no explicit mention of its support for complex layouts, including two-column formats.

Handling Complex Layouts:

The documentation does not provide details on Lexoid's ability to manage complex PDF layouts, such as multi-column formats or intricate designs.

Alternative Tools for Complex PDF Parsing:

If your requirements include parsing PDFs with complex layouts, including tables and images, you might consider the following tools:

PyMuPDF and pypdfium: These libraries have demonstrated effectiveness in handling complex layouts and paragraph structures. 

LlamaIndex's Smart PDF Loader: This tool processes PDFs by understanding their layout structures, such as nested sections, lists, paragraphs, and tables, and smartly chunks them into optimal short contexts for LLMs. 

Marker API: Provides a simple endpoint for converting PDF documents to Markdown, supporting multiple PDFs simultaneously and effectively managing complex documents. 

In summary, while Lexoid offers efficient document parsing capabilities, its support for multimodal parsing and complex layouts is not clearly documented. If your project requires handling such complexities, exploring the aforementioned alternatives may be beneficial.

What is MinerU PDF Parser

MinerU is a powerful open-source PDF data extraction tool developed by OpenDataLab. It intelligently converts PDF documents into structured data formats, supporting precise extraction of text, images, tables, and mathematical formulas. 

Advantages:

Accurate Content Extraction: MinerU combines the benefits of accurate content extraction and faster processing in text mode, along with precise span/line region recognition in OCR mode. 

Structure Preservation: The tool maintains the hierarchical structure of the original document, ensuring that the extracted data reflects the original formatting and organization. 

Multimodal Support: MinerU accurately extracts various elements, including images, tables, and captions, making it versatile for different document types. 

Formula Conversion: It recognizes mathematical formulas and converts them into LaTeX format, which is beneficial for processing scientific and technical documents. 

Multilingual OCR: The tool supports text recognition in 84 languages, enhancing its applicability across diverse linguistic documents. 

Cross-Platform Compatibility: MinerU operates on all major operating systems, providing flexibility for users across different platforms.

Disadvantages:

Complexity for Beginners: Due to its powerful features, MinerU's API can be relatively complex, resulting in a higher learning curve for beginners. 

Performance Variability: As a newer tool, MinerU may have certain pros and cons, and its performance might vary depending on specific use cases. 

In summary, MinerU offers a comprehensive solution for extracting structured data from PDFs, with robust features catering to complex documents. However, new users should be prepared for a learning curve due to its feature-rich API.

references:

OpenAI 

Saturday, March 1, 2025

What is a Cross Encoder? ( Re-ranker)

Characteristics of Cross Encoder (a.k.a reranker) models:

Calculates a similarity score given pairs of texts.

Generally provides superior performance compared to a Sentence Transformer (a.k.a. bi-encoder) model.

Often slower than a Sentence Transformer model, as it requires computation for each pair rather than each text.

Due to the previous 2 characteristics, Cross Encoders are often used to re-rank the top-k results from a Sentence Transformer model.

In Sentence Transformers, a Cross-Encoder is a model architecture designed to compute the similarity between two sentences by considering them jointly. This is in contrast to Bi-Encoders, which encode each sentence independently into vector embeddings.

Here's a breakdown of what a Cross-Encoder is and how it works:

Key Characteristics:

Joint Encoding:

A Cross-Encoder takes both sentences as input at the same time.

It processes them through the transformer network together, allowing the model to capture intricate relationships and dependencies between the words in both sentences.

Accurate Similarity Scores:

Because of this joint processing, Cross-Encoders tend to produce more accurate similarity scores than Bi-Encoders.

They can capture subtle semantic nuances that Bi-Encoders might miss.

Computational Cost:

Cross-Encoders are significantly more computationally expensive than Bi-Encoders.

They cannot pre-compute embeddings for a large corpus of text.

Similarity scores are calculated on-the-fly for each pair of sentences.

Pairwise Comparisons:

Cross-Encoders are best suited for scenarios where you need to compare a relatively small number of sentence pairs.

They excel in tasks like re-ranking search results or determining the similarity between two specific sentences.

How It Works:


Input:

The two sentences to be compared are concatenated or combined in a specific way (e.g., separated by a special token like [SEP]).

Transformer Processing:

The combined input is fed into a transformer-based model (e.g., BERT, RoBERTa).

The model processes the input jointly, attending to the relationships between words in both sentences.

Similarity Score:

The output of the transformer is typically a single value or a vector that represents the similarity between the two sentences.

This value is often passed through a sigmoid function to produce a similarity score between 0 and 1.

When to Use Cross-Encoders:


Re-ranking:

After retrieving a set of candidate documents using a Bi-Encoder, you can use a Cross-Encoder to re-rank the results for improved accuracy.

Semantic Textual Similarity (STS):

For tasks that require highly accurate similarity scores, such as determining the degree of similarity between two sentences.

Question Answering:

When comparing a question to a set of candidate answers, a Cross-Encoder can provide more accurate relevance scores.

When Not to Use Cross-Encoders:

Large-Scale Similarity Search:

If you need to find the most similar sentences in a large corpus, Bi-Encoders are much more efficient.

Real-Time Applications:

The computational cost of Cross-Encoders can make them unsuitable for real-time applications with high throughput requirements.

In essence:

Cross-Encoders prioritize accuracy over speed, making them ideal for tasks where precision is paramount and the number of comparisons is manageable. Bi-Encoders, on the other hand, prioritize speed and scalability, making them suitable for large-scale information retrieval.

References:

https://www.sbert.net/docs/cross_encoder/usage/usage.html

Thursday, February 27, 2025

What are various types of Indexes in LLamaIndex

An Index is a data structure that allows us to quickly retrieve relevant context for a user query. For LlamaIndex, it's the core foundation for retrieval-augmented generation (RAG) use-cases.


At a high-level, Indexes are built from Documents. They are used to build Query Engines and Chat Engines which enables question & answer and chat over your data.


Under the hood, Indexes store data in Node objects (which represent chunks of the original documents), and expose a Retriever interface that supports additional configuration and automation.


The most common index by far is the VectorStoreIndex;


What are various indexes In 


This guide describes how each index works with diagrams.


Some terminology:


Node: Corresponds to a chunk of text from a Document. LlamaIndex takes in Document objects and internally parses/chunks them into Node objects.

Response Synthesis: Our module which synthesizes a response given the retrieved Node.


Summary Index (formerly List Index)#

The summary index simply stores Nodes as a sequential chain.


Querying#

During query time, if no other query parameters are specified, LlamaIndex simply loads all Nodes in the list into our Response Synthesis module.


The summary index does offer numerous ways of querying a summary index, from an embedding-based query which will fetch the top-k neighbors, or with the addition of a keyword filter, as seen below:


Vector Store Index


The vector store index stores each Node and a corresponding embedding in a Vector Store.


Querying

Querying a vector store index involves fetching the top-k most similar Nodes, and passing those into our Response Synthesis module.



Tree Index#

The tree index builds a hierarchical tree from a set of Nodes (which become leaf nodes in this tree).


Querying#

Querying a tree index involves traversing from root nodes down to leaf nodes. By default, (child_branch_factor=1), a query chooses one child node given a parent node. If child_branch_factor=2, a query chooses two child nodes per level.


Keyword Table Index#

The keyword table index extracts keywords from each Node and builds a mapping from each keyword to the corresponding Nodes of that keyword.



Querying#

During query time, we extract relevant keywords from the query, and match those with pre-extracted Node keywords to fetch the corresponding Nodes. The extracted Nodes are passed to our Response Synthesis module.




Property Graph Index#

The Property Graph Index works by first building a knowledge graph containing labelled nodes and relations. The construction of this graph is extremely customizable, ranging from letting the LLM extract whatever it wants, to extracting using a strict schema, to even implementing your own extraction modules.


Optionally, nodes can also be embedded for retrieval later.


You can also skip creation, and connect to an existing knowledge graph using an integration like Neo4j.


Querying#

Querying a Property Graph Index is also highly flexible. Retrieval works by using several sub-retrievers and combining results. By default, keyword + synoymn expanasion is used, as well as vector retrieval (if your graph was embedded), to retrieve relevant triples.


You can also chose to include the source text in addition to the retrieved triples (unavailble for graphs created outside of LlamaIndex).





references:

https://docs.llamaindex.ai/en/stable/module_guides/indexing/