Monday, December 30, 2024

How to run mongo atlas vector DB locally?

Its pretty much easy, 

docker pull mongodb/mongodb-atlas-local:latest

docker run -p 27017:27017 mongodb/mongodb-atlas-local


Below is connection string without and with authentication 

mongosh "mongodb://localhost:27017/?directConnection=true"

mongosh "mongodb://user:pass@localhost:27017/?directConnection=true"


references:

https://www.mongodb.com/docs/atlas/cli/current/atlas-cli-deploy-docker/

How to create a docker image and push to Docker hub

Have Dockerfile like this below 

================================

# Use Python 3.12 as the base image

FROM python:3.12-slim

# Set the working directory

WORKDIR /app

# Copy the requirements and install dependencies

COPY requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt

# Copy the application code

COPY ./app ./app

# Expose the application port

EXPOSE 8000

# Start the FastAPI app

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]


Have requirement.txt file like below 

=====================================

fastapi

uvicorn


Have the main.py like below 

===========================

from fastapi import FastAPI

app = FastAPI()

@app.get("/")

def read_root():

    return {"message": "Hello, Dockerized FastAPI!"}


Now the build process is like below 

The below steps to be executed after creating an account in docker hub. Instead of using password on terminal, access token can be created and that can be used. 


docker build -t mrrathish/crashing-docker-app:latest .

docker login

docker push mrrathish/crashing-docker-app:latest


docker run -p 8000:8000 mrrathish/crashing-docker-app:latest

That's it pretty much 


Friday, December 27, 2024

What is Container , Container Image

A container is run from a container image.

A container image is a static version of all the files, environment variables, and the default command/program that should be present in a container. Static here means that the container image is not running, it's not being executed, it's only the packaged files and metadata.

In contrast to a "container image" that is the stored static contents, a "container" normally refers to the running instance, the thing that is being executed.

When the container is started and running (started from a container image) it could create or change files, environment variables, etc. Those changes will exist only in that container, but would not persist in the underlying container image (would not be saved to disk).

A container image is comparable to the program file and contents, e.g. python and some file main.py.

And the container itself (in contrast to the container image) is the actual running instance of the image, comparable to a process. In fact, a container is running only when it has a process running (and normally it's only a single process). The container stops when there's no process running in it.

A container image normally includes in its metadata the default program or command that should berun when the container is started and the parameters to be passed to that program. Very similar to what would be if it was in the command line.

When a container is started, it will run that command/program (although you can override it and make it run a different command/program).

A container is running as long as the main process (command or program) is running.

A container normally has a single process, but it's also possible to start subprocesses from the main process, and that way you will have multiple processes in the same container.

But it's not possible to have a running container without at least one running process. If the main process stops, the container stops.

references:

https://fastapi.tiangolo.com/deployment/docker/#what-is-a-container-image

Sunday, December 22, 2024

How to use Pedantic to declare JSON data models (Data shapes)

 First, you need to import BaseModel from pydantic and then use it to create subclasses defining the schema, or data shapes, you want to receive.

Next, you declare your data model as a class that inherits from BaseModel, using standard Python types for all the attributes:

# main.py

from typing import Optional

from fastapi import FastAPI

from pydantic import BaseModel


class Item(BaseModel):

    name: str

    description: Optional[str] = None

    price: float

    tax: Optional[float] = None


app = FastAPI()


@app.post("/items/")

async def create_item(item: Item):

    return item


When a model attribute has a default value, it is not required. Otherwise, it is required. To make an attribute optional, you can use None.


For example, the model above declares a JSON object (or Python dict) like this:


{

    "name": "Foo",

    "description": None,

    "price": 45.2,

    "tax": None

}


In this case, since description and tax are optional because they have a default value of None, this JSON object would also be valid:


{

    "name": "Foo",

    "price": 45.2

}


A JSON object that omits the default values is also valid.

Next, add the new pydantic model to your path operation as a parameter. You declare it the same way you declared path parameters:



# main.py


from typing import Optional


from fastapi import FastAPI

from pydantic import BaseModel


class Item(BaseModel):

    name: str

    description: Optional[str] = None

    price: float

    tax: Optional[float] = None


app = FastAPI()


@app.post("/items/")

async def create_item(item: Item):

    return item

The parameter item has a type hint of Item, which means that item is declared as an instance of the class Item.

With that Python type declaration, FastAPI will:

Read the body of the request as JSON

Convert the corresponding types if needed

Validate the data and return a clear error if it is invalid

Give you the received data in the parameter item—since you declared it to be of type Item, you will also have all the editor support, with completion and type checks for all the attributes and their types

By using standard type hints with pydantic, FastAPI helps you build APIs that have all these best practices by default, with little effort.


References:

https://realpython.com/fastapi-python-web-apis/#create-a-first-api


Unicorn and alternatives

What is Uvicorn?

Uvicorn is a lightning-fast ASGI (Asynchronous Server Gateway Interface) server designed to run Python web applications. It supports asynchronous frameworks like FastAPI, Starlette, and others. Uvicorn is built on top of uvloop and httptools, providing excellent performance for handling concurrent requests in modern web applications.

Why is Uvicorn Required for FastAPI?

FastAPI is an ASGI framework, meaning it requires an ASGI server to handle HTTP requests and serve the application. Uvicorn is a popular choice because:

Asynchronous Support: It natively supports async features, which are central to FastAPI’s high-performance capabilities.

Performance: Uvicorn is optimized for speed and can efficiently handle a large number of concurrent requests.

Compatibility: Uvicorn is fully compatible with FastAPI and provides seamless integration.

Ease of Use: It's simple to install and use, with minimal configuration required.

Without a server like Uvicorn, FastAPI can't process incoming HTTP requests or serve responses.


Alternatives to Uvicorn

There are other ASGI servers available that can be used instead of Uvicorn. Here are some common alternatives:

Daphne

Developed by the Django Channels team.

Suitable for applications that require WebSocket support or compatibility with Django Channels.

Less performant than Uvicorn in general cases.

Command


daphne myapp:app

Hypercorn


A highly configurable ASGI server.

Supports multiple protocols, including HTTP/1, HTTP/2, WebSocket, and QUIC.

A good alternative if fine-grained control over server behavior is needed.

Command:


hypercorn myapp:app

ASGI Built-in Development Server

Some ASGI frameworks come with built-in development servers for local testing.

Not recommended for production.




Saturday, December 21, 2024

What is dependency Injection in Python

Dependency Injection (DI) in Python is a design pattern where the dependencies of a class or function are provided (injected) from the outside, rather than being created or managed by the class or function itself. This approach makes the code more modular, testable, and easier to maintain.


Key Concepts

Dependency: Any external object or resource that a class or function needs to operate (e.g., a database connection, an API client, a logger).

Injection: Supplying the dependency from outside the class or function, typically as an argument.

Why Use Dependency Injection?

Decoupling: Reduces tight coupling between components.

Testability: Makes it easier to test classes or functions by providing mock or stub dependencies.

Flexibility: Allows swapping out dependencies without modifying the dependent class or function.

Example Without Dependency Injection

Here, the dependency (Logger) is created inside the Service class, which tightly couples the two.


class Logger:

    def log(self, message):

        print(f"LOG: {message}")


class Service:

    def __init__(self):

        self.logger = Logger()  # Dependency is created inside the class


    def perform_task(self):

        self.logger.log("Task performed")


Example With Dependency Injection

In this example, the Logger is injected into the Service class, making the Service independent of how the Logger is implemented.


python

Copy code




class Logger:

    def log(self, message):

        print(f"LOG: {message}")


class Service:

    def __init__(self, logger):  # Dependency is injected

        self.logger = logger


    def perform_task(self):

        self.logger.log("Task performed")


# Injecting the dependency

logger = Logger()

service = Service(logger)

service.perform_task()


Benefits

The Service class does not need to know how the Logger is implemented.

You can easily swap out Logger for another implementation (e.g., FileLogger, DatabaseLogger) without modifying Service.

Dependency Injection with Frameworks

In larger applications, dependency injection frameworks (like Dependency Injector or pytest fixtures in testing) can help manage dependencies systematically.


from dependency_injector import containers, providers


class Logger:

    def log(self, message):

        print(f"LOG: {message}")


class Service:

    def __init__(self, logger):

        self.logger = logger


    def perform_task(self):

        self.logger.log("Task performed")


# DI container

class Container(containers.DeclarativeContainer):

    logger = providers.Factory(Logger)

    service = providers.Factory(Service, logger=logger)


# Using the container

container = Container()

service = container.service()

service.perform_task()



1. Providers

Providers are objects responsible for creating and managing dependencies. They can define how objects (dependencies) are created and supply these objects to other parts of the application.


Providers can:


Instantiate objects.

Return pre-configured instances.

Manage singletons (single instances reused across the application).

Provide factories for creating new instances on demand.

Types of Providers

Here are some commonly used provider types in Dependency Injector:


Factory: Creates a new instance of an object every time it's called.

Singleton: Creates and returns the same instance for every call.

Callable: Calls a specified function or callable.

Configuration: Provides values from an external configuration source (e.g., environment variables, files).

Delegate: Delegates provisioning to another provider.

Resource: Manages external resources with lifecycle hooks like initialization and cleanup


from dependency_injector import providers


# Factory provider (creates a new instance each time)

class Logger:

    def log(self, message):

        print(f"LOG: {message}")


logger_provider = providers.Factory(Logger)

logger_instance_1 = logger_provider()

logger_instance_2 = logger_provider()


print(logger_instance_1 is logger_instance_2)  # False (new instance each time)


# Singleton provider (creates one instance only)

singleton_logger = providers.Singleton(Logger)

logger_instance_3 = singleton_logger()

logger_instance_4 = singleton_logger()


print(logger_instance_3 is logger_instance_4)  # True (same instance)


Sunday, December 15, 2024

Create Embedding Model that adds additional context

 1. Conceptual Understanding

Core Idea: You want to create embeddings that not only represent the input text but also incorporate external knowledge or context. This can significantly improve the quality of similarity searches and downstream tasks.

Example: Imagine you're embedding product descriptions. Adding context like brand, category, or even user purchase history can lead to more relevant recommendations.

2. Methods

Concatenation:

Approach:

Obtain Context Embeddings: Generate embeddings for the context information (e.g., brand, category) using the same or a different embedding model.

Concatenate: Concatenate the context embeddings with the input text embeddings.

Example:

Input: "Comfortable shoes"

Context: "Brand: Nike, Category: Running"

Embedding: embed("Comfortable shoes") + embed("Brand: Nike") + embed("Category: Running")

Weighted Sum:

Approach:

Obtain embeddings as in concatenation.

Assign weights to each embedding based on its importance.

Calculate a weighted sum of the embeddings.

Example:

weighted_embedding = 0.7 * embed("Comfortable shoes") + 0.2 * embed("Brand: Nike") + 0.1 * embed("Category: Running")

Contextualized Embeddings:

Approach:

Use a language model (like BERT or GPT) to generate embeddings.

Feed the input text and context to the model simultaneously.

The model will generate embeddings that capture the interaction between the text and the context.

Implementation: Utilize Hugging Face Transformers library for easy access to pre-trained models.

3. Implementation Example (Concatenation with Sentence-Transformers)


Python


from sentence_transformers import SentenceTransformer


model = SentenceTransformer('all-MiniLM-L6-v2') 


def embed_with_context(text, context):

    """

    Generates embeddings for the input text with added context.


    Args:

        text: The input text.

        context: A dictionary containing context information.


    Returns:

        The concatenated embedding.

    """

    text_embedding = model.encode(text)

    context_embeddings = [model.encode(f"{key}: {value}") for key, value in context.items()]

    return np.concatenate([text_embedding] + context_embeddings)


# Example Usage

input_text = "Comfortable shoes"

context = {"Brand": "Nike", "Category": "Running"}

embedding = embed_with_context(input_text, context)

4. Key Considerations


Context Representation: Choose a suitable format for representing context (dictionaries, lists, etc.).

Embedding Model: Select an embedding model that aligns with your context and task.

Weighting: Experiment with different weighting schemes for optimal results.

Evaluation: Thoroughly evaluate the performance of your custom embeddings on your specific task.

Remember: The effectiveness of your custom embeddings will depend on the quality and relevance of the context information you provide. Experiment with different approaches and carefully evaluate the results to find the best solution for your use case.


What are various data for similarity Search performance

1. SQuAD (Stanford Question Answering Dataset)

Use Case: Test vector search accuracy for question-answer retrieval.

How to Use:

Preprocess the dataset to create embeddings for questions and answers.

Use questions as queries and measure the retrieval accuracy against their respective answers.


2. MS MARCO

Use Case: Benchmark performance for document or passage ranking tasks.

How to Use:

Use passages and queries provided in the dataset.

Generate embeddings and use queries to retrieve relevant passages.


3. Quora Question Pairs

Use Case: Test duplicate or paraphrased question detection.

How to Use:

Generate embeddings for questions.

Use one question as a query and check if its paraphrase is retrieved as the most similar result.


4. TREC Datasets

Use Case: Test retrieval systems for a variety of topics, including QA, passage retrieval, and entity matching.

How to Use:

Choose a task, create embeddings, and evaluate search performance.


5. Synthetic Dataset for Quick Testing

If you want something lightweight and simple to start:



from sklearn.datasets import make_blobs

import numpy as np


# Generate synthetic data

data, _ = make_blobs(n_samples=100, centers=5, n_features=50, random_state=42)


# Convert to strings for a mock dataset

docs = [" ".join(map(str, row)) for row in data]


# Example: Simulate a search query

query = np.mean(data[:10], axis=0)  # Take an average vector as query


1. Sentence Similarity Datasets:

STSBenchmark: A collection of sentence pairs with human-rated similarity scores. This dataset is widely used for evaluating sentence embedding models and similarity search.   

SemEval datasets: SemEval has hosted several tasks related to semantic similarity, including paraphrase identification and textual entailment, which can provide valuable data for evaluating vector similarity search.   

Quora Question Pairs: A dataset of question pairs from Quora, where the task is to determine whether a pair of questions are duplicates.   

2. Text Retrieval Datasets:


MS MARCO: A large-scale dataset for document ranking, containing passages from Wikipedia and a set of queries.   

TREC datasets: A collection of datasets for information retrieval tasks, including question answering and document retrieval. Some TREC datasets can be adapted for vector similarity search evaluation.   

News2Dataset: A dataset of news articles and their corresponding summaries, which can be used to evaluate the retrieval of relevant documents based on query vectors.

3. Image Retrieval Datasets:


ImageNet: A large image dataset with millions of images and associated labels. It can be used to evaluate image similarity search by comparing query images to images in the dataset.   

Places365: A dataset of images categorized into 365 scene categories, which can be used to evaluate place recognition and image retrieval based on visual similarity.   

4. Code Search Datasets:


GitHub CodeSearchNet: A large dataset of code snippets and their natural language descriptions, which can be used to evaluate code search based on textual queries.   

Key Considerations When Choosing a Dataset:

Relevance to your use case: Select a dataset that is relevant to the specific application of vector similarity search you are interested in (e.g., question answering, product recommendation, image search).

Dataset size: Choose a dataset that is large enough to provide a meaningful evaluation of your system's performance.

Data quality: Ensure that the dataset is of high quality and free from errors or biases.

Availability and licensing: Make sure that the dataset is readily available and that you have the necessary rights to use it for your evaluation.

By using these datasets, you can effectively test the performance of your vector store and compare different approaches to vector similarity search.

references:


Friday, December 13, 2024

What are some of Advanced RAG techniques

Here are some advanced RAG techniques

Input/output validation: Ensuring groundedness. This technique verifies that the input query and generated output align with specific use cases and company policies. It helps maintain control over the LLM’s responses, preventing unintended or harmful outputs.


Guardrails: Compliance and auditability. Guardrails ensure that queries and responses adhere to relevant regulations and ethical guidelines. They also make it possible to track and review interactions with the LLM for accountability and transparency.


Explainable responses: This aspect involves providing clear explanations for how the LLM arrived at its conclusions. This is crucial for building trust and understanding the reasoning behind the model’s outputs.


Caching: Efficient handling of similar queries. Semantic caching optimizes the LLM’s performance by storing and reusing the results of similar queries. This reduces latency and improves the overall efficiency of the system.


Hybrid search: Combining semantic and keyword matching. This technique leverages both semantic understanding and exact keyword matching to retrieve the most relevant information from the knowledge base. This approach enhances the accuracy and breadth of the LLM’s responses.


Re-ranking: Improving relevance and accuracy. Re-ranking involves retrieving a set of relevant data points and reordering them based on their relevance to the specific query. This helps ensure the most pertinent information is presented to the user.


Evals: Continuous self-learning. Evals use techniques like Reinforcement Learning from Human Feedback (RLHF) to continuously improve the LLM’s performance. This involves collecting human feedback on the model’s responses and using that feedback to refine its future outputs.

references
OpenAI 
https://levelup.gitconnected.com/building-enterprise-ai-apps-with-multi-agent-rag-06356b35ba1a

Langchain work with Google Gemini

from langchain_google_genai import ChatGoogleGenerativeAI

llm = ChatGoogleGenerativeAI(

    model="gemini-1.5-pro",

    temperature=0,

    max_tokens=None,

    timeout=None,

    max_retries=2,

    # other params...

)

messages = [

    (

        "system",

        "You are a helpful assistant that translates English to French. Translate the user sentence.",

    ),

    ("human", "I love programming."),

]

ai_msg = llm.invoke(messages)

ai_msg


print(ai_msg.content)



Chaining is like below 


from langchain_core.prompts import ChatPromptTemplate


prompt = ChatPromptTemplate.from_messages(

    [

        (

            "system",

            "You are a helpful assistant that translates {input_language} to {output_language}.",

        ),

        ("human", "{input}"),

    ]

)

chain = prompt | llm

chain.invoke(

    {

        "input_language": "English",

        "output_language": "German",

        "input": "I love programming.",

    }

)

references:

https://python.langchain.com/docs/integrations/chat/google_generative_ai/


Thursday, December 12, 2024

What is Gemini 2.0

 Gemini 2.0 Flash is available in Gemini API and Google AI Studio. Building on the success of Gemini 1.5 Flash, the 2.0 release introduces performance improvements, new Multimodal Live API, expanded output modalities, and native tool use.

Here's a breakdown of what's new:

Improved Performance: Gemini 2.0 Flash Experimental provides double the speed of Gemini 1.5 Pro, with enhanced capabilities across multimodal understanding, text, code, video, and spatial reasoning.

Multimodal Live API: Develop real-time applications with streaming audio and video, natural conversational patterns, and tool integration via the new Multimodal Live API.

Native Tool Use: Build intelligent agentic features with integrated tools such as Google Search (including parallel search), code execution, and custom function calling.

Sample code is as below 


import google.generativeai as genai

genai.configure(api_key="YOUR_API_KEY")

model = genai.GenerativeModel("gemini-1.5-flash")

response = model.generate_content("Explain how AI works")

print(response.text)


Gemini 2.0 is a multi modal model. which has capabilites as given below 


references:


What is Command in Langgraph

Command can be useful to combine control flow (edges) and state updates (nodes). For example, you might want to BOTH perform state updates AND decide which node to go to next in the SAME node. LangGraph provides a way to do so by returning a Command object from node functions:

def my_node(state: State) -> Command[Literal["my_other_node"]]:

    return Command(

        # state update

        update={"foo": "bar"},

        # control flow

        goto="my_other_node"

    )

For e.g. like this below 

def node_a(state: State) -> Command[Literal["node_b", "node_c"]]:

    print("Called A")

    value = random.choice(["a", "b"])

    # this is a replacement for a conditional edge function

    if value == "a":

        goto = "node_b"

    else:

        goto = "node_c"

    # note how Command allows you to BOTH update the graph state AND route to the next node

    return Command(

        # this is the state update

        update={"foo": value},

        # this is a replacement for an edge

        goto=goto,

    )

References:

https://langchain-ai.github.io/langgraph/how-tos/command/

Does Langgraph studio support python app created outside of Langgraph CLI?

LangGraph Studio primarily works with applications created using the LangGraph CLI. This is because LangGraph Studio is designed to work within the LangGraph ecosystem, which involves defining workflows, agents, and actions using LangGraph's specific structure and configuration format.

To clarify:

LangGraph CLI: This tool is used to create, configure, and initialize applications that are compatible with LangGraph Studio. It allows you to define the graph, agents, tasks, and workflows in a way that LangGraph Studio can interpret and visualize.

Regular Python Apps: If you have a regular Python app that doesn't adhere to LangGraph's predefined structure (i.e., it's not set up using the LangGraph CLI), you can't directly load it into LangGraph Studio without adapting it to the LangGraph framework.

Integration Options:

If you have a regular Python app that you want to integrate with LangGraph Studio, you would likely need to refactor it into the LangGraph framework, defining tasks, agents, and workflows in a way LangGraph Studio can handle.

Alternatively, you could wrap your existing Python logic in a way that it becomes part of the LangGraph structure by using LangGraph’s APIs and conventions for agents and workflows.

In summary, LangGraph Studio is tailored to apps created with LangGraph CLI, but with some effort, it is possible to adapt a regular Python application to fit into the LangGraph model for integration.


References:

OpenAI 

What is Langgraph Server?

The LangGraph Server is a backend server component of the LangGraph framework, designed to facilitate complex workflows by leveraging multi-agent collaboration. It acts as the execution environment for the agents and their plans, providing a centralized mechanism to coordinate, manage, and monitor the execution of tasks across different agents.


Key Features

Execution Environment:


Executes plans generated by the Planner agent.

Supervises agents as they perform tasks, ensuring tasks are completed sequentially or concurrently as defined.

Agent Coordination:


Allows agents to communicate and share data through a Global State.

Ensures agents adhere to the task plan by managing dependencies and transitions.

State Management:


Maintains the global state of the workflow, enabling agents to retrieve or update task-related data dynamically.

Error Handling:


Handles failures during agent execution, allowing retries or fallback strategies as defined in the workflow.

Monitoring and Logging:


Provides logs for each agent's actions and state transitions.

Enables real-time tracking of task progress.

API Interface:


Exposes endpoints for interacting with the LangGraph ecosystem, such as submitting plans, retrieving execution results, or querying the global state.

Scalability:


Supports distributed execution for scaling complex workflows involving multiple agents.

How LangGraph Server Works

Input: Receives a high-level plan from the Planner agent or a user-defined workflow.

Task Assignment:

Breaks down the plan into steps.

Assigns each step to the appropriate agent.

Execution:

Executes tasks via agents, tracking their progress.

Updates the global state based on task outcomes.

Output:

Returns the final result once all tasks are complete.

Provides detailed logs or partial results during execution.

Example Use Case

Suppose you have a GenAI application where:


A Planner agent creates a workflow to diagnose network issues.

The workflow includes fetching device details, running diagnostics, and generating a report.

The LangGraph Server will:


Receive the workflow plan.

Assign steps to specific agents (e.g., a FetchAgent for device details, a DiagnosticAgent for running diagnostics).

Monitor the progress and ensure data flows correctly between agents.

Return the consolidated results to the user or a downstream system.

Benefits

Centralized orchestration for distributed multi-agent workflows.

Enhanced fault tolerance and retry mechanisms.

Streamlined integration of various agents and tools.

Would you like a deeper dive into any specific aspect of the LangGraph Server?

What is Langgraph Studio

Key Features of LangGraph Studio:

Graph Visualization:


Displays the flow of tasks and decisions made by agents in a graph format.

Shows the hierarchy and relationships between agents.

Execution Monitoring:


Allows real-time tracking of agent activity, including:

Input and output data at each stage.

Agent states and decisions.

Logs execution results, making it easier to identify bottlenecks.

Debugging Tools:


Provides insights into why an agent made a specific decision.

Highlights errors or issues in execution steps.

Facilitates rollback or re-execution of specific steps.

State Management:


Visualizes the global state used by LangGraph to share data between agents.

Tracks updates to the state as agents perform their tasks.

Support for Planner and Executor:


Helps visualize the steps generated by the Planner agent.

Shows how the Executor agent delegates tasks to specialized agents.

Interactive Workflow Editing:


Users can modify workflows or test scenarios directly within the studio.

Provides an interface to simulate different agent interactions without rewriting code.


references:

OpenAI 

https://medium.com/@kbdhunga/langgraph-studio-b3f16e51d437


Wednesday, December 11, 2024

What are multi modal RAG systems?

Multimodal RAG systems, AI systems capable of processing text and non-text data.

Multimodal RAG enables more sophisticated inferences beyond what is conveyed by text alone. For example, it could analyze someone’s facial expressions and speech tonality to give a richer context to a meeting’s transcription

three basic strategies at increasing levels of sophistication.


Translate modalities to text. => In this case 

Text-only retrieval + MLLM 

Multimodal retrieval + MLLM


A simple way to make a RAG system multimodal is by translating new modalities to text before storing them in the knowledge base. This could be as simple as converting meeting recordings into text transcripts, using an existing multimodal LLM (MLLM) to generate image captions, or converting tables to a readable text format (e.g., .csv or .json).


In text-only retrieval + MLLM , generate text representations of all items in the knowledge base, e.g., descriptions and meta-tags, for retrieval, but to pass the original modality to a multimodal LLM (MLLM).


In level 3, we can use multimodal embeddings to perform multimodal retrieval. This works the same way as text-based vector search, but now the embedding space co-locates similar concepts independent of its original modality. The results of such a retrieval strategy can then be passed directly to a MLLM.


references:

https://towardsdatascience.com/multimodal-rag-process-any-file-type-with-ai-e6921342c903


Saturday, November 23, 2024

Quick Rate limiting, Sentiment analysis, Safe AI response in Python for GenAI apps

import time

from functools import wraps


def rate_limit(calls: int, period: float):

    min_interval = period / calls

    last_called = [0.0]

    def decorator(func):

        @wraps(func)

        def wrapper(*args, **kwargs):

            elapsed = time.time() - last_called[0]

            if elapsed < min_interval:

                time.sleep(min_interval - elapsed)

            result = func(*args, **kwargs)

            last_called[0] = time.time()

            return result

        return wrapper

    return decorator

@rate_limit(calls=3, period=1.0)  # 3 calls per second

def rate_limited_ai(state: AgentState) -> AgentState:

    return ai(state)'



from textblob import TextBlob


def analyze_sentiment(text: str) -> float:

    """Returns sentiment score between -1 (negative) and 1 (positive)"""

    return TextBlob(text).sentiment.polarity

def enhanced_ai(state: AgentState) -> AgentState:

    messages = state["messages"]

    last_message = messages[-1].content

    # Analyze user sentiment

    sentiment = analyze_sentiment(last_message)

    # Adjust system prompt based on sentiment

    base_prompt = "You are a helpful AI assistant."

    if sentiment < -0.3:

        system_prompt = f"{base_prompt} Please respond with extra empathy and support."

    elif sentiment > 0.3:

        system_prompt = f"{base_prompt} Match the user's positive energy."

    else:

        system_prompt = base_prompt

    llm = Ollama(base_url="<http://localhost:11434>", model="llama3")

    context = f"{system_prompt}\\\\n\\\\nUser: {last_message}"

    response = llm.invoke(context)

    state["messages"].append(AIMessage(content=response))

    state["next"] = "human"

    return state


 def safe_ai_response(state: AgentState) -> AgentState:

    try:

        return ai(state)

    except Exception as e:

        error_message = f"An error occurred: {str(e)}"

        state["messages"].append(AIMessage(content=error_message))

        state["next"] = "human"

        return state


Monday, November 18, 2024

What is promptim - Langchain prompt optimization

Promptim is an experimental prompt optimization library to help you systematically improve your AI systems.

Promptim automates the process of improving prompts on specific tasks. You provide initial prompt, a dataset, and custom evaluators (and optional human feedback), and promptim runs an optimization loop to produce a refined prompt that aims to outperform the original.

From evaluation-driven development to prompt optimization

A core responsibility of AI engineers is prompt engineering. This involves manually tweaking the prompt to produce better results.

A useful way to approach this is through evaluation-driven development. This involves first creating a dataset of inputs (and optionally, expected outputs) and then defining a number of evaluation metrics. Every time you make a change to the prompt, you can run it over the dataset and then score the outputs. In this way, you can measure the performance of your prompt and make sure its improving, or at the very least not regressing. Tools like LangSmith help with dataset curation and evaluation.


The idea behind prompt optimization is to use these well-defined datasets and evaluation metrics to automatically improve the prompt. You can suggest changes to the prompt in an automated way, and then score the new prompt with this evaluation method. Tools like DSPy have been pioneering efforts like this for a while.

How Promptim works

We're excited to release our first attempt at prompt optimization. It is an open source library (promptim) that integrates with LangSmith (which we use for dataset management, prompt management, tracking results, and (optionally) human labeling.


The core algorithm is as follows:

Specify a LangSmith dataset, a prompt in LangSmith, and evaluators defined locally. Optionally, you can specify train/dev/test dataset splits.

We run the initial prompt over the dev (or full) dataset to get a baseline score.

We then loop over all examples in the train (or full) dataset. We run the prompt over all examples, then score them. We then pass the results (inputs, outputs, expected outputs, scores) to a metaprompt and ask it to suggest changes to the current prompt

We then use the new updated prompt to compute metrics again on the dev split.

If the metrics show improvement, the the updated prompt is retained. If no improvement, then the original prompt is kept.

This is repeated N times

Optionally, you can add a step where you leave human feedback. This is useful when you don't have good automated metrics, or want to optimize the prompt based on feedback beyond what the automated metrics can provide. This uses LangSmith's Annotation Queues.

references:

https://blog.langchain.dev/promptim/



What is OpenAI Operator?

According to a recent Bloomberg report, OpenAI is developing an AI assistant called “Operator” that can perform computer-based tasks like coding and travel booking on users’ behalf. The company reportedly plans to release it in January as a research preview and through their API.


This development aligns with a broader industry trend toward AI agents that can execute complex tasks with minimal human oversight. Anthropic has unveiled new capabilities for its GenAI model Claude, allowing it to manipulate desktop environments, a significant step toward more independent systems. Meanwhile, Salesforce introduced next-generation AI agents focused on automating intricate tasks for businesses, signaling a broader adoption of AI-driven workflows. These developments underscore a growing emphasis on creating AI systems that can perform advanced, goal-oriented functions with minimal human oversight


AI agents are software programs that can independently perform complex sequences of tasks on behalf of users, such as booking travel or writing code, by understanding context and making decisions. These agents represent an evolution beyond simple chatbots or models, as they can actively interact with computer interfaces and web services to accomplish real-world goals with minimal human supervision.


“AI can help you track your order, issue refunds, or help prevent cancellations; this frees up human agents to become product experts,” he added. “By automating with AI, human support agents become product experts to help guide customers through which products to buy, ultimately driving better revenue and customer happiness.”


References:

https://www.pymnts.com/artificial-intelligence-2/2024/openai-readies-operator-agent-with-ecommerce-web-browsing-capabilities/


Saturday, November 16, 2024

LLM Cost: Bit of Basics

In the context of large language models, a token is a unit of text that the model processes. A token can be as small as a single character or as large as a word or punctuation mark. The exact size of a token depends on the specific tokenization algorithm used by the model. For example:

The word “computer” is one token.

The sentence “Hello, how are you?” consists of 6 tokens: “Hello”, “,”, “how”, “are”, “you”, “?”

Typically, the model splits longer texts into smaller components (tokens) for efficient processing, making it easier to understand, generate, and manipulate text at a granular level.

For many LLMs, including OpenAI’s GPT models, usage costs are determined by the number of tokens processed, which includes both input tokens (the text prompt given to the model) and output tokens (the text generated by the model). Since the computational cost of running these models is high, token-based pricing provides a fair and scalable way to charge for usage.

Calculating Tokens in a Request

Before diving into cost calculation, let’s break down how tokens are accounted for in a request:

Input Tokens:

The text or query sent to the model is split into tokens. For example, if you send a prompt like “What is the capital of France?”, this prompt will be tokenized, and each word will contribute to the token count.

Output Tokens:

The response generated by the model also consists of tokens. For example, if the model responds with “The capital of France is Paris.”, the words in this sentence are tokenized as well.

For instance:

Input: “What is the capital of France?” (7 tokens)

Output: “The capital of France is Paris.” (7 tokens)

Total tokens used in the request: 14 tokens

Tokenize the Input and Output

First, determine the number of tokens in your input text and the model’s output.

Example:

Input Prompt: “What is the weather like in New York today?” (8 tokens)

Output: “The weather in New York today is sunny with a high of 75 degrees.” (14 tokens)

Total Tokens: 8 + 14 = 22 tokens

2. Identify the Pricing for the Model

Pricing will vary depending on the model provider. For this example, let’s assume the pricing is:

$0.02 per 1,000 tokens

3. Calculate Total Cost Based on Tokens

Multiply the total number of tokens by the rate per 1,000 tokens:

Step-by-Step Guide to Calculating the Cost

Tokenize the Input and Output

First, determine the number of tokens in your input text and the model’s output.

Example:

Input Prompt: “What is the weather like in New York today?” (8 tokens)

Output: “The weather in New York today is sunny with a high of 75 degrees.” (14 tokens)

Total Tokens: 8 + 14 = 22 tokens

2. Identify the Pricing for the Model

Pricing will vary depending on the model provider. For this example, let’s assume the pricing is:

$0.02 per 1,000 tokens

TOTAL COST = (22/1000) * 0.02 = 0.00044

Factors Influencing Token Costs

Several factors can influence the number of tokens generated and therefore the overall cost:


Length of Input Prompts:


Longer prompts result in more input tokens, increasing the overall token count.

Length of Output Responses:


If the model generates lengthy responses, more tokens are used, leading to higher costs.

Complexity of the Task:


More complex queries that require detailed explanations or multiple steps will result in more tokens, both in the input and output.

Model Used:


Different models (e.g., GPT-3, GPT-4) may have different token limits and pricing structures. More advanced models typically charge higher rates per 1,000 tokens.

Token Limits Per Request:


Many LLM providers impose token limits on each request. For instance, a single request might be capped at 2,048 or 4,096 tokens, including both input and output tokens.


Reducing Costs When Using LLMs

Optimize Prompts:


Keep prompts concise but clear to minimize the number of input tokens. Avoid unnecessary verbosity.

Limit Response Length:


Control the length of the model’s output using the maximum tokens parameter. This prevents the model from generating overly long responses, saving on tokens.

Batch Processing:


If possible, group related queries together to reduce the number of individual requests.

Choose the Right Model:


Use smaller models when applicable, as they are often cheaper per token compared to larger, more advanced models.


 

What are Small Language Models ( SLMs)

 Types: 

1. Distilled Models

2. Pruned Models

3. Quantized Models

4. Models Trained from Scratch

Key Characteristics of Small Language Models

Model Size and Parameter Count

Small Language Models (SLMs) typically range from hundreds of millions to a few billion parameters, unlike Large Language Models (LLMs), which can have hundreds of billions of parameters. This smaller size allows SLMs to be more resource-efficient, making them easier to deploy on local devices such as smartphones or IoT devices.

Ranges from millions to a few billion parameters.

Suitable for resource-constrained environments.

Easier to run on personal or edge devices


Training Data Requirements

Require less training data overall.

Emphasize the quality of data over quantity.

Faster training cycles due to smaller model size.


Inference Speed

Reduced latency due to fewer parameters.

Suitable for real-time applications.

Can run offline on smaller devices like mobile phones or embedded systems.



Creating small language models involves different techniques, each with unique approaches and trade-offs. Here's a breakdown of the key differences among Distilled Models, Pruned Models, Quantized Models, and Models Trained from Scratch:


1. Distilled Models

Approach: Knowledge distillation involves training a smaller model (the student) to mimic the behavior of a larger, pre-trained model (the teacher). The smaller model learns by approximating the outputs or logits of the larger model, rather than directly training on raw data.

Key Focus: Reduce model size while retaining most of the teacher model's performance.

Use Case: When high accuracy is needed with a smaller computational footprint.

Advantages:

Retains significant accuracy compared to the teacher model.

Faster inference and reduced memory requirements.

Drawbacks:

The process depends on the quality of the teacher model.

May require additional resources for the distillation process.

2. Pruned Models

Approach: Model pruning removes less significant weights, neurons, or layers from a large model based on predefined criteria, such as low weight magnitudes or redundancy.

Key Focus: Reduce the number of parameters and improve efficiency.

Use Case: When the original model is overparameterized, and optimization is required for resource-constrained environments.

Advantages:

Reduces computation and memory usage.

Can target specific hardware optimizations.

Drawbacks:

Risk of accuracy loss if pruning is too aggressive.

Pruning techniques can be complex to implement effectively.

3. Quantized Models

Approach: Quantization reduces the precision of the model's parameters from floating-point (e.g., 32-bit) to lower-precision formats (e.g., 8-bit integers).

Key Focus: Improve speed and reduce memory usage, especially on hardware with low-precision support.

Use Case: Optimizing models for edge devices like smartphones or IoT devices.

Advantages:

Drastically reduces model size and computational cost.

Compatible with hardware accelerators like GPUs and TPUs optimized for low-precision arithmetic.

Drawbacks:

Can lead to accuracy degradation, especially for sensitive models.

May require fine-tuning to recover performance after quantization.

4. Models Trained from Scratch

Approach: Building and training a model from the ground up, using a new or smaller dataset, rather than modifying a pre-trained large model.

Key Focus: Design a small model architecture tailored to the specific use case or dataset.

Use Case: When there is sufficient training data and computational resources to create a highly specialized model.

Advantages:

Customizable to specific tasks or domains.

No dependency on pre-trained models.

Drawbacks:

Resource-intensive training process.

Typically requires significant expertise in model design and optimization.

May underperform compared to fine-tuned pre-trained models on general tasks.


References: 

https://medium.com/@kanerika/why-small-language-models-are-making-big-waves-in-ai-0bb8e0b6f20c




What is State in Langgraph

In LangGraph, State is a fundamental concept that represents the data being passed and transformed through nodes in the workflow. It acts as a shared data container for the graph, enabling nodes to read from and write to it during execution.

Breaking Down the Example

python

Copy code

class State(TypedDict):

    # The operator.add reducer fn makes this append-only

    messages: Annotated[list, operator.add]

1. TypedDict

State is a subclass of Python's TypedDict. This allows you to define the expected structure (keys and types) of the state dictionary in a strongly typed manner.

Here, the state has one key, messages, which is a list.

2. Annotated

Annotated is a way to add metadata to a type. In this case:

python

Copy code

Annotated[list, operator.add]

It indicates that messages is a list.

The operator.add is used as a reducer function.

3. operator.add

operator.add is a Python function that performs addition for numbers or concatenation for lists.

In this context, it is used as a reducer function for the messages list.

4. Reducer Function Behavior

A reducer function specifies how new values should be combined with the existing state during updates.

By using operator.add, the messages list becomes append-only, meaning any new items added to messages will concatenate with the current list instead of replacing it.

Why Use operator.add in State?

Append-Only Behavior:

Each node in the workflow can add to the messages list without overwriting previous values. This is useful for:

Logging messages from different nodes.

Maintaining a sequential record of events.

Thread Safety:

Using a reducer function ensures that state updates are predictable and consistent, even in concurrent workflows.

Flexibility in State Updates:

Reducer functions allow complex operations during state updates, such as appending, merging dictionaries, or performing custom logic.

references:

OpenAI 




Friday, November 15, 2024

What is Elixir ?

Elixir is a dynamic, functional language for building scalable and maintainable applications.

Elixir runs on the Erlang VM, known for creating low-latency, distributed, and fault-tolerant systems. These capabilities and Elixir tooling allow developers to be productive in several domains, such as web development, embedded software, machine learning, data pipelines, and multimedia processing, across a wide range of industries.


Here is a peek:


iex> "Elixir" |> String.graphemes() |> Enum.frequencies()

%{"E" => 1, "i" => 2, "l" => 1, "r" => 1, "x" => 1}


Platform features

Scalability

All Elixir code runs inside lightweight threads of execution (called processes) that are isolated and exchange information via messages:


Due to their lightweight nature, you can run hundreds of thousands of processes concurrently in the same machine, using all machine resources efficiently (vertical scaling). Processes may also communicate with other processes running on different machines to coordinate work across multiple nodes (horizontal scaling).


Together with projects such as Numerical Elixir, Elixir scales across cores, clusters, and GPUs.


Fault-tolerance

The unavoidable truth about software in production is that things will go wrong. Even more when we take network, file systems, and other third-party resources into account.


To react to failures, Elixir supervisors describe how to restart parts of your system when things go awry, going back to a known initial state that is guaranteed to work:


children = [

  TCP.Pool,

  {TCP.Acceptor, port: 4040}

]


Supervisor.start_link(children, strategy: :one_for_one)

The combination of fault-tolerance and message passing makes Elixir an excellent choice for event-driven systems and robust architectures. Frameworks, such as Nerves, build on this foundation to enable productive development of reliable embedded/IoT systems.


Functional programming

Functional programming promotes a coding style that helps developers write code that is short, concise, and maintainable. For example, pattern matching allows us to elegantly match and assert specific conditions for some code to execute:


def drive(%User{age: age}) when age >= 16 do

  # Code that drives a car

end


drive(User.get("John Doe"))

#=> Fails if the user is under 16

Elixir relies on those features to ensure your software is working under the expected constraints. And when it is not, don't worry, supervisors have your back!


Extensibility and DSLs

Elixir has been designed to be extensible, allowing developers naturally extend the language to particular domains, in order to increase their productivity.


As an example, let's write a simple test case using Elixir's test framework called ExUnit:


defmodule MathTest do

  use ExUnit.Case, async: true


  test "can add two numbers" do

    assert 1 + 1 == 2

  end

end

The async: true option allows tests to run in parallel, using as many CPU cores as possible, while the assert functionality can introspect your code, providing great reports in case of failures.


Other examples include using Elixir to write SQL queries, compiling a subset of Elixir to the GPU, and more.


Tooling features

A growing ecosystem

Elixir ships with a great set of tools to ease development. Mix is a build tool that allows you to easily create projects, manage tasks, run tests and more:


$ mix new my_app

$ cd my_app

$ mix test

.


Finished in 0.04 seconds (0.04s on load, 0.00s on tests)

1 test, 0 failures

Mix also integrates with the Hex package manager for dependency management and hosting documentation for the whole ecosystem.


Interactive development

Tools like IEx (Elixir's interactive shell) leverage the language and platform to provide auto-complete, debugging tools, code reloading, as well as nicely formatted documentation:


$ iex

Interactive Elixir - press Ctrl+C to exit (type h() ENTER for help)

iex> h String.trim           # Prints the documentation

iex> i "Hello, World"        # Prints information about a data type

iex> break! String.trim/1    # Sets a breakpoint

iex> recompile               # Recompiles the current project

Code notebooks like Livebook allow you to interact with Elixir directly from your browser, including support for plotting, flowcharts, data tables, machine learning, and much more!


What are Pros and Cons of Erlang VM

The Erlang Virtual Machine (VM), also known as BEAM, is the runtime system that executes Erlang and Elixir code. It's designed for building concurrent, distributed, and fault-tolerant systems. Below are the pros and cons of using the Erlang VM:


Pros of Erlang VM (BEAM)

1. Concurrency and Scalability

Lightweight Processes: Erlang VM supports millions of lightweight processes, which are independent and do not share memory. This is ideal for building highly concurrent systems.

Efficient Scheduling: BEAM uses preemptive scheduling to ensure fair execution among processes, making it well-suited for multi-core CPUs.

2. Fault Tolerance

Supervisor Trees: Built-in mechanisms allow processes to monitor each other and restart failed processes seamlessly.

Isolation: Processes are isolated, so a crash in one does not affect others.

3. Distributed Systems Support

Erlang VM has first-class support for distributed computing, enabling nodes to communicate over a network as easily as within the same system.

4. Real-Time Systems

Soft Real-Time Capabilities: The VM is designed to handle soft real-time requirements, ensuring timely responses in applications like telecommunications and messaging.

5. Hot Code Upgrades

BEAM allows code to be updated in a running system without downtime, which is crucial for high-availability systems.

6. Garbage Collection

Each process has its own heap and garbage collection, making memory management efficient and avoiding global pauses.

7. Built-in Tools

BEAM provides robust tools for debugging, profiling, and tracing (e.g., Observer, DTrace).

8. Community and Ecosystem

Languages like Elixir leverage BEAM, bringing modern syntax and tooling to its robust runtime.

9. Mature and Battle-Tested

BEAM has been used in production for decades, powering telecom systems, messaging platforms (e.g., WhatsApp), and databases (e.g., CouchDB).



Cons of Erlang VM (BEAM)

1. Performance Limitations

Single-threaded Execution per Scheduler: While great for concurrency, BEAM isn't optimized for raw CPU-bound tasks compared to VMs like JVM.

Limited Numerical Processing: It's less suited for heavy numerical computations or AI/ML tasks.

2. Memory Overhead

Lightweight processes consume more memory compared to raw threads in some other VMs, especially when the number of processes is extremely high.

3. Learning Curve

The functional programming paradigm, immutable data structures, and process model can be challenging for developers used to imperative programming.

4. Lack of Mainstream Libraries

While BEAM has excellent libraries for distributed systems, its ecosystem lacks the breadth of libraries available for JVM or Python.

5. Tooling

Although improving, the tooling (e.g., IDE support) may not be as polished as in more mainstream ecosystems like Java or JavaScript.

6. Latency in Large Distributed Systems

BEAM excels in small to medium-sized distributed systems but can encounter latency challenges when scaling across a very large number of nodes.

7. Limited Language Options

BEAM primarily supports Erlang and Elixir, limiting the variety of languages that can run on the VM compared to platforms like JVM or .NET.

8. Hot Code Loading Complexity

While powerful, hot code upgrades require careful planning and can introduce subtle bugs if not managed correctly.

9. Concurrency Debugging

Debugging concurrent processes and race conditions can be challenging due to the asynchronous nature of communication.

10. Not Mainstream

Erlang and Elixir are not as widely adopted as JavaScript, Python, or Java, which might make finding experienced developers or community support harder.


What is Oban queue

Oban's primary goals are reliability, consistency and observability.

Oban is a powerful and flexible library that can handle a wide range of background job use cases, and it is well-suited for systems of any size. It provides a simple and consistent API for scheduling and performing jobs, and it is built to be fault-tolerant and easy to monitor.

Oban is fundamentally different from other background job processing tools because it retains job data for historic metrics and inspection. You can leave your application running indefinitely without worrying about jobs being lost or orphaned due to crashes.

Advantages Over Other Tools

Fewer Dependencies — If you are running a web app there is a very good chance that you're running on top of a SQL database. Running your job queue within a SQL database minimizes system dependencies and simplifies data backups.

Transactional Control — Enqueue a job along with other database changes, ensuring that everything is committed or rolled back atomically.

Database Backups — Jobs are stored inside of your primary database, which means they are backed up together with the data that they relate to.

Advanced Features

Isolated Queues — Jobs are stored in a single table but are executed in distinct queues. Each queue runs in isolation, ensuring that a job in a single slow queue can't back up other faster queues.

Queue Control — Queues can be started, stopped, paused, resumed and scaled independently at runtime locally or across all running nodes (even in environments like Heroku, without distributed Erlang).

Resilient Queues — Failing queries won't crash the entire supervision tree, instead a backoff mechanism will safely retry them again in the future.

Job Canceling — Jobs can be canceled in the middle of execution regardless of which node they are running on. This stops the job at once and flags it as cancelled.

Triggered Execution — Insert triggers ensure that jobs are dispatched on all connected nodes as soon as they are inserted into the database.

Unique Jobs — Duplicate work can be avoided through unique job controls. Uniqueness can be enforced at the argument, queue, worker and even sub-argument level for any period of time.

Scheduled Jobs — Jobs can be scheduled at any time in the future, down to the second.

Periodic (CRON) Jobs — Automatically enqueue jobs on a cron-like schedule. Duplicate jobs are never enqueued, no matter how many nodes you're running.

Job Priority — Prioritize jobs within a queue to run ahead of others with ten levels of granularity.

Historic Metrics — After a job is processed the row isn't deleted. Instead, the job is retained in the database to provide metrics. This allows users to inspect historic jobs and to see aggregate data at the job, queue or argument level.

Node Metrics — Every queue records metrics to the database during runtime. These are used to monitor queue health across nodes and may be used for analytics.

Graceful Shutdown — Queue shutdown is delayed so that slow jobs can finish executing before shutdown. When shutdown starts queues are paused and stop executing new jobs. Any jobs left running after the shutdown grace period may be rescued later.

Telemetry Integration — Job life-cycle events are emitted via Telemetry integration. This enables simple logging, error reporting and health checkups without plug-ins.

References:

https://github.com/oban-bg/oban

Thursday, November 14, 2024

What is LightRAG

 LightRAG — an advanced, cost-effective RAG framework that leverages knowledge graphs and vector-based retrieval for improved document interaction. In this article, we’ll explore LightRAG in depth, how it compares to methods like GraphRAG, and how you can set it up on your machine.


What is LightRAG?

LightRAG is a streamlined RAG framework designed for generating responses by retrieving relevant chunks of knowledge, using knowledge graphs alongside embeddings. Traditional RAG systems typically break documents into isolated chunks, but LightRAG goes a step further — it builds entity-relationship pairs that connect individual concepts in the text.

If you’ve heard of Microsoft’s GraphRAG, it’s a similar idea but with a twist: LightRAG is faster, more affordable and Allows incremental updates to graphs without full regeneration.



Why LightRAG over Traditional RAG Systems?

RAG systems, by design, chunk documents into segments for retrieval. However, this approach misses the contextual relationships between those segments. If the meaning or context spans multiple chunks, it becomes difficult to answer complex questions accurately. LightRAG solves this issue by generating knowledge graphs — which map out the relationships between entities in your data.

Limitations of GraphRAG

GraphRAG, while innovative, is resource-intensive. It requires hundreds of API calls, typically using expensive models like GPT-4o. Every time you update data, GraphRAG has to rebuild the entire graph, increasing costs. LightRAG, on the other hand:

Uses fewer API calls and lightweight models like GPT-4-mini.

Allows incremental updates to graphs without full regeneration.

Supports dual-level retrieval (local and global), which improves response quality.

Keeping Up with New Information

In fast-changing fields, like technology or news, having outdated information can be a problem. LightRAG solves this with an incremental update system, meaning it doesn’t have to rebuild its entire knowledge base whenever something new comes in. Instead, it quickly adds fresh data on the fly, so answers stay relevant even in evolving environments.

Faster, Smarter Retrieval with Graphs

By combining graphs with vector-based search (a fancy way of saying it finds related items quickly), LightRAG ensures that responses are not just accurate but also fast. The system organizes related ideas efficiently, and its deduplication feature removes repetitive information — making sure the user only gets what matters most



Tuesday, November 12, 2024

What does __call__ function do in langgraph

Yes, the __call__ method is indeed invoked when the instance of the ReturnNodeValue class is used in the context of the LangGraph node. Here's an explanation of how it works:


Code Breakdown

Class Definition (ReturnNodeValue):


This class has an __init__ method, which initializes the object with a value called node_secret.

It also defines the __call__ method, which allows an instance of the class to be "called" like a function, passing the state argument.

The __call__ Method:


This method takes in a state (likely a State object in LangGraph), prints a message, and returns a dictionary updating the "aggregate" key with the value stored in self._value.

python

Copy code

def __call__(self, state: State) -> Any:

    print(f"Adding {self._value} to {state['aggregate']}")

    return {"aggregate": [self._value]}

When the __call__ method is invoked, it manipulates the state by adding the value to the "aggregate" key.

Using ReturnNodeValue as a Callable:


The line ReturnNodeValue("I'm A") creates an instance of the ReturnNodeValue class with the string "I'm A" as the node_secret.

In Python, if a class defines the __call__ method, then instances of that class can be called as if they were functions.

Adding the Node to the Graph:


python

Copy code

builder.add_node("a", ReturnNodeValue("I'm A"))

This line adds a node labeled "a" to the LangGraph using the ReturnNodeValue("I'm A") instance as the node's callable value.

When this node is executed, it will trigger the __call__ method, passing in the current state.

Does it call __call__?

Yes, when the graph execution framework (LangGraph in this case) reaches node "a", it will invoke ReturnNodeValue("I'm A") like a function. This automatically calls the __call__ method, updating the state and returning the modified value.


Example Execution:

When the node is executed, you will see:


css

Copy code

Adding I'm A to <current state of 'aggregate'>

This is because __call__ is printing that message when invoked.


Summary:

In this example:


The class ReturnNodeValue is defined with an __call__ method.

When ReturnNodeValue("I'm A") is added to the graph, it is used as a callable object.

The LangGraph framework will invoke the __call__ method of this class instance when the node is processed in the graph execution.



Monday, November 11, 2024

What are main two aspects for Agent Swarm?

Routines can be thought of as a set of instructions (which in the context of AI agents, can be represented by a system prompt), the agent that encompasses it, and the tools available to the agent. That may sound like quite a lot of stuff but, in Swarm, these are easily coded.

Handoffs are the transfer of control from one agent to another - just like when you phone the bank, the person answering the phone may pass you on to someone more expert in your particular interests. In Swarm different agents will perform different tasks, but, unlike the real world, with Swarm, the new agent has a record of your previous conversations. Handoffs are key to multi-agent systems.

from swarm import Swarm, Agent

client = Swarm()

agent = Agent(

    name="Agent",

    instructions="You are a helpful agent.",

)

messages = [{"role": "user", "content": "What is the capital of Portugal"}]

response = client.run(agent=agent, messages=messages)

print(response.messages[-1]["content"])

Answer will be something like below 

The capital of Portugal is Lisbon.

Handoffs

Here is an example of a simple handoff from the Swarm docs[2]. We define two agents one speaks English and the other speaks Spanish. Additionally, we define a tool function (that returns the Spanish agent) which we append to the English agent.

english_agent = Agent(

    name="English Agent",

    instructions="You only speak English.",

)

spanish_agent = Agent(

    name="Spanish Agent",

    instructions="You only speak Spanish.",

)

def transfer_to_spanish_agent():

    """Transfer spanish speaking users immediately."""

    return spanish_agent

english_agent.functions.append(transfer_to_spanish_agent)

messages = [{"role": "user", "content": "Hi. How are you?"}]

response = client.run(agent=english_agent, messages=messages)

print(response.messages[-1]["content"])

messages = [{"role": "user", "content": "Hola. ¿Como estás?"}]

response = client.run(agent=english_agent, messages=messages)

print(response.messages[-1]["content"])


What is Langgraph Subgraph

Subgraphs allow you to build complex systems with multiple components that are themselves graphs. A common use case for using subgraphs is building multi-agent systems.


The main question when adding subgraphs is how the parent graph and subgraph communicate, i.e. how they pass the state between each other during the graph execution. There are two scenarios:


parent graph and subgraph share schema keys. In this case, you can add a node with the compiled subgraph

parent graph and subgraph have different schemas. In this case, you have to add a node function that invokes the subgraph: this is useful when the parent graph and the subgraph have different state schemas and you need to transform state before or after calling the subgraph

Below we show to to add subgraphs for each scenario.



subgraph_builder = StateGraph(SubgraphState)

subgraph_builder.add_node(subgraph_node_1)

subgraph_builder.add_node(subgraph_node_2)

subgraph_builder.add_edge(START, "subgraph_node_1")

subgraph_builder.add_edge("subgraph_node_1", "subgraph_node_2")

subgraph = subgraph_builder.compile()



builder = StateGraph(ParentState)

builder.add_node("node_1", node_1)

# note that we're adding the compiled subgraph as a node to the parent graph

builder.add_node("node_2", subgraph)

builder.add_edge(START, "node_1")

builder.add_edge("node_1", "node_2")

graph = builder.compile()


Add a node function that invokes the subgraph¶


def node_2(state: ParentState):

    # transform the state to the subgraph state

    response = subgraph.invoke({"bar": state["foo"]})

    # transform response back to the parent state

    return {"foo": response["bar"]}



builder = StateGraph(ParentState)

builder.add_node("node_1", node_1)

# note that instead of using the compiled subgraph we are using `node_2` function that is calling the subgraph

builder.add_node("node_2", node_2)

builder.add_edge(START, "node_1")

builder.add_edge("node_1", "node_2")

graph = builder.compile()



 references:

https://langchain-ai.github.io/langgraph/how-tos/subgraph/#add-a-node-with-the-compiled-subgraph

What does **variable do in Python?

In Python, **variable is used in two primary ways depending on the context:

1. Unpacking Keyword Arguments in Function Definitions (**kwargs):

In a function definition, **variable (commonly named **kwargs, but it can be any valid variable name) is used to collect keyword arguments into a dictionary. This allows the function to accept an arbitrary number of keyword arguments.


Example:

python

Copy code

def print_info(**kwargs):

    for key, value in kwargs.items():

        print(f"{key}: {value}")


# Call the function with arbitrary keyword arguments

print_info(name="John", age=30, location="New York")

Output:


vbnet

Copy code

name: John

age: 30

location: New York

In this case, the **kwargs collects the keyword arguments (name="John", age=30, etc.) into a dictionary:


python

Copy code

kwargs = {'name': 'John', 'age': 30, 'location': 'New York'}

2. Unpacking a Dictionary into Keyword Arguments (** in function calls):

In a function call, **variable is used to unpack a dictionary so that its key-value pairs are passed as keyword arguments to the function.


Example:

python

Copy code

def greet(name, age):

    print(f"Hello, my name is {name} and I am {age} years old.")


person = {"name": "Alice", "age": 25}


# Unpacking the dictionary

greet(**person)

Output:


csharp

Copy code

Hello, my name is Alice and I am 25 years old.

Here, **person unpacks the dictionary into name="Alice" and age=25, which are passed as keyword arguments to the greet function.


Summary:

In function definitions, **kwargs allows collecting an arbitrary number of keyword arguments into a dictionary.

In function calls, **variable allows unpacking a dictionary into keyword arguments, making it easy to pass a dictionary's contents as function arguments.


What is Vectorise

Vectorize helps you build AI apps faster and with less hassle. It automates data extraction, finds the best vectorization strategy using RAG evaluation, and lets you quickly deploy real-time RAG pipelines for your unstructured data. Your vector search indexes stay up-to-date, and it integrates with your existing vector database, so you maintain full control of your data. Vectorize handles the heavy lifting, freeing you to focus on building robust AI solutions without getting bogged down by data management.

Import

Upload documents or connect to external knowledge management systems, and let Vectorize extract natural language which can be used by your LLM.

Evaluate

Vectorize will analyze multiple chunking and embedding strategies in parallel, quantifying the results of each. Use our recommendation or choose your own.

Deploy

Turn your selected vector configuration into a real time vector pipeline, automatically updated when changes occur to ensure always accurate search results.

Features are:

RAG Evaluation Tools

Automatically evaluates RAG strategies to find the best one for your unique data.

Allows you to measure the performance of different embedding models and chunking strategies, usually in less than one minute.

RAG Pipeline Builder

Construct scalable RAG pipelines with our user-friendly interface (API coming soon)

Populate vector search indexes with unstructured data from documents, SaaS platforms, knowledge bases and more.

Automatically sync your vector databases with your source data so your LLM never has stale data.


Advanced Retrieval Capabilities

Use the built-in retrieval endpoint to simplify your RAG application architecture to improve RAG performance.

The retrieval endpoint:

Automatically vectorizes your input query and performs a k-ANN search on your vector search index

Provides built-in re-ranking of results

Enriches retrieved context from your vector search index with relevancy scores and cosine similarity.

Provides metadata

Real Time Vector Updates

Never worry about stale vector search indexes again

Vectorize can be configured to immediately update changes in your unstructured data sources as soon as they occur

Vector Database Integrations

Store embedding vectors in your current vector database with preconfigured connectors.

Select from a range of embedding models from OpenAI, Voyage AI, and more to generate vector representations.

Built-in support for Pinecone, Couchbase, DataStax and others coming soon.

Optimize Pipelines with RAG Evaluation

Use Vectorize to compare the accuracy of different embedding models dynamically.

Materialize the RAG evaluation results as a pipeline with the confidence that you will always retrieve the most relevant context for your LLM.



Sunday, November 10, 2024

Integrating RAGAS - Part 1

To integrate RAGAS evaluation into this pipeline we need a few things, from our pipeline we need the retrieved contexts, and the generated output.

We already have the generated output, it is what we're printing above

When initializing our AgentExecutor object we included return_intermediate_steps=True — this (unsuprisingly) returns the intermediate steps that the agent tool to generate the final answer. Those steps include the response from our arxiv_search tool — which we can use the evaluate the retrieval portion of our pipeline with RAGAS.

We extract the contexts themselves like so:

print(out["intermediate_steps"][0][1])

To evaluate with RAG we need a dataset containing question, ideal contexts, and the ground truth answers to those questions.

ragas_data = load_dataset("aurelio-ai/ai-arxiv2-ragas-mixtral", split="train")

ragas_data

We first iterate through the questions in this evaluation dataset and ask these questions to our agent.

import pandas as pd

from tqdm.auto import tqdm


df = pd.DataFrame({

    "question": [],

    "contexts": [],

    "answer": [],

    "ground_truth": []

})


limit = 5


for i, row in tqdm(enumerate(ragas_data), total=limit):

    if i >= limit:

        break

    question = row["question"]

    ground_truths = row["ground_truth"]

    try:

        out = chat(question)

        answer = out["output"]

        if len(out["intermediate_steps"]) != 0:

            contexts = out["intermediate_steps"][0][1].split("\n---\n")

        else:

            # this is where no intermediate steps are used

            contexts = []

    except ValueError:

        answer = "ERROR"

        contexts = []

    df = pd.concat([df, pd.DataFrame({

        "question": question,

        "answer": answer,

        "contexts": [contexts],

        "ground_truth": ground_truths

    })], ignore_index=True)





from datasets import Dataset

from ragas.metrics import (

    faithfulness,

    answer_relevancy,

    context_precision,

    context_relevancy,

    context_recall,

    answer_similarity,

    answer_correctness,

)


eval_data = Dataset.from_dict(df)

eval_data


from ragas import evaluate


result = evaluate(

    dataset=eval_data,

    metrics=[

        faithfulness,

        answer_relevancy,

        context_precision,

        context_relevancy,

        context_recall,

        answer_similarity,

        answer_correctness,

    ],

)

result = result.to_pandas()

references:

https://github.com/pinecone-io/examples/blob/master/learn/generation/better-rag/03-ragas-evaluation.ipynb

Tuesday, November 5, 2024

What is SelfQueryRetriever in Langchain

In Langchain, SelfQueryRetriever is a specialized retriever designed to make the process of retrieving relevant documents more dynamic and context-aware. Unlike traditional retrievers that solely rely on similarity searches (e.g., vector searches), the SelfQueryRetriever allows for more sophisticated, natural language-based queries by combining natural language understanding with structured search capabilities.

Key Features of SelfQueryRetriever:

Natural Language Queries: It allows users to input complex, free-form questions or queries in natural language.

Dynamic Query Modification: It uses a language model (LLM) to modify or enhance the query dynamically based on the user input. This ensures that the query is refined to retrieve the most relevant results.

Structured Filters: It can also convert a user's question into structured filters that help narrow down the search more effectively. For example, it can apply specific criteria like filtering by date, category, or other metadata fields that are relevant to the search.

How SelfQueryRetriever Works:

Self-Querying: The retriever can automatically generate additional filters or modify the query to help retrieve more accurate or relevant results. It does this by analyzing the user query and applying specific transformations based on the context of the search.

LLM-Powered Refinement: A language model is used to understand the query and extract essential parameters that can guide the retrieval process. These parameters can be key-value pairs or specific instructions, enhancing the retrieval operation by filtering or adjusting the search criteria.

Difference from Other Retrievers:

Standard Retriever:


Relies on similarity search techniques (like vector search or keyword matching).

Simply matches the user's query to the stored documents and retrieves the most similar ones based on embeddings.

No dynamic query modification or structured filtering is involved.

SelfQueryRetriever:


More intelligent because it uses an LLM to interpret and enhance the user query.

It can apply structured filters based on the query (e.g., filter documents by date or category).

It dynamically refines the query using the LLM to ensure that the retrieval is both accurate and relevant.

Example Use Case:

Suppose you have a database of documents with metadata such as "author," "date," "category," etc. A user asks:


“Can you show me all network security articles written after 2020?”


A Standard Retriever would search for documents based on the similarity between the query and the document content (probably looking for the keywords “network security”).

A SelfQueryRetriever would use an LLM to break down the query into actionable parts:

Retrieve documents about network security.

Filter documents where the date is after 2020.

Return only articles matching both criteria.

This makes SelfQueryRetriever far more powerful in scenarios where specific, structured information needs to be extracted from large corpora of documents.


Sample Code:

Here’s a simple example of using SelfQueryRetriever in Langchain:


python

Copy code

from langchain.chains.query_constructor.base import AttributeInfo

from langchain.retrievers.self_query.base import SelfQueryRetriever

from langchain.vectorstores import FAISS

from langchain.llms import OpenAI

from langchain.embeddings import OpenAIEmbeddings


# Define the attributes (metadata) of your documents

metadata_field_info = [

    AttributeInfo(name="author", description="The author of the document", type="string"),

    AttributeInfo(name="date", description="The publication date of the document", type="date"),

    AttributeInfo(name="category", description="The category of the document", type="string")

]


# Initialize your vector store and LLM

embedding_model = OpenAIEmbeddings()

vector_store = FAISS(embedding_model)

llm = OpenAI()


# Create SelfQueryRetriever

self_query_retriever = SelfQueryRetriever.from_llm(

    vector_store=vector_store,

    llm=llm,

    document_contents="content",  # What field contains the document content

    metadata_field_info=metadata_field_info,

    verbose=True

)


# Use the retriever to answer a query

query = "Show me all security reports written by Alice after 2021."

retrieved_docs = self_query_retriever.get_relevant_documents(query)


for doc in retrieved_docs:

    print(doc)

When to Use SelfQueryRetriever:

When your data has a lot of structured information (like metadata) and you need to refine queries based on that structure.

For advanced retrieval scenarios where the user queries require dynamic, intelligent modification or filtering.

In scenarios where similarity search alone might not retrieve the most relevant documents, and you need additional filtering or query modifications.


References:;

OpenAI 

Friday, November 1, 2024

Does mark down syntax help LLMs?

Markdown syntax like ###, **, and --- does not directly help ChatGPT-4 or other LLMs understand the content better in terms of meaning or context. The models interpret the underlying plain text, so formatting elements such as bold text, headings, or dividers are not processed in a way that changes the actual understanding of the text.


However, Markdown can still be useful in prompts for several reasons:


Improved clarity for human readability: Markdown can make it easier for humans to read and structure their prompts or responses, especially in cases like multi-step instructions, lists, or key points. This improved readability might indirectly lead to better prompts, helping the user or developers focus on clarity when communicating with the model.


Separating sections: For complex inputs, Markdown can visually organize the information, making it clear which parts belong to certain instructions or queries. In a multi-part conversation with a model, this can help both the human and the AI keep track of different sections logically.


Implicit structure hints: While the LLM doesn't interpret ### as a heading per se, the repetition of certain patterns (like labeled sections) might help it pick up on the structure of the text, such as treating a section starting with ### Inputs as listing relevant inputs.


In summary, Markdown won't improve the model’s inherent understanding, but it can help make your prompts clearer, well-structured, and easier to follow, which can lead to more accurate outputs by guiding how you formulate your instructions.

Tuesday, October 29, 2024

What is Ray Framework?

Ray is an open-source framework designed to enable the development of scalable and distributed applications in Python. It provides a simple and flexible programming model for building distributed systems, making it easier to leverage the power of parallel and distributed computing. Some key features and capabilities of the Ray framework include:

Ray allows you to easily parallelize your Python code by executing tasks concurrently across multiple CPU cores or even across a cluster of machines. This enables faster execution and improved performance for computationally intensive tasks.

Ray provides a distributed execution model, allowing you to scale your applications beyond a single machine. It offers tools for distributed scheduling, fault tolerance, and resource management, making it easier to handle large-scale computations

With Ray, you can define Python functions that can be executed remotely. This enables you to offload computation to different nodes in a cluster, distributing the workload and improving overall efficiency.

Ray provides high-level abstractions for distributed data processing, such as distributed data frames and distributed object stores. These features make it easier to work with large datasets and perform operations like filtering, aggregation, and transformation in a distributed manner.

Ray includes built-in support for reinforcement learning algorithms and distributed training. It provides a scalable execution environment for training and evaluating machine learning models, enabling efficient experimentation and faster training times.

1. Ray AI Runtime (AIR)

This open-source collection of Python libraries is designed specifically for ML engineers, data scientists, and researchers. It equips them with a unified and scalable toolkit for developing ML applications. The Ray AI Runtime consists of 5 core libraries:

Ray Data

Achieve scalability and flexibility in data loading and transformation across various stages, such as training, tuning, and prediction, regardless of the underlying framework.

Ray Train

Enables distributed model training across multiple nodes and cores, incorporating fault tolerance mechanisms that seamlessly integrate with widely used training libraries.

Ray Tune

Scale your hyperparameter tuning process to enhance model performance, ensuring optimal configurations are discovered.

Ray Serve

Effortlessly deploy models for online inference with Ray's scalable and programmable serving capabilities. Optionally, leverage micro batching to further enhance performance.

Ray RLlib

Seamlessly integrate scalable distributed reinforcement learning workloads with other Ray AIR libraries, enabling efficient execution of reinforcement learning tasks.

references:

https://www.datacamp.com/tutorial/distributed-processing-using-ray-framework-in-python

PymPDF - read page by page and extract images

 def navigate_page_by_page_pympdf():

## navigating page by page
doc = pymupdf.open("deployment_guide.pdf") # open a document
out = open("output.txt", "wb") # create a text output
for page in doc: # iterate the document pages
text = page.get_text().encode("utf8") # get plain text (is in UTF-8)
print("Text read is ",text)
# out.write(text) # write text of page
# out.write(bytes((12,))) # write page delimiter (form feed 0x0C)
out.close()

def extract_images_pympdf():
doc = pymupdf.open("deployment_guide.pdf") # open a document

for page_index in range(len(doc)): # iterate over pdf pages
page = doc[page_index] # get the page
image_list = page.get_images()

# print the number of images found on the page
if image_list:
print(f"Found {len(image_list)} images on page {page_index}")
else:
print("No images found on page", page_index)

for image_index, img in enumerate(image_list, start=1): # enumerate the image list
xref = img[0] # get the XREF of the image
pix = pymupdf.Pixmap(doc, xref) # create a Pixmap

if pix.n - pix.alpha > 3: # CMYK: convert to RGB first
pix = pymupdf.Pixmap(pymupdf.csRGB, pix)

pix.save("page_%s-image_%s.png" % (page_index, image_index)) # save the image as png
pix = None