Saturday, November 23, 2024

Quick Rate limiting, Sentiment analysis, Safe AI response in Python for GenAI apps

import time

from functools import wraps


def rate_limit(calls: int, period: float):

    min_interval = period / calls

    last_called = [0.0]

    def decorator(func):

        @wraps(func)

        def wrapper(*args, **kwargs):

            elapsed = time.time() - last_called[0]

            if elapsed < min_interval:

                time.sleep(min_interval - elapsed)

            result = func(*args, **kwargs)

            last_called[0] = time.time()

            return result

        return wrapper

    return decorator

@rate_limit(calls=3, period=1.0)  # 3 calls per second

def rate_limited_ai(state: AgentState) -> AgentState:

    return ai(state)'



from textblob import TextBlob


def analyze_sentiment(text: str) -> float:

    """Returns sentiment score between -1 (negative) and 1 (positive)"""

    return TextBlob(text).sentiment.polarity

def enhanced_ai(state: AgentState) -> AgentState:

    messages = state["messages"]

    last_message = messages[-1].content

    # Analyze user sentiment

    sentiment = analyze_sentiment(last_message)

    # Adjust system prompt based on sentiment

    base_prompt = "You are a helpful AI assistant."

    if sentiment < -0.3:

        system_prompt = f"{base_prompt} Please respond with extra empathy and support."

    elif sentiment > 0.3:

        system_prompt = f"{base_prompt} Match the user's positive energy."

    else:

        system_prompt = base_prompt

    llm = Ollama(base_url="<http://localhost:11434>", model="llama3")

    context = f"{system_prompt}\\\\n\\\\nUser: {last_message}"

    response = llm.invoke(context)

    state["messages"].append(AIMessage(content=response))

    state["next"] = "human"

    return state


 def safe_ai_response(state: AgentState) -> AgentState:

    try:

        return ai(state)

    except Exception as e:

        error_message = f"An error occurred: {str(e)}"

        state["messages"].append(AIMessage(content=error_message))

        state["next"] = "human"

        return state


Monday, November 18, 2024

What is promptim - Langchain prompt optimization

Promptim is an experimental prompt optimization library to help you systematically improve your AI systems.

Promptim automates the process of improving prompts on specific tasks. You provide initial prompt, a dataset, and custom evaluators (and optional human feedback), and promptim runs an optimization loop to produce a refined prompt that aims to outperform the original.

From evaluation-driven development to prompt optimization

A core responsibility of AI engineers is prompt engineering. This involves manually tweaking the prompt to produce better results.

A useful way to approach this is through evaluation-driven development. This involves first creating a dataset of inputs (and optionally, expected outputs) and then defining a number of evaluation metrics. Every time you make a change to the prompt, you can run it over the dataset and then score the outputs. In this way, you can measure the performance of your prompt and make sure its improving, or at the very least not regressing. Tools like LangSmith help with dataset curation and evaluation.


The idea behind prompt optimization is to use these well-defined datasets and evaluation metrics to automatically improve the prompt. You can suggest changes to the prompt in an automated way, and then score the new prompt with this evaluation method. Tools like DSPy have been pioneering efforts like this for a while.

How Promptim works

We're excited to release our first attempt at prompt optimization. It is an open source library (promptim) that integrates with LangSmith (which we use for dataset management, prompt management, tracking results, and (optionally) human labeling.


The core algorithm is as follows:

Specify a LangSmith dataset, a prompt in LangSmith, and evaluators defined locally. Optionally, you can specify train/dev/test dataset splits.

We run the initial prompt over the dev (or full) dataset to get a baseline score.

We then loop over all examples in the train (or full) dataset. We run the prompt over all examples, then score them. We then pass the results (inputs, outputs, expected outputs, scores) to a metaprompt and ask it to suggest changes to the current prompt

We then use the new updated prompt to compute metrics again on the dev split.

If the metrics show improvement, the the updated prompt is retained. If no improvement, then the original prompt is kept.

This is repeated N times

Optionally, you can add a step where you leave human feedback. This is useful when you don't have good automated metrics, or want to optimize the prompt based on feedback beyond what the automated metrics can provide. This uses LangSmith's Annotation Queues.

references:

https://blog.langchain.dev/promptim/



What is OpenAI Operator?

According to a recent Bloomberg report, OpenAI is developing an AI assistant called “Operator” that can perform computer-based tasks like coding and travel booking on users’ behalf. The company reportedly plans to release it in January as a research preview and through their API.


This development aligns with a broader industry trend toward AI agents that can execute complex tasks with minimal human oversight. Anthropic has unveiled new capabilities for its GenAI model Claude, allowing it to manipulate desktop environments, a significant step toward more independent systems. Meanwhile, Salesforce introduced next-generation AI agents focused on automating intricate tasks for businesses, signaling a broader adoption of AI-driven workflows. These developments underscore a growing emphasis on creating AI systems that can perform advanced, goal-oriented functions with minimal human oversight


AI agents are software programs that can independently perform complex sequences of tasks on behalf of users, such as booking travel or writing code, by understanding context and making decisions. These agents represent an evolution beyond simple chatbots or models, as they can actively interact with computer interfaces and web services to accomplish real-world goals with minimal human supervision.


“AI can help you track your order, issue refunds, or help prevent cancellations; this frees up human agents to become product experts,” he added. “By automating with AI, human support agents become product experts to help guide customers through which products to buy, ultimately driving better revenue and customer happiness.”


References:

https://www.pymnts.com/artificial-intelligence-2/2024/openai-readies-operator-agent-with-ecommerce-web-browsing-capabilities/


Saturday, November 16, 2024

LLM Cost: Bit of Basics

In the context of large language models, a token is a unit of text that the model processes. A token can be as small as a single character or as large as a word or punctuation mark. The exact size of a token depends on the specific tokenization algorithm used by the model. For example:

The word “computer” is one token.

The sentence “Hello, how are you?” consists of 6 tokens: “Hello”, “,”, “how”, “are”, “you”, “?”

Typically, the model splits longer texts into smaller components (tokens) for efficient processing, making it easier to understand, generate, and manipulate text at a granular level.

For many LLMs, including OpenAI’s GPT models, usage costs are determined by the number of tokens processed, which includes both input tokens (the text prompt given to the model) and output tokens (the text generated by the model). Since the computational cost of running these models is high, token-based pricing provides a fair and scalable way to charge for usage.

Calculating Tokens in a Request

Before diving into cost calculation, let’s break down how tokens are accounted for in a request:

Input Tokens:

The text or query sent to the model is split into tokens. For example, if you send a prompt like “What is the capital of France?”, this prompt will be tokenized, and each word will contribute to the token count.

Output Tokens:

The response generated by the model also consists of tokens. For example, if the model responds with “The capital of France is Paris.”, the words in this sentence are tokenized as well.

For instance:

Input: “What is the capital of France?” (7 tokens)

Output: “The capital of France is Paris.” (7 tokens)

Total tokens used in the request: 14 tokens

Tokenize the Input and Output

First, determine the number of tokens in your input text and the model’s output.

Example:

Input Prompt: “What is the weather like in New York today?” (8 tokens)

Output: “The weather in New York today is sunny with a high of 75 degrees.” (14 tokens)

Total Tokens: 8 + 14 = 22 tokens

2. Identify the Pricing for the Model

Pricing will vary depending on the model provider. For this example, let’s assume the pricing is:

$0.02 per 1,000 tokens

3. Calculate Total Cost Based on Tokens

Multiply the total number of tokens by the rate per 1,000 tokens:

Step-by-Step Guide to Calculating the Cost

Tokenize the Input and Output

First, determine the number of tokens in your input text and the model’s output.

Example:

Input Prompt: “What is the weather like in New York today?” (8 tokens)

Output: “The weather in New York today is sunny with a high of 75 degrees.” (14 tokens)

Total Tokens: 8 + 14 = 22 tokens

2. Identify the Pricing for the Model

Pricing will vary depending on the model provider. For this example, let’s assume the pricing is:

$0.02 per 1,000 tokens

TOTAL COST = (22/1000) * 0.02 = 0.00044

Factors Influencing Token Costs

Several factors can influence the number of tokens generated and therefore the overall cost:


Length of Input Prompts:


Longer prompts result in more input tokens, increasing the overall token count.

Length of Output Responses:


If the model generates lengthy responses, more tokens are used, leading to higher costs.

Complexity of the Task:


More complex queries that require detailed explanations or multiple steps will result in more tokens, both in the input and output.

Model Used:


Different models (e.g., GPT-3, GPT-4) may have different token limits and pricing structures. More advanced models typically charge higher rates per 1,000 tokens.

Token Limits Per Request:


Many LLM providers impose token limits on each request. For instance, a single request might be capped at 2,048 or 4,096 tokens, including both input and output tokens.


Reducing Costs When Using LLMs

Optimize Prompts:


Keep prompts concise but clear to minimize the number of input tokens. Avoid unnecessary verbosity.

Limit Response Length:


Control the length of the model’s output using the maximum tokens parameter. This prevents the model from generating overly long responses, saving on tokens.

Batch Processing:


If possible, group related queries together to reduce the number of individual requests.

Choose the Right Model:


Use smaller models when applicable, as they are often cheaper per token compared to larger, more advanced models.


 

What are Small Language Models ( SLMs)

 Types: 

1. Distilled Models

2. Pruned Models

3. Quantized Models

4. Models Trained from Scratch

Key Characteristics of Small Language Models

Model Size and Parameter Count

Small Language Models (SLMs) typically range from hundreds of millions to a few billion parameters, unlike Large Language Models (LLMs), which can have hundreds of billions of parameters. This smaller size allows SLMs to be more resource-efficient, making them easier to deploy on local devices such as smartphones or IoT devices.

Ranges from millions to a few billion parameters.

Suitable for resource-constrained environments.

Easier to run on personal or edge devices


Training Data Requirements

Require less training data overall.

Emphasize the quality of data over quantity.

Faster training cycles due to smaller model size.


Inference Speed

Reduced latency due to fewer parameters.

Suitable for real-time applications.

Can run offline on smaller devices like mobile phones or embedded systems.



Creating small language models involves different techniques, each with unique approaches and trade-offs. Here's a breakdown of the key differences among Distilled Models, Pruned Models, Quantized Models, and Models Trained from Scratch:


1. Distilled Models

Approach: Knowledge distillation involves training a smaller model (the student) to mimic the behavior of a larger, pre-trained model (the teacher). The smaller model learns by approximating the outputs or logits of the larger model, rather than directly training on raw data.

Key Focus: Reduce model size while retaining most of the teacher model's performance.

Use Case: When high accuracy is needed with a smaller computational footprint.

Advantages:

Retains significant accuracy compared to the teacher model.

Faster inference and reduced memory requirements.

Drawbacks:

The process depends on the quality of the teacher model.

May require additional resources for the distillation process.

2. Pruned Models

Approach: Model pruning removes less significant weights, neurons, or layers from a large model based on predefined criteria, such as low weight magnitudes or redundancy.

Key Focus: Reduce the number of parameters and improve efficiency.

Use Case: When the original model is overparameterized, and optimization is required for resource-constrained environments.

Advantages:

Reduces computation and memory usage.

Can target specific hardware optimizations.

Drawbacks:

Risk of accuracy loss if pruning is too aggressive.

Pruning techniques can be complex to implement effectively.

3. Quantized Models

Approach: Quantization reduces the precision of the model's parameters from floating-point (e.g., 32-bit) to lower-precision formats (e.g., 8-bit integers).

Key Focus: Improve speed and reduce memory usage, especially on hardware with low-precision support.

Use Case: Optimizing models for edge devices like smartphones or IoT devices.

Advantages:

Drastically reduces model size and computational cost.

Compatible with hardware accelerators like GPUs and TPUs optimized for low-precision arithmetic.

Drawbacks:

Can lead to accuracy degradation, especially for sensitive models.

May require fine-tuning to recover performance after quantization.

4. Models Trained from Scratch

Approach: Building and training a model from the ground up, using a new or smaller dataset, rather than modifying a pre-trained large model.

Key Focus: Design a small model architecture tailored to the specific use case or dataset.

Use Case: When there is sufficient training data and computational resources to create a highly specialized model.

Advantages:

Customizable to specific tasks or domains.

No dependency on pre-trained models.

Drawbacks:

Resource-intensive training process.

Typically requires significant expertise in model design and optimization.

May underperform compared to fine-tuned pre-trained models on general tasks.


References: 

https://medium.com/@kanerika/why-small-language-models-are-making-big-waves-in-ai-0bb8e0b6f20c




What is State in Langgraph

In LangGraph, State is a fundamental concept that represents the data being passed and transformed through nodes in the workflow. It acts as a shared data container for the graph, enabling nodes to read from and write to it during execution.

Breaking Down the Example

python

Copy code

class State(TypedDict):

    # The operator.add reducer fn makes this append-only

    messages: Annotated[list, operator.add]

1. TypedDict

State is a subclass of Python's TypedDict. This allows you to define the expected structure (keys and types) of the state dictionary in a strongly typed manner.

Here, the state has one key, messages, which is a list.

2. Annotated

Annotated is a way to add metadata to a type. In this case:

python

Copy code

Annotated[list, operator.add]

It indicates that messages is a list.

The operator.add is used as a reducer function.

3. operator.add

operator.add is a Python function that performs addition for numbers or concatenation for lists.

In this context, it is used as a reducer function for the messages list.

4. Reducer Function Behavior

A reducer function specifies how new values should be combined with the existing state during updates.

By using operator.add, the messages list becomes append-only, meaning any new items added to messages will concatenate with the current list instead of replacing it.

Why Use operator.add in State?

Append-Only Behavior:

Each node in the workflow can add to the messages list without overwriting previous values. This is useful for:

Logging messages from different nodes.

Maintaining a sequential record of events.

Thread Safety:

Using a reducer function ensures that state updates are predictable and consistent, even in concurrent workflows.

Flexibility in State Updates:

Reducer functions allow complex operations during state updates, such as appending, merging dictionaries, or performing custom logic.

references:

OpenAI 




Friday, November 15, 2024

What is Elixir ?

Elixir is a dynamic, functional language for building scalable and maintainable applications.

Elixir runs on the Erlang VM, known for creating low-latency, distributed, and fault-tolerant systems. These capabilities and Elixir tooling allow developers to be productive in several domains, such as web development, embedded software, machine learning, data pipelines, and multimedia processing, across a wide range of industries.


Here is a peek:


iex> "Elixir" |> String.graphemes() |> Enum.frequencies()

%{"E" => 1, "i" => 2, "l" => 1, "r" => 1, "x" => 1}


Platform features

Scalability

All Elixir code runs inside lightweight threads of execution (called processes) that are isolated and exchange information via messages:


Due to their lightweight nature, you can run hundreds of thousands of processes concurrently in the same machine, using all machine resources efficiently (vertical scaling). Processes may also communicate with other processes running on different machines to coordinate work across multiple nodes (horizontal scaling).


Together with projects such as Numerical Elixir, Elixir scales across cores, clusters, and GPUs.


Fault-tolerance

The unavoidable truth about software in production is that things will go wrong. Even more when we take network, file systems, and other third-party resources into account.


To react to failures, Elixir supervisors describe how to restart parts of your system when things go awry, going back to a known initial state that is guaranteed to work:


children = [

  TCP.Pool,

  {TCP.Acceptor, port: 4040}

]


Supervisor.start_link(children, strategy: :one_for_one)

The combination of fault-tolerance and message passing makes Elixir an excellent choice for event-driven systems and robust architectures. Frameworks, such as Nerves, build on this foundation to enable productive development of reliable embedded/IoT systems.


Functional programming

Functional programming promotes a coding style that helps developers write code that is short, concise, and maintainable. For example, pattern matching allows us to elegantly match and assert specific conditions for some code to execute:


def drive(%User{age: age}) when age >= 16 do

  # Code that drives a car

end


drive(User.get("John Doe"))

#=> Fails if the user is under 16

Elixir relies on those features to ensure your software is working under the expected constraints. And when it is not, don't worry, supervisors have your back!


Extensibility and DSLs

Elixir has been designed to be extensible, allowing developers naturally extend the language to particular domains, in order to increase their productivity.


As an example, let's write a simple test case using Elixir's test framework called ExUnit:


defmodule MathTest do

  use ExUnit.Case, async: true


  test "can add two numbers" do

    assert 1 + 1 == 2

  end

end

The async: true option allows tests to run in parallel, using as many CPU cores as possible, while the assert functionality can introspect your code, providing great reports in case of failures.


Other examples include using Elixir to write SQL queries, compiling a subset of Elixir to the GPU, and more.


Tooling features

A growing ecosystem

Elixir ships with a great set of tools to ease development. Mix is a build tool that allows you to easily create projects, manage tasks, run tests and more:


$ mix new my_app

$ cd my_app

$ mix test

.


Finished in 0.04 seconds (0.04s on load, 0.00s on tests)

1 test, 0 failures

Mix also integrates with the Hex package manager for dependency management and hosting documentation for the whole ecosystem.


Interactive development

Tools like IEx (Elixir's interactive shell) leverage the language and platform to provide auto-complete, debugging tools, code reloading, as well as nicely formatted documentation:


$ iex

Interactive Elixir - press Ctrl+C to exit (type h() ENTER for help)

iex> h String.trim           # Prints the documentation

iex> i "Hello, World"        # Prints information about a data type

iex> break! String.trim/1    # Sets a breakpoint

iex> recompile               # Recompiles the current project

Code notebooks like Livebook allow you to interact with Elixir directly from your browser, including support for plotting, flowcharts, data tables, machine learning, and much more!


What are Pros and Cons of Erlang VM

The Erlang Virtual Machine (VM), also known as BEAM, is the runtime system that executes Erlang and Elixir code. It's designed for building concurrent, distributed, and fault-tolerant systems. Below are the pros and cons of using the Erlang VM:


Pros of Erlang VM (BEAM)

1. Concurrency and Scalability

Lightweight Processes: Erlang VM supports millions of lightweight processes, which are independent and do not share memory. This is ideal for building highly concurrent systems.

Efficient Scheduling: BEAM uses preemptive scheduling to ensure fair execution among processes, making it well-suited for multi-core CPUs.

2. Fault Tolerance

Supervisor Trees: Built-in mechanisms allow processes to monitor each other and restart failed processes seamlessly.

Isolation: Processes are isolated, so a crash in one does not affect others.

3. Distributed Systems Support

Erlang VM has first-class support for distributed computing, enabling nodes to communicate over a network as easily as within the same system.

4. Real-Time Systems

Soft Real-Time Capabilities: The VM is designed to handle soft real-time requirements, ensuring timely responses in applications like telecommunications and messaging.

5. Hot Code Upgrades

BEAM allows code to be updated in a running system without downtime, which is crucial for high-availability systems.

6. Garbage Collection

Each process has its own heap and garbage collection, making memory management efficient and avoiding global pauses.

7. Built-in Tools

BEAM provides robust tools for debugging, profiling, and tracing (e.g., Observer, DTrace).

8. Community and Ecosystem

Languages like Elixir leverage BEAM, bringing modern syntax and tooling to its robust runtime.

9. Mature and Battle-Tested

BEAM has been used in production for decades, powering telecom systems, messaging platforms (e.g., WhatsApp), and databases (e.g., CouchDB).



Cons of Erlang VM (BEAM)

1. Performance Limitations

Single-threaded Execution per Scheduler: While great for concurrency, BEAM isn't optimized for raw CPU-bound tasks compared to VMs like JVM.

Limited Numerical Processing: It's less suited for heavy numerical computations or AI/ML tasks.

2. Memory Overhead

Lightweight processes consume more memory compared to raw threads in some other VMs, especially when the number of processes is extremely high.

3. Learning Curve

The functional programming paradigm, immutable data structures, and process model can be challenging for developers used to imperative programming.

4. Lack of Mainstream Libraries

While BEAM has excellent libraries for distributed systems, its ecosystem lacks the breadth of libraries available for JVM or Python.

5. Tooling

Although improving, the tooling (e.g., IDE support) may not be as polished as in more mainstream ecosystems like Java or JavaScript.

6. Latency in Large Distributed Systems

BEAM excels in small to medium-sized distributed systems but can encounter latency challenges when scaling across a very large number of nodes.

7. Limited Language Options

BEAM primarily supports Erlang and Elixir, limiting the variety of languages that can run on the VM compared to platforms like JVM or .NET.

8. Hot Code Loading Complexity

While powerful, hot code upgrades require careful planning and can introduce subtle bugs if not managed correctly.

9. Concurrency Debugging

Debugging concurrent processes and race conditions can be challenging due to the asynchronous nature of communication.

10. Not Mainstream

Erlang and Elixir are not as widely adopted as JavaScript, Python, or Java, which might make finding experienced developers or community support harder.


What is Oban queue

Oban's primary goals are reliability, consistency and observability.

Oban is a powerful and flexible library that can handle a wide range of background job use cases, and it is well-suited for systems of any size. It provides a simple and consistent API for scheduling and performing jobs, and it is built to be fault-tolerant and easy to monitor.

Oban is fundamentally different from other background job processing tools because it retains job data for historic metrics and inspection. You can leave your application running indefinitely without worrying about jobs being lost or orphaned due to crashes.

Advantages Over Other Tools

Fewer Dependencies — If you are running a web app there is a very good chance that you're running on top of a SQL database. Running your job queue within a SQL database minimizes system dependencies and simplifies data backups.

Transactional Control — Enqueue a job along with other database changes, ensuring that everything is committed or rolled back atomically.

Database Backups — Jobs are stored inside of your primary database, which means they are backed up together with the data that they relate to.

Advanced Features

Isolated Queues — Jobs are stored in a single table but are executed in distinct queues. Each queue runs in isolation, ensuring that a job in a single slow queue can't back up other faster queues.

Queue Control — Queues can be started, stopped, paused, resumed and scaled independently at runtime locally or across all running nodes (even in environments like Heroku, without distributed Erlang).

Resilient Queues — Failing queries won't crash the entire supervision tree, instead a backoff mechanism will safely retry them again in the future.

Job Canceling — Jobs can be canceled in the middle of execution regardless of which node they are running on. This stops the job at once and flags it as cancelled.

Triggered Execution — Insert triggers ensure that jobs are dispatched on all connected nodes as soon as they are inserted into the database.

Unique Jobs — Duplicate work can be avoided through unique job controls. Uniqueness can be enforced at the argument, queue, worker and even sub-argument level for any period of time.

Scheduled Jobs — Jobs can be scheduled at any time in the future, down to the second.

Periodic (CRON) Jobs — Automatically enqueue jobs on a cron-like schedule. Duplicate jobs are never enqueued, no matter how many nodes you're running.

Job Priority — Prioritize jobs within a queue to run ahead of others with ten levels of granularity.

Historic Metrics — After a job is processed the row isn't deleted. Instead, the job is retained in the database to provide metrics. This allows users to inspect historic jobs and to see aggregate data at the job, queue or argument level.

Node Metrics — Every queue records metrics to the database during runtime. These are used to monitor queue health across nodes and may be used for analytics.

Graceful Shutdown — Queue shutdown is delayed so that slow jobs can finish executing before shutdown. When shutdown starts queues are paused and stop executing new jobs. Any jobs left running after the shutdown grace period may be rescued later.

Telemetry Integration — Job life-cycle events are emitted via Telemetry integration. This enables simple logging, error reporting and health checkups without plug-ins.

References:

https://github.com/oban-bg/oban

Thursday, November 14, 2024

What is LightRAG

 LightRAG — an advanced, cost-effective RAG framework that leverages knowledge graphs and vector-based retrieval for improved document interaction. In this article, we’ll explore LightRAG in depth, how it compares to methods like GraphRAG, and how you can set it up on your machine.


What is LightRAG?

LightRAG is a streamlined RAG framework designed for generating responses by retrieving relevant chunks of knowledge, using knowledge graphs alongside embeddings. Traditional RAG systems typically break documents into isolated chunks, but LightRAG goes a step further — it builds entity-relationship pairs that connect individual concepts in the text.

If you’ve heard of Microsoft’s GraphRAG, it’s a similar idea but with a twist: LightRAG is faster, more affordable and Allows incremental updates to graphs without full regeneration.



Why LightRAG over Traditional RAG Systems?

RAG systems, by design, chunk documents into segments for retrieval. However, this approach misses the contextual relationships between those segments. If the meaning or context spans multiple chunks, it becomes difficult to answer complex questions accurately. LightRAG solves this issue by generating knowledge graphs — which map out the relationships between entities in your data.

Limitations of GraphRAG

GraphRAG, while innovative, is resource-intensive. It requires hundreds of API calls, typically using expensive models like GPT-4o. Every time you update data, GraphRAG has to rebuild the entire graph, increasing costs. LightRAG, on the other hand:

Uses fewer API calls and lightweight models like GPT-4-mini.

Allows incremental updates to graphs without full regeneration.

Supports dual-level retrieval (local and global), which improves response quality.

Keeping Up with New Information

In fast-changing fields, like technology or news, having outdated information can be a problem. LightRAG solves this with an incremental update system, meaning it doesn’t have to rebuild its entire knowledge base whenever something new comes in. Instead, it quickly adds fresh data on the fly, so answers stay relevant even in evolving environments.

Faster, Smarter Retrieval with Graphs

By combining graphs with vector-based search (a fancy way of saying it finds related items quickly), LightRAG ensures that responses are not just accurate but also fast. The system organizes related ideas efficiently, and its deduplication feature removes repetitive information — making sure the user only gets what matters most



Tuesday, November 12, 2024

What does __call__ function do in langgraph

Yes, the __call__ method is indeed invoked when the instance of the ReturnNodeValue class is used in the context of the LangGraph node. Here's an explanation of how it works:


Code Breakdown

Class Definition (ReturnNodeValue):


This class has an __init__ method, which initializes the object with a value called node_secret.

It also defines the __call__ method, which allows an instance of the class to be "called" like a function, passing the state argument.

The __call__ Method:


This method takes in a state (likely a State object in LangGraph), prints a message, and returns a dictionary updating the "aggregate" key with the value stored in self._value.

python

Copy code

def __call__(self, state: State) -> Any:

    print(f"Adding {self._value} to {state['aggregate']}")

    return {"aggregate": [self._value]}

When the __call__ method is invoked, it manipulates the state by adding the value to the "aggregate" key.

Using ReturnNodeValue as a Callable:


The line ReturnNodeValue("I'm A") creates an instance of the ReturnNodeValue class with the string "I'm A" as the node_secret.

In Python, if a class defines the __call__ method, then instances of that class can be called as if they were functions.

Adding the Node to the Graph:


python

Copy code

builder.add_node("a", ReturnNodeValue("I'm A"))

This line adds a node labeled "a" to the LangGraph using the ReturnNodeValue("I'm A") instance as the node's callable value.

When this node is executed, it will trigger the __call__ method, passing in the current state.

Does it call __call__?

Yes, when the graph execution framework (LangGraph in this case) reaches node "a", it will invoke ReturnNodeValue("I'm A") like a function. This automatically calls the __call__ method, updating the state and returning the modified value.


Example Execution:

When the node is executed, you will see:


css

Copy code

Adding I'm A to <current state of 'aggregate'>

This is because __call__ is printing that message when invoked.


Summary:

In this example:


The class ReturnNodeValue is defined with an __call__ method.

When ReturnNodeValue("I'm A") is added to the graph, it is used as a callable object.

The LangGraph framework will invoke the __call__ method of this class instance when the node is processed in the graph execution.



Monday, November 11, 2024

What are main two aspects for Agent Swarm?

Routines can be thought of as a set of instructions (which in the context of AI agents, can be represented by a system prompt), the agent that encompasses it, and the tools available to the agent. That may sound like quite a lot of stuff but, in Swarm, these are easily coded.

Handoffs are the transfer of control from one agent to another - just like when you phone the bank, the person answering the phone may pass you on to someone more expert in your particular interests. In Swarm different agents will perform different tasks, but, unlike the real world, with Swarm, the new agent has a record of your previous conversations. Handoffs are key to multi-agent systems.

from swarm import Swarm, Agent

client = Swarm()

agent = Agent(

    name="Agent",

    instructions="You are a helpful agent.",

)

messages = [{"role": "user", "content": "What is the capital of Portugal"}]

response = client.run(agent=agent, messages=messages)

print(response.messages[-1]["content"])

Answer will be something like below 

The capital of Portugal is Lisbon.

Handoffs

Here is an example of a simple handoff from the Swarm docs[2]. We define two agents one speaks English and the other speaks Spanish. Additionally, we define a tool function (that returns the Spanish agent) which we append to the English agent.

english_agent = Agent(

    name="English Agent",

    instructions="You only speak English.",

)

spanish_agent = Agent(

    name="Spanish Agent",

    instructions="You only speak Spanish.",

)

def transfer_to_spanish_agent():

    """Transfer spanish speaking users immediately."""

    return spanish_agent

english_agent.functions.append(transfer_to_spanish_agent)

messages = [{"role": "user", "content": "Hi. How are you?"}]

response = client.run(agent=english_agent, messages=messages)

print(response.messages[-1]["content"])

messages = [{"role": "user", "content": "Hola. ¿Como estás?"}]

response = client.run(agent=english_agent, messages=messages)

print(response.messages[-1]["content"])


What is Langgraph Subgraph

Subgraphs allow you to build complex systems with multiple components that are themselves graphs. A common use case for using subgraphs is building multi-agent systems.


The main question when adding subgraphs is how the parent graph and subgraph communicate, i.e. how they pass the state between each other during the graph execution. There are two scenarios:


parent graph and subgraph share schema keys. In this case, you can add a node with the compiled subgraph

parent graph and subgraph have different schemas. In this case, you have to add a node function that invokes the subgraph: this is useful when the parent graph and the subgraph have different state schemas and you need to transform state before or after calling the subgraph

Below we show to to add subgraphs for each scenario.



subgraph_builder = StateGraph(SubgraphState)

subgraph_builder.add_node(subgraph_node_1)

subgraph_builder.add_node(subgraph_node_2)

subgraph_builder.add_edge(START, "subgraph_node_1")

subgraph_builder.add_edge("subgraph_node_1", "subgraph_node_2")

subgraph = subgraph_builder.compile()



builder = StateGraph(ParentState)

builder.add_node("node_1", node_1)

# note that we're adding the compiled subgraph as a node to the parent graph

builder.add_node("node_2", subgraph)

builder.add_edge(START, "node_1")

builder.add_edge("node_1", "node_2")

graph = builder.compile()


Add a node function that invokes the subgraph¶


def node_2(state: ParentState):

    # transform the state to the subgraph state

    response = subgraph.invoke({"bar": state["foo"]})

    # transform response back to the parent state

    return {"foo": response["bar"]}



builder = StateGraph(ParentState)

builder.add_node("node_1", node_1)

# note that instead of using the compiled subgraph we are using `node_2` function that is calling the subgraph

builder.add_node("node_2", node_2)

builder.add_edge(START, "node_1")

builder.add_edge("node_1", "node_2")

graph = builder.compile()



 references:

https://langchain-ai.github.io/langgraph/how-tos/subgraph/#add-a-node-with-the-compiled-subgraph

What does **variable do in Python?

In Python, **variable is used in two primary ways depending on the context:

1. Unpacking Keyword Arguments in Function Definitions (**kwargs):

In a function definition, **variable (commonly named **kwargs, but it can be any valid variable name) is used to collect keyword arguments into a dictionary. This allows the function to accept an arbitrary number of keyword arguments.


Example:

python

Copy code

def print_info(**kwargs):

    for key, value in kwargs.items():

        print(f"{key}: {value}")


# Call the function with arbitrary keyword arguments

print_info(name="John", age=30, location="New York")

Output:


vbnet

Copy code

name: John

age: 30

location: New York

In this case, the **kwargs collects the keyword arguments (name="John", age=30, etc.) into a dictionary:


python

Copy code

kwargs = {'name': 'John', 'age': 30, 'location': 'New York'}

2. Unpacking a Dictionary into Keyword Arguments (** in function calls):

In a function call, **variable is used to unpack a dictionary so that its key-value pairs are passed as keyword arguments to the function.


Example:

python

Copy code

def greet(name, age):

    print(f"Hello, my name is {name} and I am {age} years old.")


person = {"name": "Alice", "age": 25}


# Unpacking the dictionary

greet(**person)

Output:


csharp

Copy code

Hello, my name is Alice and I am 25 years old.

Here, **person unpacks the dictionary into name="Alice" and age=25, which are passed as keyword arguments to the greet function.


Summary:

In function definitions, **kwargs allows collecting an arbitrary number of keyword arguments into a dictionary.

In function calls, **variable allows unpacking a dictionary into keyword arguments, making it easy to pass a dictionary's contents as function arguments.


What is Vectorise

Vectorize helps you build AI apps faster and with less hassle. It automates data extraction, finds the best vectorization strategy using RAG evaluation, and lets you quickly deploy real-time RAG pipelines for your unstructured data. Your vector search indexes stay up-to-date, and it integrates with your existing vector database, so you maintain full control of your data. Vectorize handles the heavy lifting, freeing you to focus on building robust AI solutions without getting bogged down by data management.

Import

Upload documents or connect to external knowledge management systems, and let Vectorize extract natural language which can be used by your LLM.

Evaluate

Vectorize will analyze multiple chunking and embedding strategies in parallel, quantifying the results of each. Use our recommendation or choose your own.

Deploy

Turn your selected vector configuration into a real time vector pipeline, automatically updated when changes occur to ensure always accurate search results.

Features are:

RAG Evaluation Tools

Automatically evaluates RAG strategies to find the best one for your unique data.

Allows you to measure the performance of different embedding models and chunking strategies, usually in less than one minute.

RAG Pipeline Builder

Construct scalable RAG pipelines with our user-friendly interface (API coming soon)

Populate vector search indexes with unstructured data from documents, SaaS platforms, knowledge bases and more.

Automatically sync your vector databases with your source data so your LLM never has stale data.


Advanced Retrieval Capabilities

Use the built-in retrieval endpoint to simplify your RAG application architecture to improve RAG performance.

The retrieval endpoint:

Automatically vectorizes your input query and performs a k-ANN search on your vector search index

Provides built-in re-ranking of results

Enriches retrieved context from your vector search index with relevancy scores and cosine similarity.

Provides metadata

Real Time Vector Updates

Never worry about stale vector search indexes again

Vectorize can be configured to immediately update changes in your unstructured data sources as soon as they occur

Vector Database Integrations

Store embedding vectors in your current vector database with preconfigured connectors.

Select from a range of embedding models from OpenAI, Voyage AI, and more to generate vector representations.

Built-in support for Pinecone, Couchbase, DataStax and others coming soon.

Optimize Pipelines with RAG Evaluation

Use Vectorize to compare the accuracy of different embedding models dynamically.

Materialize the RAG evaluation results as a pipeline with the confidence that you will always retrieve the most relevant context for your LLM.



Sunday, November 10, 2024

Integrating RAGAS - Part 1

To integrate RAGAS evaluation into this pipeline we need a few things, from our pipeline we need the retrieved contexts, and the generated output.

We already have the generated output, it is what we're printing above

When initializing our AgentExecutor object we included return_intermediate_steps=True — this (unsuprisingly) returns the intermediate steps that the agent tool to generate the final answer. Those steps include the response from our arxiv_search tool — which we can use the evaluate the retrieval portion of our pipeline with RAGAS.

We extract the contexts themselves like so:

print(out["intermediate_steps"][0][1])

To evaluate with RAG we need a dataset containing question, ideal contexts, and the ground truth answers to those questions.

ragas_data = load_dataset("aurelio-ai/ai-arxiv2-ragas-mixtral", split="train")

ragas_data

We first iterate through the questions in this evaluation dataset and ask these questions to our agent.

import pandas as pd

from tqdm.auto import tqdm


df = pd.DataFrame({

    "question": [],

    "contexts": [],

    "answer": [],

    "ground_truth": []

})


limit = 5


for i, row in tqdm(enumerate(ragas_data), total=limit):

    if i >= limit:

        break

    question = row["question"]

    ground_truths = row["ground_truth"]

    try:

        out = chat(question)

        answer = out["output"]

        if len(out["intermediate_steps"]) != 0:

            contexts = out["intermediate_steps"][0][1].split("\n---\n")

        else:

            # this is where no intermediate steps are used

            contexts = []

    except ValueError:

        answer = "ERROR"

        contexts = []

    df = pd.concat([df, pd.DataFrame({

        "question": question,

        "answer": answer,

        "contexts": [contexts],

        "ground_truth": ground_truths

    })], ignore_index=True)





from datasets import Dataset

from ragas.metrics import (

    faithfulness,

    answer_relevancy,

    context_precision,

    context_relevancy,

    context_recall,

    answer_similarity,

    answer_correctness,

)


eval_data = Dataset.from_dict(df)

eval_data


from ragas import evaluate


result = evaluate(

    dataset=eval_data,

    metrics=[

        faithfulness,

        answer_relevancy,

        context_precision,

        context_relevancy,

        context_recall,

        answer_similarity,

        answer_correctness,

    ],

)

result = result.to_pandas()

references:

https://github.com/pinecone-io/examples/blob/master/learn/generation/better-rag/03-ragas-evaluation.ipynb

Tuesday, November 5, 2024

What is SelfQueryRetriever in Langchain

In Langchain, SelfQueryRetriever is a specialized retriever designed to make the process of retrieving relevant documents more dynamic and context-aware. Unlike traditional retrievers that solely rely on similarity searches (e.g., vector searches), the SelfQueryRetriever allows for more sophisticated, natural language-based queries by combining natural language understanding with structured search capabilities.

Key Features of SelfQueryRetriever:

Natural Language Queries: It allows users to input complex, free-form questions or queries in natural language.

Dynamic Query Modification: It uses a language model (LLM) to modify or enhance the query dynamically based on the user input. This ensures that the query is refined to retrieve the most relevant results.

Structured Filters: It can also convert a user's question into structured filters that help narrow down the search more effectively. For example, it can apply specific criteria like filtering by date, category, or other metadata fields that are relevant to the search.

How SelfQueryRetriever Works:

Self-Querying: The retriever can automatically generate additional filters or modify the query to help retrieve more accurate or relevant results. It does this by analyzing the user query and applying specific transformations based on the context of the search.

LLM-Powered Refinement: A language model is used to understand the query and extract essential parameters that can guide the retrieval process. These parameters can be key-value pairs or specific instructions, enhancing the retrieval operation by filtering or adjusting the search criteria.

Difference from Other Retrievers:

Standard Retriever:


Relies on similarity search techniques (like vector search or keyword matching).

Simply matches the user's query to the stored documents and retrieves the most similar ones based on embeddings.

No dynamic query modification or structured filtering is involved.

SelfQueryRetriever:


More intelligent because it uses an LLM to interpret and enhance the user query.

It can apply structured filters based on the query (e.g., filter documents by date or category).

It dynamically refines the query using the LLM to ensure that the retrieval is both accurate and relevant.

Example Use Case:

Suppose you have a database of documents with metadata such as "author," "date," "category," etc. A user asks:


“Can you show me all network security articles written after 2020?”


A Standard Retriever would search for documents based on the similarity between the query and the document content (probably looking for the keywords “network security”).

A SelfQueryRetriever would use an LLM to break down the query into actionable parts:

Retrieve documents about network security.

Filter documents where the date is after 2020.

Return only articles matching both criteria.

This makes SelfQueryRetriever far more powerful in scenarios where specific, structured information needs to be extracted from large corpora of documents.


Sample Code:

Here’s a simple example of using SelfQueryRetriever in Langchain:


python

Copy code

from langchain.chains.query_constructor.base import AttributeInfo

from langchain.retrievers.self_query.base import SelfQueryRetriever

from langchain.vectorstores import FAISS

from langchain.llms import OpenAI

from langchain.embeddings import OpenAIEmbeddings


# Define the attributes (metadata) of your documents

metadata_field_info = [

    AttributeInfo(name="author", description="The author of the document", type="string"),

    AttributeInfo(name="date", description="The publication date of the document", type="date"),

    AttributeInfo(name="category", description="The category of the document", type="string")

]


# Initialize your vector store and LLM

embedding_model = OpenAIEmbeddings()

vector_store = FAISS(embedding_model)

llm = OpenAI()


# Create SelfQueryRetriever

self_query_retriever = SelfQueryRetriever.from_llm(

    vector_store=vector_store,

    llm=llm,

    document_contents="content",  # What field contains the document content

    metadata_field_info=metadata_field_info,

    verbose=True

)


# Use the retriever to answer a query

query = "Show me all security reports written by Alice after 2021."

retrieved_docs = self_query_retriever.get_relevant_documents(query)


for doc in retrieved_docs:

    print(doc)

When to Use SelfQueryRetriever:

When your data has a lot of structured information (like metadata) and you need to refine queries based on that structure.

For advanced retrieval scenarios where the user queries require dynamic, intelligent modification or filtering.

In scenarios where similarity search alone might not retrieve the most relevant documents, and you need additional filtering or query modifications.


References:;

OpenAI 

Friday, November 1, 2024

Does mark down syntax help LLMs?

Markdown syntax like ###, **, and --- does not directly help ChatGPT-4 or other LLMs understand the content better in terms of meaning or context. The models interpret the underlying plain text, so formatting elements such as bold text, headings, or dividers are not processed in a way that changes the actual understanding of the text.


However, Markdown can still be useful in prompts for several reasons:


Improved clarity for human readability: Markdown can make it easier for humans to read and structure their prompts or responses, especially in cases like multi-step instructions, lists, or key points. This improved readability might indirectly lead to better prompts, helping the user or developers focus on clarity when communicating with the model.


Separating sections: For complex inputs, Markdown can visually organize the information, making it clear which parts belong to certain instructions or queries. In a multi-part conversation with a model, this can help both the human and the AI keep track of different sections logically.


Implicit structure hints: While the LLM doesn't interpret ### as a heading per se, the repetition of certain patterns (like labeled sections) might help it pick up on the structure of the text, such as treating a section starting with ### Inputs as listing relevant inputs.


In summary, Markdown won't improve the model’s inherent understanding, but it can help make your prompts clearer, well-structured, and easier to follow, which can lead to more accurate outputs by guiding how you formulate your instructions.