Sunday, October 13, 2024

embed_query in langchain's OpenAIEmbeddings

In langchain, the embed_query method of the OpenAIEmbeddings class is used to generate an embedding vector for a query (text input). The idea behind embeddings is to convert text into numerical vectors, which represent semantic meanings and are used for similarity searches, such as when comparing queries with stored documents or other text.

How it works:

Query Embeddings: When you call embed_query, it sends the input query (a piece of text) to the OpenAI API, which then returns a vector representation of that text.

Usage: This embedding is typically used to match queries with stored document embeddings in a vector database to find the most relevant document or answer. It helps in similarity search tasks by comparing how "close" the query vector is to other document vectors.

Example:

from langchain.embeddings import OpenAIEmbeddings

# Initialize OpenAI Embeddings object

openai_embeddings = OpenAIEmbeddings()

# Get embedding for a query (a string of text)

query = "What is the version of this device?"

query_embedding = openai_embeddings.embed_query(query)

# Now, you can use this embedding for similarity searches, etc.

Main Purpose: embed_query is used when you want to search or match a user's query with similar documents stored in a vector database or embedding store.

No, not all embedding models in Langchain support the embed_query method directly. The availability of the embed_query method depends on the specific embedding model you are using. Here’s a breakdown of how this works:

1. Models that support embed_query:

OpenAIEmbeddings: OpenAI models, such as text-embedding-ada-002, natively support the embed_query method, which allows you to generate query embeddings for similarity search or document retrieval tasks.

Other Cloud/Managed API Models: Similar to OpenAI, some managed services like Cohere, Hugging Face embeddings, etc., also provide embed_query functionality depending on the model's API.

2. Models that may not support embed_query:

Self-Hosted Models: Some self-hosted or custom models (e.g., using locally trained models or models running on frameworks like transformers or Sentence Transformers) may not have the embed_query method, unless specifically implemented.

Custom Embedding Models: If you are using a custom embedding model or provider, you may need to implement the method yourself if it’s not already included.

3. General Implementation:

The embed_query method is generally a convenience function that converts a query into an embedding. For models that don't provide this directly, you may still be able to call a generic embedding method like embed_documents or embed_text and apply that to queries. It might just not be explicitly named embed_query.

Alternative Methods:

If embed_query isn’t supported, you can usually still use the model’s general embedding method for queries by treating queries like any other document or text.

Example:

query_embedding = model.embed_documents([query]) 

In summary, many embedding models do support embed_query, especially those from major providers like OpenAI, Cohere, etc., but custom, self-hosted, or specialized models may require you to handle the embedding process for queries manually. Always check the specific embedding model’s documentation in Langchain to confirm support.



What is difference between Multi Query retriever, TimeBasedVectoreStoreRetriever, and Self Query retrievers in Langchain

 In Langchain, different retrievers serve as mechanisms to extract relevant information from various data sources for LLM-based applications. Here’s a breakdown of the key retrievers you mentioned:


1. Multi Query Retriever:

The Multi Query Retriever allows an LLM to generate multiple variations of a query to improve retrieval results. This helps address scenarios where different wordings of the same query might lead to different but relevant results in a vector store or database.


Purpose: Enhance recall by increasing the chances of retrieving relevant information through multiple reformulated queries.

Process: The retriever generates alternative queries (e.g., rephrases the user's original query) and uses them to search the data store. The combined results from these queries are then ranked and returned.

Use Case: Useful when you want to cover diverse interpretations or wordings of the user's question for more comprehensive results.

Example: When a user asks, "What is the best way to secure a database?", the retriever might generate alternative queries like:


"How to improve database security?"

"Best practices for securing a database?"

"How to safeguard databases?"

This helps in retrieving different but complementary documents or information.


2. TimeBasedVectorStoreRetriever:

The TimeBasedVectorStoreRetriever is designed for retrieving information based on time relevance from a vector store. In addition to vector similarity search, it factors in the timestamp associated with documents, ensuring that results are time-ordered or time-filtered.


Purpose: To prioritize or filter documents based on their recency or relevance to a specific time range, in addition to vector similarity.

Process: This retriever can either rank results by their timestamp or restrict retrieval to a certain time window, depending on how it's set up.

Use Case: Ideal for applications dealing with time-sensitive information, like news archives, logs, or research articles.

Example: If the user asks, "What were the latest advancements in AI?", this retriever ensures that the most recent articles or documents are prioritized over older content.


3. Self Query Retriever:

The Self Query Retriever is an advanced retriever that uses an LLM to automatically generate structured queries (with filters) for more specific searches based on the user's query.


Purpose: Automatically apply metadata-based filters (e.g., date ranges, categories) to retrieve more targeted results.

Process: It involves the LLM analyzing the user's query to generate a structured query with filter conditions. These filters can be based on attributes like date, author, or document type, enhancing retrieval precision.

Use Case: Useful in situations where the data has rich metadata and users may have implicit requirements. For example, finding "recent research papers on deep learning by a specific author."

Example: If the user query is "Show me articles on machine learning from 2020," the retriever will automatically generate a query that filters for "machine learning" and restricts results to documents from 2020.


Key Differences:

Multi Query Retriever: Focuses on reformulating the query to improve recall, covering multiple possible variations.

TimeBasedVectorStoreRetriever: Prioritizes or filters results by time, useful for retrieving time-sensitive information.

Self Query Retriever: Automatically creates more precise queries with filtering based on metadata.

Each of these retrievers has its own specialized purpose, and the right one depends on the specific data retrieval needs of the application.

Thursday, October 10, 2024

What is Berkley Function calling tool

Berkeley Function Calling Leaderboard (BFCL) is a great resource for comparing how different models perform on function calling tasks. It also provides an evaluation suite to compare your own fine-tuned model on various challenging tool calling tasks. In fact, the latest dataset, BFCL v3, was just released and now includes multi-step, multi-turn function calling, further raising the bar for tool based reasoning tasks.

Both types of reasoning are powerful independently, and when combined, they have the potential to create agents that can effectively breakdown complicated tasks and autonomously interact with their environment. For more examples of AI agent architectures for reasoning, planning, and tool calling check out my team’s survey paper on ArXiv.



refernces:

https://gorilla.cs.berkeley.edu/leaderboard.html

Tuesday, October 8, 2024

What is contextual Embedding in Anthropic RAG? - Anthropic

The recent paper Anthropic introduced the concept of “contextual embedding” that solves the problem of lacking context by adding relevant context to each chunk before embedding.

What you can do is to leverage LLM for Contextual Retrieval. You can develope a prompt for an LLM that instructs the model to generate concise, chunk-specific context based on the overall document in order to provide contextual information for each chunk.

Consider a query about a specific company’s quarterly revenue growth. A relevant chunk might be something like this: “The company’s revenue grew by 3% over the previous quarter” contain the growth percentage but lack crucial details like the company name or time period. This absence of context can hinder accurate retrieval.

By sending the overall document to an LLM for every chunk, we get a contextualized_chunk like this:

This ‘contextualized_chunk’ is then sent to an embedding model to create the embedding of the chunk.

Hybrid search approach

While this contextual embeddings has proven to improve upon traditional semantic search RAG, a hybrid approach incorporating BM25 can produce even better results.

The same chunk-specific context can also be used with BM25 search to further improve retrieval performance.

Okapi BM25

BM25 is an algorithm that addresses some drawbacks of TF-IDF which concern with term saturation and document length.

Term Saturation and diminishing return

If a document contains 100 occurrences of the term “computer,” is it really twice as relevant as a document that contains 50 occurrences? We want to control the contribution of TF when a term is likely to be saturated. BM25 solves this issue by introducing a parameter k1 that controls the shape of this saturation curve. Parameter k1 can be tuned in a way that if the TF increases, at some point, the BM25 score will be saturated , meaning that the increase in TF no longer contributes much to the score.






References:

https://levelup.gitconnected.com/the-best-rag-technique-yet-anthropics-contextual-retrieval-and-hybrid-search-62320d99004e


Monday, October 7, 2024

What's the differences between Chain of Thought (CoT), ReAct, Prompt Decomposition approaches

Chain-of-Thought (CoT), ReAct (Reasoning + Acting), and Prompt Decomposition are all advanced prompting techniques for improving the reasoning capabilities of large language models (LLMs). Each approach differs in how they manage complex tasks and guide the model’s reasoning. Here’s a breakdown of the differences:


1. Chain-of-Thought (CoT):

Purpose: CoT is designed to enhance the reasoning capabilities of an LLM by encouraging it to think step-by-step.

Approach: In CoT, the model is explicitly guided to break down its reasoning process for complex questions or tasks. Instead of jumping to an answer, the model generates intermediate steps or thoughts that lead to the final result.

How It Works: When given a question, the model first generates a "chain of thought" — a logical sequence of steps that helps it arrive at a conclusion.

Use Case: CoT is useful for multi-step problems, arithmetic reasoning, logical deduction, and scenarios where intermediate steps are important for accuracy.

Example: Prompt: "If a train travels 60 miles in 2 hours, how far will it travel in 5 hours at the same speed?" Model's response using CoT:


The train travels 60 miles in 2 hours.

So, its speed is 60 ÷ 2 = 30 miles per hour.

In 5 hours, it will travel 5 × 30 = 150 miles.

2. ReAct (Reasoning + Acting):

Purpose: ReAct combines reasoning (thought process) and actions (interactions with external tools or APIs) to solve tasks that require external input or real-time actions.

Approach: The model alternates between reasoning and acting steps. The reasoning process helps the model figure out what information or action is needed, and the acting step involves interacting with external systems (e.g., querying a database, using a calculator, calling an API). This combination leads to more effective decision-making in tasks that involve dynamic responses or external actions.

How It Works: ReAct prompts the model to first reason about the problem and then take action based on that reasoning. It can repeat this cycle multiple times if needed.

Use Case: ReAct is ideal for interactive tasks like searching a knowledge base, answering questions that involve retrieving data from external sources, or interacting with APIs.

Example: Prompt: "What is the current temperature in New York?" Model's response using ReAct:


First, I need to find the current temperature in New York (Reasoning).

Let me call a weather API to get the temperature (Acting).

The temperature in New York is 72°F (Result).

3. Prompt Decomposition:

Purpose: Prompt Decomposition breaks down a complex task or query into smaller, manageable subtasks or steps. Each subtask is handled separately, and the results are combined to address the original query.

Approach: Instead of giving a single, complex prompt, the task is divided into multiple sub-prompts, where each part of the task is processed independently. The results from these sub-prompts are then aggregated.

How It Works: The original query is split into smaller, more focused prompts, which may be handled by different agents or functions. This modular approach ensures that each part of the query is processed accurately, especially for multi-step or multi-domain tasks.

Use Case: Prompt Decomposition is used for complex tasks that involve multiple steps, specialized sub-tasks, or require integration from multiple sources. It is common in multi-agent systems and workflows that need to be handled in parts.

Example: Prompt: "How do I configure a router, ensure it meets security standards, and monitor network traffic?"


First sub-prompt: "What are the steps to configure a router?"

Second sub-prompt: "What are the security standards for routers?"

Third sub-prompt: "What are the best practices for monitoring network traffic?"

Key Differences:

Chain-of-Thought (CoT): Focuses on internal reasoning by prompting the model to think in logical steps without external action. It’s ideal for solving reasoning-based problems.

ReAct: Combines reasoning with external actions, where the model alternates between thought processes and interactions with tools or APIs.

Prompt Decomposition: Breaks down complex tasks into smaller, simpler components to handle them individually, often involving multiple steps or agents.

Summary:

CoT is mainly for reasoning step-by-step and is self-contained within the model’s thought process.

ReAct involves reasoning combined with taking external actions (e.g., tool usage or calling APIs).

Prompt Decomposition breaks a problem into multiple smaller tasks, which can be handled independently and in parallel by different agents or processes.

Each approach is useful depending on the complexity and type of task you are dealing with, whether it requires reasoning, external actions, or task breakdown.

references:

OpenAI

How to convert M4a to wav format

The code is at the below link


https://gist.github.com/arjunsharma97/0ecac61da2937ec52baf61af1aa1b759

On Mac, below dependencies are to be installed 

Xcode-select instal => to load the tools for latest python, I had 3.12 

Brew install ffmpeg 

Pip install pydub 


Run the code and it works well and replaces the m4a with the wav format! .

Nice utility to convert the m4a file from quick time audio recording to the wav format! 


References:

https://gist.github.com/arjunsharma97/0ecac61da2937ec52baf61af1aa1b759

What's the maximum token limit or context length for various LLM Models?

The maximum context length (or token limit) for various LLMs depends on the specific model you are using. Here’s a general breakdown for common LLMs and their context lengths:


1. OpenAI GPT Models:

GPT-3.5 (davinci): 4,096 tokens

GPT-4 (8k variant): 8,192 tokens

GPT-4 (32k variant): 32,768 tokens

2. Anthropic Claude:

Claude 1/2: 100k tokens (depends on version, with newer versions supporting larger contexts)

3. LLaMA (Meta):

LLaMA-2 (7B, 13B): 4,096 tokens

LLaMA-2 (70B): 8,192 tokens (some variants may support more)

4. Cohere:

Cohere Command: 4096 tokens

5. Mistral:

Mistral Models: Typically support 8,192 tokens or more depending on the implementation and fine-tuning.

Understanding Token Limits:

Tokens are units of text. A token might be as short as one character or as long as one word. For example, "chatGPT is great!" would be split into 6 tokens (["chat", "G", "PT", " is", " great", "!"]).

When providing context (like cli_retriever) or a prompt (runcli_prompt), the entire length (context + user question) must stay within the token limit. If the combined size exceeds the token limit, the model will truncate the input.

Determining Token Length in LangChain:

To ensure that your context (cli_retriever) and any additional inputs (e.g., runcli_prompt) fit within the LLM's context window, you can estimate token length or use LangChain utilities to split your input text if necessary (e.g., RecursiveCharacterTextSplitter).

So, for your runcli_chain, the maximum size of {"context": cli_retriever, "question": RunnablePassthrough()} depends on the specific LLM you are querying. You would typically set the chain’s limits based on the LLM’s token capacity mentioned above.


references:

OpenAI