Sunday, October 13, 2024

embed_query in langchain's OpenAIEmbeddings

In langchain, the embed_query method of the OpenAIEmbeddings class is used to generate an embedding vector for a query (text input). The idea behind embeddings is to convert text into numerical vectors, which represent semantic meanings and are used for similarity searches, such as when comparing queries with stored documents or other text.

How it works:

Query Embeddings: When you call embed_query, it sends the input query (a piece of text) to the OpenAI API, which then returns a vector representation of that text.

Usage: This embedding is typically used to match queries with stored document embeddings in a vector database to find the most relevant document or answer. It helps in similarity search tasks by comparing how "close" the query vector is to other document vectors.

Example:

from langchain.embeddings import OpenAIEmbeddings

# Initialize OpenAI Embeddings object

openai_embeddings = OpenAIEmbeddings()

# Get embedding for a query (a string of text)

query = "What is the version of this device?"

query_embedding = openai_embeddings.embed_query(query)

# Now, you can use this embedding for similarity searches, etc.

Main Purpose: embed_query is used when you want to search or match a user's query with similar documents stored in a vector database or embedding store.

No, not all embedding models in Langchain support the embed_query method directly. The availability of the embed_query method depends on the specific embedding model you are using. Here’s a breakdown of how this works:

1. Models that support embed_query:

OpenAIEmbeddings: OpenAI models, such as text-embedding-ada-002, natively support the embed_query method, which allows you to generate query embeddings for similarity search or document retrieval tasks.

Other Cloud/Managed API Models: Similar to OpenAI, some managed services like Cohere, Hugging Face embeddings, etc., also provide embed_query functionality depending on the model's API.

2. Models that may not support embed_query:

Self-Hosted Models: Some self-hosted or custom models (e.g., using locally trained models or models running on frameworks like transformers or Sentence Transformers) may not have the embed_query method, unless specifically implemented.

Custom Embedding Models: If you are using a custom embedding model or provider, you may need to implement the method yourself if it’s not already included.

3. General Implementation:

The embed_query method is generally a convenience function that converts a query into an embedding. For models that don't provide this directly, you may still be able to call a generic embedding method like embed_documents or embed_text and apply that to queries. It might just not be explicitly named embed_query.

Alternative Methods:

If embed_query isn’t supported, you can usually still use the model’s general embedding method for queries by treating queries like any other document or text.

Example:

query_embedding = model.embed_documents([query]) 

In summary, many embedding models do support embed_query, especially those from major providers like OpenAI, Cohere, etc., but custom, self-hosted, or specialized models may require you to handle the embedding process for queries manually. Always check the specific embedding model’s documentation in Langchain to confirm support.



What is difference between Multi Query retriever, TimeBasedVectoreStoreRetriever, and Self Query retrievers in Langchain

 In Langchain, different retrievers serve as mechanisms to extract relevant information from various data sources for LLM-based applications. Here’s a breakdown of the key retrievers you mentioned:


1. Multi Query Retriever:

The Multi Query Retriever allows an LLM to generate multiple variations of a query to improve retrieval results. This helps address scenarios where different wordings of the same query might lead to different but relevant results in a vector store or database.


Purpose: Enhance recall by increasing the chances of retrieving relevant information through multiple reformulated queries.

Process: The retriever generates alternative queries (e.g., rephrases the user's original query) and uses them to search the data store. The combined results from these queries are then ranked and returned.

Use Case: Useful when you want to cover diverse interpretations or wordings of the user's question for more comprehensive results.

Example: When a user asks, "What is the best way to secure a database?", the retriever might generate alternative queries like:


"How to improve database security?"

"Best practices for securing a database?"

"How to safeguard databases?"

This helps in retrieving different but complementary documents or information.


2. TimeBasedVectorStoreRetriever:

The TimeBasedVectorStoreRetriever is designed for retrieving information based on time relevance from a vector store. In addition to vector similarity search, it factors in the timestamp associated with documents, ensuring that results are time-ordered or time-filtered.


Purpose: To prioritize or filter documents based on their recency or relevance to a specific time range, in addition to vector similarity.

Process: This retriever can either rank results by their timestamp or restrict retrieval to a certain time window, depending on how it's set up.

Use Case: Ideal for applications dealing with time-sensitive information, like news archives, logs, or research articles.

Example: If the user asks, "What were the latest advancements in AI?", this retriever ensures that the most recent articles or documents are prioritized over older content.


3. Self Query Retriever:

The Self Query Retriever is an advanced retriever that uses an LLM to automatically generate structured queries (with filters) for more specific searches based on the user's query.


Purpose: Automatically apply metadata-based filters (e.g., date ranges, categories) to retrieve more targeted results.

Process: It involves the LLM analyzing the user's query to generate a structured query with filter conditions. These filters can be based on attributes like date, author, or document type, enhancing retrieval precision.

Use Case: Useful in situations where the data has rich metadata and users may have implicit requirements. For example, finding "recent research papers on deep learning by a specific author."

Example: If the user query is "Show me articles on machine learning from 2020," the retriever will automatically generate a query that filters for "machine learning" and restricts results to documents from 2020.


Key Differences:

Multi Query Retriever: Focuses on reformulating the query to improve recall, covering multiple possible variations.

TimeBasedVectorStoreRetriever: Prioritizes or filters results by time, useful for retrieving time-sensitive information.

Self Query Retriever: Automatically creates more precise queries with filtering based on metadata.

Each of these retrievers has its own specialized purpose, and the right one depends on the specific data retrieval needs of the application.

Thursday, October 10, 2024

What is Berkley Function calling tool

Berkeley Function Calling Leaderboard (BFCL) is a great resource for comparing how different models perform on function calling tasks. It also provides an evaluation suite to compare your own fine-tuned model on various challenging tool calling tasks. In fact, the latest dataset, BFCL v3, was just released and now includes multi-step, multi-turn function calling, further raising the bar for tool based reasoning tasks.

Both types of reasoning are powerful independently, and when combined, they have the potential to create agents that can effectively breakdown complicated tasks and autonomously interact with their environment. For more examples of AI agent architectures for reasoning, planning, and tool calling check out my team’s survey paper on ArXiv.



refernces:

https://gorilla.cs.berkeley.edu/leaderboard.html

Tuesday, October 8, 2024

What is contextual Embedding in Anthropic RAG? - Anthropic

The recent paper Anthropic introduced the concept of “contextual embedding” that solves the problem of lacking context by adding relevant context to each chunk before embedding.

What you can do is to leverage LLM for Contextual Retrieval. You can develope a prompt for an LLM that instructs the model to generate concise, chunk-specific context based on the overall document in order to provide contextual information for each chunk.

Consider a query about a specific company’s quarterly revenue growth. A relevant chunk might be something like this: “The company’s revenue grew by 3% over the previous quarter” contain the growth percentage but lack crucial details like the company name or time period. This absence of context can hinder accurate retrieval.

By sending the overall document to an LLM for every chunk, we get a contextualized_chunk like this:

This ‘contextualized_chunk’ is then sent to an embedding model to create the embedding of the chunk.

Hybrid search approach

While this contextual embeddings has proven to improve upon traditional semantic search RAG, a hybrid approach incorporating BM25 can produce even better results.

The same chunk-specific context can also be used with BM25 search to further improve retrieval performance.

Okapi BM25

BM25 is an algorithm that addresses some drawbacks of TF-IDF which concern with term saturation and document length.

Term Saturation and diminishing return

If a document contains 100 occurrences of the term “computer,” is it really twice as relevant as a document that contains 50 occurrences? We want to control the contribution of TF when a term is likely to be saturated. BM25 solves this issue by introducing a parameter k1 that controls the shape of this saturation curve. Parameter k1 can be tuned in a way that if the TF increases, at some point, the BM25 score will be saturated , meaning that the increase in TF no longer contributes much to the score.






References:

https://levelup.gitconnected.com/the-best-rag-technique-yet-anthropics-contextual-retrieval-and-hybrid-search-62320d99004e


Monday, October 7, 2024

What's the differences between Chain of Thought (CoT), ReAct, Prompt Decomposition approaches

Chain-of-Thought (CoT), ReAct (Reasoning + Acting), and Prompt Decomposition are all advanced prompting techniques for improving the reasoning capabilities of large language models (LLMs). Each approach differs in how they manage complex tasks and guide the model’s reasoning. Here’s a breakdown of the differences:


1. Chain-of-Thought (CoT):

Purpose: CoT is designed to enhance the reasoning capabilities of an LLM by encouraging it to think step-by-step.

Approach: In CoT, the model is explicitly guided to break down its reasoning process for complex questions or tasks. Instead of jumping to an answer, the model generates intermediate steps or thoughts that lead to the final result.

How It Works: When given a question, the model first generates a "chain of thought" — a logical sequence of steps that helps it arrive at a conclusion.

Use Case: CoT is useful for multi-step problems, arithmetic reasoning, logical deduction, and scenarios where intermediate steps are important for accuracy.

Example: Prompt: "If a train travels 60 miles in 2 hours, how far will it travel in 5 hours at the same speed?" Model's response using CoT:


The train travels 60 miles in 2 hours.

So, its speed is 60 ÷ 2 = 30 miles per hour.

In 5 hours, it will travel 5 × 30 = 150 miles.

2. ReAct (Reasoning + Acting):

Purpose: ReAct combines reasoning (thought process) and actions (interactions with external tools or APIs) to solve tasks that require external input or real-time actions.

Approach: The model alternates between reasoning and acting steps. The reasoning process helps the model figure out what information or action is needed, and the acting step involves interacting with external systems (e.g., querying a database, using a calculator, calling an API). This combination leads to more effective decision-making in tasks that involve dynamic responses or external actions.

How It Works: ReAct prompts the model to first reason about the problem and then take action based on that reasoning. It can repeat this cycle multiple times if needed.

Use Case: ReAct is ideal for interactive tasks like searching a knowledge base, answering questions that involve retrieving data from external sources, or interacting with APIs.

Example: Prompt: "What is the current temperature in New York?" Model's response using ReAct:


First, I need to find the current temperature in New York (Reasoning).

Let me call a weather API to get the temperature (Acting).

The temperature in New York is 72°F (Result).

3. Prompt Decomposition:

Purpose: Prompt Decomposition breaks down a complex task or query into smaller, manageable subtasks or steps. Each subtask is handled separately, and the results are combined to address the original query.

Approach: Instead of giving a single, complex prompt, the task is divided into multiple sub-prompts, where each part of the task is processed independently. The results from these sub-prompts are then aggregated.

How It Works: The original query is split into smaller, more focused prompts, which may be handled by different agents or functions. This modular approach ensures that each part of the query is processed accurately, especially for multi-step or multi-domain tasks.

Use Case: Prompt Decomposition is used for complex tasks that involve multiple steps, specialized sub-tasks, or require integration from multiple sources. It is common in multi-agent systems and workflows that need to be handled in parts.

Example: Prompt: "How do I configure a router, ensure it meets security standards, and monitor network traffic?"


First sub-prompt: "What are the steps to configure a router?"

Second sub-prompt: "What are the security standards for routers?"

Third sub-prompt: "What are the best practices for monitoring network traffic?"

Key Differences:

Chain-of-Thought (CoT): Focuses on internal reasoning by prompting the model to think in logical steps without external action. It’s ideal for solving reasoning-based problems.

ReAct: Combines reasoning with external actions, where the model alternates between thought processes and interactions with tools or APIs.

Prompt Decomposition: Breaks down complex tasks into smaller, simpler components to handle them individually, often involving multiple steps or agents.

Summary:

CoT is mainly for reasoning step-by-step and is self-contained within the model’s thought process.

ReAct involves reasoning combined with taking external actions (e.g., tool usage or calling APIs).

Prompt Decomposition breaks a problem into multiple smaller tasks, which can be handled independently and in parallel by different agents or processes.

Each approach is useful depending on the complexity and type of task you are dealing with, whether it requires reasoning, external actions, or task breakdown.

references:

OpenAI

How to convert M4a to wav format

The code is at the below link


https://gist.github.com/arjunsharma97/0ecac61da2937ec52baf61af1aa1b759

On Mac, below dependencies are to be installed 

Xcode-select instal => to load the tools for latest python, I had 3.12 

Brew install ffmpeg 

Pip install pydub 


Run the code and it works well and replaces the m4a with the wav format! .

Nice utility to convert the m4a file from quick time audio recording to the wav format! 


References:

https://gist.github.com/arjunsharma97/0ecac61da2937ec52baf61af1aa1b759

What's the maximum token limit or context length for various LLM Models?

The maximum context length (or token limit) for various LLMs depends on the specific model you are using. Here’s a general breakdown for common LLMs and their context lengths:


1. OpenAI GPT Models:

GPT-3.5 (davinci): 4,096 tokens

GPT-4 (8k variant): 8,192 tokens

GPT-4 (32k variant): 32,768 tokens

2. Anthropic Claude:

Claude 1/2: 100k tokens (depends on version, with newer versions supporting larger contexts)

3. LLaMA (Meta):

LLaMA-2 (7B, 13B): 4,096 tokens

LLaMA-2 (70B): 8,192 tokens (some variants may support more)

4. Cohere:

Cohere Command: 4096 tokens

5. Mistral:

Mistral Models: Typically support 8,192 tokens or more depending on the implementation and fine-tuning.

Understanding Token Limits:

Tokens are units of text. A token might be as short as one character or as long as one word. For example, "chatGPT is great!" would be split into 6 tokens (["chat", "G", "PT", " is", " great", "!"]).

When providing context (like cli_retriever) or a prompt (runcli_prompt), the entire length (context + user question) must stay within the token limit. If the combined size exceeds the token limit, the model will truncate the input.

Determining Token Length in LangChain:

To ensure that your context (cli_retriever) and any additional inputs (e.g., runcli_prompt) fit within the LLM's context window, you can estimate token length or use LangChain utilities to split your input text if necessary (e.g., RecursiveCharacterTextSplitter).

So, for your runcli_chain, the maximum size of {"context": cli_retriever, "question": RunnablePassthrough()} depends on the specific LLM you are querying. You would typically set the chain’s limits based on the LLM’s token capacity mentioned above.


references:

OpenAI 

Sunday, October 6, 2024

RecursiveCharacterTextSplitter vs RecursiveJsonSplitter

RecursiveCharacterTextSplitter — LangChain documentation

api.python.langchain.com

text splitter documentation

When preparing JSON documents for storage in a vector database, the RecursiveCharacterTextSplitter from LangChain is an effective tool. It recursively divides text into smaller, contextually meaningful chunks, which is advantageous for maintaining the semantic integrity of your data during retrieval.

Key Features of RecursiveCharacterTextSplitter:

Recursive Splitting: The splitter attempts to divide text using a list of specified characters, such as newline or space, to create chunks that are semantically coherent (

LangChain

).

Parameter Customization: You can adjust parameters like chunk_size to control the maximum length of each chunk and chunk_overlap to specify the number of overlapping characters between chunks, ensuring flexibility based on your data's requirements.

Alternatives and Their Differences:

While RecursiveCharacterTextSplitter is recommended for generic text, other splitters are available, each suited to specific needs:

CharacterTextSplitter:

Functionality: Splits text based on a single specified character, such as a newline.

Use Case: Suitable when the text can be effectively divided by a specific delimiter.

Limitation: May not handle complex or nested structures as effectively as recursive methods.

RecursiveJsonSplitter:

Functionality: Designed for JSON data, it recursively splits JSON objects into smaller components.

Use Case: Ideal when working with structured JSON documents that require parsing into subcomponents.

Limitation: Tailored for JSON, so not as versatile for other text formats.

Considerations for JSON Documents:

Although RecursiveCharacterTextSplitter is effective for generic text, when dealing with JSON documents, consider the following:

Structure Preservation: Ensure that the splitting process maintains the hierarchical structure of JSON data to prevent loss of contextual relationships.

Semantic Integrity: Retain meaningful groupings within the JSON data to facilitate accurate and efficient retrieval from the vector database.

In summary, RecursiveCharacterTextSplitter is a versatile choice for segmenting JSON documents, especially when preserving context is crucial. However, evaluate the specific requirements of your data and retrieval needs to determine the most suitable splitting strategy.


What are various text splitters and their differences?

In Langchain and other text-processing frameworks for Large Language Models (LLMs), different text-splitting strategies are available to handle large documents effectively and avoid exceeding token limits. Here are some options for text splitting and how they differ:

1. RecursiveCharacterTextSplitter

How it works: This splitter recursively breaks down large documents into smaller chunks using a hierarchy of delimiters like paragraphs, sentences, and words.

Pros:

Preserves logical structure (e.g., paragraphs).

Handles large chunks while retaining context.

Cons:

Slower for very large documents due to recursion.

Key method:

RecursiveCharacterTextSplitter.from_tiktoken_encoder: Uses token-based splitting (specific to token encoding methods like OpenAI’s tiktoken).

2. CharacterTextSplitter

How it works: This simpler splitter divides text based on a single character or set of characters (like space, newline).

Pros:

Fast and efficient for straightforward splitting.

Cons:

Can split content mid-sentence or mid-paragraph, which might disrupt context.

Key method:


Splits based on specific delimiters like newline or any character of choice.

Use Case: Fast processing, where document structure is not as important.


3. N-gram Text Splitters

How it works: Splits documents into smaller chunks based on N-grams (groups of n words or tokens).

Pros:

Good for splitting while maintaining word-level granularity.

Cons:

Might lose structural integrity (i.e., paragraphs, sentences).

Use Case: Ideal for small token batches, especially for keyword-based searches or training LLMs that need sliding window approaches.


4. SentenceTextSplitter

How it works: Splits text based on sentence boundaries using natural language processing (NLP) techniques.

Pros:

Maintains logical sentence-level coherence.

Great for applications where each sentence matters (e.g., summarization, question-answering).

Cons:

Might generate larger chunks, leading to token overflow if not used in combination with other splitters.

Use Case: Tasks that require sentence-level processing, like summarization.


5. TokenTextSplitter

How it works: Splits based on tokens (e.g., word tokens or byte-pair encodings) instead of characters.

Pros:

Precise control over token limits, perfect for LLMs like GPT that have token-based inputs.

Cons:

May cut off mid-word or mid-sentence if token boundaries don’t align with logical text structures.

Use Case: Managing token limits when interacting with models that operate on tokens.


6. Document Splitters (Custom)

How it works: Custom splitters that use domain-specific knowledge (e.g., splitting based on document sections or code blocks).

Pros:

Allows customization based on specific document formats (e.g., splitting code by function).

Cons:

Requires custom logic, making implementation more complex.

Use Case: Domain-specific documents like programming code, academic papers, or medical reports.


Differences Between These Methods:

Granularity: Some splitters (like SentenceTextSplitter) work on a higher granularity (sentences), while others (TokenTextSplitter, CharacterTextSplitter) can split at lower levels (characters, tokens).

Structure Preservation: Recursive splitters preserve the document structure better, while simpler methods (like CharacterTextSplitter) may sacrifice coherence for speed.

Speed: Simpler splitters like CharacterTextSplitter are faster but can disrupt text flow. Recursive methods are slower but more robust for maintaining context.

Use Cases: RecursiveCharacterTextSplitter is better for documents requiring context retention, while TokenTextSplitter is more useful for precise control over token consumption.

Depending on your task, you can choose a splitter that balances speed, context preservation, and granularity of chunks.


PostgresDB Vector Search - Adding and retrieving


model_name = "sentence-transformers/all-mpnet-base-v2"

model_kwargs = {'device': 'cpu'}

encode_kwargs = {'normalize_embeddings': False}

embedding_function = HuggingFaceEmbeddings(

    model_name=model_name,

    model_kwargs=model_kwargs,

    encode_kwargs=encode_kwargs

)



vector_store = PGVector(

    embeddings=embedding_function,

    collection_name=os.environ["COLLECTION_NAME"],

    connection=os.environ["DB_CONNECTION_STRING"],

    use_jsonb=True,

)



def add_to_db_in_batches(batch_size=100):

    existing_ids = read_collection_ids()


    data_ids = [str(json.loads(item.page_content)["id"]) for item in data]


    new_ids = list(set(data_ids) - set(existing_ids))



    # print(new_ids)



    if len(new_ids) > 0:

        new_documents = [item for item in data if json.loads(item.page_content)["id"] in new_ids]



        total_products = len(new_documents)

        start_time = time.time()  # Start the timer

        

        for i in range(0, total_products, batch_size):

            batch_data = new_documents[i:i + batch_size]

            ids = [json.loads(item.page_content)["id"] for item in batch_data]

            vector_store.add_documents(batch_data, ids=ids)

            remaining = total_products - (i + len(batch_data))

            

            elapsed_time = time.time() - start_time

            batches_processed = (i // batch_size) + 1

            average_time_per_batch = elapsed_time / batches_processed if batches_processed > 0 else 0

            estimated_remaining_batches = (total_products // batch_size) - batches_processed

            estimated_remaining_time = average_time_per_batch * estimated_remaining_batches

            

            # Format estimated remaining time

            estimated_remaining_time_minutes = estimated_remaining_time // 60

            estimated_remaining_time_seconds = estimated_remaining_time % 60

            

            print(f'Added products {i + 1} to {min(i + len(batch_data), total_products)} to the database. '

                f'Remaining: {remaining}. Estimated remaining time: {int(estimated_remaining_time_minutes)} minutes and {int(estimated_remaining_time_seconds)} seconds.')


    else:

        pass




Now To Search it, below can be done 


import json

from typing import Annotated

from fastapi import Query

from pydantic import BaseModel, Field

from .setup import vector_store



class SearchParams(BaseModel):

    query:str = Field(..., max=150)

    k: int = Field(5, ge=5, le=1000)



def get_search_results(params: Annotated[SearchParams, Query()]):


    results = vector_store.similarity_search(

        query=params.query,

        k=params.k

    )



    response = [json.loads(result.page_content) for result in results]


    return response

Thursday, October 3, 2024

What is prebuilt react agents and how to use memory in them

from langgraph.prebuilt import create_react_agent

prebuilt_doc_agent = create_react_agent(model, [execute_sql],

  state_modifier = system_prompt)


from langgraph.checkpoint.sqlite import SqliteSaver

memory = SqliteSaver.from_conn_string(":memory:")


prebuilt_doc_agent = create_react_agent(model, [execute_sql], 

  checkpointer=memory)



class SQLAgent:

  def __init__(self, model, tools, system_prompt = ""):

    <...>

    self.graph = graph.compile(checkpointer=memory)

    <...>



# defining thread

thread = {"configurable": {"thread_id": "1"}}

messages = [HumanMessage(content="What info do we have in ecommerce_db.users table?")]


for event in prebuilt_doc_agent.stream({"messages": messages}, thread):

    for v in event.values():

        v['messages'][-1].pretty_print()



followup_messages = [HumanMessage(content="I would like to know the column names and types. Maybe you could look it up in database using describe.")]


for event in prebuilt_doc_agent.stream({"messages": followup_messages}, thread):

    for v in event.values():

        v['messages'][-1].pretty_print()



new_thread = {"configurable": {"thread_id": "42"}}

followup_messages = [HumanMessage(content="I would like to know the column names and types. Maybe you could look it up in database using describe.")]


for event in prebuilt_doc_agent.stream({"messages": followup_messages}, new_thread):

    for v in event.values():

        v['messages'][-1].pretty_print()



references:

https://towardsdatascience.com/from-basics-to-advanced-exploring-langgraph-e8c1cf4db787




What are different types of LLM routers?

LLM Completion Routers

LLM Function Calling Routers

Semantic Routers

Zero Shot Classification Routers

Language Classification Routers


Below is an example of LLM Completion Router

===========================================


from langchain_anthropic import ChatAnthropic

from langchain_core.output_parsers import StrOutputParser

from langchain_core.prompts import PromptTemplate


# Set up the LLM Chain to return a single word based on the query,

# and based on a list of words we provide to it in the prompt template

llm_completion_select_route_chain = (

        PromptTemplate.from_template("""

Given the user question below, classify it as either

being about `LangChain`, `Anthropic`, or `Other`.


Do not respond with more than one word.


<question>

{question}

</question>


Classification:"""

                                     )

        | ChatAnthropic(model_name="claude-3-haiku")

        | StrOutputParser()

)



# We setup an IF/Else condition to route the query to the correct chain 

# based on the LLM completion call above

def route_to_chain(route_name):

    if "anthropic" == route_name.lower():

        return anthropic_chain

    elif "langchain" == route_name.lower():

        return langchain_chain

    else:

        return general_chain


...


# Later on in the application, we can use the response from the LLM

# completion chain to control (i.e route) the flow of the application 

# to the correct chain via the route_to_chain method we created

route_name = llm_completion_select_route_chain.invoke(user_query)

chain = route_to_chain(route_name)

chain.invoke(user_query)



LLM Function Calling Router / Pedantic Router

=============================================


This leverages the function-calling ability of LLMs to pick a route to traverse. The different routes are set up as functions with appropriate descriptions in the LLM Function Call. Then, based on the query passed to the LLM, it is able to return the correct function (i.e route), for us to take.



Semantic Router

This router type leverages embeddings and similarity searches to select the best route to traverse.


Each route has a set of example queries associated with it, that become embedded and stored as vectors. The incoming query gets embedded also, and a similarity search is done against the other sample queries from the router. The route which belongs to the query with the closest match gets selected.


from semantic_router import Route


# we could use this as a guide for our chatbot to avoid political

# conversations

politics = Route(

    name="politics",

    utterances=[

        "isn't politics the best thing ever",

        "why don't you tell me about your political opinions",

        "don't you just love the president",

        "they're going to destroy this country!",

        "they will save the country!",

    ],

)


# this could be used as an indicator to our chatbot to switch to a more

# conversational prompt

chitchat = Route(

    name="chitchat",

    utterances=[

        "how's the weather today?",

        "how are things going?",

        "lovely weather today",

        "the weather is horrendous",

        "let's go to the chippy",

    ],

)


# we place both of our decisions together into single list

routes = [politics, chitchat]

encoder = OpenAIEncoder()


from semantic_router.layer import RouteLayer

route_layer = RouteLayer(encoder=encoder, routes=routes)

route_layer("don't you love politics?").name

# -> 'politics'


Zero Shot Classification Router

==============================

“Zero-shot text classification is a task in natural language processing where a model is trained on a set of labeled examples but is then able to classify new examples from previously unseen classes”. These routers leverage a Zero-Shot Classification model to assign a label to a piece of text, from a predefined set of labels you pass in to the router.





references:

https://towardsdatascience.com/routing-in-rag-driven-applications-a685460a7220


What is zero shot, single shot and few shot prompting

 Zero-shot, one-shot, and few-shot prompting refer to different techniques used to guide language models (like GPT) to perform tasks by providing varying amounts of examples in the prompt. These techniques are critical in natural language processing (NLP) as they dictate how much context or task-related information is provided to the model. Here's a breakdown of each:


1. Zero-shot Prompting

In zero-shot prompting, the model is asked to perform a task without being given any example in the prompt. The model is expected to understand and generate a response based solely on the task description.


Example:

Task: Classify the sentiment of a sentence.


Prompt:


kotlin

Copy code

Classify the sentiment of this sentence: "I love this product."

The model is directly asked to classify sentiment without any prior examples.

Useful when the model has already been pre-trained on similar tasks and can infer the task from context.

2. One-shot Prompting

In one-shot prompting, you provide the model with one example of how the task should be performed. This single example serves as a guide for the model to understand the expected format of the response.


Example:

Task: Classify the sentiment of a sentence.


Prompt:


kotlin

Copy code

Classify the sentiment of this sentence: "I hate this service." Answer: Negative.

Now classify the sentiment of this sentence: "I love this product."

The model is provided with one example (I hate this service classified as Negative) to understand the task before being asked to classify a new sentence.

3. Few-shot Prompting

In few-shot prompting, you provide the model with a few examples (typically 2-5) to guide it in understanding how to perform the task. These examples help the model generate responses that match the desired output pattern.


Example:

Task: Classify the sentiment of a sentence.


Prompt:


mathematica

Copy code

Classify the sentiment of these sentences:

1. "I hate this service." Answer: Negative.

2. "This is the worst experience ever." Answer: Negative.

3. "I love this product." Answer: Positive.

Now classify the sentiment of this sentence: "The food was amazing."

The model is given three examples, showing both positive and negative classifications, before being asked to classify the final sentence.

When to Use Each Technique

Zero-shot: Ideal when you want the model to generalize based on prior training without providing specific examples (suitable for simple tasks where the model has context).

One-shot: Useful when the task may not be as straightforward, but a single example is enough for the model to catch on.

Few-shot: Best for more complex tasks where the model needs multiple examples to understand the nuances of how the task should be performed.



references:

OpenAI 

What is Zero-Shot Classification

Zero Shot Classification is the task of predicting a class that wasn't seen by the model during training. This method, which leverages a pre-trained language model, can be thought of as an instance of transfer learning which generally refers to using a model trained for one task in a different application than what it was originally trained for. This is particularly useful for situations where the amount of labeled data is small.

In zero shot classification, we provide the model with a prompt and a sequence of text that describes what we want our model to do, in natural language. Zero-shot classification excludes any examples of the desired task being completed. This differs from single or few-shot classification, as these tasks include a single or a few examples of the selected task.

Zero, single and few-shot classification seem to be an emergent feature of large language models. This feature seems to come about around model sizes of +100M parameters. The effectiveness of a model at a zero, single or few-shot task seems to scale with model size, meaning that larger models (models with more trainable parameters or layers) generally do better at this task.

Here is an example of a zero-shot prompt for classifying the sentiment of a sequence of text:

Classify the following input text into one of the following three categories: [positive, negative, neutral]

Input Text: Hugging Face is awesome for making all of these

state of the art models available!

Sentiment: positive

One great example of this task with a nice off-the-shelf model is available at the widget of this page, where the user can input a sequence of text and candidate labels to the model. This is a word level example of zero shot classification, more elaborate and lengthy generations are available with larger models. Testing these models out and getting a feel for prompt engineering is the best way to learn how to use them.

from transformers import pipeline

pipe = pipeline(model="facebook/bart-large-mnli")

pipe("I have a problem with my iphone that needs to be resolved asap!",

    candidate_labels=["urgent", "not urgent", "phone", "tablet", "computer"],

)

# output

{'sequence': 'I have a problem with my iphone that needs to be resolved asap!!', 'labels': ['urgent', 'phone', 'computer', 'not urgent', 'tablet'], 'scores': [0.504, 0.479, 0.013, 0.003, 0.002]}


Wednesday, October 2, 2024

What is Semantic Routing in RAG

 Each route has a set of example queries associated with it, that become embedded and stored as vectors. The incoming query gets embedded also, and a similarity search is done against the other sample queries from the router. The route which belongs to the query with the closest match gets selected.

There is in fact a python package called semantic-router that does just this. Let’s look at some implementation details to get a better idea of how the whole thing works. These examples come straight out of that libraries GitHub page.


from semantic_router import Route

# we could use this as a guide for our chatbot to avoid political

# conversations

politics = Route(

    name="politics",

    utterances=[

        "isn't politics the best thing ever",

        "why don't you tell me about your political opinions",

        "don't you just love the president",

        "they're going to destroy this country!",

        "they will save the country!",

    ],

)

# this could be used as an indicator to our chatbot to switch to a more

# conversational prompt

chitchat = Route(

    name="chitchat",

    utterances=[

        "how's the weather today?",

        "how are things going?",

        "lovely weather today",

        "the weather is horrendous",

        "let's go to the chippy",

    ],

)

# we place both of our decisions together into single list

routes = [politics, chitchat]


encoder = OpenAIEncoder()

from semantic_router.layer import RouteLayer

route_layer = RouteLayer(encoder=encoder, routes=routes)

route_layer("don't you love politics?").name

references:

https://towardsdatascience.com/routing-in-rag-driven-applications-a685460a7220


Some notes on Mistral 7B

The 7B model released by Mistral AI,

Mistral is a 7B parameter model, distributed with the Apache license. It is available in both instruct (instruction following) and text completion.

The Mistral AI team has noted that Mistral 7B:

Outperforms Llama 2 13B on all benchmarks

Outperforms Llama 1 34B on many benchmarks

Approaches CodeLlama 7B performance on code, while remaining good at English tasks

Mistral 0.3 supports function calling with Ollama’s raw mode.

Example raw prompt

[AVAILABLE_TOOLS] [{"type": "function", "function": {"name": "get_current_weather", "description": "Get the current weather", "parameters": {"type": "object", "properties": {"location": {"type": "string", "description": "The city and state, e.g. San Francisco, CA"}, "format": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "The temperature unit to use. Infer this from the users location."}}, "required": ["location", "format"]}}}][/AVAILABLE_TOOLS][INST] What is the weather like today in San Francisco [/INST]

Example response

[TOOL_CALLS] [{"name": "get_current_weather", "arguments": {"location": "San Francisco, CA", "format": "celsius"}}]

Variations

instruct Instruct models follow instructions

text Text models are the base foundation model without any fine-tuning for conversations, and are best used for simple text completion.

Usage

CLI

Instruct:

ollama run mistral

API

Example:

curl -X POST http://localhost:11434/api/generate -d '{

  "model": "mistral",

  "prompt":"Here is a story about llamas eating grass"

 }'

To run the mistral Locally, ollama run mistral 

References:

https://ollama.com/library/mistral?ref=maginative.com


Tuesday, October 1, 2024

AI way of extract text from PDF?

 async def document_analysis(filename: str) -> str:

    """

    Document Understanding

    Args:

        filename: pdf filename str

    """


    pdf = pdfium.PdfDocument(filename)

    images = []

    print("Retrieved PDF ",len(pdf))

    for i in range(len(pdf)):

        print("Iter count ", i)

        page = pdf[i]

        print("Got the page ", page)

        image = page.render(scale=8).to_pil()

        buffered = BytesIO()

        image.save(buffered, format="JPEG")

        img_byte = buffered.getvalue()

        img_base64 = base64.b64encode(img_byte).decode("utf-8")

        images.append(img_base64)


    text_of_pages = await asyncio.gather(*[parse_page_with_gpt(image) for image in images])

    print("Text of pages got")

    results = []


    extracted_texts = [doc for doc in text_of_pages]

    # Clean each string in the list and append to json_results

    for text in extracted_texts:

        results.append(text)

        

    return results




async def parse_page_with_gpt(base64_image: str) -> str:

    messages=[

        {

            "role": "system",

            "content": """

            

            You are a helpful assistant that extracts information from images.

            

            """

        },

        {

            "role": "user",

            "content": [

                {"type": "text", "text": "Extract information from image into text"},

                {

                    "type": "image_url",

                    "image_url": {

                        "url": f"data:image/jpeg;base64,{base64_image}",

                        "detail": "auto"

                    },

                },

            ],

        }

    ]

    response = await clienta.chat.completions.create(

        model=MODEL,

        messages=messages,

        temperature=0,

        max_tokens=4096,

    )

    return response.choices[0].message.content or ""


What is UPF and SMF in 5G Core?

UPF (User Plane Function) and SMF (Session Management Function) are key components in the 5G core network architecture, particularly in the 3GPP (3rd Generation Partnership Project) standards. Both are part of the Service-Based Architecture (SBA) used in modern mobile networks. Here's what each of these devices/functions does:

1. UPF (User Plane Function):

Role: UPF is responsible for managing the user plane, which deals with data transmission between the RAN (Radio Access Network) and external data networks (like the Internet).

Functions:

Routing and forwarding of data packets.

Handling traffic policies, Quality of Service (QoS), and load balancing.

Packet inspection and filtering.

Lawful interception for data traffic.

Data usage monitoring (for charging purposes).

Significance: It decouples the user plane from the control plane, allowing better scalability and flexibility, which is key for 5G's promise of low latency and high throughput.

2. SMF (Session Management Function):

Role: SMF manages the control plane part of the user session, specifically handling the setup, modification, and teardown of sessions that route user data.

Functions:

Session establishment, modification, and release.

IP address allocation and management.

Interaction with Policy Control Function (PCF) for implementing network policies.

Management of UPF connections (e.g., determining how the UPF routes user traffic).

Manages mobility, handling session continuity as users move across different network zones.

Significance: It ensures seamless service and manages user sessions across the network, optimizing performance and connectivity.

In Summary:

UPF handles the actual data traffic (user data) flowing through the network, dealing with packet forwarding, routing, and traffic policies.

SMF is responsible for managing the sessions that determine how user data is routed, including the control over the UPF.

Together, these functions are essential for enabling the high-speed, low-latency capabilities of 5G networks.


References:

OpenAI


What is Ferret-UI

Ferret-UI is a model designed to understand user interactions with a mobile screen.



Mobile UI Understanding

Hence the paradigm shift is from natural language understanding to mobile UI understanding.

Understanding conversational context to understanding the current context on a mobile screen.


Visual Understanding

Ferret-UI is a model designed to understand user interactions with a mobile screen.



NLU aims to enable machines to comprehend human language and respond appropriately. It focuses on structuring unstructured conversational input to understand the user’s meaning.

This structuring is essential for making sense of human conversation across various mediums, ensuring a Conversational UI can effectively process unstructured data.

The paradigm shift is from natural language understanding to mobile UI understanding, moving from comprehending conversational context to understanding the current context on a mobile screen.

Moving from only understanding conversations, to understanding screens.

Gleaning context from screens and user interactions, as apposed to conversations only.

Ferret-UI can be considered as a RAG implementation where augmentation is not performed via retrieved documents, but rather retrieved screens.

As conversations are unstructured data, and part of a Conversational UI is to create structure around this unstructured data. In a similar fashion, Ferret-UI creates a structure around what is displayed on the screen.

Ferret-UI adds language to what is mapped on the screen, allowing context rich, accurate and multi-turn conversations. A significant step up from the “single dialog-turn command and control” scenario.

Not only does Ferret-UI add a language layer to devices, but other functionality can be added. Like task orchestration based on user behaviour, anticipating the next interaction, user guidance, and more


 

references:

https://cobusgreyling.medium.com/moving-from-natural-language-understanding-to-mobile-ui-understanding-18cd775c11b3

What are various types of Ollama models

 

Key Features of Ollama

Easy to Use & User-Friendly Interface: Quickly download and use open-source LLMs with a straightforward setup process.

Versatile: Supports a wide variety of models, including those for text, images, audio, and multimodal tasks.

Cross-Platform Compatibility: Available on macOS, Windows, and Linux.

Offline Models: Operate large language models without needing a continuous internet connection.

High Performance: Built over llama.cpp, which offers state-of-the-art performance on a wide variety of hardware, both locally and in the cloud. It efficiently utilizes the available resources, such as your GPU or MPS for Apple.

Cost-Effective: Save on compute costs by executing models locally.

Privacy: Local processing ensures your data remains secure and private.

Limitations

While Ollama offers numerous benefits, it’s important to be aware of its limitations:

Inference Only: Ollama is designed solely for model inference. For training or fine-tuning models, you will need to use tools like Hugging Face, TensorFlow, or PyTorch.

Setup and Advanced Functionalities: For detailed configuration for model inference or training, other libraries such as Hugging Face and PyTorch are necessary.

Performance: Although Ollama is based on llama.cpp, it may still be slower than using it directly.

references:

OpenAI