Tuesday, October 29, 2024

What is Ray Framework?

Ray is an open-source framework designed to enable the development of scalable and distributed applications in Python. It provides a simple and flexible programming model for building distributed systems, making it easier to leverage the power of parallel and distributed computing. Some key features and capabilities of the Ray framework include:

Ray allows you to easily parallelize your Python code by executing tasks concurrently across multiple CPU cores or even across a cluster of machines. This enables faster execution and improved performance for computationally intensive tasks.

Ray provides a distributed execution model, allowing you to scale your applications beyond a single machine. It offers tools for distributed scheduling, fault tolerance, and resource management, making it easier to handle large-scale computations

With Ray, you can define Python functions that can be executed remotely. This enables you to offload computation to different nodes in a cluster, distributing the workload and improving overall efficiency.

Ray provides high-level abstractions for distributed data processing, such as distributed data frames and distributed object stores. These features make it easier to work with large datasets and perform operations like filtering, aggregation, and transformation in a distributed manner.

Ray includes built-in support for reinforcement learning algorithms and distributed training. It provides a scalable execution environment for training and evaluating machine learning models, enabling efficient experimentation and faster training times.

1. Ray AI Runtime (AIR)

This open-source collection of Python libraries is designed specifically for ML engineers, data scientists, and researchers. It equips them with a unified and scalable toolkit for developing ML applications. The Ray AI Runtime consists of 5 core libraries:

Ray Data

Achieve scalability and flexibility in data loading and transformation across various stages, such as training, tuning, and prediction, regardless of the underlying framework.

Ray Train

Enables distributed model training across multiple nodes and cores, incorporating fault tolerance mechanisms that seamlessly integrate with widely used training libraries.

Ray Tune

Scale your hyperparameter tuning process to enhance model performance, ensuring optimal configurations are discovered.

Ray Serve

Effortlessly deploy models for online inference with Ray's scalable and programmable serving capabilities. Optionally, leverage micro batching to further enhance performance.

Ray RLlib

Seamlessly integrate scalable distributed reinforcement learning workloads with other Ray AIR libraries, enabling efficient execution of reinforcement learning tasks.

references:

https://www.datacamp.com/tutorial/distributed-processing-using-ray-framework-in-python

PymPDF - read page by page and extract images

 def navigate_page_by_page_pympdf():

## navigating page by page
doc = pymupdf.open("deployment_guide.pdf") # open a document
out = open("output.txt", "wb") # create a text output
for page in doc: # iterate the document pages
text = page.get_text().encode("utf8") # get plain text (is in UTF-8)
print("Text read is ",text)
# out.write(text) # write text of page
# out.write(bytes((12,))) # write page delimiter (form feed 0x0C)
out.close()

def extract_images_pympdf():
doc = pymupdf.open("deployment_guide.pdf") # open a document

for page_index in range(len(doc)): # iterate over pdf pages
page = doc[page_index] # get the page
image_list = page.get_images()

# print the number of images found on the page
if image_list:
print(f"Found {len(image_list)} images on page {page_index}")
else:
print("No images found on page", page_index)

for image_index, img in enumerate(image_list, start=1): # enumerate the image list
xref = img[0] # get the XREF of the image
pix = pymupdf.Pixmap(doc, xref) # create a Pixmap

if pix.n - pix.alpha > 3: # CMYK: convert to RGB first
pix = pymupdf.Pixmap(pymupdf.csRGB, pix)

pix.save("page_%s-image_%s.png" % (page_index, image_index)) # save the image as png
pix = None

Monday, October 28, 2024

PyMuPDF - How to extract tables

def extract_text_from_pdf(pdf_path):
"""Read table content only of all pages in the document.

Chatbots typically have limitations on the amount of data that can
can be passed in (number of tokens).

We therefore only extract information on the PDF's pages that are
contained in tables.
As we even know that the PDF actually contains ONE logical table
that has been segmented for reporting purposes, our approach
is the following:
* The cell contents of each table row are joined into one string
separated by ";".
* If table segment on the first page also has an external header row,
join the column names separated by ";". Also ignore any subsequent
table row that equals the header string. This deals with table
header repeat situations.
"""
# open document
doc = pymupdf.open(pdf_path)

text = "" # we will return this string
row_count = 0 # counts table rows
header = "" # overall table header: output this only once!

# iterate over the pages
for page in doc:
# only read the table rows on each page, ignore other content
tables = page.find_tables() # a "TableFinder" object
for table in tables:

# on first page extract external column names if present
if page.number == 0 and table.header.external:
# build the overall table header string
# technical note: incomplete / complex tables may have
# "None" in some header cells. Just use empty string then.
header = (
";".join(
[
name if name is not None else ""
for name in table.header.names
]
)
+ "\n"
)
text += header
row_count += 1 # increase row counter

# output the table body
for row in table.extract(): # iterate over the table rows

# again replace any "None" in cells by an empty string
row_text = (
";".join([cell if cell is not None else "" for cell in row]) + "\n"
)
if row_text != header: # omit duplicates of header row
text += row_text
row_count += 1 # increase row counter
doc.close() # close document
print(f"Loaded {row_count} table rows from file '{doc.name}'.\n")
return text

 references:

https://python.plainenglish.io/why-pymupdf4llm-is-the-best-tool-for-extracting-data-from-pdfs-even-if-you-didnt-know-you-needed-7bff75313691

Saturday, October 26, 2024

What is PyMuPDF4LLM

PyMuPDF4LLM is based on top of the tried and tested PyMuPDF and utilizes the library behind the scenes to achieve the following:


Support for multi-column pages

Support for image and vector graphics extraction (and inclusion of references in the MD text)

Support for page chunking output

Direct support for output as LlamaIndex Documents


Multi-Column Pages

The text extraction can handle document layouts with multiple columns and meaning that “newspaper” type layouts are supported. The associated Markdown output will faithfully represent the intended reading order.


Image Support

PyMuPDF4LLM will also output image files alongside the Markdown if we request write_images:

import pymupdf4llm

output = pymupdf4llm.to_markdown("input.pdf", write_images=True)

The resulting output will create a markdown text output with references to any images that may have been found in the document. The images will be saved to the location from where you have run the Python script and the markdown will have logically referenced them with the correct markdown syntax for images.

Page Chunking

We can obtain output with enriched semantic information if we request page_chunks:


import pymupdf4llm

output = pymupdf4llm.to_markdown("input.pdf", page_chunks=True)


This delivers a list of dictionary objects for each page of the document with the following schema:


metadata — dictionary consisting of the document’s metadata.

toc_items — list of Table of Contents items pointing to the page.

tables — list of tables on this page.

images — list of images on the page.

graphics — list of vector graphics rectangles on the page.

text — page content as Markdown text.

Pymupdfllm has support in LLamaIndex.

import pymupdf4llm

llama_reader = pymupdf4llm.LlamaMarkdownReader()

llama_docs = llama_reader.load_data("input.pdf")

Wednesday, October 23, 2024

What is Meta AI interface

Like chatGPT or Google Gemini, the Meta AI chat interface is powered by Llama 3. However, meta AI is only available in the US, so you must use a VPN to connect to meta.ai if you are in a Europe zone like me.

On meta.ai, you can either generate an image or have a chat. Both two generative models are free for use.

What is incredible about mata.ai:

The speed of token generation is extremely rapid comparing Gemini or ChatGPT, we feel that big monster infra is behind the scenes.

Free plan for image generation, while I type the prompt for image generation, the image is generated based on what I am typing. It’s a real-time generation.

References

https://levelup.gitconnected.com/llama-3-metas-latest-ai-breakthrough-offers-power-and-open-access-166ebb94f794

 

Monday, October 21, 2024

ChromaDB Hybrid Search

Chroma has query method which does the document search. This is effective for doing hybrid search. 

 

import chromadb

# Initialize Chroma

client = chromadb.Client()


# Create a collection with metadata

collection = client.create_collection(name="my_collection")


# Add documents with vectors and metadata

collection.add(

    embeddings=[[0.1, 0.2, 0.3], [0.2, 0.1, 0.4]],  # Embeddings

    documents=["Document 1", "Document 2"],         # Documents

    metadatas=[{"category": "science", "author": "Alice"},

              {"category": "history", "author": "Bob"}],  # Metadata

    ids=["doc1", "doc2"]

)


# Perform a hybrid search (vector search + metadata filtering)

results = collection.query(

    query_embeddings=[[0.1, 0.2, 0.3]],  # Embedding query vector

    n_results=5,                         # Number of results

    where={"category": "science"}        # Metadata filter

)


# Output results

for result in results["documents"]:

    print(result)

Tuesday, October 15, 2024

What is LIMP re-ranking

 LIMP re-ranking refers to a Latent Interaction Model with Pooling (LIMP) technique used for re-ranking documents or search results. It is a method to improve the ranking of documents retrieved by an initial retrieval model by incorporating interactions between the query and the documents in a more nuanced way.

Here’s a breakdown of LIMP re-ranking:

Key Concepts of LIMP:

Latent Interaction Model:

LIMP focuses on latent interactions between the query and the document. Instead of only relying on pre-encoded representations of documents and queries (as in a traditional bi-encoder model), LIMP allows the model to capture more granular word-to-word interactions between them.

Pooling:


The pooling step in LIMP aggregates information from all latent interactions between the query and the document to compute a relevance score. This pooling mechanism can take multiple forms (e.g., max-pooling, average pooling), and it allows the model to focus on the most relevant parts of the document when determining relevance.

Re-ranking:


Re-ranking is the process of refining the order of documents after an initial retrieval phase. In the context of LIMP, once an initial set of documents is retrieved (usually using a simpler and more scalable retrieval model like a bi-encoder), LIMP is used to re-rank the documents by analyzing the deeper interactions between the query and each document. This step improves the relevance of the top-ranked documents presented to the user.

How LIMP Re-ranking Works:

Initial Retrieval:

The system first retrieves a set of documents using a traditional retrieval method, such as BM25, a bi-encoder, or another retrieval model. These documents may contain some relevant ones, but the ranking might not be optimal.

Interaction Modeling:

LIMP then applies latent interaction modeling, where the words or embeddings of the query and the document are compared directly at various levels (e.g., word-level interactions or higher-level embeddings).

Pooling Mechanism:

The latent interaction scores are aggregated using a pooling mechanism, which captures the most relevant interactions between the query and document content. Pooling could prioritize strong matches (max-pooling) or capture an average similarity across all terms (average-pooling), depending on the implementation.

Re-ranking:

The pooled interaction score is used to re-rank the set of retrieved documents. The new ranking reflects a more detailed and fine-grained relevance scoring compared to the initial retrieval method.

Benefits of LIMP Re-ranking:

Captures Deeper Query-Document Interactions: Unlike traditional models that may only consider holistic similarity (like cosine similarity between embeddings), LIMP focuses on word-to-word and phrase-level interactions, leading to better ranking precision.


Improved Precision: By refining the initial set of retrieved documents, LIMP can significantly improve the relevance of the top results, making it useful in applications where high accuracy is critical.


Flexible Pooling: The pooling mechanism allows the model to focus on the most important aspects of the query-document relationship, further enhancing the precision of re-ranking.


Comparison with Other Re-ranking Methods:

LIMP vs. Bi-Encoders: A bi-encoder retrieves documents by encoding both the query and document separately and comparing their embeddings. In contrast, LIMP performs more detailed latent interaction modeling, which enables it to capture more nuanced relationships and improve the ranking of the results.


LIMP vs. Cross-Encoders: While cross-encoders also encode the query and document jointly, LIMP explicitly models the interactions at a finer level and uses pooling to summarize them. It offers an intermediate approach between bi-encoders (efficiency) and cross-encoders (precision).


Use Case in RAG (Retrieval-Augmented Generation):

In RAG systems, LIMP can be used for re-ranking retrieved documents to improve the quality of the documents fed into the generator (LLM). After an initial retrieval (e.g., via a bi-encoder), LIMP can re-rank the documents by looking at the finer interactions between the query and each document, ensuring the most relevant documents are presented for further processing or generation.


Conclusion:

LIMP re-ranking is a powerful tool that combines the benefits of interaction modeling and pooling to improve the relevance of search results or retrieved documents. It is especially useful in scenarios where precision is key, and it fits well within larger RAG systems as a re-ranking mechanism after initial retrieval.

What is Cross Encoder and Bi Encoder in RAG

 In the context of Retrieval-Augmented Generation (RAG), cross encoders and bi-encoders are two different methods for encoding query-document pairs to evaluate relevance. They represent two different approaches for measuring similarity between a query and potential documents during retrieval.

1. Bi-Encoder:

What it is: In a bi-encoder architecture, both the query and the documents are encoded independently into vector representations, typically using the same model. These vectors are then compared (e.g., using cosine similarity) to determine relevance.


How it works: The bi-encoder first encodes the query and the document separately into their respective embeddings. The similarity between the query and the document is computed after both have been encoded, without direct interaction between the two during encoding.

Pros:

Efficient retrieval: Since both queries and documents are independently encoded, you can pre-compute and store document embeddings in a vector database, making it fast and scalable for large datasets.

Scalability: Works well for large-scale retrieval tasks where embeddings of many documents are compared to a query.

Cons:

Lower precision: Since the query and document are encoded separately, there is no interaction between them during encoding, which might result in lower relevance compared to a cross-encoder.

Loss of interaction: The model cannot leverage interactions between query and document words, which might miss nuanced relevance.

Use case: Ideal for tasks requiring fast retrieval over large document sets, where embeddings can be precomputed and stored for quick lookup.


2. Cross-Encoder:

What it is: In a cross-encoder architecture, the query and the document are encoded together in a single pass through the model. This allows for cross-attention between the query and document, making the similarity judgment more precise.


How it works: The query and document are concatenated and passed through a model (like a transformer). The model processes them jointly, allowing direct interaction between the two. The model then outputs a relevance score based on the joint encoding.



Pros:


Higher precision: Since the query and document are encoded together, the model can take into account word-level interactions between them, leading to more accurate relevance judgments.

Better understanding of context: By processing both query and document together, the cross-encoder can capture subtle relationships and semantic nuances between the two.

Cons:


Slow for large-scale retrieval: Since every query-document pair needs to be encoded together, it's computationally expensive for large-scale retrieval tasks.

No pre-computation: Unlike bi-encoders, you can't pre-compute the document embeddings, which limits scalability.

Use case: Best suited for re-ranking a small set of documents retrieved by a bi-encoder or other methods. It is typically used in a two-step process where a bi-encoder retrieves a broad set of documents, and a cross-encoder refines the ranking.


Comparison in RAG:

In RAG (Retrieval-Augmented Generation), typically, a bi-encoder is used to perform initial retrieval from a large corpus (due to its efficiency and scalability), followed by a cross-encoder to re-rank or refine the results for better accuracy, especially when precision is critical.


Bi-Encoder: Used for fast, scalable retrieval.

Cross-Encoder: Used for accurate re-ranking of a small subset of documents retrieved by the bi-encoder.

Example Workflow in RAG:

Bi-Encoder Retrieval: The system first uses a bi-encoder to retrieve a broad set of candidate documents that are relevant to the user's query by comparing the query's embedding with precomputed document embeddings.

Cross-Encoder Re-Ranking: Once a smaller subset of relevant documents is retrieved, the system can apply a cross-encoder to re-rank the documents by jointly encoding the query and each document and generating a more precise relevance score.

Both methods have their place in RAG-based systems: bi-encoders handle large-scale retrieval, while cross-encoders improve precision for re-ranking and final selection.

Sunday, October 13, 2024

embed_query in langchain's OpenAIEmbeddings

In langchain, the embed_query method of the OpenAIEmbeddings class is used to generate an embedding vector for a query (text input). The idea behind embeddings is to convert text into numerical vectors, which represent semantic meanings and are used for similarity searches, such as when comparing queries with stored documents or other text.

How it works:

Query Embeddings: When you call embed_query, it sends the input query (a piece of text) to the OpenAI API, which then returns a vector representation of that text.

Usage: This embedding is typically used to match queries with stored document embeddings in a vector database to find the most relevant document or answer. It helps in similarity search tasks by comparing how "close" the query vector is to other document vectors.

Example:

from langchain.embeddings import OpenAIEmbeddings

# Initialize OpenAI Embeddings object

openai_embeddings = OpenAIEmbeddings()

# Get embedding for a query (a string of text)

query = "What is the version of this device?"

query_embedding = openai_embeddings.embed_query(query)

# Now, you can use this embedding for similarity searches, etc.

Main Purpose: embed_query is used when you want to search or match a user's query with similar documents stored in a vector database or embedding store.

No, not all embedding models in Langchain support the embed_query method directly. The availability of the embed_query method depends on the specific embedding model you are using. Here’s a breakdown of how this works:

1. Models that support embed_query:

OpenAIEmbeddings: OpenAI models, such as text-embedding-ada-002, natively support the embed_query method, which allows you to generate query embeddings for similarity search or document retrieval tasks.

Other Cloud/Managed API Models: Similar to OpenAI, some managed services like Cohere, Hugging Face embeddings, etc., also provide embed_query functionality depending on the model's API.

2. Models that may not support embed_query:

Self-Hosted Models: Some self-hosted or custom models (e.g., using locally trained models or models running on frameworks like transformers or Sentence Transformers) may not have the embed_query method, unless specifically implemented.

Custom Embedding Models: If you are using a custom embedding model or provider, you may need to implement the method yourself if it’s not already included.

3. General Implementation:

The embed_query method is generally a convenience function that converts a query into an embedding. For models that don't provide this directly, you may still be able to call a generic embedding method like embed_documents or embed_text and apply that to queries. It might just not be explicitly named embed_query.

Alternative Methods:

If embed_query isn’t supported, you can usually still use the model’s general embedding method for queries by treating queries like any other document or text.

Example:

query_embedding = model.embed_documents([query]) 

In summary, many embedding models do support embed_query, especially those from major providers like OpenAI, Cohere, etc., but custom, self-hosted, or specialized models may require you to handle the embedding process for queries manually. Always check the specific embedding model’s documentation in Langchain to confirm support.



What is difference between Multi Query retriever, TimeBasedVectoreStoreRetriever, and Self Query retrievers in Langchain

 In Langchain, different retrievers serve as mechanisms to extract relevant information from various data sources for LLM-based applications. Here’s a breakdown of the key retrievers you mentioned:


1. Multi Query Retriever:

The Multi Query Retriever allows an LLM to generate multiple variations of a query to improve retrieval results. This helps address scenarios where different wordings of the same query might lead to different but relevant results in a vector store or database.


Purpose: Enhance recall by increasing the chances of retrieving relevant information through multiple reformulated queries.

Process: The retriever generates alternative queries (e.g., rephrases the user's original query) and uses them to search the data store. The combined results from these queries are then ranked and returned.

Use Case: Useful when you want to cover diverse interpretations or wordings of the user's question for more comprehensive results.

Example: When a user asks, "What is the best way to secure a database?", the retriever might generate alternative queries like:


"How to improve database security?"

"Best practices for securing a database?"

"How to safeguard databases?"

This helps in retrieving different but complementary documents or information.


2. TimeBasedVectorStoreRetriever:

The TimeBasedVectorStoreRetriever is designed for retrieving information based on time relevance from a vector store. In addition to vector similarity search, it factors in the timestamp associated with documents, ensuring that results are time-ordered or time-filtered.


Purpose: To prioritize or filter documents based on their recency or relevance to a specific time range, in addition to vector similarity.

Process: This retriever can either rank results by their timestamp or restrict retrieval to a certain time window, depending on how it's set up.

Use Case: Ideal for applications dealing with time-sensitive information, like news archives, logs, or research articles.

Example: If the user asks, "What were the latest advancements in AI?", this retriever ensures that the most recent articles or documents are prioritized over older content.


3. Self Query Retriever:

The Self Query Retriever is an advanced retriever that uses an LLM to automatically generate structured queries (with filters) for more specific searches based on the user's query.


Purpose: Automatically apply metadata-based filters (e.g., date ranges, categories) to retrieve more targeted results.

Process: It involves the LLM analyzing the user's query to generate a structured query with filter conditions. These filters can be based on attributes like date, author, or document type, enhancing retrieval precision.

Use Case: Useful in situations where the data has rich metadata and users may have implicit requirements. For example, finding "recent research papers on deep learning by a specific author."

Example: If the user query is "Show me articles on machine learning from 2020," the retriever will automatically generate a query that filters for "machine learning" and restricts results to documents from 2020.


Key Differences:

Multi Query Retriever: Focuses on reformulating the query to improve recall, covering multiple possible variations.

TimeBasedVectorStoreRetriever: Prioritizes or filters results by time, useful for retrieving time-sensitive information.

Self Query Retriever: Automatically creates more precise queries with filtering based on metadata.

Each of these retrievers has its own specialized purpose, and the right one depends on the specific data retrieval needs of the application.

Thursday, October 10, 2024

What is Berkley Function calling tool

Berkeley Function Calling Leaderboard (BFCL) is a great resource for comparing how different models perform on function calling tasks. It also provides an evaluation suite to compare your own fine-tuned model on various challenging tool calling tasks. In fact, the latest dataset, BFCL v3, was just released and now includes multi-step, multi-turn function calling, further raising the bar for tool based reasoning tasks.

Both types of reasoning are powerful independently, and when combined, they have the potential to create agents that can effectively breakdown complicated tasks and autonomously interact with their environment. For more examples of AI agent architectures for reasoning, planning, and tool calling check out my team’s survey paper on ArXiv.



refernces:

https://gorilla.cs.berkeley.edu/leaderboard.html

Tuesday, October 8, 2024

What is contextual Embedding in Anthropic RAG? - Anthropic

The recent paper Anthropic introduced the concept of “contextual embedding” that solves the problem of lacking context by adding relevant context to each chunk before embedding.

What you can do is to leverage LLM for Contextual Retrieval. You can develope a prompt for an LLM that instructs the model to generate concise, chunk-specific context based on the overall document in order to provide contextual information for each chunk.

Consider a query about a specific company’s quarterly revenue growth. A relevant chunk might be something like this: “The company’s revenue grew by 3% over the previous quarter” contain the growth percentage but lack crucial details like the company name or time period. This absence of context can hinder accurate retrieval.

By sending the overall document to an LLM for every chunk, we get a contextualized_chunk like this:

This ‘contextualized_chunk’ is then sent to an embedding model to create the embedding of the chunk.

Hybrid search approach

While this contextual embeddings has proven to improve upon traditional semantic search RAG, a hybrid approach incorporating BM25 can produce even better results.

The same chunk-specific context can also be used with BM25 search to further improve retrieval performance.

Okapi BM25

BM25 is an algorithm that addresses some drawbacks of TF-IDF which concern with term saturation and document length.

Term Saturation and diminishing return

If a document contains 100 occurrences of the term “computer,” is it really twice as relevant as a document that contains 50 occurrences? We want to control the contribution of TF when a term is likely to be saturated. BM25 solves this issue by introducing a parameter k1 that controls the shape of this saturation curve. Parameter k1 can be tuned in a way that if the TF increases, at some point, the BM25 score will be saturated , meaning that the increase in TF no longer contributes much to the score.






References:

https://levelup.gitconnected.com/the-best-rag-technique-yet-anthropics-contextual-retrieval-and-hybrid-search-62320d99004e


Monday, October 7, 2024

What's the differences between Chain of Thought (CoT), ReAct, Prompt Decomposition approaches

Chain-of-Thought (CoT), ReAct (Reasoning + Acting), and Prompt Decomposition are all advanced prompting techniques for improving the reasoning capabilities of large language models (LLMs). Each approach differs in how they manage complex tasks and guide the model’s reasoning. Here’s a breakdown of the differences:


1. Chain-of-Thought (CoT):

Purpose: CoT is designed to enhance the reasoning capabilities of an LLM by encouraging it to think step-by-step.

Approach: In CoT, the model is explicitly guided to break down its reasoning process for complex questions or tasks. Instead of jumping to an answer, the model generates intermediate steps or thoughts that lead to the final result.

How It Works: When given a question, the model first generates a "chain of thought" — a logical sequence of steps that helps it arrive at a conclusion.

Use Case: CoT is useful for multi-step problems, arithmetic reasoning, logical deduction, and scenarios where intermediate steps are important for accuracy.

Example: Prompt: "If a train travels 60 miles in 2 hours, how far will it travel in 5 hours at the same speed?" Model's response using CoT:


The train travels 60 miles in 2 hours.

So, its speed is 60 ÷ 2 = 30 miles per hour.

In 5 hours, it will travel 5 × 30 = 150 miles.

2. ReAct (Reasoning + Acting):

Purpose: ReAct combines reasoning (thought process) and actions (interactions with external tools or APIs) to solve tasks that require external input or real-time actions.

Approach: The model alternates between reasoning and acting steps. The reasoning process helps the model figure out what information or action is needed, and the acting step involves interacting with external systems (e.g., querying a database, using a calculator, calling an API). This combination leads to more effective decision-making in tasks that involve dynamic responses or external actions.

How It Works: ReAct prompts the model to first reason about the problem and then take action based on that reasoning. It can repeat this cycle multiple times if needed.

Use Case: ReAct is ideal for interactive tasks like searching a knowledge base, answering questions that involve retrieving data from external sources, or interacting with APIs.

Example: Prompt: "What is the current temperature in New York?" Model's response using ReAct:


First, I need to find the current temperature in New York (Reasoning).

Let me call a weather API to get the temperature (Acting).

The temperature in New York is 72°F (Result).

3. Prompt Decomposition:

Purpose: Prompt Decomposition breaks down a complex task or query into smaller, manageable subtasks or steps. Each subtask is handled separately, and the results are combined to address the original query.

Approach: Instead of giving a single, complex prompt, the task is divided into multiple sub-prompts, where each part of the task is processed independently. The results from these sub-prompts are then aggregated.

How It Works: The original query is split into smaller, more focused prompts, which may be handled by different agents or functions. This modular approach ensures that each part of the query is processed accurately, especially for multi-step or multi-domain tasks.

Use Case: Prompt Decomposition is used for complex tasks that involve multiple steps, specialized sub-tasks, or require integration from multiple sources. It is common in multi-agent systems and workflows that need to be handled in parts.

Example: Prompt: "How do I configure a router, ensure it meets security standards, and monitor network traffic?"


First sub-prompt: "What are the steps to configure a router?"

Second sub-prompt: "What are the security standards for routers?"

Third sub-prompt: "What are the best practices for monitoring network traffic?"

Key Differences:

Chain-of-Thought (CoT): Focuses on internal reasoning by prompting the model to think in logical steps without external action. It’s ideal for solving reasoning-based problems.

ReAct: Combines reasoning with external actions, where the model alternates between thought processes and interactions with tools or APIs.

Prompt Decomposition: Breaks down complex tasks into smaller, simpler components to handle them individually, often involving multiple steps or agents.

Summary:

CoT is mainly for reasoning step-by-step and is self-contained within the model’s thought process.

ReAct involves reasoning combined with taking external actions (e.g., tool usage or calling APIs).

Prompt Decomposition breaks a problem into multiple smaller tasks, which can be handled independently and in parallel by different agents or processes.

Each approach is useful depending on the complexity and type of task you are dealing with, whether it requires reasoning, external actions, or task breakdown.

references:

OpenAI

How to convert M4a to wav format

The code is at the below link


https://gist.github.com/arjunsharma97/0ecac61da2937ec52baf61af1aa1b759

On Mac, below dependencies are to be installed 

Xcode-select instal => to load the tools for latest python, I had 3.12 

Brew install ffmpeg 

Pip install pydub 


Run the code and it works well and replaces the m4a with the wav format! .

Nice utility to convert the m4a file from quick time audio recording to the wav format! 


References:

https://gist.github.com/arjunsharma97/0ecac61da2937ec52baf61af1aa1b759

What's the maximum token limit or context length for various LLM Models?

The maximum context length (or token limit) for various LLMs depends on the specific model you are using. Here’s a general breakdown for common LLMs and their context lengths:


1. OpenAI GPT Models:

GPT-3.5 (davinci): 4,096 tokens

GPT-4 (8k variant): 8,192 tokens

GPT-4 (32k variant): 32,768 tokens

2. Anthropic Claude:

Claude 1/2: 100k tokens (depends on version, with newer versions supporting larger contexts)

3. LLaMA (Meta):

LLaMA-2 (7B, 13B): 4,096 tokens

LLaMA-2 (70B): 8,192 tokens (some variants may support more)

4. Cohere:

Cohere Command: 4096 tokens

5. Mistral:

Mistral Models: Typically support 8,192 tokens or more depending on the implementation and fine-tuning.

Understanding Token Limits:

Tokens are units of text. A token might be as short as one character or as long as one word. For example, "chatGPT is great!" would be split into 6 tokens (["chat", "G", "PT", " is", " great", "!"]).

When providing context (like cli_retriever) or a prompt (runcli_prompt), the entire length (context + user question) must stay within the token limit. If the combined size exceeds the token limit, the model will truncate the input.

Determining Token Length in LangChain:

To ensure that your context (cli_retriever) and any additional inputs (e.g., runcli_prompt) fit within the LLM's context window, you can estimate token length or use LangChain utilities to split your input text if necessary (e.g., RecursiveCharacterTextSplitter).

So, for your runcli_chain, the maximum size of {"context": cli_retriever, "question": RunnablePassthrough()} depends on the specific LLM you are querying. You would typically set the chain’s limits based on the LLM’s token capacity mentioned above.


references:

OpenAI 

Sunday, October 6, 2024

RecursiveCharacterTextSplitter vs RecursiveJsonSplitter

RecursiveCharacterTextSplitter — LangChain documentation

api.python.langchain.com

text splitter documentation

When preparing JSON documents for storage in a vector database, the RecursiveCharacterTextSplitter from LangChain is an effective tool. It recursively divides text into smaller, contextually meaningful chunks, which is advantageous for maintaining the semantic integrity of your data during retrieval.

Key Features of RecursiveCharacterTextSplitter:

Recursive Splitting: The splitter attempts to divide text using a list of specified characters, such as newline or space, to create chunks that are semantically coherent (

LangChain

).

Parameter Customization: You can adjust parameters like chunk_size to control the maximum length of each chunk and chunk_overlap to specify the number of overlapping characters between chunks, ensuring flexibility based on your data's requirements.

Alternatives and Their Differences:

While RecursiveCharacterTextSplitter is recommended for generic text, other splitters are available, each suited to specific needs:

CharacterTextSplitter:

Functionality: Splits text based on a single specified character, such as a newline.

Use Case: Suitable when the text can be effectively divided by a specific delimiter.

Limitation: May not handle complex or nested structures as effectively as recursive methods.

RecursiveJsonSplitter:

Functionality: Designed for JSON data, it recursively splits JSON objects into smaller components.

Use Case: Ideal when working with structured JSON documents that require parsing into subcomponents.

Limitation: Tailored for JSON, so not as versatile for other text formats.

Considerations for JSON Documents:

Although RecursiveCharacterTextSplitter is effective for generic text, when dealing with JSON documents, consider the following:

Structure Preservation: Ensure that the splitting process maintains the hierarchical structure of JSON data to prevent loss of contextual relationships.

Semantic Integrity: Retain meaningful groupings within the JSON data to facilitate accurate and efficient retrieval from the vector database.

In summary, RecursiveCharacterTextSplitter is a versatile choice for segmenting JSON documents, especially when preserving context is crucial. However, evaluate the specific requirements of your data and retrieval needs to determine the most suitable splitting strategy.


What are various text splitters and their differences?

In Langchain and other text-processing frameworks for Large Language Models (LLMs), different text-splitting strategies are available to handle large documents effectively and avoid exceeding token limits. Here are some options for text splitting and how they differ:

1. RecursiveCharacterTextSplitter

How it works: This splitter recursively breaks down large documents into smaller chunks using a hierarchy of delimiters like paragraphs, sentences, and words.

Pros:

Preserves logical structure (e.g., paragraphs).

Handles large chunks while retaining context.

Cons:

Slower for very large documents due to recursion.

Key method:

RecursiveCharacterTextSplitter.from_tiktoken_encoder: Uses token-based splitting (specific to token encoding methods like OpenAI’s tiktoken).

2. CharacterTextSplitter

How it works: This simpler splitter divides text based on a single character or set of characters (like space, newline).

Pros:

Fast and efficient for straightforward splitting.

Cons:

Can split content mid-sentence or mid-paragraph, which might disrupt context.

Key method:


Splits based on specific delimiters like newline or any character of choice.

Use Case: Fast processing, where document structure is not as important.


3. N-gram Text Splitters

How it works: Splits documents into smaller chunks based on N-grams (groups of n words or tokens).

Pros:

Good for splitting while maintaining word-level granularity.

Cons:

Might lose structural integrity (i.e., paragraphs, sentences).

Use Case: Ideal for small token batches, especially for keyword-based searches or training LLMs that need sliding window approaches.


4. SentenceTextSplitter

How it works: Splits text based on sentence boundaries using natural language processing (NLP) techniques.

Pros:

Maintains logical sentence-level coherence.

Great for applications where each sentence matters (e.g., summarization, question-answering).

Cons:

Might generate larger chunks, leading to token overflow if not used in combination with other splitters.

Use Case: Tasks that require sentence-level processing, like summarization.


5. TokenTextSplitter

How it works: Splits based on tokens (e.g., word tokens or byte-pair encodings) instead of characters.

Pros:

Precise control over token limits, perfect for LLMs like GPT that have token-based inputs.

Cons:

May cut off mid-word or mid-sentence if token boundaries don’t align with logical text structures.

Use Case: Managing token limits when interacting with models that operate on tokens.


6. Document Splitters (Custom)

How it works: Custom splitters that use domain-specific knowledge (e.g., splitting based on document sections or code blocks).

Pros:

Allows customization based on specific document formats (e.g., splitting code by function).

Cons:

Requires custom logic, making implementation more complex.

Use Case: Domain-specific documents like programming code, academic papers, or medical reports.


Differences Between These Methods:

Granularity: Some splitters (like SentenceTextSplitter) work on a higher granularity (sentences), while others (TokenTextSplitter, CharacterTextSplitter) can split at lower levels (characters, tokens).

Structure Preservation: Recursive splitters preserve the document structure better, while simpler methods (like CharacterTextSplitter) may sacrifice coherence for speed.

Speed: Simpler splitters like CharacterTextSplitter are faster but can disrupt text flow. Recursive methods are slower but more robust for maintaining context.

Use Cases: RecursiveCharacterTextSplitter is better for documents requiring context retention, while TokenTextSplitter is more useful for precise control over token consumption.

Depending on your task, you can choose a splitter that balances speed, context preservation, and granularity of chunks.


PostgresDB Vector Search - Adding and retrieving


model_name = "sentence-transformers/all-mpnet-base-v2"

model_kwargs = {'device': 'cpu'}

encode_kwargs = {'normalize_embeddings': False}

embedding_function = HuggingFaceEmbeddings(

    model_name=model_name,

    model_kwargs=model_kwargs,

    encode_kwargs=encode_kwargs

)



vector_store = PGVector(

    embeddings=embedding_function,

    collection_name=os.environ["COLLECTION_NAME"],

    connection=os.environ["DB_CONNECTION_STRING"],

    use_jsonb=True,

)



def add_to_db_in_batches(batch_size=100):

    existing_ids = read_collection_ids()


    data_ids = [str(json.loads(item.page_content)["id"]) for item in data]


    new_ids = list(set(data_ids) - set(existing_ids))



    # print(new_ids)



    if len(new_ids) > 0:

        new_documents = [item for item in data if json.loads(item.page_content)["id"] in new_ids]



        total_products = len(new_documents)

        start_time = time.time()  # Start the timer

        

        for i in range(0, total_products, batch_size):

            batch_data = new_documents[i:i + batch_size]

            ids = [json.loads(item.page_content)["id"] for item in batch_data]

            vector_store.add_documents(batch_data, ids=ids)

            remaining = total_products - (i + len(batch_data))

            

            elapsed_time = time.time() - start_time

            batches_processed = (i // batch_size) + 1

            average_time_per_batch = elapsed_time / batches_processed if batches_processed > 0 else 0

            estimated_remaining_batches = (total_products // batch_size) - batches_processed

            estimated_remaining_time = average_time_per_batch * estimated_remaining_batches

            

            # Format estimated remaining time

            estimated_remaining_time_minutes = estimated_remaining_time // 60

            estimated_remaining_time_seconds = estimated_remaining_time % 60

            

            print(f'Added products {i + 1} to {min(i + len(batch_data), total_products)} to the database. '

                f'Remaining: {remaining}. Estimated remaining time: {int(estimated_remaining_time_minutes)} minutes and {int(estimated_remaining_time_seconds)} seconds.')


    else:

        pass




Now To Search it, below can be done 


import json

from typing import Annotated

from fastapi import Query

from pydantic import BaseModel, Field

from .setup import vector_store



class SearchParams(BaseModel):

    query:str = Field(..., max=150)

    k: int = Field(5, ge=5, le=1000)



def get_search_results(params: Annotated[SearchParams, Query()]):


    results = vector_store.similarity_search(

        query=params.query,

        k=params.k

    )



    response = [json.loads(result.page_content) for result in results]


    return response

Thursday, October 3, 2024

What is prebuilt react agents and how to use memory in them

from langgraph.prebuilt import create_react_agent

prebuilt_doc_agent = create_react_agent(model, [execute_sql],

  state_modifier = system_prompt)


from langgraph.checkpoint.sqlite import SqliteSaver

memory = SqliteSaver.from_conn_string(":memory:")


prebuilt_doc_agent = create_react_agent(model, [execute_sql], 

  checkpointer=memory)



class SQLAgent:

  def __init__(self, model, tools, system_prompt = ""):

    <...>

    self.graph = graph.compile(checkpointer=memory)

    <...>



# defining thread

thread = {"configurable": {"thread_id": "1"}}

messages = [HumanMessage(content="What info do we have in ecommerce_db.users table?")]


for event in prebuilt_doc_agent.stream({"messages": messages}, thread):

    for v in event.values():

        v['messages'][-1].pretty_print()



followup_messages = [HumanMessage(content="I would like to know the column names and types. Maybe you could look it up in database using describe.")]


for event in prebuilt_doc_agent.stream({"messages": followup_messages}, thread):

    for v in event.values():

        v['messages'][-1].pretty_print()



new_thread = {"configurable": {"thread_id": "42"}}

followup_messages = [HumanMessage(content="I would like to know the column names and types. Maybe you could look it up in database using describe.")]


for event in prebuilt_doc_agent.stream({"messages": followup_messages}, new_thread):

    for v in event.values():

        v['messages'][-1].pretty_print()



references:

https://towardsdatascience.com/from-basics-to-advanced-exploring-langgraph-e8c1cf4db787




What are different types of LLM routers?

LLM Completion Routers

LLM Function Calling Routers

Semantic Routers

Zero Shot Classification Routers

Language Classification Routers


Below is an example of LLM Completion Router

===========================================


from langchain_anthropic import ChatAnthropic

from langchain_core.output_parsers import StrOutputParser

from langchain_core.prompts import PromptTemplate


# Set up the LLM Chain to return a single word based on the query,

# and based on a list of words we provide to it in the prompt template

llm_completion_select_route_chain = (

        PromptTemplate.from_template("""

Given the user question below, classify it as either

being about `LangChain`, `Anthropic`, or `Other`.


Do not respond with more than one word.


<question>

{question}

</question>


Classification:"""

                                     )

        | ChatAnthropic(model_name="claude-3-haiku")

        | StrOutputParser()

)



# We setup an IF/Else condition to route the query to the correct chain 

# based on the LLM completion call above

def route_to_chain(route_name):

    if "anthropic" == route_name.lower():

        return anthropic_chain

    elif "langchain" == route_name.lower():

        return langchain_chain

    else:

        return general_chain


...


# Later on in the application, we can use the response from the LLM

# completion chain to control (i.e route) the flow of the application 

# to the correct chain via the route_to_chain method we created

route_name = llm_completion_select_route_chain.invoke(user_query)

chain = route_to_chain(route_name)

chain.invoke(user_query)



LLM Function Calling Router / Pedantic Router

=============================================


This leverages the function-calling ability of LLMs to pick a route to traverse. The different routes are set up as functions with appropriate descriptions in the LLM Function Call. Then, based on the query passed to the LLM, it is able to return the correct function (i.e route), for us to take.



Semantic Router

This router type leverages embeddings and similarity searches to select the best route to traverse.


Each route has a set of example queries associated with it, that become embedded and stored as vectors. The incoming query gets embedded also, and a similarity search is done against the other sample queries from the router. The route which belongs to the query with the closest match gets selected.


from semantic_router import Route


# we could use this as a guide for our chatbot to avoid political

# conversations

politics = Route(

    name="politics",

    utterances=[

        "isn't politics the best thing ever",

        "why don't you tell me about your political opinions",

        "don't you just love the president",

        "they're going to destroy this country!",

        "they will save the country!",

    ],

)


# this could be used as an indicator to our chatbot to switch to a more

# conversational prompt

chitchat = Route(

    name="chitchat",

    utterances=[

        "how's the weather today?",

        "how are things going?",

        "lovely weather today",

        "the weather is horrendous",

        "let's go to the chippy",

    ],

)


# we place both of our decisions together into single list

routes = [politics, chitchat]

encoder = OpenAIEncoder()


from semantic_router.layer import RouteLayer

route_layer = RouteLayer(encoder=encoder, routes=routes)

route_layer("don't you love politics?").name

# -> 'politics'


Zero Shot Classification Router

==============================

“Zero-shot text classification is a task in natural language processing where a model is trained on a set of labeled examples but is then able to classify new examples from previously unseen classes”. These routers leverage a Zero-Shot Classification model to assign a label to a piece of text, from a predefined set of labels you pass in to the router.





references:

https://towardsdatascience.com/routing-in-rag-driven-applications-a685460a7220


What is zero shot, single shot and few shot prompting

 Zero-shot, one-shot, and few-shot prompting refer to different techniques used to guide language models (like GPT) to perform tasks by providing varying amounts of examples in the prompt. These techniques are critical in natural language processing (NLP) as they dictate how much context or task-related information is provided to the model. Here's a breakdown of each:


1. Zero-shot Prompting

In zero-shot prompting, the model is asked to perform a task without being given any example in the prompt. The model is expected to understand and generate a response based solely on the task description.


Example:

Task: Classify the sentiment of a sentence.


Prompt:


kotlin

Copy code

Classify the sentiment of this sentence: "I love this product."

The model is directly asked to classify sentiment without any prior examples.

Useful when the model has already been pre-trained on similar tasks and can infer the task from context.

2. One-shot Prompting

In one-shot prompting, you provide the model with one example of how the task should be performed. This single example serves as a guide for the model to understand the expected format of the response.


Example:

Task: Classify the sentiment of a sentence.


Prompt:


kotlin

Copy code

Classify the sentiment of this sentence: "I hate this service." Answer: Negative.

Now classify the sentiment of this sentence: "I love this product."

The model is provided with one example (I hate this service classified as Negative) to understand the task before being asked to classify a new sentence.

3. Few-shot Prompting

In few-shot prompting, you provide the model with a few examples (typically 2-5) to guide it in understanding how to perform the task. These examples help the model generate responses that match the desired output pattern.


Example:

Task: Classify the sentiment of a sentence.


Prompt:


mathematica

Copy code

Classify the sentiment of these sentences:

1. "I hate this service." Answer: Negative.

2. "This is the worst experience ever." Answer: Negative.

3. "I love this product." Answer: Positive.

Now classify the sentiment of this sentence: "The food was amazing."

The model is given three examples, showing both positive and negative classifications, before being asked to classify the final sentence.

When to Use Each Technique

Zero-shot: Ideal when you want the model to generalize based on prior training without providing specific examples (suitable for simple tasks where the model has context).

One-shot: Useful when the task may not be as straightforward, but a single example is enough for the model to catch on.

Few-shot: Best for more complex tasks where the model needs multiple examples to understand the nuances of how the task should be performed.



references:

OpenAI 

What is Zero-Shot Classification

Zero Shot Classification is the task of predicting a class that wasn't seen by the model during training. This method, which leverages a pre-trained language model, can be thought of as an instance of transfer learning which generally refers to using a model trained for one task in a different application than what it was originally trained for. This is particularly useful for situations where the amount of labeled data is small.

In zero shot classification, we provide the model with a prompt and a sequence of text that describes what we want our model to do, in natural language. Zero-shot classification excludes any examples of the desired task being completed. This differs from single or few-shot classification, as these tasks include a single or a few examples of the selected task.

Zero, single and few-shot classification seem to be an emergent feature of large language models. This feature seems to come about around model sizes of +100M parameters. The effectiveness of a model at a zero, single or few-shot task seems to scale with model size, meaning that larger models (models with more trainable parameters or layers) generally do better at this task.

Here is an example of a zero-shot prompt for classifying the sentiment of a sequence of text:

Classify the following input text into one of the following three categories: [positive, negative, neutral]

Input Text: Hugging Face is awesome for making all of these

state of the art models available!

Sentiment: positive

One great example of this task with a nice off-the-shelf model is available at the widget of this page, where the user can input a sequence of text and candidate labels to the model. This is a word level example of zero shot classification, more elaborate and lengthy generations are available with larger models. Testing these models out and getting a feel for prompt engineering is the best way to learn how to use them.

from transformers import pipeline

pipe = pipeline(model="facebook/bart-large-mnli")

pipe("I have a problem with my iphone that needs to be resolved asap!",

    candidate_labels=["urgent", "not urgent", "phone", "tablet", "computer"],

)

# output

{'sequence': 'I have a problem with my iphone that needs to be resolved asap!!', 'labels': ['urgent', 'phone', 'computer', 'not urgent', 'tablet'], 'scores': [0.504, 0.479, 0.013, 0.003, 0.002]}


Wednesday, October 2, 2024

What is Semantic Routing in RAG

 Each route has a set of example queries associated with it, that become embedded and stored as vectors. The incoming query gets embedded also, and a similarity search is done against the other sample queries from the router. The route which belongs to the query with the closest match gets selected.

There is in fact a python package called semantic-router that does just this. Let’s look at some implementation details to get a better idea of how the whole thing works. These examples come straight out of that libraries GitHub page.


from semantic_router import Route

# we could use this as a guide for our chatbot to avoid political

# conversations

politics = Route(

    name="politics",

    utterances=[

        "isn't politics the best thing ever",

        "why don't you tell me about your political opinions",

        "don't you just love the president",

        "they're going to destroy this country!",

        "they will save the country!",

    ],

)

# this could be used as an indicator to our chatbot to switch to a more

# conversational prompt

chitchat = Route(

    name="chitchat",

    utterances=[

        "how's the weather today?",

        "how are things going?",

        "lovely weather today",

        "the weather is horrendous",

        "let's go to the chippy",

    ],

)

# we place both of our decisions together into single list

routes = [politics, chitchat]


encoder = OpenAIEncoder()

from semantic_router.layer import RouteLayer

route_layer = RouteLayer(encoder=encoder, routes=routes)

route_layer("don't you love politics?").name

references:

https://towardsdatascience.com/routing-in-rag-driven-applications-a685460a7220


Some notes on Mistral 7B

The 7B model released by Mistral AI,

Mistral is a 7B parameter model, distributed with the Apache license. It is available in both instruct (instruction following) and text completion.

The Mistral AI team has noted that Mistral 7B:

Outperforms Llama 2 13B on all benchmarks

Outperforms Llama 1 34B on many benchmarks

Approaches CodeLlama 7B performance on code, while remaining good at English tasks

Mistral 0.3 supports function calling with Ollama’s raw mode.

Example raw prompt

[AVAILABLE_TOOLS] [{"type": "function", "function": {"name": "get_current_weather", "description": "Get the current weather", "parameters": {"type": "object", "properties": {"location": {"type": "string", "description": "The city and state, e.g. San Francisco, CA"}, "format": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "The temperature unit to use. Infer this from the users location."}}, "required": ["location", "format"]}}}][/AVAILABLE_TOOLS][INST] What is the weather like today in San Francisco [/INST]

Example response

[TOOL_CALLS] [{"name": "get_current_weather", "arguments": {"location": "San Francisco, CA", "format": "celsius"}}]

Variations

instruct Instruct models follow instructions

text Text models are the base foundation model without any fine-tuning for conversations, and are best used for simple text completion.

Usage

CLI

Instruct:

ollama run mistral

API

Example:

curl -X POST http://localhost:11434/api/generate -d '{

  "model": "mistral",

  "prompt":"Here is a story about llamas eating grass"

 }'

To run the mistral Locally, ollama run mistral 

References:

https://ollama.com/library/mistral?ref=maginative.com


Tuesday, October 1, 2024

AI way of extract text from PDF?

 async def document_analysis(filename: str) -> str:

    """

    Document Understanding

    Args:

        filename: pdf filename str

    """


    pdf = pdfium.PdfDocument(filename)

    images = []

    print("Retrieved PDF ",len(pdf))

    for i in range(len(pdf)):

        print("Iter count ", i)

        page = pdf[i]

        print("Got the page ", page)

        image = page.render(scale=8).to_pil()

        buffered = BytesIO()

        image.save(buffered, format="JPEG")

        img_byte = buffered.getvalue()

        img_base64 = base64.b64encode(img_byte).decode("utf-8")

        images.append(img_base64)


    text_of_pages = await asyncio.gather(*[parse_page_with_gpt(image) for image in images])

    print("Text of pages got")

    results = []


    extracted_texts = [doc for doc in text_of_pages]

    # Clean each string in the list and append to json_results

    for text in extracted_texts:

        results.append(text)

        

    return results




async def parse_page_with_gpt(base64_image: str) -> str:

    messages=[

        {

            "role": "system",

            "content": """

            

            You are a helpful assistant that extracts information from images.

            

            """

        },

        {

            "role": "user",

            "content": [

                {"type": "text", "text": "Extract information from image into text"},

                {

                    "type": "image_url",

                    "image_url": {

                        "url": f"data:image/jpeg;base64,{base64_image}",

                        "detail": "auto"

                    },

                },

            ],

        }

    ]

    response = await clienta.chat.completions.create(

        model=MODEL,

        messages=messages,

        temperature=0,

        max_tokens=4096,

    )

    return response.choices[0].message.content or ""


What is UPF and SMF in 5G Core?

UPF (User Plane Function) and SMF (Session Management Function) are key components in the 5G core network architecture, particularly in the 3GPP (3rd Generation Partnership Project) standards. Both are part of the Service-Based Architecture (SBA) used in modern mobile networks. Here's what each of these devices/functions does:

1. UPF (User Plane Function):

Role: UPF is responsible for managing the user plane, which deals with data transmission between the RAN (Radio Access Network) and external data networks (like the Internet).

Functions:

Routing and forwarding of data packets.

Handling traffic policies, Quality of Service (QoS), and load balancing.

Packet inspection and filtering.

Lawful interception for data traffic.

Data usage monitoring (for charging purposes).

Significance: It decouples the user plane from the control plane, allowing better scalability and flexibility, which is key for 5G's promise of low latency and high throughput.

2. SMF (Session Management Function):

Role: SMF manages the control plane part of the user session, specifically handling the setup, modification, and teardown of sessions that route user data.

Functions:

Session establishment, modification, and release.

IP address allocation and management.

Interaction with Policy Control Function (PCF) for implementing network policies.

Management of UPF connections (e.g., determining how the UPF routes user traffic).

Manages mobility, handling session continuity as users move across different network zones.

Significance: It ensures seamless service and manages user sessions across the network, optimizing performance and connectivity.

In Summary:

UPF handles the actual data traffic (user data) flowing through the network, dealing with packet forwarding, routing, and traffic policies.

SMF is responsible for managing the sessions that determine how user data is routed, including the control over the UPF.

Together, these functions are essential for enabling the high-speed, low-latency capabilities of 5G networks.


References:

OpenAI


What is Ferret-UI

Ferret-UI is a model designed to understand user interactions with a mobile screen.



Mobile UI Understanding

Hence the paradigm shift is from natural language understanding to mobile UI understanding.

Understanding conversational context to understanding the current context on a mobile screen.


Visual Understanding

Ferret-UI is a model designed to understand user interactions with a mobile screen.



NLU aims to enable machines to comprehend human language and respond appropriately. It focuses on structuring unstructured conversational input to understand the user’s meaning.

This structuring is essential for making sense of human conversation across various mediums, ensuring a Conversational UI can effectively process unstructured data.

The paradigm shift is from natural language understanding to mobile UI understanding, moving from comprehending conversational context to understanding the current context on a mobile screen.

Moving from only understanding conversations, to understanding screens.

Gleaning context from screens and user interactions, as apposed to conversations only.

Ferret-UI can be considered as a RAG implementation where augmentation is not performed via retrieved documents, but rather retrieved screens.

As conversations are unstructured data, and part of a Conversational UI is to create structure around this unstructured data. In a similar fashion, Ferret-UI creates a structure around what is displayed on the screen.

Ferret-UI adds language to what is mapped on the screen, allowing context rich, accurate and multi-turn conversations. A significant step up from the “single dialog-turn command and control” scenario.

Not only does Ferret-UI add a language layer to devices, but other functionality can be added. Like task orchestration based on user behaviour, anticipating the next interaction, user guidance, and more


 

references:

https://cobusgreyling.medium.com/moving-from-natural-language-understanding-to-mobile-ui-understanding-18cd775c11b3

What are various types of Ollama models

 

Key Features of Ollama

Easy to Use & User-Friendly Interface: Quickly download and use open-source LLMs with a straightforward setup process.

Versatile: Supports a wide variety of models, including those for text, images, audio, and multimodal tasks.

Cross-Platform Compatibility: Available on macOS, Windows, and Linux.

Offline Models: Operate large language models without needing a continuous internet connection.

High Performance: Built over llama.cpp, which offers state-of-the-art performance on a wide variety of hardware, both locally and in the cloud. It efficiently utilizes the available resources, such as your GPU or MPS for Apple.

Cost-Effective: Save on compute costs by executing models locally.

Privacy: Local processing ensures your data remains secure and private.

Limitations

While Ollama offers numerous benefits, it’s important to be aware of its limitations:

Inference Only: Ollama is designed solely for model inference. For training or fine-tuning models, you will need to use tools like Hugging Face, TensorFlow, or PyTorch.

Setup and Advanced Functionalities: For detailed configuration for model inference or training, other libraries such as Hugging Face and PyTorch are necessary.

Performance: Although Ollama is based on llama.cpp, it may still be slower than using it directly.

references:

OpenAI