Friday, April 26, 2024

Langchain How to create Custom PromptTemplate for RAG

Below is an example. 

from langchain_core.prompts import PromptTemplate


template = """Use the following pieces of context to answer the question at the end.

If you don't know the answer, just say that you don't know, don't try to make up an answer.

Use three sentences maximum and keep the answer as concise as possible.

Always say "thanks for asking!" at the end of the answer.


{context}


Question: {question}


Helpful Answer:"""

custom_rag_prompt = PromptTemplate.from_template(template)


rag_chain = (

    {"context": retriever | format_docs, "question": RunnablePassthrough()}

    | custom_rag_prompt

    | llm

    | StrOutputParser()

)


rag_chain.invoke("What is Task Decomposition?")


references:

https://python.langchain.com/docs/use_cases/question_answering/quickstart/


Thursday, April 25, 2024

How does bs4.SoupStrainer work with WebBaseLoader

In the Langchain framework, bs4.SoupStrainer and WebBaseLoader work together to streamline the process of loading and parsing HTML content for specific elements. Here's a breakdown of their roles:

bs4.SoupStrainer:

Purpose: This class from the Beautiful Soup library (bs4) acts as a filter during the HTML parsing process. It allows you to specify which parts of the HTML you want to focus on, improving efficiency and reducing the amount of data processed.

Functionality: You can create a SoupStrainer object, defining the tags or attributes you're interested in. When used with a parser like BeautifulSoup, it ensures that only those matching elements are parsed and stored in the resulting soup object.

WebBaseLoader:

Purpose: This is a component within Langchain designed for loading web documents. It provides a convenient way to fetch HTML content from URLs and handle potential errors during the retrieval process.

Using SoupStrainer with WebBaseLoader: When you instantiate a WebBaseLoader object, you can optionally pass a bs_kwargs dictionary with configuration options. One of these options is parse_only. This allows you to specify a SoupStrainer instance within parse_only.

Example:

Python

from bs4 import SoupStrainer

from langchain.document_loaders import WebBaseLoader

# Define a SoupStrainer to only keep the body element

only_body = SoupStrainer('body')

# Create a WebBaseLoader with the SoupStrainer

loader = WebBaseLoader(['https://example.com'], bs_kwargs={'parse_only': only_body})

# Load the documents

documents = loader.load()

# The documents list will now contain soup objects with only the body element parsed

Use code with caution.

In this example, the only_body SoupStrainer instructs the parsing process to focus solely on the <body> element of the HTML content fetched from the specified URL. This reduces the amount of data processed and the resulting soup object will only contain the content within the <body> tags.

Benefits of using bs4.SoupStrainer with WebBaseLoader:

Improved Efficiency: By filtering out irrelevant parts of the HTML, you can significantly improve parsing performance, especially for large or complex web pages.

Reduced Memory Usage: Only the essential elements are stored in the soup object, minimizing memory consumption during processing.

Targeted Processing: If you're only interested in specific sections of the HTML (e.g., article content, product listings), using SoupStrainer helps you focus on that data directly, simplifying subsequent processing steps.

In summary, bs4.SoupStrainer acts as a filter during parsing, and WebBaseLoader allows you to leverage this filtering functionality when loading web documents using Langchain. This combination helps you streamline web content processing and focus on the specific elements you need for your application.

What is Functionality of rlm/rag-prompt

In Retrieval-Augmented Generation (RAG) tasks, rlm/rag-prompt is a prompt specifically designed for use with the LangChain framework. It serves the purpose of guiding a Large Language Model (LLM) during question answering or similar tasks that leverage retrieved information.

Here's a breakdown of its functionality:

Functionality:

Context and Question Integration: rlm/rag-prompt incorporates both the retrieved context (relevant information for the task) and the user's question seamlessly. It structures the prompt in a way that effectively conveys both elements to the LLM.

Focus on Answer Brevity: This prompt is designed to encourage the LLM to provide concise and informative answers, typically aiming for a maximum of three sentences. This helps with readability and avoids overly verbose responses.

Knowledge Base Reference: While the specific implementation details might vary, rlm/rag-prompt often references a knowledge base or corpus of information that the LLM can access during the retrieval stage. This retrieved context is then used to answer the question.

Benefits:

Improved Answer Quality: By providing context and focusing on brevity, rlm/rag-prompt can lead to more accurate and succinct answers compared to generic prompts that lack context or guidance on answer length.

Enhanced Reusability: This prompt template is generally reusable across various question answering tasks within the LangChain framework, simplifying development and promoting consistency.

Here's an illustrative example (assuming the retrieved context is about different types of birds):

rlm/rag-prompt

You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.

**Retrieved Context:**

* Birds are warm-blooded vertebrates with feathers.

* They lay eggs and have wings.

* There are over 10,000 different bird species in the world.

**Question:** What are some characteristics of birds?

In this example, the rlm/rag-prompt incorporates the retrieved context about birds and presents the question. The LLM, guided by this prompt, would ideally respond with something like:

Birds are warm-blooded animals with feathers. They lay eggs and come in a vast variety, with over 10,000 known species.

In summary, rlm/rag-prompt is a valuable tool within the LangChain framework for guiding LLMs in question answering tasks, promoting context-aware, concise, and informative responses.


What are the main steps involved in creating a RAG application

Indexing

Load: First we need to load our data. We’ll use DocumentLoaders for this.

Split: Text splitters break large Documents into smaller chunks. This is useful both for indexing data and for passing it in to a model, since large chunks are harder to search over and won’t fit in a model’s finite context window.

Store: We need somewhere to store and index our splits, so that they can later be searched over. This is often done using a VectorStore and Embeddings model.


Retrieval and generation

Retrieve: Given a user input, relevant splits are retrieved from storage using a Retriever.

Generate: A ChatModel / LLM produces an answer using a prompt that includes the question and the retrieved data


In this case, loading a document from web and trying to perform QnA.  


Indexing In detail:


In Langchain, document loading can be done in many ways depending on, for e.g. TExtLoader, WebBaseLoader


import bs4

from langchain_community.document_loaders import WebBaseLoader


# Only keep post title, headers, and content from the full HTML.

bs4_strainer = bs4.SoupStrainer(class_=("post-title", "post-header", "post-content"))

loader = WebBaseLoader(

    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),

    bs_kwargs={"parse_only": bs4_strainer},

)

docs = loader.load()


Index Splits:


Big documents such as 42K char log will be too big to fit into context window of Most of the LLMs, For this, we can use  


There are arounnd 160 Document loaders. To handle this we’ll split the Document into chunks for embedding and vector storage. This should help us retrieve only the most relevant bits of the blog post at run time


In this case we’ll split our documents into chunks of 1000 characters with 200 characters of overlap between chunks. The overlap helps mitigate the possibility of separating a statement from important context related to it. We use the RecursiveCharacterTextSplitter, which will recursively split the document using common separators like new lines until each chunk is the appropriate size. This is the recommended text splitter for generic text use cases.



Indexing: Store

Now we need to index our 66 text chunks so that we can search over them at runtime. The most common way to do this is to embed the contents of each document split and insert these embeddings into a vector database (or vector store). When we want to search over our splits, we take a text search query, embed it, and perform some sort of “similarity” search to identify the stored splits with the most similar embeddings to our query embedding. The simplest similarity measure is cosine similarity — we measure the cosine of the angle between each pair of embeddings (which are high dimensional vectors).


We can embed and store all of our document splits in a single command using the Chroma vector store and OpenAIEmbeddings model.


from langchain_chroma import Chroma

from langchain_openai import OpenAIEmbeddings


vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())


Embeddings: Wrapper around a text embedding model, used for converting text to embeddings.


This completes the Indexing portion of the pipeline. At this point we have a query-able vector store containing the chunked contents of our blog post. Given a user question, we should ideally be able to return the snippets of the blog post that answer the question.



Retrieval and Generation: Retrieve

We want to create a simple application that takes a user question, searches for documents relevant to that question, passes the retrieved documents and initial question to a model, and returns an answer.


The most common type of Retriever is the VectorStoreRetriever, which uses the similarity search capabilities of a vector store to facilitate retrieval. Any VectorStore can easily be turned into a Retriever with VectorStore.as_retriever():



retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})

retrieved_docs = retriever.invoke("What are the approaches to Task Decomposition?")

len(retrieved_docs)

print(retrieved_docs[0].page_content)



MultiQueryRetriever generates variants of the input question to improve retrieval hit rate.

MultiVectorRetriever (diagram below) instead generates variants of the embeddings, also in order to improve retrieval hit rate.

Max marginal relevance selects for relevance and diversity among the retrieved documents to avoid passing in duplicate context.

Documents can be filtered during vector store retrieval using metadata filters, such as with a Self Query Retriever.


from langchain import hub

prompt = hub.pull("rlm/rag-prompt")


Now the prompt can be made like this 


example_messages = prompt.invoke(

    {"context": "filler context", "question": "filler question"}

).to_messages()

example_messages


print(example_messages[0].content)


Will give something like this below 


You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.

Question: filler question

Context: filler context

Answer:



Now the rag chain can be built like this below 


from langchain_core.output_parsers import StrOutputParser

from langchain_core.runnables import RunnablePassthrough



def format_docs(docs):

    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (

    {"context": retriever | format_docs, "question": RunnablePassthrough()}

    | prompt

    | llm

    | StrOutputParser()

)


Now the rag chain can be streamed like this below 


for chunk in rag_chain.stream("What is Task Decomposition?"):

    print(chunk, end="", flush=True)



References:

https://python.langchain.com/docs/use_cases/question_answering/quickstart/


Wednesday, April 24, 2024

What is Langchain hub

Langchain Hub is a central repository for sharing and discovering components used in building applications with the Langchain framework. Here's a breakdown of its key features and functionalities:

Purpose:

Centralized Resource: Langchain Hub acts as a one-stop shop for developers working with Langchain. It provides easy access to pre-built components like prompts, chains (workflows), and agents that can be used to create complex LLM (Large Language Model) applications.

Sharing and Discovery: Developers can upload their custom-created Langchain components to the hub, making them reusable by others. This fosters collaboration and innovation within the Langchain community.

Improved Efficiency: By leveraging pre-built and shared components, developers can save time and effort when building Langchain applications.

Components Available:

Prompts: These are instructions or starting points that guide the LLM towards generating the desired output. The hub offers a collection of prompts for various tasks like text summarization, question answering, and creative writing.

Chains (Workflows): These define sequences of operations involving prompts, agents (human experts), and potentially other chains. They orchestrate the overall workflow for complex LLM applications.

Agents (Optional): While not the primary focus yet, Langchain Hub might also allow sharing and discovery of "agents". These could be human experts who can interact within a Langchain workflow, potentially providing additional input or validation.

Benefits of Using Langchain Hub:

Reduced Development Time: By utilizing existing components from the hub, developers can build Langchain applications faster.

Improved Quality: Shared components can be vetted and improved by the community, leading to higher quality and reliability.

Knowledge Sharing: The hub facilitates knowledge sharing within the Langchain ecosystem, allowing developers to learn from each other's work.

Overall, Langchain Hub is a valuable resource for developers working with the Langchain framework. It promotes collaboration, accelerates development, and helps build more robust and innovative LLM applications.

references:

Gemini

Tuesday, April 23, 2024

Basic Replication and Support for that in Nutanix and VMWare ESXi

Basic Replication Explained

Basic replication is a data protection technique that copies data from a source system (primary) to a target system (secondary) at regular intervals. This creates a replica of the source data on the target system, allowing for quick recovery in case of a disaster or outage on the primary system. Here are some key characteristics of basic replication:


Asynchronous: Data is copied periodically (e.g., every hour), not in real-time. This means there might be some data loss between the last successful replication and the point of failure on the primary system (Recovery Point Objective - RPO).

One-way: Data flows in one direction, from the source to the target system.

Simple Setup: Basic replication is typically easier to set up and manage compared to more complex disaster recovery solutions.

Nutanix AHV and VMware ESXi capabilities for Basic Replication

Both Nutanix AHV (Acropolis Hypervisor) and VMware ESXi offer functionalities for basic replication of virtual machines (VMs). Here's a breakdown of their capabilities:


Nutanix AHV:


Built-in Functionality: AHV integrates data protection features directly within the hypervisor. It offers asynchronous replication of VMs to another AHV cluster for disaster recovery.

Snapshot-based Replication: AHV utilizes snapshots to capture the state of a VM at a specific point in time. These snapshots are then replicated to the target system.

Simple Management: AHV's web-based interface (Prism) allows for easy configuration and monitoring of replication jobs.

VMware ESXi:


Requires Additional Software: Basic replication on ESXi typically requires additional software tools from third-party vendors or from VMware itself (e.g., vSphere Replication).

Similar Functionality: Third-party tools or vSphere Replication offer functionalities similar to AHV, enabling asynchronous VM replication to another ESXi cluster for disaster recovery purposes.

Management: The management interface for replication might vary depending on the chosen solution (third-party tool vs. vSphere Replication).


Choosing Between AHV and ESXi for Basic Replication:


Ease of Use: If easy setup and management are priorities, AHV's built-in replication might be preferable.

Existing Infrastructure: If you already have a VMware environment with ESXi, using vSphere Replication or a compatible third-party tool could be a good fit.

Specific Requirements: Evaluate the specific features and functionalities offered by different solutions to match your needs (e.g., RPO requirements, supported platforms).


references:

Gemini 

Thursday, April 18, 2024

What is RAG 2.0 and is it required?

RAG 2.0, which stands for Retrieval-Augmented Generation 2.0, is an advancement in the technique of generating text using retrieval and pre-trained language models (LLMs). Here's a breakdown of its key aspects:

RAG (Retrieval-Augmented Generation):

The original RAG approach involved using an LLM (like GPT-3) for text generation and a separate retriever component to search for relevant information from external sources (e.g., Wikipedia, documents) based on a prompt or query.

The retrieved information was then fed into the LLM to improve the quality and coherence of the generated text.

Challenges of Traditional RAG:

Brittleness: These systems often required extensive prompting and suffered from cascading errors if the initial retrieval wasn't accurate.

Lack of Machine Learning: Individual components were not optimized together, leading to suboptimal performance.

Black-Box Nature: It was difficult to understand the reasoning behind the generated text and identify the source of retrieved information.

Improvements in RAG 2.0:

End-to-End Optimization: RAG 2.0 addresses these limitations by treating the entire system (retriever, LLM) as a single unit and jointly training all components. This allows for better synergy and optimization of the overall generation process.

Pretraining and Fine-tuning: Both the LLM and retriever are pre-trained on relevant datasets and then fine-tuned on the specific task for improved performance.

Alignment: The components are aligned during training to ensure the retrieved information is most beneficial for the LLM to generate high-quality text.

Benefits of RAG 2.0:

Improved Text Quality: RAG 2.0 can generate more informative, factually correct, and coherent text by leveraging retrieved information.

Reduced Prompting Needs: The system can potentially understand the user's intent better and generate relevant text with less explicit prompting compared to traditional RAG.

Explainability: With advancements in this area, RAG 2.0 might offer better insights into the reasoning behind the generated text and the source of retrieved information.

Applications of RAG 2.0:

Chatbots: RAG 2.0 can enhance chatbots by enabling them to access and incorporate relevant information to provide more informative and comprehensive responses.

Machine Translation: By retrieving contextually relevant information, RAG 2.0 can potentially improve the accuracy and fluency of machine translation.

Text Summarization: The retrieved information can be used to create more informative and comprehensive summaries of factual topics.

Overall, RAG 2.0 is a significant advancement in retrieval-augmented generation, offering a more robust and efficient approach to generating high-quality text with the help of external information.

The Real Question is Still Unanswered

Although it seems RAG 2.0 might become the enterprise standard shortly due to its design that is specifically aimed at companies unwilling to share confidential data with the LLM providers, there’s a reason to believe that RAG, no matter the version, won’t eventually be required at all.

The Arrival of Huge Sequence Length

I’m sure you are very aware of the fact that our frontier models today, models like Gemini 1.5 or Claude 3, have huge context windows that go up to a million tokens (750k words) in their production-released models and up to 10 million tokens (7.5 million words) in the research labs.



References:

Gemini