Thursday, April 25, 2024

What are the main steps involved in creating a RAG application

Indexing

Load: First we need to load our data. We’ll use DocumentLoaders for this.

Split: Text splitters break large Documents into smaller chunks. This is useful both for indexing data and for passing it in to a model, since large chunks are harder to search over and won’t fit in a model’s finite context window.

Store: We need somewhere to store and index our splits, so that they can later be searched over. This is often done using a VectorStore and Embeddings model.


Retrieval and generation

Retrieve: Given a user input, relevant splits are retrieved from storage using a Retriever.

Generate: A ChatModel / LLM produces an answer using a prompt that includes the question and the retrieved data


In this case, loading a document from web and trying to perform QnA.  


Indexing In detail:


In Langchain, document loading can be done in many ways depending on, for e.g. TExtLoader, WebBaseLoader


import bs4

from langchain_community.document_loaders import WebBaseLoader


# Only keep post title, headers, and content from the full HTML.

bs4_strainer = bs4.SoupStrainer(class_=("post-title", "post-header", "post-content"))

loader = WebBaseLoader(

    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),

    bs_kwargs={"parse_only": bs4_strainer},

)

docs = loader.load()


Index Splits:


Big documents such as 42K char log will be too big to fit into context window of Most of the LLMs, For this, we can use  


There are arounnd 160 Document loaders. To handle this we’ll split the Document into chunks for embedding and vector storage. This should help us retrieve only the most relevant bits of the blog post at run time


In this case we’ll split our documents into chunks of 1000 characters with 200 characters of overlap between chunks. The overlap helps mitigate the possibility of separating a statement from important context related to it. We use the RecursiveCharacterTextSplitter, which will recursively split the document using common separators like new lines until each chunk is the appropriate size. This is the recommended text splitter for generic text use cases.



Indexing: Store

Now we need to index our 66 text chunks so that we can search over them at runtime. The most common way to do this is to embed the contents of each document split and insert these embeddings into a vector database (or vector store). When we want to search over our splits, we take a text search query, embed it, and perform some sort of “similarity” search to identify the stored splits with the most similar embeddings to our query embedding. The simplest similarity measure is cosine similarity — we measure the cosine of the angle between each pair of embeddings (which are high dimensional vectors).


We can embed and store all of our document splits in a single command using the Chroma vector store and OpenAIEmbeddings model.


from langchain_chroma import Chroma

from langchain_openai import OpenAIEmbeddings


vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())


Embeddings: Wrapper around a text embedding model, used for converting text to embeddings.


This completes the Indexing portion of the pipeline. At this point we have a query-able vector store containing the chunked contents of our blog post. Given a user question, we should ideally be able to return the snippets of the blog post that answer the question.



Retrieval and Generation: Retrieve

We want to create a simple application that takes a user question, searches for documents relevant to that question, passes the retrieved documents and initial question to a model, and returns an answer.


The most common type of Retriever is the VectorStoreRetriever, which uses the similarity search capabilities of a vector store to facilitate retrieval. Any VectorStore can easily be turned into a Retriever with VectorStore.as_retriever():



retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})

retrieved_docs = retriever.invoke("What are the approaches to Task Decomposition?")

len(retrieved_docs)

print(retrieved_docs[0].page_content)



MultiQueryRetriever generates variants of the input question to improve retrieval hit rate.

MultiVectorRetriever (diagram below) instead generates variants of the embeddings, also in order to improve retrieval hit rate.

Max marginal relevance selects for relevance and diversity among the retrieved documents to avoid passing in duplicate context.

Documents can be filtered during vector store retrieval using metadata filters, such as with a Self Query Retriever.


from langchain import hub

prompt = hub.pull("rlm/rag-prompt")


Now the prompt can be made like this 


example_messages = prompt.invoke(

    {"context": "filler context", "question": "filler question"}

).to_messages()

example_messages


print(example_messages[0].content)


Will give something like this below 


You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.

Question: filler question

Context: filler context

Answer:



Now the rag chain can be built like this below 


from langchain_core.output_parsers import StrOutputParser

from langchain_core.runnables import RunnablePassthrough



def format_docs(docs):

    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (

    {"context": retriever | format_docs, "question": RunnablePassthrough()}

    | prompt

    | llm

    | StrOutputParser()

)


Now the rag chain can be streamed like this below 


for chunk in rag_chain.stream("What is Task Decomposition?"):

    print(chunk, end="", flush=True)



References:

https://python.langchain.com/docs/use_cases/question_answering/quickstart/


No comments:

Post a Comment