Monday, July 10, 2023

Context-Aware Question-Answering Systems With LLM

The application processes user input and generates appropriate responses based on the document’s content. It uses the LangChain library for document loading, text splitting, embeddings, vector storage, question-answering, and GPT-3.5-turbo under the hood providing the bot responses via JSON to our UI.


The blog here gives good details on this below are the attempts done 

git clone https://github.com/Ricoledan/llm-gpt-demo


cd backend/ 

pip install -r requirements.txt


cd frontend/

npm i


Bit of pre-processing to be done like this below 


First, we leverage LangChain’s document_loaders.unstructured package like this import below:


from langchain.document_loaders.unstructured import UnstructuredFileLoader



The load the unstructured data like this 


loader = UnstructuredFileLoader(‘./docs/document.txt’)

documents = loader.load()


text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

split_texts = text_splitter.split_documents(documents)


CharacterTextSplitter in LangChain takes two arguments: chunk_size and chunk_overlap. The chunk_size parameter determines the size of each text chunk, while chunk_overlap specifies the number of overlapping characters between two adjacent chunks. By setting these parameters, you can control the granularity of the text splitting and tailor it to your specific application’s requirements.


Embedding Generation

Representing text numerically


For our model to leverage what’s in the text, we must first convert the textual data into numerical representations called embeddings to make sense of it. These embeddings capture the semantic meaning of the text and allow for efficient and meaningful comparisons between text segments.


embeddings = OpenAIEmbeddings()


In LangChain, embeddings = OpenAIEmbeddings() creates an instance of the OpenAIEmbeddings class, which generates vector embeddings of the text data. Vector embeddings are the numerical representations of the text that capture its semantic meaning. These embeddings are used in various stages of the NLP pipeline, such as similarity search and response generation.


Vector Database Storage

Efficient organization of embeddings


A vector database, a vector store or search engine, is a data storage and retrieval system designed to handle high-dimensional vector data. In the context of natural language processing (NLP) and machine learning, vector databases are used to store and efficiently query embeddings or other vector representations of data.

from langchain.vectorstores import Chroma

Similarity Search

Finding relevant matches


With the query embedding generated, a similarity search is performed in the vector database to identify the most relevant matches. The search compares the query embedding to the stored embeddings, ranking them based on similarity metrics like cosine similarity or Euclidean distance. The top matches are the most relevant text passages to the user’s query.


vector_db = Chroma.from_documents(documents=split_texts, embeddings=embeddings, persist_directory=persist_directory)


This line of code creates an instance of the Chroma vector database using the from_documents() method. By default, Chroma uses an in-memory database, which gets persisted on exit and loaded on start, but for our project, we are persisting the database locally using the persist_directory option and passing in the name with a variable of the same name.

Response Generation

Producing informative and contextual answers

Finally, the crafted prompt is fed to ChatGPT, which generates an answer based on the input. The generated response is then returned to the user, completing the process of retrieving and delivering relevant information based on the user’s query. The language model produces a coherent, informative, and contextually appropriate response by leveraging its deep understanding of language patterns and the provided context.


references:

https://betterprogramming.pub/building-context-aware-question-answering-systems-with-llms-b6f2b6e387ec

No comments:

Post a Comment