Tuesday, February 18, 2025

What are some of advanced techniques for building production grade RAG?

 Decoupling chunks used for retrieval vs. chunks used for synthesis

Structured Retrieval for Larger Document Sets

Dynamically Retrieve Chunks Depending on your Task

Optimize context embeddings


Key Techniques#

There’s two main ways to take advantage of this idea:

1. Embed a document summary, which links to chunks associated with the document.

This can help retrieve relevant documents at a high-level before retrieving chunks vs. retrieving chunks directly (that might be in irrelevant documents).


Key Techniques#

There’s two main ways to take advantage of this idea:

1. Embed a document summary, which links to chunks associated with the document.

This can help retrieve relevant documents at a high-level before retrieving chunks vs. retrieving chunks directly (that might be in irrelevant documents).

2. Embed a sentence, which then links to a window around the sentence.

This allows for finer-grained retrieval of relevant context (embedding giant chunks leads to “lost in the middle” problems), but also ensures enough context for LLM synthesis.


Structured Retrieval for Larger Document Sets

A big issue with the standard RAG stack (top-k retrieval + basic text splitting) is that it doesn’t do well as the number of documents scales up - e.g. if you have 100 different PDFs. In this setting, given a query you may want to use structured information to help with more precise retrieval; for instance, if you ask a question that's only relevant to two PDFs, using structured information to ensure those two PDFs get returned beyond raw embedding similarity with chunks.

Key Techniques#

1. Metadata Filters + Auto Retrieval Tag each document with metadata and then store in a vector database. During inference time, use the LLM to infer the right metadata filters to query the vector db in addition to the semantic query string.

Pros ✅: Supported in major vector dbs. Can filter document via multiple dimensions.

Cons 🚫: Can be hard to define the right tags. Tags may not contain enough relevant information for more precise retrieval. Also tags represent keyword search at the document-level, doesn’t allow for semantic lookups.

2. Store Document Hierarchies (summaries -> raw chunks) + Recursive Retrieval Embed document summaries and map to chunks per document. Fetch at the document-level first before chunk level.

Pros ✅: Allows for semantic lookups at the document level.

Cons 🚫: Doesn’t allow for keyword lookups by structured tags (can be more precise than semantic search). Also autogenerating summaries can be expensive.

Dynamically Retrieve Chunks Depending on your Task

RAG isn't just about question-answering about specific facts, which top-k similarity is optimized for. There can be a broad range of queries that a user might ask. Queries that are handled by naive RAG stacks include ones that ask about specific facts e.g. "Tell me about the D&I initiatives for this company in 2023" or "What did the narrator do during his time at Google". But queries can also include summarization e.g. "Can you give me a high-level overview of this document", or comparisons "Can you compare/contrast X and Y". All of these use cases may require different retrieval techniques.

LlamaIndex provides some core abstractions to help you do task-specific retrieval. This includes router module as well as our data agent module. This also includes some advanced query engine modules. This also include other modules that join structured and unstructured data.

You can use these modules to do joint question-answering and summarization, or even combine structured queries with unstructured queries.

Optimize Context Embeddings

This is related to the motivation described above in "decoupling chunks used for retrieval vs. synthesis". We want to make sure that the embeddings are optimized for better retrieval over your specific data corpus. Pre-trained models may not capture the salient properties of the data relevant to your use cas

Key Techniques#

Beyond some of the techniques listed above, we can also try finetuning the embedding model. We can actually do this over an unstructured text corpus, in a label-free way.

referneces:

https://docs.llamaindex.ai/en/stable/optimizing/production_rag/

No comments:

Post a Comment