The recent paper Anthropic introduced the concept of “contextual embedding” that solves the problem of lacking context by adding relevant context to each chunk before embedding.
What you can do is to leverage LLM for Contextual Retrieval. You can develope a prompt for an LLM that instructs the model to generate concise, chunk-specific context based on the overall document in order to provide contextual information for each chunk.
Consider a query about a specific company’s quarterly revenue growth. A relevant chunk might be something like this: “The company’s revenue grew by 3% over the previous quarter” contain the growth percentage but lack crucial details like the company name or time period. This absence of context can hinder accurate retrieval.
By sending the overall document to an LLM for every chunk, we get a contextualized_chunk like this:
This ‘contextualized_chunk’ is then sent to an embedding model to create the embedding of the chunk.
Hybrid search approach
While this contextual embeddings has proven to improve upon traditional semantic search RAG, a hybrid approach incorporating BM25 can produce even better results.
The same chunk-specific context can also be used with BM25 search to further improve retrieval performance.
Okapi BM25
BM25 is an algorithm that addresses some drawbacks of TF-IDF which concern with term saturation and document length.
Term Saturation and diminishing return
If a document contains 100 occurrences of the term “computer,” is it really twice as relevant as a document that contains 50 occurrences? We want to control the contribution of TF when a term is likely to be saturated. BM25 solves this issue by introducing a parameter k1 that controls the shape of this saturation curve. Parameter k1 can be tuned in a way that if the TF increases, at some point, the BM25 score will be saturated , meaning that the increase in TF no longer contributes much to the score.
References:
https://levelup.gitconnected.com/the-best-rag-technique-yet-anthropics-contextual-retrieval-and-hybrid-search-62320d99004e
No comments:
Post a Comment