In Langchain, SelfQueryRetriever is a specialized retriever designed to make the process of retrieving relevant documents more dynamic and context-aware. Unlike traditional retrievers that solely rely on similarity searches (e.g., vector searches), the SelfQueryRetriever allows for more sophisticated, natural language-based queries by combining natural language understanding with structured search capabilities.
Key Features of SelfQueryRetriever:
Natural Language Queries: It allows users to input complex, free-form questions or queries in natural language.
Dynamic Query Modification: It uses a language model (LLM) to modify or enhance the query dynamically based on the user input. This ensures that the query is refined to retrieve the most relevant results.
Structured Filters: It can also convert a user's question into structured filters that help narrow down the search more effectively. For example, it can apply specific criteria like filtering by date, category, or other metadata fields that are relevant to the search.
How SelfQueryRetriever Works:
Self-Querying: The retriever can automatically generate additional filters or modify the query to help retrieve more accurate or relevant results. It does this by analyzing the user query and applying specific transformations based on the context of the search.
LLM-Powered Refinement: A language model is used to understand the query and extract essential parameters that can guide the retrieval process. These parameters can be key-value pairs or specific instructions, enhancing the retrieval operation by filtering or adjusting the search criteria.
Difference from Other Retrievers:
Standard Retriever:
Relies on similarity search techniques (like vector search or keyword matching).
Simply matches the user's query to the stored documents and retrieves the most similar ones based on embeddings.
No dynamic query modification or structured filtering is involved.
SelfQueryRetriever:
More intelligent because it uses an LLM to interpret and enhance the user query.
It can apply structured filters based on the query (e.g., filter documents by date or category).
It dynamically refines the query using the LLM to ensure that the retrieval is both accurate and relevant.
Example Use Case:
Suppose you have a database of documents with metadata such as "author," "date," "category," etc. A user asks:
“Can you show me all network security articles written after 2020?”
A Standard Retriever would search for documents based on the similarity between the query and the document content (probably looking for the keywords “network security”).
A SelfQueryRetriever would use an LLM to break down the query into actionable parts:
Retrieve documents about network security.
Filter documents where the date is after 2020.
Return only articles matching both criteria.
This makes SelfQueryRetriever far more powerful in scenarios where specific, structured information needs to be extracted from large corpora of documents.
Sample Code:
Here’s a simple example of using SelfQueryRetriever in Langchain:
python
Copy code
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.vectorstores import FAISS
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings
# Define the attributes (metadata) of your documents
metadata_field_info = [
AttributeInfo(name="author", description="The author of the document", type="string"),
AttributeInfo(name="date", description="The publication date of the document", type="date"),
AttributeInfo(name="category", description="The category of the document", type="string")
]
# Initialize your vector store and LLM
embedding_model = OpenAIEmbeddings()
vector_store = FAISS(embedding_model)
llm = OpenAI()
# Create SelfQueryRetriever
self_query_retriever = SelfQueryRetriever.from_llm(
vector_store=vector_store,
llm=llm,
document_contents="content", # What field contains the document content
metadata_field_info=metadata_field_info,
verbose=True
)
# Use the retriever to answer a query
query = "Show me all security reports written by Alice after 2021."
retrieved_docs = self_query_retriever.get_relevant_documents(query)
for doc in retrieved_docs:
print(doc)
When to Use SelfQueryRetriever:
When your data has a lot of structured information (like metadata) and you need to refine queries based on that structure.
For advanced retrieval scenarios where the user queries require dynamic, intelligent modification or filtering.
In scenarios where similarity search alone might not retrieve the most relevant documents, and you need additional filtering or query modifications.
References:;
OpenAI