To implement this, I recommend using LlamaIndex with its built-in QueryFusionRetriever. This is a powerful, production-ready way to perform Hybrid Search (BM25 + Vector) and then use Reciprocal Rank Fusion (RRF) to get the most accurate result.
1. Prerequisites
You will need to install the following libraries:
pip install llama-index llama-index-retrievers-bm25 llama-index-embeddings-openai llama-index-vector-stores-qdrant qdrant-client
2. Implementation Code
This script loads your JSON, prepares the documents, and sets up the hybrid retrieval pipeline.
import json
from llama_index.core import Document, VectorStoreIndex, StorageContext
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client
# 1. Load your JSON Data
cli_data = [
{
"command": "docker ps",
"description": "List running containers",
"examples": ["docker ps -a", "docker ps --format '{{.ID}}'"]
},
# ... more commands
]
# 2. Transform JSON to Documents for Indexing
documents = []
for item in cli_data:
# We combine command, description, and examples into one text block
# This ensures the model can "see" all parts during search
content = f"Command: {item['command']}\nDescription: {item['description']}\nExamples: {', '.join(item['examples'])}"
doc = Document(
text=content,
metadata={"command": item['command']} # Keep original command in metadata
)
documents.append(doc)
# 3. Setup Vector Storage (Dense Search)
client = qdrant_client.QdrantClient(location=":memory:") # Use local memory for this example
vector_store = QdrantVectorStore(client=client, collection_name="cli_docs")
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)
# 4. Initialize Retrievers
# Semantic (Vector) Retriever
vector_retriever = index.as_retriever(similarity_top_k=5)
# Keyword (BM25) Retriever
bm25_retriever = BM25Retriever.from_defaults(
docstore=index.docstore,
similarity_top_k=5
)
# 5. The Hybrid "Fusion" Retriever
# This combines both and reranks using Reciprocal Rank Fusion (RRF)
hybrid_retriever = QueryFusionRetriever(
[vector_retriever, bm25_retriever],
similarity_top_k=2,
num_queries=1, # Set > 1 if you want the LLM to rewrite the query into multiple variations
mode="reciprocal_rerank",
use_async=True
)
# 6. Usage
query = "How do I see all my containers?"
nodes = hybrid_retriever.retrieve(query)
for node in nodes:
print(f"Score: {node.score:.4f}")
print(f"Content:\n{node.text}\n")
Why this works for CLI datasets:
* Contextual Awareness: By putting the command, description, and examples into the document text, the vector search understands that "list containers" relates to docker ps.
* Precise Flag Matching: If a user searches for a specific flag like -a or --format, the BM25 retriever will catch that exact string, which a standard vector search might ignore as "noise."
* RRF Ranking: Reciprocal Rank Fusion is great because it doesn't require you to manually "weight" (e.g., 70% vector, 30% keyword). It mathematically finds the items that appear at the top of both lists.
Would you like me to show you how to add an LLM step to this so it generates a natural language explanation of the command it found?
No comments:
Post a Comment