Implementing a RAG (Retrieval-Augmented Generation) system for technical content like CLI commands requires a different approach than standard document RAG. Because CLI commands use specific syntax (e.g., --flags, grep, awk) and jargon, simple semantic search often fails to distinguish between similar-looking commands.
The best approach for your JSON dataset is a Hybrid Retrieval strategy combined with Context-Aware Metadata.
1. The Strategy: Hybrid Retrieval
Technical documentation usually requires two types of matching:
* Semantic Search (Dense): Understands intent. If a user asks "How do I list hidden files?", the retriever finds ls -a even if the word "hidden" isn't in the description.
* Keyword Match (Sparse/BM25): Handles exact technical terms. If a user types the specific flag --recursive, you want the retriever to prioritize commands that actually contain that exact string.
Recommended Pipeline:
* Preprocessing: For each JSON object, create a "searchable string" that combines the command name, description, and examples.
* Indexing: Use a vector database (like Qdrant, Pinecone, or Weaviate) that supports Hybrid Search.
* Retrieval: Use Reciprocal Rank Fusion (RRF) to combine the results from the vector search and the keyword search.
2. Preparing Your JSON Data
Don't just embed the description. You need to structure the text to help the embedding model "see" the command.
Original JSON:
{
"command": "docker ps",
"description": "List running containers",
"examples": ["docker ps -a", "docker ps --format '{{.ID}}'"]
}
Optimized Chunk for Embedding:
> Command: docker ps
> Description: List running containers.
> Examples: docker ps -a, docker ps --format '{{.ID}}'
> Keywords: docker, ps, list, containers, running, status
>
3. Implementation Steps
A. Embedding Model Choice
Since you are dealing with code-like structures, use a model trained on technical text or code.
* Open Source: BAAI/bge-small-en-v1.5 (excellent for technical retrieval).
* Proprietary: text-embedding-3-small (OpenAI) or text-embedding-004 (Gemini).
B. The Retrieval Logic
Use a "Multi-Vector" approach if your examples are very different from your descriptions:
* Summary Vector: Embed just the command and description.
* Example Vector: Embed each example separately but link them to the same parent command ID.
C. Handling "Short Query" Problems
CLI users often type short, ambiguous queries like "delete file." To solve this, use Query Expansion (Hypothetical Document Embeddings - HyDE):
* Ask an LLM: "What is the technical description of the command to delete a file?"
* Use that expanded description to search your JSON array instead of the user's two-word query.
4. Why this works for CLI
| Feature | Why it matters |
|---|---|
| Hybrid Search | Ensures git log is found for both "history" (semantic) and "log" (keyword). |
| Metadata Filtering | Allows you to filter by categories (e.g., only search git commands if the user is in a git context). |
| Example Indexing | Often, a user remembers a flag but not the command; indexing examples helps catch these. |
Would you like me to provide a Python code snippet using a specific library (like LangChain or LlamaIndex) to implement this hybrid search?
No comments:
Post a Comment