Friday, January 30, 2026

Create own rag evaluation strategy

 Building a RAG (Retrieval-Augmented Generation) evaluation from scratch is actually a great way to deeply understand where your pipeline is failing. While frameworks like Ragas or Arize Phoenix are popular, they are essentially just wrappers for specific prompts and math.

To do this manually, you need to evaluate the two distinct pillars of RAG: Retrieval (finding the right info) and Generation (using that info correctly).

1. The Evaluation Dataset

You can’t evaluate without a "Golden Dataset." Create a spreadsheet with 20–50 rows containing:

 * Question: What the user asks.

 * Context/Source: The specific document snippet that contains the answer.

 * Ground Truth: The ideal, "perfect" answer.

2. Evaluating Retrieval (The "Search" Part)

This measures if your vector database is actually finding the right documents. You don't need an LLM for this; you just need basic math.

 * Hit Rate (Precision at K): Did the correct document appear in the top k results?

   * Calculation: (Number of queries where the right doc was found) / (Total queries).

 * Mean Reciprocal Rank (MRR): Measures where the right document appeared. It rewards the system more for having the correct answer at rank 1 than rank 5.

   * Formula: MRR = \frac{1}{Q} \sum_{i=1}^{Q} \frac{1}{rank_i}

3. Evaluating Generation (The "LLM-as-a-Judge" Method)

Since manual grading is slow, you can use a "Judge LLM" (like GPT-4o or Claude 3.5) to grade your RAG output. You feed the judge a specific prompt for each of these three metrics:

A. Faithfulness (Groundedness)

Does the answer stay true to the retrieved context, or did the LLM hallucinate?

 * The Prompt: "Given the following context and the generated answer, list every claim in the answer. For each claim, state if it is supported by the context. Score 1.0 if all claims are supported, 0.0 otherwise."

B. Answer Relevance

Does the answer actually address the user's question?

 * The Prompt: "On a scale of 1-5, how relevant is this response to the original user question? Ignore whether the facts are true for now; focus only on whether it addresses the user's intent."

C. Context Precision

Did the retrieval step provide "clean" information, or was it full of noise?

 * The Prompt: "Check the retrieved context. Is this information actually necessary to answer the user's question? Rate 1 for useful, 0 for irrelevant."

4. Simple Python Implementation Structure

You don't need a library; a simple loop will do:

results = []

for item in golden_dataset:

    # 1. Run your RAG pipeline

    retrieved_docs = retriever.get_relevant_documents(item['question'])

    response = rag_chain.invoke(item['question'])

    

    # 2. Manual/LLM Scoring

    score = call_judge_llm(

        system_prompt="You are a grader...",

        user_content=f"Question: {item['question']}\nContext: {retrieved_docs}\nAnswer: {response}"

    )

    

    results.append({"question": item['question'], "score": score})


# 3. Calculate the mean score

final_grade = sum(r['score'] for r in results) / len(results)


Summary Table: What to Track

| Metric | What it tests | Success Criteria |

|---|---|---|

| Context Recall | Retrieval | Is the ground truth present in the chunks? |

| Faithfulness | Generation | Did the LLM make things up? |

| Answer Similarity | Generation | How close is the answer to the Ground Truth? (Use Semantic Similarity) |

Would you like me to write a specific "Judge Prompt" you can use to grade your RAG's faithfulness?


Arize phonix embedding visualization and observability

 Arize Phoenix is different from Ragas or DeepEval because it is an observability tool. Instead of just giving you a score, it launches a local web dashboard that lets you visually inspect your CLI embeddings and trace exactly how your RAG pipeline is performing in real-time.

For your CLI project, Phoenix is incredibly helpful for seeing "clusters" of commands and finding out why a specific query retrieved the wrong CLI command.

1. Prerequisites

pip install arize-phoenix llama-index-callbacks-arize-phoenix


2. Implementation Code

This script connects LlamaIndex to Phoenix. Once you run this, a browser window will open showing your RAG "traces."

import phoenix as px

import llama_index.core

from llama_index.core import VectorStoreIndex, Document, Settings

from llama_index.core.callbacks import CallbackManager

from llama_index.callbacks.arize_phoenix import ArizePhoenixCallbackHandler


# 1. Start the Phoenix Search & Trace server (launches a local web UI)

session = px.launch_app()


# 2. Setup LlamaIndex to send data to Phoenix

remote_callback_handler = ArizePhoenixCallbackHandler()

callback_manager = CallbackManager([remote_callback_handler])

Settings.callback_manager = callback_manager


# 3. Your CLI JSON Data

cli_data = [

    {"command": "git checkout -b", "description": "Create and switch to a new branch", "examples": ["git checkout -b feature-login"]},

    {"command": "git branch -d", "description": "Delete a local branch", "examples": ["git branch -d old-feature"]}

]


# 4. Standard LlamaIndex Ingestion

documents = [Document(text=f"{item['command']}: {item['description']}") for item in cli_data]

index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()


# 5. Run a query

# After running this, check the Phoenix Dashboard!

response = query_engine.query("How do I make a new branch?")


print(f"Answer: {response}")

print(f"Phoenix Dashboard URL: {session.url}")


# Keep the script running so you can explore the UI

import time

time.sleep(1000)


What to look for in the Phoenix UI:

 * Traces: You will see a "timeline" of your query. You can click on it to see exactly what text was sent to the embedding model and what chunks were pulled from your JSON.

 * The Embedding Map: Phoenix can visualize your CLI commands as dots in a 3D space.

   * Example: You might see a cluster of "Docker" commands and a cluster of "Git" commands.

   * Insight: If "how do I delete a branch" pulls up a "Docker delete" command, you will see the query dot land in the wrong cluster, telling you that your embeddings need more technical context.

 * LLM Evaluation: Phoenix can run "Evals" in the background. It will flag queries that it thinks were "Unfaithful" or had "Poor Retrieval" based on its internal heuristics.

Comparison: When to use which?

| Use Case | Recommended Tool |

|---|---|

| "I want to know if my RAG is accurate." | Ragas |

| "I want to prevent breaking changes in my code." | DeepEval |

| "I want to see WHY my RAG is failing visually." | Arize Phoenix |

Would you like to know how to use Phoenix to find "Useless Commands" in your JSON (commands that never get retrieved or overlap too much with others)?


Using deepeval

 DeepEval is often called the "Pytest for LLMs" because it allows you to write evaluation scripts that feel exactly like standard software unit tests.

For your CLI JSON project, DeepEval is particularly useful because it provides Reasoning. If a command fails the test, it will tell you exactly why (e.g., "The model suggested the --force flag, but the JSON context only mentions --recursive").

1. Prerequisites

pip install deepeval


2. The DeepEval Test File (test_cli_rag.py)

This script uses the RAG Triad (Faithfulness, Answer Relevancy, and Contextual Precision) to test your CLI commands.

import pytest

from deepeval import assert_test

from deepeval.test_case import LLMTestCase

from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric, ContextualPrecisionMetric


# 1. Setup the metrics with passing thresholds

# Threshold 0.7 means the score must be > 0.7 to "Pass" the unit test

faithfulness_metric = FaithfulnessMetric(threshold=0.7)

relevancy_metric = AnswerRelevancyMetric(threshold=0.7)

precision_metric = ContextualPrecisionMetric(threshold=0.7)


def test_docker_ps_command():

    # --- SIMULATED RAG OUTPUT ---

    # In a real test, you would call your query_engine.query() here

    input_query = "How do I see all my containers, even stopped ones?"

    actual_output = "Use the command 'docker ps -a' to list all containers including stopped ones."

    retrieval_context = [

        "Command: docker ps. Description: List running containers. Examples: docker ps -a"

    ]

    

    # 2. Create the Test Case

    test_case = LLMTestCase(

        input=input_query,

        actual_output=actual_output,

        retrieval_context=retrieval_context

    )

    

    # 3. Assert the test with multiple metrics

    assert_test(test_case, [faithfulness_metric, relevancy_metric, precision_metric])


def test_non_existent_command():

    input_query = "How do I hack into NASA?"

    actual_output = "I'm sorry, I don't have information on that."

    retrieval_context = [] # Nothing found in your CLI JSON

    

    test_case = LLMTestCase(

        input=input_query,

        actual_output=actual_output,

        retrieval_context=retrieval_context

    )

    

    assert_test(test_case, [relevancy_metric])


3. Running the Test

You run this from your terminal just like a normal python test:

deepeval test run test_cli_rag.py


4. Why DeepEval is better than Ragas for CLI:

 * The Dashboard: If you run deepeval login, all your results are uploaded to a web dashboard where you can see how your CLI tool's accuracy changes over time as you add more commands to your JSON.

 * Strict Flags: You can create a custom GEval metric in DeepEval specifically to check for "Flag Accuracy"—ensuring the LLM never hallucinates a CLI flag that isn't in your documentation.

 * CI/CD Integration: You can block a GitHub Pull Request from merging if the "CLI Accuracy" score drops below 80%.

Comparison: Ragas vs. DeepEval

| Feature | Ragas | DeepEval |

|---|---|---|

| Primary Use | Research / Bulk Data Eval | Engineering / Unit Testing |

| Output | Raw Scores (0.0 - 1.0) | Pass/Fail + Detailed Reasoning |

| Integration | Pandas / Notebooks | Pytest / GitHub Actions |

| UI | None (requires 3rd party) | Built-in Cloud Dashboard |

Would you like me to show you how to create a "Custom Flag Metric" to ensure the LLM never invents fake CLI arguments?


Thursday, January 29, 2026

Using custom embedding models with llamaindex

 To use a custom model with LlamaIndex, you use the Settings object. This acts as a global configuration hub that tells LlamaIndex which "brain" (LLM) and "dictionary" (Embedding Model) to use for all operations.

Since you are working with CLI commands, I recommend using a local embedding model (no API cost and high privacy) and a custom LLM (like a specific Llama 3 variant).

1. Setup for Local Embedding & LLM

First, install the necessary integrations:

pip install llama-index-embeddings-huggingface llama-index-llms-ollama


2. Configuration Code

Here is how you replace the default OpenAI models with custom local ones.

from llama_index.core import Settings

from llama_index.embeddings.huggingface import HuggingFaceEmbedding

from llama_index.llms.ollama import Ollama


# 1. Set a Custom Embedding Model (Local from HuggingFace)

# BGE-Small is excellent for technical/CLI text retrieval

Settings.embed_model = HuggingFaceEmbedding(

    model_name="BAAI/bge-small-en-v1.5"

)


# 2. Set a Custom LLM (Local via Ollama)

# This assumes you have Ollama running locally with 'llama3' pulled

Settings.llm = Ollama(model="llama3", request_timeout=60.0)


# 3. Standard LlamaIndex flow now uses these settings automatically

# index = VectorStoreIndex.from_documents(documents)

# query_engine = index.as_query_engine()


Which Custom Model Should You Choose for CLI?

Depending on your hardware and specific needs, here are the best "custom" matches for your JSON array:

| Type | Recommended Model | Why? |

|---|---|---|

| Embedding | BAAI/bge-base-en-v1.5 | Consistently tops benchmarks for retrieval accuracy. |

| Embedding | nomic-embed-text | Great for long examples and varied CLI syntax. |

| LLM | codellama or llama3 | These models understand code structure and terminal syntax better than generic models. |

Visual Flow of Custom Settings

Key Tips for Custom Models:

 * Device selection: If you have a GPU, the HuggingFace embedding will automatically try to use it. You can force it by adding device="cuda" or device="mps" (for Mac) inside HuggingFaceEmbedding.

 * Batch Size: For large JSON arrays, you can speed up the "custom" embedding process by setting embed_batch_size=32 in the embedding settings.

 * Persistence: Once you embed your JSON array using a custom model, you must use that same model every time you load the index later. If you change models, the "numbers" won't match, and retrieval will fail.

Would you like me to show you how to save (persist) the index to your disk so you don't have to re-embed your CLI JSON every time you run the script?


Rag for Technical documentation

 Implementing a RAG (Retrieval-Augmented Generation) system for technical content like CLI commands requires a different approach than standard document RAG. Because CLI commands use specific syntax (e.g., --flags, grep, awk) and jargon, simple semantic search often fails to distinguish between similar-looking commands.

The best approach for your JSON dataset is a Hybrid Retrieval strategy combined with Context-Aware Metadata.

1. The Strategy: Hybrid Retrieval

Technical documentation usually requires two types of matching:

 * Semantic Search (Dense): Understands intent. If a user asks "How do I list hidden files?", the retriever finds ls -a even if the word "hidden" isn't in the description.

 * Keyword Match (Sparse/BM25): Handles exact technical terms. If a user types the specific flag --recursive, you want the retriever to prioritize commands that actually contain that exact string.

Recommended Pipeline:

 * Preprocessing: For each JSON object, create a "searchable string" that combines the command name, description, and examples.

 * Indexing: Use a vector database (like Qdrant, Pinecone, or Weaviate) that supports Hybrid Search.

 * Retrieval: Use Reciprocal Rank Fusion (RRF) to combine the results from the vector search and the keyword search.

2. Preparing Your JSON Data

Don't just embed the description. You need to structure the text to help the embedding model "see" the command.

Original JSON:

{

  "command": "docker ps",

  "description": "List running containers",

  "examples": ["docker ps -a", "docker ps --format '{{.ID}}'"]

}


Optimized Chunk for Embedding:

> Command: docker ps

> Description: List running containers.

> Examples: docker ps -a, docker ps --format '{{.ID}}'

> Keywords: docker, ps, list, containers, running, status

3. Implementation Steps

A. Embedding Model Choice

Since you are dealing with code-like structures, use a model trained on technical text or code.

 * Open Source: BAAI/bge-small-en-v1.5 (excellent for technical retrieval).

 * Proprietary: text-embedding-3-small (OpenAI) or text-embedding-004 (Gemini).

B. The Retrieval Logic

Use a "Multi-Vector" approach if your examples are very different from your descriptions:

 * Summary Vector: Embed just the command and description.

 * Example Vector: Embed each example separately but link them to the same parent command ID.

C. Handling "Short Query" Problems

CLI users often type short, ambiguous queries like "delete file." To solve this, use Query Expansion (Hypothetical Document Embeddings - HyDE):

 * Ask an LLM: "What is the technical description of the command to delete a file?"

 * Use that expanded description to search your JSON array instead of the user's two-word query.

4. Why this works for CLI

| Feature | Why it matters |

|---|---|

| Hybrid Search | Ensures git log is found for both "history" (semantic) and "log" (keyword). |

| Metadata Filtering | Allows you to filter by categories (e.g., only search git commands if the user is in a git context). |

| Example Indexing | Often, a user remembers a flag but not the command; indexing examples helps catch these. |

Would you like me to provide a Python code snippet using a specific library (like LangChain or LlamaIndex) to implement this hybrid search?


Query fusion retrieval

 To implement this, I recommend using LlamaIndex with its built-in QueryFusionRetriever. This is a powerful, production-ready way to perform Hybrid Search (BM25 + Vector) and then use Reciprocal Rank Fusion (RRF) to get the most accurate result.

1. Prerequisites

You will need to install the following libraries:

pip install llama-index llama-index-retrievers-bm25 llama-index-embeddings-openai llama-index-vector-stores-qdrant qdrant-client


2. Implementation Code

This script loads your JSON, prepares the documents, and sets up the hybrid retrieval pipeline.

import json

from llama_index.core import Document, VectorStoreIndex, StorageContext

from llama_index.core.retrievers import QueryFusionRetriever

from llama_index.retrievers.bm25 import BM25Retriever

from llama_index.vector_stores.qdrant import QdrantVectorStore

import qdrant_client


# 1. Load your JSON Data

cli_data = [

    {

        "command": "docker ps",

        "description": "List running containers",

        "examples": ["docker ps -a", "docker ps --format '{{.ID}}'"]

    },

    # ... more commands

]


# 2. Transform JSON to Documents for Indexing

documents = []

for item in cli_data:

    # We combine command, description, and examples into one text block

    # This ensures the model can "see" all parts during search

    content = f"Command: {item['command']}\nDescription: {item['description']}\nExamples: {', '.join(item['examples'])}"

    

    doc = Document(

        text=content,

        metadata={"command": item['command']} # Keep original command in metadata

    )

    documents.append(doc)


# 3. Setup Vector Storage (Dense Search)

client = qdrant_client.QdrantClient(location=":memory:") # Use local memory for this example

vector_store = QdrantVectorStore(client=client, collection_name="cli_docs")

storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)


# 4. Initialize Retrievers

# Semantic (Vector) Retriever

vector_retriever = index.as_retriever(similarity_top_k=5)


# Keyword (BM25) Retriever

bm25_retriever = BM25Retriever.from_defaults(

    docstore=index.docstore, 

    similarity_top_k=5

)


# 5. The Hybrid "Fusion" Retriever

# This combines both and reranks using Reciprocal Rank Fusion (RRF)

hybrid_retriever = QueryFusionRetriever(

    [vector_retriever, bm25_retriever],

    similarity_top_k=2,

    num_queries=1,  # Set > 1 if you want the LLM to rewrite the query into multiple variations

    mode="reciprocal_rerank",

    use_async=True

)


# 6. Usage

query = "How do I see all my containers?"

nodes = hybrid_retriever.retrieve(query)


for node in nodes:

    print(f"Score: {node.score:.4f}")

    print(f"Content:\n{node.text}\n")


Why this works for CLI datasets:

 * Contextual Awareness: By putting the command, description, and examples into the document text, the vector search understands that "list containers" relates to docker ps.

 * Precise Flag Matching: If a user searches for a specific flag like -a or --format, the BM25 retriever will catch that exact string, which a standard vector search might ignore as "noise."

 * RRF Ranking: Reciprocal Rank Fusion is great because it doesn't require you to manually "weight" (e.g., 70% vector, 30% keyword). It mathematically finds the items that appear at the top of both lists.

Would you like me to show you how to add an LLM step to this so it generates a natural language explanation of the command it found?


Evaluating using Raagas

 Evaluating a RAG (Retrieval-Augmented Generation) pipeline with Ragas (Retrieval Augmented Generation Assessment) is a smart move. It moves you away from "vibes-based" testing and into actual metrics like Faithfulness, Answer Relevance, and Context Precision.

To get this running, you'll need a "Evaluation Dataset" consisting of Questions, Contexts, Answers, and (optionally) Ground Truths.

Prerequisites

First, install the necessary libraries:

pip install ragas langchain openai


Python Implementation

Here is a concise script to evaluate a set of RAG results using Ragas and OpenAI as the "LLM judge."

import os

from datasets import Dataset

from ragas import evaluate

from ragas.metrics import (

    faithfulness,

    answer_relevancy,

    context_precision,

    context_recall,

)


# 1. Setup your API Key

os.environ["OPENAI_API_KEY"] = "your-api-key-here"


# 2. Prepare your data

# 'contexts' should be a list of lists (strings retrieved from your vector db)

data_samples = {

    'question': ['When was the first iPhone released?', 'Who founded SpaceX?'],

    'answer': ['The first iPhone was released on June 29, 2007.', 'Elon Musk founded SpaceX in 2002.'],

    'contexts': [

        ['Apple Inc. released the first iPhone in mid-2007.', 'Steve Jobs announced it in January.'],

        ['SpaceX was founded by Elon Musk to reduce space transportation costs.']

    ],

    'ground_truth': ['June 29, 2007', 'Elon Musk']

}


dataset = Dataset.from_dict(data_samples)


# 3. Define the metrics you want to track

metrics = [

    faithfulness,

    answer_relevancy,

    context_precision,

    context_recall

]


# 4. Run the evaluation

score = evaluate(dataset, metrics=metrics)


# 5. Review the results

df = score.to_pandas()

print(df)


Key Metrics Explained

Understanding what these numbers mean is half the battle:

| Metric | What it measures |

|---|---|

| Faithfulness | Is the answer derived only from the retrieved context? (Prevents hallucinations). |

| Answer Relevancy | Does the answer actually address the user's question? |

| Context Precision | Are the truly relevant chunks ranked higher in your retrieval results? |

| Context Recall | Does the retrieved context actually contain the information needed to answer? |

Pro-Tip: Evaluation without Ground Truth

If you don't have human-annotated ground_truth data yet, you can still run Faithfulness and Answer Relevancy. Ragas is particularly powerful because it uses an LLM to "reason" through whether the retrieved context supports the generated answer.

Would you like me to show you how to integrate this directly with a LangChain or LlamaIndex retriever so you don't have to manually build the dataset?