Building a RAG (Retrieval-Augmented Generation) evaluation from scratch is actually a great way to deeply understand where your pipeline is failing. While frameworks like Ragas or Arize Phoenix are popular, they are essentially just wrappers for specific prompts and math.
To do this manually, you need to evaluate the two distinct pillars of RAG: Retrieval (finding the right info) and Generation (using that info correctly).
1. The Evaluation Dataset
You can’t evaluate without a "Golden Dataset." Create a spreadsheet with 20–50 rows containing:
* Question: What the user asks.
* Context/Source: The specific document snippet that contains the answer.
* Ground Truth: The ideal, "perfect" answer.
2. Evaluating Retrieval (The "Search" Part)
This measures if your vector database is actually finding the right documents. You don't need an LLM for this; you just need basic math.
* Hit Rate (Precision at K): Did the correct document appear in the top k results?
* Calculation: (Number of queries where the right doc was found) / (Total queries).
* Mean Reciprocal Rank (MRR): Measures where the right document appeared. It rewards the system more for having the correct answer at rank 1 than rank 5.
* Formula: MRR = \frac{1}{Q} \sum_{i=1}^{Q} \frac{1}{rank_i}
3. Evaluating Generation (The "LLM-as-a-Judge" Method)
Since manual grading is slow, you can use a "Judge LLM" (like GPT-4o or Claude 3.5) to grade your RAG output. You feed the judge a specific prompt for each of these three metrics:
A. Faithfulness (Groundedness)
Does the answer stay true to the retrieved context, or did the LLM hallucinate?
* The Prompt: "Given the following context and the generated answer, list every claim in the answer. For each claim, state if it is supported by the context. Score 1.0 if all claims are supported, 0.0 otherwise."
B. Answer Relevance
Does the answer actually address the user's question?
* The Prompt: "On a scale of 1-5, how relevant is this response to the original user question? Ignore whether the facts are true for now; focus only on whether it addresses the user's intent."
C. Context Precision
Did the retrieval step provide "clean" information, or was it full of noise?
* The Prompt: "Check the retrieved context. Is this information actually necessary to answer the user's question? Rate 1 for useful, 0 for irrelevant."
4. Simple Python Implementation Structure
You don't need a library; a simple loop will do:
results = []
for item in golden_dataset:
# 1. Run your RAG pipeline
retrieved_docs = retriever.get_relevant_documents(item['question'])
response = rag_chain.invoke(item['question'])
# 2. Manual/LLM Scoring
score = call_judge_llm(
system_prompt="You are a grader...",
user_content=f"Question: {item['question']}\nContext: {retrieved_docs}\nAnswer: {response}"
)
results.append({"question": item['question'], "score": score})
# 3. Calculate the mean score
final_grade = sum(r['score'] for r in results) / len(results)
Summary Table: What to Track
| Metric | What it tests | Success Criteria |
|---|---|---|
| Context Recall | Retrieval | Is the ground truth present in the chunks? |
| Faithfulness | Generation | Did the LLM make things up? |
| Answer Similarity | Generation | How close is the answer to the Ground Truth? (Use Semantic Similarity) |
Would you like me to write a specific "Judge Prompt" you can use to grade your RAG's faithfulness?