Thursday, January 29, 2026

Evaluating using Raagas

 Evaluating a RAG (Retrieval-Augmented Generation) pipeline with Ragas (Retrieval Augmented Generation Assessment) is a smart move. It moves you away from "vibes-based" testing and into actual metrics like Faithfulness, Answer Relevance, and Context Precision.

To get this running, you'll need a "Evaluation Dataset" consisting of Questions, Contexts, Answers, and (optionally) Ground Truths.

Prerequisites

First, install the necessary libraries:

pip install ragas langchain openai


Python Implementation

Here is a concise script to evaluate a set of RAG results using Ragas and OpenAI as the "LLM judge."

import os

from datasets import Dataset

from ragas import evaluate

from ragas.metrics import (

    faithfulness,

    answer_relevancy,

    context_precision,

    context_recall,

)


# 1. Setup your API Key

os.environ["OPENAI_API_KEY"] = "your-api-key-here"


# 2. Prepare your data

# 'contexts' should be a list of lists (strings retrieved from your vector db)

data_samples = {

    'question': ['When was the first iPhone released?', 'Who founded SpaceX?'],

    'answer': ['The first iPhone was released on June 29, 2007.', 'Elon Musk founded SpaceX in 2002.'],

    'contexts': [

        ['Apple Inc. released the first iPhone in mid-2007.', 'Steve Jobs announced it in January.'],

        ['SpaceX was founded by Elon Musk to reduce space transportation costs.']

    ],

    'ground_truth': ['June 29, 2007', 'Elon Musk']

}


dataset = Dataset.from_dict(data_samples)


# 3. Define the metrics you want to track

metrics = [

    faithfulness,

    answer_relevancy,

    context_precision,

    context_recall

]


# 4. Run the evaluation

score = evaluate(dataset, metrics=metrics)


# 5. Review the results

df = score.to_pandas()

print(df)


Key Metrics Explained

Understanding what these numbers mean is half the battle:

| Metric | What it measures |

|---|---|

| Faithfulness | Is the answer derived only from the retrieved context? (Prevents hallucinations). |

| Answer Relevancy | Does the answer actually address the user's question? |

| Context Precision | Are the truly relevant chunks ranked higher in your retrieval results? |

| Context Recall | Does the retrieved context actually contain the information needed to answer? |

Pro-Tip: Evaluation without Ground Truth

If you don't have human-annotated ground_truth data yet, you can still run Faithfulness and Answer Relevancy. Ragas is particularly powerful because it uses an LLM to "reason" through whether the retrieved context supports the generated answer.

Would you like me to show you how to integrate this directly with a LangChain or LlamaIndex retriever so you don't have to manually build the dataset?


No comments:

Post a Comment