Sunday, November 10, 2024

Integrating RAGAS - Part 1

To integrate RAGAS evaluation into this pipeline we need a few things, from our pipeline we need the retrieved contexts, and the generated output.

We already have the generated output, it is what we're printing above

When initializing our AgentExecutor object we included return_intermediate_steps=True — this (unsuprisingly) returns the intermediate steps that the agent tool to generate the final answer. Those steps include the response from our arxiv_search tool — which we can use the evaluate the retrieval portion of our pipeline with RAGAS.

We extract the contexts themselves like so:

print(out["intermediate_steps"][0][1])

To evaluate with RAG we need a dataset containing question, ideal contexts, and the ground truth answers to those questions.

ragas_data = load_dataset("aurelio-ai/ai-arxiv2-ragas-mixtral", split="train")

ragas_data

We first iterate through the questions in this evaluation dataset and ask these questions to our agent.

import pandas as pd

from tqdm.auto import tqdm


df = pd.DataFrame({

    "question": [],

    "contexts": [],

    "answer": [],

    "ground_truth": []

})


limit = 5


for i, row in tqdm(enumerate(ragas_data), total=limit):

    if i >= limit:

        break

    question = row["question"]

    ground_truths = row["ground_truth"]

    try:

        out = chat(question)

        answer = out["output"]

        if len(out["intermediate_steps"]) != 0:

            contexts = out["intermediate_steps"][0][1].split("\n---\n")

        else:

            # this is where no intermediate steps are used

            contexts = []

    except ValueError:

        answer = "ERROR"

        contexts = []

    df = pd.concat([df, pd.DataFrame({

        "question": question,

        "answer": answer,

        "contexts": [contexts],

        "ground_truth": ground_truths

    })], ignore_index=True)





from datasets import Dataset

from ragas.metrics import (

    faithfulness,

    answer_relevancy,

    context_precision,

    context_relevancy,

    context_recall,

    answer_similarity,

    answer_correctness,

)


eval_data = Dataset.from_dict(df)

eval_data


from ragas import evaluate


result = evaluate(

    dataset=eval_data,

    metrics=[

        faithfulness,

        answer_relevancy,

        context_precision,

        context_relevancy,

        context_recall,

        answer_similarity,

        answer_correctness,

    ],

)

result = result.to_pandas()

references:

https://github.com/pinecone-io/examples/blob/master/learn/generation/better-rag/03-ragas-evaluation.ipynb

No comments:

Post a Comment