To integrate RAGAS evaluation into this pipeline we need a few things, from our pipeline we need the retrieved contexts, and the generated output.
We already have the generated output, it is what we're printing above
When initializing our AgentExecutor object we included return_intermediate_steps=True — this (unsuprisingly) returns the intermediate steps that the agent tool to generate the final answer. Those steps include the response from our arxiv_search tool — which we can use the evaluate the retrieval portion of our pipeline with RAGAS.
We extract the contexts themselves like so:
print(out["intermediate_steps"][0][1])
To evaluate with RAG we need a dataset containing question, ideal contexts, and the ground truth answers to those questions.
ragas_data = load_dataset("aurelio-ai/ai-arxiv2-ragas-mixtral", split="train")
ragas_data
We first iterate through the questions in this evaluation dataset and ask these questions to our agent.
import pandas as pd
from tqdm.auto import tqdm
df = pd.DataFrame({
"question": [],
"contexts": [],
"answer": [],
"ground_truth": []
})
limit = 5
for i, row in tqdm(enumerate(ragas_data), total=limit):
if i >= limit:
break
question = row["question"]
ground_truths = row["ground_truth"]
try:
out = chat(question)
answer = out["output"]
if len(out["intermediate_steps"]) != 0:
contexts = out["intermediate_steps"][0][1].split("\n---\n")
else:
# this is where no intermediate steps are used
contexts = []
except ValueError:
answer = "ERROR"
contexts = []
df = pd.concat([df, pd.DataFrame({
"question": question,
"answer": answer,
"contexts": [contexts],
"ground_truth": ground_truths
})], ignore_index=True)
from datasets import Dataset
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_relevancy,
context_recall,
answer_similarity,
answer_correctness,
)
eval_data = Dataset.from_dict(df)
eval_data
from ragas import evaluate
result = evaluate(
dataset=eval_data,
metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_relevancy,
context_recall,
answer_similarity,
answer_correctness,
],
)
result = result.to_pandas()
references:
https://github.com/pinecone-io/examples/blob/master/learn/generation/better-rag/03-ragas-evaluation.ipynb
No comments:
Post a Comment