RAGAS covers a number of key metrics useful in LLM evaluation, including answer correctness (later renamed to “factual correctness”) and context accuracy via precision and recall.
RAGAS implements correctness tests by converting both the generated answer and the ground truth (reference) into a series of simplified statements.
The score is essentially a grade for the level of overlap between statements from reference vs. the generated answer, combined with some weight for overall similarity between the answers.
When eyeballing the scores RAGAS generated, we noticed two recurring issues:
For relatively short answers, every small “missed fact” results in significant penalties.
When one of the answers was more detailed than the other, the correctness score suffered greatly, despite both answers being valid and even useful
The latter issue was common enough, and didn’t align with our intention for the correctness metric, so we needed to find a way to evaluate the “essence” of the answers as well as the details.
references:
https://www.qodo.ai/blog/evaluating-rag-for-large-scale-codebases/
No comments:
Post a Comment