Saturday, February 15, 2025

What is LLM as a judge and how to compares to RAGAS?

The idea behind LLM-is-a-judge is simple – provide an LLM with the output of your system and the ground truth answer, and ask it to score the output based on some criteria.

The challenge is to get the judge to score according to domain-specific and problem-specific standards.

in other words, we needed to evaluate the evaluators!

First, we ran a sanity test – we used our system to generate answers based on ground truth context, and scored them using the judges.

This test confirmed that both judges behaved as expected: the answers, which were based on the actual ground truth context, scored high – both in absolute terms and in relation to the scores of running the system including the retrieval phase on the same questions. 

Next, we performed an evaluation of the correctness score by comparing it to the correctness score generated by human domain experts.

Our main focus was investigating the correlation between our various LLM-as-a-judge tools to the human-labeled examples, looking at trends rather than the absolute score values.

This method helped us deal with another risk – human experts’ can have a subjective perception of absolute score numbers. So instead of looking at the exact score they assigned, we focused on the relative ranking of examples.

Both RAGAS and our own judge correlated reasonably well to the human scores, with our own judge being better correlated, especially in the higher score bands

The results convinced us that our LLM-as-a-Judge offers a sufficiently reliable mechanism for assessing our system’s quality – both for comparing the quality of system versions to each other in order to make decisions about release candidates, and for finding examples which can indicate systematic quality issues we need to address.

references:
https://www.qodo.ai/blog/evaluating-rag-for-large-scale-codebases/

No comments:

Post a Comment