Tuesday, June 25, 2024

Comparison Evaluators in Langchain

Comparison evaluators in LangChain help measure two different chains or LLM outputs. These evaluators are helpful for comparative analyses, such as A/B testing between two language models, or comparing different versions of the same model. They can also be useful for things like generating preference scores for ai-assisted reinforcement learning.

These evaluators inherit from the PairwiseStringEvaluator class, providing a comparison interface for two strings - typically, the outputs from two different prompts or models, or two versions of the same model. In essence, a comparison evaluator performs an evaluation on a pair of strings and returns a dictionary containing the evaluation score and other relevant details.

evaluate_string_pairs: Evaluate the output string pairs. This function should be overwritten when creating custom evaluators.

aevaluate_string_pairs: Asynchronously evaluate the output string pairs. This function should be overwritten for asynchronous evaluation.

requires_input: This property indicates whether this evaluator requires an input string.

requires_reference: This property specifies whether this evaluator requires a reference label.

Often you will want to compare predictions of an LLM, Chain, or Agent for a given input. The StringComparison evaluators facilitate this so you can answer questions like:

Which LLM or prompt produces a preferred output for a given question?

Which examples should I include for few-shot example selection?

Which output is better to include for fine-tuning?

Below is sample code for this

def pairwise_comparison():

    evaluator = load_evaluator("labeled_pairwise_string")

    result = evaluator.evaluate_string_pairs(

        prediction="there are three dogs",

        prediction_b="4",

        input="how many dogs are in the park?",

        reference="four",

    )

    print("Evaluation result is ",result)


Output is something like below 

Evaluation result is  {'reasoning': "Both Assistant A and Assistant B provided direct answers to the user's question. However, Assistant A's response is incorrect as it stated there are three dogs in the park, while the user's question indicated there are four. On the other hand, Assistant B correctly answered the user's question by stating there are four dogs in the park. Therefore, Assistant B's response is more accurate and relevant to the user's question. \n\nFinal Verdict: [[B]]", 'value': 'B', 'score': 0}


References:

https://python.langchain.com/v0.1/docs/guides/productionization/evaluation/comparison/

No comments:

Post a Comment