Saturday, June 21, 2025

What is Gecko Evaluator

The world of generative AI is moving fast, with models like Lyria, Imagen, and Veo now capable of producing stunningly realistic and imaginative images and videos from simple text prompts. However, evaluating these models is still a steep challenge. Traditional human evaluation, while the gold standard, can be slow and costly, hindering rapid development cycles.

To address this, we're thrilled to introduce Gecko, now available through Google Cloud’s Vertex AI Evaluation Service. Gecko is a rubric-based and interpretable autorater for evaluating generative AI models that empowers developers with a more nuanced, customizable, and transparent way to assess the performance of image and video generation models.

The challenge of evaluating generative models with auto-raters

Creating useful, performant auto-raters is challenging as the quality of generation dramatically improves. While specialised models can be efficient, they lack the interpretability developers need to understand model behavior and pinpoint areas for improvement. For instance, when evaluating how accurately a generated image depicts a prompt, a single score doesn't reveal why a model succeeded or failed.

Gecko offers a fine-grained, interpretable, and customizable auto-rater. This Google DeepMind research paper shows that such an auto-rater can reliably evaluate image and video generation across a range of skills, reducing the dependency on costly human judgment. Notably, beyond its interpretability, Gecko exhibits strong performance and has already been instrumental in benchmarking the progress of leading models like Imagen.

Gecko makes evaluation interpretable with its  clear, step-by-step rubric-based approach. Let’s take an example and use Gecko to evaluate the generated media of a cup of coffee and a croissant on a table.

Step 1: Semantic prompt decomposition.

Gecko leverages a Gemini model to first break down the input text prompt into key semantic elements that need to be verified in the generated media. This includes identifying entities, their attributes, and the relationships between them.

For the running example, the prompt is broken down into keywords: Steaming, cup of coffee, croissant, table.

Step 2: Question generation.

Based on the decomposed prompt, the Gemini model then generates a series of question-answer pairs. These questions are specifically designed to probe the generated image or video for the presence and accuracy of the identified elements and relationships. Optionally, Gemini can provide justifications for why a particular answer is correct, further enhancing transparency.

Let’s take a look at the running example and generate question-answer pairs for each keyword. For the keyword Steaming, the question-answer pair is ‘is the cup of coffee steaming? [‘yes’, ‘no’]’ with the ground-truth answer ‘yes’.

Step 3: Scoring

Finally, the Gemini model scores the generated media against each question-answer pair. These individual scores are then aggregated to produce a final evaluation score.

For the running example, all questions were found to be correct, giving a perfect final score.

Evaluate with Gecko on Vertex AI

Gecko is now available via the Gen AI Evaluation Service in Vertex AI, empowering you to evaluate image or video generative models. Here's how you can get started with Gecko evaluation for images and videos on Vertex AI:

First, you'll need to set up configurations for both rubric generation and rubric validation.


# Rubric Generation

rubric_generation_config = RubricGenerationConfig(

    prompt_template=RUBRIC_GENERATION_PROMPT,

    parsing_fn=parse_json_to_qa_records,

)

# Rubric Validation

pointwise_metric = PointwiseMetric(

    metric="gecko_metric",

    metric_prompt_template=RUBRIC_VALIDATOR_PROMPT,

    custom_output_config=CustomOutputConfig(

        return_raw_output=True,

        parsing_fn=parse_rubric_results,

    ),

)

# Rubric Metric

rubric_based_gecko = RubricBasedMetric(

    generation_config=rubric_generation_config,

    critique_metric=pointwise_metric,

)


Next, prepare your dataset for evaluation. This involves creating a Pandas DataFrame with columns for your prompts and the corresponding generated images or videos.


prompts = [

    "steaming cup of coffee and a croissant on a table",

    "steaming cup of coffee and toast in a cafe",

    # ... more prompts

]

images = [

    '{"contents": [{"parts": [{"file_data": {"mime_type": "image/png", "file_uri": "gs://cloud-samples-data/generative-ai/evaluation/images/coffee.png"}}]}]}',

    '{"contents": [{"parts": [{"file_data": {"mime_type": "image/png", "file_uri": "gs://cloud-samples-data/generative-ai/evaluation/images/coffee.png"}}]}]}',

    # ... more image URIs

]

eval_dataset = pd.DataFrame(

    {

        "prompt": prompts,

        "image": images, # or "video": videos for video evaluation

    }

)


Now, you can generate the rubrics based on your prompts using the configured rubric_based_gecko metric.


Finally, run the evaluation using the generated rubrics and your dataset. The evaluate method of EvalTask will use the rubric validator to score the generated content.


eval_task = EvalTask(

    dataset=dataset_with_rubrics,

    metrics=[rubric_based_gecko],

)

eval_result = eval_task.evaluate(response_column_name="image") # or "video"



No comments:

Post a Comment