MTEB [1] is a multi-task and multi-language comparison of embedding models. It comes in the form of a leaderboard, based on multiple scores, and only one model stands at the top! Does it make it easy to choose the right model for your application? You wish! This guide is an attempt to provide tips on how to make clever use of MTEB. As our team worked on making the French benchmark available [2], the examples will rely on the French MTEB. Nonetheless, those tips apply to the entire benchmark.
MTEB is a leaderboard. It shows you scores. What it doesn't show you? Significance.
While being a great resource for discovering and comparing models, MTEB might not be as straightforward as one might expect. As of today (1st of March 2024), many SOTA models have been tested, and most of them display close average scores. For the French MTEB, these average scores are computed on 26 different tasks (and 56 for English MTEB!) and no standard deviation comes with it. Even though the top model looks better than the others, the score difference with a model that comes after it might not be significant. One can directly get the raw results to compute statistical metrics. As an example, we performed critical difference tests and found that, with a p-value of 0.05, the current 9 top models in the French MTEB leaderboard are statistically equivalent. It would require even more datasets to perceive statistical significance.
Dive into data
Do not just look at the average scores of models on the task you are interested in. Instead, look at the individual scores on the datasets that best represent your use case.
Consider the model's characteristics
Using the model displaying the best average score for your application might be tempting. However, a model comes with its characteristics, leading to its usage constraints. Make sure those constraints match yours.
Do not forget MTEB is a leaderboard...
And as leaderboards sometimes do, it encourages to compete without following the rules.
Indeed, keep in mind that many providers want to see their models on top of that list and, being based on public datasets, some malpractices such as data leakage or overfitting on data could bias MTEB evaluation.
references:
https://huggingface.co/blog/lyon-nlp-group/mteb-leaderboard-best-practices
No comments:
Post a Comment