Late Interaction models are a class of models used in the MTEB (Massive Text Embedding Benchmark) that differ significantly from traditional bi-encoder models. Instead of encoding each sentence or passage into a fixed-length embedding independently and then comparing those embeddings, Late Interaction models perform a more fine-grained, token-level interaction between the two input texts before generating a final similarity score.
Here's a breakdown:
Late Interaction Models:
Token-Level Interactions:
They process the two input texts together, allowing for direct comparisons between individual tokens or subword units.
This enables the model to capture more nuanced relationships and dependencies between the words in the two texts.
Increased Accuracy:
By considering the interactions at a granular level, Late Interaction models often achieve higher accuracy on tasks like semantic textual similarity (STS) and retrieval compared to bi-encoders.
Computational Cost:
The trade-off is that they are generally more computationally expensive, as they require processing the entire pair of texts together. This makes them less suitable for large-scale similarity searches where pre-computing and storing embeddings is crucial.
Example Architectures:
Models that use cross-encoders fall into this category. They take a pair of sentences as input and output a similarity score.
MaxSim Operation:
The MaxSim operation is a specific technique used within some Late Interaction models to compute similarity between embeddings. It's designed to capture the maximum similarity between individual elements of the two embeddings. Here's how it works:
Pairwise Similarity:
Given two embeddings, A and B, the MaxSim operation computes the pairwise similarity between all elements of A and all elements of B.
The similarity metric used is typically cosine similarity.
Maximum Similarity:
For each element in A, the maximum similarity score with any element in B is selected.
Similarly, for each element in B, the maximum similarity score with any element in A is selected.
Aggregation:
The resulting maximum similarity scores are then aggregated (e.g., averaged) to produce a final similarity score between the two embeddings.
In essence:
The MaxSim operation aims to find the most similar parts of the two embeddings and use those to determine the overall similarity. This can be particularly useful when dealing with sentences or passages that have overlapping but not identical vocabulary.
Why MaxSim?
Captures Local Similarity:
It can capture local similarities between parts of the embeddings, even if the overall embeddings are not very similar.
Robust to Word Order Variations:
It is somewhat robust to word order variations, as it focuses on finding the most similar elements regardless of their position.
Improved Accuracy:
In some cases, it has been shown to improve accuracy compared to simply computing the cosine similarity between the entire embeddings.
In the context of MTEB:
When you see Late Interaction models being evaluated in MTEB, understand that they are working by comparing the two sentences to be compared within the same model, and the MaxSim operation is a way that some of those models compute the final similarity score.