Thursday, April 3, 2025

What are Late Interaction Models and Cross Encoders

Late Interaction models are a class of models used in the MTEB (Massive Text Embedding Benchmark) that differ significantly from traditional bi-encoder models. Instead of encoding each sentence or passage into a fixed-length embedding independently and then comparing those embeddings, Late Interaction models perform a more fine-grained, token-level interaction between the two input texts before generating a final similarity score.   


Here's a breakdown:

Late Interaction Models:


Token-Level Interactions:

They process the two input texts together, allowing for direct comparisons between individual tokens or subword units.

This enables the model to capture more nuanced relationships and dependencies between the words in the two texts.

Increased Accuracy:

By considering the interactions at a granular level, Late Interaction models often achieve higher accuracy on tasks like semantic textual similarity (STS) and retrieval compared to bi-encoders.

Computational Cost:

The trade-off is that they are generally more computationally expensive, as they require processing the entire pair of texts together. This makes them less suitable for large-scale similarity searches where pre-computing and storing embeddings is crucial.

Example Architectures:

Models that use cross-encoders fall into this category. They take a pair of sentences as input and output a similarity score.

MaxSim Operation:


The MaxSim operation is a specific technique used within some Late Interaction models to compute similarity between embeddings. It's designed to capture the maximum similarity between individual elements of the two embeddings. Here's how it works:   


Pairwise Similarity:

Given two embeddings, A and B, the MaxSim operation computes the pairwise similarity between all elements of A and all elements of B.   

The similarity metric used is typically cosine similarity.   

Maximum Similarity:

For each element in A, the maximum similarity score with any element in B is selected.   

Similarly, for each element in B, the maximum similarity score with any element in A is selected.

Aggregation:

The resulting maximum similarity scores are then aggregated (e.g., averaged) to produce a final similarity score between the two embeddings.   

In essence:


The MaxSim operation aims to find the most similar parts of the two embeddings and use those to determine the overall similarity. This can be particularly useful when dealing with sentences or passages that have overlapping but not identical vocabulary.


Why MaxSim?


Captures Local Similarity:

It can capture local similarities between parts of the embeddings, even if the overall embeddings are not very similar.   

Robust to Word Order Variations:

It is somewhat robust to word order variations, as it focuses on finding the most similar elements regardless of their position.

Improved Accuracy:

In some cases, it has been shown to improve accuracy compared to simply computing the cosine similarity between the entire embeddings.

In the context of MTEB:


When you see Late Interaction models being evaluated in MTEB, understand that they are working by comparing the two sentences to be compared within the same model, and the MaxSim operation is a way that some of those models compute the final similarity score.

What do CrossEncoder do in SentenceTransformers

In Sentence Transformers, a CrossEncoder is a model architecture designed for tasks where you need to compare pairs of sentences or text passages to determine their relationship. It's particularly useful for tasks like:


Semantic Textual Similarity (STS): Determining how similar two sentences are in meaning.   

Re-ranking: Given a query and a list of documents, re-ordering the documents based on their relevance to the query.   

Here's a breakdown of what a CrossEncoder does and how it differs from a SentenceTransformer (bi-encoder):

Key Differences Between CrossEncoders and Bi-Encoders:

Bi-Encoders (SentenceTransformers):

Encode each sentence or text passage independently into a fixed-length vector (embedding).   

Calculate the similarity between two sentences by comparing their embeddings (e.g., using cosine similarity).

Efficient for large-scale similarity searches because you can pre-compute and store embeddings.

Cross-Encoders:

Take a pair of sentences or text passages as input and process them together.   

Produce a single output score that represents the relationship between the two inputs.   

Generally more accurate than bi-encoders for pairwise comparison tasks.   

Slower than bi-encoders because they require processing each pair of sentences individually.

How CrossEncoders Work:

Concatenation:

The two input sentences are concatenated (often with a special separator token like [SEP]).

Transformer Processing:

The concatenated input is fed into a Transformer-based model (e.g., BERT, RoBERTa).

Output Score:

The model produces a single output score, typically a value between 0 and 1, that represents the similarity or relevance between the two input sentences.   

For example, in a STS task, a score of 1 indicates high similarity, and a score of 0 indicates low similarity.

Use Cases:

Re-ranking Search Results: When you have a large set of potentially relevant documents, a cross-encoder can be used to re-rank the top-k results from a bi-encoder search, improving accuracy.   

Question Answering: Cross-encoders can be used to determine the relevance of candidate answer passages to a given question.   

Duplicate Question Detection: Identifying duplicate questions in a forum or online platform.   

Code Example (using Sentence Transformers):

from sentence_transformers import CrossEncoder

model = CrossEncoder('cross-encoder/stsb-roberta-large')

sentence_pairs = [

    ('A man is eating food.', 'A man is eating a meal.'),

    ('A man is eating food.', 'The food is being eaten by a man.'),

    ('A man is eating food.', 'A man is playing a guitar.'),

]

scores = model.predict(sentence_pairs)

for pair, score in zip(sentence_pairs, scores):

    print(f"Sentence Pair: {pair}, Score: {score}"


In summary:

CrossEncoders provide high accuracy for pairwise text comparison tasks by processing sentence pairs together, but they are computationally more expensive than bi-encoders. They are most useful when accuracy is critical and you can afford the extra processing time.   



Tuesday, April 1, 2025

What is CLIP?

CLIP (Contrastive Language-Image Pre-training) is a foundational Deep Learning Model by OpenAI that connects images and their natural language descriptions

While traditional deep learning systems for these kinds of problems (connecting text and images) have revolutionized the world of Computer Vision, there are some key problems that we all face.

It is very labor-intensive to label big datasets for supervised learning that are required to scale a state-of-the-art model.

Strictly supervised learning restricts the model to a single task, and they are not good at multiple tasks.

The reason they are not good at multiple tasks is that

1) Datasets are very costly, so it is difficult to get labeled datasets for multiple tasks that can scale a deep learning model.

2) Since it is strictly supervised learning, hence the model learns a narrow set of visual concepts; standard vision models are good at one task and one task only. An example of this can be a very well-trained ResNet-101, a very good Deep Learning model, while it performs really well on the simple ImageNet dataset, as soon as the task deviates a little bit to sketch, it starts performing really poorly.

CLIP is one of the most notable and impactful works done in multimodal learning.

Multimodal learning attempts to model the combination of different modalities of data, often arising in real-world applications. An example of multi-modal data is data that combines text (typically represented as discrete word count vectors) with imaging data consisting of pixel intensities and annotation tags. As these modalities have fundamentally different statistical properties, combining them is non-trivial, which is why specialized modeling strategies and algorithms are required. (Definition taken from Wikipedia)

In easy words, we can explain multimodal deep learning as a field of artificial intelligence that focuses on developing algorithms and models that can process and understand multiple types of data, such as text, images, and audio, unlike traditional models that can only deal with a single type of data.

Multimodal deep learning is like teaching a robot to understand different things at the same time. Just like how we can see a picture and read a description to understand what’s happening in the picture, a robot can also do the same thing.

The way that CLIP is designed is very simple yet very effective. It uses contrastive learning which is one of the main techniques that can calculate the similarities. Originally it was used to calculate the similarities between images.

For example, let’s say the robot sees a picture of a dog, but it doesn’t know what kind of dog it is. Multimodal deep learning can help the robot understand what kind of dog it is by also reading a description of the dog, like “This is a Golden Retriever”. By looking at the picture and reading the description, the robot can learn what a Golden Retriever looks like, and use that information to recognize other Golden Retrievers in the future.


Hugging face Transformers library for Image embeddings

Hot to use the Hugging Face Transformers library with AutoImageProcessor and ViTModel (or any other Vision Transformer model) to extract image embeddings. Your provided code snippet is a valid way to achieve this.

Here's a breakdown of why it works and some additional considerations:

Explanation of Your Code:

Load Pre-trained Image Processor and Model:

AutoImageProcessor.from_pretrained("google/vit-large-patch16-224-in21k"): Loads the pre-trained image processor associated with the specified Vision Transformer model. This processor handles image resizing, normalization, and other necessary transformations.

ViTModel.from_pretrained("google/vit-large-patch16-224-in21k"): Loads the pre-trained Vision Transformer model itself.

Prepare Input Image:

inputs = image_processor(test_image, return_tensors='pt'): Processes the input image (test_image) using the loaded image processor and converts it into PyTorch tensors.

Generate Embeddings:


with torch.no_grad(): outputs = model(**inputs): Runs the Vision Transformer model on the processed input image to generate the output. torch.no_grad() disables gradient calculations, which are not needed for inference.

embedding = outputs.last_hidden_state: Extracts the last hidden state from the model's output. This hidden state represents the image embedding.

embedding = embedding[:, 0, :].squeeze(1): Selects the class token embedding (the first token) and removes the unnecessary dimension. This class token embedding is commonly used as the image-level embedding.

Embedding Shape:


print('embedding shape: ', embedding.shape): Prints the shape of the generated embedding.

Advantages of Using Hugging Face Transformers:


Ease of Use: Hugging Face Transformers simplifies the process of loading and using pre-trained models.

Unified API: The library provides a consistent API for working with various models.

Large Model Repository: Hugging Face Hub hosts a vast collection of pre-trained models.

Integration with PyTorch and TensorFlow: The library supports both PyTorch and TensorFlow.

Important Notes:


Dependencies: Make sure you have the transformers and torch libraries installed.

GPU Acceleration: If you have a GPU, ensure that PyTorch is configured to use it for faster processing.

Image Input: The test_image variable should be a PIL Image object or a NumPy array representing the image.

Embedding Dimension: The embedding dimension will vary depending on the specific Vision Transformer model you use.

Model Selection: You can use other Vision Transformer models from the Hugging Face Hub by changing the model identifier (e.g., "google/vit-base-patch16-224-in21k").

Batching: If you want to process multiple images, you can batch them together using the image processor.

Tensorflow: The code can be modified to use tensorflow.



from transformers import AutoImageProcessor, ViTModel

import torch

from PIL import Image


# Load image

image_path = "your_image.jpg" #Replace with your image path.

test_image = Image.open(image_path).convert("RGB")


# Load pre-trained image processor and model

image_processor = AutoImageProcessor.from_pretrained("google/vit-large-patch16-224-in21k")

model = ViTModel.from_pretrained("google/vit-large-patch16-224-in21k")


# prepare input image

inputs = image_processor(test_image, return_tensors='pt')

print('input shape: ', inputs['pixel_values'].shape)


with torch.no_grad():

    outputs = model(**inputs)


embedding = outputs.last_hidden_state

embedding = embedding[:, 0, :].squeeze(1)

print('embedding shape: ', embedding.shape)


#Use the embedding variable for similarity search.