Tuesday, April 1, 2025

Hugging face Transformers library for Image embeddings

Hot to use the Hugging Face Transformers library with AutoImageProcessor and ViTModel (or any other Vision Transformer model) to extract image embeddings. Your provided code snippet is a valid way to achieve this.

Here's a breakdown of why it works and some additional considerations:

Explanation of Your Code:

Load Pre-trained Image Processor and Model:

AutoImageProcessor.from_pretrained("google/vit-large-patch16-224-in21k"): Loads the pre-trained image processor associated with the specified Vision Transformer model. This processor handles image resizing, normalization, and other necessary transformations.

ViTModel.from_pretrained("google/vit-large-patch16-224-in21k"): Loads the pre-trained Vision Transformer model itself.

Prepare Input Image:

inputs = image_processor(test_image, return_tensors='pt'): Processes the input image (test_image) using the loaded image processor and converts it into PyTorch tensors.

Generate Embeddings:


with torch.no_grad(): outputs = model(**inputs): Runs the Vision Transformer model on the processed input image to generate the output. torch.no_grad() disables gradient calculations, which are not needed for inference.

embedding = outputs.last_hidden_state: Extracts the last hidden state from the model's output. This hidden state represents the image embedding.

embedding = embedding[:, 0, :].squeeze(1): Selects the class token embedding (the first token) and removes the unnecessary dimension. This class token embedding is commonly used as the image-level embedding.

Embedding Shape:


print('embedding shape: ', embedding.shape): Prints the shape of the generated embedding.

Advantages of Using Hugging Face Transformers:


Ease of Use: Hugging Face Transformers simplifies the process of loading and using pre-trained models.

Unified API: The library provides a consistent API for working with various models.

Large Model Repository: Hugging Face Hub hosts a vast collection of pre-trained models.

Integration with PyTorch and TensorFlow: The library supports both PyTorch and TensorFlow.

Important Notes:


Dependencies: Make sure you have the transformers and torch libraries installed.

GPU Acceleration: If you have a GPU, ensure that PyTorch is configured to use it for faster processing.

Image Input: The test_image variable should be a PIL Image object or a NumPy array representing the image.

Embedding Dimension: The embedding dimension will vary depending on the specific Vision Transformer model you use.

Model Selection: You can use other Vision Transformer models from the Hugging Face Hub by changing the model identifier (e.g., "google/vit-base-patch16-224-in21k").

Batching: If you want to process multiple images, you can batch them together using the image processor.

Tensorflow: The code can be modified to use tensorflow.



from transformers import AutoImageProcessor, ViTModel

import torch

from PIL import Image


# Load image

image_path = "your_image.jpg" #Replace with your image path.

test_image = Image.open(image_path).convert("RGB")


# Load pre-trained image processor and model

image_processor = AutoImageProcessor.from_pretrained("google/vit-large-patch16-224-in21k")

model = ViTModel.from_pretrained("google/vit-large-patch16-224-in21k")


# prepare input image

inputs = image_processor(test_image, return_tensors='pt')

print('input shape: ', inputs['pixel_values'].shape)


with torch.no_grad():

    outputs = model(**inputs)


embedding = outputs.last_hidden_state

embedding = embedding[:, 0, :].squeeze(1)

print('embedding shape: ', embedding.shape)


#Use the embedding variable for similarity search.

No comments:

Post a Comment