-- Living Mobile --: What is CLIP?

Tuesday, April 1, 2025

What is CLIP?

CLIP (Contrastive Language-Image Pre-training) is a foundational Deep Learning Model by OpenAI that connects images and their natural language descriptions

While traditional deep learning systems for these kinds of problems (connecting text and images) have revolutionized the world of Computer Vision, there are some key problems that we all face.

It is very labor-intensive to label big datasets for supervised learning that are required to scale a state-of-the-art model.

Strictly supervised learning restricts the model to a single task, and they are not good at multiple tasks.

The reason they are not good at multiple tasks is that

1) Datasets are very costly, so it is difficult to get labeled datasets for multiple tasks that can scale a deep learning model.

2) Since it is strictly supervised learning, hence the model learns a narrow set of visual concepts; standard vision models are good at one task and one task only. An example of this can be a very well-trained ResNet-101, a very good Deep Learning model, while it performs really well on the simple ImageNet dataset, as soon as the task deviates a little bit to sketch, it starts performing really poorly.

CLIP is one of the most notable and impactful works done in multimodal learning.

Multimodal learning attempts to model the combination of different modalities of data, often arising in real-world applications. An example of multi-modal data is data that combines text (typically represented as discrete word count vectors) with imaging data consisting of pixel intensities and annotation tags. As these modalities have fundamentally different statistical properties, combining them is non-trivial, which is why specialized modeling strategies and algorithms are required. (Definition taken from Wikipedia)

In easy words, we can explain multimodal deep learning as a field of artificial intelligence that focuses on developing algorithms and models that can process and understand multiple types of data, such as text, images, and audio, unlike traditional models that can only deal with a single type of data.

Multimodal deep learning is like teaching a robot to understand different things at the same time. Just like how we can see a picture and read a description to understand what’s happening in the picture, a robot can also do the same thing.

The way that CLIP is designed is very simple yet very effective. It uses contrastive learning which is one of the main techniques that can calculate the similarities. Originally it was used to calculate the similarities between images.

For example, let’s say the robot sees a picture of a dog, but it doesn’t know what kind of dog it is. Multimodal deep learning can help the robot understand what kind of dog it is by also reading a description of the dog, like “This is a Golden Retriever”. By looking at the picture and reading the description, the robot can learn what a Golden Retriever looks like, and use that information to recognize other Golden Retrievers in the future.

-- Living Mobile --

Tuesday, April 1, 2025

What is CLIP?

No comments:

Post a Comment

Followers

Blog Archive

About Me