CLIP: Connecting Text and Images (2023)

CLIP: Connecting Text and Images (1)
(Video) CLIP: Connecting Text and Images

15 minute read

We’re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision. CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the “zero-shot” capabilities of GPT-2 and GPT-3.

Read paperView code

Although deep learning has revolutionized computer vision, current approaches have several major problems: typical vision datasets are labor intensive and costly to create while teaching only a narrow set of visual concepts; standard vision models are good at one task and one task only, and require significant effort to adapt to a new task; and models that perform well on benchmarks have disappointingly poor performance on stress tests, casting doubt on the entire deep learning approach to computer vision.

We present a neural network that aims to address these problems: it is trained on a wide variety of images with a wide variety of natural language supervision that’s abundantly available on the internet. By design, the network can be instructed in natural language to perform a great variety of classification benchmarks, without directly optimizing for the benchmark’s performance, similar to the “zero-shot” capabilities of GPT-2 and GPT-3. This is a key change: by not directly optimizing for the benchmark, we show that it becomes much more representative: our system closes this “robustness gap” by up to 75% while matching the performance of the original ResNet-50 on ImageNet zero-shot without using any of the original 1.28M labeled examples.

(Video) OpenAI CLIP: ConnectingText and Images (Paper Explained)

Although both models have the same accuracy on the ImageNet test set, CLIP’s performance is much more representative of how it will fare on datasets that measure accuracy in different, non-ImageNet settings. For instance, ObjectNet checks a model's ability to recognize objects in many different poses and with many different backgrounds inside homes while ImageNet Rendition and ImageNet Sketch check a model's ability to recognize more abstract depictions of objects.

CLIP (Contrastive Language–Image Pre-training) builds on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning. The idea of zero-data learning dates back over a decade but until recently was mostly studied in computer vision as a way of generalizing to unseen object categories. A critical insight was to leverage natural language as a flexible prediction space to enable generalization and transfer. In 2013, Richer Socher and co-authors at Stanford developed a proof of concept by training a model on CIFAR-10 to make predictions in a word vector embedding space and showed this model could predict two unseen classes. The same year DeVISE scaled this approach and demonstrated that it was possible to fine-tune an ImageNet model so that it could generalize to correctly predicting objects outside the original 1000 training set.

Most inspirational for CLIP is the work of Ang Li and his co-authors at FAIR who in 2016 demonstrated using natural language supervision to enable zero-shot transfer to several existing computer vision classification datasets, such as the canonical ImageNet dataset. They achieved this by fine-tuning an ImageNet CNN to predict a much wider set of visual concepts (visual n-grams) from the text of titles, descriptions, and tags of 30 million Flickr photos and were able to reach 11.5% accuracy on ImageNet zero-shot.

Finally, CLIP is part of a group of papers revisiting learning visual representations from natural language supervision in the past year. This line of work uses more modern architectures like the Transformer and includes VirTex, which explored autoregressive language modeling, ICMLM, which investigated masked language modeling, and ConVIRT, which studied the same contrastive objective we use for CLIP but in the field of medical imaging.


We show that scaling a simple pre-training task is sufficient to achieve competitive zero-shot performance on a great variety of image classification datasets. Our method uses an abundantly available source of supervision: the text paired with images found across the internet. This data is used to create the following proxy training task for CLIP: given an image, predict which out of a set of 32,768 randomly sampled text snippets, was actually paired with it in our dataset.

In order to solve this task, our intuition is that CLIP models will need to learn to recognize a wide variety of visual concepts in images and associate them with their names. As a result, CLIP models can then be applied to nearly arbitrary visual classification tasks. For instance, if the task of a dataset is classifying photos of dogs vs cats we check for each image whether a CLIP model predicts the text description “a photo of a dog” or “a photo of a cat” is more likely to be paired with it.

CLIP: Connecting Text and Images (2)

CLIP: Connecting Text and Images (3)

CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in our dataset. We then use this behavior to turn CLIP into a zero-shot classifier. We convert all of a dataset’s classes into captions such as “a photo of a dog” and predict the class of the caption CLIP estimates best pairs with a given image.

(Video) OpenAI CLIP - Connecting Text and Images | Paper Explained

CLIP was designed to mitigate a number of major problems in the standard deep learning approach to computer vision:

Costly datasets: Deep learning needs a lot of data, and vision models have traditionally been trained on manually labeled datasets that are expensive to construct and only provide supervision for a limited number of predetermined visual concepts. The ImageNet dataset, one of the largest efforts in this space, required over 25,000 workers to annotate 14 million images for 22,000 object categories. In contrast, CLIP learns from text–image pairs that are already publicly available on the internet. Reducing the need for expensive large labeled datasets has been extensively studied by prior work, notably self-supervised learning, contrastive methods, self-training approaches, and generative modeling.

Narrow: An ImageNet model is good at predicting the 1000 ImageNet categories, but that’s all it can do “out of the box.” If we wish to perform any other task, an ML practitioner needs to build a new dataset, add an output head, and fine-tune the model. In contrast, CLIP can be adapted to perform a wide variety of visual classification tasks without needing additional training examples. To apply CLIP to a new task, all we need to do is “tell” CLIP’s text-encoder the names of the task’s visual concepts, and it will output a linear classifier of CLIP’s visual representations. The accuracy of this classifier is often competitive with fully supervised models.

We show random, non-cherry picked, predictions of zero-shot CLIP classifiers on examples from various datasets below.

Show more

Show less

Poor real-world performance: Deep learning systems are often reported to achieve human or even superhuman performance[1] on vision benchmarks, yet when deployed in the wild, their performance can be far below the expectation set by the benchmark. In other words, there is a gap between “benchmark performance” and “real performance.” We conjecture that this gap occurs because the models “cheat” by only optimizing for performance on the benchmark, much like a student who passed an exam by studying only the questions on past years’ exams. In contrast, the CLIP model can be evaluated on benchmarks without having to train on their data, so it can’t “cheat” in this manner. This results in its benchmark performance being much more representative of its performance in the wild. To verify the “cheating hypothesis”, we also measure how CLIP's performance changes when it is able to “study” for ImageNet. When a linear classifier is fitted on top of CLIP's features, it improves CLIP's accuracy on the ImageNet test set by almost 10%. However, this classifier does no better on average across an evaluation suite of 7 other datasets measuring “robust” performance.

(Video) Clip - connecting text and images

Key Takeaways

1. CLIP is highly efficient

CLIP learns from unfiltered, highly varied, and highly noisy data, and is intended to be used in a zero-shot manner. We know from GPT-2 and 3 that models trained on such data can achieve compelling zero shot performance; however, such models require significant training compute. To reduce the needed compute, we focused on algorithmic ways to improve the training efficiency of our approach.

We report two algorithmic choices that led to significant compute savings. The first choice is the adoption of a contrastive objective for connecting text with images. We originally explored an image-to-text approach, similar to VirTex, but encountered difficulties scaling this to achieve state-of-the-art performance. In small to medium scale experiments, we found that the contrastive objective used by CLIP is 4x to 10x more efficient at zero-shot ImageNet classification. The second choice was the adoption of the Vision Transformer, which gave us a further 3x gain in compute efficiency over a standard ResNet. In the end, our best performing CLIP model trains on 256 GPUs for 2 weeks which is similar to existing large scale image models.

We originally explored training image-to-caption language models but found this approach struggled at zero-shot transfer. In this 16 GPU day experiment, a language model only achieves 16% accuracy on ImageNet after training for 400 million images. CLIP is much more efficient and achieves the same accuracy roughly 10x faster.

2. CLIP is flexible and general

Because they learn a wide range of visual concepts directly from natural language, CLIP models are significantly more flexible and general than existing ImageNet models. We find they are able to zero-shot perform many different tasks. To validate this we have measured CLIP’s zero-shot performance on over 30 different datasets including tasks such as fine-grained object classification, geo-localization, action recognition in videos, and OCR.[2] In particular, learning OCR is an example of an exciting behavior that does not occur in standard ImageNet models. Above, we visualize a random non-cherry picked prediction from each zero-shot classifier.

This finding is also reflected on a standard representation learning evaluation using linear probes. The best CLIP model outperforms the best publicly available ImageNet model, the Noisy Student EfficientNet-L2, on 20 out of 26 different transfer datasets we tested.

Across a suite of 27 datasets measuring tasks such as fine-grained object classification, OCR, activity recognition in videos, and geo-localization, we find that CLIP models learn more widely useful image representations. CLIP models are also more compute efficient than the models from 10 prior approaches that we compare with.


While CLIP usually performs well on recognizing common objects, it struggles on more abstract or systematic tasks such as counting the number of objects in an image and on more complex tasks such as predicting how close the nearest car is in a photo. On these two datasets, zero-shot CLIP is only slightly better than random guessing. Zero-shot CLIP also struggles compared to task specific models on very fine-grained classification, such as telling the difference between car models, variants of aircraft, or flower species.

CLIP also still has poor generalization to images not covered in its pre-training dataset. For instance, although CLIP learns a capable OCR system, when evaluated on handwritten digits from the MNIST dataset, zero-shot CLIP only achieves 88% accuracy, well below the 99.75% of humans on the dataset. Finally, we’ve observed that CLIP's zero-shot classifiers can be sensitive to wording or phrasing and sometimes require trial and error “prompt engineering” to perform well.

Broader Impacts

CLIP allows people to design their own classifiers and removes the need for task-specific training data. The manner in which these classes are designed can heavily influence both model performance and model biases. For example, we find that when given a set of labels including Fairface race labels[3] and a handful of egregious terms such as “criminal”, “animal,” etc., the model tends to classify images of people aged 0–20 in the egregious category at a rate of ~32.3%. However, when we add the class “child” to the list of possible classes, this behaviour drops to ~8.7%.

Additionally, given that CLIP does not need task-specific training data it can unlock certain niche tasks with greater ease. Some of these tasks may raise privacy or surveillance related risks and we explore this concern by studying the performance of CLIP on celebrity identification. CLIP has a top-1 accuracy of 59.2% for “in the wild” celebrity image classification when choosing from 100 candidates and a top-1 accuracy of 43.3% when choosing from 1000 possible choices. Although it’s noteworthy to achieve these results with task agnostic pre-training, this performance is not competitive when compared to widely available production level models. We further explore challenges that CLIP poses in our paper and we hope that this work motivates future research on the characterization of the capabilities, shortcomings, and biases of such models. We are excited to engage with the research community on such questions.


With CLIP, we’ve tested whether task agnostic pre-training on internet scale natural language, which has powered a recent breakthrough in NLP, can also be leveraged to improve the performance of deep learning for other fields. We are excited by the results we’ve seen so far applying this approach to computer vision. Like the GPT family, CLIP learns a wide variety of tasks during pre-training which we demonstrate via zero-shot transfer. We are also encouraged by our findings on ImageNet that suggest zero-shot evaluation is a more representative measure of a model’s capability.

(Video) OpenAI’s CLIP explained! | Examples, links to code and pretrained model


What is the disadvantage of clip model? ›

Limitations. While CLIP usually performs well on recognizing common objects, it struggles on more abstract or systematic tasks such as counting the number of objects in an image and on more complex tasks such as predicting how close the nearest car is in a photo.

What does clip model stand for? ›

What is CLIP? CLIP is the first multimodal (in this case, vision and text) model tackling computer vision and was recently released by OpenAI on January 5, 2021. From the OpenAI CLIP repository, "CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs.

What is clip image embedding? ›

CLIP is an extremely powerful image and text embedding model that can be used to find the text snippet that best represents a given image (such as in a classical classification task), or the most suitable image given a text query (eg. image search).

How much data was clip trained on? ›

CLIP is trained using a staggering amount of 400 million image-text pairs. For comparison, the ImageNet dataset contains 1.2 million images. The final tuned CLIP model was trained on 256 V100 GPUs for two weeks.

What is clipping and why should it be avoided? ›

Clipping is a destructive change to an audio signal that happens when the level is too high for the system it's passing through. This could mean recording the levels too hot on your audio interface or pushing your master fader into the red in your DAW.

Do clip on extensions fall out easily? ›

If your clip-in hair extensions are installed incorrectly and are not clipped in the right way, then they are most definitely prone to falling or getting loose.

What is clip text? ›

CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3.

What language model does clip use? ›

CLIP is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image classification. CLIP uses a ViT like transformer to get visual features and a causal language model to get the text features.

What is clip value? ›

Use the Clip Values component to identify and optionally replace data values that are above or below a specified threshold with a mean, a constant, or other substitute value.

How does clip embedding work? ›

In order for the image and text pairs to be connected to each other, both are embedded. A CLIP model consists of two sub-models, called encoders, including a text encoder and an image encoder. The text encoder embeds text into a mathematical space while the image encoder embeds images into a mathematical space.

How does image embed work? ›

Image Embedding reads images and uploads them to a remote server or evaluate them locally. Deep learning models are used to calculate a feature vector for each image. It returns an enhanced data table with additional columns (image descriptors).

What is the clip text encoder? ›

The CLIP model consists of a text and an image encoder which encodes textual and visual information into a multimodal embedding space. Now, the aim of the model is to increase the cosine similarity score of images and text which is actually associated which in this case there are N such pairs.

How many parameters does clip have? ›

We can see in the above image that the CLIP achieved the language model accuracy at just 33M parameters compared to 400M. CLIP is 12 times more efficient!! As a result of this methodology, CLIP can easily be applied to nearly any visual classification tasks and achieve great performance.

How much can clip benefit vision and language? ›

On all three tasks, CLIP-ViL brings sizable improvement over strong baselines, 1.4% accuracy on VQA v2. 0, 6.5 CIDEr on COCO Captioning, and 4.0% success rate on Room-to-Room navigation.

Does clip use Bert? ›

The architecture we have considered uses the original image encoder from CLIP, instead, as a text encoder, we use an Italian BERT model (as we need to create Italian embeddings).

What is the rule of clipping? ›

The term "clippingn refers to "the process whereby a lexeme (simplex or complex) is shortened, while still retaining the same meaning and still being a member of the same form class" (Bauer, 1983:233).

What are the 3 types of clipping? ›

There are four types of possible clipping processes, depending on which part of the word undergoes structural changes: back-clipping (temperature — temp, rhino — rhinoceros, gym — gymnasium), fore-clipping (helicopter — copter, telephone — phone, plane — aeroplane), mixed clipping (influenza — flu, refrigerator — ...

What is the purpose of clipping? ›

The primary use of clipping in computer graphics is to remove objects, lines, or line segments that are outside the viewing pane.

Is it OK to wear clip in hair extensions everyday? ›

Is it OK to wear my hair extensions every day? You can wear your clip in hair extensions as often as you want, so long as you take care of them and fit them properly. The beauty of clip in hair extensions is that you can wear them and remove them as often as you want.

Is it OK to sleep with clip in extensions? ›

Unfortunately not. Your hair extensions should always be removed before you exercise and sleep if you want them to last and if you want to protect your natural locks.

What are the cons of clip in hair extensions? ›

Cons of Clip Ins

They can't be worn for long periods of time and need to be removed before sleeping. As the clips are somewhat bulky, they may not be suitable for thin hair. And do not work well for up dos. They can fall out or shift over the day.

What is the example of clip? ›

A clip is a small metal or plastic device that is used for holding things together. She took the clip out of her hair. fasten When you clip things together or when things clip together, you fasten them together using a clip or clips. He clipped his safety belt to a fitting on the deck.

What does it mean to clip data? ›

Clipping may occur when a signal is recorded by a sensor that has constraints on the range of data it can measure, it can occur when a signal is digitized, or it can occur any other time an analog or digital signal is transformed, particularly in the presence of gain or overshoot and undershoot.

How does Clip work deep learning? ›

CLIP takes an Image, text pairing as input to learn a multi-modal embedding space. CLIP jointly trains an image encoder and text encoder to maximize the cosine similarity of the image and text embedding of the correct pair and minimize the cosine similarity of the image and text embeddings of the incorrect pairings.

How are text-to-image models trained? ›

Rather than directly training a model to output a high-resolution image conditioned on a text embedding, a popular technique is to train a model to generate low-resolution images, and use one or more auxiliary deep learning models to upscale it, filling in finer details.

Is clip a foundation model? ›

What is a foundation model? First of all, foundation models are not something new. They've been with us for several years. These are models like BERT, RoBERTa, T5, BART, GPT-3, CLIP, DALL·E, Codex, and so on.

What does clip size mean? ›

Credit controls are enforced by means of maximum allowable order, known as "clip size". Clip size is dynamically calculated based on real-time fill and open order calculations and a weighted average margin rate.

What is gradient clipping? ›

What is gradient clipping? Gradient Clipping is a method where the error derivative is changed or clipped to a threshold during backward propagation through the network, and using the clipped gradients to update the weights.

What is a clip property? ›

The clip CSS property defines a visible portion of an element. The clip property applies only to absolutely positioned elements — that is, elements with position:absolute or position:fixed .

What are embedding techniques? ›

The word embedding techniques are used to represent words mathematically. One Hot Encoding, TF-IDF, Word2Vec, FastText are frequently used Word Embedding methods. One of these techniques (in some cases several) is preferred and used according to the status, size and purpose of processing the data.

What is embedding and why is it important? ›

An embedding is a relatively low-dimensional space into which you can translate high-dimensional vectors. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words.

How does text embed work? ›

In natural language processing (NLP), a word embedding is a representation of a word. The embedding is used in text analysis. Typically the representation is a real-valued vector that encodes the meaning of the word in such a way that words that are closer in the vector space are expected to be similar in meaning.

Is it better to link or embed? ›

Visitors can watch the video on your website without leaving the current page. In contrast, linking a video shares the URL of the video. When readers click the link, they are redirected to the page where the video is hosted. If you want to include videos on a website, the better option is to embed videos.

What is the difference between embed and attach? ›

What's the Difference Between Embedding and Attaching Images? Simple. Embedding allows your recipient to see the images inside the email, whereas when you attach images, your subscriber will need to download them in order to access them.

What is an example of embed? ›

The verb embed means to implant something or someone — like to embed a stone into a garden pathway or to embed a journalist in a military unit. When you stick something firmly within a particular environment, you are embedding it.

What is clip retrieval? ›

Clip retrieval works by converting the text query to a CLIP embedding , then using that embedding to query a knn index of clip image embedddings. Display captions. Display full captions. Display similarities.

Does clip use transformers? ›

Contrastive Language-Image Pretraining (CLIP) is a primarily transformer-based model released by OpenAI in 2021 [1].

What is linear probe clip? ›

Linear probe CLIP (Radford et al., 2021) trains an additional linear classifier on top of its visual encoder and follows a few-shot training manner. It is different from our bottleneck adapter that finetunes both the image feature and classifier weight in a dynamic and residual fashion.

What is the difference between clip and DALL·E? ›

DALL-E was developed and announced to the public in conjunction with CLIP (Contrastive Language-Image Pre-training). CLIP is a separate model based on zero-shot learning that was trained on 400 million pairs of images with text captions scraped from the Internet.

How many parameters does each filter have? ›

Number of parameters in a CONV layer would be : ((m * n * d)+1)* k), added 1 because of the bias term for each filter. The same expression can be written as follows: ((shape of width of the filter * shape of height of the filter * number of filters in the previous layer+1)*number of filters).

What is zero-shot transfer? ›

The general idea of zero-shot learning is to transfer the knowledge already contained in the training instances to the task of testing instance classification. Thus, zero-shot learning is a subfield of transfer learning. Therefore, Zero-Shot Learning is a subfield of Transfer Learning.

Can you combine vision benefits? ›

VSP allows coordination of benefits for patients eligible for coverage by more than one vision plan. When coordinating benefits, it must be determined which plan is billed first. The plan that covers the member as an employee is “primary”. The plan that covers the member as a dependent is “secondary”.

Can you use vision benefits for sunglasses? ›

Most vision insurance plans provide coverage for prescription sunglasses. Generally, vision coverage is limited to products that help correct your eyesight. But some plans might also offer discounts for non-prescription sunglasses from in-network healthcare providers.

What are the disadvantages of clip in hair extensions? ›

The Disadvantages of Clip In Extensions

These extensions are not permanent, leaving them in for too long or during washing can cause damage to your hair and potentially tangle in with your hair. They are not a permanent method.

What are the negatives of clip in hair extensions? ›

Cons of Clip Ins
  • Although real hair versions are available, clip in extensions are usually made from poorer quality synthetic hair.
  • They can't be worn for long periods of time and need to be removed before sleeping.
  • As the clips are somewhat bulky, they may not be suitable for thin hair.
Apr 10, 2017

What are the disadvantages of process model? ›

The major downside of process modelling is the risk of over analysis. Second, although vendor-supplied reference models may not be used much in implementation projects, process models themselves are still valuable for developing shared understandings of processes and planning software implementation projects.

What are the pros and cons of clip in hair extensions? ›

  • Clip-in hair extensions are the: The choice for adding length, volume, and color to your very own hair. ...
  • Will You Benefit from Clip-in Extensions? ...
  • No Commitment. ...
  • Easy for Beginners. ...
  • No Tools or Tape. ...
  • Easy to Apply. ...
  • Easy to Remove. ...
  • Instantly Add Hair Length extension.

What happens if you wear clip in extensions everyday? ›

"If worn daily, this can lead to breakage and even traction alopecia, a condition where the repeated strain on the hair follicles causes hair loss." The point being, if you're reaching for clip-ins day in and day out, you're better off considering semi-permanent extension options that are intended and made for longer- ...

How long can you keep clip in extensions in your hair? ›

High quality clip-in hair extensions will last you anywhere from 3-6 months up to a year or even longer, depending on how often you wear them and how well you take care of them.

Can you shower with clip in extensions? ›

After your tape in hair extensions are applied, be sure to wait 48 hours before washing your hair. The adhesive will have been given adequate time to adhere to your natural hair, making them last longer and hold tighter. When you need to shower in those first two days, use a shower cap.


1. CLIP: Connecting Text and Images
(‍유이경[ 대학원석사과정재학 / 산업경영공학과 ])
2. CLIP: Connecting Text and Images (Swedish NLP Webinars)
(AI Sweden)
3. OpenAI CLIP: Connecting Text and Images
(Data Science Gems)
4. Ariel Ekgren: CLIP: Connecting text and images
(RISE Research Institutes of Sweden)
5. Fast intro to multi-modal ML with OpenAI's CLIP
(James Briggs)
6. Searching Across Images and Text: Intro to OpenAI’s CLIP
Top Articles
Latest Posts
Article information

Author: Domingo Moore

Last Updated: 02/01/2023

Views: 5735

Rating: 4.2 / 5 (53 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Domingo Moore

Birthday: 1997-05-20

Address: 6485 Kohler Route, Antonioton, VT 77375-0299

Phone: +3213869077934

Job: Sales Analyst

Hobby: Kayaking, Roller skating, Cabaret, Rugby, Homebrewing, Creative writing, amateur radio

Introduction: My name is Domingo Moore, I am a attractive, gorgeous, funny, jolly, spotless, nice, fantastic person who loves writing and wants to share my knowledge and understanding with you.