CLIP

A model that learns shared text and image embeddings from paired data.

CLIP trains an image encoder and a text encoder together so that matching image–caption pairs land near each other in a shared embedding space. This contrastive objective, trained on hundreds of millions of pairs, yields strong zero-shot image classification and a bridge between language and vision.

CLIP embeddings became a core component of text-to-image systems and many multimodal models.

Related papers