ArchitectureTraining

Learning Transferable Visual Models From Natural Language Supervision

OpenAI·February 26, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy

TL;DR

CLIP learns from 400M image–text pairs to align vision and language in a shared embedding space, enabling strong zero-shot image classification.

Bridged language and vision and became a core component of text-to-image systems and many multimodal models.