ArchitectureTraining
Learning Transferable Visual Models From Natural Language Supervision
OpenAI·February 26, 2021
Alec Radford, Jong Wook Kim, Chris Hallacy
View on arXivTL;DR
CLIP learns from 400M image–text pairs to align vision and language in a shared embedding space, enabling strong zero-shot image classification.
Why it matters
Bridged language and vision and became a core component of text-to-image systems and many multimodal models.