AI Hub
All papers
Architecture

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Google·October 22, 2020

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov

View on arXiv

TL;DR

The Vision Transformer treats image patches as tokens, matching CNNs given enough data.

Why it matters

Brought the Transformer to vision and helped unify architectures across modalities.

Related terms