Architecture
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Google·October 22, 2020
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov
View on arXivTL;DR
The Vision Transformer treats image patches as tokens, matching CNNs given enough data.
Why it matters
Brought the Transformer to vision and helped unify architectures across modalities.