Architecture

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Google·October 22, 2020

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov

TL;DR

The Vision Transformer treats image patches as tokens, matching CNNs given enough data.

Brought the Transformer to vision and helped unify architectures across modalities.