Vision Transformer

Applying the Transformer architecture directly to images.

A Vision Transformer (ViT) splits an image into patches, treats them like tokens, and processes them with a standard Transformer. It showed that, with enough data, attention can match or beat CNNs on vision and unified architectures across modalities.

Related papers