Computer vision

Vision Transformer: What It Is & How It Works [2024 Guide]

14 min read

Dec 15, 2022

A vision transformer (ViT) is a transformer-like model that handles vision processing tasks. Learn how it works and see some examples.

Deval Shah

Guest Author

Vision Transformer (ViT) emerged as a competitive alternative to convolutional neural networks (CNNs) that are currently state-of-the-art in computer vision and widely used for different image recognition tasks. ViT models outperform the current state-of-the-art CNNs by almost four times in terms of computational efficiency and accuracy.

Although convolutional neural networks have dominated the field of computer vision for years, new vision transformer models have also shown remarkable abilities, achieving comparable and even better performance than CNNs on many computer vision tasks.

Here’s what we’ll cover:

  • A brief history of Transformers

  • What is a Vision Transformer (ViT)

  • How do Vision Transformers work?

  • Vision Transformer model training

  • Six applications of Vision Transformers

A video labeling annotation tool where drone footage of a port inspection is being annotated

Video annotation

AI video annotation

Get started today

A video labeling annotation tool where drone footage of a port inspection is being annotated

Video annotation

AI video annotation

Get started today

A brief history of Transformers

Attention mechanisms combined with RNNs were the predominant architecture for facing any task involving text until 2017, when a paper was published and changed everything, giving birth to the now widely used Transformers. The paper was entitled “Attention is all you need.”

A Transformer is a deep learning model that adopts the self-attention mechanism, differentially weighting the significance of each part of the input data. Transformers are increasingly the model of choice for NLP problems, replacing RNN models such as long short-term memory (LSTM).

Transformers architecture

Transformers differed from Recurrent Neural Networks in the following ways:

  1. Transformers are non-sequential, in contrast to RNNs that take in sequential data. For example, if my input is a sentence, RNNs will take one word at a time as input. This isn’t the case for transformers—they can take all the words of the sentence as input.

  2. Transformers use the attention mechanism. It is called the attention mechanism because transformers understand the context and can access past information. RNNs can access past information only to a certain extent, mostly only the previous state. The information gets lost in the following states.

  3. Transformers use positional embeddings to store information regarding the position of words in a sentence.

These were the reasons why transformers got popular in no time. Transformers had surpassed RNNs in all NLP tasks. This is why it is essential to understand this architecture and its evolution in the past five years.

Pro tip: Learn more by reading The Essential Guide to Neural Network Architectures.

GPT by OpenAI

GPT Architecture

GPT Architecture. Source: OpenAI

Originally released on 11th June 2018, GPT has undergone transformations in recent years. There is no better model to begin discussing transformers than GPT, which stands for Generative Pre-Training. It pioneered using unsupervised learning for pre-training and supervised learning for fine-tuning, which is currently commonly employed by many transformers.

It was trained on a book corpus dataset consisting of 7,000 unpublished books. The architecture of GPT consists of 12 decoders stacked together, meaning it is a decoder-only model. GPT-1 consists of 117 million parameters, a small number compared to the transformers developed today! GPT was just the beginning of the era of transformers.

BERT by Google

BERT Architecture

BERT Architecture. Source: Stanford

BERT stands for Bidirectional Encoder Representation from Transformers and it was released on 11th October 2018. As the name suggests, BERT is a bidirectional model. The attention mechanism can attend to both directions of the current token, left and right. This is because BERT is made by stacking together 12 encoders, meaning it is an encoder-only model. Encoders can take the full sentence as the input and reference any sentence word to perform the task.

BERT consists of 110 million parameters. Like GPT, it was trained on a specific task and can be fine-tuned for other tasks. In other words, the task used to pre-train BERT was to fill in the blanks.

What is a Vision Transformer (ViT)

The Vision Transformer (ViT) model was introduced in 2021 in a conference research paper titled "An Image is Worth 16*16 Words: Transformers for Image Recognition at Scale," published at ICLR 2021. The fine-tuning code and pre-trained ViT models are accessible on Google Research's GitHub. The ViT models were pre-trained on the ImageNet and ImageNet-21k datasets.

Vision transformers have extensive applications in popular image recognition tasks such as object detection, image segmentation, image classification, and action recognition. Moreover, ViTs are applied in generative modeling and multi-model tasks, including visual grounding, visual-question answering, and visual reasoning.

In ViTs, images are represented as sequences, and class labels for the image are predicted, which allows models to learn image structure independently. Input images are treated as a sequence of patches where every patch is flattened into a single vector by concatenating the channels of all pixels in a patch and then linearly projecting it to the desired input dimension.

Let’s examine the vision transformer architecture step by step.

  1. Split an image into patches

  2. Flatten the patches

  3. Produce lower-dimensional linear embeddings from the flattened patches

  4. Add positional embeddings

  5. Feed the sequence as an input to a standard transformer encoder

  6. Pretrain the model with image labels (fully supervised on a huge dataset)

  7. Finetune on the downstream dataset for image classification

A demo of a Vision Transformer for Image Classification (source)

Image patches are the sequence tokens (like words). The encoder block is identical to the original transformer proposed by Vaswani et al. (2017).

There are multiple blocks in the ViT encoder, and each block consists of three major processing elements:

  • Layer Norm

  • Multi-head Attention Network (MSP)

  • Multi-Layer Perceptrons (MLP

Layer Norm keeps the training process on track and lets the model adapt to the variations among the training images.

Multi-head Attention Network (MSP) is a network responsible for generating attention maps from the given embedded visual tokens. These attention maps help the network focus on the most critical regions in the image, such as object(s). The concept of attention maps is the same as that found in the traditional computer vision literature (e.g., saliency maps and alpha-matting).

MLP is a two-layer classification network with GELU (Gaussian Error Linear Unit) at the end. The final MLP block also called the MLP head, is used as an output of the transformer. An application of softmax on this output can provide classification labels (i.e., if the application is Image Classification).

ViT vs. Convolutional Neural Networks

The ViT model represents an input image as a series of image patches, like the series of word embeddings used when using transformers to text, and directly predicts class labels for the image.

Vision Transformer (ViT) has been gaining momentum in recent years. In the following sections, we will explain the ideas from the paper entitled “Do Vision Transformers See Like Convolutional Neural Networks?” by Raghu et al., published in 2021 by Google Research and Google Brain. In particular, we’ll explore the difference between the conventionally used CNN and Vision Transformer.

The six central abstract ideas shared in the paper are:

  1. ViT has more similarity between the representations obtained in shallow and deep layers compared to CNNs.

  2. Unlike CNNs, ViT obtains the global representation from the shallow layers, but the local representation obtained from the shallow layers is also important.

  3. Skip connections in ViT are even more influential than in CNNs (ResNet) and substantially impact the performance and similarity of representations.

  4. ViT retains more spatial information than ResNet

  5. ViT can learn high-quality intermediate representations with large amounts of data.

  6. MLP-Mixer’s representation is closer to ViT than to ResNet

Understanding fundamental differences between ViTs and CNNs is essential as the transformer architecture has become more ubiquitous. Transformers have extended their reach from taking over the world of language models to usurping CNNs as the de-facto vision model.

How do Vision Transformers work?

Before diving deep into how vision Transformers work, we must understand the fundamentals of attention and multi-head attention presented in the original transformer paper.

The Transformer is a model proposed in the paper “Attention Is All You Need” (Vaswani et al., 2017). It is a model that uses a mechanism called self-attention, which is neither a CNN nor an LSTM and builds a Transformer model to outperform existing methods significantly.

Note that the part labeled Multi-Head Attention in the figure below is the core part of the Transformer, but it also uses skip-joining like ResNet.

Transformer architecture

The attention mechanism used in the Transformer uses three variables: Q (Query), K(Key), and V (Value). Simply put, it calculates the attention weight of a Query token (token: something like a word) and a Key token and multiplies the Value associated with each Key. In short, it calculates the association (attention weight) between the Query token and the Key token and multiplies the Value associated with each Key.

Single Head (Self-Attention)

Defining the Q, K, and V calculation as a single head, the multi-head attention mechanism is defined as follows. The (single-head) attention mechanism in the above figure uses Q and K. Still, in the multi-head attention mechanism, each head has its projection matrix W_i^Q, W_i^K, and W_i^V, and they calculate the attention weights using the feature values projected using these matrices.

Multi-Head Attention

The intuition behind multi-head attention is that it allows us to attend to different parts of the sequence differently each time. This practically means that:

  • The model can better capture positional information because each head will attend to different input segments. The combination of them will give us a more robust representation.

  • Each head will also capture different contextual information by uniquely correlating words.

Vision Transformer (ViT) is a model that applies the Transformer to the image classification task and was proposed in October 2020 (Dosovitskiy et al. 2020). The model architecture is almost the same as the original Transformer but with a twist to allow images to be treated as input, just like natural language processing.

Transformer Architecture modified for images (ViT) (source)‍

The paper suggests using a Transformer Encoder as a base model to extract features from the image and passing these “processed” features into a Multilayer Perceptron (MLP) head model for classification.

Transformers are already very compute-heavy—infamous for their quadratic complexity when computing the Attention matrix. This worsens as the sequence length increases.

For a 28x28 mnist image, if we flatten it to 784 pixels, we still have to deal with an attention matrix of 784x784 to see which pixels attend to one another. This is very expensive even for today’s hardware capabilities.

Hence, the paper suggests breaking the image down into square patches as a form of lightweight “windowed” Attention to address this issue.

image converted to square patches

The image is converted to square patches.

These patches are flattened and sent through a single Feed Forward layer to get a linear patch projection. This Feed Forward layer contains the embedding matrix E mentioned in the paper. This matrix E is randomly generated.

patches flattened and sent through a single Feed Forward layer

To help with the classification bit, the authors took inspiration from the original BERT paper by concatenating a learnable [class] embedding with the other patch projections.

Yet another problem with Transformers is that the sequence order is not enforced naturally since data is passed in at a shot instead of timestep-wise, as is done in RNNs and LSTMs. To combat this, the original Transformer paper suggests using Positional Encodings/Embeddings that establish a certain order in the inputs.

The positional embedding matrix Epos is randomly generated and added to the concatenated matrix containing the learnable class embedding and patch projections.

D is the fixed latent vector size used throughout the Transformer. It’s what we squash the input vectors to before passing them into the Encoder.

Altogether, these patch projections and positional embeddings form a larger matrix that’ll soon be put through the Transformer Encoder.

The outputs of the Transformer Encoder are then sent into a Multilayer Perceptron for image classification. The input features capture the image's essence very well, making the MLP head’s classification task far simpler.

Pro tip: Read this Comprehensive Guide to Convolutional Neural Networks

map head outputs

The MLP Head inputs the Transformer outputs related to the special [class] embedding and ignores the other outputs.

Performance benchmark comparison of ViT vs. ResNet vs. MobileNet

While ViT shows excellent potential in learning high-quality image features, it is inferior in performance vs. accuracy gains. The little gain in accuracy does not justify the poor run time of ViT.

Six applications of Vision Transformers

With ever-increasing dataset sizes and the continued development of unsupervised and semi-supervised methods, developing new vision architectures that train more efficiently on these datasets becomes increasingly essential. We believe ViT is a preliminary step towards generic, scalable architectures that can solve many vision tasks. Hence, the ViT has gained prominence in research interests due to its broad applicability and scalability.

Pro tip: Read our guide to Supervised vs. Unsupervised Learning [Differences & Examples]  

Here are some of the most prominent applications of ViT:

Image Classification

The task of image classification is the most common problem in vision. CNN-based methods are state-of-art for image classification tasks. ViTs don’t produce a comparable performance at small to medium datasets. However, they have outperformed CNNs on very large datasets.

This is because CNNs encode the local information in the image more effectively than ViTs due to the application of locally restricted receptive fields.

Image Classification Output of a ViT

Pro tip: Read this introductory Guide to Image Classification and start building your classifiers with V7.

Image captioning

A more advanced form of image categorization can be achieved by generating a caption describing the content of an image instead of a one-word label. This has become possible with the use of ViTs. ViTs learn a general representation of a given data modality instead of a crude set of labels. Therefore, it is possible to generate descriptive text for a given image. We will use an implementation of ViT trained on the COCO dataset. The results of such captioning can be seen below:

Image Caption Generation using a ViT

Pro tip: Read this guide on Image Segmentation.

Image segmentation

DPT (DensePredictionTransformers) is a segmentation model released by Intel in March 2021 that applies vision transformers to images. It can perform image semantic segmentation with 49.02% mIoU on ADE20K. It can also be used for monocular depth estimation with an improvement of up to 28% relative performance compared to a state-of-the-art fully-convolutional network.

Segmentation results using ViT (source)

Pro tip: Looking for quality training data? Check out 65+ Best Free Datasets for Machine Learning to find the right dataset for your needs.

Anomaly detection

A transformer-based image anomaly detection and localization network combines a reconstruction-based approach and patch embedding. The use of transformer networks helps preserve the spatial information of the embedded patches, which is later processed by a Gaussian mixture density network to localize the anomalous areas.

Anomaly detection using ViT

Anomaly Detection using ViT (source)

Action recognition

Interesting paper by the Google Research team where they use a pure-transformer based models for video classification, drawing upon the recent success of such models in image classification. The model extracts spatiotemporal tokens from the input video, which are then encoded by a series of transformer layers.

To handle the long sequences of tokens encountered in the video, the authors propose several efficient variants of our model that factorize the input's spatial and temporal dimensions.

Although transformer-based models are known to only be effective when large training datasets are available, we show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets

Pro tip: Read our guide to Human Pose Estimation: Deep Learning Approach

Autonomous driving

On Tesla AI Day in 2021, Tesla revealed many intricate inner workings of the neural network powering Tesla FSD. One of the most intriguing building blocks is one dubbed “image-to-BEV transform + multi-camera fusion). At the center of this block is a Transformer module, or more concretely, a cross-attention module.

Pro tip: Explore other applications of AI in our guide AI in Supply Chain and Logistics [20+ Practical Applications]

Key Takeaways

Here are several things to keep in mind about ViTs:

  • ViT demonstrated its effectiveness for the CV tasks; the vision transformers have received considerable attention and undermined the dominance of CNNs in the CV field.

  • Since Transformers require a large amount of data for high accuracy, the data collection process can extend project time. In the case of having less data, CNNs generally perform better than Transformers.

  • The training time of the Transformer takes less than CNNs. Comparing them in terms of computational efficiency and accuracy, Transformers can be chosen if the time for model training is limited.

  • The self-attention mechanism can bring more awareness to the developed model. Since it is so hard to understand the weaknesses of the model developed by CNNs, attention maps can be visualized, and they can help developers to guide how to improve the model. This process is harder for CNN-based models.

  • Deployment of chosen approach should be straightforward and fast to get ready to be deployed (If you do not have time limits, no problem). Even though there are some frameworks for Transformers, CNN-based approaches are still less complex to be deployed.

  • The emergence of Vision Transformers has also provided an important foundation for developing vision models. The largest vision model is Google’s ViT-MoE model, which has 15 billion parameters. These large models have set new records on the ImageNet-1K classification.

A data labeling tool where a medical image is being labeled as Basophil Cell

Data labeling

Data labeling platform

Get started today

A data labeling tool where a medical image is being labeled as Basophil Cell

Data labeling

Data labeling platform

Get started today

Deval is a senior software engineer at Eagle Eye Networks and a computer vision enthusiast. He writes about complex topics related to machine learning and deep learning.

Next steps

Label videos with V7.

Rewind less, achieve more.

Try our free tier or talk to one of our experts.

Next steps

Label videos with V7.

Rewind less, achieve more.