Vision Transformer (ViT) & DeiT: Architecture, Formulas, Apps
Explore the Vision Transformer (ViT) and its data-efficient variant, DeiT. Learn about their architectures, key innovations, formulas, and applications in AI & computer vision.
Vision Transformer (ViT) and DeiT: Comprehensive Documentation
This document provides a comprehensive overview of the Vision Transformer (ViT) and its data-efficient variant, DeiT. We will explore their architectures, key innovations, underlying formulas, and applications in image classification and other computer vision tasks.
1. Vision Transformer (ViT)
The Vision Transformer (ViT) is a groundbreaking deep learning model that adapts the highly successful transformer architecture, originally designed for Natural Language Processing (NLP), to computer vision tasks. Instead of relying on convolutional layers, ViT treats an image as a sequence of fixed-size patches, similar to how text is treated as a sequence of words.
1.1 How ViT Works
The core workflow of ViT can be summarized in the following steps:
- Image Patching: The input image is divided into a grid of non-overlapping patches. For example, a 224x224 image might be split into 16x16 patches.
- Patch Flattening and Embedding: Each patch is flattened into a vector. These flattened vectors are then linearly projected into an embedding space, transforming them into fixed-dimensional representations.
- Positional Embeddings: Since the transformer architecture itself is permutation-invariant (it doesn't inherently understand the order of input tokens), positional embeddings are added to the patch embeddings. This allows the model to retain spatial information about the patches.
- Transformer Encoder: The sequence of embedded patches (along with positional embeddings) is fed into a standard transformer encoder. This encoder consists of multiple layers, each containing:
- Multi-Head Self-Attention (MHSA): Allows the model to weigh the importance of different patches relative to each other.
- Feed-Forward Network (FFN): A simple multi-layer perceptron (MLP) applied independently to each patch representation.
- Layer Normalization and Residual Connections: Applied before and after the MHSA and FFN to stabilize training and improve gradient flow.
- Classification Head: A special learnable token, often referred to as the
[CLS]
token (borrowed from NLP), is prepended to the sequence of patch embeddings. The final representation of this[CLS]
token after passing through the transformer encoder is used as the global image representation. This representation is then fed into a simple MLP head to perform the final classification.
1.2 ViT Architecture Components
The key components of the ViT architecture are:
- Patch Embedding:
- Input: Image
I
of shape(H, W, C)
whereH
is height,W
is width, andC
is channels. - Patching: Image
I
is divided intoN
patches, each of sizeP x P
. Each patch is flattened into a vector of sizeP*P*C
. - Embedding: A linear projection layer maps each flattened patch vector to an embedding dimension
D
. - Formula:
patches ∈ ℝ^(N x (P*P*C))
embeddings ∈ ℝ^(N x D)
- Input: Image
- Positional Embeddings: Learnable embeddings of shape
(N+1, D)
are added to the patch embeddings (including the[CLS]
token).- Formula:
z₀ = [x_cls; x_patch_1; ...; x_patch_N] + E_pos
x_cls
is the learnable[CLS]
token embedding.x_patch_i
is the embedding of the i-th patch.E_pos
are the learnable positional embeddings.
- Formula:
- Transformer Encoder Block: Consists of Multi-Head Self-Attention and a Feed-Forward Network.
- Input:
x
(patch embeddings with positional info) - Layer 1 (MHSA):
x_attn = x + MultiHeadAttention(LayerNorm(x))
- Layer 2 (MLP):
z = x_attn + MLP(LayerNorm(x_attn))
- This block is typically stacked
L
times.
- Input:
- Classification Head: A simple MLP applied to the
[CLS]
token's final representation.- Input:
z_L^cls
(the final representation of the[CLS]
token) - Output:
y = MLP(z_L^cls)
- Input:
1.3 ViT Formula Summary
- Input Image Shape:
image ∈ ℝ^(H×W×C)
- Patching and Flattening:
patches ∈ ℝ^(N×(P×P×C))
whereN = (H*W)/P²
- Patch Embedding: Linear projection of each patch:
x_patch_i ∈ ℝ^D
- Input to Transformer:
z₀ = [x_cls; x_patch_1; ...; x_patch_N] + E_pos ∈ ℝ^((N+1)×D)
- Transformer Encoding (L layers):
z₁, z₂, ..., z_L
where eachz_l ∈ ℝ^((N+1)×D)
- Classification Output:
y = MLP(z_L^cls) ∈ ℝ^(Num_Classes)
2. DeiT (Data-efficient Image Transformer)
DeiT, developed by Facebook AI (Meta), is an evolution of the ViT architecture that significantly improves its data efficiency. DeiT achieves competitive performance with substantially fewer data and training resources, making transformers more practical for a wider range of computer vision tasks without requiring massive pre-training datasets like ImageNet-21k or JFT-300M.
2.1 DeiT Key Innovations
DeiT introduces two primary innovations:
-
Distillation Token:
- A second learnable token is added to the input sequence, similar to the
[CLS]
token. - This distillation token is trained to mimic the output of a larger, pre-trained "teacher" model (often a standard CNN like ResNet or even a larger ViT).
- During training, the model is optimized to match both the ground truth labels (like ViT) and the softened predictions of the teacher model. This "knowledge distillation" process allows the smaller DeiT model to learn more effectively from limited data.
- A second learnable token is added to the input sequence, similar to the
-
Efficient Training Methods:
- DeiT leverages advanced data augmentation techniques and regularization strategies.
- Crucially, it demonstrates that a ViT-like architecture can achieve state-of-the-art or competitive results on datasets like ImageNet-1k without relying on large-scale pre-training. This significantly lowers the barrier to entry for using transformer models in vision.
2.2 How DeiT Enhances ViT
- Improved Generalization: DeiT models often exhibit better generalization capabilities compared to standard ViT when trained on smaller datasets.
- Lower Training Cost: The ability to train effectively on smaller datasets like ImageNet-1k (1.2M images) with competitive accuracy means less computational power and time are needed for training compared to ViT's original approach which relied on datasets with hundreds of millions of images.
3. Applications
ViT and DeiT models have found applications in various computer vision domains:
- Image Classification: Their primary and most successful application.
- Object Detection: When integrated with architectures like DETR (DEtection TRansformer).
- Image Segmentation: Used in various segmentation frameworks.
- Vision-Language Models: Bridging the gap between visual and textual understanding.
- Action Recognition: Applied to video data.
4. Advantages of Vision Transformers (ViT & DeiT)
- No Need for Convolutions: Eliminates the inductive bias of CNNs, allowing for greater flexibility.
- Scalability: Can be scaled to very large model sizes, leading to improved performance.
- Unified Architectures: Facilitates easier integration into unified models that handle both vision and language tasks.
- Data Efficiency (DeiT): DeiT specifically addresses the data bottleneck of ViT, making it more accessible.
- Global Context: Self-attention naturally captures long-range dependencies across the entire image.
5. Limitations of Vision Transformers
- Data Requirements (ViT without DeiT): The original ViT requires very large datasets for effective pre-training to overcome the lack of CNN's inductive biases.
- Computational Intensity: Can be computationally expensive, especially at higher input resolutions due to the quadratic complexity of self-attention with respect to sequence length (number of patches).
- Interpretability: While attention maps offer some insight, understanding the exact reasoning behind a transformer's decision remains an active research area.
- Positional Embeddings: Choice and implementation of positional embeddings can impact performance.
6. SEO Keywords
- Vision Transformer
- ViT
- DeiT
- Image Transformer
- Transformer for Images
- Patch Embedding
- Self-Attention for Vision
- Data-Efficient Image Transformer
- Deep Learning Vision
- Image Classification Transformers
- Computer Vision Models
- Transformer Architecture
- Knowledge Distillation Vision
7. Interview Questions
Here are some common interview questions related to ViT and DeiT:
- What is the Vision Transformer (ViT) and how does it process images?
- Describe the key components of the ViT architecture.
- How does patch embedding work in Vision Transformers?
- What is the significance of position embeddings in ViT?
- How is classification performed in Vision Transformers?
- What are the main innovations introduced by DeiT compared to ViT?
- How does the distillation token improve DeiT training?
- Why is DeiT considered more data-efficient than ViT?
- What are the advantages and limitations of using Vision Transformers?
- What are some practical use cases for ViT and DeiT models?
- How does self-attention enable transformers to capture global context in images?
- Can you explain the trade-offs between ViT and traditional CNNs?
Segment Anything Model (SAM): AI Image Segmentation
Explore the Segment Anything Model (SAM) by Meta AI, a revolutionary AI for zero-shot image segmentation. Learn how it extracts any object using flexible prompts.
Chapter 15: Optimize AI Models for Edge Deployment
Learn to optimize deep learning models for edge devices using quantization, pruning, ONNX, and TensorRT. Deploy YOLO on webcam feeds with hands-on guidance.