Vision transformers

Vision transformers

Reinventing sight again

#️⃣   ⌛  ~1.5 h 📚  Advanced

13.10.2023

upd:

#78

Vision transformers

Reinventing sight again

⌛  ~1.5 h

#78

🎓 110/167

This post is a part of the Computer vision educational series from my free course. Please keep in mind that the correct sequence of posts is outlined on the course page, while it can be arbitrary in Research.

I'm also happy to announce that I've started working on standalone paid courses, so you could support my work and get cheap educational material. These courses will be of completely different quality, with more theoretical depth and niche focus, and will feature challenging projects, quizzes, exercises, video lectures and supplementary stuff. Stay tuned!

Deep learning has achieved remarkable progress in computer vision over the past decade, mainly propelled by convolutional neural networks (CNNs). Models such as AlexNet, VGG, ResNet, and more advanced architectures like EfficientNet all leverage convolutional layers as the backbone for image feature extraction. Convolutions exploit inductive biases like local connectivity, weight sharing, and translation invariance, which have proven extremely effective for tasks such as image classification, object detection, and image segmentation.

However, the success of transformers in natural language processing (most prominently in the form of the "transformer" architecture used in machine translation, language modeling, etc.) inspired researchers to adapt these architectures to vision problems. The transformer, introduced in "Attention Is All You Need" (Vaswani and gang, NeurIPS 2017), relies heavily on self-attention mechanisms and feed-forward layers without convolutional operations or recurrent structures. Translating this approach to images initially seemed nontrivial, since images are fundamentally 2D structures, while text is typically processed as 1D token sequences. Moreover, early attempts raised concerns about the heavy computational cost of pairwise attention among all image pixels.

Despite these challenges, the notion of global attention — enabling every image patch or "token" to attend to every other patch — offers a compelling proposition for capturing long-range dependencies. Convolutions, although powerful, inherently limit receptive fields (albeit they can expand through deeper layers). Vision transformers (ViTs) achieve this kind of global interaction from the start, with minimal inductive bias. The question then arises: how should an image be transformed into a sequence of tokens? The answer came in the form of patch embeddings, effectively slicing images into smaller patches, flattening them, and embedding them as vectors. This straightforward scheme proved highly effective, especially at scale.

In this article, I explore the world of vision transformers in great detail — covering everything from the fundamental architecture to modern variants like Swin transformer, Convolutional Vision Transformer (CvT), Dilated Neighborhood Attention Transformer (DiNAT), and more. Along the way, I discuss how these models compare to and sometimes integrate with CNNs. I explain the benefits, the known challenges, and provide pointers on how to train these models effectively. The material aims to be deeply theoretical, yet approachable, illustrating core ideas with formulas, code snippets, and references to state-of-the-art research from major ML conferences and journals (e.g., NeurIPS, ICCV, CVPR, ICML, JMLR).

Given that we are now seeing an explosion of attention-based architectures for tasks as diverse as medical imaging, advanced object detection, segmentation, and beyond, it is essential to understand the foundations on which these models are built and how they can be applied and adapted in real-world settings. Let's begin by revisiting the fundamentals of the transformer model and see how it has been adapted from natural language processing (NLP) to computer vision.

fundamentals of vision transformers

revisiting the transformer architecture: self-attention, feed-forward layers, residual connections

The transformer was originally designed for sequence-to-sequence tasks in NLP, where each sequence element (typically a subword token) attends to all other tokens in the sequence to capture contextual relationships. The key mechanism at play is self-attention, a computational block that computes attention weights between every pair of tokens in the input sequence. These tokens are first projected into three representations: queries $Q$ , keys $K$ , and values $V$ . Formally, for an input sequence of tokens $X \in \mathbb{R}^{n \times d}$ (where $n$ is the sequence length and $d$ is the embedding dimensionality), we have:

Q = X W_Q, \quad K = X W_K, \quad V = X W_V

Each of $W_Q, W_K, W_V$ is a learned projection matrix of shape $d \times d'$ for queries, keys, and values, respectively (often $d' = d$ , but it can differ). The self-attention output is computed as:

\text{Attention}(Q, K, V) = \text{softmax}\Bigl(\frac{QK^\top}{\sqrt{d'}}\Bigr)V

The attention distribution indicates how strongly each token should weigh every other token for representation. Transformers typically feature multiple heads of attention (multi-head self-attention), each projecting queries, keys, and values into different subspaces to capture diverse aspects of relationships. Then these heads are concatenated:

\text{MHA}(Q, K, V) = [\text{head}_1; \text{head}_2; \dots; \text{head}_h] W_O

where each $head_i = \text{Attention}(Q_i, K_i, V_i)$ and $W_O$ is another learned projection. The feed-forward layer is typically a two-layer MLP with a non-linear activation, applied after the attention block. Finally, residual connections and layer normalization are included at various points in the architecture:

Residual connections help preserve gradients across layers and reduce vanishing or exploding gradient problems.
Layer normalization stabilizes training by normalizing the activations across feature dimensions.

In the standard transformer, the entire architecture can be summarized as repeated blocks of: multi-head self-attention, add & norm, feed-forward, add & norm. In a vision transformer, these same components are reused almost verbatim — although the way the data is tokenized is quite distinct from language use cases.

patch embeddings for images: splitting images into patches and embedding them as tokens

A crucial step when applying transformers to vision is converting a 2D image into a set of 1D tokens. One canonical approach (introduced in Vision Transformer, Dosovitskiy and gang, ICLR 2021) is to split an image of shape $H \times W \times C$ (height, width, and channels) into non-overlapping patches of size $P \times P$ . If we flatten each patch, it becomes a vector of size $P^2 \times C$ . Next, we linearly project this flattened patch into an embedding of size $D$ . By doing this for all patches, we obtain a set of $\frac{HW}{P^2}$ tokens, each of dimension $D$ .

To help the model understand the position of each token, positional embeddings (fixed or learned) are usually added to the token embeddings. These can be 1D sinusoidal embeddings, trainable vectors, or other variations. Concretely, if $x_i$ is the embedding of the $i$ -th patch, then the final input token is:

z_i = x_i + e_i

where $e_i$ is the positional embedding for the $i$ -th token. The sequence $z_1, z_2, \dots, z_N$ is then passed into the transformer encoder layers.

A further detail from the original ViT approach is to prepend a [class] token embedding (just like BERT's [CLS]) at the start of the sequence; the output of this class token is used for classification. Alternatively, one can pool the final patch token embeddings. Different variants exist, but the overarching idea is that the token for classification is learned as a global descriptor that aggregates information from other patches through the attention mechanism.

An image was requested, but the frog was found.

Alt: "illustration-of-vit-patch-embedding"

Caption: "A conceptual diagram showing an image split into patches, each flattened and linearly embedded, then combined with position embeddings."

Error type: missing path

encoder and decoder overview: understanding the core components of vit

The original transformer architecture typically has an encoder and a decoder (useful in sequence-to-sequence tasks like machine translation). However, for many vision tasks (like classification), a decoder is not strictly necessary. The standard Vision Transformer (Dosovitskiy and gang) employs only the encoder portion. That means the pipeline is:

Compute patch embeddings and add positional embeddings.
Pass these embeddings through multiple layers of multi-head self-attention + feed-forward sublayers + residual connections + layer normalization.
Optionally, extract the [class] token output or perform some pooling, feed it to a classification head (often a linear layer), and produce the final class logits.

For downstream tasks such as image generation, object detection, or segmentation, some modifications or expansions might include decoder-like structures (as we see in DETR for object detection). But for standard classification, the focus is primarily on the encoder.

strengths and challenges: data requirements, interpretability, computational complexity

One of the major benefits of using a transformer-based model in vision is the ability to capture global relationships from the very first layer, thanks to self-attention across patches. Another benefit is the minimal inherent inductive bias — transformers do not enforce local connectivity like convolutions do. This can be a double-edged sword:

Strength: The model can learn relevant features end-to-end, potentially discovering superior representations if provided with enough data.
Challenge: Vision transformers often require massive datasets (like ImageNet-21K or JFT-300M) to unlock their full potential. Without enough data or aggressive data augmentation, they may suffer from overfitting.
Interpretability: The attention maps in transformers can sometimes provide interpretable cues about where the model focuses. However, interpretability remains complex, as attention can be distributed in subtle ways, and not all forms of attention are easily human-interpretable.
Computational complexity: The standard self-attention mechanism scales quadratically with the number of tokens. For high-resolution images, this can be computationally expensive. Modern variants address this challenge in different ways (e.g., shifting windows, dilating attention, or imposing local constraints).

Overall, if you have large-scale data or use pre-trained models, ViTs can yield performance on par or better than CNNs, often with strong scaling properties as you grow the model size and data. However, these benefits come at the cost of high training requirements and more complexity in implementing efficient operations on GPUs or specialized hardware.

multi-head self-attention and positional embedding strategies

Vision transformers typically preserve the multi-head attention design from the original transformer. Each attention head learns to focus on different aspects or regions of the image. Researchers have experimented with variations on positional embedding for images, such as:

Learned 2D embeddings that map row and column coordinates to vectors, then combine them.
Sinusoidal 2D embeddings that extend the 1D sinusoidal approach from the original transformer to two dimensions.
Relative positional embeddings, popular in some advanced variants like the Swin transformer, which model relative distances among patches.
Zero or no explicit embeddings in some hierarchical approaches that rely on convolution-like operations (seen in certain hybrid models).

Each approach has trade-offs between simplicity, expressivity, and the capacity for the model to generalize to different input resolutions.

tokenization nuances in vision vs. language

In language tasks, tokens often represent subwords or words with natural boundaries. In images, patch boundaries may not always align with semantic boundaries (like edges of objects). A single patch may contain partial or multiple objects or backgrounds. This can make the model's job more challenging — hence the desire for large datasets and robust training. Some approaches incorporate learnable patch splitting or dynamic patch shapes, but the mainstream approach in ViTs remains fixed patch sizes.

Furthermore, Token ordering can be artificial in images. In text, there is a clear sequence order. In images, the row-major or column-major flattening of patches is an arbitrary linearization. The model must learn the concept of 2D spatial layout through positional embeddings. This is an additional reason why large-scale data or strong regularization might be needed.

common training pitfalls (e.g., overfitting, data augmentation, large-scale pretraining needs)

Researchers have found that training a ViT from scratch on smaller datasets (like ImageNet with 1 million images) can be quite difficult without heavy regularization. Some standard recommendations include:

Use large-scale pretraining on bigger datasets (ImageNet-21K, JFT-300M, LAION, etc.), then fine-tune on the target dataset.
Adopt strong data augmentation such as Mixup, CutMix, RandAugment, or augmentations from the AugReg approach (Steiner and gang, 2021).
Apply knowledge distillation from a well-trained CNN teacher (as in DeiT, Touvron and gang, ICML 2021).
Carefully tune hyperparameters such as learning rate, weight decay, batch size, and the type of optimizer (AdamW is common for ViTs).

When these best practices are followed, ViTs can outperform comparable CNNs in various benchmarks, showcasing the potential of attention-based models in vision.

cnn vs. vision transformers

CNNs have dominated computer vision due to their built-in inductive biases:

Local receptive fields: Convolutions operate over local pixel neighborhoods, which is efficient for capturing edges, textures, and simple shapes.
Weight sharing: A single filter is applied across different spatial locations, greatly reducing the number of parameters and enabling translation-invariant feature detection.
Translation invariance: The same pattern recognized at one location can be recognized at another location in the image. This ties well to typical object recognition tasks.

These biases make CNNs sample-efficient and easier to train on relatively smaller datasets, because the model does not have to learn these fundamental properties from scratch.

why transformers can work without strong inductive bias: global attention and flexible modeling

Vision transformers eliminate some of the strict assumptions embedded in CNNs. Rather than forcing local connectivity at each stage, ViTs allow for:

Global attention: Each patch can potentially attend to any other patch, enabling the model to capture long-range interactions from the beginning.
Flexible receptive fields: The receptive field can scale up or down depending on the attention patterns.
Learned features: The model learns how to parse the image by itself, which can lead to emergent forms of invariances or hierarchical structure, but requires significant data.

While CNNs do well with smaller or medium-sized datasets, ViTs often shine when large-scale data is available or when there is a powerful pretraining scheme. This is the central trade-off: minimal inductive bias with high capacity can yield state-of-the-art performance, but usually at the expense of more data and computational resources.

scalability and data requirements: large-scale datasets vs. smaller datasets

Compared to CNNs, vision transformers exhibit impressive scaling behavior. As you increase the model size (e.g., from ViT-Base to ViT-Large or ViT-Huge) and the pretraining dataset size, the performance can keep improving steadily. Meanwhile, CNN performance often saturates faster at large parameter counts. This suggests that in scenarios where:

You have extremely large labeled or unlabeled image corpora.
You have the computational resources to train giant models.

ViTs can be a compelling choice. However, if you only have a small dataset (e.g., medical images with a few thousand samples), you might see better results using CNNs — or you might resort to heavy regularization, knowledge distillation, or specialized data augmentation to make a ViT work.

practical considerations (hardware, dataset size, etc.)

Vision transformers can be memory-intensive, especially at higher resolutions, because self-attention has $O(n^2)$ complexity in the number of tokens. With high-resolution images, the number of patches $n = \frac{HW}{P^2}$ can be quite large. Some practical considerations include:

Using smaller patch sizes can increase the length of the token sequence, so a balance is needed. Large patches reduce sequence length but might degrade performance by not capturing fine details.
Gradient checkpointing can reduce memory usage at the cost of slightly more computation.
Efficient attention variants (like hierarchical windows or local attention) can mitigate the quadratic cost.
Multi-GPU or distributed training is often necessary for large-scale training.

comparison with classical cnn architectures (e.g., resnet, efficientnet)

ResNet is a classic backbone for many vision tasks, featuring skip connections and uniform convolutional blocks. EfficientNet introduced a compound scaling approach that systematically increases network depth, width, and resolution. ViTs introduced a new dimension of scaling that does not revolve around convolution kernels but rather the size of the transformer layers, the embedding dimension, and the number of attention heads. Empirically:

On ImageNet with around 1.2M training images, both a well-tuned ResNet and a well-tuned ViT can achieve strong accuracy, but the ResNet might be easier to train from scratch without specialized techniques.
On extremely large datasets, ViTs can surpass CNN accuracy with fewer heuristics.

effects of data augmentation and regularization on vits vs. cnns

Data augmentation and regularization can be even more critical for ViTs due to the weaker inductive bias. Techniques like CutMix, Mixup, RandAugment, stochastic depth, label smoothing, and knowledge distillation (DeiT approach) often serve as crucial components of a successful training recipe. These strategies help reduce overfitting and produce stronger generalization. CNNs also benefit from data augmentation, but to a lesser extent, they rely heavily on the built-in local connectivity prior.

using pre-trained vision transformers

publicly available models (hugging face hub, etc.) and model zoos

Many pre-trained ViT checkpoints are publicly accessible, for example in the Hugging Face Hub or the official repositories of major research institutions. There are also frameworks like timm (PyTorch Image Models) by Ross Wightman, which include an extensive model zoo of vision transformers and other state-of-the-art architectures. These resources can drastically speed up your workflow by allowing you to skip the expensive stage of training from scratch on massive datasets.

pros and cons of using pre-trained backbones

Advantages:

Save time and compute resources, especially when the pre-training was done on hundreds of millions of images.
Benefit from the general representations learned on large corpora, leading to higher accuracy and better generalization.
Potentially reduce the risk of overfitting on small or medium-sized datasets.

Drawbacks:

Pre-trained weights might not perfectly fit your domain (e.g., medical imaging, satellite imagery), requiring domain adaptation.
Large-scale pre-trained models can be big in size (hundreds of MB or more), posing challenges for memory or deployment.
You might end up relying on high-level features that are suboptimal for specialized tasks, unless you do extensive fine-tuning.

transfer learning workflow: from loading weights to final evaluation

A typical recipe to fine-tune a pre-trained ViT on a new dataset involves:

Load the checkpoint from the model zoo or a local checkpoint file.
Replace the final classification head to match your dataset's number of classes.
Optionally freeze early layers if you have very limited data, although many practitioners prefer to unfreeze all layers eventually.
Train with a smaller learning rate initially, then adjust or use a learning rate scheduler.
Evaluate on a validation or test set to monitor performance and avoid overfitting.

freezing layers vs. full fine-tuning: trade-offs and best practices

Freezing the lower layers can reduce overfitting when you have few training samples, since it also reduces the effective number of learnable parameters. However, it can limit the adaptability of the model to domain-specific features. As a compromise, you might freeze layers at first and then progressively unfreeze them (a technique sometimes referred to as layer-wise learning rate tuning). Generally, if you have enough data and computational capacity, a full fine-tuning approach is recommended.

choosing a learning rate and optimizer: recommendations for stable training

AdamW is currently the most common optimizer for vision transformers, with a typical weight decay around 0.05–0.1. A common learning rate schedule starts relatively low (e.g., 1e-5 or 2e-5) for fine-tuning, then decays over time. Experimentation with warm-up epochs is also beneficial. A small number of warm-up steps (5–10 epochs) at an even lower LR can stabilize early training, especially if the final layers were replaced or re-initialized.

multi-class image classification case study: example dataset and code flow

Below is a simplified PyTorch code snippet illustrating how one might load a pre-trained ViT, replace its classification head, and then train on a hypothetical dataset of 10 classes:


import torch
import torch.nn as nn
from torchvision import datasets, transforms
from timm import create_model

# Data transforms
train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,)),
])
val_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,)),
])

# Example dataset
train_dataset = datasets.ImageFolder("path_to_train", transform=train_transform)
val_dataset = datasets.ImageFolder("path_to_val", transform=val_transform)

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=32, shuffle=False)

# Load a pre-trained ViT base model
model = create_model("vit_base_patch16_224", pretrained=True)
model.head = nn.Linear(model.head.in_features, 10)  # 10 classes

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5, weight_decay=0.05)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Training loop
for epoch in range(10):
    model.train()
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
    
    # Validation
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for images, labels in val_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    accuracy = 100 * correct / total
    print(f"Epoch {epoch}, Validation Accuracy: {accuracy:.2f}%")

multi-label image classification case study: handling multiple labels per image

Multi-label classification differs from multi-class classification. Instead of having a single label from a fixed set of classes, each image can have multiple labels. This typically requires a sigmoid output layer and a binary cross-entropy loss. You might change the final layer to something like:


model.head = nn.Sequential(
    nn.Linear(model.head.in_features, 10),
    nn.Sigmoid()
)
criterion = nn.BCEWithLogitsLoss()

where each of the 10 outputs corresponds to a particular label that can be either 0 or 1. The rest of the process — data loading, training loop, etc. — is broadly similar, except you must account for multiple targets per image and compute accuracy or F1 scores accordingly.

domain adaptation challenges for specialized tasks

When your dataset drastically differs from the images used in pre-training (e.g., medical scans, satellite imagery, IR images), the extracted features might not directly transfer. You may need advanced domain adaptation techniques such as:

Custom data augmentations to match the domain properties.
Fine-tuning all layers more aggressively.
Possibly pre-training from scratch on large domain-specific data if feasible.
Using self-supervised or semi-supervised approaches if labeled data is scarce.

hyperparameter tuning guidelines (batch size, weight decay, etc.)

For vision transformers, recommended hyperparameters might differ from typical CNN defaults. Some general tips:

Batch size: Larger batch sizes are often preferred to stabilize training, but memory constraints can limit this. Using gradient accumulation can help.
Weight decay: Values in the range 0.01 to 0.1 are common. It's good to experiment since it can strongly affect overfitting.
Learning rate schedulers: Cosine decay or step-based schedules are popular. A brief warm-up period is often beneficial.
Precision: Mixed-precision training (fp16) can speed up training significantly on GPUs with Tensor Cores.

swin transformer

key highlights and hierarchical design for scalable architectures

The Swin transformer (Liu and gang, ICCV 2021) was proposed to address the high computational cost of standard ViTs, which rely on global self-attention. Swin stands for Shifted Window — it partitions images into small, non-overlapping windows and performs self-attention within each window. Across layers, the windows shift in such a way that patches from different windows can interact, enabling a hierarchical feature representation reminiscent of CNNs.

Swin is often described as a hierarchical vision transformer because it reduces the resolution of feature maps progressively, similar to many CNN backbones (like ResNet). This allows the network to handle high-resolution inputs without the quadratic blow-up in cost.

shifted window attention mechanism: local windows and shifting strategies

In standard local attention, the image is divided into local windows, and self-attention is computed only within each window. The shifted window approach in Swin modifies the window arrangement at alternate layers. Specifically:

On one layer, the feature map is partitioned into non-overlapping windows.
On the next layer, the windows are shifted by some pixels (like half the window size), which causes different patches to be grouped together in a new arrangement.

This shift ensures that each patch eventually has the chance to interact with patches in neighboring windows, effectively capturing cross-window dependencies. The compute cost is significantly reduced compared to global self-attention. Swin's attention complexity scales $O(h \cdot w \cdot \frac{M^2}{M^2} \cdot d)$ within each window of size $M \times M$ rather than $O(h \cdot w \cdot h \cdot w)$ in naive global self-attention.

advantages and performance considerations across various benchmarks

Swin has shown state-of-the-art or near state-of-the-art results on tasks like ImageNet classification, COCO object detection, and ADE20K segmentation. It is also widely used as a backbone in many computer vision toolkits (e.g., MMDetection, Detectron2). Key advantages:

Better scalability to high-resolution inputs.
Hierarchical representations that adapt easily to downstream dense tasks.
Strong performance improvements over the original ViT on smaller or mid-sized datasets.

extensions: swin v2 and its improvements

Swin V2 introduced further refinements, including larger window sizes, improved normalization strategies, and novel initialization schemes that enable stable training of even bigger models (such as 3 billion parameters). These improvements push the envelope of performance on large-scale benchmarks.

larger parameter counts: trade-offs in memory usage and accuracy

Swin can scale to very large parameter counts (billions of parameters). The trade-off is memory usage and training time. As with other large models, these big Swin variants can require distributed training across many GPUs. If you're working on a resource-constrained setup, smaller variants like Swin-T (tiny) or Swin-S (small) might be more practical.

simmim self-supervised learning: mask-based training strategies

A major area of active research is masked image modeling, an analog to masked language modeling (BERT) in NLP. Swin has a self-supervised extension called SimMIM, which trains the model to reconstruct masked patches in the pixel space. This can help the model learn meaningful features without requiring large-scale labeled data, and then fine-tune for specific tasks.

applications in image restoration (swinir, swin2sr)

Other specialized variants of Swin incorporate the same shifted window principle but focus on super-resolution and image restoration tasks. SwinIR (Liang and gang, ICCV 2021) used a Swin-based architecture to achieve impressive results in super-resolution, denoising, and JPEG artifact removal. Swin2SR is a more recent iteration with even better performance. This highlights the flexibility of hierarchical transformers for tasks beyond classification and detection.

convolutional vision transformer (cvt)

recap: how vit paved the way for hybrid models

After the first wave of excitement around pure vision transformers, researchers explored various hybrid architectures to incorporate the best of both worlds: the flexibility and global attention of transformers with the strong inductive bias and efficiency of convolutions. In some early experiments, CNN layers were used to create patch embeddings for ViTs, effectively bridging convolutional feature extraction and transformer-based attention.

cnn–transformer hybrid approach: leveraging convolutional layers in attention blocks

CvT (Wu and gang, ICCV 2021) stands for Convolutional Vision Transformer. It is one such hybrid approach, introducing convolutions both in the token embedding step and within the projection step for queries, keys, and values. By adding these local operations, CvT can:

Capture local structures more efficiently.
Reduce the parameter count by sharing weights in a CNN-like manner.
Achieve stable training with less data than a pure ViT.

convolutional token embedding: embedding patches via cnn-like kernels

Instead of using a simple linear projection of flattened patches, CvT employs a CNN-based token embedding. For example, a single 2D convolution layer with a certain stride can be used to reduce the spatial dimension and produce a feature map that is then flattened into tokens. This approach helps the model better capture low-level visual cues right from the start.

convolutional projection for queries, keys, and values: localizing attention operations

CvT goes further by applying depthwise separable convolutions to the queries, keys, and values. This step introduces a local receptive field in the attention mechanism, effectively combining local attention with global attention. It also reduces the computational overhead. The result is a series of CvT blocks that are reminiscent of standard transformer blocks but incorporate convolution at key steps.

architectural highlights (no positional encodings, hierarchical structure)

Because convolutions already incorporate some positional information (the arrangement of kernel filters), CvT can dispense with explicit positional embeddings. The architecture is also hierarchical, reducing the spatial dimensions as we move through the layers, very much like a traditional CNN. This hierarchical design typically ends in a global average pooling (or a class token) before the final classification layer.

comparison with other transformer-based backbones

CvT competes with other backbones like Swin, PVT (Pyramid Vision Transformer), and hierarchical ViT variants. Depending on the benchmark or task, the best choice can vary. CvT's main selling points include simpler hierarchical design, CNN-like tokenization, and potentially better efficiency than a pure ViT at moderate image sizes.

potential for bridging the best of both worlds (local vs. global feature extraction)

In essence, CvT is part of a broader trend of rethinking attention to strike a balance between:

Local interactions: cheaply captured by convolution or local attention.
Global interactions: captured by the self-attention mechanism.

By combining local biases with the flexible attention mechanism, these hybrid architectures often offer improved performance and efficiency.

dilated neighborhood attention transformer (dinat)

overview and motivation: addressing global context more efficiently

One limitation of local window-based approaches like Swin is that patches outside the window can only indirectly interact after multiple layers. Dilated Neighborhood Attention (DiNA) and the Dilated Neighborhood Attention Transformer (DiNAT) were proposed to cover a broader receptive field without paying the full cost of global attention. Dilated attention introduces spacing between attended patches, similar to dilated convolutions.

neighborhood attention vs. global attention: balancing locality and range

Neighborhood Attention (NA) (Hassani and gang, 2022) processes a local region around each patch for attention. Instead of every patch attending to every other patch, we focus on a neighborhood. This reduces complexity from $O(n^2)$ to $O(n \cdot k)$ , where $k$ is the neighborhood size (far smaller than $n$ ). However, a purely local region might still hamper capturing global interactions quickly. Hence, dilated attention expands this neighborhood by skipping patches in between, effectively letting each patch attend to a sparser but more globally spread set of patches.

dilated neighborhood attention for sparse global context

Dilated NA can be visualized as taking a local region around a patch but skipping certain patches at a defined rate — like the dilation factor in dilated convolutions. This approach can capture global structure faster without the overhead of full global attention. Meanwhile, it remains more efficient than naive global self-attention.

evolution from nat to dina to dinat: key incremental improvements

NAT introduced the concept of neighborhood attention with a localized approach.
DiNA extended NAT with dilated attention, covering more context.
DiNAT further refines these concepts into a full hierarchical transformer backbone for classification, detection, or segmentation tasks.

performance and use cases: large images, medical imaging, dense predictions

DiNAT can handle large images more efficiently than a pure ViT or Swin, making it suitable for high-resolution tasks like medical imaging or satellite imagery, where patch-based methods might falter due to memory constraints. It also excels in tasks requiring dense predictions (segmentation, detection), because local but well-dilated attention can capture both near and far context.

mobilevit and separable self-attention

challenges on mobile and low-resource devices

Although vision transformers have shown remarkable performance, deploying them on mobile or edge devices is challenging. The attention mechanism can be computationally expensive, and large model sizes can exceed memory constraints. There is thus a strong need for specialized, lightweight transformer designs.

original mobilevit: bridging cnns and transformers for lightweight models

MobileViT (Mehta & Rastegari, ICLR 2022) was introduced to adapt transformer blocks for mobile environments. It uses Inverted Residual Blocks (popularized by MobileNetV2) combined with lightweight transformer layers that operate on smaller feature maps. The key idea is to fuse local representations (from convolutions) with global representations (from attention), but keep the overall parameter count and FLOPs low.

the mobilevit block (local + global representations): combining convolution and attention

A MobileViT block typically has:

A convolutional layer (or layers) to extract local features and reduce or expand the channel dimension.
A small transformer block (often operating on unfolded or flattened feature maps) to capture global interactions.
Another convolutional layer to merge the output with the local path.

This synergy allows the model to retain the efficiency of separable convolutions while benefiting from a global receptive field in the attention sub-block.

separable self-attention for o(k) complexity: reducing computations

Various approaches to separable self-attention have emerged. The notion is to factorize the attention computation along spatial and channel dimensions, or to break down the query–key–value interactions into more efficient forms. By doing so, the complexity can be cut down significantly, approaching linear in the number of patches or tokens, rather than quadratic. This is especially relevant when designing mobile-friendly or low-latency architectures.

performance benchmarks on mobile devices: latency, power consumption

MobileViT outperforms many conventional CNN-based mobile models (like MobileNetV3 or ShuffleNet) in terms of accuracy–FLOPs trade-off. However, actual latency and power consumption must also be measured on target hardware, since theoretical FLOPs might not perfectly correlate with real-world performance. The model can still become bottlenecked by memory bandwidth or suboptimal kernel implementations for attention layers.

model compression and hardware-aware optimizations

To further tailor MobileViT for resource-limited environments, additional compression techniques can be applied:

Pruning: Remove unimportant heads, tokens, or channels.
Quantization: Use lower precision (INT8, for instance) to reduce memory usage.
Knowledge distillation: Distill from a larger teacher model to a lightweight student.

Hardware-aware neural architecture search (NAS) might also identify the best trade-offs for a particular device (smartphone, microcontroller, etc.).

vision transformers for object detection

brief overview of object detection tasks: bounding box regression and classification

Object detection requires assigning bounding boxes to objects in an image and classifying those objects. Traditional methods rely on region proposals (R-CNN family), single-stage detectors (YOLO), or anchor-based approaches (RetinaNet). Transformers bring a novel perspective to detection by removing the need for many hand-crafted components.

detr (detection transformer): encoder–decoder structure and set-based prediction

DETR (Carion and gang, ECCV 2020) was the pioneering approach to use a transformer for detection. It processes an image with a CNN backbone (e.g., ResNet), then flattens the feature map into a sequence. A transformer encoder refines this sequence, and a transformer decoder attends to it with learnable "object queries" to predict bounding boxes and classes. One distinctive aspect is the set-based prediction loss, where the final predictions are treated as a set, and a bipartite matching cost assigns predictions to ground-truth bounding boxes.

bounding box predictions in parallel: removing hand-crafted components

By leveraging the transformer architecture, DETR removes the need for anchors, region proposals, or non-maximum suppression. All bounding boxes are predicted in parallel. Although DETR can be slower to converge and less effective for small objects, subsequent variants (like Deformable DETR or Conditional DETR) address these issues.

comparison to traditional approaches (e.g., yolo, faster r-cnn)

YOLO (You Only Look Once) is a one-stage approach with anchor boxes, while Faster R-CNN is a two-stage approach with region proposals. Both rely heavily on CNN backbones and hand-designed components for bounding box generation or refinement. DETR uses a more direct approach, but typically needs more epochs to converge. Its performance is competitive, especially with improvements in the form of Deformable DETR, which reduces memory usage and speeds up training by focusing attention on relevant spatial regions.

fine-tuning for custom detection tasks: dataset requirements and hyperparameters

When using DETR or its variants, a few points to remember:

Pretraining a CNN backbone or a hybrid CNN–transformer backbone is often beneficial.
Training time can be high. Convergence is slower compared to methods like Faster R-CNN, so large batch sizes and many epochs might be required.
Hyperparameters such as the number of object queries, learning rate, and matching cost components can significantly affect performance.
Data augmentation for object detection (random crops, random scale, color jittering) is crucial to improve robustness.

extensions of detr (deformable detr, conditional detr, etc.)

Deformable DETR (Zhu and gang, ICLR 2021) introduces multi-scale deformable attention, significantly speeding up training and improving small object detection.
Conditional DETR modifies how queries attend to encoded features, further stabilizing training and boosting performance.
Numerous other variants experiment with hierarchical backbones (like Swin or CvT) or incorporate extra modules (like bounding box refinements).

vision transformers for segmentation

comparison: cnn-based vs. vit-based segmentation approaches

Image segmentation tasks — semantic, instance, or panoptic — require classifying each pixel in an image. CNN-based methods (e.g., FCN, U-Net, DeepLab) leverage dilated convolutions or encoder–decoder designs to generate dense predictions. Vision transformers handle segmentation by using attention over flattened patches or tokens. This can capture richer global context but also demands careful handling of upsampling or patch reshaping in the decoder stage.

capturing global context with attention: benefits for dense prediction

One of the biggest advantages of attention-based segmentation is the capacity to handle interactions among distant pixels. This global view can improve boundary delineation and reduce confusion among classes that appear at different scales.

unified segmentation frameworks (maskformer, segformer, sam)

MaskFormer (Cheng and gang, CVPR 2021) unifies semantic and instance segmentation by predicting a set of masks and associated class labels.
SegFormer (Xie and gang, NeurIPS 2021) is a purely transformer-based approach with an encoder that outputs multi-scale features, then merges them for segmentation.
SAM (Segment Anything Model) (Kirillov and gang, 2023) introduced a highly generic segmentation approach that can handle a wide variety of objects and tasks with strong zero-shot capabilities, using a powerful vision transformer backbone.

oneformer: task-conditioned joint training for semantic and instance segmentation

OneFormer (Li and gang, 2022) extends the idea by training a single model for multiple segmentation tasks simultaneously. It uses a task prompt to condition the model on whether it's performing semantic, instance, or panoptic segmentation. The backbone can be a strong vision transformer, making it flexible for multi-task scenarios.

practical tips for fine-tuning on segmentation datasets (optimizer, data augmentation)

Segmentation tasks often rely on specialized data augmentation: random scaling, cropping, flipping, and color jittering. The choice of optimizer (AdamW or SGD with momentum) can depend on your architecture. The learning rate might need to be carefully scaled if you're using a pre-trained backbone (like a Swin or ViT) to avoid catastrophic forgetting of learned features. Techniques like sliding window inference or mixed precision might be necessary for large images.

multi-task learning for semantic, instance, and panoptic segmentation

One advantage of attention-based models is that they can unify tasks that require different types of supervision. With appropriate architectural additions and multi-task losses, a single transformer-based model can handle multiple segmentation paradigms. This synergy can sometimes improve performance across all tasks by sharing underlying representations.

knowledge distillation with vision transformers

what is knowledge distillation?: teacher–student model concept

Knowledge distillation is a technique where a large (teacher) model's outputs or intermediate representations are used to guide the training of a smaller (student) model. This technique was originally popularized in the context of compressing large CNN-based models, but it also applies to transformers.

teacher–student model setup in practice

A typical pipeline for distillation includes:

Pretraining the teacher model on a large dataset.
Forwarding a batch of images through the teacher, collecting the logits or sometimes internal feature maps.
Forwarding the same batch through the student model.
Computing a loss (KL divergence, MSE) between the teacher outputs and the student outputs, possibly combined with a standard task loss.

distillation in practice (distilgpt, distilbert analogies) for vision tasks

Just as DistilBERT or DistilGPT have shrunk large language models, distillation can reduce the size of vision transformers while retaining much of their performance. DeiT famously used distillation from a CNN teacher to train a data-efficient ViT. This approach addresses the lack of inductive bias by guiding the transformer to adopt some of the CNN's learned priors.

compressing large vision transformers for edge devices: reducing parameters and latency

If you want to deploy a giant ViT on a device with limited compute, knowledge distillation helps produce a more compact model. Combined with quantization or pruning, you can achieve significant memory reductions and faster inference times.

combining distillation with quantization or pruning: further compression techniques

Quantization: Convert the weights from float32 or float16 to int8 or even int4. Distillation can help the model remain accurate post-quantization.
Pruning: Remove entire heads, tokens, or channels that contribute less to final accuracy. Distillation can compensate for lost capacity.
Neural Architecture Search (NAS): Automated search methods can discover smaller, more efficient architectures that benefit from distillation.

future directions and emerging trends

larger-scale pretraining and foundation models for vision

As data keeps growing, the idea of building foundation models in vision — analogous to large language models in NLP — has gained momentum. These are massive models trained on huge unlabeled or partially labeled datasets. The vision transformer (or variants) is at the core of many of these foundational efforts (e.g., large MAE or iGPT-like approaches).

self-supervised learning beyond simmim: masked autoencoders, contrastive learning

MAE (Masked Autoencoders) (He and gang, CVPR 2022) is an approach that masks a large portion of the input image patches and trains the model to reconstruct them, providing a strong pre-training signal without labels. Contrastive learning (like MoCo, SimCLR) and CLIP (for vision–language tasks) are also pushing the boundaries of representation learning. These directions show that vision transformers can excel in self-supervised or multimodal contexts.

The attention mechanism is domain-agnostic; it can handle text, images, and other modalities. This allows the creation of cross-modal transformers, which process multiple data streams simultaneously. CLIP (Radford and gang, ICML 2021) is an example of learning a joint vision–language embedding space. This synergy paves the way for tasks like image captioning, visual question answering, or advanced text-to-image generation.

efficient training techniques: mixed precision, gradient checkpointing, multi-gpu scaling

To train large vision transformers, it's essential to use:

Mixed precision (fp16 or bf16) to significantly reduce memory usage and speed up training.
Gradient checkpointing to recompute activations during backpropagation and reduce memory at the cost of compute.
Distributed training (DP, DDP, ZeRO) or multi-GPU parallelism to handle bigger batch sizes and faster iteration.

real-world industrial applications: robotics, medical imaging, autonomous vehicles

Vision transformers are increasingly being tested in real-world domains:

Robotics: Understanding complex environments with 3D geometry or multi-camera setups.
Medical imaging: Analysis of CT, MRI, histopathology slides, which can be high-resolution and large in size — an area where hierarchical attention shines.
Autonomous vehicles: Multi-view cameras for scene understanding, object detection, and tracking under challenging conditions.

adversarial robustness and safety: challenges in attention-based models

Recent studies show that transformers can be vulnerable to adversarial examples or may exhibit distribution shift issues. Their global attention might inadvertently amplify spurious correlations in the data. Ongoing research focuses on making ViTs more robust through specialized training procedures, adversarial training, or improved architectural designs.

integration of transformer-based vision models with large language models (llms)

We are witnessing an era of multimodal LLMs that combine vision and language under a single large-scale transformer (e.g., BLIP-2, PaLI, Flamingo). The synergy of image understanding and language understanding enables advanced tasks like describing images with high-level semantic detail, answering visual questions, or performing image retrieval conditioned on textual inputs. This intersection is likely to grow rapidly in the future.

potential breakthroughs and open research questions

How to develop more efficient attention mechanisms for extremely high resolutions?
How to handle data scarcity when training large transformer-based models for specific domains?
Can we unify all computer vision tasks — classification, detection, segmentation, 3D understanding — into a single foundation architecture?
How to ensure robust and fair usage of these powerful models in real-world applications?

With this comprehensive exploration, I hope I've provided a deeper look at vision transformers, from the core architecture to diverse variants and practical considerations. The excitement in the field is palpable, and the pace of innovation is rapid. While training large-scale ViTs can be resource-intensive, the potential rewards in performance, representation power, and flexibility are remarkable. Researchers and practitioners are actively discovering new ways to blend the best of CNNs and transformers, or push pure attention-based methods to new frontiers. Whether you aim to adopt vision transformers in a production pipeline or continue exploring advanced research questions, understanding their foundations, variants, and best practices is key to harnessing the next wave of breakthroughs in computer vision.

Averett's Heuristics@avheuristics

Subscribe to my Telegram channel for updates in the Research section and more tech content