Learn to fine-tune Vision Transformer (ViT) and DETR on your custom datasets for object detection & image classification. Get practical, step-by-step guidance.

Hands-on: Fine-tuning Vision Transformer (ViT) and DETR on Custom Datasets

This guide provides a step-by-step approach to applying state-of-the-art transformer models, Vision Transformer (ViT) and DETR (DEtection TRansformer), to custom datasets for object detection and image classification tasks.

Introduction

Fine-tuning powerful deep learning models like ViT and DETR on custom datasets allows developers to tailor them to specific use cases. Whether you need precise image classification for specialized domains or robust object detection for unique environments, this guide will equip you with the knowledge and tools.

What You Will Learn

Dataset Preparation: How to structure and format your custom datasets for both ViT and DETR.
ViT Fine-tuning: Step-by-step process to fine-tune ViT for image classification.
DETR Fine-tuning: Step-by-step process to fine-tune DETR for object detection.
Tools and Libraries: Essential libraries needed for the fine-tuning process.
Evaluation and Inference: How to assess your fine-tuned models and use them for predictions on new data.

Prerequisites

Python: Version 3.7 or higher.
PyTorch: The deep learning framework.
torchvision: For dataset handling and image transformations.
Hugging Face Transformers: For easy access to pre-trained ViT and DETR models and their associated tools.
Dataset Format:
- ViT (Image Classification): Folder-based structure or Pascal VOC format.
- DETR (Object Detection): COCO format (JSON annotations).

Part 1: Fine-tuning ViT on a Custom Dataset (Image Classification)

This section details the process of adapting a pre-trained Vision Transformer for image classification on your own dataset.

Step 1: Dataset Preparation

Organize your custom dataset for classification into a hierarchical folder structure. The recommended structure is as follows:

custom_dataset/
├── train/
│   ├── class1/
│   │   ├── image1.jpg
│   │   └── image2.png
│   └── class2/
│       ├── imageA.jpg
│       └── imageB.jpeg
└── val/
    ├── class1/
    │   └── image3.jpg
    └── class2/
        └── imageC.jpg

train/: Contains images for the training set.
val/: Contains images for the validation set.
Each sub-directory within train/ and val/ represents a distinct class.

Step 2: Load and Preprocess Data

Use torchvision.datasets.ImageFolder to load your dataset and define necessary image transformations.

from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Define transformations
transform = transforms.Compose([
    transforms.Resize((224, 224)),  # Resize images to ViT's expected input size
    transforms.ToTensor(),          # Convert images to PyTorch tensors
    # Add normalization if required by the pre-trained model (e.g., ImageNet stats)
    # transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Load datasets
train_dataset = datasets.ImageFolder("custom_dataset/train", transform=transform)
val_dataset = datasets.ImageFolder("custom_dataset/val", transform=transform)

# Create DataLoaders
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)

# Get the number of labels (classes)
num_labels = len(train_dataset.classes)
print(f"Found {num_labels} classes: {train_dataset.classes}")

Step 3: Load Pretrained ViT Model

Initialize a ViT model pre-trained on a large dataset (like ImageNet) and adapt its classification head for your specific number of classes.

from transformers import ViTForImageClassification

# Load a pre-trained ViT model
# "google/vit-base-patch16-224" is a common choice, trained on ImageNet-21k
model = ViTForImageClassification.from_pretrained(
    "google/vit-base-patch16-224",
    num_labels=num_labels  # Update the classification head for your number of classes
)

Step 4: Training Loop

Implement a standard PyTorch training loop. This involves:

Optimizer: Choose an optimizer (e.g., AdamW).
Loss Function: Use a suitable loss function (e.g., Cross-Entropy Loss for classification).
Training Epochs: Iterate through the dataset multiple times.
Forward Pass: Feed batches of data through the model.
Backward Pass: Calculate gradients and update model weights.
Validation: Periodically evaluate the model on the validation set.

import torch
from torch.optim import AdamW
from torch.nn import CrossEntropyLoss

# Define optimizer and loss function
optimizer = AdamW(model.parameters(), lr=5e-5)
criterion = CrossEntropyLoss()

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Training loop (simplified example)
num_epochs = 5
for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    for batch in train_loader:
        inputs, labels = batch
        inputs, labels = inputs.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs.logits, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {running_loss/len(train_loader):.4f}")

    # Add validation logic here
    # model.eval()
    # with torch.no_grad():
    #    ... evaluate on val_loader ...

Part 2: Fine-tuning DETR on a Custom Dataset (Object Detection)

This section covers fine-tuning the DETR model for object detection on your custom dataset.

Step 1: Dataset Format

DETR typically uses the COCO dataset format for annotations. This format is a JSON file that describes images, annotations, and categories.

Example COCO Annotation Structure:

{
  "images": [
    {"file_name": "img1.jpg", "id": 1, "height": 480, "width": 640},
    {"file_name": "img2.jpg", "id": 2, "height": 720, "width": 1280}
  ],
  "annotations": [
    {
      "id": 1,
      "image_id": 1,
      "category_id": 1,
      "bbox": [x, y, width, height],  // [top-left x, top-left y, width, height] in pixels
      "area": 1000.0,
      "iscrowd": 0,
      "segmentation": [] // Optional: can be empty for bounding box detection
    },
    // ... more annotations for other objects/images
  ],
  "categories": [
    {"id": 1, "name": "object1"},
    {"id": 2, "name": "object2"}
    // ... more categories
  ]
}

You'll need to create or convert your annotations to this format.

Step 2: Load DETR Model and Processor

Use the Hugging Face transformers library to load a pre-trained DETR model and its associated image processor. The processor handles the necessary pre-processing steps for DETR.

from transformers import DetrForObjectDetection, DetrImageProcessor

# Load a pre-trained DETR model (e.g., DETR with ResNet-50 backbone)
# "facebook/detr-resnet-50" is a common choice
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")

# Load the corresponding image processor
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")

Step 3: Preprocess and Fine-tune

The DetrImageProcessor is crucial for preparing your custom dataset for the DETR model. It handles resizing, normalization, and converting annotations into the format expected by the model.

# Assume 'image' is a PIL Image object and 'target' is a list of dictionaries
# representing annotations in COCO format for that image.

# Example processing for a single image and its annotations:
# For training, you typically pass a list of images and a list of annotations.
# inputs = processor(images=list_of_images, annotations=list_of_annotations, return_tensors="pt")

# For inference with a single image:
# inputs = processor(images=image, return_tensors="pt")

# During training, the processor will format inputs and targets.
# The model expects inputs in this format:
# outputs = model(**inputs)
# The 'inputs' dictionary will typically contain:
# - 'pixel_values': Tensor of preprocessed image data
# - 'labels': Tensor of formatted target labels (class IDs and bounding boxes)

# A simplified training loop would involve:
# 1. Loading batches of images and their COCO-formatted annotations.
# 2. Using the processor to format these into 'inputs' dictionary.
# 3. Passing 'inputs' to the model.
# 4. Calculating the loss based on 'outputs' and the original 'labels' from 'inputs'.
# 5. Performing backpropagation and optimizer steps.

# Note: The specific structure of 'target' passed to the processor and how
# the loss is calculated can be complex and depends on the Hugging Face examples.
# Refer to Hugging Face's DETR documentation and examples for detailed training loop implementation.

# Example of how the processor might be used for training (conceptual):
# training_data = CustomCOCODataset("path/to/your/coco_annotations.json", "path/to/your/images/")
# for batch_images, batch_targets in DataLoader(training_data, batch_size=2):
#     # batch_images is a list of PIL Images, batch_targets is a list of annotation dicts
#     inputs = processor(images=batch_images, annotations=batch_targets, return_tensors="pt")
#     outputs = model(**inputs)
#     # Calculate loss using outputs and inputs['labels']
#     # ... backprop and optimization ...

Evaluation and Inference

After fine-tuning, it's essential to evaluate your models' performance and use them to make predictions.

Evaluation Metrics

ViT (Image Classification):
- Accuracy
- F1-Score
- Precision
- Recall
- Confusion Matrix
DETR (Object Detection):
- mAP (mean Average Precision) at various IoU thresholds (e.g., mAP@0.5, mAP@0.5:0.95)
- IoU (Intersection over Union)

Inference Examples

ViT Inference:

from PIL import Image
from transformers import ViTFeatureExtractor

# Assume 'model' is your fine-tuned ViTForImageClassification model
# Assume 'image_path' points to an image file

feature_extractor = ViTFeatureExtractor.from_pretrained("google/vit-base-patch16-224") # Use the same extractor

# Load and preprocess the image for inference
image = Image.open(image_path)
# Apply the same transformations used during training (or specific inference transformations)
# For ViT, the feature extractor handles resizing and normalization.
inputs = feature_extractor(images=image, return_tensors="pt")

# Move inputs to the correct device
inputs = {k: v.to(device) for k, v in inputs.items()}

# Perform inference
with torch.no_grad():
    outputs = model(**inputs)

# Get the predicted class index
predicted_class_idx = outputs.logits.argmax(-1).item()
predicted_class_name = model.config.id2label[predicted_class_idx]

print(f"Predicted class: {predicted_class_name}")

DETR Inference:

from PIL import Image
# Assume 'model' is your fine-tuned DetrForObjectDetection model
# Assume 'image_path' points to an image file

# Load and preprocess the image for inference using the processor
image = Image.open(image_path)
inputs = processor(images=image, return_tensors="pt")

# Move inputs to the correct device
inputs = {k: v.to(device) for k, v in inputs.items()}

# Perform inference
with torch.no_grad():
    outputs = model(**inputs)

# Process outputs to get bounding boxes, scores, and labels
# The processor provides a way to post-process DETR outputs
results = processor.post_process_object_detection(outputs, threshold=0.9, target_sizes=(image.size[::-1],)) # target_sizes should be (height, width)

detected_objects = []
for score, label, box in zip(results[0]["scores"], results[0]["labels"], results[0]["boxes"]):
    box = [round(i, 2) for i in box.tolist()]
    detected_objects.append({
        "score": round(score.item(), 2),
        "label": model.config.id2label[label.item()],
        "box": box
    })

print("Detected objects:", detected_objects)

Applications

ViT:
- Medical Imaging: Classifying diseases from X-rays or MRI scans.
- Retail: Identifying specific products on shelves.
- Document Analysis: Categorizing scanned documents.
DETR:
- Traffic Analysis: Detecting vehicles, pedestrians, and traffic signs.
- Wildlife Monitoring: Identifying and tracking animals in camera trap footage.
- Industrial Automation: Locating parts on an assembly line or detecting defects.
- Autonomous Driving: Perceiving the environment by detecting other vehicles, pedestrians, and road features.

Conclusion

This guide has demonstrated how to fine-tune both Vision Transformer (ViT) for image classification and DETR for object detection on your custom datasets. By leveraging the power of transformers and the Hugging Face ecosystem, you can effectively adapt these state-of-the-art models to solve domain-specific problems and achieve impressive results.

Vision Transformer custom dataset
DETR object detection tutorial
Fine-tune ViT PyTorch
Train DETR on COCO dataset
ViT image classification guide
DETR transformer model training
Custom object detection with DETR
HuggingFace ViT and DETR

Frequently Asked Questions

What are the key differences between Vision Transformer (ViT) and DETR architectures? ViT is primarily designed for image classification by treating images as sequences of patches. DETR is an end-to-end object detection system that uses a transformer encoder-decoder architecture to directly predict bounding boxes and class labels, eliminating the need for anchor boxes or Non-Maximum Suppression (NMS) in its original form.
How do you prepare a custom dataset for fine-tuning a Vision Transformer? For classification, organize your dataset into a folder structure where each sub-folder represents a class, containing images of that class. torchvision.datasets.ImageFolder can then load this structure.
What preprocessing steps are required before feeding images into a ViT model? Images typically need to be resized to the input dimensions expected by the pre-trained ViT (e.g., 224x224), converted to tensors, and normalized using the mean and standard deviation of the dataset it was pre-trained on (often ImageNet statistics). The ViTFeatureExtractor from Hugging Face simplifies this.
How is the COCO dataset format structured for DETR training? The COCO format is a JSON file containing lists of images, annotations (with bounding boxes in [x_min, y_min, width, height] format), and categories. Each annotation links an image to a category and defines its bounding box.
What role does the DetrImageProcessor play in DETR model fine-tuning? The DetrImageProcessor handles complex preprocessing for DETR, including resizing images to a common size, normalizing pixel values, and converting the COCO-style annotations into the specific tensor format (including labels and bounding box coordinates) that the DETR model expects during training. It also assists in post-processing during inference.
Describe the training loop process for fine-tuning ViT and DETR models. Both involve standard PyTorch training loops: loading data in batches, forwarding data through the model, calculating loss (CrossEntropy for ViT, detection loss for DETR), performing backpropagation to compute gradients, and updating model weights using an optimizer. The key difference lies in the data loading and the specific loss function used, which must match the task (classification vs. detection).
What evaluation metrics are commonly used for ViT and DETR models? For ViT (classification): Accuracy, Precision, Recall, F1-Score. For DETR (object detection): mAP (mean Average Precision) at various Intersection over Union (IoU) thresholds.
How does DETR handle object detection differently from traditional CNN-based detectors? DETR performs object detection as a direct set prediction problem. It uses a transformer encoder-decoder architecture and object queries to directly output a fixed-size set of predictions (bounding boxes and class labels), thereby eliminating the need for hand-designed components like anchor boxes or NMS during inference.
What are some real-world applications where ViT and DETR models excel? ViT excels in tasks requiring nuanced image understanding for classification, such as medical image analysis, scene understanding, and fine-grained visual categorization. DETR is strong in scenarios demanding accurate localization of multiple objects, like autonomous driving, surveillance, robotics, and industrial quality control.
How do you perform inference using fine-tuned ViT and DETR models? For inference, load your fine-tuned model and the corresponding processor/feature extractor. Preprocess the new image(s) using the processor. Pass the preprocessed data through the model. For ViT, interpret the output logits to get class probabilities. For DETR, use the processor's post-processing function to convert raw model outputs into understandable bounding boxes and class labels with confidence scores.

Hands-on: Fine-tune ViT & DETR on Custom Datasets