Chapter 12: Semantic & Instance Segmentation in CV

Explore semantic and instance segmentation in computer vision. This chapter covers key architectures like FCNs for pixel-wise predictions in AI.

Chapter 12: Semantic and Instance Segmentation

This chapter delves into the advanced computer vision tasks of semantic and instance segmentation, exploring key architectures and their applications.

Key Architectures

Fully Convolutional Networks (FCNs)

Fully Convolutional Networks revolutionized semantic segmentation by adapting existing convolutional neural networks (CNNs) to output dense, pixel-wise predictions. Unlike traditional CNNs that produce a single classification per image, FCNs replace fully connected layers with convolutional layers, enabling them to process inputs of arbitrary size and generate output maps of the same spatial resolution as the input.

  • Core Concept: Replaces fully connected layers with convolutional layers to achieve pixel-wise prediction.
  • How it works: FCNs retain spatial information throughout the network, often using upsampling techniques (e.g., transposed convolutions or deconvolution) to recover the resolution lost during downsampling in earlier convolutional layers. This allows each pixel in the output map to correspond to a specific region in the input image.

U-Net

U-Net is a highly influential convolutional neural network architecture specifically designed for biomedical image segmentation. Its distinctive U-shaped architecture allows it to capture context at multiple scales while precisely localizing features.

  • Architecture:
    • Contracting Path (Encoder): Similar to a typical CNN, it uses convolutional and max-pooling layers to capture context and reduce spatial dimensions.
    • Expanding Path (Decoder): Employs upsampling (transposed convolutions) and concatenation of feature maps from the contracting path. This is crucial for accurate localization, as it allows the network to combine high-level semantic information with low-level spatial details.
    • Skip Connections: The key innovation of U-Net lies in its skip connections, which directly connect feature maps from the encoder to the decoder at corresponding spatial resolutions. These connections help mitigate the loss of spatial information during downsampling.
  • Strengths: Excellent performance on tasks requiring precise localization and segmentation of fine details.

Hands-on: Segmentation Using U-Net on Biomedical Images

To implement semantic segmentation using U-Net on biomedical images, you would typically follow these steps:

  1. Data Preparation:
    • Gather a dataset of biomedical images with corresponding ground truth segmentation masks.
    • Perform necessary preprocessing steps like resizing, normalization, and data augmentation (e.g., rotations, flips, elastic deformations) to improve model robustness.
  2. Model Implementation:
    • Build or utilize a pre-existing U-Net architecture (e.g., using TensorFlow, PyTorch).
    • Define the loss function (e.g., Binary Cross-Entropy, Dice Loss, or a combination) and optimizer (e.g., Adam, SGD).
  3. Training:
    • Train the U-Net model on the prepared dataset.
    • Monitor training progress using appropriate metrics like Intersection over Union (IoU) or Dice Coefficient.
  4. Evaluation and Inference:
    • Evaluate the trained model on a separate test set.
    • Use the model to predict segmentation masks on new, unseen biomedical images.

DeepLab Family

The DeepLab family of models (DeepLabv1, v2, v3, v3+) are state-of-the-art architectures for semantic segmentation, known for their effective use of atrous (dilated) convolutions and atrous spatial pyramid pooling (ASPP).

  • Atrous Convolutions (Dilated Convolutions): These allow the convolutional kernels to cover a larger receptive field without increasing the number of parameters or computational cost. This helps capture multi-scale contextual information.
  • Atrous Spatial Pyramid Pooling (ASPP): This module applies atrous convolutions with different dilation rates in parallel, effectively probing features at multiple scales. It then fuses these features to obtain a richer contextual representation.
  • DeepLabv3+: The latest iteration often incorporates an encoder-decoder structure with a powerful ASPP module in the encoder and a decoder that refines segmentation boundaries by recovering spatial information using shallow features from the encoder.

Mask R-CNN

Mask R-CNN extends the Faster R-CNN architecture to perform instance segmentation. Unlike semantic segmentation where all pixels of the same class are assigned the same label, instance segmentation distinguishes between different objects of the same class.

  • Core Functionality: It not only predicts a bounding box and class label for each object but also generates a pixel-level segmentation mask for each detected instance.
  • Architecture:
    • It builds upon Faster R-CNN by adding a parallel branch that predicts an object mask in parallel with the existing bounding box and class prediction.
    • The added mask head operates on the feature maps produced by the Region of Interest (RoI) Align layer (an improvement over RoI Pooling), which accurately aligns features with the RoI.
  • Output: For each detected object, Mask R-CNN outputs a class label, a bounding box, and a binary mask.

Use-Cases

Medical Imaging

  • Tumor Detection and Segmentation: Identifying and outlining tumors in MRI, CT scans, or X-rays for diagnosis, treatment planning, and monitoring.
  • Organ Segmentation: Delineating organs (e.g., heart, lungs, kidneys) for volumetric analysis, surgical guidance, or disease assessment.
  • Cell Segmentation: Isolating and segmenting individual cells in microscopy images for biological research.

Autonomous Driving

  • Road and Lane Detection: Identifying drivable areas, lane markings, and road boundaries.
  • Object Detection and Segmentation: Recognizing and segmenting various road elements like vehicles, pedestrians, cyclists, traffic signs, and traffic lights. This is crucial for scene understanding and safe navigation.
  • Semantic Scene Understanding: Providing a detailed understanding of the entire driving environment, enabling more informed decision-making for path planning and obstacle avoidance.