Explore foundational classical object detection techniques like HOG + SVM and Viola-Jones, predating deep learning. Learn about sliding window and image pyramid approaches.

Chapter 7: Classical Object Detection

This chapter explores foundational techniques in classical object detection, focusing on methods that predate the widespread adoption of deep learning. We will delve into the principles behind popular algorithms like HOG + SVM and Viola-Jones, examining their core concepts and applications.

7.1 Sliding Window + Image Pyramid Approach

The sliding window technique is a fundamental strategy for object detection. It involves systematically scanning an image with a fixed-size window, classifying the content within each window. To account for objects of varying sizes, this approach is often combined with an image pyramid.

7.1.1 The Sliding Window Concept

Window Movement: A fixed-size window is moved across the image, typically from left to right and top to bottom.
Feature Extraction: For each window position, relevant features are extracted from the image patch.
Classification: A classifier is used to determine if the features extracted from the window correspond to the object of interest.
Overlapping Windows: Due to the fixed size of the window, multiple overlapping windows might contain the same object. Post-processing techniques like Non-Maximum Suppression (NMS) are used to consolidate these detections and select the most confident bounding box.

7.1.2 The Image Pyramid

To detect objects at different scales within an image, an image pyramid is constructed. This involves creating multiple resized versions (scales) of the original image.

Resizing: The original image is scaled down or up to create a series of images at different resolutions.
Sliding Window on Each Scale: The sliding window process is then applied independently to each image in the pyramid.
Detection Across Scales: This allows the system to detect objects regardless of their size in the original image.

7.2 Histogram of Oriented Gradients (HOG) + Support Vector Machine (SVM)

HOG features, combined with an SVM classifier, proved to be a highly effective method for object detection, particularly for deformable objects like pedestrians and faces.

7.2.1 Histogram of Oriented Gradients (HOG) Features

HOG features capture the local shape of an object by describing the distribution of intensity gradients or edge directions.

Gradient Computation:
- Calculate the horizontal and vertical gradients for each pixel in the image.
- This is typically done using Sobel or Prewitt operators.
- For each pixel, compute the gradient magnitude and orientation. $$ \text{Gradient Magnitude: } |G| = \sqrt{G_x^2 + G_y^2} $$ $$ \text{Gradient Orientation: } \theta = \text{atan2}(G_y, G_x) $$
Cell Division: Divide the image into small spatial regions called "cells" (e.g., 8x8 pixels).
Block Formation: Group adjacent cells into larger spatial blocks (e.g., 2x2 cells).
Histogram Creation: For each cell, create a histogram of gradient orientations. The histogram bins are typically spread across 0-180 degrees or 0-360 degrees. Each pixel's gradient contributes to a bin based on its orientation, weighted by its magnitude.
Normalization: Normalize the histograms within each block to account for variations in illumination and shadowing. This makes the features more robust.
Feature Vector Concatenation: Concatenate the normalized histograms from all blocks to form the final HOG feature descriptor for the detection window.

7.2.2 Support Vector Machine (SVM) Classifier

An SVM is a powerful supervised learning model used for classification. In the context of HOG, it's trained to distinguish between regions containing the object of interest and those that do not.

Training: A large dataset of labeled image patches (containing the object and background) is used to train the SVM. The HOG features extracted from these patches serve as the input.
Decision Boundary: The SVM finds an optimal hyperplane that maximally separates the feature vectors of the two classes (object vs. background).
Classification: During detection, the HOG features of a sliding window are fed into the trained SVM. The SVM outputs a score indicating the likelihood that the window contains the object.

7.2.3 Application: Face and Pedestrian Detection

HOG + SVM was widely successful for:

Pedestrian Detection: Capturing the characteristic shapes and postures of humans.
Face Detection: Identifying key facial features and contours.

7.3 Viola-Jones Face Detection Framework

The Viola-Jones algorithm was a pioneering real-time face detection system that achieved significant performance improvements. It relies on a combination of Haar-like features, an integral image for rapid feature computation, AdaBoost for feature selection and classifier training, and a cascade of classifiers for efficient detection.

7.3.1 Haar-like Features

Haar-like features are simple rectangular features that represent differences in image intensity. They are computed over specific regions of the detection window.

Feature Types:
- Edge Features: Detect edges.
- Line Features: Detect horizontal or vertical lines.
- Center-Surround Features: Detect regions with a bright center and dark surround, or vice versa.
Rapid Computation with Integral Image:
- The integral image (also known as the summed-area table) allows for the computation of the sum of pixel values within any rectangular region in constant time, regardless of the rectangle's size.
- An integral image $I(x, y)$ at pixel $(x, y)$ stores the sum of all pixel values in the rectangle from $(0, 0)$ to $(x, y)$. $$ I(x, y) = \sum_{i=0}^{x} \sum_{j=0}^{y} P(i, j) $$ where $P(i, j)$ is the pixel intensity at $(i, j)$.
- The sum of a rectangular region defined by $(x_1, y_1)$ and $(x_2, y_2)$ can be calculated using four lookups in the integral image: $$ \text{Sum} = I(x_2, y_2) - I(x_1-1, y_2) - I(x_2, y_1-1) + I(x_1-1, y_1-1) $$
- This allows for extremely fast calculation of Haar-like features.

7.3.2 AdaBoost Learning

AdaBoost (Adaptive Boosting) is used to select a small set of the most discriminative Haar-like features and train a strong classifier by combining multiple weak classifiers.

Weak Classifiers: Each Haar-like feature, when applied to a region, can form a very simple "weak" classifier. This classifier typically splits data based on a simple threshold.
Iterative Training: AdaBoost iteratively trains weak classifiers. In each iteration:
- It focuses more on the misclassified samples from the previous iteration.
- It assigns weights to weak classifiers based on their accuracy.
Strong Classifier: The final strong classifier is a weighted combination of all the trained weak classifiers.

7.3.3 Cascade of Classifiers

The Viola-Jones algorithm employs a cascade of classifiers to achieve real-time performance. This is a crucial optimization.

Stage-based Detection: The full classifier is structured as a sequence of increasingly complex classifier stages.
Early Rejection: Each stage is designed to quickly reject non-face regions.
- The initial stages use a very small number of features and are highly efficient, acting as a fast rejector.
- If a window passes a stage, it proceeds to the next, more complex stage.
- If a window fails any stage, it is immediately discarded as a non-face.
High Detection Rate: Later stages use more features and are more complex, ensuring a high detection rate for windows that survive the earlier stages.
Efficiency: This cascade approach ensures that only a small fraction of image windows are processed by the more computationally intensive stages, leading to significant speedups.

7.3.4 Application: Real-Time Face Detection

The Viola-Jones framework was revolutionary for its ability to perform face detection in real-time on standard hardware, making it a cornerstone for many early computer vision applications involving faces.

Classical Object Detection: HOG, Viola-Jones & More