Explore geometric vision in Chapter 5: camera calibration, epipolar geometry, and stereo vision for 3D world reconstruction. Practical Python & OpenCV.

Chapter 5: Geometric Vision

This chapter delves into the fundamental concepts and practical applications of geometric vision, exploring how to understand and reconstruct the 3D world from 2D images. We will cover essential topics such as camera calibration, epipolar geometry, stereo vision, and various geometric transformations, with a focus on practical implementation using Python and OpenCV.

5.1 Camera Calibration

Camera calibration is the process of determining the intrinsic and extrinsic parameters of a camera. This allows us to establish a precise relationship between the 3D world and the 2D image plane.

5.1.1 Intrinsic Parameters

Intrinsic parameters describe the internal characteristics of the camera, such as focal length, optical center, and pixel aspect ratio. They define how a 3D point is projected onto the 2D image plane.

Focal Length ($f_x, f_y$): The distance between the optical center and the image plane, typically expressed in pixels.
Optical Center ($c_x, c_y$): The principal point or the center of the image sensor, also expressed in pixels.
Skew Coefficient ($\alpha$): Represents the non-orthogonality of the sensor's pixel axes. In most modern cameras, this is zero.
Distortion Coefficients: These account for lens distortions, primarily radial and tangential distortion. Common distortion types include:
- Radial Distortion: Affects points further from the image center more significantly.
- Tangential Distortion: Occurs when the image plane is not perfectly parallel to the sensor plane.

The intrinsic camera matrix ($K$) is typically represented as:

$$ K = \begin{bmatrix} f_x & \alpha & c_x \ 0 & f_y & c_y \ 0 & 0 & 1 \end{bmatrix} $$

5.1.2 Extrinsic Parameters

Extrinsic parameters describe the camera's position and orientation in the 3D world. They are represented by a rotation matrix ($R$) and a translation vector ($t$). These parameters define the transformation from the world coordinate system to the camera coordinate system.

Rotation Matrix ($R$): A 3x3 matrix that describes the camera's orientation.
Translation Vector ($t$): A 3x1 vector that describes the camera's position.

The extrinsic transformation can be represented as a 4x4 transformation matrix ($[R|t]$):

$$ [R|t] = \begin{bmatrix} r_{11} & r_{12} & r_{13} & t_x \ r_{21} & r_{22} & r_{23} & t_y \ r_{31} & r_{32} & r_{33} & t_z \end{bmatrix} $$

5.2 Epipolar Geometry and Stereo Vision

Epipolar geometry describes the geometric relationship between two cameras observing the same 3D scene. This relationship is crucial for stereo vision, enabling depth estimation and 3D reconstruction.

5.2.1 Epipolar Constraint

For a point $P$ in the 3D world, its projection in the first camera is $p_1$ and in the second camera is $p_2$. The epipolar constraint states that for a given point $p_1$ in the first image, its corresponding point $p_2$ in the second image must lie on a specific line called the epipolar line.

5.2.2 Fundamental Matrix ($F$)

The fundamental matrix $F$ is a 3x3 matrix that encapsulates the epipolar geometry between two uncalibrated cameras. It relates corresponding image points $p_1 = [u_1, v_1, 1]^T$ and $p_2 = [u_2, v_2, 1]^T$ by the equation:

$$ p_2^T F p_1 = 0 $$

5.2.3 Essential Matrix ($E$)

The essential matrix $E$ relates corresponding image points when the cameras are calibrated (i.e., intrinsic parameters are known). It is related to the fundamental matrix by:

$$ E = K_2^T F K_1 $$

where $K_1$ and $K_2$ are the intrinsic matrices of the first and second cameras, respectively. The essential matrix contains information about the relative rotation and translation between the two cameras.

5.2.4 Stereo Vision

Stereo vision uses two or more cameras to infer depth information from disparities in images. Key components include:

Stereo Rectification: A process that transforms the image planes of the two cameras so that they become coplanar and their image lines are horizontal (epipolar lines). This simplifies the correspondence problem.
Disparity Map: A map where each pixel value represents the horizontal disparity between its corresponding pixels in the left and right rectified images.
Depth Estimation: Depth can be calculated from the disparity using the stereo camera's baseline (distance between cameras) and focal length.

5.3 Homography, Affine, and Projective Transforms

These transformations describe how points in one plane are mapped to another, often used for image warping, view synthesis, and relating different camera views.

5.3.1 Projective Transform (Homography)

A projective transform, also known as a homography ($H$), is a 3x3 matrix that describes the mapping between two projective planes. It preserves straight lines but not parallelism or angles. It is commonly used to map a plane in 3D space to an image plane or to align two images of the same planar surface taken from different viewpoints.

The relationship between corresponding points $p_1 = [u_1, v_1, 1]^T$ and $p_2 = [u_2, v_2, 1]^T$ is given by:

$$ p_2 \propto H p_1 $$

5.3.2 Affine Transform

An affine transform is a more restrictive transformation than a projective transform. It preserves parallelism of lines but not necessarily angles or lengths. Affine transformations can be represented by a 2x3 matrix (or a 3x3 matrix with the last row $[0, 0, 1]$). They consist of translation, rotation, scaling, and shear.

5.3.3 Projective vs. Affine vs. Euclidean

Euclidean Transform: Preserves distances, angles, and orientation (translation and rotation).
Similarity Transform: Preserves angles and ratios of lengths (translation, rotation, and uniform scaling).
Affine Transform: Preserves parallelism of lines (translation, rotation, scaling, shear).
Projective Transform: Preserves incidence of points and lines (general linear transformations).

5.4 Camera Calibration with Python – OpenCV

OpenCV provides robust functions for camera calibration using chessboard patterns or other calibration objects.

5.4.1 Calibration Process

Capture Calibration Images: Take multiple images of a calibration pattern (e.g., chessboard) from different viewpoints.
Detect Corners: Use cv2.findChessboardCorners() to detect the corners of the pattern in each image.
Object Points: Define the 3D coordinates of the corners of the calibration pattern in a world coordinate system.
Image Points: Store the 2D pixel coordinates of the detected corners in each image.
Calibrate Camera: Use cv2.calibrateCamera() with the object points and image points to compute the intrinsic matrix ($K$), distortion coefficients, rotation vectors, and translation vectors.

import cv2
import numpy as np

# Chessboard dimensions
chessboard_size = (9, 6) # Number of inner corners

# Arrays to store object points and image points from all images
objpoints = []  # 3d points in real world space
imgpoints = []  # 2d points in image plane

# Create real world coordinates for the chessboard corners
objp = np.zeros((chessboard_size[0] * chessboard_size[1], 3), np.float32)
objp[:, :2] = np.mgrid[0:chessboard_size[0], 0:chessboard_size[1]].T.reshape(-1, 2)

# Load images and detect corners
images = ['image1.jpg', 'image2.jpg', ...] # List of calibration image paths

for fname in images:
    img = cv2.imread(fname)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Find the chessboard corners
    ret, corners = cv2.findChessboardCorners(gray, chessboard_size, None)

    if ret:
        objpoints.append(objp)
        # Refine corner locations for higher precision
        corners_refined = cv2.cornerSubPix(gray, corners, (11, 11), (-1, -1), (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 30, 0.001))
        imgpoints.append(corners_refined)

        # Draw and display the corners
        cv2.drawChessboardCorners(img, chessboard_size, corners_refined, ret)
        cv2.imshow('img', img)
        cv2.waitKey(500)

cv2.destroyAllWindows()

# Calibrate the camera
ret, mtx, dist, rvecs, tvecs = cv2.calibrateCamera(objpoints, imgpoints, gray.shape[::-1], None, None)

print("Camera Matrix (Intrinsic Parameters):\n", mtx)
print("\nDistortion Coefficients:\n", dist)

5.4.2 Undistorting Images

After calibration, you can use the computed matrix and distortion coefficients to undistort images, correcting for lens distortions.

# For a specific image, e.g., img
img = cv2.imread('image_to_undistort.jpg')
h,  w = img.shape[:2]

# Get the optimal new camera matrix
newcameramtx, roi = cv2.getOptimalNewCameraMatrix(mtx, dist, (w,h), 1, (w,h))

# Undistort the image
dst = cv2.undistort(img, mtx, dist, None, newcameramtx)

# Crop the image to remove black borders (optional)
x, y, w, h = roi
dst = dst[y:y+h, x:x+w]

cv2.imshow('Undistorted Image', dst)
cv2.waitKey(0)
cv2.destroyAllWindows()

5.5 Depth Estimation Basics

Depth estimation aims to determine the distance of objects from the camera.

5.5.1 Methods for Depth Estimation

Structure from Motion (SfM): Reconstructs the 3D structure of a scene from a sequence of images taken from different viewpoints.
Stereo Vision: As discussed earlier, uses two or more cameras to calculate depth based on disparity.
Depth from Focus/Defocus: Analyzes the sharpness or blurriness of objects at different focal settings.
Time-of-Flight (ToF) Cameras: Directly measure the time it takes for light to travel to an object and back.
Monocular Depth Estimation: Uses a single camera and often machine learning models (e.g., Convolutional Neural Networks) trained on large datasets to predict depth.

5.6 Hands-on: Perspective Correction and Camera Calibration

This section will guide you through practical applications of geometric vision, focusing on perspective correction and performing camera calibration.

5.6.1 Perspective Correction

Perspective correction involves transforming an image so that a specific planar object within it appears as if viewed from directly above, correcting for perspective distortion.

Steps:

Detect Key Points: Identify four corner points of the planar object in the input image.
Define Target Shape: Specify the desired dimensions and shape of the output image (e.g., a rectangle).
Compute Homography: Calculate the homography matrix that maps the detected corner points to the target points.
Warp Image: Apply the homography transformation to the input image to obtain the perspective-corrected image.

import cv2
import numpy as np

# Load the image
img = cv2.imread('perspective_image.jpg')

# Define the approximate corners of the object in the image
# Format: [[x1, y1], [x2, y2], [x3, y3], [x4, y4]]
pts1 = np.float32([[56, 65], [368, 52], [28, 307], [388, 295]])

# Define the desired output image dimensions (e.g., a square)
widthA = np.sqrt(((388 - 28)**2) + ((295 - 307)**2))
widthB = np.sqrt(((368 - 56)**2) + ((52 - 65)**2))
minWidth = min(int(widthA), int(widthB))

heightA = np.sqrt(((56 - 28)**2) + ((65 - 307)**2))
heightB = np.sqrt(((368 - 388)**2) + ((52 - 295)**2))
minHeight = min(int(heightA), int(heightB))

# Define the destination points in the output image
pts2 = np.float32([[0, 0], [minWidth - 1, 0], [0, minHeight - 1], [minWidth - 1, minHeight - 1]])

# Compute the perspective transform matrix (homography)
matrix = cv2.getPerspectiveTransform(pts1, pts2)

# Apply the perspective warp
result = cv2.warpPerspective(img, matrix, (minWidth, minHeight))

cv2.imshow('Original Image', img)
cv2.imshow('Perspective Corrected', result)
cv2.waitKey(0)
cv2.destroyAllWindows()

5.6.2 Hands-on Camera Calibration Example

(Refer to Section 5.4 for detailed code and explanation of camera calibration using Python and OpenCV.)

5.7 Python OpenCV – Depth Map from Stereo Images

This section demonstrates how to generate a depth map from a pair of stereo images using OpenCV.

5.7.1 Stereo Matching

Stereo matching algorithms find corresponding pixels in two rectified stereo images to calculate disparity. Common algorithms include:

Block Matching (BM): Finds the best match for a block of pixels from one image in the other image.
Semi-Global Block Matching (SGBM): A more advanced algorithm that considers global context, often yielding better results.

5.7.2 Depth Map Generation

Load Stereo Images: Load the left and right images of a stereo pair.
Rectify Images: If the stereo camera pair is not already calibrated and rectified, perform stereo rectification. This requires camera calibration parameters.
Create Stereo Matcher: Instantiate a stereo matcher object (e.g., cv2.StereoBM_create or cv2.StereoSGBM_create).
Compute Disparity: Compute the disparity map using the stereo matcher.
Convert to Depth: Convert the disparity map to a depth map using the stereo camera's parameters (baseline, focal length).

import cv2
import numpy as np

# Load stereo images
imgL = cv2.imread('stereo_left.png', 0) # Load as grayscale
imgR = cv2.imread('stereo_right.png', 0) # Load as grayscale

# Create StereoBM object
# Parameters: numDisparities, blockSize
# numDisparities: Must be divisible by 16
# blockSize: Must be odd
window_size = 3
min_disp = 0
num_disp = 16 * 6 # Must be divisible by 16

stereo = cv2.StereoBM_create(numDisparities=num_disp, blockSize=window_size)

# Compute disparity map
disparity = stereo.compute(imgL, imgR)

# Normalize the disparity map for visualization
# Disparity values are typically negative in OpenCV's output for BM
disparity = np.int16(disparity)

# Convert disparity to depth (using a hypothetical baseline and focal length)
# Depth = (baseline * focal_length) / disparity
# Note: These values are illustrative. You need actual camera parameters.
baseline = 70 # Example baseline in mm
focal_length = 1000 # Example focal length in pixels

# Avoid division by zero or very small disparities
depth_map = np.zeros_like(disparity, dtype=np.float32)
valid_disp = disparity > 0
depth_map[valid_disp] = (baseline * focal_length) / disparity[valid_disp]

# Normalize the disparity map for display
# Use a different range for visualization (0 to 255)
normalizer = cv2.normalize(disparity, None, alpha=0, beta=255, norm_type=cv2.NORM_MINMAX, dtype=cv2.CV_8U)

cv2.imshow('Disparity Map', normalizer)
cv2.imshow('Left Image', imgL)
cv2.imshow('Right Image', imgR)
cv2.waitKey(0)
cv2.destroyAllWindows()

5.8 Python OpenCV – Pose Estimation

Pose estimation involves determining the 3D position and orientation of an object relative to a camera. This is often achieved using known 3D models of objects and their corresponding 2D projections in an image.

5.8.1 Process

Obtain 3D Model and 2D Image Points: You need a set of 3D points corresponding to known features on the object and their corresponding 2D pixel locations in the image.
Camera Calibration: You must have the intrinsic camera matrix ($K$) and distortion coefficients.
Solve PNP (Perspective-n-Point): Use cv2.solvePnP() or cv2.solvePnPRansac() to estimate the rotation vector ($rvec$) and translation vector ($tvec$) that transform the 3D object points to the camera's coordinate system.
Convert to Transformation Matrix: Convert the rotation vector and translation vector into a 4x4 transformation matrix for easier use.

import cv2
import numpy as np

# Assume you have these:
# objpoints: 3D coordinates of object features (e.g., corners of a cube)
# imgpoints: 2D pixel coordinates of those features in the image
# mtx: Intrinsic camera matrix
# dist: Distortion coefficients

# Example data (replace with your actual data)
objpoints = np.array([
    [0.0, 0.0, 0.0],
    [1.0, 0.0, 0.0],
    [1.0, 1.0, 0.0],
    [0.0, 1.0, 0.0],
    [0.0, 0.0, -1.0],
    [1.0, 0.0, -1.0],
    [1.0, 1.0, -1.0],
    [0.0, 1.0, -1.0]
], dtype=np.float32)

# Example image points (e.g., detected corners of a cube in an image)
imgpoints = np.array([
    [150, 200],
    [250, 210],
    [260, 300],
    [160, 310],
    [140, 150],
    [240, 160],
    [250, 250],
    [150, 260]
], dtype=np.float32)

# Hypothetical calibration results
mtx = np.array([[800, 0, 320], [0, 800, 240], [0, 0, 1]], dtype=np.float32)
dist = np.zeros((4, 1), dtype=np.float32) # Assuming no distortion for simplicity

# Solve for pose
success, rotation_vector, translation_vector = cv2.solvePnP(objpoints, imgpoints, mtx, dist)

if success:
    print("Rotation Vector (rvec):\n", rotation_vector)
    print("\nTranslation Vector (tvec):\n", translation_vector)

    # Project an axis (e.g., X-axis) from the object's origin to visualize orientation
    axis = np.float32([[3, 0, 0], [0, 3, 0], [0, 0, -3]]).reshape(-1, 3)
    imgpts, jac = cv2.projectPoints(axis, rotation_vector, translation_vector, mtx, dist)

    # Draw the projected axes on the image
    # (You would typically do this on the original image, here we just simulate)
    print("\nProjected points for axis visualization:\n", imgpts)

    # To get the full transformation matrix:
    R, _ = cv2.Rodrigues(rotation_vector)
    pose_matrix = np.hstack((R, translation_vector))
    print("\nPose Transformation Matrix:\n", pose_matrix)

else:
    print("Pose estimation failed.")

Geometric Vision: 3D Reconstruction from 2D Images