TensorFlow 1.x CNN for MNIST Digit Classification

Learn to build a TensorFlow 1.x Convolutional Neural Network (CNN) from scratch for MNIST digit classification. Explore static graph concepts and optimization.

TensorFlow 1.x CNN Implementation for MNIST Digit Classification

This documentation details a Convolutional Neural Network (CNN) built from scratch using TensorFlow 1.x to classify handwritten digits from the MNIST dataset. It covers key deep learning concepts within TensorFlow's static graph paradigm, including convolutional layers, pooling, fully connected layers, softmax classification, and gradient-based optimization.

1. Context and Overview

This implementation serves as a practical guide to understanding and building CNNs in TensorFlow 1.x. It walks through the process of defining the network architecture, setting up the training pipeline, and evaluating its performance on the MNIST dataset.

2. Stepwise Breakdown of CNN Construction and Training

Step 1: Importing Dependencies and Dataset

This step initializes the necessary libraries and loads the MNIST dataset.

import tensorflow as tf
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data
  • TensorFlow (tf): The core framework for building and running the computational graph.
  • NumPy (np): Used for numerical operations and data manipulation.
  • MNIST Dataset Loader: Provides convenient access to the benchmark dataset, consisting of 28x28 grayscale images of handwritten digits (0-9).

Step 2: Defining the CNN Training Function run_cnn()

This function encapsulates all model, training, and evaluation logic.

Hyperparameters:

  • learning_rate = 0.0001: A low learning rate is chosen to ensure stable convergence.
  • epochs = 10: The number of full passes over the entire training dataset.
  • batch_size = 50: The size of mini-batches used in gradient descent for efficient optimization.

Step 3: Input Placeholders and Data Preparation

Placeholders are defined to feed data into the TensorFlow graph dynamically.

x = tf.placeholder(tf.float32, [None, 784])  # Input images
x_shaped = tf.reshape(x, [-1, 28, 28, 1])    # Reshaped for convolution
y = tf.placeholder(tf.float32, [None, 10])  # One-hot encoded labels
  • x: A placeholder for input images. The [None, 784] shape indicates a dynamic batch size (None) and flattened image data (28x28 = 784 pixels).
  • x_shaped: The input x is reshaped into a 4D tensor: [batch_size, height, width, channels]. This format is required for convolutional operations. For MNIST, it's [None, 28, 28, 1] (batch size, height, width, 1 channel for grayscale).
  • y: A placeholder for the target labels, represented in one-hot encoded format for the 10 digit classes (0-9).

Step 4: Building Convolutional Layers

This section details the construction of the convolutional layers, which are responsible for extracting spatial features from the input images.

# Helper function to create a new convolutional layer
def create_new_conv_layer(input_data, num_input_channels, num_filters, filter_shape, pool_shape, name):
    # Convolutional filter weights
    weights = tf.Variable(tf.truncated_normal([filter_shape[0], filter_shape[1], num_input_channels, num_filters], stddev=0.03), name=name+'_W')
    # Biases
    biases = tf.Variable(tf.truncated_normal([num_filters], stddev=0.01), name=name+'_b')

    # Apply convolution
    # Stride is [1, 1] for height and width, padding 'SAME' preserves spatial dimensions
    conv = tf.nn.conv2d(input_data, weights, [1, 1, 1, 1], padding='SAME')
    conv += biases  # Add biases

    # Apply ReLU activation
    conv = tf.nn.relu(conv)

    # Apply max-pooling
    # Pool size and stride are defined by pool_shape
    conv = tf.nn.max_pool(conv, ksize=[1, pool_shape[0], pool_shape[1], 1], strides=[1, pool_shape[0], pool_shape[1], 1], padding='VALID')
    return conv

# Layer 1
layer1 = create_new_conv_layer(x_shaped, 1, 32, [5, 5], [2, 2], name='layer1')

# Layer 2
layer2 = create_new_conv_layer(layer1, 32, 64, [5, 5], [2, 2], name='layer2')

create_new_conv_layer Function Details:

  • Weights: Learnable filters (kernels) with dimensions [filter_height, filter_width, input_channels, output_filters]. Initialized using truncated_normal for stable training.
  • Biases: Added to the output of the convolution.
  • 2D Convolution (tf.nn.conv2d): Applies the learned filters to the input data.
    • strides=[1, 1, 1, 1]: The filter moves one step at a time in height and width dimensions.
    • padding='SAME': Ensures the output spatial dimensions are the same as the input by adding zero padding.
  • ReLU Activation (tf.nn.relu): Introduces non-linearity, allowing the network to learn complex patterns.
  • Max-Pooling (tf.nn.max_pool): Downsamples the feature maps, reducing spatial dimensions and computational complexity while retaining the most important features.
    • ksize=[1, pool_shape[0], pool_shape[1], 1]: The size of the pooling window (e.g., [2, 2]).
    • strides=[1, pool_shape[0], pool_shape[1], 1]: The stride of the pooling window.

Layer-wise Detail:

  • Layer 1:
    • Input channels: 1 (grayscale MNIST images)
    • Output filters: 32
    • Kernel size: 5x5
  • Layer 2:
    • Input channels: 32 (output from Layer 1)
    • Output filters: 64
    • Kernel size: 5x5

Max-Pooling Effect: Each max-pooling step with a 2x2 window and stride 2 effectively halves the spatial dimensions. This progressively reduces the image representation from 28x28 to 14x14 (after Layer 1 pooling) and then to 7x7 (after Layer 2 pooling).

Step 5: Flattening and Fully Connected Dense Layer

After feature extraction by convolutional layers, the data is flattened and fed into a dense (fully connected) layer.

# Flatten the output from the convolutional layers
# The shape of layer2 is [batch_size, 7, 7, 64]
flattened = tf.reshape(layer2, [-1, 7 * 7 * 64])

# Dense layer (fully connected)
# Weights for the dense layer: input_size=7*7*64, output_size=1000
wd1 = tf.Variable(tf.truncated_normal([7 * 7 * 64, 1000], stddev=0.03), name='wd1')
bd1 = tf.Variable(tf.truncated_normal([1000], stddev=0.01), name='bd1')

# Matrix multiplication and add bias
dense_layer1 = tf.matmul(flattened, wd1) + bd1
dense_layer1 = tf.nn.relu(dense_layer1) # Apply ReLU activation
  • Flattening: The 4D output tensor of layer2 ([batch_size, 7, 7, 64]) is reshaped into a 2D tensor ([batch_size, 7*7*64]) suitable for input into a fully connected layer.
  • Weights (wd1) and Biases (bd1): These are the learnable parameters for the dense layer. They are initialized using truncated_normal to promote training stability.
  • Matrix Multiplication (tf.matmul): Computes the linear transformation of the flattened features.
  • ReLU Activation: Introduces non-linearity to the dense layer's output.

Step 6: Output Layer with Softmax and Loss Function

The final layer maps the features to the output class probabilities, and a loss function is defined for training.

# Output layer (fully connected)
# Weights: input_size=1000, output_size=10 (for 10 classes)
wd2 = tf.Variable(tf.truncated_normal([1000, 10], stddev=0.03), name='wd2')
bd2 = tf.Variable(tf.truncated_normal([10], stddev=0.01), name='bd2')

# Linear transformation for output
dense_layer2 = tf.matmul(dense_layer1, wd2) + bd2

# Softmax activation for probability distribution
y_ = tf.nn.softmax(dense_layer2)

# Cross-entropy loss function
# Calculates the loss between predicted probabilities and true labels
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=dense_layer2, labels=y))

# Optimizer: Adam Optimizer
# Minimizes the cross-entropy loss
optimiser = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cross_entropy)
  • Final Dense Layer (wd2, bd2): Transforms the 1000 features from the previous dense layer into 10 output units, corresponding to the 10 digit classes.
  • Softmax (tf.nn.softmax): Converts the raw output scores (logits) into a probability distribution over the 10 classes. The sum of probabilities for each input image will be 1.
  • Cross-Entropy Loss (tf.nn.softmax_cross_entropy_with_logits): A standard loss function for multi-class classification. It measures the difference between the predicted probability distribution and the true label distribution. tf.reduce_mean averages the loss across the batch.
  • Adam Optimizer (tf.train.AdamOptimizer): An adaptive learning rate optimization algorithm that is generally efficient and robust. It adjusts the learning rate for each parameter and aims to minimize the cross_entropy.

Step 7: Evaluation and TensorBoard Summaries

This step defines metrics for evaluating the model's performance and sets up logging for TensorBoard visualization.

# Correct prediction calculation
# Compares the predicted class (argmax of y_) with the true class (argmax of y)
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))

# Accuracy calculation
# Casts boolean results to float and computes the mean
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

# Initialize all variables
init_op = tf.global_variables_initializer()

# TensorBoard summaries
# Log accuracy as a scalar value
tf.summary.scalar('accuracy', accuracy)
# Merge all summaries into a single operation
merged = tf.summary.merge_all()
  • correct_prediction: Compares the index of the highest predicted probability (tf.argmax(y_, 1)) with the index of the true label (tf.argmax(y, 1)). This results in a boolean tensor.
  • accuracy: Calculates the mean of the correct_prediction tensor by casting it to floating-point numbers. This gives the proportion of correctly classified samples in the batch.
  • tf.global_variables_initializer(): Initializes all TensorFlow variables defined in the graph (weights and biases) before training begins.
  • TensorBoard Summaries:
    • tf.summary.scalar('accuracy', accuracy): Creates a summary that logs the accuracy value over time.
    • tf.summary.merge_all(): Combines all defined summaries into a single operation, making it easier to write them to disk.

Step 8: Running the Training Loop

This section executes the computational graph, trains the model using mini-batches, and evaluates performance.

# Create a TensorFlow session to run the graph
with tf.Session() as sess:
    # Initialize the variables
    sess.run(init_op)

    # Calculate the total number of batches per epoch
    total_batch = int(len(mnist.train.labels) / batch_size)

    # Training loop
    for epoch in range(epochs):
        avg_cost = 0  # Initialize average cost for the epoch
        for i in range(total_batch):
            # Get next batch of training data
            batch_x, batch_y = mnist.train.next_batch(batch_size=batch_size)

            # Run the optimizer and calculate the loss
            # `optimiser` and `cross_entropy` are fetched. `optimiser` performs the training step.
            _, c = sess.run([optimiser, cross_entropy], feed_dict={x: batch_x, y: batch_y})

            # Accumulate the cost
            avg_cost += c / total_batch

        # Evaluate accuracy on the test set after each epoch
        test_acc = sess.run(accuracy, feed_dict={x: mnist.test.images, y: mnist.test.labels})

        # Generate and write TensorBoard summaries for the current epoch
        summary = sess.run(merged, feed_dict={x: mnist.test.images, y: mnist.test.labels})
        writer.add_summary(summary, epoch) # `writer` should be initialized globally before this loop

        print(f"Epoch: {epoch+1}/{epochs}, Cost: {avg_cost:.4f}, Test Accuracy: {test_acc:.4f}")

    print("\nTraining complete!")

    # Save the computation graph for TensorBoard visualization
    writer.add_graph(sess.graph)

    # Print the final accuracy on the test set
    final_test_accuracy = sess.run(accuracy, feed_dict={x: mnist.test.images, y: mnist.test.labels})
    print(f"Final Test Accuracy: {final_test_accuracy:.4f}")

    # Close the writer
    writer.close()
  • tf.Session(): Creates a session to execute the TensorFlow graph. The with statement ensures the session is properly closed.
  • Initialization: sess.run(init_op) initializes all variables.
  • Training Loop: Iterates through each epoch and each batch within the epoch.
    • mnist.train.next_batch(): Retrieves a mini-batch of training data.
    • sess.run([optimiser, cross_entropy], feed_dict={...}): Executes the optimization step and calculates the loss for the current batch. The feed_dict provides the batch data to the input placeholders (x and y).
    • Average Cost: The loss is accumulated and averaged per epoch to monitor training progress.
    • Test Accuracy: After each epoch, the model's accuracy is evaluated on the entire test set.
    • TensorBoard Logging: sess.run(merged, ...) generates summary data, which is then written to a file using writer.add_summary(). This allows real-time visualization of training metrics.
  • Graph Visualization: writer.add_graph(sess.graph) saves the entire computation graph to TensorBoard, which is useful for understanding the model's structure.
  • Final Output: Prints the completion message and the final accuracy on the test dataset.

3. Additional Notes

Memory Warnings (Example: Allocation of 1003520000 exceeds 10% of system memory)

These warnings typically indicate that TensorFlow is allocating large tensors (around 1GB or more) on CPU memory. This can happen with large batch sizes or large datasets. To address this:

  • Reduce Batch Size: Use a smaller batch_size.
  • Hardware Upgrade: If possible, increase system RAM or use a machine with a GPU, which can handle large computations more efficiently.

TensorFlow 1.x vs. TensorFlow 2.x

This implementation uses the older TensorFlow 1.x style, which relies on explicit tf.placeholder and tf.Session for defining and executing computational graphs.

TensorFlow 2.x promotes:

  • Eager Execution: Operations are executed immediately, similar to standard Python, making debugging easier.
  • High-level APIs (tf.keras): Simplifies model creation, training, and deployment with pre-built layers and training loops.

While TF 1.x provides a deeper understanding of graph computation, TF 2.x is generally recommended for new projects due to its improved usability and flexibility.

4. Summary and Insights

ConceptExplanation
PlaceholdersDefine input/output interfaces dynamically for graph execution.
Convolution LayersExtract spatial features via learnable filters and pooling.
Fully Connected LayersCombine extracted features to predict class probabilities.
Softmax + Cross EntropyCalculate probability distributions and measure prediction error for classification.
Adam OptimizerEfficient stochastic gradient descent optimizer for faster convergence.
Sessions and GraphsExecute computation within a managed environment for performance and scalability.
TensorBoard IntegrationEnables real-time training visualization and debugging of metrics and graphs.

5. SEO Keywords

TensorFlow 1.x CNN, MNIST classification, Build CNN from scratch TensorFlow, TensorFlow 1.x convolutional neural network tutorial, MNIST digit classification TensorFlow, TensorFlow session placeholder, TensorFlow cross entropy softmax, CNN architecture TensorFlow, Visualize TensorFlow model TensorBoard, TensorFlow ReLU max pooling, TensorFlow Adam optimizer training.

6. Interview Questions

  1. Explain how a CNN processes input images in TensorFlow 1.x. A CNN processes images by applying convolutional filters to extract features, followed by pooling layers to reduce spatial dimensions. These features are then flattened and fed into fully connected layers for classification. In TensorFlow 1.x, this is defined as a static graph using tf.placeholder for input and operations like tf.nn.conv2d, tf.nn.max_pool, tf.matmul, and tf.nn.softmax. The graph is executed within a tf.Session.

  2. What is the role of tf.placeholder() in TensorFlow 1.x? tf.placeholder() is used to define input nodes in a TensorFlow graph that will be fed data during execution. It allows you to build a graph that can operate on different data batches or inputs without recompiling the graph.

  3. How do convolutional layers extract features in a CNN? Convolutional layers use learnable filters (kernels) that slide across the input image. Each filter is designed to detect specific patterns (e.g., edges, corners, textures). The dot product between the filter and the local input region produces an "activation map," highlighting where the feature is present in the image.

  4. Describe the purpose and working of max pooling in CNNs. Max pooling is a downsampling operation that reduces the spatial dimensions (width and height) of feature maps. It works by dividing the feature map into small rectangular regions and outputting the maximum value from each region. Its purpose is to reduce computational complexity, control overfitting, and make the network more robust to spatial variations in features.

  5. Why do we reshape input data before feeding into a convolutional layer? Convolutional operations in TensorFlow (tf.nn.conv2d) expect input tensors to have a specific 4D format: [batch_size, height, width, channels]. Raw input images (e.g., from MNIST) are often flattened into a 1D vector. Reshaping converts this 1D vector into the required 4D tensor format, including the channel dimension (e.g., 1 for grayscale images).

  6. What are the advantages of using the Adam optimizer over standard gradient descent? Adam (Adaptive Moment Estimation) is an adaptive learning rate optimization algorithm. Advantages over standard Stochastic Gradient Descent (SGD) include:

    • Faster Convergence: It adapts the learning rate for each parameter based on estimates of the first and second moments of the gradients.
    • Improved Performance: Often performs well on problems with sparse gradients or noisy objectives.
    • Less Hyperparameter Tuning: Generally requires less manual tuning of the learning rate.
  7. How does softmax classification work in TensorFlow? Softmax classification involves two main steps:

    • Logits: A linear transformation of the final layer's output (e.g., tf.matmul(input, weights) + biases).
    • Softmax Activation (tf.nn.softmax): Converts these logits into a probability distribution across all classes. Each output value represents the probability that the input belongs to that specific class.
  8. What is the role of tf.Session() and how is it used for training a model? In TensorFlow 1.x, tf.Session() is an object that manages the execution of the computational graph. Training a model involves:

    1. Initializing variables (sess.run(tf.global_variables_initializer())).
    2. Feeding data into placeholders (feed_dict).
    3. Running operations, such as the optimizer (sess.run(optimizer, feed_dict=...)) or loss calculation, within the session. The session is responsible for managing tensors and operations, enabling computations to be performed efficiently.
  9. How do TensorBoard summaries help during model training? TensorBoard summaries allow you to log various aspects of your training process, such as:

    • Metrics: Accuracy, loss, learning rate over time.
    • Histograms: Distribution of weights, biases, and activations.
    • Graphs: Visualization of the model's computational graph. This information is crucial for monitoring training progress, diagnosing issues, and understanding model behavior.
  10. What are the key differences between TensorFlow 1.x and 2.x in terms of model execution? The primary differences lie in how computation is defined and executed:

    • TensorFlow 1.x: Uses a static graph paradigm. You first define the entire computational graph, then compile it, and finally execute it within a tf.Session. This requires explicit placeholders and sessions.
    • TensorFlow 2.x: Employs eager execution by default, meaning operations are executed immediately as they are called, like standard Python. It also strongly encourages the use of high-level APIs like tf.keras, which abstract away much of the low-level graph management and session handling, simplifying model development.