Lecture Notes on Convolutional Neural Networks

Author

Your Name

Published

January 28, 2025

Introduction

This lecture introduces convolutional neural networks (CNNs) and their applications, particularly focusing on image processing tasks. We will begin by revisiting the concept of standard convolutional filters and then delve into transpose convolutional filters, which are essential for increasing the spatial resolution of feature maps.

The primary goal of this session is to understand how transpose convolutions can be used to upsample images, moving from smaller input sizes to larger outputs. This is in contrast to standard convolutions, which typically reduce the spatial dimensions.

Following the theoretical foundations, we will explore the practical applications of CNNs in computer vision. The lecture will cover three major areas:

  • Image Classification: Assigning a label to an entire image.

  • Object Detection: Identifying and localizing objects within an image.

  • Image Segmentation: Partitioning an image into meaningful regions or objects.

The latter part of this lecture is dedicated to a hands-on laboratory activity using PyTorch. In this lab, you will gain practical experience by implementing and training a CNN for image classification. The lab session will be guided by educational assistants Beatrice and Ali.

The practical exercises will include:

  • CIFAR-10 Dataset: Training a CNN to classify images from the CIFAR-10 dataset, which consists of 10 classes of objects.

  • Fashion MNIST Dataset: An exercise to adapt the learned concepts to classify grayscale images of clothing articles from the Fashion MNIST dataset.

  • Object Detection Example: An exploration of a pre-built object detection model, demonstrating the application of CNNs in more complex tasks. A ready-to-run notebook example will be provided for this purpose.

All materials for this lab, including Jupyter notebooks and exercise files, are available on Microsoft Teams under the respective lab channel. Solutions to previous exercises are also accessible on the platform. This lab aims to provide a foundational understanding of CNNs and their practical implementation in PyTorch for various computer vision tasks.

Convolutional Filters

Standard Convolution

In standard convolution, a filter, also known as a kernel, slides across the input image to produce a feature map. This operation is fundamental for extracting features from the input data. A key characteristic of standard convolution, especially when applied without padding, is the reduction in the spatial dimensions of the output feature map compared to the input image. This section will elaborate on this image size reduction and explain the representation of convolution as matrix multiplication.

Image Size Reduction

When a convolutional filter is applied to an image without padding, the output feature map’s spatial dimensions are smaller than the input image. This reduction occurs because the filter cannot be fully positioned at the borders of the input image; it must be entirely within the image boundaries to perform the convolution operation.

Consider an input image \(I\) of height \(H\) and width \(W\), denoted as \(H \times W\). Let \(K\) be a convolutional kernel with height \(h\) and width \(w\), denoted as \(h \times w\). If we apply standard convolution with a stride of 1 and no padding, the resulting output image \(O\) will have dimensions \((H-h+1) \times (W-w+1)\).

For instance, if we have a \(5 \times 5\) input image and a \(3 \times 3\) kernel, the output size will be \((5-3+1) \times (5-3+1) = 3 \times 3\). This reduction in size is a direct consequence of the convolution operation’s mechanics.

Convolution as Matrix Multiplication

An alternative and insightful way to understand convolution is through matrix multiplication. This perspective is particularly useful when contrasting standard convolution with transpose convolution. To represent convolution as matrix multiplication, we need to transform both the input image and the convolutional kernel into matrix forms.

Let’s consider a simplified 1D example to illustrate the concept, which can be extended to 2D images. Suppose we have a \(1 \times 3\) input image \(I = [i_1, i_2, i_3]\) and a \(1 \times 2\) kernel \(K = [k_1, k_2]\). The standard convolution \(I \ast K\) (without padding, stride 1) can be computed directly. To represent this as matrix multiplication, we construct a convolutional matrix \(C\) from the kernel \(K\).

For a \(1 \times 2\) kernel, the convolutional matrix \(C\) that operates on a \(1 \times 3\) input vector will be a \(2 \times 3\) matrix. Each row of \(C\) corresponds to a kernel application position.

For the first output element, the kernel is at the first position: \([k_1, k_2]\). For the second output element, the kernel moves one step: \([0, k_1, k_2]\). We pad with zeros to maintain the kernel size.

Thus, the convolutional matrix \(C\) is constructed as: \[C = \begin{pmatrix} k_1 & k_2 & 0 \\ 0 & k_1 & k_2 \end{pmatrix}\] The input image \(I\) is reshaped into a column vector \(x = \begin{pmatrix} i_1 \\ i_2 \\ i_3 \end{pmatrix}\). The convolution operation \(y = I \ast K\) is then equivalent to the matrix multiplication \(y' = C \cdot x\).

\[y' = \begin{pmatrix} k_1 & k_2 & 0 \\ 0 & k_1 & k_2 \end{pmatrix} \begin{pmatrix} i_1 \\ i_2 \\ i_3 \end{pmatrix} = \begin{pmatrix} k_1i_1 + k_2i_2 \\ k_1i_2 + k_2i_3 \end{pmatrix}\] The result \(y' = \begin{pmatrix} y'_1 \\ y'_2 \end{pmatrix}\) is a vector representing the convolved output. In this 1D example, the output is a \(1 \times 2\) vector, reflecting the size reduction. For 2D images, this principle extends, and the convolutional matrix becomes significantly larger but follows the same logic of placing kernel weights according to their position during the sliding window operation.

Transpose Convolution

Transpose convolution, also referred to as deconvolution or fractionally strided convolution, is an upsampling technique. Unlike standard convolution, which typically reduces spatial dimensions, transpose convolution increases them. It can be seen as performing an inverse operation to standard convolution in terms of spatial resolution.

Upsampling via Transpose Convolution

Transpose convolution employs a filter to map from a smaller input size to a larger output size. This is the reverse of what standard convolution usually does. Transpose convolution is particularly useful in applications such as:

  • Image Segmentation: To upscale feature maps to the original image resolution for pixel-wise classification.

  • Generative Models (e.g., GANs, VAEs): To increase the spatial dimensions of generated samples, starting from a lower-dimensional latent space.

Starting with a smaller input image, a transpose convolutional filter is designed to produce a larger output. This can be interpreted as learning an "inverse" convolution that effectively interpolates and upscales the input. The filter weights in transpose convolution are learned during training, allowing the network to determine the optimal upsampling strategy.

Matrix Representation of Transpose Convolution

Building upon the matrix representation of standard convolution, transpose convolution can be elegantly understood using the transpose of the convolutional matrix.

If standard convolution is represented by the matrix multiplication \(y' = C \cdot x\), where \(C\) is the convolutional matrix, \(x\) is the vectorized input image, and \(y'\) is the vectorized output, then transpose convolution can be represented using the transpose of \(C\), denoted as \(C^T\).

Given the convolutional matrix \(C\) derived from a standard convolution operation, the transpose convolution operation is represented as \(x' = C^T \cdot y'\). Here, \(y'\) is a vector representing a smaller, lower-resolution input feature map, and \(x'\) is the resulting vector representing the upsampled, higher-resolution output. Reshaping \(x'\) back into a matrix yields the larger output image.

Using the transposed convolutional matrix \(C^T\) effectively reverses the mapping of the standard convolution in the matrix multiplication context. This mathematical formulation clarifies how transpose convolution achieves upsampling, providing a solid foundation for its application in neural networks.

For our 1D example, the transpose convolutional matrix would be \(C^T\): \[C^T = \begin{pmatrix} k_1 & 0 \\ k_2 & k_1 \\ 0 & k_2 \end{pmatrix}\] If we multiply this transposed matrix by the output of the previous convolution \(y' = \begin{pmatrix} y'_1 \\ y'_2 \end{pmatrix}\), we get: \[x' = C^T \cdot y' = \begin{pmatrix} k_1 & 0 \\ k_2 & k_1 \\ 0 & k_2 \end{pmatrix} \begin{pmatrix} y'_1 \\ y'_2 \end{pmatrix} = \begin{pmatrix} k_1y'_1 \\ k_2y'_1 + k_1y'_2 \\ k_2y'_2 \end{pmatrix}\] The resulting vector \(x' = \begin{pmatrix} x'_1 \\ x'_2 \\ x'_3 \end{pmatrix}\) is a vector of size \(1 \times 3\), which is the original input size, demonstrating the upsampling effect in this simplified example. In practice, transpose convolution learns the optimal weights for upsampling, rather than strictly inverting a predefined convolution.

Convolutional Neural Networks (CNNs)

Convolutional Neural Networks (CNNs) are a class of deep neural networks that have proven exceptionally effective in processing structured grid data, particularly images. CNNs excel at automatically learning hierarchical spatial features through convolutional layers, making them ideal for various computer vision tasks.

Applications of CNNs in Computer Vision

CNNs have become the cornerstone of modern computer vision, revolutionizing several tasks. This lecture highlights the following key applications of CNNs:

Image Classification

Image classification is the task of assigning a single label to an entire image, indicating its primary content. CNNs achieve this by learning discriminative features that distinguish between different classes. For instance, a CNN can be trained to classify images into categories such as ‘airplane’, ‘car’, ‘bird’, ‘cat’, ‘dog’, etc., based on learned visual patterns.

Object Detection

Object detection is a more complex task that involves not only classifying objects within an image but also localizing them by drawing bounding boxes around each detected object. CNNs for object detection are designed to identify multiple objects in a single image and determine their positions, enabling applications like autonomous driving and surveillance systems.

Image Segmentation

Image segmentation aims to partition an image into meaningful segments, often at the pixel level. This can be further divided into:

  • Semantic Segmentation: Classifying each pixel in the image into predefined categories (e.g., pixel-wise classification of roads, buildings, and pedestrians in a street scene).

  • Instance Segmentation: Differentiating between individual instances of the same object class (e.g., identifying each car in an image as a separate instance, even if they belong to the same class ‘car’).

Image segmentation is crucial for applications requiring fine-grained understanding of image content, such as medical image analysis and robotic perception.

Lab Activity: CNN Implementation with PyTorch

This lab session provides a hands-on experience in implementing and training a CNN using the PyTorch deep learning framework. The primary objective is to build a CNN for image classification using the CIFAR-10 dataset.

Lab Assistants

For this session, Beatrice and Ali, experienced lab assistants, will be available to guide you through the exercises, answer questions, and provide support throughout the lab activity.

Accessing Lab Materials

All necessary materials for this lab, including Jupyter notebooks and exercise files, are available on Microsoft Teams. Navigate to the channel dedicated to this lab session and find the materials under the ‘Files’ tab. Solutions to previous lab exercises are also available in their respective channels, allowing you to review and catch up if needed.

Review of Previous Lab: "Weird Network" Exercise

Before starting the CNN lab, we briefly reviewed the solution to the "weird network" exercise from the previous lab (Lab 1). This exercise highlighted an unusual network configuration where a middle layer was designed to potentially become an identity function. The key takeaway was understanding how network layers can adapt and simplify during training, even in unconventional setups.

Practical CNN Implementation for CIFAR-10

In this hands-on section, we will implement a CNN in PyTorch to classify images from the CIFAR-10 dataset.

CIFAR-10 Dataset: Loading and Preprocessing

The CIFAR-10 dataset is a widely used benchmark dataset for image classification. It consists of 60,000 32x32 color images in 10 classes, with 6,000 images per class. The classes are: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. We will use PyTorch’s ‘torchvision’ library to efficiently load and manage this dataset.

The essential preprocessing steps for CIFAR-10 include:

  • Normalization: Normalizing image pixel values is crucial for stable and faster training. Typically, this involves converting pixel values to tensors and then normalizing them channel-wise. A common normalization technique is to scale the pixel values to a range around zero mean and unit variance, using pre-calculated mean and standard deviation values for the CIFAR-10 dataset. This helps in optimizing the network by ensuring that the input features are on a similar scale.

  • Data Loaders: PyTorch ‘DataLoader’ is used to create efficient, iterable batches of the dataset. This is essential for mini-batch gradient descent. Key parameters for ‘DataLoader’ include ‘batch_size’ (number of images per batch) and ‘shuffle’ (whether to shuffle the dataset in each epoch to improve generalization). For training, shuffling is typically set to ‘True’, while for validation and testing, it is set to ‘False’.

Defining the CNN Architecture in PyTorch

We will define a simple CNN architecture using PyTorch’s ‘nn’ module. This architecture will consist of convolutional layers (‘nn.Conv2d’), max-pooling layers (‘nn.MaxPool2d’), ReLU activation functions, and fully connected linear layers (‘nn.Linear’).

The architecture we will implement is as follows:

  1. Convolutional Layer 1: ‘nn.Conv2d(in_channels=3, out_channels=6, kernel_size=5)’ - Takes RGB images (3 channels) as input and outputs 6 feature maps using a 5x5 kernel.

  2. ReLU Activation: ‘F.relu()’ - Applies Rectified Linear Unit activation function element-wise.

  3. Max Pooling Layer 1: ‘nn.MaxPool2d(kernel_size=2, stride=2)’ - Reduces spatial dimensions by half using 2x2 max pooling with a stride of 2.

  4. Convolutional Layer 2: ‘nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5)’ - Takes 6 input channels and outputs 16 feature maps.

  5. ReLU Activation: ‘F.relu()’ - Applies ReLU activation again.

  6. Max Pooling Layer 2: ‘nn.MaxPool2d(kernel_size=2, stride=2)’ - Further reduces spatial dimensions.

  7. Flatten Layer: Reshapes the 16 feature maps into a single vector to prepare for fully connected layers.

  8. Fully Connected Layer 1: ‘nn.Linear(in_features=16 * 5 * 5, out_features=120)’ - Maps the flattened features to 120 output features. The ‘in_features’ size is calculated based on the output size after the pooling layers.

  9. ReLU Activation: ‘F.relu()’ - ReLU activation.

  10. Fully Connected Layer 2: ‘nn.Linear(in_features=120, out_features=84)’ - Maps 120 features to 84 features.

  11. ReLU Activation: ‘F.relu()’ - ReLU activation.

  12. Fully Connected Layer 3 (Output Layer): ‘nn.Linear(in_features=84, out_features=10)’ - Maps 84 features to 10 output features, corresponding to the 10 classes in CIFAR-10.

Implementing the Forward Pass in PyTorch

The forward pass defines the sequence of operations through which the input data flows in the network. In PyTorch, this is implemented in the ‘forward’ method of the network class.

def forward(self, x):
    x = self.pool(F.relu(self.conv1(x)))
    x = self.pool(F.relu(self.conv2(x)))
    x = torch.flatten(x, 1) # flatten all dimensions except batch
    x = F.relu(self.fc1(x))
    x = F.relu(self.fc2(x))
    x = self.fc3(x)
    return x

In this ‘forward’ function:

  • Input ‘x’ is passed through the first convolutional layer (‘self.conv1’), followed by ReLU activation (‘F.relu()’) and max pooling (‘self.pool’).

  • This process is repeated for the second convolutional layer (‘self.conv2’).

  • The feature maps are then flattened into a vector using ‘torch.flatten(x, 1)’.

  • The flattened vector is passed through the fully connected layers (‘self.fc1’, ‘self.fc2’, ‘self.fc3’), with ReLU activation applied after each fully connected layer except the last one.

  • Finally, the output ‘x’ is returned, which represents the raw scores for each class before softmax (CrossEntropyLoss in PyTorch implicitly applies softmax).

Training the CNN: Loss Function and Optimizer

To train the CNN, we need to define a loss function and an optimizer.

  • Loss Function: We use Cross-Entropy Loss (‘nn.CrossEntropyLoss’) which is suitable for multi-class classification problems. Cross-entropy loss measures the dissimilarity between the predicted probability distribution and the true distribution of classes.

  • Optimizer: We use Stochastic Gradient Descent (SGD) (‘torch.optim.SGD’) to update the network’s weights. Key hyperparameters for SGD include:

    • Learning Rate (lr): Controls the step size at each iteration while moving towards a minimum of the loss function. A typical starting value is 0.001 or 0.01.

    • Momentum: Helps accelerate SGD in the relevant direction and dampens oscillations. A common value is 0.9.

Training Loop and Backpropagation

The training process involves iterating over the dataset for a number of epochs. An epoch is a complete pass through the entire training dataset. Within each epoch, we process data in batches. The training loop in PyTorch typically follows these steps:

Input: Network net, data loaders trainloader, criterion, optimizer, number of epochs

Get inputs and labels from batch Zero the gradients: optimizer.zero_grad() Forward pass: outputs = net(inputs) Calculate loss: loss = criterion(outputs, labels) Backpropagation: loss.backward() Update weights: optimizer.step() Print statistics (loss, accuracy, etc.) every N batches Print epoch statistics

Key steps in the training loop:

  1. Zero Gradients: Before each batch, it’s crucial to zero out the gradients from the previous batch using ‘optimizer.zero_grad()’. Otherwise, gradients would accumulate, leading to incorrect updates.

  2. Forward Pass: Pass the input batch through the network to get the output predictions: ‘outputs = net(inputs)’.

  3. Calculate Loss: Compute the loss by comparing the network’s output with the true labels using the chosen loss function: ‘loss = criterion(outputs, labels)’.

  4. Backpropagation: Compute gradients of the loss with respect to all network parameters using backpropagation: ‘loss.backward()’. This step calculates how much each parameter contributed to the loss.

  5. Optimization Step: Update the network’s parameters based on the calculated gradients using the optimizer: ‘optimizer.step()’. This step adjusts the weights to reduce the loss.

  6. Statistics and Logging: Periodically print or log training statistics such as the current loss and accuracy to monitor the training progress.

Evaluating Model Performance on Test Set

After training, it is essential to evaluate the model’s performance on a separate test dataset to assess its generalization ability. The evaluation process involves:

  1. Prediction: Pass test images through the trained network to obtain predictions.

  2. Accuracy Calculation: Calculate the accuracy by comparing the predicted labels with the true labels for the test set. Accuracy is the percentage of correctly classified images. We can calculate overall accuracy and class-wise accuracy to understand the model’s performance in detail.

Validation Set (Best Practice)

While not explicitly used in this lab example for simplicity, in practice, it is highly recommended to use a validation set. A validation set is a portion of the training data held out to monitor the model’s performance during training. It helps in:

  • Hyperparameter Tuning: Optimizing hyperparameters like learning rate, batch size, and network architecture.

  • Early Stopping: Preventing overfitting by stopping training when the validation loss starts to increase, even if the training loss is still decreasing.

The validation process is similar to the test process but is performed during training epochs to guide training decisions.

Additional Exercise: Fashion MNIST Dataset

As a homework exercise, you are tasked with adapting the CNN model to classify images from the Fashion MNIST dataset. Fashion MNIST is a dataset of 70,000 grayscale images of fashion articles from 10 categories. The images are 28x28 pixels.

The key adaptation for Fashion MNIST is to change the ‘in_channels’ parameter of the first convolutional layer to 1, as Fashion MNIST images are grayscale (single channel), compared to the 3 channels (RGB) of CIFAR-10. The rest of the network architecture and the training procedure can largely remain the same. This exercise will reinforce your understanding of CNNs and their adaptability to different datasets.

Additionally, an example notebook demonstrating object detection using a pre-trained model is available on Microsoft Teams for those interested in exploring more advanced applications of CNNs. This notebook provides a ready-to-run example of how CNNs are used in object detection tasks and can serve as a starting point for further exploration in this area.

Complexity Analysis of CNN

The computational complexity of a CNN is primarily determined by the convolutional layers and fully connected layers.

Convolutional Layer Complexity

For a convolutional layer with input size \(H \times W \times C_{in}\), kernel size \(K \times K\), \(C_{out}\) output channels, and stride \(S\), the complexity per layer is approximately: \(O(H_{out} \times W_{out} \times K^2 \times C_{in} \times C_{out})\), where \(H_{out} = \frac{H - K}{S} + 1\) and \(W_{out} = \frac{W - K}{S} + 1\).

Fully Connected Layer Complexity

For a fully connected layer with \(N_{in}\) input features and \(N_{out}\) output features, the complexity is \(O(N_{in} \times N_{out})\).

The overall complexity of a CNN depends on the number of layers and their parameters. CNNs are computationally intensive, especially during training, due to the large number of parameters and operations involved in convolution and backpropagation. However, their efficiency in feature extraction and hierarchical learning makes them highly effective for image and video processing tasks.

Conclusion

This lecture has provided a comprehensive introduction to Convolutional Neural Networks (CNNs), covering both theoretical foundations and practical implementation aspects. We explored standard and transpose convolutional filters, their matrix representations, and the applications of CNNs in image classification, object detection, and image segmentation.

The hands-on lab session using PyTorch provided practical experience in building, training, and evaluating a CNN for image classification on the CIFAR-10 dataset. Students were guided through data loading and preprocessing, defining the CNN architecture, implementing the forward pass, setting up the training loop with a loss function and optimizer, and evaluating the model’s performance. The additional exercise with the Fashion MNIST dataset and the object detection example further extended the practical understanding of CNN applications.

Key Takeaways:

  • CNNs are powerful tools for feature extraction and hierarchical learning from image data.

  • Standard convolution reduces spatial dimensions, while transpose convolution increases them, enabling upsampling.

  • PyTorch simplifies the implementation and training of CNNs with its modular nn module and automatic differentiation capabilities.

  • Practical experience through lab exercises is crucial for mastering CNN concepts and developing hands-on skills in deep learning.

Further Exploration:

  • Experiment with different CNN architectures, hyperparameters such as learning rates and batch sizes, and explore various datasets beyond CIFAR-10 and Fashion MNIST.

  • Investigate advanced CNN architectures like ResNet, VGG, and Inception to understand deeper and more complex network designs.

  • Explore techniques for improving CNN performance, including data augmentation, regularization methods like dropout and weight decay, and batch normalization to stabilize training.

  • Study more complex applications of CNNs, such as advanced object detection frameworks, semantic segmentation for scene understanding, and generative models like GANs that utilize transpose convolutions.

We encourage you to continue working on the Fashion MNIST exercise and explore the object detection example notebook available on Microsoft Teams. Review the concepts covered in this lecture, particularly the matrix representations of convolution and transpose convolution, and experiment further with CNN implementations to solidify your understanding. We look forward to seeing you next week for further exploration in deep learning.