Lecture Notes on Convolutional Neural Networks

Author

Your Name

Published

February 3, 2025

Introduction

This lecture explores Convolutional Neural Networks (CNNs), a foundational architecture in deep learning with significant impact in fields such as computer vision and image recognition. We will examine the core concepts of CNNs, beginning with filters and kernels, and proceed to understand how convolution operations, which are essentially a series of dot products, form the basis of these networks. We will discuss the motivation behind CNNs, drawing inspiration from biological vision, and highlight their efficiency and effectiveness in processing visual data. Key topics will include the architecture of CNNs, the functions of convolutional layers such as feature extraction and mapping, and the crucial role of hyperparameters and data in training these powerful models. By the end of this lecture, you should have a solid understanding of the principles and practical aspects of Convolutional Neural Networks.

Convolutional Neural Networks

Core Concepts

Filters and Kernels

In Convolutional Neural Networks, the fundamental building blocks for feature extraction are filters or kernels. These terms are often used interchangeably.

  • Filters/Kernels: Typically small matrices or tensors containing weights that are learned during the training process. They act as feature detectors, sliding across the input data to identify specific patterns.

  • Small Matrix or Tensor: Filters are usually of a smaller dimension compared to the input image, allowing them to detect local features efficiently.

As illustrated in 1, the concepts of filters and kernels are foundational to understanding convolution and dot products in CNNs.

Core Concepts of CNNs visualized as a pyramid

Convolution as a Dot Product

At its core, the convolution operation in CNNs is a series of dot products.

  • Dot Product Operation: Convolution involves sliding a filter across the input data (e.g., an image) and, at each location, computing the dot product between the filter and the corresponding patch of the input.

Motivation for CNNs

Applications in Computer Vision

Convolutional Neural Networks are particularly well-suited for computer vision tasks.

  • Image Recognition and Processing: CNNs excel in tasks like image classification, object detection, and image segmentation due to their ability to automatically learn spatial hierarchies of features.

  • Feature Visibility: In vision, features such as edges, corners, and textures are spatially localized and visually discernible. CNNs are designed to effectively detect these types of features.

Inspiration from Biological Vision

The architecture of CNNs is inspired by biological visual processes.

  • Visual Cortex: The mechanisms used by the visual cortex in the human brain to process visual information are similar to convolution operations. This biological inspiration underpins the design and effectiveness of CNNs for vision tasks.

Understanding Convolution

Convolutional Filters

Definition: Sliding Window Operation

A convolutional filter operates as a sliding window over the input image.

  • Sliding Window: The filter, also known as a kernel filter, moves across the image both horizontally and vertically.

  • Simplified 2D Case: For simplicity, we often consider the 2D case to understand the basic operation, but the concept extends to higher dimensions.

Example: Horizontal Edge Detection

Consider an example filter designed for horizontal edge detection, as shown in 2.

1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
-1.0 -1.0 -1.0 -1.0
-1.0 -1.0 -1.0 -1.0
A simple filter for horizontal line detection (Figure 3.1 from transcription)

This filter, with positive values in the top rows and negative values in the bottom rows, is designed to respond strongly to horizontal edges in an image.

The Convolution Process

Convolution with Image Patches

The convolution process involves applying the filter to patches of the image.

  • Patch-wise Operation: The filter is convolved with a patch of the image, where the patch is of the same size as the filter.

Applying Filters to the Entire Image

The filter is systematically applied across the entire image to produce a feature map.

  • Global Application: By sliding the filter across all possible locations on the image, we apply the filter to the entire image.

Formal Definition of the Convolution Operation

Formally, convolution is an operation that computes a generalized dot product between the filter and image patches.

Convolution Operationdef:convolution Given an input image \(I\) and a kernel \(K\), the convolution operation, denoted as \(V = I * K\), is defined for each spatial location \((x, y)\) as: \[V(x, y) = (I * K)(x, y) = \sum_{m} \sum_{n} I(x + m, y + n)K(m, n)\] where \(I(x+m, y+n)\) is the intensity of the input image at position \((x+m, y+n)\), and \(K(m, n)\) is the value of the kernel at position \((m, n)\). The indices \(m\) and \(n\) iterate over the dimensions of the kernel.

As illustrated in 3, the convolution operation takes an input image and a convolutional kernel (filter) to produce a convolved output.

Diagram of the Convolution Operation (Based on transcription diagram)

The Dot Product Connection

Geometric Interpretation of the Dot Product

The dot product has a geometric interpretation related to the angle between vectors.

  • Angle and Similarity: For two vectors \(\vect{a}\) and \(\vect{b}\), their dot product is given by \(\vect{a} \cdot \vect{b} = |\vect{a}| |\vect{b}| \cos(\theta)\), where \(\theta\) is the angle between them. This indicates the alignment or similarity between the vectors.

4 illustrates the geometric interpretation of the dot product.

Geometric Interpretation of Dot Product (Based on transcription diagram)

Dot Product as a Measure of Feature Similarity

In CNNs, the kernel function acts similarly to a vector, and the dot product measures feature similarity.

  • Kernel as Feature Detector: The kernel function behaves like a vector, and when convolved with image patches, it computes dot products that indicate the presence and similarity of features in the image patch compared to the filter’s pattern.

  • Application in Digitized Vision: This process is crucial for extracting meaningful features from images, which is essential for tasks like object recognition and classification.

Why Use Convolutional Networks?

Reducing Connections and Parameters

Computational Efficiency

Convolutional networks are designed to reduce the number of connections compared to fully connected networks, leading to computational efficiency.

  • Minimize Unnecessary Information: By using local connections (filters), CNNs minimize the processing of unnecessary information during computation, especially in spatially structured data like images.

Mitigating Overfitting

Reducing the number of parameters in CNNs helps to mitigate overfitting, especially when dealing with limited data.

  • Kernel Filters: CNNs use kernel filters which have fewer parameters than fully connected layers, thus reducing the risk of overfitting.

Properties of Convolutional Kernels

Form and Value of Filters

Convolutional kernels are characterized by their form and the values they contain.

  • Form: Refers to the size and shape of the filter (e.g., \(3 \times 3\), \(5 \times 5\)).

  • Values: Refers to the numerical entries within the filter, which determine what features the filter is designed to detect.

Learning Filter Values via Backpropagation

A key advantage of deep convolutional networks is that the filter values are learned from data through backpropagation, rather than being manually designed.

  • NN Parameters: Filter values are treated as neural network parameters and are optimized during the training process using gradient descent.

  • Data-Driven Feature Learning: This learning approach allows CNNs to automatically learn effective features from large datasets, making them highly flexible and powerful.

The Importance of Data

  • Data Dependency: The effectiveness of deep learning models, including CNNs, heavily relies on the availability of large amounts of data. Filter values are learned from data, emphasizing that "everything comes from data!".

Architecture of a Convolutional Neural Network

Basic Building Blocks

Banks of Convolutional Filters

CNNs typically start with a bank of convolutional filters.

  • Filter Banks: A collection of multiple filters, each designed to detect different features in the input.

  • 3D (or more) Tensor: A bank of filters can be considered as a 3D or higher-dimensional tensor, representing a stack of filters.

Non-Linear Activation Functions

Non-linear activation functions are applied after the convolution operation.

  • ReLU (Rectified Linear Unit): A common non-linear activation function used in CNNs to introduce non-linearity, enabling the network to learn complex patterns.

5 illustrates a simplified image recognition architecture using convolutional filters and activation functions.

Image recognition architecture with convolution filters (Figure 3.3 from transcription)

Tensor Dimensions in CNNs

3D and 4D Tensors for Images and Filters

CNNs process data represented as tensors, with specific dimensions for images and filters.

  • 3D Tensors for Images: Images are typically represented as 3D tensors (height, width, channels), where channels represent color components (e.g., RGB).

  • 4D Tensors for Batches of Images: When processing batches of images, the input becomes a 4D tensor (batch size, height, width, channels).

  • 4D Tensors for Filter Banks: Banks of filters are also represented as 4D tensors (height, width, input channels, output channels/number of filters).

Feature Maps, Height, and Width

Filters are defined by their height, width, and the features they are designed to detect.

  • Filter Dimensions: Filters have a height and width that define their spatial extent.

  • Feature Detection: Each filter is designed to detect a specific feature, contributing to the creation of feature maps in the output.

Key Hyperparameters

Stride Length

Stride length is a crucial hyperparameter that controls the movement of the filter across the input.

  • Stride Definition: Stride is the distance the filter shifts at each step, both horizontally (\(S_h\)) and vertically (\(S_v\)).

  • Impact on Output Size: Stride affects the spatial dimensions of the output feature map. Larger strides lead to smaller output sizes.

Padding (Valid and Same)

Padding is used to manage the spatial dimensions of the output feature maps, particularly at the borders of the input image.

  • Valid Padding: No padding is added. Convolution is only performed where the filter fully overlaps with the input, typically reducing the output size.

  • Same Padding: Padding is added to the input such that the output feature map has the same spatial dimensions as the input (when stride is 1).

6 illustrates the effect of Valid and Same padding on the output size.

Types of Padding

Convolution in TensorFlow

The tf.nn.conv2d Function

TensorFlow provides the tf.nn.conv2d function to perform convolution operations efficiently.

  • Single Instruction: In TensorFlow, the entire convolution process is encapsulated in the tf.nn.conv2d function.

  • Function Signature: tf.nn.conv2d(input, filters, strides, padding)

  • Arguments:

    • input: Batch of input images (4D tensor).

    • filters: Bank of filters (4D tensor).

    • strides: Stride length for each dimension.

    • padding: Padding type (‘VALID’ or ‘SAME’).

Input and Filter Tensor Shapes

Understanding the required shapes for input and filter tensors is crucial for using tf.nn.conv2d.

  • Input Shape: Requires images to be 3D tensors (height, width, channels) or 4D tensors for batches (batch, height, width, channels).

  • Filter Shape: Filters are 4D tensors (height, width, input channels, output channels).

Illustrative Code Examples

A Simple Convolution Operation

A basic code example demonstrating convolution using tf.nn.conv2d is shown below.

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

ii = [[[[0.], [0.], [2.], [2.]],
       [[0.], [0.], [2.], [2.]],
       [[0.], [0.], [2.], [2.]],
       [[0.], [0.], [2.], [2.]]]] # "4x4" Input Image
I = tf.constant(ii, tf.float32)
ww = [[[[ -1.]], [[ -1.]], [[ 1.]]],
      [[[ -1.]], [[ -1.]], [[ 1.]]],
      [[[ -1.]], [[ -1.]], [[ 1.]]]] # "3x3" Filter
W = tf.constant(ww, tf.float32)
C = tf.nn.conv2d(I, W, strides=[1, 1, 1, 1], padding='VALID')

with tf.Session() as sess:
    print(sess.run(C)) # Output: [[[[6.], [0.]], [[6.], [0.]]]]

A Complete Convolutional Layer

A more complete example showing a convolutional layer within a neural network is provided.

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# Assuming 'img' is a placeholder for input images
img = tf.placeholder(tf.float32, shape=[None, 28*28]) # Example placeholder

image = tf.reshape(img, [-1, 28, 28, 1]) # Turns img into 4d Tensor
flts = tf.Variable(tf.truncated_normal([4, 4, 1, 4], stddev=0.1)) # Create parameters for the filters
convOut = tf.nn.conv2d(image, flts, strides=[1, 2, 2, 1], padding="SAME") # Create graph to do convolution
convOut = tf.nn.relu(convOut) # Don't forget to add nonlinearity
convOut = tf.reshape(convOut, [-1, 784]) # Reshape to match original batch size and flattened feature map
# Assuming W and b are defined for a subsequent layer
W = tf.Variable(tf.truncated_normal([784, 10])) # Example weight matrix
b = tf.Variable(tf.constant(0.1, shape=[10])) # Example bias vector
prbs = tf.nn.softmax(tf.matmul(convOut, W) + b) # Example softmax layer

Training CNNs

Loss Functions and Gradient Descent

Training CNNs involves defining a loss function and using gradient descent to optimize filter values.

  • Loss Function: Measures the error between the network’s predictions and the true labels (e.g., cross-entropy loss).

  • Gradient Descent: An optimization algorithm used to update the filter values (network parameters) to minimize the loss function.

Impact of Convolution on Image Size

Reduction of Spatial Dimensions

Convolution operations, especially with valid padding or strides greater than 1, often reduce the spatial dimensions of the feature maps.

  • Size Reduction: Convolving an image generally reduces its spatial size.

Increase in the Number of Feature Maps

While spatial dimensions may decrease, the number of feature maps (channels) typically increases as we go deeper into the network, allowing for richer feature representation.

  • Information Increase: Even with reduced spatial size, the network can capture more information through an increased number of feature maps. For example, a \(28 \times 28\) input image might be transformed into a \(14 \times 14 \times 4\) feature map, reducing spatial dimensions but increasing feature depth.

Typical CNN Architectures

Alternating Convolutional and Subsampling Layers

A common architecture in CNNs involves alternating convolutional layers with subsampling (pooling) layers.

  • Convolution and Subsampling: Typical CNNs consist of stacks of convolutional layers followed by subsampling or pooling layers to progressively extract features and reduce spatial resolution.

7 illustrates a typical CNN architecture with alternating convolutional and subsampling layers.

Convolutional network architecture for image processing (Figure 4.23 from transcription)

Functions of Convolutional Layers

Feature Extraction

Local Receptive Fields

Convolutional layers excel at feature extraction due to their use of local receptive fields.

  • Local Features: Each neuron in a convolutional layer connects to a small local region in the previous layer, forcing it to extract local features.

  • Position Invariance: Once a feature is detected, its exact location becomes less critical, as long as its relative position to other features is maintained.

Feature Mapping

Weight Sharing and Shift Invariance

Feature mapping is achieved through weight sharing, which leads to shift invariance.

  • Multiple Feature Maps: Each convolutional layer comprises multiple feature maps, where neurons within each map share the same set of weights.

  • Shift Invariance: Weight sharing enforces shift invariance, meaning the network can detect a feature regardless of its location in the input.

  • Parameter Reduction: Weight sharing significantly reduces the number of free parameters in the network.

Subsampling (Pooling)

Reducing Sensitivity to Small Distortions

Subsampling or pooling layers are used to reduce the spatial resolution of feature maps and decrease sensitivity to small distortions.

  • Local Averaging and Subsampling: Pooling layers perform local averaging or max pooling, reducing the resolution of feature maps.

  • Distortion Robustness: This operation makes the network less sensitive to minor shifts and distortions in the input.

Relationship to the Universal Approximation Theorem

While CNNs are powerful for feature extraction, it is important to consider their relationship to the Universal Approximation Theorem.

  • Structured Feature Learning: CNNs provide a structured and efficient way to learn features, especially for image data, contrasting with the broader applicability of the Universal Approximation Theorem to simpler neural networks.

  • Spatial Relationships: CNNs are particularly effective in tasks where spatial relationships are important, leveraging convolution to capture these relationships efficiently.

Conclusion

In summary, this lecture has provided a comprehensive overview of Convolutional Neural Networks. We have explored the foundational concepts of filters and kernels, understood the convolution operation as a series of dot products, and discussed the motivations and biological inspirations behind CNNs. We examined the architecture of CNNs, including key hyperparameters like stride and padding, and delved into the functions of convolutional layers, such as feature extraction, feature mapping, and subsampling.

Key takeaways from this lecture include:

  • CNNs are highly effective for processing spatially structured data like images due to their ability to learn hierarchical features automatically.

  • Convolutional layers utilize filters to detect local patterns and features through weight sharing, achieving shift invariance and reducing the number of parameters.

  • Subsampling layers help to reduce spatial resolution and increase robustness to small distortions.

  • The power of CNNs comes from learning filter values from data using backpropagation, making them adaptable and powerful for various computer vision tasks.

For further study, consider exploring different types of pooling layers (max pooling, average pooling), advanced CNN architectures (ResNet, Inception, etc.), and applications of CNNs in various domains beyond image recognition.

Are there any questions?