Lecture Notes on Convolutional Neural Networks
Introduction
This lecture explores Convolutional Neural Networks (CNNs), a foundational architecture in deep learning with significant impact in fields such as computer vision and image recognition. We will examine the core concepts of CNNs, beginning with filters and kernels, and proceed to understand how convolution operations, which are essentially a series of dot products, form the basis of these networks. We will discuss the motivation behind CNNs, drawing inspiration from biological vision, and highlight their efficiency and effectiveness in processing visual data. Key topics will include the architecture of CNNs, the functions of convolutional layers such as feature extraction and mapping, and the crucial role of hyperparameters and data in training these powerful models. By the end of this lecture, you should have a solid understanding of the principles and practical aspects of Convolutional Neural Networks.
Convolutional Neural Networks
Core Concepts
Filters and Kernels
In Convolutional Neural Networks, the fundamental building blocks for feature extraction are filters or kernels. These terms are often used interchangeably.
Filters/Kernels: Typically small matrices or tensors containing weights that are learned during the training process. They act as feature detectors, sliding across the input data to identify specific patterns.
Small Matrix or Tensor: Filters are usually of a smaller dimension compared to the input image, allowing them to detect local features efficiently.
As illustrated in 1, the concepts of filters and kernels are foundational to understanding convolution and dot products in CNNs.
Convolution as a Dot Product
At its core, the convolution operation in CNNs is a series of dot products.
- Dot Product Operation: Convolution involves sliding a filter across the input data (e.g., an image) and, at each location, computing the dot product between the filter and the corresponding patch of the input.
Motivation for CNNs
Applications in Computer Vision
Convolutional Neural Networks are particularly well-suited for computer vision tasks.
Image Recognition and Processing: CNNs excel in tasks like image classification, object detection, and image segmentation due to their ability to automatically learn spatial hierarchies of features.
Feature Visibility: In vision, features such as edges, corners, and textures are spatially localized and visually discernible. CNNs are designed to effectively detect these types of features.
Inspiration from Biological Vision
The architecture of CNNs is inspired by biological visual processes.
- Visual Cortex: The mechanisms used by the visual cortex in the human brain to process visual information are similar to convolution operations. This biological inspiration underpins the design and effectiveness of CNNs for vision tasks.
Understanding Convolution
Convolutional Filters
Definition: Sliding Window Operation
A convolutional filter operates as a sliding window over the input image.
Sliding Window: The filter, also known as a kernel filter, moves across the image both horizontally and vertically.
Simplified 2D Case: For simplicity, we often consider the 2D case to understand the basic operation, but the concept extends to higher dimensions.
Example: Horizontal Edge Detection
Consider an example filter designed for horizontal edge detection, as shown in 2.
1.0 | 1.0 | 1.0 | 1.0 |
1.0 | 1.0 | 1.0 | 1.0 |
-1.0 | -1.0 | -1.0 | -1.0 |
-1.0 | -1.0 | -1.0 | -1.0 |
This filter, with positive values in the top rows and negative values in the bottom rows, is designed to respond strongly to horizontal edges in an image.
The Convolution Process
Convolution with Image Patches
The convolution process involves applying the filter to patches of the image.
- Patch-wise Operation: The filter is convolved with a patch of the image, where the patch is of the same size as the filter.
Applying Filters to the Entire Image
The filter is systematically applied across the entire image to produce a feature map.
- Global Application: By sliding the filter across all possible locations on the image, we apply the filter to the entire image.
Formal Definition of the Convolution Operation
Formally, convolution is an operation that computes a generalized dot product between the filter and image patches.
Convolution Operationdef:convolution Given an input image \(I\) and a kernel \(K\), the convolution operation, denoted as \(V = I * K\), is defined for each spatial location \((x, y)\) as: \[V(x, y) = (I * K)(x, y) = \sum_{m} \sum_{n} I(x + m, y + n)K(m, n)\] where \(I(x+m, y+n)\) is the intensity of the input image at position \((x+m, y+n)\), and \(K(m, n)\) is the value of the kernel at position \((m, n)\). The indices \(m\) and \(n\) iterate over the dimensions of the kernel.
As illustrated in 3, the convolution operation takes an input image and a convolutional kernel (filter) to produce a convolved output.
The Dot Product Connection
Geometric Interpretation of the Dot Product
The dot product has a geometric interpretation related to the angle between vectors.
- Angle and Similarity: For two vectors \(\vect{a}\) and \(\vect{b}\), their dot product is given by \(\vect{a} \cdot \vect{b} = |\vect{a}| |\vect{b}| \cos(\theta)\), where \(\theta\) is the angle between them. This indicates the alignment or similarity between the vectors.
4 illustrates the geometric interpretation of the dot product.
Dot Product as a Measure of Feature Similarity
In CNNs, the kernel function acts similarly to a vector, and the dot product measures feature similarity.
Kernel as Feature Detector: The kernel function behaves like a vector, and when convolved with image patches, it computes dot products that indicate the presence and similarity of features in the image patch compared to the filter’s pattern.
Application in Digitized Vision: This process is crucial for extracting meaningful features from images, which is essential for tasks like object recognition and classification.
Why Use Convolutional Networks?
Reducing Connections and Parameters
Computational Efficiency
Convolutional networks are designed to reduce the number of connections compared to fully connected networks, leading to computational efficiency.
- Minimize Unnecessary Information: By using local connections (filters), CNNs minimize the processing of unnecessary information during computation, especially in spatially structured data like images.
Mitigating Overfitting
Reducing the number of parameters in CNNs helps to mitigate overfitting, especially when dealing with limited data.
- Kernel Filters: CNNs use kernel filters which have fewer parameters than fully connected layers, thus reducing the risk of overfitting.
Properties of Convolutional Kernels
Form and Value of Filters
Convolutional kernels are characterized by their form and the values they contain.
Form: Refers to the size and shape of the filter (e.g., \(3 \times 3\), \(5 \times 5\)).
Values: Refers to the numerical entries within the filter, which determine what features the filter is designed to detect.
Learning Filter Values via Backpropagation
A key advantage of deep convolutional networks is that the filter values are learned from data through backpropagation, rather than being manually designed.
NN Parameters: Filter values are treated as neural network parameters and are optimized during the training process using gradient descent.
Data-Driven Feature Learning: This learning approach allows CNNs to automatically learn effective features from large datasets, making them highly flexible and powerful.
The Importance of Data
- Data Dependency: The effectiveness of deep learning models, including CNNs, heavily relies on the availability of large amounts of data. Filter values are learned from data, emphasizing that "everything comes from data!".
Architecture of a Convolutional Neural Network
Basic Building Blocks
Banks of Convolutional Filters
CNNs typically start with a bank of convolutional filters.
Filter Banks: A collection of multiple filters, each designed to detect different features in the input.
3D (or more) Tensor: A bank of filters can be considered as a 3D or higher-dimensional tensor, representing a stack of filters.
Non-Linear Activation Functions
Non-linear activation functions are applied after the convolution operation.
- ReLU (Rectified Linear Unit): A common non-linear activation function used in CNNs to introduce non-linearity, enabling the network to learn complex patterns.
5 illustrates a simplified image recognition architecture using convolutional filters and activation functions.
Tensor Dimensions in CNNs
3D and 4D Tensors for Images and Filters
CNNs process data represented as tensors, with specific dimensions for images and filters.
3D Tensors for Images: Images are typically represented as 3D tensors (height, width, channels), where channels represent color components (e.g., RGB).
4D Tensors for Batches of Images: When processing batches of images, the input becomes a 4D tensor (batch size, height, width, channels).
4D Tensors for Filter Banks: Banks of filters are also represented as 4D tensors (height, width, input channels, output channels/number of filters).
Feature Maps, Height, and Width
Filters are defined by their height, width, and the features they are designed to detect.
Filter Dimensions: Filters have a height and width that define their spatial extent.
Feature Detection: Each filter is designed to detect a specific feature, contributing to the creation of feature maps in the output.
Key Hyperparameters
Stride Length
Stride length is a crucial hyperparameter that controls the movement of the filter across the input.
Stride Definition: Stride is the distance the filter shifts at each step, both horizontally (\(S_h\)) and vertically (\(S_v\)).
Impact on Output Size: Stride affects the spatial dimensions of the output feature map. Larger strides lead to smaller output sizes.
Padding (Valid and Same)
Padding is used to manage the spatial dimensions of the output feature maps, particularly at the borders of the input image.
Valid Padding: No padding is added. Convolution is only performed where the filter fully overlaps with the input, typically reducing the output size.
Same Padding: Padding is added to the input such that the output feature map has the same spatial dimensions as the input (when stride is 1).
6 illustrates the effect of Valid and Same padding on the output size.
Convolution in TensorFlow
The tf.nn.conv2d
Function
TensorFlow provides the tf.nn.conv2d
function to perform convolution operations efficiently.
Single Instruction: In TensorFlow, the entire convolution process is encapsulated in the
tf.nn.conv2d
function.Function Signature:
tf.nn.conv2d(input, filters, strides, padding)
Arguments:
input
: Batch of input images (4D tensor).filters
: Bank of filters (4D tensor).strides
: Stride length for each dimension.padding
: Padding type (‘VALID’ or ‘SAME’).
Input and Filter Tensor Shapes
Understanding the required shapes for input and filter tensors is crucial for using tf.nn.conv2d
.
Input Shape: Requires images to be 3D tensors (height, width, channels) or 4D tensors for batches (batch, height, width, channels).
Filter Shape: Filters are 4D tensors (height, width, input channels, output channels).
Illustrative Code Examples
A Simple Convolution Operation
A basic code example demonstrating convolution using tf.nn.conv2d
is shown below.
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
= [[[[0.], [0.], [2.], [2.]],
ii 0.], [0.], [2.], [2.]],
[[0.], [0.], [2.], [2.]],
[[0.], [0.], [2.], [2.]]]] # "4x4" Input Image
[[= tf.constant(ii, tf.float32)
I = [[[[ -1.]], [[ -1.]], [[ 1.]]],
ww -1.]], [[ -1.]], [[ 1.]]],
[[[ -1.]], [[ -1.]], [[ 1.]]]] # "3x3" Filter
[[[ = tf.constant(ww, tf.float32)
W = tf.nn.conv2d(I, W, strides=[1, 1, 1, 1], padding='VALID')
C
with tf.Session() as sess:
print(sess.run(C)) # Output: [[[[6.], [0.]], [[6.], [0.]]]]
A Complete Convolutional Layer
A more complete example showing a convolutional layer within a neural network is provided.
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
# Assuming 'img' is a placeholder for input images
= tf.placeholder(tf.float32, shape=[None, 28*28]) # Example placeholder
img
= tf.reshape(img, [-1, 28, 28, 1]) # Turns img into 4d Tensor
image = tf.Variable(tf.truncated_normal([4, 4, 1, 4], stddev=0.1)) # Create parameters for the filters
flts = tf.nn.conv2d(image, flts, strides=[1, 2, 2, 1], padding="SAME") # Create graph to do convolution
convOut = tf.nn.relu(convOut) # Don't forget to add nonlinearity
convOut = tf.reshape(convOut, [-1, 784]) # Reshape to match original batch size and flattened feature map
convOut # Assuming W and b are defined for a subsequent layer
= tf.Variable(tf.truncated_normal([784, 10])) # Example weight matrix
W = tf.Variable(tf.constant(0.1, shape=[10])) # Example bias vector
b = tf.nn.softmax(tf.matmul(convOut, W) + b) # Example softmax layer prbs
Training CNNs
Loss Functions and Gradient Descent
Training CNNs involves defining a loss function and using gradient descent to optimize filter values.
Loss Function: Measures the error between the network’s predictions and the true labels (e.g., cross-entropy loss).
Gradient Descent: An optimization algorithm used to update the filter values (network parameters) to minimize the loss function.
Impact of Convolution on Image Size
Reduction of Spatial Dimensions
Convolution operations, especially with valid padding or strides greater than 1, often reduce the spatial dimensions of the feature maps.
- Size Reduction: Convolving an image generally reduces its spatial size.
Increase in the Number of Feature Maps
While spatial dimensions may decrease, the number of feature maps (channels) typically increases as we go deeper into the network, allowing for richer feature representation.
- Information Increase: Even with reduced spatial size, the network can capture more information through an increased number of feature maps. For example, a \(28 \times 28\) input image might be transformed into a \(14 \times 14 \times 4\) feature map, reducing spatial dimensions but increasing feature depth.
Typical CNN Architectures
Alternating Convolutional and Subsampling Layers
A common architecture in CNNs involves alternating convolutional layers with subsampling (pooling) layers.
- Convolution and Subsampling: Typical CNNs consist of stacks of convolutional layers followed by subsampling or pooling layers to progressively extract features and reduce spatial resolution.
7 illustrates a typical CNN architecture with alternating convolutional and subsampling layers.
Functions of Convolutional Layers
Feature Extraction
Local Receptive Fields
Convolutional layers excel at feature extraction due to their use of local receptive fields.
Local Features: Each neuron in a convolutional layer connects to a small local region in the previous layer, forcing it to extract local features.
Position Invariance: Once a feature is detected, its exact location becomes less critical, as long as its relative position to other features is maintained.
Feature Mapping
Weight Sharing and Shift Invariance
Feature mapping is achieved through weight sharing, which leads to shift invariance.
Multiple Feature Maps: Each convolutional layer comprises multiple feature maps, where neurons within each map share the same set of weights.
Shift Invariance: Weight sharing enforces shift invariance, meaning the network can detect a feature regardless of its location in the input.
Parameter Reduction: Weight sharing significantly reduces the number of free parameters in the network.
Subsampling (Pooling)
Reducing Sensitivity to Small Distortions
Subsampling or pooling layers are used to reduce the spatial resolution of feature maps and decrease sensitivity to small distortions.
Local Averaging and Subsampling: Pooling layers perform local averaging or max pooling, reducing the resolution of feature maps.
Distortion Robustness: This operation makes the network less sensitive to minor shifts and distortions in the input.
Relationship to the Universal Approximation Theorem
While CNNs are powerful for feature extraction, it is important to consider their relationship to the Universal Approximation Theorem.
Structured Feature Learning: CNNs provide a structured and efficient way to learn features, especially for image data, contrasting with the broader applicability of the Universal Approximation Theorem to simpler neural networks.
Spatial Relationships: CNNs are particularly effective in tasks where spatial relationships are important, leveraging convolution to capture these relationships efficiently.
Conclusion
In summary, this lecture has provided a comprehensive overview of Convolutional Neural Networks. We have explored the foundational concepts of filters and kernels, understood the convolution operation as a series of dot products, and discussed the motivations and biological inspirations behind CNNs. We examined the architecture of CNNs, including key hyperparameters like stride and padding, and delved into the functions of convolutional layers, such as feature extraction, feature mapping, and subsampling.
Key takeaways from this lecture include:
CNNs are highly effective for processing spatially structured data like images due to their ability to learn hierarchical features automatically.
Convolutional layers utilize filters to detect local patterns and features through weight sharing, achieving shift invariance and reducing the number of parameters.
Subsampling layers help to reduce spatial resolution and increase robustness to small distortions.
The power of CNNs comes from learning filter values from data using backpropagation, making them adaptable and powerful for various computer vision tasks.
For further study, consider exploring different types of pooling layers (max pooling, average pooling), advanced CNN architectures (ResNet, Inception, etc.), and applications of CNNs in various domains beyond image recognition.
Are there any questions?