Neural Networks and Deep Learning
Introduction
This lecture covers the fundamentals of neural networks and deep learning, focusing on their architecture, behavior, and the motivations behind using deep architectures. We will explore the differences between classification and regression tasks, delve into the architecture of basic neural networks, and examine how they can approximate functions. We will also discuss the advantages of deep learning over shallow networks, supported by empirical observations and theoretical insights. Finally, we will introduce the concept of loss functions, which are crucial for training neural networks, and discuss specific loss functions used in classification and regression tasks.
Neural Network Basics
Classification vs. Regression
Classification: In classification tasks, the goal is to assign an input to one of a finite set of labels. For example, determining whether an image contains a cat or a dog.
Regression: In regression tasks, the goal is to predict a continuous value. For example, predicting the price of a house based on its features.
Basic Neural Network Architecture
A basic neural network consists of:
Input Layer: Receives the input features.
Hidden Layer(s): Composed of multiple units (neurons) that apply an activation function.
Output Layer: Produces the final prediction.
A Simple Neural Network for Regression
Let’s consider a simple neural network with one hidden layer containing three units, each using the ReLU activation function, and an output layer with a single unit.
Input Layer
The input layer receives the raw input data. Each node in the input layer represents a feature of the input data.
Output Layer
The output layer produces the final prediction. For regression, a linear activation function is often used: \[y = w_o \cdot h + b_o\] where \(h\) is the vector of hidden unit outputs, \(w_o\) is the output weight vector, and \(b_o\) is the output bias.
Function Approximation with Neural Networks
Interactive Visualizations
Interactive figures demonstrate how changing the parameters (weights and biases) of a neural network affects its output function. These visualizations help to understand the behavior of the network. For a network with one input, the output is a piecewise linear function when using ReLU activations.
Impact of Parameters on Output Function
When the parameters of the network change, the output function changes. For a network with one input and one output, the output is a piecewise linear function. The number of linear segments increases with the number of units in the hidden layer.
Piecewise Linear and Sigmoid Approximations
Piecewise Linear: With ReLU activation, the network approximates functions with piecewise linear segments.
Sigmoid: With sigmoid activation, the network approximates functions with smooth curves, similar to a combination of sigmoid functions.
Fitting Data to a Function
In regression tasks, the network tries to find the best parameters to fit a function to the given data points. This involves adjusting the weights and biases to minimize the difference between the predicted and actual values.
Extending to Multiple Inputs
When the network has multiple inputs, the activation function partitions the feature space into regions. Each region corresponds to a different linear segment of the output function.
Feature Space Partitioning with Hyperplanes
With two inputs, the network partitions the feature space using planes. The ReLU activation function sets parts of the output to zero, creating distinct regions. Changing the parameters adjusts these regions, allowing the network to approximate complex functions.
Complexity Analysis
For a network with one hidden layer of \(D\) units and a single input, the time complexity for a forward pass is \(O(D)\). The space complexity is also \(O(D)\) to store the weights and biases.
Deep Learning vs. Shallow Networks
Defining Deep Learning
Shallow Networks: Networks with only one hidden layer.
Deep Learning: Networks with multiple hidden layers.
Motivation for Deep Architectures
Increased Complexity of Feature Space Regions
Deep networks can partition the feature space into more regions compared to shallow networks with the same number of units. This allows deep networks to model more complex functions. For example, consider a shallow network with 6 hidden units and a deep network with two hidden layers, each with 3 units. The shallow network can divide the feature space into a maximum of 7 regions, while the deep network can divide it into 9 regions.
Compositional Nature of Deep Networks
Deep networks compose multiple simple functions to create a complex function. Each layer builds upon the features learned by the previous layers, increasing the network’s expressive power. For instance, a deep network can be constructed by combining simpler networks.
Linear Regions and Parameter Count
Shallow Network: A shallow network with \(D\) units in the hidden layer, using ReLU activation, can create up to \(D + 1\) linear regions. The number of parameters is \(3 \times D + 1\) (including weights and biases).
Deep Network: A deep network with \(K\) hidden layers, each with \(D\) units, can create up to \((D + 1)^K\) linear regions. The number of parameters is \(K \times (2 \times D + 1) + (D+1)\).
Effect of Network Depth on Region Count
Increasing the depth of the network significantly increases the number of linear regions it can create. This is demonstrated by comparing networks with the same number of parameters but different depths. The following plot illustrates this relationship.
Impact of Input Dimensionality on Region Count
Increasing the number of inputs further increases the number of regions, especially in deep networks. This effect is more pronounced when the network has a large number of inputs. The following plot demonstrates this effect for a network with 10 input units.
Empirical Observations
Image Classification Task Example
Experiments on image classification tasks, such as the MNIST dataset, show that deep networks often converge faster and achieve better performance compared to shallow networks.
Comparison of Convergence Rates
Deep networks tend to converge faster, meaning they require fewer iterations to reach a good solution. This is because they can learn more complex features with fewer parameters. The following plot illustrates the convergence rates of different network architectures on the MNIST dataset.
Recommended Readings
Chapters 2, 3, and 4 of the suggested book.
Additional tutorials and videos available in the material folder and online.
Loss Functions
The Role of Loss Functions in Training
Loss functions are crucial for training neural networks. They measure the difference between the predicted and actual values, providing a signal for adjusting the network’s parameters. The goal of training a neural network is to minimize the loss function.
Just as tasting food during cooking helps adjust ingredients, the loss function helps adjust the network’s weights and biases to improve its predictions. The loss function provides a mathematical way to evaluate the performance of the network. If the "taste" (loss) is bad, we adjust the "ingredients" (weights and biases).
Loss Functions for Classification
Binary Classification Problems
In binary classification, the goal is to classify inputs into two categories (e.g., yes/no, true/false). The output of the network is typically a probability between 0 and 1, indicating the likelihood of the input belonging to the positive class.
Binary Cross-Entropy Loss Function
The binary cross-entropy loss is commonly used for binary classification problems. It measures the difference between the predicted probability and the true label.
Mathematical Definition and Interpretation
The binary cross-entropy loss for a single sample is defined as: \[L(y, \hat{y}) = -[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})]\] where \(y\) is the true label (0 or 1) and \(\hat{y}\) is the predicted probability. The total loss over a set of \(M\) samples is the average of the individual losses: \[\text{Loss} = \frac{1}{M} \sum_{i=1}^{M} L(y_i, \hat{y}_i)\]
Loss Calculation for Positive Samples
For positive samples (\(y = 1\)), the loss simplifies to: \[L(1, \hat{y}) = -\log(\hat{y})\]
When \(\hat{y}\) is close to 1 (i.e., the prediction is correct), the loss is close to 0.
When \(\hat{y}\) is close to 0 (i.e., the prediction is incorrect), the loss approaches infinity.
Loss Calculation for Negative Samples
For negative samples (\(y = 0\)), the loss simplifies to: \[L(0, \hat{y}) = -\log(1 - \hat{y})\]
When \(\hat{y}\) is close to 0 (i.e., the prediction is correct), the loss is close to 0.
When \(\hat{y}\) is close to 1 (i.e., the prediction is incorrect), the loss approaches infinity.
Relationship Between Prediction and Loss Value
If the prediction (\(\hat{y}\)) is close to the true label (\(y\)), the loss is close to 0.
If the prediction is far from the true label, the loss is large (approaching infinity).
| True Label (\(y\)) | Predicted Probability (\(\hat{y}\)) | Loss (\(L(y, \hat{y})\)) |
|---|---|---|
| 1 | 0.9 | 0.105 |
| 1 | 0.6 | 0.511 |
| 1 | 0.1 | 2.303 |
| 0 | 0.1 | 0.105 |
| 0 | 0.4 | 0.511 |
| 0 | 0.9 | 2.303 |
Loss Functions for Regression
Mean Squared Error (MSE) Loss
The mean squared error (MSE) loss is commonly used for regression problems. It measures the average squared difference between the predicted and actual values.
Mathematical Definition
The MSE loss is defined as: \[L(y, \hat{y}) = \frac{1}{M} \sum_{i=1}^{M} (y_i - \hat{y}_i)^2\] where \(y_i\) is the true value, \(\hat{y}_i\) is the predicted value for the \(i\)-th sample, and \(M\) is the number of samples.
Rationale for Using the Squared Error
The squared error ensures that positive and negative errors do not cancel each other out. It also simplifies the computation of derivatives, which is important for optimizing the network’s parameters. Using the square of the error, as opposed to the absolute value, simplifies the computation of derivatives.
| True Value (\(y\)) | Predicted Value (\(\hat{y}\)) | Squared Error (\((y - \hat{y})^2\)) |
|---|---|---|
| 5 | 4 | 1 |
| 3 | 3.5 | 0.25 |
| -2 | -1 | 1 |
Example Calculation
Suppose we have three samples with true values \(y = [5, 3, -2]\) and predicted values \(\hat{y} = [4, 3.5, -1]\). The MSE loss is: \[L(y, \hat{y}) = \frac{1}{3} [(5 - 4)^2 + (3 - 3.5)^2 + (-2 - (-1))^2] = \frac{1}{3} [1 + 0.25 + 1] = \frac{2.25}{3} = 0.75\]
Conclusion
In this lecture, we covered the basics of neural networks, including their architecture and how they can approximate functions. We explored the advantages of deep learning over shallow networks, supported by empirical observations and theoretical insights. We also introduced the concept of loss functions, which are crucial for training neural networks, and discussed specific loss functions used in classification and regression tasks. The binary cross-entropy loss is used for binary classification problems, while the mean squared error loss is used for regression problems.
Key takeaways include the importance of deep architectures for modeling complex functions, the role of loss functions in training, and the specific mathematical formulations of common loss functions. For the next lecture, consider exploring different activation functions and their impact on network performance. Additionally, think about how the choice of loss function can affect the training process and the quality of the learned model. Further reading on optimization algorithms, such as gradient descent, will also be beneficial for the next lecture.