Neural Networks and Deep Learning

Author

Your Name

Published

January 28, 2025

Introduction

This lecture covers the fundamentals of neural networks and deep learning, focusing on their architecture, behavior, and the motivations behind using deep architectures. We will explore the differences between classification and regression tasks, delve into the architecture of basic neural networks, and examine how they can approximate functions. We will also discuss the advantages of deep learning over shallow networks, supported by empirical observations and theoretical insights. Finally, we will introduce the concept of loss functions, which are crucial for training neural networks, and discuss specific loss functions used in classification and regression tasks.

Neural Network Basics

Classification vs. Regression

Classification: In classification tasks, the goal is to assign an input to one of a finite set of labels. For example, determining whether an image contains a cat or a dog.
Regression: In regression tasks, the goal is to predict a continuous value. For example, predicting the price of a house based on its features.

Basic Neural Network Architecture

A basic neural network consists of:

Input Layer: Receives the input features.
Hidden Layer(s): Composed of multiple units (neurons) that apply an activation function.
Output Layer: Produces the final prediction.

A Simple Neural Network for Regression

Let’s consider a simple neural network with one hidden layer containing three units, each using the ReLU activation function, and an output layer with a single unit.

Input Layer

The input layer receives the raw input data. Each node in the input layer represents a feature of the input data.

Hidden Layer with ReLU Activation

The hidden layer consists of neurons that apply the ReLU (Rectified Linear Unit) activation function. The ReLU function is defined as: \[\text{ReLU}(x) = \max(0, x)\] This introduces non-linearity into the network, allowing it to learn complex patterns. The output of each hidden unit is computed as follows: \[h_i = \text{ReLU}(w_i \cdot x + b_i)\] where $x$ is the input, $w_i$ is the weight, and $b_i$ is the bias for the $i$-th hidden unit.

Output Layer

The output layer produces the final prediction. For regression, a linear activation function is often used: \[y = w_o \cdot h + b_o\] where $h$ is the vector of hidden unit outputs, $w_o$ is the output weight vector, and $b_o$ is the output bias.

Function Approximation with Neural Networks

Interactive Visualizations

Interactive figures demonstrate how changing the parameters (weights and biases) of a neural network affects its output function. These visualizations help to understand the behavior of the network. For a network with one input, the output is a piecewise linear function when using ReLU activations.

Impact of Parameters on Output Function

When the parameters of the network change, the output function changes. For a network with one input and one output, the output is a piecewise linear function. The number of linear segments increases with the number of units in the hidden layer.

Piecewise Linear and Sigmoid Approximations

Piecewise Linear: With ReLU activation, the network approximates functions with piecewise linear segments.
Sigmoid: With sigmoid activation, the network approximates functions with smooth curves, similar to a combination of sigmoid functions.

Fitting Data to a Function

In regression tasks, the network tries to find the best parameters to fit a function to the given data points. This involves adjusting the weights and biases to minimize the difference between the predicted and actual values.

Extending to Multiple Inputs

When the network has multiple inputs, the activation function partitions the feature space into regions. Each region corresponds to a different linear segment of the output function.

Feature Space Partitioning with Hyperplanes

With two inputs, the network partitions the feature space using planes. The ReLU activation function sets parts of the output to zero, creating distinct regions. Changing the parameters adjusts these regions, allowing the network to approximate complex functions.

Complexity Analysis

For a network with one hidden layer of $D$ units and a single input, the time complexity for a forward pass is $O(D)$. The space complexity is also $O(D)$ to store the weights and biases.

Deep Learning vs. Shallow Networks

Defining Deep Learning

Shallow Networks: Networks with only one hidden layer.
Deep Learning: Networks with multiple hidden layers.

Motivation for Deep Architectures

Increased Complexity of Feature Space Regions

Deep networks can partition the feature space into more regions compared to shallow networks with the same number of units. This allows deep networks to model more complex functions. For example, consider a shallow network with 6 hidden units and a deep network with two hidden layers, each with 3 units. The shallow network can divide the feature space into a maximum of 7 regions, while the deep network can divide it into 9 regions.

Compositional Nature of Deep Networks

Deep networks compose multiple simple functions to create a complex function. Each layer builds upon the features learned by the previous layers, increasing the network’s expressive power. For instance, a deep network can be constructed by combining simpler networks.

Linear Regions and Parameter Count

Shallow Network: A shallow network with $D$ units in the hidden layer, using ReLU activation, can create up to $D + 1$ linear regions. The number of parameters is $3 \times D + 1$ (including weights and biases).
Deep Network: A deep network with $K$ hidden layers, each with $D$ units, can create up to $(D + 1)^K$ linear regions. The number of parameters is $K \times (2 \times D + 1) + (D+1)$.

Effect of Network Depth on Region Count

Increasing the depth of the network significantly increases the number of linear regions it can create. This is demonstrated by comparing networks with the same number of parameters but different depths. The following plot illustrates this relationship.

Impact of Input Dimensionality on Region Count

Increasing the number of inputs further increases the number of regions, especially in deep networks. This effect is more pronounced when the network has a large number of inputs. The following plot demonstrates this effect for a network with 10 input units.

Empirical Observations

Image Classification Task Example

Experiments on image classification tasks, such as the MNIST dataset, show that deep networks often converge faster and achieve better performance compared to shallow networks.

Comparison of Convergence Rates

Deep networks tend to converge faster, meaning they require fewer iterations to reach a good solution. This is because they can learn more complex features with fewer parameters. The following plot illustrates the convergence rates of different network architectures on the MNIST dataset.

Loss Functions

The Role of Loss Functions in Training

Loss functions are crucial for training neural networks. They measure the difference between the predicted and actual values, providing a signal for adjusting the network’s parameters. The goal of training a neural network is to minimize the loss function.

Just as tasting food during cooking helps adjust ingredients, the loss function helps adjust the network’s weights and biases to improve its predictions. The loss function provides a mathematical way to evaluate the performance of the network. If the "taste" (loss) is bad, we adjust the "ingredients" (weights and biases).

Loss Functions for Classification

Binary Classification Problems

In binary classification, the goal is to classify inputs into two categories (e.g., yes/no, true/false). The output of the network is typically a probability between 0 and 1, indicating the likelihood of the input belonging to the positive class.

Binary Cross-Entropy Loss Function

The binary cross-entropy loss is commonly used for binary classification problems. It measures the difference between the predicted probability and the true label.

Mathematical Definition and Interpretation

The binary cross-entropy loss for a single sample is defined as: \[L(y, \hat{y}) = -[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})]\] where $y$ is the true label (0 or 1) and $\hat{y}$ is the predicted probability. The total loss over a set of $M$ samples is the average of the individual losses: \[\text{Loss} = \frac{1}{M} \sum_{i=1}^{M} L(y_i, \hat{y}_i)\]

Loss Calculation for Positive Samples

For positive samples ($y = 1$), the loss simplifies to: \[L(1, \hat{y}) = -\log(\hat{y})\]

When $\hat{y}$ is close to 1 (i.e., the prediction is correct), the loss is close to 0.
When $\hat{y}$ is close to 0 (i.e., the prediction is incorrect), the loss approaches infinity.

Loss Calculation for Negative Samples

For negative samples ($y = 0$), the loss simplifies to: \[L(0, \hat{y}) = -\log(1 - \hat{y})\]

When $\hat{y}$ is close to 0 (i.e., the prediction is correct), the loss is close to 0.
When $\hat{y}$ is close to 1 (i.e., the prediction is incorrect), the loss approaches infinity.

Relationship Between Prediction and Loss Value

If the prediction ($\hat{y}$) is close to the true label ($y$), the loss is close to 0.
If the prediction is far from the true label, the loss is large (approaching infinity).

Examples of Binary Cross-Entropy Loss
True Label ($y$)	Predicted Probability ($\hat{y}$)	Loss ($L(y, \hat{y})$)
1	0.9	0.105
1	0.6	0.511
1	0.1	2.303
0	0.1	0.105
0	0.4	0.511
0	0.9	2.303

Loss Functions for Regression

Mean Squared Error (MSE) Loss

The mean squared error (MSE) loss is commonly used for regression problems. It measures the average squared difference between the predicted and actual values.

Mathematical Definition

The MSE loss is defined as: \[L(y, \hat{y}) = \frac{1}{M} \sum_{i=1}^{M} (y_i - \hat{y}_i)^2\] where $y_i$ is the true value, $\hat{y}_i$ is the predicted value for the $i$-th sample, and $M$ is the number of samples.

Rationale for Using the Squared Error

The squared error ensures that positive and negative errors do not cancel each other out. It also simplifies the computation of derivatives, which is important for optimizing the network’s parameters. Using the square of the error, as opposed to the absolute value, simplifies the computation of derivatives.

Examples of Squared Error
True Value ($y$)	Predicted Value ($\hat{y}$)	Squared Error ($(y - \hat{y})^2$)
5	4	1
3	3.5	0.25
-2	-1	1

Example Calculation

Suppose we have three samples with true values $y = [5, 3, -2]$ and predicted values $\hat{y} = [4, 3.5, -1]$. The MSE loss is: \[L(y, \hat{y}) = \frac{1}{3} [(5 - 4)^2 + (3 - 3.5)^2 + (-2 - (-1))^2] = \frac{1}{3} [1 + 0.25 + 1] = \frac{2.25}{3} = 0.75\]

Conclusion

In this lecture, we covered the basics of neural networks, including their architecture and how they can approximate functions. We explored the advantages of deep learning over shallow networks, supported by empirical observations and theoretical insights. We also introduced the concept of loss functions, which are crucial for training neural networks, and discussed specific loss functions used in classification and regression tasks. The binary cross-entropy loss is used for binary classification problems, while the mean squared error loss is used for regression problems.

Key takeaways include the importance of deep architectures for modeling complex functions, the role of loss functions in training, and the specific mathematical formulations of common loss functions. For the next lecture, consider exploring different activation functions and their impact on network performance. Additionally, think about how the choice of loss function can affect the training process and the quality of the learned model. Further reading on optimization algorithms, such as gradient descent, will also be beneficial for the next lecture.

--- title: "Neural Networks and Deep Learning" author: "Your Name" date: "2025-01-28" format: html: toc: true # Table of Contents toc-depth: 2 code-tools: true theme: cosmo # Or "journal" for Distill-like minimalism --- # Introduction This lecture covers the fundamentals of neural networks and deep learning, focusing on their architecture, behavior, and the motivations behind using deep architectures. We will explore the differences between classification and regression tasks, delve into the architecture of basic neural networks, and examine how they can approximate functions. We will also discuss the advantages of deep learning over shallow networks, supported by empirical observations and theoretical insights. Finally, we will introduce the concept of loss functions, which are crucial for training neural networks, and discuss specific loss functions used in classification and regression tasks. # Neural Network Basics ## Classification vs. Regression - **Classification**: In classification tasks, the goal is to assign an input to one of a finite set of labels. For example, determining whether an image contains a cat or a dog. - **Regression**: In regression tasks, the goal is to predict a continuous value. For example, predicting the price of a house based on its features. ## Basic Neural Network Architecture A basic neural network consists of: - **Input Layer**: Receives the input features. - **Hidden Layer(s)**: Composed of multiple units (neurons) that apply an activation function. - **Output Layer**: Produces the final prediction. ### A Simple Neural Network for Regression Let's consider a simple neural network with one hidden layer containing three units, each using the ReLU activation function, and an output layer with a single unit. ::: center ::: ### Input Layer The input layer receives the raw input data. Each node in the input layer represents a feature of the input data. ### Hidden Layer with ReLU Activation The hidden layer consists of neurons that apply the ReLU (Rectified Linear Unit) activation function. The ReLU function is defined as: $$\text{ReLU}(x) = \max(0, x)$$ This introduces non-linearity into the network, allowing it to learn complex patterns. The output of each hidden unit is computed as follows: $$h_i = \text{ReLU}(w_i \cdot x + b_i)$$ where $x$ is the input, $w_i$ is the weight, and $b_i$ is the bias for the $i$-th hidden unit. ### Output Layer The output layer produces the final prediction. For regression, a linear activation function is often used: $$y = w_o \cdot h + b_o$$ where $h$ is the vector of hidden unit outputs, $w_o$ is the output weight vector, and $b_o$ is the output bias. ## Function Approximation with Neural Networks ### Interactive Visualizations Interactive figures demonstrate how changing the parameters (weights and biases) of a neural network affects its output function. These visualizations help to understand the behavior of the network. For a network with one input, the output is a piecewise linear function when using ReLU activations. ### Impact of Parameters on Output Function When the parameters of the network change, the output function changes. For a network with one input and one output, the output is a piecewise linear function. The number of linear segments increases with the number of units in the hidden layer. ::: center ::: ### Piecewise Linear and Sigmoid Approximations - **Piecewise Linear**: With ReLU activation, the network approximates functions with piecewise linear segments. - **Sigmoid**: With sigmoid activation, the network approximates functions with smooth curves, similar to a combination of sigmoid functions. ::: center ::: ### Fitting Data to a Function In regression tasks, the network tries to find the best parameters to fit a function to the given data points. This involves adjusting the weights and biases to minimize the difference between the predicted and actual values. ## Extending to Multiple Inputs When the network has multiple inputs, the activation function partitions the feature space into regions. Each region corresponds to a different linear segment of the output function. ### Feature Space Partitioning with Hyperplanes With two inputs, the network partitions the feature space using planes. The ReLU activation function sets parts of the output to zero, creating distinct regions. Changing the parameters adjusts these regions, allowing the network to approximate complex functions. ::: center ::: ### Complexity Analysis For a network with one hidden layer of $D$ units and a single input, the time complexity for a forward pass is $O(D)$. The space complexity is also $O(D)$ to store the weights and biases. # Deep Learning vs. Shallow Networks ## Defining Deep Learning - **Shallow Networks**: Networks with only one hidden layer. - **Deep Learning**: Networks with multiple hidden layers. ## Motivation for Deep Architectures ### Increased Complexity of Feature Space Regions Deep networks can partition the feature space into more regions compared to shallow networks with the same number of units. This allows deep networks to model more complex functions. For example, consider a shallow network with 6 hidden units and a deep network with two hidden layers, each with 3 units. The shallow network can divide the feature space into a maximum of 7 regions, while the deep network can divide it into 9 regions. ::: center ::: ### Compositional Nature of Deep Networks Deep networks compose multiple simple functions to create a complex function. Each layer builds upon the features learned by the previous layers, increasing the network's expressive power. For instance, a deep network can be constructed by combining simpler networks. ::: center ::: ### Linear Regions and Parameter Count - **Shallow Network**: A shallow network with $D$ units in the hidden layer, using ReLU activation, can create up to $D + 1$ linear regions. The number of parameters is $3 \times D + 1$ (including weights and biases). - **Deep Network**: A deep network with $K$ hidden layers, each with $D$ units, can create up to $(D + 1)^K$ linear regions. The number of parameters is $K \times (2 \times D + 1) + (D+1)$. ### Effect of Network Depth on Region Count Increasing the depth of the network significantly increases the number of linear regions it can create. This is demonstrated by comparing networks with the same number of parameters but different depths. The following plot illustrates this relationship. ::: center ::: ### Impact of Input Dimensionality on Region Count Increasing the number of inputs further increases the number of regions, especially in deep networks. This effect is more pronounced when the network has a large number of inputs. The following plot demonstrates this effect for a network with 10 input units. ::: center ::: ## Empirical Observations ### Image Classification Task Example Experiments on image classification tasks, such as the MNIST dataset, show that deep networks often converge faster and achieve better performance compared to shallow networks. ### Comparison of Convergence Rates Deep networks tend to converge faster, meaning they require fewer iterations to reach a good solution. This is because they can learn more complex features with fewer parameters. The following plot illustrates the convergence rates of different network architectures on the MNIST dataset. ::: center ::: ## Recommended Readings - Chapters 2, 3, and 4 of the suggested book. - Additional tutorials and videos available in the material folder and online. # Loss Functions ## The Role of Loss Functions in Training Loss functions are crucial for training neural networks. They measure the difference between the predicted and actual values, providing a signal for adjusting the network's parameters. The goal of training a neural network is to minimize the loss function. ::: tcolorbox Just as tasting food during cooking helps adjust ingredients, the loss function helps adjust the network's weights and biases to improve its predictions. The loss function provides a mathematical way to evaluate the performance of the network. If the \"taste\" (loss) is bad, we adjust the \"ingredients\" (weights and biases). ::: ## Loss Functions for Classification ### Binary Classification Problems In binary classification, the goal is to classify inputs into two categories (e.g., yes/no, true/false). The output of the network is typically a probability between 0 and 1, indicating the likelihood of the input belonging to the positive class. ### Binary Cross-Entropy Loss Function ::: tcolorbox The binary cross-entropy loss is commonly used for binary classification problems. It measures the difference between the predicted probability and the true label. ::: ### Mathematical Definition and Interpretation The binary cross-entropy loss for a single sample is defined as: $$L(y, \hat{y}) = -[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})]$$ where $y$ is the true label (0 or 1) and $\hat{y}$ is the predicted probability. The total loss over a set of $M$ samples is the average of the individual losses: $$\text{Loss} = \frac{1}{M} \sum_{i=1}^{M} L(y_i, \hat{y}_i)$$ ### Loss Calculation for Positive Samples For positive samples ($y = 1$), the loss simplifies to: $$L(1, \hat{y}) = -\log(\hat{y})$$ - When $\hat{y}$ is close to 1 (i.e., the prediction is correct), the loss is close to 0. - When $\hat{y}$ is close to 0 (i.e., the prediction is incorrect), the loss approaches infinity. ::: center ::: ### Loss Calculation for Negative Samples For negative samples ($y = 0$), the loss simplifies to: $$L(0, \hat{y}) = -\log(1 - \hat{y})$$ - When $\hat{y}$ is close to 0 (i.e., the prediction is correct), the loss is close to 0. - When $\hat{y}$ is close to 1 (i.e., the prediction is incorrect), the loss approaches infinity. ::: center ::: ### Relationship Between Prediction and Loss Value - If the prediction ($\hat{y}$) is close to the true label ($y$), the loss is close to 0. - If the prediction is far from the true label, the loss is large (approaching infinity). True Label ($y$) Predicted Probability ($\hat{y}$) Loss ($L(y, \hat{y})$) ------------------ ----------------------------------- ------------------------ 1 0.9 0.105 1 0.6 0.511 1 0.1 2.303 0 0.1 0.105 0 0.4 0.511 0 0.9 2.303 : Examples of Binary Cross-Entropy Loss ## Loss Functions for Regression ### Mean Squared Error (MSE) Loss ::: tcolorbox The mean squared error (MSE) loss is commonly used for regression problems. It measures the average squared difference between the predicted and actual values. ::: ### Mathematical Definition The MSE loss is defined as: $$L(y, \hat{y}) = \frac{1}{M} \sum_{i=1}^{M} (y_i - \hat{y}_i)^2$$ where $y_i$ is the true value, $\hat{y}_i$ is the predicted value for the $i$-th sample, and $M$ is the number of samples. ### Rationale for Using the Squared Error The squared error ensures that positive and negative errors do not cancel each other out. It also simplifies the computation of derivatives, which is important for optimizing the network's parameters. Using the square of the error, as opposed to the absolute value, simplifies the computation of derivatives. True Value ($y$) Predicted Value ($\hat{y}$) Squared Error ($(y - \hat{y})^2$) ------------------ ----------------------------- ----------------------------------- 5 4 1 3 3.5 0.25 -2 -1 1 : Examples of Squared Error ### Example Calculation ::: tcolorbox Suppose we have three samples with true values $y = [5, 3, -2]$ and predicted values $\hat{y} = [4, 3.5, -1]$. The MSE loss is: $$L(y, \hat{y}) = \frac{1}{3} [(5 - 4)^2 + (3 - 3.5)^2 + (-2 - (-1))^2] = \frac{1}{3} [1 + 0.25 + 1] = \frac{2.25}{3} = 0.75$$ ::: # Conclusion In this lecture, we covered the basics of neural networks, including their architecture and how they can approximate functions. We explored the advantages of deep learning over shallow networks, supported by empirical observations and theoretical insights. We also introduced the concept of loss functions, which are crucial for training neural networks, and discussed specific loss functions used in classification and regression tasks. The binary cross-entropy loss is used for binary classification problems, while the mean squared error loss is used for regression problems. Key takeaways include the importance of deep architectures for modeling complex functions, the role of loss functions in training, and the specific mathematical formulations of common loss functions. For the next lecture, consider exploring different activation functions and their impact on network performance. Additionally, think about how the choice of loss function can affect the training process and the quality of the learned model. Further reading on optimization algorithms, such as gradient descent, will also be beneficial for the next lecture.