Deep Learning Lecture Notes - Optimization Techniques
Introduction
This lecture serves as a comprehensive guide to the course’s exam structure, followed by a review of fundamental neural network exercises. It then transitions into a detailed exploration of optimization techniques, with a particular focus on Stochastic Gradient Descent (SGD), exponentially weighted averages, and bias correction. The overarching objective is to provide students with a robust understanding of advanced optimization algorithms commonly employed in deep learning, such as the Adam optimizer, enabling them to effectively implement and utilize these techniques in practice.
Exam Information
For University of Udine Students
Written Exam
Platform and Location: The written exam will be administered through the e-learning platform. All students must register for the course on this platform prior to the exam date. The exam will be held in Lab 1 and Lab 2, located within the Computer Science building.
Allowed Resources: During the exam, no external resources such as slides, notes, books, or any other materials are permitted. However, students are allowed to use a calculator.
Validity Period: A passing grade on the written exam remains valid for one academic year, specifically until September 2025.
Booking Procedure: Students are required to book their oral exam slot through the S3 platform.
Oral Exam
Admission Requirements: A minimum grade threshold must be achieved in the written exam to be eligible for the oral exam.
Exam Content: The oral exam will comprehensively assess the student’s understanding of the course lectures and any laboratory sessions conducted throughout the term.
Retaking the Oral Exam: Students who do not pass the oral exam are required to retake the written test.
Accessing Past Exams
Past exam papers are accessible on the Teams platform. However, it is important to note that the exam content may vary from year to year. New topics introduced in the current academic year may not be reflected in past exams.
For Other Students
Exam Modality
The examination format for students not enrolled at the University of Udine is currently under discussion. The current plan is to implement a quiz-based assessment, similar in content to the University of Udine’s exam, but potentially with different modalities.
Timeline for Details
Further details regarding the exam modality for this student group will be communicated in the upcoming weeks.
Review of Exercises
This section reviews several exercises designed to reinforce fundamental concepts related to neural networks. Each exercise is presented with a clear statement of the problem, the correct solution, and a detailed explanation.
Exercise 1: Neural Network Logic Gate
Problem: A given neural network takes two binary inputs and produces an output. The task is to identify the logical function it computes.
Solution: E (AND)
Explanation: The network’s activation behavior mirrors that of a logical AND gate. It only activates when both inputs are 1.
Note: It is crucial to explicitly state the activation function as a sigmoid function in the problem description for clarity.
Exercise 2: Neural Network Output Calculation
Problem Statement
Given a neural network with three inputs, a hidden layer consisting of two units, and a single output unit, the task is to compute the network’s output for the input vector [2, 5, 3].
Network Parameters
Weights and Biases: \(W_1 = \begin{bmatrix} 4 & -3 \\ 2 & -6 \\ -2 & 0 \end{bmatrix}\), \(b_1 = \begin{bmatrix} -1 \\ 8 \end{bmatrix}\), \(W_2 = \begin{bmatrix} 2 & 1 \end{bmatrix}\), \(b_2 = 2\)
Activation Function: Sigmoid, defined as \(g(z) = \frac{1}{1 + e^{-z}}\)
Forward Propagation Steps
Calculate the weighted sum of inputs for the hidden layer (\(Z_1\)): \(Z_1 = W_1 \cdot \begin{bmatrix} 2 \\ 5 \\ 3 \end{bmatrix} + b_1 = \begin{bmatrix} 4(2) - 3(5) \\ 2(2) - 6(5) \\ -2(2) + 0(3) \end{bmatrix} + \begin{bmatrix} -1 \\ 8 \end{bmatrix} = \begin{bmatrix} -7 \\ -26 \\ -4 \end{bmatrix} + \begin{bmatrix} -1 \\ 8 \end{bmatrix} = \begin{bmatrix} -8 \\ -18 \\ -4 \end{bmatrix}\)
Apply the sigmoid activation function to \(Z_1\) to obtain the hidden layer activations (\(A_1\)): \(A_1 = g(Z_1) = \begin{bmatrix} g(-8) \\ g(-18) \end{bmatrix} \approx \begin{bmatrix} 0.000335 \\ 0.000000015 \end{bmatrix}\)
Calculate the weighted sum of inputs for the output layer (\(Z_2\)): \(Z_2 = W_2 \cdot A_1 + b_2 = \begin{bmatrix} 2 & 1 \end{bmatrix} \cdot \begin{bmatrix} 0.000335 \\ 0.000000015 \end{bmatrix} + 2 \approx 0.00067 + 2 = 2.00067\)
Apply the sigmoid activation function to \(Z_2\) to obtain the final output (\(A_2\)): \(A_2 = g(Z_2) \approx g(2.00067) \approx 0.88086\)
Exercise 3: Neural Network with Softmax
Problem Statement
Given a neural network with three inputs, a hidden layer with two units, and two output units, the task is to compute the network’s output for the input vector [2, 3, 1].
Network Parameters
Weights and Biases: \(W_1 = \begin{bmatrix} 4 & -3 \\ 2 & -6 \\ -2 & 0 \end{bmatrix}\), \(b_1 = \begin{bmatrix} -1 \\ 8 \end{bmatrix}\), \(W_2 = \begin{bmatrix} 2 & 1 \\ 2 & -1 \end{bmatrix}\), \(b_2 = \begin{bmatrix} 2 \\ 2 \end{bmatrix}\)
Activation Function: Sigmoid for the hidden layer, Softmax for the output layer.
Forward Propagation Steps
Calculate the weighted sum of inputs for the hidden layer (\(Z_1\)): \(Z_1 = W_1 \cdot \begin{bmatrix} 2 \\ 3 \\ 1 \end{bmatrix} + b_1 = \begin{bmatrix} 4(2) - 3(3) \\ 2(2) - 6(3) \\ -2(2) + 0(1) \end{bmatrix} + \begin{bmatrix} -1 \\ 8 \end{bmatrix} = \begin{bmatrix} -1 \\ -14 \\ -4 \end{bmatrix} + \begin{bmatrix} -1 \\ 8 \end{bmatrix} = \begin{bmatrix} -2 \\ -6 \end{bmatrix}\)
Apply the sigmoid activation function to \(Z_1\) to obtain the hidden layer activations (\(A_1\)): \(A_1 = g(Z_1) = \begin{bmatrix} g(-2) \\ g(-6) \end{bmatrix} \approx \begin{bmatrix} 0.1192 \\ 0.00247 \end{bmatrix}\)
Calculate the weighted sum of inputs for the output layer (\(Z_2\)): \(Z_2 = W_2 \cdot A_1 + b_2 = \begin{bmatrix} 2 & 1 \\ 2 & -1 \end{bmatrix} \cdot \begin{bmatrix} 0.1192 \\ 0.00247 \end{bmatrix} + \begin{bmatrix} 2 \\ 2 \end{bmatrix} \approx \begin{bmatrix} 0.24087 \\ 0.23593 \end{bmatrix} + \begin{bmatrix} 2 \\ 2 \end{bmatrix} = \begin{bmatrix} 2.24087 \\ 2.23593 \end{bmatrix}\)
Apply the softmax activation function to \(Z_2\) to obtain the final output (\(A_2\)): \(A_2 = \text{softmax}(Z_2) = \begin{bmatrix} \frac{e^{2.24087}}{e^{2.24087} + e^{2.23593}} \\ \frac{e^{2.23593}}{e^{2.24087} + e^{2.23593}} \end{bmatrix} \approx \begin{bmatrix} 0.5012 \\ 0.4988 \end{bmatrix}\)
Note: The solution provided in the original transcript (0.731) appears to be incorrect based on these calculations.
Exercise 4: Data Size vs. Performance
Question: In the diagram discussed during the lecture, what do the horizontal and vertical axes represent?
Solution: A. The horizontal axis represents the amount of data, and the vertical axis represents performance.
Explanation: The diagram illustrates the relationship between the amount of training data and the performance of learning algorithms. It demonstrates that as the amount of data increases, the performance of traditional learning algorithms tends to plateau, whereas the performance of neural networks continues to improve, especially with larger model sizes.
Exercise 5: ReLU Activation Function Plot
Question: Which of the following plots accurately represents the ReLU activation function?
Solution: The plot that depicts \(f(z) = 0\) for \(z < 0\) and \(f(z) = z\) for \(z \geq 0\).
Explanation: The Rectified Linear Unit (ReLU) activation function is mathematically defined as \(f(z) = \max(0, z)\). This means it outputs 0 for any negative input and the input value itself for any non-negative input.
Exercise 6: Binary Cross-Entropy Loss Function
Question: Which of the following mathematical expressions correctly represents the binary cross-entropy loss function?
Solution: A. \(L(y, \hat{y}) = -[y \log(\hat{y}) + (1-y) \log(1-\hat{y})]\)
Explanation: This formula is the standard representation of the binary cross-entropy loss. Here, \(y\) denotes the true label (either 0 or 1), and \(\hat{y}\) represents the predicted probability of the positive class.
Exercise 7: Output of a Computational Graph
Question: What is the output of the computational graph presented?
Solution: B. (Computational graph not provided in the transcript)
Explanation: A detailed explanation cannot be provided without the visual representation of the computational graph.
Exercise 8: Activation for Binary Classification
Question: For a binary visual classifier designed to distinguish between apples and tomatoes, which activation function is most suitable for the output layer?
Solution: Sigmoid
Explanation: The sigmoid function is well-suited for binary classification tasks because it produces an output value between 0 and 1. This output can be interpreted as the probability of the input belonging to one of the two classes.
Exercise 9: Impact of Zero Initialization
Question: If you initialize all the weights and biases of a neural network to zero, which of the following statements is true?
Solution: A. Every neuron in the first hidden layer will perform the identical computation.
Explanation: Initializing all weights and biases to zero leads to a symmetry problem. All neurons within a given layer will compute the same function, effectively making them redundant. This symmetry prevents the network from learning effectively during training.
Exercise 10: Matrix Dimensionality in a Network
Question: What is the dimensionality of the matrix in the specified neural network?
Solution: A, B, and F (Network structure not fully specified in the transcript)
Explanation: Without a complete description of the network’s architecture, providing a definitive explanation regarding the matrix dimensionality is not possible.
Introduction to Optimization
This section introduces the crucial concept of optimization within the realm of deep learning. While gradient descent serves as a foundational optimization algorithm, the field has advanced significantly, leading to the development of more sophisticated techniques that are commonly employed in contemporary deep learning practice. A prominent example of such an advanced technique is the Adam optimizer, which leverages and extends concepts like Stochastic Gradient Descent (SGD) and exponentially weighted averages.
Overview of Gradient Descent
Gradient descent is an iterative optimization algorithm that aims to find the minimum of a function. In the specific context of neural networks, gradient descent is employed to iteratively update the network’s weights and biases, with the objective of minimizing the cost function. The update rule for gradient descent can be expressed mathematically as follows:
\[\theta_{t+1} = \theta_t - \alpha \nabla J(\theta_t)\]
where:
* \(\theta\) represents the parameters of the model (weights and biases). * \(\alpha\) denotes the learning rate, a hyperparameter that controls the step size of the updates. * \(\nabla J(\theta_t)\) represents the gradient of the cost function \(J\) with respect to the parameters \(\theta\) at iteration \(t\).
Introduction to Advanced Optimization
Modern deep learning practice often relies on optimization algorithms that significantly improve upon the standard gradient descent method. These advanced algorithms, such as Adam, frequently incorporate techniques like momentum and adaptive learning rates. The incorporation of these techniques allows for more effective navigation of the complex, non-convex loss landscapes that are characteristic of neural networks.
The Gabor Model
Non-Linear Functionality
The Gabor model is presented here as a relatively simple example of a non-linear function. Its purpose is to illustrate the challenges that arise when performing optimization in non-convex landscapes. The Gabor function is mathematically defined as:
\[g(x; \alpha, \beta) = e^{-\alpha^2 x^2} \cos(\beta x)\]
where:
* \(x\) is the input variable. * \(\alpha\) and \(\beta\) are parameters that govern the shape of the function. Specifically, \(\alpha\) controls the width of the Gaussian envelope, and \(\beta\) controls the frequency of the cosine wave.
Adapting Parameters to Data
By judiciously adjusting the parameters \(\alpha\) and \(\beta\), the Gabor function can be adapted to closely fit a given set of data points. This adaptation process involves finding the specific values of \(\alpha\) and \(\beta\) that minimize the discrepancy between the function’s output and the target data points. This is typically achieved by minimizing a suitable cost function, such as the mean squared error.
Visualizing the Cost Function
A key advantage of the Gabor model, for illustrative purposes, is that it has only two parameters (\(\alpha\) and \(\beta\)). This allows for a straightforward visualization of the cost function in a 3D plot. When visualized, the cost function reveals a non-convex landscape. This landscape is characterized by the presence of multiple local minima, flat regions (plateaus), and valleys. These features highlight the inherent challenges associated with optimization in such landscapes, as algorithms can easily get trapped in suboptimal solutions.
Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent (SGD) is a highly effective optimization algorithm that builds upon the principles of standard gradient descent. It introduces a key modification that significantly enhances its performance, particularly in the context of large datasets and complex models.
Mini-Batch Approach Explained
Unlike standard gradient descent, which computes the gradient using the entire training dataset at each iteration, SGD updates the model parameters based on a subset of the training data known as a mini-batch. This mini-batch is typically much smaller than the full dataset, leading to significant computational savings.
Introducing Noise via Mini-Batches
The use of mini-batches introduces an element of stochasticity into the optimization process. Since the gradient is estimated from a small sample of the data, it will naturally deviate from the true gradient computed over the entire dataset. This deviation can be interpreted as adding "noise" to the optimization trajectory. While seemingly counterintuitive, this noise can be beneficial, enabling the algorithm to explore the loss landscape more broadly and potentially escape from local minima.
Selecting Mini-Batches from Data
Mini-batches are typically formed by randomly sampling the training data without replacement. This ensures that each data point is included in exactly one mini-batch within a single pass through the entire dataset (an epoch).
Advantages of Using SGD
Escaping Local Minima Effectively
The inherent noise introduced by the mini-batch approach provides a mechanism for SGD to escape from local minima. The stochastic fluctuations in the gradient can help the optimization process "jump out" of shallow local optima and potentially converge to a more globally optimal solution.
Computational Efficiency Gains
SGD offers significant computational advantages over standard gradient descent, especially when dealing with large datasets. Since each parameter update is based on a smaller mini-batch, the computational cost per update is substantially reduced. This allows for faster iteration and can make training large models feasible.
Definition of an Epoch
In the context of optimization, an epoch is defined as one complete pass through the entire training dataset. When using mini-batch SGD, an epoch is completed once all the mini-batches have been used to update the model parameters.
Learning Rate Scheduling Techniques
Learning rate scheduling is a crucial technique for optimizing the performance of SGD. It involves dynamically adjusting the learning rate during the training process. A common and effective strategy is to start with a relatively high learning rate and gradually decrease it over time, often by a constant factor every few epochs. This allows for large initial steps, facilitating faster convergence in the early stages of training, while smaller steps later on help fine-tune the solution and avoid overshooting the optimum.
A common formula for implementing learning rate decay is:
\[\alpha_t = \text{decay\_rate}^{\text{epoch\_num}} \cdot \alpha_0\]
where:
* \(\alpha_t\) represents the learning rate at epoch \(t\). * \(\alpha_0\) is the initial learning rate. * ‘decay_rate’ is a hyperparameter that controls the speed of the decay (typically a value between 0 and 1).
This formula demonstrates an exponential decay of the learning rate as the number of epochs increases. Other decay schedules exist, but this exponential form is widely used.
Exponentially Weighted Averages
Exponentially weighted averages provide a powerful method for smoothing out noisy sequential data and extracting underlying trends. They are particularly useful in time-series analysis and serve as a fundamental component in several advanced optimization algorithms.
Illustrative Example: Temperature Data
To illustrate the concept, consider a scenario involving daily temperature measurements. Due to various factors, daily temperature readings often exhibit fluctuations and noise. The goal is to use exponentially weighted averages to smooth out these fluctuations and obtain a more representative trend of the underlying temperature pattern.
Formula and Core Concept
The core of exponentially weighted averages lies in the following recursive formula:
\[V_t = \beta V_{t-1} + (1 - \beta) \theta_t\]
where:
* \(V_t\) represents the exponentially weighted average at time step \(t\). * \(\theta_t\) denotes the actual observed value (e.g., temperature) at time step \(t\). * \(\beta\) is a hyperparameter that takes a value between 0 and 1. It governs the weighting assigned to past values in the calculation of the average.
The formula essentially computes the new average, \(V_t\), as a weighted combination of the previous average, \(V_{t-1}\), and the current observation, \(\theta_t\).
Influence of the Beta Parameter
The \(\beta\) parameter plays a crucial role in determining the behavior of the exponentially weighted average. It effectively controls the "memory" or "window size" of the average:
* Larger \(\beta\) (closer to 1): Assigns more weight to past values. This results in a smoother average that adapts more slowly to changes in the data. It effectively averages over a longer period. * Smaller \(\beta\) (closer to 0): Assigns more weight to recent values. This leads to a faster-adapting average that is more responsive to recent fluctuations but may also be noisier. It effectively averages over a shorter period.
Mathematical Explanation of Averaging
The recursive formula can be expanded to reveal that the exponentially weighted average \(V_t\) is, in fact, a weighted sum of all past values \(\theta_i\) up to time \(t\). The weights assigned to these past values decay exponentially with time:
\[V_t = (1-\beta) \sum_{i=0}^{t-1} \beta^i \theta_{t-i}\]
This expanded form clearly shows the exponential decay of weights. The weight of a past value \(\theta_{t-i}\) is proportional to \(\beta^i\). Since \(\beta\) is between 0 and 1, the weights decrease exponentially as we go further back in time.
It can be shown that \(V_t\) effectively averages over approximately \(\frac{1}{1-\beta}\) past values. This provides a useful rule of thumb for understanding the effective window size controlled by \(\beta\). For example, if \(\beta = 0.9\), the average effectively considers the past 10 values. If \(\beta = 0.98\), it considers approximately the past 50 values.
Bias Correction in Exponential Averages
While exponentially weighted averages are a powerful tool, they suffer from a bias issue, particularly during the initial stages of computation. Bias correction is a technique used to address this problem and improve the accuracy of the initial estimates.
Problem with Initial Estimates
The bias arises primarily when the initial value of the exponentially weighted average, \(V_0\), is set to 0 (a common initialization). Due to the recursive nature of the formula, the initial values of \(V_t\) tend to be significantly lower than the actual values they are trying to estimate. This is because the initial estimates are heavily influenced by the zero initialization and have not yet incorporated enough information from the actual data.
The Bias Correction Formula
To mitigate this initial bias, a correction term is introduced. The corrected exponentially weighted average, denoted as \(V_t^{\text{corrected}}\), is calculated as follows:
\[V_t^{\text{corrected}} = \frac{V_t}{1 - \beta^t}\]
where:
* \(V_t\) is the uncorrected exponentially weighted average at time \(t\). * \(\beta\) is the hyperparameter controlling the weighting of past values. * \(t\) is the current time step.
Effectiveness of Bias Correction
The bias correction term, \(\frac{1}{1 - \beta^t}\), is most effective during the early stages of the averaging process:
* When \(t\) is small: \(\beta^t\) is significantly greater than 0, making the denominator \(1 - \beta^t\) less than 1. This results in an amplified value for \(V_t^{\text{corrected}}\), effectively counteracting the downward bias caused by the initial underestimation. * As \(t\) increases: \(\beta^t\) approaches 0 (since \(0 < \beta < 1\)), causing the denominator \(1 - \beta^t\) to approach 1. Consequently, the correction term has a diminishing effect, and \(V_t^{\text{corrected}}\) converges towards the uncorrected \(V_t\).
In essence, the bias correction provides a "boost" to the initial estimates, helping them align more quickly with the actual values. As more data is incorporated into the average, the need for this correction diminishes, and the corrected average converges to the uncorrected average. This ensures that the long-term behavior of the exponentially weighted average remains unaffected by the initial bias.
Conclusion
This lecture provided a comprehensive overview of key topics, beginning with essential information regarding the exam structure. It then transitioned into a review of fundamental neural network exercises, reinforcing core concepts. The primary focus, however, was a deep dive into optimization techniques, specifically Stochastic Gradient Descent (SGD), exponentially weighted averages, and the crucial role of bias correction.
These concepts are foundational for understanding and effectively implementing advanced optimization algorithms, such as Adam, which are prevalent in modern deep learning. The key takeaways from this lecture include:
* Advantages of SGD: SGD’s computational efficiency, stemming from its mini-batch approach, and its ability to escape local minima due to the introduced stochasticity, make it a powerful optimization tool. * Role of Exponentially Weighted Averages: Exponentially weighted averages provide a robust method for smoothing noisy sequential data, revealing underlying trends. The \(\beta\) parameter offers fine-grained control over the smoothing process. * Importance of Bias Correction: Bias correction addresses the issue of initial underestimation in exponentially weighted averages, ensuring more accurate estimates, especially during the early stages of computation.
Looking ahead, the next lecture will explore gradient descent with momentum, building upon the concepts covered today. We will further investigate how exponentially weighted averages are integrated into more sophisticated gradient descent techniques, paving the way for a deeper understanding of algorithms like Adam.
Students are strongly encouraged to review the concepts discussed in this lecture and prepare any questions they may have for the next session. This will facilitate a more interactive and engaging learning experience.