Lecture Notes on Perceptrons and Feedforward Neural Networks
Introduction
This lecture introduces the fundamental concepts of perceptrons and feedforward neural networks. We begin with the perceptron algorithm for binary classification and extend it to multi-class classification and multi-layer networks. We will explore the limitations of the perceptron algorithm and motivate the need for more sophisticated approaches, such as feedforward neural networks and gradient-based optimization. Key concepts include hyperparameters, bias, epochs, geometric interpretations of perceptrons, loss functions, gradient descent, and cross-entropy loss. The lecture aims to build a solid foundation for understanding deep learning architectures by focusing on the basic building blocks and their mathematical underpinnings.
The Perceptron Algorithm
The Perceptron Algorithm is a foundational algorithm in machine learning for binary classification problems. It attempts to find a linear decision boundary to separate two classes.
Convergence and Limitations
The Perceptron Algorithm is guaranteed to converge if the data is linearly separable. However, if the data is not linearly separable, the perceptron algorithm may not converge. It might oscillate or get stuck in a loop.
Key Terminology
Hyperparameters: These are "second order" parameters, meaning they are parameters that control the learning process itself, not the model’s weights directly.
The number of iterations ‘N’ in the perceptron algorithm is a hyperparameter.
Bias: The bias term ‘b’ can be seen as fixed or external knowledge. It’s like a weight that’s always "on," like the weight of a continuously firing neuron. By using bias, we can sometimes avoid long computations or infinite loops in certain cases.
Epoch: An epoch is a single run through the entire training dataset. We often train for multiple epochs.
- Why train for multiple epochs? Well, because one pass through the data might not be enough to learn all the patterns. We need to iterate and refine the weights over multiple epochs. This is especially important for more complex models like Feed Forward Neural Networks.
Perceptron as a Single Unit
The perceptron is a single computing unit.
Geometric Interpretation
We can visualize it geometrically as a hyperplane dividing the input space. The perceptron algorithm, in a nutshell, is about finding the right hyperplane.
Mathematically, for a given input vector \(\mathbf{x} = [x_1, x_2, \dots, x_n]^T\), weight vector \(\mathbf{w} = [w_1, w_2, \dots, w_n]^T\), and bias \(b\), the output of a perceptron \(f_{\{\mathbf{w}, b\}}(\mathbf{x})\) is given by:
\[f_{\{\mathbf{w}, b\}}(\mathbf{x}) = \begin{cases} 1 & \text{if } b + \sum_{i=1}^{n} x_i w_i > 0 \\ 0 & \text{otherwise} \end{cases} \label{eq:perceptron_output}\]
where \(\{\mathbf{w}, b\}\) represents the parameters of the perceptron.
The equation \(b + \sum_{i=1}^{n} x_i w_i = 0\) defines this hyperplane. The weight vector \(\mathbf{w}\) is normal to the hyperplane, and the bias \(b\) determines the hyperplane’s position relative to the origin.
Understanding the geometric interpretation of the perceptron involves concepts from linear algebra, such as vectors, dot products, and hyperplanes. The expression \(\sum_{i=1}^{n} x_i w_i\) is the dot product of the input vector \(\mathbf{x}\) and the weight vector \(\mathbf{w}\), denoted as \(\mathbf{w} \cdot \mathbf{x}\) or \(\mathbf{w}^T \mathbf{x}\). The condition \(b + \mathbf{w} \cdot \mathbf{x} > 0\) defines a half-space on one side of the hyperplane.
Scalar Product (Dot Product): The scalar product of two vectors \(\mathbf{x}\) and \(\mathbf{y}\) is defined as:
\[\mathbf{x} \cdot \mathbf{y} = \sum_{i=1}^{n} x_i y_i = \|\mathbf{x}\| \|\mathbf{y}\| \cos \theta \label{eq:dot_product}\]
where \(\|\mathbf{x}\|\) and \(\|\mathbf{y}\|\) are the magnitudes (or norms) of vectors \(\mathbf{x}\) and \(\mathbf{y}\) respectively, and \(\theta\) is the angle between them.
Cosine Theorem: In a triangle with sides of lengths \(a, b, c\) and angle \(\theta\) opposite to side \(c\), the cosine theorem states:
\[c^2 = a^2 + b^2 - 2ab \cos \theta \label{eq:cosine_theorem}\]
This theorem can be derived using vector algebra and the scalar product, linking geometric concepts with algebraic formulations. The norm of a vector \(\mathbf{v}\) can be defined as \(\|\mathbf{v}\| = \sqrt{\mathbf{v} \cdot \mathbf{v}}\).
Norm as Length: The norm of a vector can be interpreted as its length. The cosine theorem and scalar product are related through the norm, providing a geometric measure of the scalar product.
Multi-Class Classification with Perceptrons
Multiple Perceptrons for Multiple Classes
We can use multiple perceptrons, multiple computing units. And they can share the same input. For multi-class classification, we can use multiple perceptrons, one for each class. For example, for digit recognition with 10 digits, we can have 10 perceptrons.
Input-Output Mapping
These perceptrons can operate in parallel. Each perceptron will be responsible for detecting one digit. We can think of this as mapping a 784-dimensional input vector (the pixels) to a 10-dimensional output vector.
Interpreting Outputs
These output values can be interpreted as scores or confidences for each digit class.
From Scores to Probabilities
Are these outputs real numbers? Yes. Can we interpret them as probabilities? Not directly, but we can transform them into probabilities.
Training Process
Consider a scenario with 10 perceptrons for digit classification. Given an input image and its true label (e.g., digit ‘5’), we perform the following steps:
Feed the input image to all 10 perceptrons.
Identify the perceptron that produces the highest output value.
Compare the predicted class (based on the perceptron with the highest output) with the true label.
If the predicted class is incorrect:
Adjust the weights of the perceptron corresponding to the true class to increase its output for this input.
Adjust the weights of the perceptron that produced the highest (incorrect) output to decrease its output for this input.
Perceptrons corresponding to other classes, which were correctly deactivated (i.e., did not produce the highest output), do not need weight adjustments in this step.
If the predicted class is correct, no weight adjustments are needed for this example.
Independent Perceptron Updates
The Perceptron Algorithm runs independently for each perceptron. Each perceptron adjusts its own weights based on the error signal.
Geometric View of Training
Think geometrically: what is going on? We are using linear algebra and geometric intuition to understand how the weights are being modified and how the decision boundaries are being adjusted. Geometrically, what are we doing? We are still working with hyperplanes, but now we have multiple hyperplanes, and their combination can create more complex decision regions, potentially related to convex hulls or intersections of half-spaces.
Feedforward Neural Networks
Stopping Criterion
When do we stop training? When we have encoded our ability to learn into the weights. When the weights have converged, and the model is performing well on the training data and ideally also on a validation set. These weights represent the learned knowledge, the "memory" of the model.
The Feedforward Concept
We are feeding forward information through Layers of computing units. We can have MANY computing units per layer. This is the "feedforward" part. Information flows in one direction, from input to output.
Multi-Layer Networks
Visual representation of a multi-layer network.
Multi-level perceptrons or multi-layer perceptrons.
Basic Architecture
This is The simplest architecture! A feedforward neural network with multiple layers.
Limitations of Perceptron Algorithm
Can we still use the Perceptron Algorithm when there are (many) LAYERS? Not directly. The Perceptron Algorithm is designed for single-layer perceptrons. For multi-layer networks, we need a more sophisticated learning algorithm, like backpropagation.
Loss Functions and Optimization
Introduction to Loss
Let’s lift up our point of view and think about LOSS. We’re going to introduce the concept of Loss Functions for Neural Nets, specifically Cross-entropy Loss. This is a very interesting notion, not just for neural networks, but for data science in general. We need a Loss function to measure how well our model is performing. The loss function provides a measure of the error.
The 0/1 Loss
A simple loss function could be the 0/1 loss, which is trivial in some sense. The 0/1 loss simply counts the number of misclassifications.
The 0/1 loss function for a single example is defined as: \[L_{0/1}(y, \hat{y}) = \begin{cases} 0 & \text{if } y = \hat{y} \\ 1 & \text{if } y \neq \hat{y} \end{cases}\] where \(y\) is the true label and \(\hat{y}\) is the predicted label. The total 0/1 loss over a dataset is the number of misclassifications.
Loss Functions over Real Numbers
But for training neural networks, we often prefer to use loss functions that are defined over the real numbers, R.
Gradient-Based Optimization
Instead of the Perceptron Algorithm rule, we’ll use a more general rule based on loss functions.
General Weight Update Rule
\[\Delta \Phi_i = - \alpha \frac{\partial L}{\partial \Phi_i}\]
This is a general form of the weight update rule, using the gradient of the loss function. L is the loss function, \(\Phi_i\) represents the parameters (weights and biases), and \(\alpha\) is the learning rate.
Our unknowns \(\Phi\) are our unknowns, the parameters we want to learn.
"Learning rate" \(\alpha\) is the learning rate, a hyperparameter that controls the step size during optimization.
Notice that the 0/1 loss is not differentiable, which makes it difficult to use with gradient-based optimization.
Gradient Descent
We can use gradient descent to minimize the loss function and find optimal parameter values. Gradient descent is a procedure or technique for finding the minimum of a function. It generalizes the idea of the Perceptron Algorithm to more complex, multi-dimensional models and loss functions.
Definition of the Gradient
The gradient \(\nabla f\) of a scalar-valued function \(f: \mathbb{R}^n \rightarrow \mathbb{R}\) is a vector whose components are the partial derivatives.
\[\nabla f(\mathbf{p}) = \left[ \frac{\partial f}{\partial x_1} ; \dots ; \frac{\partial f}{\partial x_n} \right] \text{ vector } (\nabla f: \mathbb{R}^n \rightarrow \mathbb{R}^n)\]
Loss as a Function of Parameters
The loss function is a function of the parameters \(L(\Phi)\). The loss function must be smooth or differentiable for gradient descent to work effectively.
Revisiting the Update Rule
Let’s revisit the weight update rule:
\[\Delta \Phi_i = - \alpha \frac{\partial L}{\partial \Phi_i}\]
This represents the change in the values of the parameters. The gradient \(\nabla L\) points in the direction of the steepest increase of the loss function. We want to decrease the loss, so we move in the opposite direction of the gradient, hence the negative sign.
We are working in many dimensions! The parameter space can be very high-dimensional, especially for deep neural networks with many weights and biases. The number of dimensions is related to the number of features and the complexity of the model.
In multi-layer networks, parameters from any layer must be taken into account when calculating gradients and updating weights.
Cross-Entropy Loss
For classification problems, a common choice for the Loss-function is Cross-entropy. It is smooth (to apply the rule) and a continuous and differentiable function. Cross-entropy is a very popular loss function for classification tasks.
Intermezzo: Entropy
When Shannon first derived his famous formula for information, he asked von Neumann what he should call it and von Neumann replied “You should call it entropy for two reasons: first because that is what the formula is in statistical mechanics but second and more important, as nobody knows what entropy is, whenever you use the term you will always be at an advantage!
The Softmax Function
We can use the output of our neural network as an estimate of a probability distribution over the classes. A probability distribution is a set of numbers between 0 and 1 that sum up to 1. To obtain a probability distribution from the raw outputs of our network, we can use the softmax function.
\[\sigma(\mathbf{z})_j = \frac{e^{z_j}}{\sum_i e^{z_i}}\]
The softmax function takes a vector of real numbers ‘z’ and transforms it into a probability distribution.
Terminology: softmax, the inputs to the softmax function are often called logits (the \(l_i\)’s).
\(p(l_i) = \frac{e^{l_i}}{\sum_j e^{l_j}} \propto e^{l_i}\)
The softmax function converts these logits into probabilities \(p(l_i)\).
Cross-Entropy Loss Function
\[X(\Phi, \mathbf{x}) = - \log p(\hat{a})\]
This is the cross-entropy loss function for a single example ‘x’.
\(\Phi\) are the parameters of our model.
\(\hat{a}\) is the predicted class label for input ‘x’.
Observations on the definition of Cross-entropy loss function:
probabilities \(\in [0, 1] \Rightarrow\) minus sign. Since probabilities are between 0 and 1, their logarithm is negative. The minus sign makes the loss positive.
\(p(l_i) \propto e^{l_i} \Rightarrow\) logarithm. The softmax function uses exponentials, and the cross-entropy loss uses logarithm. This combination is mathematically convenient and leads to good optimization properties.
Computational Flow
Computation flow in a neural network with softmax and cross-entropy loss.
\(l_j = b_j + \mathbf{x} \cdot \mathbf{w}_j\)
\(p(a) = \sigma_a(\mathbf{l}) = \frac{e^{l_a}}{\sum_i e^{l_i}}\)
\(X(\Phi, \mathbf{x}) = - \ln p(a)\)
First, we compute the logits \(l_j\) as a linear combination of inputs and weights, plus bias. Then, we apply the softmax function to get the predicted probability \(p(a)\) for the true class ‘a’. Finally, we compute the cross-entropy loss \(X(\Phi, \mathbf{x}) = - \ln p(a)\). We want to minimize this loss during training.
Conclusion
This lecture provided an overview of perceptrons and feedforward neural networks, starting from the basic perceptron algorithm for binary classification and extending to multi-class classification and multi-layer networks. We discussed key terminologies such as hyperparameters, bias, and epochs, and explored the geometric interpretation of perceptrons as linear classifiers. We highlighted the limitations of the perceptron algorithm and introduced feedforward neural networks as a more powerful alternative. Furthermore, we delved into the concepts of loss functions, gradient descent, and cross-entropy loss, which are essential for training neural networks.
Key takeaways from this lecture include:
The Perceptron Algorithm is a simple linear classifier that works well for linearly separable data but fails to converge otherwise.
Multi-class classification can be achieved using multiple perceptrons, each dedicated to one class.
Feedforward neural networks extend perceptrons into multi-layer architectures, enabling the learning of more complex functions.
Loss functions quantify the error of a model’s predictions, and gradient descent is a fundamental optimization algorithm used to minimize these loss functions.
Cross-entropy loss, combined with the softmax function, is a popular and effective choice for multi-class classification problems in neural networks.
Further topics to explore include backpropagation for training multi-layer neural networks, different types of activation functions, and advanced optimization techniques. Understanding these fundamental concepts is crucial for progressing into more advanced topics in deep learning.