Lecture Notes on Multilayer Perceptrons and Backpropagation

Author

Your Name

Published

February 3, 2025

Introduction

This lecture introduces **Multilayer Perceptrons (MLPs)** and the **backpropagation algorithm**, a fundamental technique for training these networks. We will explore the key characteristics of MLPs, their architecture, and the crucial role of hidden layers in feature detection. A significant portion of the lecture will formalize the backpropagation algorithm, detailing the computation flow, gradient descent, and error signal propagation. We will examine the XOR problem as a case study to understand the power of MLPs in solving non-linearly separable problems and briefly discuss the "Double Moon" experiment to illustrate their capabilities in complex classification tasks. This lecture aims to provide a solid foundation for understanding how MLPs learn and solve complex problems through backpropagation.

Multilayer Perceptrons

Key Characteristics

Multilayer Perceptrons (MLPs) possess several key characteristics that distinguish them from simpler neural network models and enable them to learn complex patterns and solve non-linearly separable problems:

Nonlinear and Differentiable Activation Functions

Each neuron in an MLP employs a **nonlinear activation function**, denoted as $\varphi(\cdot)$. This nonlinearity is crucial for the network to learn complex relationships in data. For the backpropagation algorithm to be applicable, this activation function must also be **differentiable**, allowing us to compute gradients essential for updating the network’s weights during training.

Hidden Layers

MLPs are defined by the presence of one or more **hidden layers** positioned between the input and output layers. These layers are not directly accessible from the external environment. They play a critical role in learning intermediate representations of the input data, effectively acting as **feature detectors**.

High Connectivity

MLPs typically exhibit a high degree of **connectivity** between neurons in successive layers. The strength of these connections is determined by **synaptic weights**. This dense connectivity allows for complex interactions between neurons, enabling the network to model intricate patterns in the data.

Architecture and Signal Flow

The architecture of an MLP is defined by its layered structure and the direction of signal flow.

Input, Hidden, and Output Layers

An MLP consists of distinct layers:

**Input Layer:** Receives the input signals.
**Hidden Layers:** One or more layers between the input and output layers, responsible for feature extraction.
**Output Layer:** Produces the final output of the network.

Signals in an MLP flow in a directed manner, typically from the input layer, through the hidden layers, and finally to the output layer.

Forward and Backward Propagation

MLPs operate through two primary phases:

**Forward Propagation:** In this phase, **function signals** flow in the forward direction, from the input layer to the output layer. Each neuron computes its output based on the weighted sum of its inputs and its activation function. This process is repeated layer by layer until the output of the network is computed.
**Backward Propagation:** Following the forward pass, the network calculates the error between its output and the desired output. Then, **error signals** propagate in the backward direction, from the output layer back to the input layer. This backward flow is used to adjust the synaptic weights of the network to minimize the error. This process is formalized by the backpropagation algorithm, which we will discuss in detail later.

The Role of Hidden Layers

Hidden layers are central to the power and flexibility of Multilayer Perceptrons.

Feature Detection

The primary role of hidden layers is to act as **feature detectors**. Each neuron in a hidden layer learns to detect specific features or patterns in the input data. As signals propagate through the hidden layers, the network progressively extracts more complex and abstract features.

Operation in Feature Space

Hidden layers operate in what is referred to as a **feature space**. This space is an internal representation of the input data, transformed by the hidden layers to highlight relevant features for the task at hand. By operating in this feature space, MLPs can effectively handle complex, non-linear relationships in the data.

Supervised Learning and Feature Space Formation

The formation of the feature space in MLPs is driven by **supervised learning**. During training, the network adjusts its weights based on the error signals, effectively shaping the feature space to optimize performance on the given task. This ability to learn and adapt its feature representation distinguishes MLPs from simpler models like Rosenblatt’s perceptron, which lacks hidden layers and operates directly in the input space.

Errors and Learning

To train a Multilayer Perceptron, it is crucial to define how to measure the error and how the network learns from this error.

Defining Error

Consider a set of $N$ examples, denoted as $\mathcal{D} = \{(\mathbf{x}(n), \mathbf{d}(n))\}_{n=1}^{N}$, where $\mathbf{x}(n)$ is the $n$-th input vector and $\mathbf{d}(n)$ is the corresponding desired output vector.

Error for the n-th Sample

For the $n$-th sample, the error for the $j$-th neuron in the output layer, $e_j(n)$, is defined as the difference between the desired output, $d_j(n)$, and the actual output of the $j$-th neuron, $y_j(n)$: \[\label{eq:error_n_sample} e_j(n) = d_j(n) - y_j(n)\]

Instantaneous Energy Error

The **instantaneous energy error**, $\mathcal{E}(n)$, for the $n$-th sample is commonly defined as half the sum of the squared errors at the output neurons: \[\label{eq:instantaneous_error} \mathcal{E}(n) = \frac{1}{2} \sum_{j \in \text{Output Neurons}} e_j^2(n)\] This is analogous to kinetic energy in physics, having a similar quadratic form.

Total Error

The **total error** up to sample $n$, $E(n)$, is the sum of the instantaneous energy errors over all samples from 1 to $n$: \[E(n) = \sum_{i=1}^{n} \mathcal{E}(i)\]

Average Error

The **average error**, $E_{av}$, over $N$ samples is obtained by dividing the total error by the number of samples: \[E_{av} = \frac{1}{N} \sum_{i=1}^{N} \mathcal{E}(i)\] The average error is often used to evaluate the overall performance of the network.

Online vs. Batch Learning

There are two main approaches to updating the weights of a neural network based on the error: online learning and batch learning.

**Online Learning (Stochastic Gradient Descent):** In online learning, the network weights are updated after each training sample. This approach is also known as stochastic gradient descent because the gradient is estimated based on a single sample.
**Batch Learning (Batch Gradient Descent):** In batch learning, the weights are updated after processing the entire training dataset (or a batch of samples). The error is accumulated over all samples in the batch before updating the weights. This approach calculates the true gradient over the entire batch.

Advantages of Online Learning for Pattern Classification

Despite some potential disadvantages compared to batch learning, online learning is widely used for pattern classification problems due to several practical advantages:

**Simplicity of Implementation:** Online learning is straightforward to implement as weights are updated after each sample.
**Effectiveness for Large-Scale Problems:** Online learning often provides effective solutions for large and complex pattern classification problems.
**Dynamic Tracking:** Online learning can track small changes in the data distribution more dynamically, potentially leading to better generalization and escape from local minima in the error surface.

For these reasons, backpropagation, as discussed in the following sections, is often implemented in an online learning fashion.

Backpropagation: Preliminaries

To understand backpropagation, let’s first examine the structure and signals within a single neuron, specifically neuron $j$.

Neuron j: Structure and Signals

A typical neuron $j$ in an MLP receives multiple input signals and produces an output signal based on its internal processing.

Induced Local Field

The neuron receives inputs $y_0, y_1, \ldots, y_m$. $y_0$ is a bias input, typically set to $+1$. Each input $y_i$ is weighted by a synaptic weight $w_{ji}$. The **induced local field** or net input to neuron $j$, $v_j(n)$, is the sum of these weighted inputs: \[\label{eq:induced_local_field} v_j(n) = \sum_{i=0}^{m} w_{ji}(n) y_i(n)\] Note that the weight for the bias input $y_0$ is often denoted as the bias $b_j$, so $w_{j0}(n) = b_j(n)$.

Nonlinear Activation Function

The induced local field $v_j(n)$ is then passed through a **nonlinear activation function**, $\varphi(\cdot)$. This function introduces nonlinearity into the neuron’s response.

Function Signal

The output of the activation function is the **function signal** or output of neuron $j$, $y_j(n)$: \[\label{eq:function_signal} y_j(n) = \varphi(v_j(n))\]

Error of the j-th Neuron

For output neurons, the output $y_j(n)$ is compared with the desired output $d_j(n)$ to calculate the **error** $e_j(n)$, as defined in Equation [eq:error_n_sample].

Computation Flow and Gradient Descent

The core idea of backpropagation is to use gradient descent to adjust the weights of the network in a direction that minimizes the error.

Forward Propagation

The forward propagation step involves calculating the output of each neuron layer by layer.

Induced Local Field Calculation

For each neuron $j$, the induced local field $v_j(n)$ is calculated using Equation [eq:induced_local_field].

Output Calculation

The output of each neuron $j$, $y_j(n)$, is then computed using the activation function as in Equation [eq:function_signal]. This process is repeated for all layers until the output layer is reached.

Differentiating to Minimize Error

To minimize the error using gradient descent, we need to calculate the gradient of the error function with respect to each weight in the network. This involves using the chain rule of calculus.

Sensitivity Factor

We are interested in the **sensitivity factor**, which measures how sensitive the energy error $\mathcal{E}(n)$ is to changes in a weight $w_{ji}(n)$. This is given by the partial derivative $\frac{\partial \mathcal{E}(n)}{\partial w_{ji}(n)}$.

Applying the Chain Rule

Using the chain rule, we can decompose the derivative as follows: \[\label{eq:chain_rule} \frac{\partial \mathcal{E}(n)}{\partial w_{ji}(n)} = \frac{\partial \mathcal{E}(n)}{\partial e_j(n)} \cdot \frac{\partial e_j(n)}{\partial y_j(n)} \cdot \frac{\partial y_j(n)}{\partial v_j(n)} \cdot \frac{\partial v_j(n)}{\partial w_{ji}(n)}\] Let’s evaluate each term:

$\frac{\partial \mathcal{E}(n)}{\partial e_j(n)} = \frac{\partial}{\partial e_j(n)} \left( \frac{1}{2} e_j^2(n) \right) = e_j(n)$
$\frac{\partial e_j(n)}{\partial y_j(n)} = \frac{\partial}{\partial y_j(n)} (d_j(n) - y_j(n)) = -1$
$\frac{\partial y_j(n)}{\partial v_j(n)} = \frac{\partial}{\partial v_j(n)} \varphi(v_j(n)) = \varphi'(v_j(n))$
$\frac{\partial v_j(n)}{\partial w_{ji}(n)} = \frac{\partial}{\partial w_{ji}(n)} \left( \sum_{k=0}^{m} w_{jk}(n) y_k(n) \right) = y_i(n)$

Substituting these terms back into the chain rule expression, we get: \[\label{eq:sensitivity_factor_final} \frac{\partial \mathcal{E}(n)}{\partial w_{ji}(n)} = -e_j(n) \cdot \varphi'(v_j(n)) \cdot y_i(n)\] This equation gives the variation of the error relative to the weight $w_{ji}(n)$.

Local Gradient and Error Assignment

Definition of the Local Gradient

The **local gradient** for neuron $j$, denoted as $\delta_j(n)$, is defined as: \[\label{eq:local_gradient_def} \delta_j(n) = - \frac{\partial \mathcal{E}(n)}{\partial v_j(n)}\] Using the chain rule, we can express $\delta_j(n)$ in terms of previously calculated derivatives: \[\begin{aligned} \delta_j(n) &= - \frac{\partial \mathcal{E}(n)}{\partial e_j(n)} \cdot \frac{\partial e_j(n)}{\partial y_j(n)} \cdot \frac{\partial y_j(n)}{\partial v_j(n)} \\ &= - e_j(n) \cdot (-1) \cdot \varphi'(v_j(n)) \\ &= e_j(n) \cdot \varphi'(v_j(n)) \end{aligned}\] Thus, \[\label{eq:local_gradient_def_val} \delta_j(n) = e_j(n) \varphi'(v_j(n))\]

Interpretation: Error Responsibility of Neuron j

The local gradient $\delta_j(n)$ represents the portion of the error for which neuron $j$ is considered "responsible". It quantifies how much neuron $j$’s induced local field contributes to the overall error.

Weight Correction and the Delta Rule

The Delta Rule for Weight Adjustment

The **delta rule** specifies how to adjust the weight $w_{ji}(n)$ to reduce the error. The correction to the weight, $\Delta w_{ji}(n)$, is proportional to the negative gradient of the error with respect to the weight: \[\Delta w_{ji}(n) = - \eta \frac{\partial \mathcal{E}(n)}{\partial w_{ji}(n)}\] where $\eta$ is the learning rate, a positive constant that controls the step size in the gradient descent.

Gradient Descent

The delta rule implements **gradient descent**, an optimization algorithm that iteratively moves in the direction of the steepest decrease of the error function. Substituting the expression for $\frac{\partial \mathcal{E}(n)}{\partial w_{ji}(n)}$ from Equation [eq:sensitivity_factor_final] and using the definition of the local gradient, we get: \[\begin{aligned} \Delta w_{ji}(n) &= - \eta \left( - \delta_j(n) \cdot y_i(n) \right) \\ &= \eta \delta_j(n) y_i(n) \end{aligned}\] Thus, the weight update rule becomes: \[\label{eq:delta_rule} \Delta w_{ji}(n) = \eta \delta_j(n) y_i(n)\]

Error Signal and Neuron Location

The calculation of the error signal $e_j(n)$ and consequently the local gradient $\delta_j(n)$ depends on the location of neuron $j$ in the network.

Case 1: Output Node

If neuron $j$ is an **output node**, the desired response $d_j(n)$ is directly available. Therefore, the error signal $e_j(n)$ can be directly calculated as: \[e_j(n) = d_j(n) - y_j(n)\] and the local gradient $\delta_j(n)$ is computed using Equation [eq:local_gradient_def_val].

Case 2: Hidden Node and the Credit-Assignment Problem

If neuron $j$ is a **hidden node**, there is no directly specified desired output. However, hidden neurons contribute to the error at the output layer. The challenge of determining how to assign "credit" or "blame" to hidden neurons for their contribution to the output error is known as the **credit-assignment problem** or **blame assignment problem**. Backpropagation provides a systematic way to solve this problem by propagating error signals backwards through the network.

General Weight Correction Formula

The general weight correction formula for backpropagation can be summarized as: \[\Delta w_{ji}(n) = \eta \cdot \delta_j(n) \cdot y_i(n)\] where:

$\eta$ is the learning rate.
$\delta_j(n)$ is the local gradient for neuron $j$.
$y_i(n)$ is the input signal to neuron $j$ (output of neuron $i$ in the previous layer).

Computing the Local Gradient: Output vs. Hidden Nodes

The method for computing the local gradient $\delta_j(n)$ differs depending on whether neuron $j$ is an output node or a hidden node.

Output Node: Direct Calculation

For an **output node** $j$, the local gradient $\delta_j(n)$ is calculated directly using the error signal $e_j(n)$ and the derivative of the activation function: \[\delta_j(n) = e_j(n) \varphi'(v_j(n)) = (d_j(n) - y_j(n)) \varphi'(v_j(n))\]

Hidden Node: Recursive Calculation

For a **hidden node** $j$, the local gradient $\delta_j(n)$ is computed recursively based on the local gradients of the neurons in the next layer (closer to the output). The error signal is effectively propagated backwards. To derive this recursive formula, consider the error $\mathcal{E}(n)$ as a function of the output of neuron $j$, $y_j(n)$. Changes in $y_j(n)$ affect the induced local fields $v_k(n)$ and consequently the errors $e_k(n)$ of all neurons $k$ in the next layer that receive input from neuron $j$. Using the chain rule again, we can express $\frac{\partial \mathcal{E}(n)}{\partial v_j(n)}$ in terms of the derivatives in the next layer.

Consider neuron $j$ in layer $l$. The local gradient $\delta_j^{(l)}(n)$ is given by: \[\label{eq:hidden_delta} \delta_j^{(l)}(n) = \varphi'(v_j^{(l)}(n)) \sum_{k} \delta_k^{(l+1)}(n) w_{kj}^{(l+1)}(n)\] where the sum is over all neurons $k$ in the next layer ($l+1$) that receive input from neuron $j$ in layer $l$, and $w_{kj}^{(l+1)}(n)$ are the weights connecting neuron $j$ in layer $l$ to neuron $k$ in layer $l+1$. This recursive formula shows that the local gradient at a hidden layer depends on the local gradients computed for the next layer and the weights connecting these layers.

Backpropagation: Signal Flow and Mathematical Foundations

Signal-Flow Graph

Illustrative diagrams can be used to depict the signal-flow graph of backpropagation, highlighting the connections between neurons and the flow of signals during both forward and backward passes.

Chain Rule and Partial Differentiation

Backpropagation is fundamentally based on the **chain rule** of calculus and the concept of **partial differentiation** of many-valued real functions. It provides a systematic method for computing the gradients in complex, layered networks by repeatedly applying the chain rule to propagate error signals backwards.

Approximating Function and Partial Derivatives

A Multilayer Perceptron can be viewed as an **approximating function**, $F[\mathbf{w}, \mathbf{x}(n)]$, where $\mathbf{w}$ represents all the weights in the network and $\mathbf{x}(n)$ is the input. The goal of backpropagation is to adjust the weights $\mathbf{w}$ to minimize the error between the network’s output and the desired output. This is achieved by computing the **partial derivatives** of the error function with respect to each weight and using these derivatives to update the weights through gradient descent.

Formalizing the Network: Layer Notation

To formalize the backpropagation algorithm for networks with multiple layers, we introduce a layer notation. Let $l = 0, 1, 2, \ldots, L$ denote the layers of the network, where $l=0$ is the input layer and $l=L$ is the output layer. We use the following notation for layer $l$:

$y_j^{(l)}(n)$: $j$-th component of the output vector of layer $l$ for the $n$-th sample. For $l=0$, $y_j^{(0)}(n) = x_j(n)$, the $j$-th component of the input vector.
$v_j^{(l)}(n)$: Induced local field for the $j$-th neuron in layer $l$.
$\delta_j^{(l)}(n)$: Local gradient for the $j$-th neuron in layer $l$.

Local Gradients: Formal Definition

Using the layer notation, we can formally define the local gradients for the output and hidden layers.

Output Layer

For the output layer $L$, the local gradient for the $j$-th neuron is: \[\delta_j^{(L)}(n) = e_j^{(L)}(n) \varphi'(v_j^{(L)}(n)) = (d_j(n) - y_j^{(L)}(n)) \varphi'(v_j^{(L)}(n))\]

Hidden Layer

For a hidden layer $l < L$, the local gradient for the $j$-th neuron is calculated recursively: \[\delta_j^{(l)}(n) = \varphi'(v_j^{(l)}(n)) \sum_{k} \delta_k^{(l+1)}(n) w_{kj}^{(l+1)}(n)\] where the sum is over all neurons $k$ in the next layer $l+1$.

Practical Delta Rule: Momentum

Delta Rule with Momentum

In practice, the basic delta rule can be noisy and slow to converge. To improve convergence and stability, a **momentum term** is often added to the weight update rule. The **delta rule with momentum** is given by: \[\Delta w_{ji}(n) = \alpha \Delta w_{ji}(n-1) + \eta \delta_j(n) y_i(n)\] where $\alpha$ is the momentum constant, typically set between 0 and 1 ($0 \leq \alpha < 1$). The momentum term $\alpha \Delta w_{ji}(n-1)$ adds a fraction of the previous weight update to the current update, effectively smoothing the learning process and potentially accelerating convergence by allowing a larger learning rate $\eta$ without causing oscillations.

Generalized Delta Rule and Historical Influence

The delta rule with momentum is sometimes referred to as the **generalized delta rule**. Expanding the recursive formula for the weight update, we can see the influence of historical gradients: \[\Delta w_{ji}(n) = \eta \sum_{t=0}^{n} \alpha^{(n-t)} \delta_j(t) y_i(t)\] This shows that the current weight update is influenced by a history of past gradients, with older gradients having exponentially decaying influence due to the $\alpha^{(n-t)}$ term.

Stopping Criteria

To determine when to stop training the network, we need to define **stopping criteria**.

Euclidean Norm of the Gradient

A common stopping criterion is to monitor the **Euclidean norm of the gradient** of the error function. Training can be stopped when this norm becomes sufficiently small, indicating that the network has reached a (local) minimum in the error surface. This is a heuristic approach, but widely used in practice. The Euclidean norm of the gradient at iteration $n$ can be calculated as: \[\| \nabla E(n) \| = \sqrt{\sum_{j,i} \left( \frac{\partial \mathcal{E}(n)}{\partial w_{ji}} \right)^2}\] and training stops when $\| \nabla E(n) \| < \epsilon$, where $\epsilon$ is a predefined small threshold.

Signal-Flow Summary of Backpropagation

Illustrative diagrams can be used to provide a signal-flow graphical summary of backpropagation learning, illustrating the forward and backward passes.

Forward Pass

In the **forward pass**, input signals are propagated through the network layer by layer. For each layer $l$:

Calculate the induced local fields $v_j^{(l)}(n)$ for all neurons $j$ in layer $l$.
Calculate the outputs $y_j^{(l)}(n) = \varphi(v_j^{(l)}(n))$ for all neurons $j$ in layer $l$.

This process continues until the output layer $L$ is reached, producing the network outputs $y_j^{(L)}(n)$. The errors $e_j^{(L)}(n) = d_j(n) - y_j^{(L)}(n)$ are then calculated at the output layer.

Backward Pass

In the **backward pass**, error signals are propagated backwards from the output layer to the input layer. For each layer $l$, starting from the output layer $L$ down to the first hidden layer:

For the output layer $L$, calculate the local gradients $\delta_j^{(L)}(n) = e_j^{(L)}(n) \varphi'(v_j^{(L)}(n))$.
For hidden layers $l < L$, calculate the local gradients recursively using $\delta_j^{(l)}(n) = \varphi'(v_j^{(l)}(n)) \sum_{k} \delta_k^{(l+1)}(n) w_{kj}^{(l+1)}(n)$.
Update the weights using the delta rule (possibly with momentum): $\Delta w_{ji}^{(l)}(n) = \alpha \Delta w_{ji}^{(l)}(n-1) + \eta \delta_j^{(l)}(n) y_i^{(l-1)}(n)$.

This completes one iteration of backpropagation for a single training sample.

The XOR Problem: A Case Study

The **XOR problem** is a classic example used to demonstrate the limitations of single-layer perceptrons and the power of multilayer perceptrons (MLPs). It highlights the concept of linear inseparability and how hidden layers can overcome this limitation.

Linear Inseparability of XOR

The XOR (exclusive OR) function is defined by the following truth table:

Input 1 ($x_1$)	Input 2 ($x_2$)	Output (XOR)
0	0	0
0	1	1
1	0	1
1	1	0

When plotted in a 2D space, the points (0,0) and (1,1) belong to one class (output 0), and the points (0,1) and (1,0) belong to another class (output 1). These two classes are **linearly inseparable**, meaning that it is impossible to draw a single straight line to separate them. A **single-layer perceptron** cannot solve the XOR problem because it can only implement linear decision boundaries.

Solving XOR with a Multilayer Perceptron

To solve the XOR problem, we need to introduce nonlinearity through hidden layers.

Hidden Layer with Two Neurons

An MLP with **one hidden layer** containing **two neurons** is sufficient to solve the XOR problem. This network architecture allows for the creation of non-linear decision boundaries.

Linear Combination of Neuron Outputs

The hidden layer neurons act as feature detectors, and the output neuron then **combines linearly the outputs of these hidden neurons** to produce the final output. This combination of nonlinear transformations in the hidden layer followed by a linear combination in the output layer enables the network to solve non-linearly separable problems.

Network Configuration and Solution

Consider an MLP with two input neurons, two hidden neurons (Neuron 1 and Neuron 2), and one output neuron (Neuron 3).

Specific Weights and Biases

A specific set of weights and biases that solves the XOR problem is:

**Neuron 1 (Hidden):** Weights $w_{11} = 1$, $w_{12} = 1$, Bias $b_1 = -3/2$.
**Neuron 2 (Hidden):** Weights $w_{21} = 1$, $w_{22} = 1$, Bias $b_2 = -1/2$.
**Neuron 3 (Output):** Weights $w_{31} = -2$, $w_{32} = 1$, Bias $b_3 = -1/2$.

Decision Boundaries of Neurons 1, 2, and 3

The decision boundariesfor each neuron can be analyzed as follows:

**Neuron 1:** The induced local field is $v_1 = x_1 + x_2 - \frac{3}{2}$. The decision boundary is given by $x_1 + x_2 - \frac{3}{2} = 0$ or $x_1 + x_2 = \frac{3}{2}$.
**Neuron 2:** The induced local field is $v_2 = x_1 + x_2 - \frac{1}{2}$. The decision boundary is given by $x_1 + x_2 - \frac{1}{2} = 0$ or $x_1 + x_2 = \frac{1}{2}$.
**Neuron 3:** The induced local field is $v_3 = -2y_1 + y_2 - \frac{1}{2}$, where $y_1$ and $y_2$ are the outputs of Neuron 1 and Neuron 2, respectively. The decision boundary is given by $-2y_1 + y_2 - \frac{1}{2} = 0$ or $y_2 = 2y_1 + \frac{1}{2}$.

These decision boundaries, when combined, effectively separate the XOR classes.

Verification of the Solution

Let’s verify how this network configuration correctly implements the XOR function for each input combination, assuming a step activation function with a threshold at 0 (output is 1 if $v \ge 0$, and 0 if $v < 0$):

**Input (0, 0):**
- Neuron 1: $v_1 = 0 + 0 - \frac{3}{2} = -\frac{3}{2} < 0 \implies y_1 = 0$
- Neuron 2: $v_2 = 0 + 0 - \frac{1}{2} = -\frac{1}{2} < 0 \implies y_2 = 0$
- Neuron 3: $v_3 = -2(0) + 0 - \frac{1}{2} = -\frac{1}{2} < 0 \implies y_3 = 0$ (Correct XOR output)
**Input (1, 1):**
- Neuron 1: $v_1 = 1 + 1 - \frac{3}{2} = \frac{1}{2} \ge 0 \implies y_1 = 1$
- Neuron 2: $v_2 = 1 + 1 - \frac{1}{2} = \frac{3}{2} \ge 0 \implies y_2 = 1$
- Neuron 3: $v_3 = -2(1) + 1 - \frac{1}{2} = -\frac{3}{2} < 0 \implies y_3 = 0$ (Correct XOR output)
**Input (0, 1):**
- Neuron 1: $v_1 = 0 + 1 - \frac{3}{2} = -\frac{1}{2} < 0 \implies y_1 = 0$
- Neuron 2: $v_2 = 0 +1 - \frac{1}{2} = \frac{1}{2} \ge 0 \implies y_2 = 1$
- Neuron 3: $v_3 = -2(0) + 1 - \frac{1}{2} = \frac{1}{2} \ge 0 \implies y_3 = 1$ (Correct XOR output)
**Input (1, 0):**
- Neuron 1: $v_1 = 1 + 0 - \frac{3}{2} = -\frac{1}{2} < 0 \implies y_1 = 0$
- Neuron 2: $v_2 = 1 + 0 - \frac{1}{2} = \frac{1}{2} \ge 0 \implies y_2 = 1$
- Neuron 3: $v_3 = -2(0) + 1 - \frac{1}{2} = \frac{1}{2} \ge 0 \implies y_3 = 1$ (Correct XOR output)

As verified, this specific network configuration correctly solves the XOR problem.

Key Takeaway: The Power of Feature Space Dimensionality

The solution to the XOR problem highlights a crucial concept: while a single-layer perceptron is limited to linear decision boundaries and cannot solve linearly inseparable problems, **MLPs overcome this limitation by operating in a higher-dimensional feature space**. By introducing hidden layers, MLPs increase the dimensionality of the feature space, enabling them to learn complex, non-linear decision boundaries and solve problems like XOR that are beyond the capability of single-layer networks. The hidden layers act as feature extractors, transforming the input data into a representation where the classes become linearly separable, allowing the output layer to perform a linear classification in this transformed space.

The "Double Moon" Experiment

The **"Double Moon" experiment** is a classic benchmark for testing the ability of classification algorithms, particularly neural networks, to handle **nonlinearly separable data**.

Testing Nonlinearly Separable Data Classification

This experiment is designed to demonstrate the effectiveness of multilayer perceptrons (MLPs), trained with backpropagation, in classifying data that cannot be separated by a linear boundary. It also serves to explore the limits of MLPs by testing them on increasingly difficult nonlinearly separable patterns.

Experimental Setup and Parameters

The experiment utilizes a "double moon" dataset, which consists of two crescent-shaped clusters of data points that are intertwined and nonlinearly separable. The experimental setup includes the following specifications for the MLP:

**Input Layer Size:** Number of input features for the double moon dataset (likely 2 for 2D data).
**Hidden Layer Size:** Number of neurons in the hidden layer(s).
**Output Layer Size:** Number of output neurons (likely 1 for binary classification or 2 for class probabilities).
**Activation Function:** Hyperbolic Tangent (tanh).
**Threshold Setting:** Threshold for classification decision.
**Learning Rate Annealing:** Details on how the learning rate is adjusted during training.

The experiment is conducted for two different distances, $d=-4$ and $d=-5$, between the moons. These distances control the degree of overlap and thus the difficulty of the classification task.

Results and Performance

The results of the double moon experiment are typically visualized through learning curves and classification boundaries.

**Distance d = -4:** For $d=-4$, the moons are **nonlinearly separable**, but with a reasonable margin. The MLP, trained with backpropagation, is expected to classify the data reasonably well, demonstrating its capability to handle nonlinear boundaries.
**Distance d = -5:** As the distance increases to $d=-5$, the moons become more intertwined, making the classification task more challenging. However, the MLP is still expected to perform surprisingly well, showcasing its robustness in handling complex nonlinear patterns.

The experiment aims to show that MLPs can effectively learn and classify nonlinearly separable data, even as the complexity of the data increases. The learning curves would typically show the error decreasing over training epochs, and the classification boundaries would visually represent how the MLP separates the two moon-shaped classes.

Conclusion

In this lecture, we have explored the fundamental concepts of Multilayer Perceptrons (MLPs) and the backpropagation algorithm. We began by outlining the key characteristics of MLPs, emphasizing the importance of nonlinear activation functions, hidden layers, and high connectivity. We then delved into the architecture and signal flow within MLPs, distinguishing between forward and backward propagation phases. The critical role of hidden layers as feature detectors operating in a feature space was highlighted, showcasing their ability to learn complex data representations through supervised learning.

We formalized the concept of error measurement, defining instantaneous, total, and average errors, and discussed the advantages of online learning, particularly for pattern classification problems. A significant portion of the lecture was dedicated to a detailed derivation of the backpropagation algorithm, starting from the neuron structure and signals, moving through computation flow, gradient descent, and the delta rule for weight correction. We differentiated between local gradient computation for output and hidden nodes, emphasizing the recursive nature of backpropagation in hidden layers and its foundation in the chain rule and partial differentiation. Practical improvements like momentum were also discussed to enhance the delta rule’s performance.

The XOR problem served as a crucial case study, illustrating the linear inseparability issue and how MLPs with hidden layers can solve it by increasing the dimensionality of the feature space. Finally, we briefly discussed the "Double Moon" experiment, demonstrating the capability of MLPs to classify complex, nonlinearly separable datasets in practice.

**Key Takeaways:**

Multilayer Perceptrons, with their nonlinear activation functions and hidden layers, are powerful tools for learning complex patterns and solving nonlinearly separable problems.
Backpropagation is a fundamental algorithm for training MLPs, enabling the network to learn by propagating error signals backward and adjusting weights through gradient descent.
Hidden layers play a crucial role in feature detection and transforming the input data into a feature space where complex relationships can be learned.
The XOR problem and the Double Moon experiment exemplify the ability of MLPs to overcome the limitations of linear models and handle nonlinear data effectively.

**Follow-up Questions and Topics for Next Lecture:**

How do different activation functions affect the performance of backpropagation?
What are the mathematical properties of backpropagation, such as the Jacobian and Hessian matrices, and how do they influence learning dynamics?
How can we analyze the convergence and generalization properties of backpropagation?
What are some advanced techniques to improve backpropagation, such as different optimization algorithms and regularization methods?
How do deeper networks (networks with more hidden layers) further enhance feature representation and problem-solving capabilities?

These questions will guide our exploration in the next lecture as we delve deeper into the theoretical underpinnings and advanced aspects of backpropagation and multilayer perceptrons.

--- title: "Lecture Notes on Multilayer Perceptrons and Backpropagation" author: "Your Name" date: "2025-02-03" format: html: toc: true # Table of Contents toc-depth: 2 code-tools: true theme: cosmo # Or "journal" for Distill-like minimalism --- # Introduction This lecture introduces \*\*Multilayer Perceptrons (MLPs)\*\* and the \*\*backpropagation algorithm\*\*, a fundamental technique for training these networks. We will explore the key characteristics of MLPs, their architecture, and the crucial role of hidden layers in feature detection. A significant portion of the lecture will formalize the backpropagation algorithm, detailing the computation flow, gradient descent, and error signal propagation. We will examine the XOR problem as a case study to understand the power of MLPs in solving non-linearly separable problems and briefly discuss the \"Double Moon\" experiment to illustrate their capabilities in complex classification tasks. This lecture aims to provide a solid foundation for understanding how MLPs learn and solve complex problems through backpropagation. # Multilayer Perceptrons ## Key Characteristics Multilayer Perceptrons (MLPs) possess several key characteristics that distinguish them from simpler neural network models and enable them to learn complex patterns and solve non-linearly separable problems: ### Nonlinear and Differentiable Activation Functions Each neuron in an MLP employs a \*\*nonlinear activation function\*\*, denoted as $\varphi(\cdot)$. This nonlinearity is crucial for the network to learn complex relationships in data. For the backpropagation algorithm to be applicable, this activation function must also be \*\*differentiable\*\*, allowing us to compute gradients essential for updating the network's weights during training. ### Hidden Layers MLPs are defined by the presence of one or more \*\*hidden layers\*\* positioned between the input and output layers. These layers are not directly accessible from the external environment. They play a critical role in learning intermediate representations of the input data, effectively acting as \*\*feature detectors\*\*. ### High Connectivity MLPs typically exhibit a high degree of \*\*connectivity\*\* between neurons in successive layers. The strength of these connections is determined by \*\*synaptic weights\*\*. This dense connectivity allows for complex interactions between neurons, enabling the network to model intricate patterns in the data. ## Architecture and Signal Flow The architecture of an MLP is defined by its layered structure and the direction of signal flow. ### Input, Hidden, and Output Layers An MLP consists of distinct layers: - \*\*Input Layer:\*\* Receives the input signals. - \*\*Hidden Layers:\*\* One or more layers between the input and output layers, responsible for feature extraction. - \*\*Output Layer:\*\* Produces the final output of the network. Signals in an MLP flow in a directed manner, typically from the input layer, through the hidden layers, and finally to the output layer. ### Forward and Backward Propagation MLPs operate through two primary phases: - \*\*Forward Propagation:\*\* In this phase, \*\*function signals\*\* flow in the forward direction, from the input layer to the output layer. Each neuron computes its output based on the weighted sum of its inputs and its activation function. This process is repeated layer by layer until the output of the network is computed. - \*\*Backward Propagation:\*\* Following the forward pass, the network calculates the error between its output and the desired output. Then, \*\*error signals\*\* propagate in the backward direction, from the output layer back to the input layer. This backward flow is used to adjust the synaptic weights of the network to minimize the error. This process is formalized by the backpropagation algorithm, which we will discuss in detail later. ## The Role of Hidden Layers Hidden layers are central to the power and flexibility of Multilayer Perceptrons. ### Feature Detection The primary role of hidden layers is to act as \*\*feature detectors\*\*. Each neuron in a hidden layer learns to detect specific features or patterns in the input data. As signals propagate through the hidden layers, the network progressively extracts more complex and abstract features. ### Operation in Feature Space Hidden layers operate in what is referred to as a \*\*feature space\*\*. This space is an internal representation of the input data, transformed by the hidden layers to highlight relevant features for the task at hand. By operating in this feature space, MLPs can effectively handle complex, non-linear relationships in the data. ### Supervised Learning and Feature Space Formation The formation of the feature space in MLPs is driven by \*\*supervised learning\*\*. During training, the network adjusts its weights based on the error signals, effectively shaping the feature space to optimize performance on the given task. This ability to learn and adapt its feature representation distinguishes MLPs from simpler models like Rosenblatt's perceptron, which lacks hidden layers and operates directly in the input space. # Errors and Learning To train a Multilayer Perceptron, it is crucial to define how to measure the error and how the network learns from this error. ## Defining Error Consider a set of $N$ examples, denoted as $\mathcal{D} = \{(\mathbf{x}(n), \mathbf{d}(n))\}_{n=1}^{N}$, where $\mathbf{x}(n)$ is the $n$-th input vector and $\mathbf{d}(n)$ is the corresponding desired output vector. ### Error for the n-th Sample ::: tcolorbox For the $n$-th sample, the error for the $j$-th neuron in the output layer, $e_j(n)$, is defined as the difference between the desired output, $d_j(n)$, and the actual output of the $j$-th neuron, $y_j(n)$: $$\label{eq:error_n_sample} e_j(n) = d_j(n) - y_j(n)$$ ::: ### Instantaneous Energy Error ::: tcolorbox The \*\*instantaneous energy error\*\*, $\mathcal{E}(n)$, for the $n$-th sample is commonly defined as half the sum of the squared errors at the output neurons: $$\label{eq:instantaneous_error} \mathcal{E}(n) = \frac{1}{2} \sum_{j \in \text{Output Neurons}} e_j^2(n)$$ This is analogous to kinetic energy in physics, having a similar quadratic form. ::: ### Total Error ::: tcolorbox The \*\*total error\*\* up to sample $n$, $E(n)$, is the sum of the instantaneous energy errors over all samples from 1 to $n$: $$E(n) = \sum_{i=1}^{n} \mathcal{E}(i)$$ ::: ### Average Error ::: tcolorbox The \*\*average error\*\*, $E_{av}$, over $N$ samples is obtained by dividing the total error by the number of samples: $$E_{av} = \frac{1}{N} \sum_{i=1}^{N} \mathcal{E}(i)$$ The average error is often used to evaluate the overall performance of the network. ::: ## Online vs. Batch Learning There are two main approaches to updating the weights of a neural network based on the error: online learning and batch learning. - \*\*Online Learning (Stochastic Gradient Descent):\*\* In online learning, the network weights are updated after each training sample. This approach is also known as stochastic gradient descent because the gradient is estimated based on a single sample. - \*\*Batch Learning (Batch Gradient Descent):\*\* In batch learning, the weights are updated after processing the entire training dataset (or a batch of samples). The error is accumulated over all samples in the batch before updating the weights. This approach calculates the true gradient over the entire batch. ### Advantages of Online Learning for Pattern Classification Despite some potential disadvantages compared to batch learning, online learning is widely used for pattern classification problems due to several practical advantages: - \*\*Simplicity of Implementation:\*\* Online learning is straightforward to implement as weights are updated after each sample. - \*\*Effectiveness for Large-Scale Problems:\*\* Online learning often provides effective solutions for large and complex pattern classification problems. - \*\*Dynamic Tracking:\*\* Online learning can track small changes in the data distribution more dynamically, potentially leading to better generalization and escape from local minima in the error surface. For these reasons, backpropagation, as discussed in the following sections, is often implemented in an online learning fashion. ## Backpropagation: Preliminaries To understand backpropagation, let's first examine the structure and signals within a single neuron, specifically neuron $j$. ### Neuron j: Structure and Signals A typical neuron $j$ in an MLP receives multiple input signals and produces an output signal based on its internal processing. #### Induced Local Field ::: tcolorbox The neuron receives inputs $y_0, y_1, \ldots, y_m$. $y_0$ is a bias input, typically set to $+1$. Each input $y_i$ is weighted by a synaptic weight $w_{ji}$. The \*\*induced local field\*\* or net input to neuron $j$, $v_j(n)$, is the sum of these weighted inputs: $$\label{eq:induced_local_field} v_j(n) = \sum_{i=0}^{m} w_{ji}(n) y_i(n)$$ Note that the weight for the bias input $y_0$ is often denoted as the bias $b_j$, so $w_{j0}(n) = b_j(n)$. ::: #### Nonlinear Activation Function The induced local field $v_j(n)$ is then passed through a \*\*nonlinear activation function\*\*, $\varphi(\cdot)$. This function introduces nonlinearity into the neuron's response. #### Function Signal ::: tcolorbox The output of the activation function is the \*\*function signal\*\* or output of neuron $j$, $y_j(n)$: $$\label{eq:function_signal} y_j(n) = \varphi(v_j(n))$$ ::: #### Error of the j-th Neuron For output neurons, the output $y_j(n)$ is compared with the desired output $d_j(n)$ to calculate the \*\*error\*\* $e_j(n)$, as defined in Equation [\[eq:error_n_sample\]](#eq:error_n_sample){reference-type="eqref" reference="eq:error_n_sample"}. ## Computation Flow and Gradient Descent The core idea of backpropagation is to use gradient descent to adjust the weights of the network in a direction that minimizes the error. ### Forward Propagation The forward propagation step involves calculating the output of each neuron layer by layer. #### Induced Local Field Calculation For each neuron $j$, the induced local field $v_j(n)$ is calculated using Equation [\[eq:induced_local_field\]](#eq:induced_local_field){reference-type="eqref" reference="eq:induced_local_field"}. #### Output Calculation The output of each neuron $j$, $y_j(n)$, is then computed using the activation function as in Equation [\[eq:function_signal\]](#eq:function_signal){reference-type="eqref" reference="eq:function_signal"}. This process is repeated for all layers until the output layer is reached. ### Differentiating to Minimize Error To minimize the error using gradient descent, we need to calculate the gradient of the error function with respect to each weight in the network. This involves using the chain rule of calculus. #### Sensitivity Factor We are interested in the \*\*sensitivity factor\*\*, which measures how sensitive the energy error $\mathcal{E}(n)$ is to changes in a weight $w_{ji}(n)$. This is given by the partial derivative $\frac{\partial \mathcal{E}(n)}{\partial w_{ji}(n)}$. #### Applying the Chain Rule ::: tcolorbox Using the chain rule, we can decompose the derivative as follows: $$\label{eq:chain_rule} \frac{\partial \mathcal{E}(n)}{\partial w_{ji}(n)} = \frac{\partial \mathcal{E}(n)}{\partial e_j(n)} \cdot \frac{\partial e_j(n)}{\partial y_j(n)} \cdot \frac{\partial y_j(n)}{\partial v_j(n)} \cdot \frac{\partial v_j(n)}{\partial w_{ji}(n)}$$ Let's evaluate each term: - $\frac{\partial \mathcal{E}(n)}{\partial e_j(n)} = \frac{\partial}{\partial e_j(n)} \left( \frac{1}{2} e_j^2(n) \right) = e_j(n)$ - $\frac{\partial e_j(n)}{\partial y_j(n)} = \frac{\partial}{\partial y_j(n)} (d_j(n) - y_j(n)) = -1$ - $\frac{\partial y_j(n)}{\partial v_j(n)} = \frac{\partial}{\partial v_j(n)} \varphi(v_j(n)) = \varphi'(v_j(n))$ - $\frac{\partial v_j(n)}{\partial w_{ji}(n)} = \frac{\partial}{\partial w_{ji}(n)} \left( \sum_{k=0}^{m} w_{jk}(n) y_k(n) \right) = y_i(n)$ Substituting these terms back into the chain rule expression, we get: $$\label{eq:sensitivity_factor_final} \frac{\partial \mathcal{E}(n)}{\partial w_{ji}(n)} = -e_j(n) \cdot \varphi'(v_j(n)) \cdot y_i(n)$$ This equation gives the variation of the error relative to the weight $w_{ji}(n)$. ::: ## Local Gradient and Error Assignment ### Definition of the Local Gradient ::: tcolorbox The \*\*local gradient\*\* for neuron $j$, denoted as $\delta_j(n)$, is defined as: $$\label{eq:local_gradient_def} \delta_j(n) = - \frac{\partial \mathcal{E}(n)}{\partial v_j(n)}$$ Using the chain rule, we can express $\delta_j(n)$ in terms of previously calculated derivatives: $$\begin{aligned} \delta_j(n) &= - \frac{\partial \mathcal{E}(n)}{\partial e_j(n)} \cdot \frac{\partial e_j(n)}{\partial y_j(n)} \cdot \frac{\partial y_j(n)}{\partial v_j(n)} \\ &= - e_j(n) \cdot (-1) \cdot \varphi'(v_j(n)) \\ &= e_j(n) \cdot \varphi'(v_j(n)) \end{aligned}$$ Thus, $$\label{eq:local_gradient_def_val} \delta_j(n) = e_j(n) \varphi'(v_j(n))$$ ::: ### Interpretation: Error Responsibility of Neuron j ::: tcolorbox The local gradient $\delta_j(n)$ represents the portion of the error for which neuron $j$ is considered \"responsible\". It quantifies how much neuron $j$'s induced local field contributes to the overall error. ::: ## Weight Correction and the Delta Rule ### The Delta Rule for Weight Adjustment ::: tcolorbox The \*\*delta rule\*\* specifies how to adjust the weight $w_{ji}(n)$ to reduce the error. The correction to the weight, $\Delta w_{ji}(n)$, is proportional to the negative gradient of the error with respect to the weight: $$\Delta w_{ji}(n) = - \eta \frac{\partial \mathcal{E}(n)}{\partial w_{ji}(n)}$$ where $\eta$ is the learning rate, a positive constant that controls the step size in the gradient descent. ::: ### Gradient Descent ::: tcolorbox The delta rule implements \*\*gradient descent\*\*, an optimization algorithm that iteratively moves in the direction of the steepest decrease of the error function. Substituting the expression for $\frac{\partial \mathcal{E}(n)}{\partial w_{ji}(n)}$ from Equation [\[eq:sensitivity_factor_final\]](#eq:sensitivity_factor_final){reference-type="eqref" reference="eq:sensitivity_factor_final"} and using the definition of the local gradient, we get: $$\begin{aligned} \Delta w_{ji}(n) &= - \eta \left( - \delta_j(n) \cdot y_i(n) \right) \\ &= \eta \delta_j(n) y_i(n) \end{aligned}$$ Thus, the weight update rule becomes: $$\label{eq:delta_rule} \Delta w_{ji}(n) = \eta \delta_j(n) y_i(n)$$ ::: ## Error Signal and Neuron Location The calculation of the error signal $e_j(n)$ and consequently the local gradient $\delta_j(n)$ depends on the location of neuron $j$ in the network. ### Case 1: Output Node If neuron $j$ is an \*\*output node\*\*, the desired response $d_j(n)$ is directly available. Therefore, the error signal $e_j(n)$ can be directly calculated as: $$e_j(n) = d_j(n) - y_j(n)$$ and the local gradient $\delta_j(n)$ is computed using Equation [\[eq:local_gradient_def_val\]](#eq:local_gradient_def_val){reference-type="eqref" reference="eq:local_gradient_def_val"}. ### Case 2: Hidden Node and the Credit-Assignment Problem ::: tcolorbox If neuron $j$ is a \*\*hidden node\*\*, there is no directly specified desired output. However, hidden neurons contribute to the error at the output layer. The challenge of determining how to assign \"credit\" or \"blame\" to hidden neurons for their contribution to the output error is known as the \*\*credit-assignment problem\*\* or \*\*blame assignment problem\*\*. Backpropagation provides a systematic way to solve this problem by propagating error signals backwards through the network. ::: ## General Weight Correction Formula ::: tcolorbox The general weight correction formula for backpropagation can be summarized as: $$\Delta w_{ji}(n) = \eta \cdot \delta_j(n) \cdot y_i(n)$$ where: - $\eta$ is the learning rate. - $\delta_j(n)$ is the local gradient for neuron $j$. - $y_i(n)$ is the input signal to neuron $j$ (output of neuron $i$ in the previous layer). ::: ## Computing the Local Gradient: Output vs. Hidden Nodes The method for computing the local gradient $\delta_j(n)$ differs depending on whether neuron $j$ is an output node or a hidden node. ### Output Node: Direct Calculation For an \*\*output node\*\* $j$, the local gradient $\delta_j(n)$ is calculated directly using the error signal $e_j(n)$ and the derivative of the activation function: $$\delta_j(n) = e_j(n) \varphi'(v_j(n)) = (d_j(n) - y_j(n)) \varphi'(v_j(n))$$ ### Hidden Node: Recursive Calculation For a \*\*hidden node\*\* $j$, the local gradient $\delta_j(n)$ is computed recursively based on the local gradients of the neurons in the next layer (closer to the output). The error signal is effectively propagated backwards. To derive this recursive formula, consider the error $\mathcal{E}(n)$ as a function of the output of neuron $j$, $y_j(n)$. Changes in $y_j(n)$ affect the induced local fields $v_k(n)$ and consequently the errors $e_k(n)$ of all neurons $k$ in the next layer that receive input from neuron $j$. Using the chain rule again, we can express $\frac{\partial \mathcal{E}(n)}{\partial v_j(n)}$ in terms of the derivatives in the next layer. Consider neuron $j$ in layer $l$. The local gradient $\delta_j^{(l)}(n)$ is given by: $$\label{eq:hidden_delta} \delta_j^{(l)}(n) = \varphi'(v_j^{(l)}(n)) \sum_{k} \delta_k^{(l+1)}(n) w_{kj}^{(l+1)}(n)$$ where the sum is over all neurons $k$ in the next layer ($l+1$) that receive input from neuron $j$ in layer $l$, and $w_{kj}^{(l+1)}(n)$ are the weights connecting neuron $j$ in layer $l$ to neuron $k$ in layer $l+1$. This recursive formula shows that the local gradient at a hidden layer depends on the local gradients computed for the next layer and the weights connecting these layers. ## Backpropagation: Signal Flow and Mathematical Foundations ### Signal-Flow Graph Illustrative diagrams can be used to depict the signal-flow graph of backpropagation, highlighting the connections between neurons and the flow of signals during both forward and backward passes. ### Chain Rule and Partial Differentiation ::: tcolorbox Backpropagation is fundamentally based on the \*\*chain rule\*\* of calculus and the concept of \*\*partial differentiation\*\* of many-valued real functions. It provides a systematic method for computing the gradients in complex, layered networks by repeatedly applying the chain rule to propagate error signals backwards. ::: ### Approximating Function and Partial Derivatives ::: tcolorbox A Multilayer Perceptron can be viewed as an \*\*approximating function\*\*, $F[\mathbf{w}, \mathbf{x}(n)]$, where $\mathbf{w}$ represents all the weights in the network and $\mathbf{x}(n)$ is the input. The goal of backpropagation is to adjust the weights $\mathbf{w}$ to minimize the error between the network's output and the desired output. This is achieved by computing the \*\*partial derivatives\*\* of the error function with respect to each weight and using these derivatives to update the weights through gradient descent. ::: ## Formalizing the Network: Layer Notation To formalize the backpropagation algorithm for networks with multiple layers, we introduce a layer notation. Let $l = 0, 1, 2, \ldots, L$ denote the layers of the network, where $l=0$ is the input layer and $l=L$ is the output layer. We use the following notation for layer $l$: - $y_j^{(l)}(n)$: $j$-th component of the output vector of layer $l$ for the $n$-th sample. For $l=0$, $y_j^{(0)}(n) = x_j(n)$, the $j$-th component of the input vector. - $v_j^{(l)}(n)$: Induced local field for the $j$-th neuron in layer $l$. - $\delta_j^{(l)}(n)$: Local gradient for the $j$-th neuron in layer $l$. ## Local Gradients: Formal Definition Using the layer notation, we can formally define the local gradients for the output and hidden layers. ### Output Layer For the output layer $L$, the local gradient for the $j$-th neuron is: $$\delta_j^{(L)}(n) = e_j^{(L)}(n) \varphi'(v_j^{(L)}(n)) = (d_j(n) - y_j^{(L)}(n)) \varphi'(v_j^{(L)}(n))$$ ### Hidden Layer For a hidden layer $l < L$, the local gradient for the $j$-th neuron is calculated recursively: $$\delta_j^{(l)}(n) = \varphi'(v_j^{(l)}(n)) \sum_{k} \delta_k^{(l+1)}(n) w_{kj}^{(l+1)}(n)$$ where the sum is over all neurons $k$ in the next layer $l+1$. ## Practical Delta Rule: Momentum ### Delta Rule with Momentum ::: tcolorbox In practice, the basic delta rule can be noisy and slow to converge. To improve convergence and stability, a \*\*momentum term\*\* is often added to the weight update rule. The \*\*delta rule with momentum\*\* is given by: $$\Delta w_{ji}(n) = \alpha \Delta w_{ji}(n-1) + \eta \delta_j(n) y_i(n)$$ where $\alpha$ is the momentum constant, typically set between 0 and 1 ($0 \leq \alpha < 1$). The momentum term $\alpha \Delta w_{ji}(n-1)$ adds a fraction of the previous weight update to the current update, effectively smoothing the learning process and potentially accelerating convergence by allowing a larger learning rate $\eta$ without causing oscillations. ::: ### Generalized Delta Rule and Historical Influence ::: tcolorbox The delta rule with momentum is sometimes referred to as the \*\*generalized delta rule\*\*. Expanding the recursive formula for the weight update, we can see the influence of historical gradients: $$\Delta w_{ji}(n) = \eta \sum_{t=0}^{n} \alpha^{(n-t)} \delta_j(t) y_i(t)$$ This shows that the current weight update is influenced by a history of past gradients, with older gradients having exponentially decaying influence due to the $\alpha^{(n-t)}$ term. ::: ## Stopping Criteria To determine when to stop training the network, we need to define \*\*stopping criteria\*\*. ### Euclidean Norm of the Gradient ::: tcolorbox A common stopping criterion is to monitor the \*\*Euclidean norm of the gradient\*\* of the error function. Training can be stopped when this norm becomes sufficiently small, indicating that the network has reached a (local) minimum in the error surface. This is a heuristic approach, but widely used in practice. The Euclidean norm of the gradient at iteration $n$ can be calculated as: $$\| \nabla E(n) \| = \sqrt{\sum_{j,i} \left( \frac{\partial \mathcal{E}(n)}{\partial w_{ji}} \right)^2}$$ and training stops when $\| \nabla E(n) \| < \epsilon$, where $\epsilon$ is a predefined small threshold. ::: ## Signal-Flow Summary of Backpropagation Illustrative diagrams can be used to provide a signal-flow graphical summary of backpropagation learning, illustrating the forward and backward passes. ### Forward Pass In the \*\*forward pass\*\*, input signals are propagated through the network layer by layer. For each layer $l$: 1. Calculate the induced local fields $v_j^{(l)}(n)$ for all neurons $j$ in layer $l$. 2. Calculate the outputs $y_j^{(l)}(n) = \varphi(v_j^{(l)}(n))$ for all neurons $j$ in layer $l$. This process continues until the output layer $L$ is reached, producing the network outputs $y_j^{(L)}(n)$. The errors $e_j^{(L)}(n) = d_j(n) - y_j^{(L)}(n)$ are then calculated at the output layer. ### Backward Pass In the \*\*backward pass\*\*, error signals are propagated backwards from the output layer to the input layer. For each layer $l$, starting from the output layer $L$ down to the first hidden layer: 1. For the output layer $L$, calculate the local gradients $\delta_j^{(L)}(n) = e_j^{(L)}(n) \varphi'(v_j^{(L)}(n))$. 2. For hidden layers $l < L$, calculate the local gradients recursively using $\delta_j^{(l)}(n) = \varphi'(v_j^{(l)}(n)) \sum_{k} \delta_k^{(l+1)}(n) w_{kj}^{(l+1)}(n)$. 3. Update the weights using the delta rule (possibly with momentum): $\Delta w_{ji}^{(l)}(n) = \alpha \Delta w_{ji}^{(l)}(n-1) + \eta \delta_j^{(l)}(n) y_i^{(l-1)}(n)$. This completes one iteration of backpropagation for a single training sample. # The XOR Problem: A Case Study The \*\*XOR problem\*\* is a classic example used to demonstrate the limitations of single-layer perceptrons and the power of multilayer perceptrons (MLPs). It highlights the concept of linear inseparability and how hidden layers can overcome this limitation. ## Linear Inseparability of XOR The XOR (exclusive OR) function is defined by the following truth table: ::: center Input 1 ($x_1$) Input 2 ($x_2$) Output (XOR) ----------------- ----------------- -------------- 0 0 0 0 1 1 1 0 1 1 1 0 ::: When plotted in a 2D space, the points (0,0) and (1,1) belong to one class (output 0), and the points (0,1) and (1,0) belong to another class (output 1). These two classes are \*\*linearly inseparable\*\*, meaning that it is impossible to draw a single straight line to separate them. A \*\*single-layer perceptron\*\* cannot solve the XOR problem because it can only implement linear decision boundaries. ## Solving XOR with a Multilayer Perceptron To solve the XOR problem, we need to introduce nonlinearity through hidden layers. ### Hidden Layer with Two Neurons An MLP with \*\*one hidden layer\*\* containing \*\*two neurons\*\* is sufficient to solve the XOR problem. This network architecture allows for the creation of non-linear decision boundaries. ### Linear Combination of Neuron Outputs ::: tcolorbox The hidden layer neurons act as feature detectors, and the output neuron then \*\*combines linearly the outputs of these hidden neurons\*\* to produce the final output. This combination of nonlinear transformations in the hidden layer followed by a linear combination in the output layer enables the network to solve non-linearly separable problems. ::: ## Network Configuration and Solution Consider an MLP with two input neurons, two hidden neurons (Neuron 1 and Neuron 2), and one output neuron (Neuron 3). ### Specific Weights and Biases ::: tcolorbox A specific set of weights and biases that solves the XOR problem is: - \*\*Neuron 1 (Hidden):\*\* Weights $w_{11} = 1$, $w_{12} = 1$, Bias $b_1 = -3/2$. - \*\*Neuron 2 (Hidden):\*\* Weights $w_{21} = 1$, $w_{22} = 1$, Bias $b_2 = -1/2$. - \*\*Neuron 3 (Output):\*\* Weights $w_{31} = -2$, $w_{32} = 1$, Bias $b_3 = -1/2$. ::: ### Decision Boundaries of Neurons 1, 2, and 3 The decision boundariesfor each neuron can be analyzed as follows: - \*\*Neuron 1:\*\* The induced local field is $v_1 = x_1 + x_2 - \frac{3}{2}$. The decision boundary is given by $x_1 + x_2 - \frac{3}{2} = 0$ or $x_1 + x_2 = \frac{3}{2}$. - \*\*Neuron 2:\*\* The induced local field is $v_2 = x_1 + x_2 - \frac{1}{2}$. The decision boundary is given by $x_1 + x_2 - \frac{1}{2} = 0$ or $x_1 + x_2 = \frac{1}{2}$. - \*\*Neuron 3:\*\* The induced local field is $v_3 = -2y_1 + y_2 - \frac{1}{2}$, where $y_1$ and $y_2$ are the outputs of Neuron 1 and Neuron 2, respectively. The decision boundary is given by $-2y_1 + y_2 - \frac{1}{2} = 0$ or $y_2 = 2y_1 + \frac{1}{2}$. These decision boundaries, when combined, effectively separate the XOR classes. ## Verification of the Solution Let's verify how this network configuration correctly implements the XOR function for each input combination, assuming a step activation function with a threshold at 0 (output is 1 if $v \ge 0$, and 0 if $v < 0$): - \*\*Input (0, 0):\*\* - Neuron 1: $v_1 = 0 + 0 - \frac{3}{2} = -\frac{3}{2} < 0 \implies y_1 = 0$ - Neuron 2: $v_2 = 0 + 0 - \frac{1}{2} = -\frac{1}{2} < 0 \implies y_2 = 0$ - Neuron 3: $v_3 = -2(0) + 0 - \frac{1}{2} = -\frac{1}{2} < 0 \implies y_3 = 0$ (Correct XOR output) - \*\*Input (1, 1):\*\* - Neuron 1: $v_1 = 1 + 1 - \frac{3}{2} = \frac{1}{2} \ge 0 \implies y_1 = 1$ - Neuron 2: $v_2 = 1 + 1 - \frac{1}{2} = \frac{3}{2} \ge 0 \implies y_2 = 1$ - Neuron 3: $v_3 = -2(1) + 1 - \frac{1}{2} = -\frac{3}{2} < 0 \implies y_3 = 0$ (Correct XOR output) - \*\*Input (0, 1):\*\* - Neuron 1: $v_1 = 0 + 1 - \frac{3}{2} = -\frac{1}{2} < 0 \implies y_1 = 0$ - Neuron 2: $v_2 = 0 +1 - \frac{1}{2} = \frac{1}{2} \ge 0 \implies y_2 = 1$ - Neuron 3: $v_3 = -2(0) + 1 - \frac{1}{2} = \frac{1}{2} \ge 0 \implies y_3 = 1$ (Correct XOR output) - \*\*Input (1, 0):\*\* - Neuron 1: $v_1 = 1 + 0 - \frac{3}{2} = -\frac{1}{2} < 0 \implies y_1 = 0$ - Neuron 2: $v_2 = 1 + 0 - \frac{1}{2} = \frac{1}{2} \ge 0 \implies y_2 = 1$ - Neuron 3: $v_3 = -2(0) + 1 - \frac{1}{2} = \frac{1}{2} \ge 0 \implies y_3 = 1$ (Correct XOR output) As verified, this specific network configuration correctly solves the XOR problem. ## Key Takeaway: The Power of Feature Space Dimensionality ::: tcolorbox The solution to the XOR problem highlights a crucial concept: while a single-layer perceptron is limited to linear decision boundaries and cannot solve linearly inseparable problems, \*\*MLPs overcome this limitation by operating in a higher-dimensional feature space\*\*. By introducing hidden layers, MLPs increase the dimensionality of the feature space, enabling them to learn complex, non-linear decision boundaries and solve problems like XOR that are beyond the capability of single-layer networks. The hidden layers act as feature extractors, transforming the input data into a representation where the classes become linearly separable, allowing the output layer to perform a linear classification in this transformed space. ::: # The \"Double Moon\" Experiment The \*\*\"Double Moon\" experiment\*\* is a classic benchmark for testing the ability of classification algorithms, particularly neural networks, to handle \*\*nonlinearly separable data\*\*. ## Testing Nonlinearly Separable Data Classification This experiment is designed to demonstrate the effectiveness of multilayer perceptrons (MLPs), trained with backpropagation, in classifying data that cannot be separated by a linear boundary. It also serves to explore the limits of MLPs by testing them on increasingly difficult nonlinearly separable patterns. ## Experimental Setup and Parameters The experiment utilizes a \"double moon\" dataset, which consists of two crescent-shaped clusters of data points that are intertwined and nonlinearly separable. The experimental setup includes the following specifications for the MLP: - \*\*Input Layer Size:\*\* Number of input features for the double moon dataset (likely 2 for 2D data). - \*\*Hidden Layer Size:\*\* Number of neurons in the hidden layer(s). - \*\*Output Layer Size:\*\* Number of output neurons (likely 1 for binary classification or 2 for class probabilities). - \*\*Activation Function:\*\* Hyperbolic Tangent (tanh). - \*\*Threshold Setting:\*\* Threshold for classification decision. - \*\*Learning Rate Annealing:\*\* Details on how the learning rate is adjusted during training. The experiment is conducted for two different distances, $d=-4$ and $d=-5$, between the moons. These distances control the degree of overlap and thus the difficulty of the classification task. ## Results and Performance The results of the double moon experiment are typically visualized through learning curves and classification boundaries. - \*\*Distance d = -4:\*\* For $d=-4$, the moons are \*\*nonlinearly separable\*\*, but with a reasonable margin. The MLP, trained with backpropagation, is expected to classify the data reasonably well, demonstrating its capability to handle nonlinear boundaries. - \*\*Distance d = -5:\*\* As the distance increases to $d=-5$, the moons become more intertwined, making the classification task more challenging. However, the MLP is still expected to perform surprisingly well, showcasing its robustness in handling complex nonlinear patterns. The experiment aims to show that MLPs can effectively learn and classify nonlinearly separable data, even as the complexity of the data increases. The learning curves would typically show the error decreasing over training epochs, and the classification boundaries would visually represent how the MLP separates the two moon-shaped classes. # Conclusion In this lecture, we have explored the fundamental concepts of Multilayer Perceptrons (MLPs) and the backpropagation algorithm. We began by outlining the key characteristics of MLPs, emphasizing the importance of nonlinear activation functions, hidden layers, and high connectivity. We then delved into the architecture and signal flow within MLPs, distinguishing between forward and backward propagation phases. The critical role of hidden layers as feature detectors operating in a feature space was highlighted, showcasing their ability to learn complex data representations through supervised learning. We formalized the concept of error measurement, defining instantaneous, total, and average errors, and discussed the advantages of online learning, particularly for pattern classification problems. A significant portion of the lecture was dedicated to a detailed derivation of the backpropagation algorithm, starting from the neuron structure and signals, moving through computation flow, gradient descent, and the delta rule for weight correction. We differentiated between local gradient computation for output and hidden nodes, emphasizing the recursive nature of backpropagation in hidden layers and its foundation in the chain rule and partial differentiation. Practical improvements like momentum were also discussed to enhance the delta rule's performance. The XOR problem served as a crucial case study, illustrating the linear inseparability issue and how MLPs with hidden layers can solve it by increasing the dimensionality of the feature space. Finally, we briefly discussed the \"Double Moon\" experiment, demonstrating the capability of MLPs to classify complex, nonlinearly separable datasets in practice. \*\*Key Takeaways:\*\* - Multilayer Perceptrons, with their nonlinear activation functions and hidden layers, are powerful tools for learning complex patterns and solving nonlinearly separable problems. - Backpropagation is a fundamental algorithm for training MLPs, enabling the network to learn by propagating error signals backward and adjusting weights through gradient descent. - Hidden layers play a crucial role in feature detection and transforming the input data into a feature space where complex relationships can be learned. - The XOR problem and the Double Moon experiment exemplify the ability of MLPs to overcome the limitations of linear models and handle nonlinear data effectively. \*\*Follow-up Questions and Topics for Next Lecture:\*\* - How do different activation functions affect the performance of backpropagation? - What are the mathematical properties of backpropagation, such as the Jacobian and Hessian matrices, and how do they influence learning dynamics? - How can we analyze the convergence and generalization properties of backpropagation? - What are some advanced techniques to improve backpropagation, such as different optimization algorithms and regularization methods? - How do deeper networks (networks with more hidden layers) further enhance feature representation and problem-solving capabilities? These questions will guide our exploration in the next lecture as we delve deeper into the theoretical underpinnings and advanced aspects of backpropagation and multilayer perceptrons.