Of RNNs: Ising, Hopfield, Math and Modernity

2025-02-03

d3 = require("d3@7")

Recurrent Neural Networks

Physics Foundations

1925: The Ising Model laid groundwork for understanding dynamic systems.
- Concept: A mathematical model of ferromagnetism in statistical mechanics.
- Relevance: Introduced ideas of interacting components and system states evolving over time, conceptually related to dynamic systems later explored in RNNs.

Hopfield Networks

1982: Hopfield Networks introduced associative memory structures.
- Concept: A type of recurrent neural network that serves as a content-addressable “associative memory” system.
- Relevance: Demonstrated the potential of recurrent connections for memory and pattern completion, a precursor to RNNs for sequential data.

Formalizing Recurrent Networks and Training

1986: Recurrent Neural Networks (RNNs) were formalized.
- Concept: Neural networks with loops, allowing them to process sequences of inputs by maintaining a hidden state that carries information across time steps.
- Relevance: The birth of the RNN as we know it, designed to handle sequential data.

Backpropation Through Time

1986: Backpropagation Through Time (BPTT) was developed for training RNNs.
- Concept: An adaptation of backpropagation algorithm to train RNNs by unfolding the network through time and then applying standard backpropagation.
- Relevance: Provided a practical method to train RNNs, enabling them to learn from sequential data.

Addressing the Vanishing Gradient

1997: LSTM networks addressed the vanishing gradient problem.
- Concept: Long Short-Term Memory networks, a special type of RNN with memory cells and gates that regulate information flow, designed to mitigate the vanishing gradient problem.
- Relevance: A major breakthrough that allowed RNNs to learn long-range dependencies in sequences, significantly improving performance on tasks like NLP.

Simplifying and Maintaining Performance

2014: GRUs simplified RNN structures while maintaining performance.
- Concept: Gated Recurrent Units, a simplified version of LSTMs with fewer gates, offering comparable performance with fewer parameters.
- Relevance: Provided a more efficient alternative to LSTMs in many cases, making RNNs more accessible and faster to train.

Modern Advancements

2020: Modern LSTMs evolved for advanced applications like NLP and forecasting.
- Concept: Continued research and development on LSTMs and related architectures, incorporating attention mechanisms, transformers, and other advancements for state-of-the-art performance.
- Relevance: Highlights the ongoing evolution of RNNs and their continued relevance in cutting-edge applications, especially in natural language processing and time series forecasting.

Backpropagation: The Math

To understand BPTT works, let’s derive its mathematical formulation.

Unfolding the RNN

Consider an RNN that processes a sequence of inputs \[( x_1, x_2, \ldots, x_T )\] At each time step \(t\), the RNN maintains a hidden state \(h_t\) which is updated based on the current input \(x_t\) and the previous hidden state \(h_{t-1}\):
\[h_t = f(W_h h_{t-1} + W_x x_t + b)\]
where \(W_h\) and \(W_x\) are weight matrices, \(b\) is a bias vector, and \(f\) is an activation function (typically tanh or ReLU).

Loss Function

Assume we have a loss function \(L\) that depends on the outputs of the RNN at each time step. The total loss over the sequence is:

\[L = \sum_{t=1}^T L_t(y_t, \hat{y}_t)\]

where \(y_t\) is the true output and \(\hat{y}_t\) is the predicted output at time step \(t\).

Backpropagation Through Time

To train the RNN, we need to compute the gradients of the loss with respect to the weights \(W_h\) and \(W_x\). BPTT involves unfolding the RNN through time and applying the chain rule of calculus to compute these gradients.

Forward Pass: Compute the hidden states \(h_t\) and the outputs \(\hat{y}_t\) for \(t = 1, 2, \ldots, T\).
Backward Pass: Compute the gradients of the loss with respect to the hidden states and weights by propagating the error backwards through time.
The gradient of the loss with respect to the hidden state at time step \(t\) is:
\[\frac{\partial L}{\partial h_t} = \sum_{k=t}^T \frac{\partial L_k}{\partial h_t}\]
Using the chain rule, we can express this as:
\[\frac{\partial L_k}{\partial h_t} = \frac{\partial L_k}{\partial \hat{y}_k} \frac{\partial \hat{y}_k}{\partial h_k} \frac{\partial h_k}{\partial h_t}\]
The gradient of the hidden state \(h_k\) with respect to \(h_t\) involves the recurrent connection:
\[\frac{\partial h_k}{\partial h_t} = \prod_{j=t+1}^k \frac{\partial h_j}{\partial h_{j-1}}\]
Finally, the gradients of the loss with respect to the weights are computed by summing the contributions from each time step:
\[\frac{\partial L}{\partial W_h} = \sum_{t=1}^T \frac{\partial L}{\partial h_t} \frac{\partial h_t}{\partial W_h}\]
\[\frac{\partial L}{\partial W_x} = \sum_{t=1}^T \frac{\partial L}{\partial h_t} \frac{\partial h_t}{\partial W_x}\]

A Plot

sunburst = {
  const root = partition(flareData);
  root.each(d => d.current = d);
  const svg = d3.create("svg")
      .attr("viewBox", [0, 0, width, width])
      .style("font", "15px sans-serif");

  // ...remainder of implementation
  
  return svg.node();
}