Of RNNs: Ising, Hopfield, Math and Modernity

2025-02-03

Recurrent Neural Networks

Physics Foundations

  • 1925: The Ising Model laid groundwork for understanding dynamic systems.
    • Concept: A mathematical model of ferromagnetism in statistical mechanics.
    • Relevance: Introduced ideas of interacting components and system states evolving over time, conceptually related to dynamic systems later explored in RNNs.

Hopfield Networks

  • 1982: Hopfield Networks introduced associative memory structures.
    • Concept: A type of recurrent neural network that serves as a content-addressable “associative memory” system.
    • Relevance: Demonstrated the potential of recurrent connections for memory and pattern completion, a precursor to RNNs for sequential data.

Formalizing Recurrent Networks and Training

  • 1986: Recurrent Neural Networks (RNNs) were formalized.
    • Concept: Neural networks with loops, allowing them to process sequences of inputs by maintaining a hidden state that carries information across time steps.
    • Relevance: The birth of the RNN as we know it, designed to handle sequential data.

Backpropation Through Time

  • 1986: Backpropagation Through Time (BPTT) was developed for training RNNs.
    • Concept: An adaptation of backpropagation algorithm to train RNNs by unfolding the network through time and then applying standard backpropagation.
    • Relevance: Provided a practical method to train RNNs, enabling them to learn from sequential data.

Addressing the Vanishing Gradient

  • 1997: LSTM networks addressed the vanishing gradient problem.
    • Concept: Long Short-Term Memory networks, a special type of RNN with memory cells and gates that regulate information flow, designed to mitigate the vanishing gradient problem.
    • Relevance: A major breakthrough that allowed RNNs to learn long-range dependencies in sequences, significantly improving performance on tasks like NLP.

Simplifying and Maintaining Performance

  • 2014: GRUs simplified RNN structures while maintaining performance.
    • Concept: Gated Recurrent Units, a simplified version of LSTMs with fewer gates, offering comparable performance with fewer parameters.
    • Relevance: Provided a more efficient alternative to LSTMs in many cases, making RNNs more accessible and faster to train.

Modern Advancements

  • 2020: Modern LSTMs evolved for advanced applications like NLP and forecasting.
    • Concept: Continued research and development on LSTMs and related architectures, incorporating attention mechanisms, transformers, and other advancements for state-of-the-art performance.
    • Relevance: Highlights the ongoing evolution of RNNs and their continued relevance in cutting-edge applications, especially in natural language processing and time series forecasting.

Backpropagation: The Math

To understand BPTT works, let’s derive its mathematical formulation.

Unfolding the RNN

  • Consider an RNN that processes a sequence of inputs \[( x_1, x_2, \ldots, x_T )\] At each time step \(t\), the RNN maintains a hidden state \(h_t\) which is updated based on the current input \(x_t\) and the previous hidden state \(h_{t-1}\):

  • \[h_t = f(W_h h_{t-1} + W_x x_t + b)\]

  • where \(W_h\) and \(W_x\) are weight matrices, \(b\) is a bias vector, and \(f\) is an activation function (typically tanh or ReLU).

Loss Function

Assume we have a loss function \(L\) that depends on the outputs of the RNN at each time step. The total loss over the sequence is:

\[L = \sum_{t=1}^T L_t(y_t, \hat{y}_t)\]

where \(y_t\) is the true output and \(\hat{y}_t\) is the predicted output at time step \(t\).

Backpropagation Through Time

To train the RNN, we need to compute the gradients of the loss with respect to the weights \(W_h\) and \(W_x\). BPTT involves unfolding the RNN through time and applying the chain rule of calculus to compute these gradients.

  • Forward Pass: Compute the hidden states \(h_t\) and the outputs \(\hat{y}_t\) for \(t = 1, 2, \ldots, T\).

  • Backward Pass: Compute the gradients of the loss with respect to the hidden states and weights by propagating the error backwards through time.

  • The gradient of the loss with respect to the hidden state at time step \(t\) is:

  • \[\frac{\partial L}{\partial h_t} = \sum_{k=t}^T \frac{\partial L_k}{\partial h_t}\]

  • Using the chain rule, we can express this as:

  • \[\frac{\partial L_k}{\partial h_t} = \frac{\partial L_k}{\partial \hat{y}_k} \frac{\partial \hat{y}_k}{\partial h_k} \frac{\partial h_k}{\partial h_t}\]

  • The gradient of the hidden state \(h_k\) with respect to \(h_t\) involves the recurrent connection:

  • \[\frac{\partial h_k}{\partial h_t} = \prod_{j=t+1}^k \frac{\partial h_j}{\partial h_{j-1}}\]

  • Finally, the gradients of the loss with respect to the weights are computed by summing the contributions from each time step:

  • \[\frac{\partial L}{\partial W_h} = \sum_{t=1}^T \frac{\partial L}{\partial h_t} \frac{\partial h_t}{\partial W_h}\]

  • \[\frac{\partial L}{\partial W_x} = \sum_{t=1}^T \frac{\partial L}{\partial h_t} \frac{\partial h_t}{\partial W_x}\]

A Plot