Lecture Notes on Sequence-to-Sequence Learning and Transformers
Introduction
This lecture continues the exploration of neural networks, transitioning from basic supervised learning models to advanced architectures that handle sequential data. It begins with a review of fundamental neural network types, including Single Layer Perceptrons (SLPs), Multi-Layer Perceptrons (MLPs), and Bayesian Neural Networks (BNNs). It then delves into Recurrent Neural Networks (RNNs), including early forms like Jordan and Elman networks and the more sophisticated Long Short-Term Memory (LSTM) networks, and their application to sequence data. The lecture highlights the limitations of traditional RNNs and LSTMs for complex sequence-to-sequence tasks like Machine Translation (MT), motivating the introduction of attention mechanisms and the Transformer architecture. Core concepts of sequence-to-sequence learning, encoder-decoder models, the revolutionary Transformer model, and its self-attention mechanism are explored. The main objective is to understand the evolution from RNN-based sequence models to the Transformer, emphasizing the advantages of attention and self-attention in overcoming the limitations of recurrence.
Review of Neural Network Fundamentals
Basic Neural Network Architectures
Single Layer Perceptron (SLP)
The Single Layer Perceptron (SLP) represents the simplest form of a neural network.
Multi-Layer Perceptron (MLP)
The Multi-Layer Perceptron (MLP) introduces non-linearity, enabling the model to solve more complex problems than the SLP.
Bayesian Neural Networks (BNN)
Bayesian Neural Networks (BNNs) offer a probabilistic approach to neural networks, although they were not covered in depth.
Recurrent Neural Networks (RNNs)
Early RNN Architectures: Jordan and Elman Networks
Early types of recurrent networks include Jordan and Elman networks.
Long Short-Term Memory (LSTM) Networks
Long Short-Term Memory (LSTM) networks are a more advanced type of RNN. LSTMs introduce the concept of long and short-term memory, allowing them to remember information over longer sequences. This makes them powerful for tasks such as natural language processing.
Training RNNs: Backpropagation Through Time (BPTT)
Training RNNs involves the concept of backpropagation applied through time (BPTT). This allows for handling sequential data. Recurrence is based on unfolding the network over time, which is how backpropagation is applied to sequences.
Key Concepts and Challenges in RNNs
Architectural Rigidity: The basic RNN structure can be inflexible for very complex sequences.
Linear and Non-linear Components: RNNs use linear and non-linear components within the network layers to capture different types of relationships in the data.
Unfolding in Time: Unfolding the recurrent network over time is how backpropagation is applied to sequences.
Batch Processing and Teacher Forcing: Batch processing can be performed online or offline. Teacher Forcing (TF) is a common training technique for RNNs.
Time Complexity: The time complexity of training RNNs with BPTT is generally \(O(T \times K)\), where \(T\) is the sequence length and \(K\) represents the computational cost of processing a single time step, which depends on the network’s architecture (number of layers, units per layer).
Space Complexity: The space complexity is also influenced by the sequence length and the network’s size. During training, gradients and activations for each time step need to be stored, leading to a space complexity of \(O(T \times M)\), where \(M\) is the memory required to store activations and gradients for a single time step.
Sequence-to-Sequence Learning
Motivation: Machine Translation (MT) as a Sequence Problem
Challenges in Traditional Machine Translation
Machine Translation (MT) is a challenging task. Since around 1990, it has been recognized that expressing the mapping from one language to another in a program is difficult. A more indirect approach is to use an aligned corpus – many examples of sentence pairs that are mutual translations of each other – and require the machine to figure out the mapping for itself. This is where deep learning comes in.
Limitations of LSTMs for Complex MT Tasks
The deep learning techniques we have learned for natural-language tasks, e.g., LSTMs, by themselves are not sufficient for Machine Translation. We need more, specifically, attention mechanisms. Word-by-word translation is insufficient. Consider the example: "edited hansard number 1" and its French translation "hansard révisé numéro 1". For longer sentences like "this being the day on which parliament was convoked by proclamation of his excellency ..." and its French counterpart "parlement ayant été convoqué pour aujourd’hui, par proclamation de son excellence ...", it becomes even clearer that we need to read more than just the immediate preceding words. We need links between words across the sentence to understand the relationships and context for accurate translation.
The Sequence-to-Sequence (Seq2Seq) Paradigm
Encoder-Decoder Architecture
The core idea of the Sequence-to-Sequence (Seq2Seq) paradigm is to encode a sentence and then decode it. We take an input sentence, encode it into a representation, and then decode that representation into an output sentence, which is the translation. A typical Seq2Seq model consists of two RNNs: an encoder and a decoder. The encoder takes the input sentence word by word (e.g., "revised", "hansards", "number", "1", "STOP"). The "STOP" token signals the end of the input sequence. The encoder processes these words sequentially and produces an encoding.
The Role of the Internal State (Sentence Embedding)
This encoding is often called the internal state or sentence embedding. It’s a compressed representation of the entire input sentence. The decoder takes this internal state and generates the output sentence word by word (e.g., "hansards", "révisé", "numéro", "1", "STOP"). Again, "STOP" indicates the end of the output sentence.
Gated Recurrent Units (GRUs) as an Alternative to LSTMs
Historically, LSTMs were often used in this type of architecture. However, LSTMs have been replaced by Gated Recurrent Unit (GRU) cells. GRUs are similar to LSTMs but are often more efficient. A simplified diagram of a GRU cell includes an update gate (\(z_t\)) and a reset gate (\(r_t\)).
Simplifying the Problem for Initial Understanding
Focusing on Next-Word Prediction
To simplify the complexity of real MT, we can evaluate a program’s ability to predict the correct next English word given the correct previous word. In real MT, there is no single correct English translation for a particular French sentence.
Evaluation Metric: BLEU Score and its Limitations
We’ll use the BLEU (Bilingual Evaluation Understudy) score for evaluation, although it has limitations.
Implementation Details of Seq2Seq Models
Backpropagation Through Time (BPTT) in Seq2Seq
The process of encoding and decoding involves backpropagation through time (unfolding), similar to regular RNNs.
Handling Variable Length Sequences: Window Size and Padding
Sequence length is often referred to as the window size. In Seq2Seq models, window size relates to the maximum length of the encoded/decoded sentence. For example, a maximum window size might be 12 or 13 words. If input sentences are shorter than the maximum window size, they are padded with a special "STOP" token to ensure consistent input length.
Managing Variable Scopes in TensorFlow
When writing a Seq2Seq MT program, it’s important to manage variable scopes to prevent naming conflicts.
TensorFlow Code Example: Encoder and Decoder
Here is a TensorFlow code example for implementing a Seq2Seq model, defining two variable scopes, "enc" for the encoder and "dec" for the decoder:
“’python with tf.variable_scope("enc"): F = tf.Variable(tf.random_normal((vfSz,embedSz), stddev=.1)) embs = tf.nn.embedding_lookup(F, encIn) embs = tf.nn.dropout(embs, keepPrb) # Regularization cell = tf.contrib.rnn.GRUCell(rnnSz) initState = cell.zero_state(bSz, tf.float32) encout, encState = tf.nn.dynamic_rnn(cell, embs, initial_state=initState)
with tf.variable_scope("dec"): E = tf.Variable(tf.random_normal((veSz,embedSz),stddev=.1)) embs = tf.nn.embedding_lookup(E, decIn) embs = tf.nn.dropout(embs, keepPrb) cell = tf.contrib.rnn.GRUCell(rnnSz) decout,_ = tf.nn.dynamic_rnn(cell, embs, initial_state=encState) “’
Encoder Implementation Details
F = tf.Variable(tf.random_normal((vfSz,embedSz), stddev=.1))
: Initializes the embedding variable of the French vocabulary (vfSz
).embs = tf.nn.embedding_lookup(F, encIn)
: Performs the French-words embedding, resulting in a 3D tensor of shape (batch size x window size x embedding size).embs = tf.nn.dropout(embs, keepPrb)
: Applies dropout for regularization.cell = tf.contrib.rnn.GRUCell(rnnSz)
: Instantiates the GRU cells.initState = cell.zero_state(bSz, tf.float32)
: Initializes the initial state of the GRU cell to zero.encout, encState = tf.nn.dynamic_rnn(cell, embs, initial_state=initState)
: Computes the output and next state using a dynamic RNN.
Decoder Implementation Details
The decoder is similar to the encoder but uses a separate embedding matrix E
for the English vocabulary and is initialized with the final state of the encoder (encState
).
Generating Logits and Probabilities from Decoder Output
After decoding, the decoder outputs are converted into probabilities over the English vocabulary. This involves a linear layer to obtain logits and then probabilities. The challenge is handling 3D logits (batch size x window size x vocabulary size).
“‘python W = tf.Variable(tf.random_normal((rnnSz, veSz), stddev=.1)) b = tf.Variable(tf.random_normal((veSz), stddev=.1)) logits = tf.tensordot(decout, W, axes=[[2],[0]]) + b “’
To solve this, a linear transformation with weight matrix W
and bias b
is used, and the logits are computed using tf.tensordot
.
Calculating Sequence Loss
Finally, the loss is calculated using tf.contrib.seq2seq.sequence_loss
:
“‘python loss = tf.contrib.seq2seq.sequence_loss(logits, ans, tf.ones([bSz, wSz])) “’
This function calculates the sequence loss, which is a weighted sum of the losses over each time step in the sequence.
Alternative Seq2Seq Approach: Additive Summary
Using an Additive Approach for Sentence Representation
Instead of directly feeding the encoder summary to the decoder, an additive approach can be used. In this approach, the encoder outputs are added together to create a single summary vector. This summary vector is then used by the decoder.
Limitations of this Simplified Approach
This approach is simpler but might lose some information by summarizing the entire input into a single vector.
Introduction to Attention Mechanisms
Motivation for Attention in Machine Translation
Both of the previously mentioned approaches, even with LSTMs or GRUs, have limitations, especially when dealing with long sequences. This is where attention mechanisms come into play. The core idea of attention is that everything matters, but not everything we encounter deserves the same level of attention. In machine translation, different words in a pair of translated sentences can influence other words more or less.
Weighted Encoder Outputs Based on Relevance
Instead of simple addition, we can weight the encoder outputs differently based on their relevance to the decoder’s current step.
We can use a weight matrix to represent these weights. This weight matrix is part of the attention scheme.
Basic Position-Only Attention
In this simplified example, it’s a position-only attention scheme, meaning the attention weights depend only on the positions of the words. We are going to build an attention scheme where the attention an English word at position \(i\) pays to the state of the French encoding at position \(j\) depends only on \(i\) and \(j\). Generally, the closer \(i\) and \(j\), the greater the attention.
An imaginary weight matrix for a model in which the encoder (French) window size is 4 and the decoder size is 3 is shown in Figure 5. Here, \(W_{i,j}\) is the weight given to the \(i\)-th French state when used to predict the \(j\)-th English word. The total weight for any English word is the column sum, which is made 1. For example, for the first English word, the weights are 1/2, 1/6, 1/6, 1/6 for the four French words.
Time Complexity:
Without Attention: The time complexity for each step in a Seq2Seq model without attention is \(O(M \times N)\), where \(M\) is the size of the hidden state and \(N\) is the size of the vocabulary. For a sequence of length \(T\), the overall time complexity is \(O(T \times M \times N)\).
With Attention: The time complexity of adding an attention mechanism is \(O(T^2 \times M)\), as each output time step involves calculating attention scores over each input time step.
Space Complexity:
Without Attention: The space complexity is primarily determined by the size of the model’s parameters and the activations that need to be stored for backpropagation, which is \(O(M \times N)\).
With Attention: The space complexity increases due to the need to store attention weights, adding an additional \(O(T^2)\) space requirement.
Transformers and Self-Attention
The Transformer Model: A Paradigm Shift
Key Paper: "Attention Is All You Need" (Vaswani et al., 2017)
Self-attention is a crucial concept that leads us to the Transformer architecture. The key innovation behind the Transformer is detailed in the paper "Attention Is All You Need" by Vaswani et al. (2017), presented at NIPS 2017. The authors are Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin, many from Google Brain and the University of Toronto.
Core Idea: Replacing Recurrence with Attention
The Transformer offers a new perspective on sequence modeling. It introduces a novel network architecture based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
Advantages: Parallelization and Improved Training Efficiency
The Transformer allows for significantly more parallelization and can reach a new state-of-the-art in translation quality while requiring significantly less time to train. This is a significant advantage over recurrent models, which are inherently sequential and harder to parallelize. Instead of "memory" like RNNs, the Transformer relies solely on attention to draw global dependencies between input and output.
The Self-Attention Mechanism
Self-attention, sometimes called intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.
Applications of Self-Attention in Various Tasks
Self-attention has been used successfully in a variety of tasks, including reading comprehension, abstractive summarization, textual entailment, and learning task-independent sentence representations.
Detailed Architecture of the Transformer
Encoder and Decoder Stacks and their Components
The Transformer architecture consists of \(N_x\) encoder layers and \(N_x\) decoder layers, where \(N_x\) is typically 6 in the original paper. Each encoder layer contains:
Multi-Head Attention
Feed Forward
Add & Norm (applied after both Multi-Head Attention and Feed Forward)
Each decoder layer contains:
Masked Multi-Head Attention
Encoder-Decoder Attention
Feed Forward
Add & Norm (applied after each of the three components above)
Both the encoder and decoder also use Positional Encoding, Input Embedding, and Output Embedding. The decoder takes Outputs (shifted right) as input.
The Role of Positional Encoding
Both the encoder and decoder use Positional Encoding to inject information about the position of words in the sequence.
Input and Output Embeddings
Both the encoder and decoder use Input Embedding and Output Embedding to convert words into vector representations.
Visualizing the Transformer: Jay Alammar’s Blog Post
High-Level Overview of the Transformer Model
For a more visual and intuitive explanation of the Transformer, refer to "The Illustrated Transformer" blog by Jay Alammar. This blog post provides a clear and visual explanation of the Transformer, starting with a high-level look at the model as a single black box. In a machine translation application, it would take a sentence in one language and output its translation in another. For example:
INPUT: "Je suis étudiant" \(\rightarrow\) THE TRANSFORMER \(\rightarrow\) OUTPUT: "I am a student".
Detailed Explanation of Encoder and Decoder Components
Opening up the black box, we see an encoding component, a decoding component, and connections between them.
The encoding component is a stack of encoders (the paper stacks six of them on top of each other). The decoding component is a stack of decoders of the same number. Encoders are all identical in structure (yet they do not share weights). Each one is broken down into two sub-layers: a Self-Attention layer and a Feed-Forward Neural Network. The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence.
In-depth Look at the Self-Attention Layer
The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. Self-attention becomes the central theme.
The Role of Feed-Forward Networks
The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.
Data Flow and Tensor Transformations
In NLP applications, we begin by turning each input word into a vector using an embedding algorithm. For example, for the sentence "Je suis étudiant", each word is embedded into a vector of size 512. The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors, each of size 512. After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.
Self-Attention Calculation: Query, Key, and Value Vectors
To calculate self-attention, the first step is to create three vectors from each of the encoder’s input vectors (embedding of each word): a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three learned matrices.
Understanding Query, Key, and Value Vectors
Query, key, and value vectors are abstractions useful for calculating and thinking about attention. To calculate a score for self-attention, we take the dot product of the query vector with the key vector of the respective word we’re scoring.
Time Complexity:
Self-Attention: The time complexity of self-attention is \(O(T^2 \times D)\), where \(T\) is the sequence length and \(D\) is the embedding dimension. This is because each position attends to every other position in the sequence.
Feed-Forward Network: The time complexity of the feed-forward network is \(O(T \times D^2)\), as it operates on each position independently with a complexity proportional to the square of the embedding dimension.
Overall: The overall time complexity for a single layer of the Transformer is dominated by the self-attention mechanism, resulting in \(O(T^2 \times D)\). For \(N_x\) layers, the complexity becomes \(O(N_x \times T^2 \times D)\).
Space Complexity:
Self-Attention: The space complexity is primarily determined by the attention weights, which require \(O(T^2)\) space.
Model Parameters: The space complexity for the model parameters is \(O(D^2)\), dominated by the embedding and feed-forward layers.
Overall: The overall space complexity is \(O(T^2 + D^2)\). For \(N_x\) layers, this becomes \(O(N_x \times (T^2 + D^2))\).
Conclusion
This lecture reviewed neural network fundamentals, transitioned into sequence-to-sequence learning, and introduced the Transformer model. The limitations of RNNs and LSTMs for complex tasks like machine translation were discussed, motivating the need for attention mechanisms. The sequence-to-sequence paradigm with encoder-decoder architectures was explained, along with implementation details using TensorFlow and an alternative additive approach. The Transformer model was introduced, highlighting its self-attention mechanism, parallelization capabilities, and improved training efficiency compared to traditional recurrent models. Key takeaways include understanding the evolution from RNN-based models to Transformers, the importance of attention and self-attention, and the architectural components of the Transformer model. For further study, consider exploring the mathematical details of positional encoding and multi-head attention in the Transformer, and investigate the TensorFlow or PyTorch implementations of these models. Questions to guide further exploration into advanced neural network architectures include: What are the computational costs associated with attention mechanisms compared to recurrent layers? How do different types of attention mechanisms (e.g., dot-product, additive) affect model performance?