Lecture Notes on Sequence-to-Sequence Learning and Transformers

Author

Your Name

Published

February 3, 2025

Introduction

This lecture continues the exploration of neural networks, transitioning from basic supervised learning models to advanced architectures that handle sequential data. It begins with a review of fundamental neural network types, including Single Layer Perceptrons (SLPs), Multi-Layer Perceptrons (MLPs), and Bayesian Neural Networks (BNNs). It then delves into Recurrent Neural Networks (RNNs), including early forms like Jordan and Elman networks and the more sophisticated Long Short-Term Memory (LSTM) networks, and their application to sequence data. The lecture highlights the limitations of traditional RNNs and LSTMs for complex sequence-to-sequence tasks like Machine Translation (MT), motivating the introduction of attention mechanisms and the Transformer architecture. Core concepts of sequence-to-sequence learning, encoder-decoder models, the revolutionary Transformer model, and its self-attention mechanism are explored. The main objective is to understand the evolution from RNN-based sequence models to the Transformer, emphasizing the advantages of attention and self-attention in overcoming the limitations of recurrence.

Review of Neural Network Fundamentals

Basic Neural Network Architectures

Single Layer Perceptron (SLP)

The Single Layer Perceptron (SLP) represents the simplest form of a neural network.

Multi-Layer Perceptron (MLP)

The Multi-Layer Perceptron (MLP) introduces non-linearity, enabling the model to solve more complex problems than the SLP.

Bayesian Neural Networks (BNN)

Bayesian Neural Networks (BNNs) offer a probabilistic approach to neural networks, although they were not covered in depth.

Recurrent Neural Networks (RNNs)

Early RNN Architectures: Jordan and Elman Networks

Early types of recurrent networks include Jordan and Elman networks.

Long Short-Term Memory (LSTM) Networks

Long Short-Term Memory (LSTM) networks are a more advanced type of RNN. LSTMs introduce the concept of long and short-term memory, allowing them to remember information over longer sequences. This makes them powerful for tasks such as natural language processing.

Training RNNs: Backpropagation Through Time (BPTT)

Training RNNs involves the concept of backpropagation applied through time (BPTT). This allows for handling sequential data. Recurrence is based on unfolding the network over time, which is how backpropagation is applied to sequences.

Key Concepts and Challenges in RNNs

Architectural Rigidity: The basic RNN structure can be inflexible for very complex sequences.
Linear and Non-linear Components: RNNs use linear and non-linear components within the network layers to capture different types of relationships in the data.
Unfolding in Time: Unfolding the recurrent network over time is how backpropagation is applied to sequences.
Batch Processing and Teacher Forcing: Batch processing can be performed online or offline. Teacher Forcing (TF) is a common training technique for RNNs.

Time Complexity: The time complexity of training RNNs with BPTT is generally $O(T \times K)$, where $T$ is the sequence length and $K$ represents the computational cost of processing a single time step, which depends on the network’s architecture (number of layers, units per layer).
Space Complexity: The space complexity is also influenced by the sequence length and the network’s size. During training, gradients and activations for each time step need to be stored, leading to a space complexity of $O(T \times M)$, where $M$ is the memory required to store activations and gradients for a single time step.

Sequence-to-Sequence Learning

Motivation: Machine Translation (MT) as a Sequence Problem

Challenges in Traditional Machine Translation

Machine Translation (MT) is a challenging task. Since around 1990, it has been recognized that expressing the mapping from one language to another in a program is difficult. A more indirect approach is to use an aligned corpus – many examples of sentence pairs that are mutual translations of each other – and require the machine to figure out the mapping for itself. This is where deep learning comes in.

Limitations of LSTMs for Complex MT Tasks

The deep learning techniques we have learned for natural-language tasks, e.g., LSTMs, by themselves are not sufficient for Machine Translation. We need more, specifically, attention mechanisms. Word-by-word translation is insufficient. Consider the example: "edited hansard number 1" and its French translation "hansard révisé numéro 1". For longer sentences like "this being the day on which parliament was convoked by proclamation of his excellency ..." and its French counterpart "parlement ayant été convoqué pour aujourd’hui, par proclamation de son excellence ...", it becomes even clearer that we need to read more than just the immediate preceding words. We need links between words across the sentence to understand the relationships and context for accurate translation.

The Sequence-to-Sequence (Seq2Seq) Paradigm

Encoder-Decoder Architecture

The core idea of the Sequence-to-Sequence (Seq2Seq) paradigm is to encode a sentence and then decode it. We take an input sentence, encode it into a representation, and then decode that representation into an output sentence, which is the translation. A typical Seq2Seq model consists of two RNNs: an encoder and a decoder. The encoder takes the input sentence word by word (e.g., "revised", "hansards", "number", "1", "STOP"). The "STOP" token signals the end of the input sequence. The encoder processes these words sequentially and produces an encoding.

Encoder-Decoder Architecture for Machine Translation

The Role of the Internal State (Sentence Embedding)

This encoding is often called the internal state or sentence embedding. It’s a compressed representation of the entire input sentence. The decoder takes this internal state and generates the output sentence word by word (e.g., "hansards", "révisé", "numéro", "1", "STOP"). Again, "STOP" indicates the end of the output sentence.

Gated Recurrent Units (GRUs) as an Alternative to LSTMs

Historically, LSTMs were often used in this type of architecture. However, LSTMs have been replaced by Gated Recurrent Unit (GRU) cells. GRUs are similar to LSTMs but are often more efficient. A simplified diagram of a GRU cell includes an update gate ($z_t$) and a reset gate ($r_t$).

Simplified Diagram of a GRU Cell

Simplifying the Problem for Initial Understanding

Focusing on Next-Word Prediction

To simplify the complexity of real MT, we can evaluate a program’s ability to predict the correct next English word given the correct previous word. In real MT, there is no single correct English translation for a particular French sentence.

Evaluation Metric: BLEU Score and its Limitations

We’ll use the BLEU (Bilingual Evaluation Understudy) score for evaluation, although it has limitations.

Implementation Details of Seq2Seq Models

Backpropagation Through Time (BPTT) in Seq2Seq

The process of encoding and decoding involves backpropagation through time (unfolding), similar to regular RNNs.

Handling Variable Length Sequences: Window Size and Padding

Sequence length is often referred to as the window size. In Seq2Seq models, window size relates to the maximum length of the encoded/decoded sentence. For example, a maximum window size might be 12 or 13 words. If input sentences are shorter than the maximum window size, they are padded with a special "STOP" token to ensure consistent input length.

Managing Variable Scopes in TensorFlow

When writing a Seq2Seq MT program, it’s important to manage variable scopes to prevent naming conflicts.

TensorFlow Code Example: Encoder and Decoder

Here is a TensorFlow code example for implementing a Seq2Seq model, defining two variable scopes, "enc" for the encoder and "dec" for the decoder:

“’python with tf.variable_scope("enc"): F = tf.Variable(tf.random_normal((vfSz,embedSz), stddev=.1)) embs = tf.nn.embedding_lookup(F, encIn) embs = tf.nn.dropout(embs, keepPrb) # Regularization cell = tf.contrib.rnn.GRUCell(rnnSz) initState = cell.zero_state(bSz, tf.float32) encout, encState = tf.nn.dynamic_rnn(cell, embs, initial_state=initState)

with tf.variable_scope("dec"): E = tf.Variable(tf.random_normal((veSz,embedSz),stddev=.1)) embs = tf.nn.embedding_lookup(E, decIn) embs = tf.nn.dropout(embs, keepPrb) cell = tf.contrib.rnn.GRUCell(rnnSz) decout,_ = tf.nn.dynamic_rnn(cell, embs, initial_state=encState) “’

Encoder Implementation Details

F = tf.Variable(tf.random_normal((vfSz,embedSz), stddev=.1)): Initializes the embedding variable of the French vocabulary (vfSz).
embs = tf.nn.embedding_lookup(F, encIn): Performs the French-words embedding, resulting in a 3D tensor of shape (batch size x window size x embedding size).
embs = tf.nn.dropout(embs, keepPrb): Applies dropout for regularization.
cell = tf.contrib.rnn.GRUCell(rnnSz): Instantiates the GRU cells.
initState = cell.zero_state(bSz, tf.float32): Initializes the initial state of the GRU cell to zero.
encout, encState = tf.nn.dynamic_rnn(cell, embs, initial_state=initState): Computes the output and next state using a dynamic RNN.

Decoder Implementation Details

The decoder is similar to the encoder but uses a separate embedding matrix E for the English vocabulary and is initialized with the final state of the encoder (encState).

Generating Logits and Probabilities from Decoder Output

After decoding, the decoder outputs are converted into probabilities over the English vocabulary. This involves a linear layer to obtain logits and then probabilities. The challenge is handling 3D logits (batch size x window size x vocabulary size).

“‘python W = tf.Variable(tf.random_normal((rnnSz, veSz), stddev=.1)) b = tf.Variable(tf.random_normal((veSz), stddev=.1)) logits = tf.tensordot(decout, W, axes=[[2],[0]]) + b “’

To solve this, a linear transformation with weight matrix W and bias b is used, and the logits are computed using tf.tensordot.

Calculating Sequence Loss

Finally, the loss is calculated using tf.contrib.seq2seq.sequence_loss:

“‘python loss = tf.contrib.seq2seq.sequence_loss(logits, ans, tf.ones([bSz, wSz])) “’

This function calculates the sequence loss, which is a weighted sum of the losses over each time step in the sequence.

Alternative Seq2Seq Approach: Additive Summary

Using an Additive Approach for Sentence Representation

Instead of directly feeding the encoder summary to the decoder, an additive approach can be used. In this approach, the encoder outputs are added together to create a single summary vector. This summary vector is then used by the decoder.

Additive Summary Approach for Seq2Seq

Limitations of this Simplified Approach

This approach is simpler but might lose some information by summarizing the entire input into a single vector.

Introduction to Attention Mechanisms

Motivation for Attention in Machine Translation

Both of the previously mentioned approaches, even with LSTMs or GRUs, have limitations, especially when dealing with long sequences. This is where attention mechanisms come into play. The core idea of attention is that everything matters, but not everything we encounter deserves the same level of attention. In machine translation, different words in a pair of translated sentences can influence other words more or less.

Weighted Encoder Outputs Based on Relevance

Instead of simple addition, we can weight the encoder outputs differently based on their relevance to the decoder’s current step.

Weighted Encoder Outputs in Attention Mechanism

We can use a weight matrix to represent these weights. This weight matrix is part of the attention scheme.

Basic Position-Only Attention

In this simplified example, it’s a position-only attention scheme, meaning the attention weights depend only on the positions of the words. We are going to build an attention scheme where the attention an English word at position $i$ pays to the state of the French encoding at position $j$ depends only on $i$ and $j$. Generally, the closer $i$ and $j$, the greater the attention.

Imaginary Weight Matrix for Position-Only Attention

An imaginary weight matrix for a model in which the encoder (French) window size is 4 and the decoder size is 3 is shown in Figure 5. Here, $W_{i,j}$ is the weight given to the $i$-th French state when used to predict the $j$-th English word. The total weight for any English word is the column sum, which is made 1. For example, for the first English word, the weights are 1/2, 1/6, 1/6, 1/6 for the four French words.

Time Complexity:
- Without Attention: The time complexity for each step in a Seq2Seq model without attention is $O(M \times N)$, where $M$ is the size of the hidden state and $N$ is the size of the vocabulary. For a sequence of length $T$, the overall time complexity is $O(T \times M \times N)$.
- With Attention: The time complexity of adding an attention mechanism is $O(T^2 \times M)$, as each output time step involves calculating attention scores over each input time step.
Space Complexity:
- Without Attention: The space complexity is primarily determined by the size of the model’s parameters and the activations that need to be stored for backpropagation, which is $O(M \times N)$.
- With Attention: The space complexity increases due to the need to store attention weights, adding an additional $O(T^2)$ space requirement.

Transformers and Self-Attention

The Transformer Model: A Paradigm Shift

Key Paper: "Attention Is All You Need" (Vaswani et al., 2017)

Self-attention is a crucial concept that leads us to the Transformer architecture. The key innovation behind the Transformer is detailed in the paper "Attention Is All You Need" by Vaswani et al. (2017), presented at NIPS 2017. The authors are Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin, many from Google Brain and the University of Toronto.

Core Idea: Replacing Recurrence with Attention

The Transformer offers a new perspective on sequence modeling. It introduces a novel network architecture based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

Advantages: Parallelization and Improved Training Efficiency

The Transformer allows for significantly more parallelization and can reach a new state-of-the-art in translation quality while requiring significantly less time to train. This is a significant advantage over recurrent models, which are inherently sequential and harder to parallelize. Instead of "memory" like RNNs, the Transformer relies solely on attention to draw global dependencies between input and output.

The Self-Attention Mechanism

Self-attention, sometimes called intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.

Applications of Self-Attention in Various Tasks

Self-attention has been used successfully in a variety of tasks, including reading comprehension, abstractive summarization, textual entailment, and learning task-independent sentence representations.

Detailed Architecture of the Transformer

Encoder and Decoder Stacks and their Components

The Transformer architecture consists of $N_x$ encoder layers and $N_x$ decoder layers, where $N_x$ is typically 6 in the original paper. Each encoder layer contains:

Multi-Head Attention
Feed Forward
Add & Norm (applied after both Multi-Head Attention and Feed Forward)

Each decoder layer contains:

Masked Multi-Head Attention
Encoder-Decoder Attention
Feed Forward
Add & Norm (applied after each of the three components above)

Both the encoder and decoder also use Positional Encoding, Input Embedding, and Output Embedding. The decoder takes Outputs (shifted right) as input.

The Transformer Architecture

The Role of Positional Encoding

Both the encoder and decoder use Positional Encoding to inject information about the position of words in the sequence.

Input and Output Embeddings

Both the encoder and decoder use Input Embedding and Output Embedding to convert words into vector representations.

Visualizing the Transformer: Jay Alammar’s Blog Post

High-Level Overview of the Transformer Model

For a more visual and intuitive explanation of the Transformer, refer to "The Illustrated Transformer" blog by Jay Alammar. This blog post provides a clear and visual explanation of the Transformer, starting with a high-level look at the model as a single black box. In a machine translation application, it would take a sentence in one language and output its translation in another. For example:

INPUT: "Je suis étudiant" $\rightarrow$ THE TRANSFORMER $\rightarrow$ OUTPUT: "I am a student".

High-Level View of the Transformer

Detailed Explanation of Encoder and Decoder Components

Opening up the black box, we see an encoding component, a decoding component, and connections between them.

Encoder and Decoder Components

The encoding component is a stack of encoders (the paper stacks six of them on top of each other). The decoding component is a stack of decoders of the same number. Encoders are all identical in structure (yet they do not share weights). Each one is broken down into two sub-layers: a Self-Attention layer and a Feed-Forward Neural Network. The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence.

Stacked Encoders and Decoders

Encoder Structure

In-depth Look at the Self-Attention Layer

The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. Self-attention becomes the central theme.

The Role of Feed-Forward Networks

The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.

Data Flow and Tensor Transformations

In NLP applications, we begin by turning each input word into a vector using an embedding algorithm. For example, for the sentence "Je suis étudiant", each word is embedded into a vector of size 512. The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors, each of size 512. After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.

Self-Attention Calculation: Query, Key, and Value Vectors

To calculate self-attention, the first step is to create three vectors from each of the encoder’s input vectors (embedding of each word): a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three learned matrices.

Understanding Query, Key, and Value Vectors

Query, key, and value vectors are abstractions useful for calculating and thinking about attention. To calculate a score for self-attention, we take the dot product of the query vector with the key vector of the respective word we’re scoring.

Time Complexity:
- Self-Attention: The time complexity of self-attention is $O(T^2 \times D)$, where $T$ is the sequence length and $D$ is the embedding dimension. This is because each position attends to every other position in the sequence.
- Feed-Forward Network: The time complexity of the feed-forward network is $O(T \times D^2)$, as it operates on each position independently with a complexity proportional to the square of the embedding dimension.
- Overall: The overall time complexity for a single layer of the Transformer is dominated by the self-attention mechanism, resulting in $O(T^2 \times D)$. For $N_x$ layers, the complexity becomes $O(N_x \times T^2 \times D)$.
Space Complexity:
- Self-Attention: The space complexity is primarily determined by the attention weights, which require $O(T^2)$ space.
- Model Parameters: The space complexity for the model parameters is $O(D^2)$, dominated by the embedding and feed-forward layers.
- Overall: The overall space complexity is $O(T^2 + D^2)$. For $N_x$ layers, this becomes $O(N_x \times (T^2 + D^2))$.

Conclusion

This lecture reviewed neural network fundamentals, transitioned into sequence-to-sequence learning, and introduced the Transformer model. The limitations of RNNs and LSTMs for complex tasks like machine translation were discussed, motivating the need for attention mechanisms. The sequence-to-sequence paradigm with encoder-decoder architectures was explained, along with implementation details using TensorFlow and an alternative additive approach. The Transformer model was introduced, highlighting its self-attention mechanism, parallelization capabilities, and improved training efficiency compared to traditional recurrent models. Key takeaways include understanding the evolution from RNN-based models to Transformers, the importance of attention and self-attention, and the architectural components of the Transformer model. For further study, consider exploring the mathematical details of positional encoding and multi-head attention in the Transformer, and investigate the TensorFlow or PyTorch implementations of these models. Questions to guide further exploration into advanced neural network architectures include: What are the computational costs associated with attention mechanisms compared to recurrent layers? How do different types of attention mechanisms (e.g., dot-product, additive) affect model performance?

--- title: "Lecture Notes on Sequence-to-Sequence Learning and Transformers" author: "Your Name" date: "2025-02-03" format: html: toc: true # Table of Contents toc-depth: 2 code-tools: true theme: cosmo # Or "journal" for Distill-like minimalism --- # Introduction This lecture continues the exploration of neural networks, transitioning from basic supervised learning models to advanced architectures that handle sequential data. It begins with a review of fundamental neural network types, including Single Layer Perceptrons (SLPs), Multi-Layer Perceptrons (MLPs), and Bayesian Neural Networks (BNNs). It then delves into Recurrent Neural Networks (RNNs), including early forms like Jordan and Elman networks and the more sophisticated Long Short-Term Memory (LSTM) networks, and their application to sequence data. The lecture highlights the limitations of traditional RNNs and LSTMs for complex sequence-to-sequence tasks like Machine Translation (MT), motivating the introduction of attention mechanisms and the Transformer architecture. Core concepts of sequence-to-sequence learning, encoder-decoder models, the revolutionary Transformer model, and its self-attention mechanism are explored. The main objective is to understand the evolution from RNN-based sequence models to the Transformer, emphasizing the advantages of attention and self-attention in overcoming the limitations of recurrence. # Review of Neural Network Fundamentals ## Basic Neural Network Architectures ### Single Layer Perceptron (SLP) The Single Layer Perceptron (SLP) represents the simplest form of a neural network. ### Multi-Layer Perceptron (MLP) The Multi-Layer Perceptron (MLP) introduces non-linearity, enabling the model to solve more complex problems than the SLP. ### Bayesian Neural Networks (BNN) Bayesian Neural Networks (BNNs) offer a probabilistic approach to neural networks, although they were not covered in depth. ## Recurrent Neural Networks (RNNs) ### Early RNN Architectures: Jordan and Elman Networks Early types of recurrent networks include Jordan and Elman networks. ### Long Short-Term Memory (LSTM) Networks Long Short-Term Memory (LSTM) networks are a more advanced type of RNN. LSTMs introduce the concept of long and short-term memory, allowing them to remember information over longer sequences. This makes them powerful for tasks such as natural language processing. ### Training RNNs: Backpropagation Through Time (BPTT) Training RNNs involves the concept of backpropagation applied through time (BPTT). This allows for handling sequential data. Recurrence is based on unfolding the network over time, which is how backpropagation is applied to sequences. ### Key Concepts and Challenges in RNNs - **Architectural Rigidity:** The basic RNN structure can be inflexible for very complex sequences. - **Linear and Non-linear Components:** RNNs use linear and non-linear components within the network layers to capture different types of relationships in the data. - **Unfolding in Time:** Unfolding the recurrent network over time is how backpropagation is applied to sequences. - **Batch Processing and Teacher Forcing:** Batch processing can be performed online or offline. Teacher Forcing (TF) is a common training technique for RNNs. ::: tcolorbox - **Time Complexity:** The time complexity of training RNNs with BPTT is generally $O(T \times K)$, where $T$ is the sequence length and $K$ represents the computational cost of processing a single time step, which depends on the network's architecture (number of layers, units per layer). - **Space Complexity:** The space complexity is also influenced by the sequence length and the network's size. During training, gradients and activations for each time step need to be stored, leading to a space complexity of $O(T \times M)$, where $M$ is the memory required to store activations and gradients for a single time step. ::: # Sequence-to-Sequence Learning ## Motivation: Machine Translation (MT) as a Sequence Problem ### Challenges in Traditional Machine Translation Machine Translation (MT) is a challenging task. Since around 1990, it has been recognized that expressing the mapping from one language to another in a program is difficult. A more indirect approach is to use an aligned corpus -- many examples of sentence pairs that are mutual translations of each other -- and require the machine to figure out the mapping for itself. This is where deep learning comes in. ### Limitations of LSTMs for Complex MT Tasks The deep learning techniques we have learned for natural-language tasks, e.g., LSTMs, by themselves are not sufficient for Machine Translation. We need more, specifically, attention mechanisms. Word-by-word translation is insufficient. Consider the example: \"edited hansard number 1\" and its French translation \"hansard révisé numéro 1\". For longer sentences like \"this being the day on which parliament was convoked by proclamation of his excellency \...\" and its French counterpart \"parlement ayant été convoqué pour aujourd'hui, par proclamation de son excellence \...\", it becomes even clearer that we need to read more than just the immediate preceding words. We need links between words across the sentence to understand the relationships and context for accurate translation. ## The Sequence-to-Sequence (Seq2Seq) Paradigm ### Encoder-Decoder Architecture The core idea of the Sequence-to-Sequence (Seq2Seq) paradigm is to encode a sentence and then decode it. We take an input sentence, encode it into a representation, and then decode that representation into an output sentence, which is the translation. A typical Seq2Seq model consists of two RNNs: an encoder and a decoder. The encoder takes the input sentence word by word (e.g., \"revised\", \"hansards\", \"number\", \"1\", \"STOP\"). The \"STOP\" token signals the end of the input sequence. The encoder processes these words sequentially and produces an encoding. <figure id="fig:encoder_decoder"> <figcaption>Encoder-Decoder Architecture for Machine Translation</figcaption> </figure> ### The Role of the Internal State (Sentence Embedding) This encoding is often called the internal state or sentence embedding. It's a compressed representation of the entire input sentence. The decoder takes this internal state and generates the output sentence word by word (e.g., \"hansards\", \"révisé\", \"numéro\", \"1\", \"STOP\"). Again, \"STOP\" indicates the end of the output sentence. ### Gated Recurrent Units (GRUs) as an Alternative to LSTMs Historically, LSTMs were often used in this type of architecture. However, LSTMs have been replaced by Gated Recurrent Unit (GRU) cells. GRUs are similar to LSTMs but are often more efficient. A simplified diagram of a GRU cell includes an update gate ($z_t$) and a reset gate ($r_t$). <figure id="fig:gru_cell"> <figcaption>Simplified Diagram of a GRU Cell</figcaption> </figure> ## Simplifying the Problem for Initial Understanding ### Focusing on Next-Word Prediction To simplify the complexity of real MT, we can evaluate a program's ability to predict the correct next English word given the correct previous word. In real MT, there is no single correct English translation for a particular French sentence. ### Evaluation Metric: BLEU Score and its Limitations We'll use the BLEU (Bilingual Evaluation Understudy) score for evaluation, although it has limitations. ## Implementation Details of Seq2Seq Models ### Backpropagation Through Time (BPTT) in Seq2Seq The process of encoding and decoding involves backpropagation through time (unfolding), similar to regular RNNs. ### Handling Variable Length Sequences: Window Size and Padding Sequence length is often referred to as the window size. In Seq2Seq models, window size relates to the maximum length of the encoded/decoded sentence. For example, a maximum window size might be 12 or 13 words. If input sentences are shorter than the maximum window size, they are padded with a special \"STOP\" token to ensure consistent input length. ### Managing Variable Scopes in TensorFlow When writing a Seq2Seq MT program, it's important to manage variable scopes to prevent naming conflicts. ### TensorFlow Code Example: Encoder and Decoder Here is a TensorFlow code example for implementing a Seq2Seq model, defining two variable scopes, \"enc\" for the encoder and \"dec\" for the decoder: "'python with tf.variable_scope(\"enc\"): F = tf.Variable(tf.random_normal((vfSz,embedSz), stddev=.1)) embs = tf.nn.embedding_lookup(F, encIn) embs = tf.nn.dropout(embs, keepPrb) \# Regularization cell = tf.contrib.rnn.GRUCell(rnnSz) initState = cell.zero_state(bSz, tf.float32) encout, encState = tf.nn.dynamic_rnn(cell, embs, initial_state=initState) with tf.variable_scope(\"dec\"): E = tf.Variable(tf.random_normal((veSz,embedSz),stddev=.1)) embs = tf.nn.embedding_lookup(E, decIn) embs = tf.nn.dropout(embs, keepPrb) cell = tf.contrib.rnn.GRUCell(rnnSz) decout,\_ = tf.nn.dynamic_rnn(cell, embs, initial_state=encState) "' #### Encoder Implementation Details - `F = tf.Variable(tf.random_normal((vfSz,embedSz), stddev=.1))`: Initializes the embedding variable of the French vocabulary (`vfSz`). - `embs = tf.nn.embedding_lookup(F, encIn)`: Performs the French-words embedding, resulting in a 3D tensor of shape (batch size x window size x embedding size). - `embs = tf.nn.dropout(embs, keepPrb)`: Applies dropout for regularization. - `cell = tf.contrib.rnn.GRUCell(rnnSz)`: Instantiates the GRU cells. - `initState = cell.zero_state(bSz, tf.float32)`: Initializes the initial state of the GRU cell to zero. - `encout, encState = tf.nn.dynamic_rnn(cell, embs, initial_state=initState)`: Computes the output and next state using a dynamic RNN. #### Decoder Implementation Details The decoder is similar to the encoder but uses a separate embedding matrix `E` for the English vocabulary and is initialized with the final state of the encoder (`encState`). #### Generating Logits and Probabilities from Decoder Output After decoding, the decoder outputs are converted into probabilities over the English vocabulary. This involves a linear layer to obtain logits and then probabilities. The challenge is handling 3D logits (batch size x window size x vocabulary size). "'python W = tf.Variable(tf.random_normal((rnnSz, veSz), stddev=.1)) b = tf.Variable(tf.random_normal((veSz), stddev=.1)) logits = tf.tensordot(decout, W, axes=\[\[2\],\[0\]\]) + b "' To solve this, a linear transformation with weight matrix `W` and bias `b` is used, and the logits are computed using `tf.tensordot`. #### Calculating Sequence Loss Finally, the loss is calculated using `tf.contrib.seq2seq.sequence_loss`: "'python loss = tf.contrib.seq2seq.sequence_loss(logits, ans, tf.ones(\[bSz, wSz\])) "' This function calculates the sequence loss, which is a weighted sum of the losses over each time step in the sequence. ## Alternative Seq2Seq Approach: Additive Summary ### Using an Additive Approach for Sentence Representation Instead of directly feeding the encoder summary to the decoder, an additive approach can be used. In this approach, the encoder outputs are added together to create a single summary vector. This summary vector is then used by the decoder. <figure id="fig:additive_summary"> <figcaption>Additive Summary Approach for Seq2Seq</figcaption> </figure> ### Limitations of this Simplified Approach This approach is simpler but might lose some information by summarizing the entire input into a single vector. ## Introduction to Attention Mechanisms ### Motivation for Attention in Machine Translation Both of the previously mentioned approaches, even with LSTMs or GRUs, have limitations, especially when dealing with long sequences. This is where attention mechanisms come into play. The core idea of attention is that everything matters, but not everything we encounter deserves the same level of attention. In machine translation, different words in a pair of translated sentences can influence other words more or less. ### Weighted Encoder Outputs Based on Relevance Instead of simple addition, we can weight the encoder outputs differently based on their relevance to the decoder's current step. <figure id="fig:weighted_encoder_outputs"> <figcaption>Weighted Encoder Outputs in Attention Mechanism</figcaption> </figure> We can use a weight matrix to represent these weights. This weight matrix is part of the attention scheme. ### Basic Position-Only Attention In this simplified example, it's a position-only attention scheme, meaning the attention weights depend only on the positions of the words. We are going to build an attention scheme where the attention an English word at position $i$ pays to the state of the French encoding at position $j$ depends only on $i$ and $j$. Generally, the closer $i$ and $j$, the greater the attention. <figure id="fig:position_only_attention"> <figcaption>Imaginary Weight Matrix for Position-Only Attention</figcaption> </figure> An imaginary weight matrix for a model in which the encoder (French) window size is 4 and the decoder size is 3 is shown in Figure [5](#fig:position_only_attention){reference-type="ref" reference="fig:position_only_attention"}. Here, $W_{i,j}$ is the weight given to the $i$-th French state when used to predict the $j$-th English word. The total weight for any English word is the column sum, which is made 1. For example, for the first English word, the weights are 1/2, 1/6, 1/6, 1/6 for the four French words. ::: tcolorbox - **Time Complexity:** - **Without Attention:** The time complexity for each step in a Seq2Seq model without attention is $O(M \times N)$, where $M$ is the size of the hidden state and $N$ is the size of the vocabulary. For a sequence of length $T$, the overall time complexity is $O(T \times M \times N)$. - **With Attention:** The time complexity of adding an attention mechanism is $O(T^2 \times M)$, as each output time step involves calculating attention scores over each input time step. - **Space Complexity:** - **Without Attention:** The space complexity is primarily determined by the size of the model's parameters and the activations that need to be stored for backpropagation, which is $O(M \times N)$. - **With Attention:** The space complexity increases due to the need to store attention weights, adding an additional $O(T^2)$ space requirement. ::: # Transformers and Self-Attention ## The Transformer Model: A Paradigm Shift ### Key Paper: \"Attention Is All You Need\" (Vaswani et al., 2017) Self-attention is a crucial concept that leads us to the Transformer architecture. The key innovation behind the Transformer is detailed in the paper \"Attention Is All You Need\" by Vaswani et al. (2017), presented at NIPS 2017. The authors are Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin, many from Google Brain and the University of Toronto. ### Core Idea: Replacing Recurrence with Attention The Transformer offers a new perspective on sequence modeling. It introduces a novel network architecture based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. ### Advantages: Parallelization and Improved Training Efficiency The Transformer allows for significantly more parallelization and can reach a new state-of-the-art in translation quality while requiring significantly less time to train. This is a significant advantage over recurrent models, which are inherently sequential and harder to parallelize. Instead of \"memory\" like RNNs, the Transformer relies solely on attention to draw global dependencies between input and output. ## The Self-Attention Mechanism ::: tcolorbox Self-attention, sometimes called intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. ::: ### Applications of Self-Attention in Various Tasks Self-attention has been used successfully in a variety of tasks, including reading comprehension, abstractive summarization, textual entailment, and learning task-independent sentence representations. ## Detailed Architecture of the Transformer ### Encoder and Decoder Stacks and their Components The Transformer architecture consists of $N_x$ encoder layers and $N_x$ decoder layers, where $N_x$ is typically 6 in the original paper. Each encoder layer contains: - Multi-Head Attention - Feed Forward - Add & Norm (applied after both Multi-Head Attention and Feed Forward) Each decoder layer contains: - Masked Multi-Head Attention - Encoder-Decoder Attention - Feed Forward - Add & Norm (applied after each of the three components above) Both the encoder and decoder also use Positional Encoding, Input Embedding, and Output Embedding. The decoder takes Outputs (shifted right) as input. <figure id="fig:transformer_architecture"> <figcaption>The Transformer Architecture</figcaption> </figure> ### The Role of Positional Encoding Both the encoder and decoder use Positional Encoding to inject information about the position of words in the sequence. ### Input and Output Embeddings Both the encoder and decoder use Input Embedding and Output Embedding to convert words into vector representations. ## Visualizing the Transformer: Jay Alammar's Blog Post ### High-Level Overview of the Transformer Model For a more visual and intuitive explanation of the Transformer, refer to \"The Illustrated Transformer\" blog by Jay Alammar. This blog post provides a clear and visual explanation of the Transformer, starting with a high-level look at the model as a single black box. In a machine translation application, it would take a sentence in one language and output its translation in another. For example: INPUT: \"Je suis étudiant\" $\rightarrow$ THE TRANSFORMER $\rightarrow$ OUTPUT: \"I am a student\". <figure id="fig:transformer_high_level"> <figcaption>High-Level View of the Transformer</figcaption> </figure> ### Detailed Explanation of Encoder and Decoder Components Opening up the black box, we see an encoding component, a decoding component, and connections between them. <figure id="fig:encoder_decoder_components"> <figcaption>Encoder and Decoder Components</figcaption> </figure> The encoding component is a stack of encoders (the paper stacks six of them on top of each other). The decoding component is a stack of decoders of the same number. Encoders are all identical in structure (yet they do not share weights). Each one is broken down into two sub-layers: a Self-Attention layer and a Feed-Forward Neural Network. The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence. <figure id="fig:stacked_encoders_decoders"> <figcaption>Stacked Encoders and Decoders</figcaption> </figure> <figure id="fig:encoder_structure"> <figcaption>Encoder Structure</figcaption> </figure> ### In-depth Look at the Self-Attention Layer The encoder's inputs first flow through a self-attention layer -- a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. Self-attention becomes the central theme. ### The Role of Feed-Forward Networks The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position. ### Data Flow and Tensor Transformations In NLP applications, we begin by turning each input word into a vector using an embedding algorithm. For example, for the sentence \"Je suis étudiant\", each word is embedded into a vector of size 512. The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors, each of size 512. After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder. ### Self-Attention Calculation: Query, Key, and Value Vectors To calculate self-attention, the first step is to create three vectors from each of the encoder's input vectors (embedding of each word): a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three learned matrices. #### Understanding Query, Key, and Value Vectors Query, key, and value vectors are abstractions useful for calculating and thinking about attention. To calculate a score for self-attention, we take the dot product of the query vector with the key vector of the respective word we're scoring. ::: tcolorbox - **Time Complexity:** - **Self-Attention:** The time complexity of self-attention is $O(T^2 \times D)$, where $T$ is the sequence length and $D$ is the embedding dimension. This is because each position attends to every other position in the sequence. - **Feed-Forward Network:** The time complexity of the feed-forward network is $O(T \times D^2)$, as it operates on each position independently with a complexity proportional to the square of the embedding dimension. - **Overall:** The overall time complexity for a single layer of the Transformer is dominated by the self-attention mechanism, resulting in $O(T^2 \times D)$. For $N_x$ layers, the complexity becomes $O(N_x \times T^2 \times D)$. - **Space Complexity:** - **Self-Attention:** The space complexity is primarily determined by the attention weights, which require $O(T^2)$ space. - **Model Parameters:** The space complexity for the model parameters is $O(D^2)$, dominated by the embedding and feed-forward layers. - **Overall:** The overall space complexity is $O(T^2 + D^2)$. For $N_x$ layers, this becomes $O(N_x \times (T^2 + D^2))$. ::: # Conclusion This lecture reviewed neural network fundamentals, transitioned into sequence-to-sequence learning, and introduced the Transformer model. The limitations of RNNs and LSTMs for complex tasks like machine translation were discussed, motivating the need for attention mechanisms. The sequence-to-sequence paradigm with encoder-decoder architectures was explained, along with implementation details using TensorFlow and an alternative additive approach. The Transformer model was introduced, highlighting its self-attention mechanism, parallelization capabilities, and improved training efficiency compared to traditional recurrent models. Key takeaways include understanding the evolution from RNN-based models to Transformers, the importance of attention and self-attention, and the architectural components of the Transformer model. For further study, consider exploring the mathematical details of positional encoding and multi-head attention in the Transformer, and investigate the TensorFlow or PyTorch implementations of these models. Questions to guide further exploration into advanced neural network architectures include: What are the computational costs associated with attention mechanisms compared to recurrent layers? How do different types of attention mechanisms (e.g., dot-product, additive) affect model performance?