Lecture Notes on Transformer Models and BERT

Author

Your Name

Published

January 28, 2025

Introduction

This lecture introduces Transformer models, a groundbreaking architecture in the field of Natural Language Processing (NLP). We will explore the Transformer architecture, focusing on its encoder component as used in BERT (Bidirectional Encoder Representations from Transformers). The lecture aims to provide a detailed understanding of how these models work, starting from the foundational "Attention is All You Need" paper to the architecture of BERT and its training methodologies. By the end of this lecture, and the beginning of the next, you should gain a solid understanding of the principles behind models like ChatGPT and other modern language processing systems.

Transformer Models

The "Attention is All You Need" Paper

Impact and Citations

The foundation of Transformer models is rooted in the highly influential paper "Attention is All You Need," presented by Google. This paper introduced the Transformer architecture, a truly exciting system that has become a cornerstone of modern Natural Language Processing (NLP). Its impact is vividly demonstrated by its enormous citation count, which grows daily, underscoring its profound relevance and influence within the research community. Initially, this groundbreaking architecture was designed for machine translation, where it achieved state-of-the-art results in translating text between languages.

Transformer Architecture Overview

Encoder and Decoder Structure

The Transformer architecture is elegantly structured into two primary components that work in concert: the encoder and the decoder.

Encoder: The encoder’s role is to meticulously process the input sequence, transforming it into a comprehensive and nuanced representation. It is built from a stack of multiple encoder blocks, each contributing to a deeper understanding of the input.
Decoder: Conversely, the decoder leverages the rich encoded representation produced by the encoder to generate the desired output sequence. For instance, in machine translation, the output would be the translated text in the target language. Mirroring the encoder, the decoder is also constructed from a series of decoder blocks.

Within these encoder and decoder blocks, self-attention mechanisms and feed-forward networks are integral, enabling the model to capture complex patterns and dependencies in the data.

Application to Machine Translation

The Transformer model was originally conceived and engineered for the task of machine translation. In this application, the encoder ingests text in a source language and meticulously processes it to distill its underlying meaning. Following this, the decoder takes over, utilizing the encoded representation to construct a semantically equivalent text in the target language. This innovative architecture marked a significant leap forward in translation quality, outperforming previous models like LSTMs, particularly when handling lengthy sequences of text where long-range dependencies are crucial.

Sequential Processing of Output

When applied to tasks like machine translation, the Transformer’s output generation process shares conceptual similarities with Recurrent Neural Networks (RNNs) such as LSTMs, especially in its sequential nature. However, the Transformer diverges significantly in its approach to dependency handling, primarily through the attention mechanism. The decoder initiates output generation by predicting the first word, often starting with a beginning-of-sentence token. Subsequent words are then predicted one by one, conditioned on the encoder’s processed input and the words already generated. This iterative process continues until the entire translated sentence is formed. Throughout this decoding phase, the information encoded by the encoder is continuously consulted, ensuring that the generated translation remains contextually faithful to the original input.

Evolution of Language Models

Building upon Word2Vec

The trajectory of language model development has been significantly shaped by earlier innovations, notably Word2Vec. As previously discussed, Word2Vec provided a method for creating word embeddings—vector representations of words in a continuous space. These embeddings are designed to capture semantic relationships, placing semantically similar words closer together in the vector space. Transformer models represent an evolutionary step beyond this, learning contextualized word representations. This means they move past the static, context-independent embeddings of models like Word2Vec, allowing for a more nuanced understanding of language.

Leading to BERT, BART, Llama, and ChatGPT

The advent of the Transformer architecture has been a catalyst for a wave of advanced language models, each building upon its principles and pushing the boundaries of NLP. Key models in this lineage include:

BERT (Bidirectional Encoder Representations from Transformers): BERT’s architecture primarily focuses on the encoder component of the Transformer. It is expertly designed for tasks requiring a deep understanding of word context within sentences, excelling in nuanced language comprehension.
BART (Bidirectional and Auto-Regressive Transformer): BART strategically combines a bidirectional encoder, similar to BERT, with an autoregressive decoder, akin to GPT. This hybrid design makes it exceptionally versatile for sequence-to-sequence tasks, effectively handling both input understanding and coherent output generation.
Llama (Large Language Model Meta AI): Developed by Meta, Llama is an open-source large language model that leverages the Transformer architecture’s power and flexibility. It is designed to be accessible and adaptable for a wide range of research and application purposes.
ChatGPT: Perhaps one of the most widely recognized Transformer-based models, ChatGPT is a conversational AI developed by OpenAI. It is built upon the Transformer architecture, with a particular emphasis on the decoder component, and is rigorously fine-tuned for generating human-like dialogue.

These models collectively represent a substantial leap in the field of NLP. They offer markedly improved performance across a spectrum of language-related tasks compared to their predecessors. Tracing their origins back to the foundational concept of Word2Vec, they illustrate the rapid and continuous progress in the domain of language modeling and understanding.

BERT and the Encoder

Focus on the Encoder Component

BERT (Bidirectional Encoder Representations from Transformers) distinguishes itself by primarily employing the encoder component of the Transformer architecture. Unlike the original Transformer, which utilizes both an encoder and a decoder, BERT’s architecture is centered around a deep, bidirectional encoder. This design choice is pivotal, enabling BERT to effectively learn contextual word representations. By processing the entire input sentence simultaneously, BERT considers both the preceding (left) and succeeding (right) context of each word, leading to a more nuanced understanding of language.

Input and Output Dimensionality

Preserving Dimensionality Through Layers

A notable characteristic of the BERT encoder is its consistent dimensionality throughout its layers. This means that if the input embeddings are of a specific dimension, say 300, the output embeddings from each subsequent encoder block will retain this same dimensionality. This approach contrasts with some neural network architectures where a reduction in dimensionality is common as data progresses through deeper layers. In BERT, maintaining dimensionality allows for consistent feature representation across different levels of processing.

Stacking Encoder Blocks

Varying the Number of Blocks

BERT models are constructed by sequentially stacking multiple encoder blocks. The number of these blocks is a key factor determining the model’s size and capacity. For example, a smaller BERT configuration, often referred to as BERT-small, might be built with 12 encoder blocks, while a larger version, BERT-large, could incorporate 24 blocks. Generally, increasing the number of encoder blocks enhances the model’s ability to learn more intricate and abstract representations from the data. However, this also leads to a larger model size, demanding greater computational resources and training time. Extending this concept to even more advanced models, similar architectures in contemporary systems like ChatGPT may employ a significantly larger number of blocks, potentially ranging from 49 to 94 or even more, depending on the specific demands of the application and desired model complexity.

Fixed Input Length

Handling Variable Length Sentences with Truncation and Padding

In its standard configuration, BERT is designed to process inputs of a fixed length. This implies a predefined maximum limit on the number of tokens (words or sub-word units) that the model can handle at once. To accommodate sentences of varying lengths, two primary techniques are used:

Truncation: When an input sentence exceeds the maximum allowable length, it is truncated. Typically, this truncation occurs from the end of the sentence, effectively discarding the later parts of the text to fit within the model’s input capacity.
Padding: Conversely, if a sentence is shorter than the fixed input length, it is padded. This is commonly achieved by appending special "[PAD]" tokens to the end of the sentence until it reaches the required length. Padding ensures that all input sequences fed into the BERT model are of uniform length.

This fixed-length input approach contrasts with Recurrent Neural Networks (RNNs) like LSTMs, which are inherently capable of processing variable-length inputs without explicit truncation or padding. However, while LSTMs can handle variable lengths, padding is often used in practice for batch processing to improve computational efficiency.

Matrix Representation of Input

The input to the BERT encoder is structured as a matrix. Given a fixed input length $L$ and an embedding dimension $D$, the resulting input matrix has dimensions $L \times D$. In this matrix, each row corresponds to a token from the input sequence, and each element within a row represents a component of that token’s embedding vector. For instance, in the BERT-small configuration, the input sequence length might be set to 516 tokens, and each token could be represented by an embedding vector of 300 dimensions. This would result in an input matrix of size $516 \times 300$. This matrix serves as the fundamental input structure for the BERT encoder blocks to process.

Independent Weights in Each Encoder Block

A key architectural feature of BERT is that each encoder block operates with its own set of independent weights. This means that the parameters learned within one encoder block are not shared with other blocks in the network. This design is conceptually similar to Convolutional Neural Networks (CNNs), where each convolutional layer typically employs independent filters to capture different features. The use of independent weights in each BERT encoder block allows different layers to specialize in learning distinct levels of abstraction and feature representations from the input data. This contributes to the model’s capacity to capture complex patterns and hierarchical features in language.

Encoder Input Processing

Tokenization of Text

Tokenization Process and Punctuation

The initial step in preparing text for BERT processing is tokenization. This process involves segmenting the input text into meaningful units called tokens. Tokens can be words, sub-word units, or even individual characters, depending on the tokenizer used.

Tokenization Details: Sentences are parsed and divided into a sequence of tokens. Crucially, punctuation marks are treated as distinct tokens. For example, the sentence "Hello, how are you?" is tokenized into: ["Hello", ",", "how", "are", "you", "?"]. Each word and punctuation mark becomes a separate element in the token sequence.
Semantic Importance of Punctuation: In early NLP practices, punctuation was sometimes disregarded as noise and removed during preprocessing. However, modern approaches, especially in models like BERT, recognize the critical role of punctuation in conveying semantics. For instance, a question mark ? fundamentally alters the meaning of a sentence, transforming a statement into a question. Similarly, commas, periods, and other punctuation marks contribute to the grammatical structure and semantic interpretation of text. Therefore, preserving punctuation during tokenization is essential for capturing the full meaning of the input.

Numericalization of Tokens

Vocabulary Creation and Token Mapping

Following tokenization, the next step is to convert each token into a numerical format that the model can process. This is achieved through numericalization, which relies on a vocabulary dictionary.

Vocabulary Dictionary Construction: A vocabulary dictionary is constructed from the training data. This dictionary is essentially a mapping that assigns a unique integer index to each distinct token encountered in the dataset. For example, the dictionary might include entries like: "A": 1, "D": 2, "hello": 34, ",": 90, "how": 15, and so forth. This dictionary serves as the model’s lexicon, encompassing all the tokens it is designed to recognize.
Token to Index Mapping: Once the vocabulary dictionary is established, each token in the tokenized input sequence is replaced by its corresponding index from the dictionary. This transformation converts the text from a sequence of strings into a sequence of integers. For instance, the token sequence ["Hello", ",", "how", "are", "you", "?"] would be transformed into a numerical sequence like [34, 90, 15, ..., ...] based on the indices defined in the vocabulary. These numerical indices act as placeholders for the actual token representations that will be used in the model.

The vocabulary dictionary is inherently domain-specific. A model trained on medical texts will have a different vocabulary compared to one trained on legal documents or general web content. The vocabulary is tailored to the specific domain of the training data to ensure effective processing of relevant terminology.

Word Embeddings

From Token Indices to Vector Representations

The numerical indices obtained from tokenization and numericalization are further converted into dense vector representations known as word embeddings. These embeddings are crucial as they transform discrete token indices into continuous vector spaces where semantic relationships can be modeled.

Word Embeddings as Dense Vectors: Each token index is mapped to a high-dimensional vector. For example, if the token "how" has an index of 15, it might be represented by a vector with 300 dimensions. These vectors are designed to capture the semantic attributes of the tokens.
Initial vs. Learned Embeddings: Initially, pre-trained word embeddings, such as those generated by Word2Vec, could be considered as a starting point. These pre-trained embeddings offer a way to initialize the model with some prior knowledge of word semantics. However, BERT is designed to learn its own word embeddings from scratch during the training process. These learned embeddings are contextualized, meaning they are not static but are adjusted based on the context in which the words appear in the training data. BERT’s training process refines these embeddings to be optimally suited for its specific architecture and objectives, often surpassing the effectiveness of static pre-trained embeddings like Word2Vec for tasks BERT is designed for. While one might conceptually start with Word2Vec embeddings, BERT’s training is independent and will learn embeddings that are more tailored and effective for its architecture.

After tokenization, numericalization, and embedding, the original sentence is transformed into a matrix of embedding vectors. This matrix, where each row represents a word embedding, becomes the input to the BERT encoder.

Padding for Uniform Sequence Length

BERT operates on fixed-length input sequences. Therefore, when processing batches of sentences, it is necessary to ensure that all sentences have the same length. Padding is the technique used to achieve this.

Padding Procedure: For sentences that are shorter than the predefined maximum sequence length, special "[PAD]" tokens are appended to the end of the sequence until it reaches the required length.
Padding Position Configuration: While padding can technically be added at the beginning or end of a sequence, the convention, and often the default configuration for models like BERT, is to apply padding at the end of the sequences.
Ensuring Batch Homogeneity: Padding is essential for efficient batch processing in deep learning. By making all input sequences in a batch the same length, it allows for straightforward matrix operations and parallel computation, which are critical for training large models like BERT. Without padding, processing variable-length sequences in batches would be significantly more complex and less efficient.

Positional Encoding

Encoding Token Order in Transformers

Unlike Recurrent Neural Networks (RNNs) that inherently process sequential data in order, Transformer models, including BERT, process all tokens in the input sequence simultaneously. To provide the model with information about the position of each token in the sequence, positional encoding is employed.

Initial Positional Encoding Methods: In the original Transformer paper, positional encodings were introduced as fixed vectors added to the word embeddings. These encodings were based on sine and cosine functions of different frequencies, designed to create unique patterns for each position. A simpler, initial conceptual approach considered was to use straightforward numerical indices (1, 2, 3, ...) to represent the position of each word in the sequence.
Purpose of Positional Information: Positional encoding vectors are added element-wise to the word embeddings. This sum combines the semantic meaning of the word (from the word embedding) with information about its location in the sentence (from the positional encoding). By incorporating positional information, the model can differentiate between words not only by their meaning but also by their position in the input sequence, which is crucial for understanding sentence structure and word order.
Evolution and Refinement: While the sine and cosine-based positional encodings from the original Transformer are widely adopted and effective, simpler numerical positional encodings were considered in early explorations. However, these simpler methods were found to be less robust and lacked a strong theoretical basis compared to the trigonometric encodings. The NLP community has continually explored and refined positional encoding techniques to optimize performance and efficiency. Modern Transformers predominantly use the more sophisticated sine and cosine encodings or learned positional embeddings, which are learned during training rather than being fixed.

By combining word embeddings with positional encodings, the input to the BERT encoder effectively represents both the meaning of each token and its sequential position within the input text, providing all necessary information for the Transformer layers to process language effectively.

Multi-Head Attention Mechanism

Input Splitting and Linear Projections

Query, Key, and Value Matrices

The multi-head attention mechanism is at the heart of the Transformer architecture, enabling it to weigh the importance of different parts of the input sequence when processing information. The process begins by transforming the input embedding matrix into three distinct matrices: Queries (Q), Keys (K), and Values (V). For each encoder block, the input matrix is derived from the output of the preceding block, or from the initial input embeddings in the case of the first block. In the initial step of the attention mechanism, the input matrix is taken and linearly projected into these three matrices.

Linear Transformations to Project into Different Feature Subspaces

To create the Query, Key, and Value matrices, the input matrix undergoes separate linear transformations.

Linear Projection Layers: Three distinct linear layers are employed, each with its own weight matrix, to perform these transformations. Let $X$ represent the input matrix. The transformations are mathematically expressed as: $$
\[\begin{aligned} Q &= X W_Q \\ K &= X W_K \\ V &= X W_V \end{aligned}\]
$$ Here, $W_Q$, $W_K$, and $W_V$ are the weight matrices specific to the linear layers for Queries, Keys, and Values, respectively. These matrices are learnable parameters, adjusted during the training process to optimize the model’s performance.
Intuition Behind Projection: The purpose of these linear transformations is to project the input embeddings into different representation subspaces. This is conceptually similar to viewing an object from multiple perspectives to understand it better. By projecting the input into Query, Key, and Value spaces, the attention mechanism can explore different facets of the relationships between tokens. Analogous to projecting a face with different lighting to highlight various features, these projections allow the model to discern diverse patterns and dependencies within the input data. If the linear projections were identical, the attention mechanism might become redundant, potentially collapsing into a less effective state where it fails to capture diverse relationships. However, by initializing with random weights and training, the different projections are highly unlikely to remain the same, and the multi-head attention is designed to benefit from these diverse perspectives.

Multi-Head Structure: Parallel Attention Layers

Concurrent Attention Heads for Capturing Richer Context

The "multi-head" aspect of the attention mechanism involves running the scaled dot-product attention process multiple times in parallel. Each of these parallel runs is termed an "attention head," and they operate independently using different learned linear projections.

Number of Heads ($H$): The number of attention heads, denoted as $H$, is a hyperparameter of the model. In BERT, typical values for $H$ are 8 or 12. The entire multi-head attention mechanism consists of $H$ such attention heads operating in parallel.
Capturing Diverse Relationships: Each attention head, due to its unique set of projection weight matrices ($W_Q, W_K, W_V$), is capable of learning to focus on different types of relationships and dependencies within the input sequence. This parallel structure allows the multi-head attention mechanism to capture a more comprehensive and nuanced understanding of the context. By aggregating information from multiple, diverse attention heads, the model can discern richer patterns than would be possible with a single attention mechanism.

To simplify the explanation, we will initially focus on the operation within a single attention head before discussing how the outputs from all heads are combined.

Scaled Dot-Product Attention: Core Mechanism

Computing Attention Scores via Dot Product Similarity

Within each attention head, the scaled dot-product attention mechanism is the fundamental operation. It is designed to calculate attention scores, which quantify the relevance or similarity between each pair of tokens in the input sequence. These scores determine how much each token should be emphasized when representing other tokens.

Dot Product for Similarity: The attention scores are calculated by performing a dot product between the Query matrix $Q$ and the Key matrix $K$. Mathematically, this is represented as the matrix multiplication $Q {K}^\mathsf{T}$. The resulting matrix from this multiplication contains scores where each element $(i, j)$ indicates the degree of similarity or relevance between the $i$-th token (represented by its query vector) and the $j$-th token (represented by its key vector).
Interpretation of Dot Product: The dot product serves as a measure of similarity in vector space. A larger dot product between two vectors implies greater alignment and, in this context, indicates a stronger relationship or higher relevance between the corresponding tokens. Conversely, a smaller or negative dot product suggests less similarity or relevance.

Scaling for Stable Gradients and Masking for Padding

Following the computation of dot products, two critical operations are applied to refine the attention scores: scaling and masking.

Scaling by $\sqrt{d_k}$: The raw dot product scores are scaled down by dividing them by $\sqrt{d_k}$, where $d_k$ is the dimensionality of the key vectors. This scaling step is crucial for stabilizing gradients during the training process. Without scaling, dot products can become excessively large, especially with higher dimensional vectors, leading to very small gradients after the softmax function (as softmax becomes very peaky), which in turn slows down learning. Scaling mitigates this issue by ensuring that the gradients remain in a more manageable range.
Masking for Padding Tokens: Masking is applied to handle padding tokens, which are artificially added to make input sequences uniform in length. Padding tokens should be ignored by the attention mechanism as they do not represent actual content. The masking process involves applying a mask matrix to the scaled scores. This mask effectively sets the attention scores for padding tokens to $-\infty$ before the softmax normalization. By setting these scores to a very large negative number, their contribution to the softmax output becomes virtually zero, ensuring that padding tokens do not influence the attention distribution over the valid tokens in the sequence.

Let $d_k$ denote the dimension of the key vectors. The scaled dot-product attention scores matrix $S$ is calculated as: \[S = \frac{Q {K}^\mathsf{T}}{\sqrt{d_k}}\] After calculating $S$, the masking operation, denoted as $Mask(S)$, is applied to exclude padding tokens from influencing the attention mechanism.

Softmax Normalization: From Scores to Probabilities

The next step is to transform the scaled and masked attention scores into a probability distribution using the softmax function.

Row-wise Softmax: The softmax function is applied row-wise to the scaled and masked score matrix $S$. For each row $i$ in $S$, softmax is applied across all columns.
Probability Distribution Output: The softmax function normalizes the scores in each row into a probability distribution. For each query token $i$, the resulting softmax output in row $i$ represents the attention weights assigned to all key tokens. These weights indicate the distribution of attention from the $i$-th query token to all key tokens in the sequence. Crucially, after softmax, each row of the resulting attention weights matrix sums to 1, ensuring it is a valid probability distribution.

The attention weights matrix $A$ is obtained by applying the softmax function to the masked and scaled scores: \[A = \text{softmax}(Mask(S))\] where $Mask(S)$ represents the matrix of scaled scores after masking has been applied.

Contextualized Output via Weighted Value Summation

The final operation within the scaled dot-product attention mechanism is to compute a weighted sum of the Value matrix $V$, using the attention weights derived from the softmax normalization.

Weighted Summation: The attention weights matrix $A$ is multiplied by the Value matrix $V$ to produce the output: $A V$. This matrix multiplication is the core step in creating contextualized representations.
Generating Contextualized Representations: The result of $A V$ is the output matrix of the attention head. Each row $i$ in this output matrix is a contextualized representation of the $i$-th token from the input sequence. This representation is formed by aggregating the value vectors of all tokens in the input sequence, weighted by their corresponding attention scores from row $i$ of matrix $A$. In essence, for each token, the attention mechanism computes a new vector representation that is a weighted combination of all token representations in the sequence. The weights are determined by the attention scores, reflecting the relevance of each token to the current token being represented. Consider an extreme, simplified scenario: if for the word "hello," the attention scores are such that only "you" gets a weight of 1 and all other words get 0, then the new representation for "hello" becomes identical to the representation of "you." In a more typical and nuanced situation, the representation of each word becomes a weighted blend of the representations of all words in its context, with weights determined by the learned attention mechanism.

The output of a single attention head is thus given by: \[\text{Attention Head Output} = A V = \text{softmax}(Mask(\frac{Q {K}^\mathsf{T}}{\sqrt{d_k}})) V\]

Concatenation of Heads and Output Projection

Combining and Reducing Dimensionality of Multi-Head Outputs

After the parallel computation of attention outputs from all $H$ attention heads, these outputs need to be combined and projected to maintain consistent dimensionality across the Transformer layers.

Concatenation of Outputs: The output matrices from all $H$ attention heads are concatenated along their feature dimension. If each attention head produces an output matrix with a feature dimension of $d_v$, then concatenating the outputs from $H$ heads results in a combined matrix with a feature dimension of $H \times d_v$. This concatenation step aggregates the diverse contextual information captured by each of the parallel attention heads.
Dimensionality Reduction via Output Projection: To ensure that the output of the multi-head attention mechanism has the same dimensionality as its input (and thus maintain dimensional consistency for stacking encoder blocks), a final linear projection is applied. This projection reduces the concatenated high-dimensional output back to the original input dimension.
Linear Projection Layer: A linear layer with a weight matrix $W_O$ is used for this dimensionality reduction. Let $C$ be the matrix resulting from the concatenation of all attention head outputs. The final output of the multi-head attention layer is then computed as $C W_O$. This linear transformation projects the rich, concatenated features back into the desired output dimension, preparing it for subsequent layers in the Transformer encoder.

This final projection step is essential for integrating the multi-head attention mechanism into the broader Transformer architecture, allowing for the seamless stacking of encoder blocks and consistent information flow throughout the network.

Training the BERT Model

Masked Language Model (MLM)

Objective: Learning Contextual Word Representations

The core objective of MLM is to enable BERT to learn rich, contextual representations of words by training the model to predict intentionally masked words within a sentence, forcing it to understand the surrounding context.

Masking Strategy: Random Word Obscuration

In the MLM task, a crucial step is the random masking of words within input sentences.

Random Masking Percentage: During pre-training, a certain percentage of tokens in each input sequence are randomly selected to be masked. Specifically, 15% of the words in each input sentence are chosen for masking.
Mask Token Replacement: The selected words are not simply removed; instead, they are replaced with a special token, typically "[MASK]". This token signals to the model that it needs to predict the original word at this position.
Example Masking: Consider the sentence: "Hello, how are you?". Applying the masking strategy might transform it into: "Hello, how [MASK] you?". Here, the word "are" has been masked and replaced by the "[MASK]" token.

Prediction Task and Contextual Learning

The primary task for BERT in MLM is to accurately predict the original word that was masked, based solely on the context provided by the unmasked words in the sentence.

Vocabulary Probability Distribution: For each masked position in the input sequence, the BERT model is tasked with predicting the original token. To do this, the model outputs a probability distribution over the entire vocabulary. This distribution represents the model’s confidence in each word of the vocabulary being the correct replacement for the "[MASK]" token.
Classification Layer for Prediction: To facilitate this prediction, a classification layer is added on top of the final encoder output for each masked token position. This layer typically consists of a linear transformation followed by a softmax function. The linear layer projects the encoder’s output into the vocabulary space, and the softmax function converts these projections into the probability distribution over the vocabulary.
Loss Function and Weight Updates: The training objective for MLM is to minimize the cross-entropy loss between the predicted probability distribution and the actual original word. If BERT correctly predicts the masked word (i.e., assigns a high probability to the correct word), the loss is low, and minimal adjustments to the model’s weights are needed. Conversely, if the prediction is incorrect, the loss is higher, prompting the optimization algorithm (like Adam) to update the model’s weights via backpropagation. These weight updates are aimed at improving the model’s ability to predict masked words in future iterations, thereby enhancing its understanding of contextual relationships between words.

The MLM task is instrumental in enabling BERT to learn bidirectional contextual representations. To accurately predict a masked word, the model must consider the context from both directions – both the words preceding and following the masked word. This forces BERT to develop a deep understanding of word context. A significant advantage of MLM is its unsupervised nature. It can be trained on vast amounts of readily available text data from the internet, such as Wikipedia or books, without the need for manual annotations or labels. This allows for efficient pre-training on massive datasets, leveraging the abundance of unlabeled text data.

Next Sentence Prediction (NSP)

Objective: Understanding Sentence Relationships

The primary goal of NSP is to equip BERT with the ability to understand relationships between sentences, particularly whether two sentences are consecutive in a text.

Sentence Pair Input with Special Delimiters

For the NSP task, the input to BERT is structured as pairs of sentences, rather than single sentences in isolation.

Paired Sentences as Input: During NSP training, BERT is fed pairs of sentences. These pairs are constructed to represent two scenarios: sentences that are consecutive in the original text and sentences that are unrelated.
Special Tokens for Sentence Boundary Detection: To enable BERT to distinguish between the two sentences in a pair, special tokens are used. The "[CLS]" token is inserted at the very beginning of the first sentence, and the "[SEP]" token is used to separate the first sentence from the second sentence. For example, a sentence pair might be formatted as: "[CLS] My dog is cute [SEP] he likes playing". The "[CLS]" token is particularly important as its final hidden state is often used to represent the aggregate representation of the entire input sequence for classification tasks.

Binary Classification: Consecutive or Random Sentences

The core task in NSP is a binary classification problem. BERT must predict whether the second sentence in a given pair is the actual sentence that immediately follows the first sentence in a corpus, or if it is a randomly chosen sentence from elsewhere in the corpus.

IsNextSentence or NotNextSentence Classification: NSP is framed as a binary classification task where the model must classify each sentence pair into one of two categories: "IsNextSentence" if the second sentence is indeed consecutive to the first, or "NotNextSentence" if the second sentence is a random sentence unrelated to the first.
Dataset Creation Strategy: To generate training data for NSP, sentence pairs are sampled from a large text corpus.
- Positive Samples (IsNextSentence): For positive examples, pairs of consecutive sentences are extracted directly from the corpus. The label for these pairs is "IsNextSentence", indicating they are naturally following sentences.
- Negative Samples (NotNextSentence): For negative examples, the first sentence is taken from the corpus, but the second sentence is randomly selected from a different part of the corpus, ensuring it is unrelated to the first sentence. The label for these pairs is "NotNextSentence", indicating they are not consecutive.
Objective: Global Context and Coherence: The primary objective of NSP is to train BERT to understand inter-sentence relationships and discourse coherence. By predicting whether sentencesare consecutive, the model is encouragedto capture higher-level contextual information that spans across sentences. This task forces the model to consider a more global context, beyond just the words within a single sentence, and to understand the flow of text and relationships between different parts of a document. Successfully performing NSP requires the model to encode and compare information from both sentences to determine if they logically follow each other, thus enhancing its ability to process and understand coherent text.

Training from Randomly Initialized Embeddings

Bootstrapping Language Understanding from Randomness

BERT’s training process is remarkable in that it starts from a state of almost complete ignorance about language. Initially, both the word embeddings and the weights of the neural network components within BERT are randomly initialized.

Emergence of Contextualized Representations

Through the extensive iterative training process, BERT progressively refines its initially random word embeddings and model weights, gradually learning to encode nuanced aspects of language.

Progressive Embedding Refinement: The initial random word embeddings are not static; they are dynamically adjusted and refined throughout training. As BERT is exposed to vast amounts of text data and performs MLM and NSP tasks, the embeddings are updated to better capture semantic and contextual information. For instance, if BERT processes sentences where "Apple" appears as a company name, the embedding for "Apple" will be modified to reflect this context. Later, when it encounters "Apple" in the context of fruit, the embedding will be further refined to differentiate between these different senses based on the surrounding words.
Contextualization of Word Embeddings: A key outcome of this training process is that the learned word embeddings become highly contextualized. Unlike static word embeddings (e.g., from Word2Vec) where each word has a single, fixed vector representation regardless of context, BERT’s embeddings are dynamic. The representation of a word varies depending on the other words in the sentence. This contextual sensitivity is a major advantage, allowing BERT to understand polysemy and subtle semantic nuances in language.
Continuous Dynamic Updates: The word embeddings in BERT are not learned in isolation but are continuously updated as the model processes more and more text data. Every time a word is encountered in a new context during training, its embedding representation is potentially adjusted. This ongoing dynamic update process enables BERT to learn highly adaptable and context-aware word representations, which are crucial for its strong performance in various NLP tasks.

This dynamic and iterative learning paradigm, starting from random initializations, is what empowers BERT to develop a deep understanding of language, ultimately enabling it to generate high-quality, contextualized word representations that are effective for a wide range of natural language processing applications.

Conclusion

In this lecture, we explored the Transformer architecture and delved into the specifics of BERT, focusing on its encoder component. We discussed the evolution from the "Attention is All You Need" paper to the development of BERT and related models like ChatGPT. Key aspects covered include the architecture of the BERT encoder, input processing techniques such as tokenization, numericalization, embedding, padding, and positional encoding. We also examined the multi-head attention mechanism in detail, including the roles of Queries, Keys, and Values, scaled dot-product attention, and the multi-head structure. Finally, we discussed the training methodologies for BERT, specifically the Masked Language Model (MLM) and Next Sentence Prediction (NSP) tasks, and how BERT learns contextualized word representations starting from random embeddings.

Important Remarks and Key Takeaways:

Attention is Key: The attention mechanism is the core innovation in Transformer models, enabling them to effectively capture relationships between words in a sentence.
Contextualized Embeddings: BERT learns contextualized word embeddings, meaning the representation of a word changes based on its surrounding context, a significant advancement over static embeddings.
Unsupervised Training: BERT is pre-trained using unsupervised tasks (MLM and NSP) on large text corpora, allowing it to learn rich language representations without extensive manual labeling.
Encoder Focus: BERT primarily utilizes the encoder part of the Transformer, making it highly effective for understanding language and various downstream tasks.

Next Steps

In the upcoming sessions, we will build upon this foundational knowledge to explore the decoder component of the Transformer and its applications in models like GPT and ChatGPT. Understanding the nuances of these architectures will further clarify how systems like ChatGPT operate. We will also have a practical laboratory session focusing on LSTMs to solidify your understanding of sequence models and provide a comparative perspective against the Transformer architecture.

Follow-up Questions and Topics for the Next Lecture:

How does the decoder part of the Transformer work, and how is it used in models like GPT and ChatGPT?
What are the architectural differences between BERT and GPT, and why are they suited for different natural language tasks?
How are Transformer models fine-tuned for specific downstream applications after pre-training?
What are the computational demands and strategies for scaling the training and deployment of large Transformer models?
Laboratory Activity: Hands-on session with LSTMs for practical understanding and comparison with Transformers.

--- title: "Lecture Notes on Transformer Models and BERT" author: "Your Name" date: "2025-01-28" format: html: toc: true # Table of Contents toc-depth: 2 code-tools: true theme: cosmo # Or "journal" for Distill-like minimalism --- # Introduction This lecture introduces Transformer models, a groundbreaking architecture in the field of Natural Language Processing (NLP). We will explore the Transformer architecture, focusing on its encoder component as used in BERT (Bidirectional Encoder Representations from Transformers). The lecture aims to provide a detailed understanding of how these models work, starting from the foundational \"Attention is All You Need\" paper to the architecture of BERT and its training methodologies. By the end of this lecture, and the beginning of the next, you should gain a solid understanding of the principles behind models like ChatGPT and other modern language processing systems. # Transformer Models ## The \"Attention is All You Need\" Paper ### Impact and Citations The foundation of Transformer models is rooted in the highly influential paper \"Attention is All You Need,\" presented by Google. This paper introduced the Transformer architecture, a truly exciting system that has become a cornerstone of modern Natural Language Processing (NLP). Its impact is vividly demonstrated by its enormous citation count, which grows daily, underscoring its profound relevance and influence within the research community. Initially, this groundbreaking architecture was designed for machine translation, where it achieved state-of-the-art results in translating text between languages. ## Transformer Architecture Overview ### Encoder and Decoder Structure The Transformer architecture is elegantly structured into two primary components that work in concert: the encoder and the decoder. - **Encoder**: The encoder's role is to meticulously process the input sequence, transforming it into a comprehensive and nuanced representation. It is built from a stack of multiple encoder blocks, each contributing to a deeper understanding of the input. - **Decoder**: Conversely, the decoder leverages the rich encoded representation produced by the encoder to generate the desired output sequence. For instance, in machine translation, the output would be the translated text in the target language. Mirroring the encoder, the decoder is also constructed from a series of decoder blocks. Within these encoder and decoder blocks, self-attention mechanisms and feed-forward networks are integral, enabling the model to capture complex patterns and dependencies in the data. ### Application to Machine Translation The Transformer model was originally conceived and engineered for the task of machine translation. In this application, the encoder ingests text in a source language and meticulously processes it to distill its underlying meaning. Following this, the decoder takes over, utilizing the encoded representation to construct a semantically equivalent text in the target language. This innovative architecture marked a significant leap forward in translation quality, outperforming previous models like LSTMs, particularly when handling lengthy sequences of text where long-range dependencies are crucial. ### Sequential Processing of Output When applied to tasks like machine translation, the Transformer's output generation process shares conceptual similarities with Recurrent Neural Networks (RNNs) such as LSTMs, especially in its sequential nature. However, the Transformer diverges significantly in its approach to dependency handling, primarily through the attention mechanism. The decoder initiates output generation by predicting the first word, often starting with a beginning-of-sentence token. Subsequent words are then predicted one by one, conditioned on the encoder's processed input and the words already generated. This iterative process continues until the entire translated sentence is formed. Throughout this decoding phase, the information encoded by the encoder is continuously consulted, ensuring that the generated translation remains contextually faithful to the original input. ## Evolution of Language Models ### Building upon Word2Vec The trajectory of language model development has been significantly shaped by earlier innovations, notably Word2Vec. As previously discussed, Word2Vec provided a method for creating word embeddings---vector representations of words in a continuous space. These embeddings are designed to capture semantic relationships, placing semantically similar words closer together in the vector space. Transformer models represent an evolutionary step beyond this, learning contextualized word representations. This means they move past the static, context-independent embeddings of models like Word2Vec, allowing for a more nuanced understanding of language. ### Leading to BERT, BART, Llama, and ChatGPT The advent of the Transformer architecture has been a catalyst for a wave of advanced language models, each building upon its principles and pushing the boundaries of NLP. Key models in this lineage include: - **BERT** (Bidirectional Encoder Representations from Transformers): BERT's architecture primarily focuses on the encoder component of the Transformer. It is expertly designed for tasks requiring a deep understanding of word context within sentences, excelling in nuanced language comprehension. - **BART** (Bidirectional and Auto-Regressive Transformer): BART strategically combines a bidirectional encoder, similar to BERT, with an autoregressive decoder, akin to GPT. This hybrid design makes it exceptionally versatile for sequence-to-sequence tasks, effectively handling both input understanding and coherent output generation. - **Llama** (Large Language Model Meta AI): Developed by Meta, Llama is an open-source large language model that leverages the Transformer architecture's power and flexibility. It is designed to be accessible and adaptable for a wide range of research and application purposes. - **ChatGPT**: Perhaps one of the most widely recognized Transformer-based models, ChatGPT is a conversational AI developed by OpenAI. It is built upon the Transformer architecture, with a particular emphasis on the decoder component, and is rigorously fine-tuned for generating human-like dialogue. These models collectively represent a substantial leap in the field of NLP. They offer markedly improved performance across a spectrum of language-related tasks compared to their predecessors. Tracing their origins back to the foundational concept of Word2Vec, they illustrate the rapid and continuous progress in the domain of language modeling and understanding. # BERT and the Encoder ## Focus on the Encoder Component BERT (Bidirectional Encoder Representations from Transformers) distinguishes itself by primarily employing the encoder component of the Transformer architecture. Unlike the original Transformer, which utilizes both an encoder and a decoder, BERT's architecture is centered around a deep, bidirectional encoder. This design choice is pivotal, enabling BERT to effectively learn contextual word representations. By processing the entire input sentence simultaneously, BERT considers both the preceding (left) and succeeding (right) context of each word, leading to a more nuanced understanding of language. ## Input and Output Dimensionality ### Preserving Dimensionality Through Layers A notable characteristic of the BERT encoder is its consistent dimensionality throughout its layers. This means that if the input embeddings are of a specific dimension, say 300, the output embeddings from each subsequent encoder block will retain this same dimensionality. This approach contrasts with some neural network architectures where a reduction in dimensionality is common as data progresses through deeper layers. In BERT, maintaining dimensionality allows for consistent feature representation across different levels of processing. ## Stacking Encoder Blocks ### Varying the Number of Blocks BERT models are constructed by sequentially stacking multiple encoder blocks. The number of these blocks is a key factor determining the model's size and capacity. For example, a smaller BERT configuration, often referred to as BERT-small, might be built with 12 encoder blocks, while a larger version, BERT-large, could incorporate 24 blocks. Generally, increasing the number of encoder blocks enhances the model's ability to learn more intricate and abstract representations from the data. However, this also leads to a larger model size, demanding greater computational resources and training time. Extending this concept to even more advanced models, similar architectures in contemporary systems like ChatGPT may employ a significantly larger number of blocks, potentially ranging from 49 to 94 or even more, depending on the specific demands of the application and desired model complexity. ## Fixed Input Length ### Handling Variable Length Sentences with Truncation and Padding In its standard configuration, BERT is designed to process inputs of a fixed length. This implies a predefined maximum limit on the number of tokens (words or sub-word units) that the model can handle at once. To accommodate sentences of varying lengths, two primary techniques are used: - **Truncation**: When an input sentence exceeds the maximum allowable length, it is truncated. Typically, this truncation occurs from the end of the sentence, effectively discarding the later parts of the text to fit within the model's input capacity. - **Padding**: Conversely, if a sentence is shorter than the fixed input length, it is padded. This is commonly achieved by appending special \"\[PAD\]\" tokens to the end of the sentence until it reaches the required length. Padding ensures that all input sequences fed into the BERT model are of uniform length. This fixed-length input approach contrasts with Recurrent Neural Networks (RNNs) like LSTMs, which are inherently capable of processing variable-length inputs without explicit truncation or padding. However, while LSTMs can handle variable lengths, padding is often used in practice for batch processing to improve computational efficiency. ## Matrix Representation of Input The input to the BERT encoder is structured as a matrix. Given a fixed input length $L$ and an embedding dimension $D$, the resulting input matrix has dimensions $L \times D$. In this matrix, each row corresponds to a token from the input sequence, and each element within a row represents a component of that token's embedding vector. For instance, in the BERT-small configuration, the input sequence length might be set to 516 tokens, and each token could be represented by an embedding vector of 300 dimensions. This would result in an input matrix of size $516 \times 300$. This matrix serves as the fundamental input structure for the BERT encoder blocks to process. ## Independent Weights in Each Encoder Block A key architectural feature of BERT is that each encoder block operates with its own set of independent weights. This means that the parameters learned within one encoder block are not shared with other blocks in the network. This design is conceptually similar to Convolutional Neural Networks (CNNs), where each convolutional layer typically employs independent filters to capture different features. The use of independent weights in each BERT encoder block allows different layers to specialize in learning distinct levels of abstraction and feature representations from the input data. This contributes to the model's capacity to capture complex patterns and hierarchical features in language. # Encoder Input Processing ## Tokenization of Text ### Tokenization Process and Punctuation The initial step in preparing text for BERT processing is tokenization. This process involves segmenting the input text into meaningful units called tokens. Tokens can be words, sub-word units, or even individual characters, depending on the tokenizer used. - **Tokenization Details**: Sentences are parsed and divided into a sequence of tokens. Crucially, punctuation marks are treated as distinct tokens. For example, the sentence \"Hello, how are you?\" is tokenized into: `["Hello", ",", "how", "are", "you", "?"]`. Each word and punctuation mark becomes a separate element in the token sequence. - **Semantic Importance of Punctuation**: In early NLP practices, punctuation was sometimes disregarded as noise and removed during preprocessing. However, modern approaches, especially in models like BERT, recognize the critical role of punctuation in conveying semantics. For instance, a question mark `?` fundamentally alters the meaning of a sentence, transforming a statement into a question. Similarly, commas, periods, and other punctuation marks contribute to the grammatical structure and semantic interpretation of text. Therefore, preserving punctuation during tokenization is essential for capturing the full meaning of the input. ## Numericalization of Tokens ### Vocabulary Creation and Token Mapping Following tokenization, the next step is to convert each token into a numerical format that the model can process. This is achieved through numericalization, which relies on a vocabulary dictionary. - **Vocabulary Dictionary Construction**: A vocabulary dictionary is constructed from the training data. This dictionary is essentially a mapping that assigns a unique integer index to each distinct token encountered in the dataset. For example, the dictionary might include entries like: `"A": 1`, `"D": 2`, `"hello": 34`, `",": 90`, `"how": 15`, and so forth. This dictionary serves as the model's lexicon, encompassing all the tokens it is designed to recognize. - **Token to Index Mapping**: Once the vocabulary dictionary is established, each token in the tokenized input sequence is replaced by its corresponding index from the dictionary. This transformation converts the text from a sequence of strings into a sequence of integers. For instance, the token sequence `["Hello", ",", "how", "are", "you", "?"]` would be transformed into a numerical sequence like `[34, 90, 15, ..., ...] ` based on the indices defined in the vocabulary. These numerical indices act as placeholders for the actual token representations that will be used in the model. The vocabulary dictionary is inherently domain-specific. A model trained on medical texts will have a different vocabulary compared to one trained on legal documents or general web content. The vocabulary is tailored to the specific domain of the training data to ensure effective processing of relevant terminology. ## Word Embeddings ### From Token Indices to Vector Representations The numerical indices obtained from tokenization and numericalization are further converted into dense vector representations known as word embeddings. These embeddings are crucial as they transform discrete token indices into continuous vector spaces where semantic relationships can be modeled. - **Word Embeddings as Dense Vectors**: Each token index is mapped to a high-dimensional vector. For example, if the token \"how\" has an index of 15, it might be represented by a vector with 300 dimensions. These vectors are designed to capture the semantic attributes of the tokens. - **Initial vs. Learned Embeddings**: Initially, pre-trained word embeddings, such as those generated by Word2Vec, could be considered as a starting point. These pre-trained embeddings offer a way to initialize the model with some prior knowledge of word semantics. However, BERT is designed to learn its own word embeddings from scratch during the training process. These learned embeddings are contextualized, meaning they are not static but are adjusted based on the context in which the words appear in the training data. BERT's training process refines these embeddings to be optimally suited for its specific architecture and objectives, often surpassing the effectiveness of static pre-trained embeddings like Word2Vec for tasks BERT is designed for. While one might conceptually start with Word2Vec embeddings, BERT's training is independent and will learn embeddings that are more tailored and effective for its architecture. After tokenization, numericalization, and embedding, the original sentence is transformed into a matrix of embedding vectors. This matrix, where each row represents a word embedding, becomes the input to the BERT encoder. ## Padding for Uniform Sequence Length BERT operates on fixed-length input sequences. Therefore, when processing batches of sentences, it is necessary to ensure that all sentences have the same length. Padding is the technique used to achieve this. - **Padding Procedure**: For sentences that are shorter than the predefined maximum sequence length, special \"\[PAD\]\" tokens are appended to the end of the sequence until it reaches the required length. - **Padding Position Configuration**: While padding can technically be added at the beginning or end of a sequence, the convention, and often the default configuration for models like BERT, is to apply padding at the end of the sequences. - **Ensuring Batch Homogeneity**: Padding is essential for efficient batch processing in deep learning. By making all input sequences in a batch the same length, it allows for straightforward matrix operations and parallel computation, which are critical for training large models like BERT. Without padding, processing variable-length sequences in batches would be significantly more complex and less efficient. ## Positional Encoding ### Encoding Token Order in Transformers Unlike Recurrent Neural Networks (RNNs) that inherently process sequential data in order, Transformer models, including BERT, process all tokens in the input sequence simultaneously. To provide the model with information about the position of each token in the sequence, positional encoding is employed. - **Initial Positional Encoding Methods**: In the original Transformer paper, positional encodings were introduced as fixed vectors added to the word embeddings. These encodings were based on sine and cosine functions of different frequencies, designed to create unique patterns for each position. A simpler, initial conceptual approach considered was to use straightforward numerical indices (1, 2, 3, \...) to represent the position of each word in the sequence. - **Purpose of Positional Information**: Positional encoding vectors are added element-wise to the word embeddings. This sum combines the semantic meaning of the word (from the word embedding) with information about its location in the sentence (from the positional encoding). By incorporating positional information, the model can differentiate between words not only by their meaning but also by their position in the input sequence, which is crucial for understanding sentence structure and word order. - **Evolution and Refinement**: While the sine and cosine-based positional encodings from the original Transformer are widely adopted and effective, simpler numerical positional encodings were considered in early explorations. However, these simpler methods were found to be less robust and lacked a strong theoretical basis compared to the trigonometric encodings. The NLP community has continually explored and refined positional encoding techniques to optimize performance and efficiency. Modern Transformers predominantly use the more sophisticated sine and cosine encodings or learned positional embeddings, which are learned during training rather than being fixed. By combining word embeddings with positional encodings, the input to the BERT encoder effectively represents both the meaning of each token and its sequential position within the input text, providing all necessary information for the Transformer layers to process language effectively. # Multi-Head Attention Mechanism ## Input Splitting and Linear Projections ### Query, Key, and Value Matrices The multi-head attention mechanism is at the heart of the Transformer architecture, enabling it to weigh the importance of different parts of the input sequence when processing information. The process begins by transforming the input embedding matrix into three distinct matrices: Queries (Q), Keys (K), and Values (V). For each encoder block, the input matrix is derived from the output of the preceding block, or from the initial input embeddings in the case of the first block. In the initial step of the attention mechanism, the input matrix is taken and linearly projected into these three matrices. ### Linear Transformations to Project into Different Feature Subspaces To create the Query, Key, and Value matrices, the input matrix undergoes separate linear transformations. - **Linear Projection Layers**: Three distinct linear layers are employed, each with its own weight matrix, to perform these transformations. Let $X$ represent the input matrix. The transformations are mathematically expressed as: $$\begin{aligned} Q &= X W_Q \\ K &= X W_K \\ V &= X W_V \end{aligned}$$ Here, $W_Q$, $W_K$, and $W_V$ are the weight matrices specific to the linear layers for Queries, Keys, and Values, respectively. These matrices are learnable parameters, adjusted during the training process to optimize the model's performance. - **Intuition Behind Projection**: The purpose of these linear transformations is to project the input embeddings into different representation subspaces. This is conceptually similar to viewing an object from multiple perspectives to understand it better. By projecting the input into Query, Key, and Value spaces, the attention mechanism can explore different facets of the relationships between tokens. Analogous to projecting a face with different lighting to highlight various features, these projections allow the model to discern diverse patterns and dependencies within the input data. If the linear projections were identical, the attention mechanism might become redundant, potentially collapsing into a less effective state where it fails to capture diverse relationships. However, by initializing with random weights and training, the different projections are highly unlikely to remain the same, and the multi-head attention is designed to benefit from these diverse perspectives. ## Multi-Head Structure: Parallel Attention Layers ### Concurrent Attention Heads for Capturing Richer Context The \"multi-head\" aspect of the attention mechanism involves running the scaled dot-product attention process multiple times in parallel. Each of these parallel runs is termed an \"attention head,\" and they operate independently using different learned linear projections. - **Number of Heads ($H$)**: The number of attention heads, denoted as $H$, is a hyperparameter of the model. In BERT, typical values for $H$ are 8 or 12. The entire multi-head attention mechanism consists of $H$ such attention heads operating in parallel. - **Capturing Diverse Relationships**: Each attention head, due to its unique set of projection weight matrices ($W_Q, W_K, W_V$), is capable of learning to focus on different types of relationships and dependencies within the input sequence. This parallel structure allows the multi-head attention mechanism to capture a more comprehensive and nuanced understanding of the context. By aggregating information from multiple, diverse attention heads, the model can discern richer patterns than would be possible with a single attention mechanism. To simplify the explanation, we will initially focus on the operation within a single attention head before discussing how the outputs from all heads are combined. ## Scaled Dot-Product Attention: Core Mechanism ### Computing Attention Scores via Dot Product Similarity Within each attention head, the scaled dot-product attention mechanism is the fundamental operation. It is designed to calculate attention scores, which quantify the relevance or similarity between each pair of tokens in the input sequence. These scores determine how much each token should be emphasized when representing other tokens. - **Dot Product for Similarity**: The attention scores are calculated by performing a dot product between the Query matrix $Q$ and the Key matrix $K$. Mathematically, this is represented as the matrix multiplication $Q {K}^\mathsf{T}$. The resulting matrix from this multiplication contains scores where each element $(i, j)$ indicates the degree of similarity or relevance between the $i$-th token (represented by its query vector) and the $j$-th token (represented by its key vector). - **Interpretation of Dot Product**: The dot product serves as a measure of similarity in vector space. A larger dot product between two vectors implies greater alignment and, in this context, indicates a stronger relationship or higher relevance between the corresponding tokens. Conversely, a smaller or negative dot product suggests less similarity or relevance. ### Scaling for Stable Gradients and Masking for Padding Following the computation of dot products, two critical operations are applied to refine the attention scores: scaling and masking. - **Scaling by $\sqrt{d_k}$**: The raw dot product scores are scaled down by dividing them by $\sqrt{d_k}$, where $d_k$ is the dimensionality of the key vectors. This scaling step is crucial for stabilizing gradients during the training process. Without scaling, dot products can become excessively large, especially with higher dimensional vectors, leading to very small gradients after the softmax function (as softmax becomes very peaky), which in turn slows down learning. Scaling mitigates this issue by ensuring that the gradients remain in a more manageable range. - **Masking for Padding Tokens**: Masking is applied to handle padding tokens, which are artificially added to make input sequences uniform in length. Padding tokens should be ignored by the attention mechanism as they do not represent actual content. The masking process involves applying a mask matrix to the scaled scores. This mask effectively sets the attention scores for padding tokens to $-\infty$ before the softmax normalization. By setting these scores to a very large negative number, their contribution to the softmax output becomes virtually zero, ensuring that padding tokens do not influence the attention distribution over the valid tokens in the sequence. Let $d_k$ denote the dimension of the key vectors. The scaled dot-product attention scores matrix $S$ is calculated as: $$S = \frac{Q {K}^\mathsf{T}}{\sqrt{d_k}}$$ After calculating $S$, the masking operation, denoted as $Mask(S)$, is applied to exclude padding tokens from influencing the attention mechanism. ### Softmax Normalization: From Scores to Probabilities The next step is to transform the scaled and masked attention scores into a probability distribution using the softmax function. - **Row-wise Softmax**: The softmax function is applied row-wise to the scaled and masked score matrix $S$. For each row $i$ in $S$, softmax is applied across all columns. - **Probability Distribution Output**: The softmax function normalizes the scores in each row into a probability distribution. For each query token $i$, the resulting softmax output in row $i$ represents the attention weights assigned to all key tokens. These weights indicate the distribution of attention from the $i$-th query token to all key tokens in the sequence. Crucially, after softmax, each row of the resulting attention weights matrix sums to 1, ensuring it is a valid probability distribution. The attention weights matrix $A$ is obtained by applying the softmax function to the masked and scaled scores: $$A = \text{softmax}(Mask(S))$$ where $Mask(S)$ represents the matrix of scaled scores after masking has been applied. ### Contextualized Output via Weighted Value Summation The final operation within the scaled dot-product attention mechanism is to compute a weighted sum of the Value matrix $V$, using the attention weights derived from the softmax normalization. - **Weighted Summation**: The attention weights matrix $A$ is multiplied by the Value matrix $V$ to produce the output: $A V$. This matrix multiplication is the core step in creating contextualized representations. - **Generating Contextualized Representations**: The result of $A V$ is the output matrix of the attention head. Each row $i$ in this output matrix is a contextualized representation of the $i$-th token from the input sequence. This representation is formed by aggregating the value vectors of all tokens in the input sequence, weighted by their corresponding attention scores from row $i$ of matrix $A$. In essence, for each token, the attention mechanism computes a new vector representation that is a weighted combination of all token representations in the sequence. The weights are determined by the attention scores, reflecting the relevance of each token to the current token being represented. Consider an extreme, simplified scenario: if for the word \"hello,\" the attention scores are such that only \"you\" gets a weight of 1 and all other words get 0, then the new representation for \"hello\" becomes identical to the representation of \"you.\" In a more typical and nuanced situation, the representation of each word becomes a weighted blend of the representations of all words in its context, with weights determined by the learned attention mechanism. The output of a single attention head is thus given by: $$\text{Attention Head Output} = A V = \text{softmax}(Mask(\frac{Q {K}^\mathsf{T}}{\sqrt{d_k}})) V$$ ## Concatenation of Heads and Output Projection ### Combining and Reducing Dimensionality of Multi-Head Outputs After the parallel computation of attention outputs from all $H$ attention heads, these outputs need to be combined and projected to maintain consistent dimensionality across the Transformer layers. - **Concatenation of Outputs**: The output matrices from all $H$ attention heads are concatenated along their feature dimension. If each attention head produces an output matrix with a feature dimension of $d_v$, then concatenating the outputs from $H$ heads results in a combined matrix with a feature dimension of $H \times d_v$. This concatenation step aggregates the diverse contextual information captured by each of the parallel attention heads. - **Dimensionality Reduction via Output Projection**: To ensure that the output of the multi-head attention mechanism has the same dimensionality as its input (and thus maintain dimensional consistency for stacking encoder blocks), a final linear projection is applied. This projection reduces the concatenated high-dimensional output back to the original input dimension. - **Linear Projection Layer**: A linear layer with a weight matrix $W_O$ is used for this dimensionality reduction. Let $C$ be the matrix resulting from the concatenation of all attention head outputs. The final output of the multi-head attention layer is then computed as $C W_O$. This linear transformation projects the rich, concatenated features back into the desired output dimension, preparing it for subsequent layers in the Transformer encoder. This final projection step is essential for integrating the multi-head attention mechanism into the broader Transformer architecture, allowing for the seamless stacking of encoder blocks and consistent information flow throughout the network. # Training the BERT Model ## Masked Language Model (MLM) ### Objective: Learning Contextual Word Representations ::: tcolorbox The core objective of MLM is to enable BERT to learn rich, contextual representations of words by training the model to predict intentionally masked words within a sentence, forcing it to understand the surrounding context. ::: ### Masking Strategy: Random Word Obscuration In the MLM task, a crucial step is the random masking of words within input sentences. - **Random Masking Percentage**: During pre-training, a certain percentage of tokens in each input sequence are randomly selected to be masked. Specifically, 15% of the words in each input sentence are chosen for masking. - **Mask Token Replacement**: The selected words are not simply removed; instead, they are replaced with a special token, typically \"\[MASK\]\". This token signals to the model that it needs to predict the original word at this position. - **Example Masking**: Consider the sentence: \"Hello, how are you?\". Applying the masking strategy might transform it into: \"Hello, how \[MASK\] you?\". Here, the word \"are\" has been masked and replaced by the \"\[MASK\]\" token. ### Prediction Task and Contextual Learning The primary task for BERT in MLM is to accurately predict the original word that was masked, based solely on the context provided by the unmasked words in the sentence. - **Vocabulary Probability Distribution**: For each masked position in the input sequence, the BERT model is tasked with predicting the original token. To do this, the model outputs a probability distribution over the entire vocabulary. This distribution represents the model's confidence in each word of the vocabulary being the correct replacement for the \"\[MASK\]\" token. - **Classification Layer for Prediction**: To facilitate this prediction, a classification layer is added on top of the final encoder output for each masked token position. This layer typically consists of a linear transformation followed by a softmax function. The linear layer projects the encoder's output into the vocabulary space, and the softmax function converts these projections into the probability distribution over the vocabulary. - **Loss Function and Weight Updates**: The training objective for MLM is to minimize the cross-entropy loss between the predicted probability distribution and the actual original word. If BERT correctly predicts the masked word (i.e., assigns a high probability to the correct word), the loss is low, and minimal adjustments to the model's weights are needed. Conversely, if the prediction is incorrect, the loss is higher, prompting the optimization algorithm (like Adam) to update the model's weights via backpropagation. These weight updates are aimed at improving the model's ability to predict masked words in future iterations, thereby enhancing its understanding of contextual relationships between words. The MLM task is instrumental in enabling BERT to learn bidirectional contextual representations. To accurately predict a masked word, the model must consider the context from both directions -- both the words preceding and following the masked word. This forces BERT to develop a deep understanding of word context. A significant advantage of MLM is its unsupervised nature. It can be trained on vast amounts of readily available text data from the internet, such as Wikipedia or books, without the need for manual annotations or labels. This allows for efficient pre-training on massive datasets, leveraging the abundance of unlabeled text data. ## Next Sentence Prediction (NSP) ### Objective: Understanding Sentence Relationships ::: tcolorbox The primary goal of NSP is to equip BERT with the ability to understand relationships between sentences, particularly whether two sentences are consecutive in a text. ::: ### Sentence Pair Input with Special Delimiters For the NSP task, the input to BERT is structured as pairs of sentences, rather than single sentences in isolation. - **Paired Sentences as Input**: During NSP training, BERT is fed pairs of sentences. These pairs are constructed to represent two scenarios: sentences that are consecutive in the original text and sentences that are unrelated. - **Special Tokens for Sentence Boundary Detection**: To enable BERT to distinguish between the two sentences in a pair, special tokens are used. The \"\[CLS\]\" token is inserted at the very beginning of the first sentence, and the \"\[SEP\]\" token is used to separate the first sentence from the second sentence. For example, a sentence pair might be formatted as: \"\[CLS\] My dog is cute \[SEP\] he likes playing\". The \"\[CLS\]\" token is particularly important as its final hidden state is often used to represent the aggregate representation of the entire input sequence for classification tasks. ### Binary Classification: Consecutive or Random Sentences The core task in NSP is a binary classification problem. BERT must predict whether the second sentence in a given pair is the actual sentence that immediately follows the first sentence in a corpus, or if it is a randomly chosen sentence from elsewhere in the corpus. - **IsNextSentence or NotNextSentence Classification**: NSP is framed as a binary classification task where the model must classify each sentence pair into one of two categories: \"IsNextSentence\" if the second sentence is indeed consecutive to the first, or \"NotNextSentence\" if the second sentence is a random sentence unrelated to the first. - **Dataset Creation Strategy**: To generate training data for NSP, sentence pairs are sampled from a large text corpus. - **Positive Samples (IsNextSentence)**: For positive examples, pairs of consecutive sentences are extracted directly from the corpus. The label for these pairs is \"IsNextSentence\", indicating they are naturally following sentences. - **Negative Samples (NotNextSentence)**: For negative examples, the first sentence is taken from the corpus, but the second sentence is randomly selected from a different part of the corpus, ensuring it is unrelated to the first sentence. The label for these pairs is \"NotNextSentence\", indicating they are not consecutive. - **Objective: Global Context and Coherence**: The primary objective of NSP is to train BERT to understand inter-sentence relationships and discourse coherence. By predicting whether sentencesare consecutive, the model is encouragedto capture higher-level contextual information that spans across sentences. This task forces the model to consider a more global context, beyond just the words within a single sentence, and to understand the flow of text and relationships between different parts of a document. Successfully performing NSP requires the model to encode and compare information from both sentences to determine if they logically follow each other, thus enhancing its ability to process and understand coherent text. ## Training from Randomly Initialized Embeddings ### Bootstrapping Language Understanding from Randomness ::: tcolorbox BERT's training process is remarkable in that it starts from a state of almost complete ignorance about language. Initially, both the word embeddings and the weights of the neural network components within BERT are randomly initialized. ::: ### Iterative Refinement of Embeddings and Weights The learning process in BERT is iterative, involving repeated cycles of prediction and adjustment based on the MLM and NSP tasks. - **Random Initialization Phase**: At the outset, all parameters of the BERT model, including word embeddings and the weights in linear layers of the Transformer encoder, are set to random values. This means the model initially has no pre-existing knowledge of language. - **Iterative Update Mechanism**: Through the MLM and NSP training tasks, BERT undergoes iterative updates to its embeddings and network weights. In each iteration: - **Prediction and Evaluation**: BERT processes input data (sentences for MLM, sentence pairs for NSP) and makes predictions (masked word prediction or next sentence prediction). The model's performance is evaluated by comparing its predictions to the actual values. - **Backpropagation and Optimization**: If the model's prediction deviates from the correct answer, a loss is calculated. This loss signal is then used to adjust the model's weights and embedding vectors through backpropagation and optimization algorithms, such as Adam. These adjustments are designed to reduce the prediction error in subsequent iterations. - **Vocabulary and Initial Random Vectors**: To begin the process, a vocabulary dictionary is constructed, listing all the unique tokens the model will handle. For each token in this dictionary, a random vector is assigned as its initial word embedding. These random vectors are the starting point for learning meaningful semantic representations. ### Emergence of Contextualized Representations Through the extensive iterative training process, BERT progressively refines its initially random word embeddings and model weights, gradually learning to encode nuanced aspects of language. - **Progressive Embedding Refinement**: The initial random word embeddings are not static; they are dynamically adjusted and refined throughout training. As BERT is exposed to vast amounts of text data and performs MLM and NSP tasks, the embeddings are updated to better capture semantic and contextual information. For instance, if BERT processes sentences where \"Apple\" appears as a company name, the embedding for \"Apple\" will be modified to reflect this context. Later, when it encounters \"Apple\" in the context of fruit, the embedding will be further refined to differentiate between these different senses based on the surrounding words. - **Contextualization of Word Embeddings**: A key outcome of this training process is that the learned word embeddings become highly contextualized. Unlike static word embeddings (e.g., from Word2Vec) where each word has a single, fixed vector representation regardless of context, BERT's embeddings are dynamic. The representation of a word varies depending on the other words in the sentence. This contextual sensitivity is a major advantage, allowing BERT to understand polysemy and subtle semantic nuances in language. - **Continuous Dynamic Updates**: The word embeddings in BERT are not learned in isolation but are continuously updated as the model processes more and more text data. Every time a word is encountered in a new context during training, its embedding representation is potentially adjusted. This ongoing dynamic update process enables BERT to learn highly adaptable and context-aware word representations, which are crucial for its strong performance in various NLP tasks. This dynamic and iterative learning paradigm, starting from random initializations, is what empowers BERT to develop a deep understanding of language, ultimately enabling it to generate high-quality, contextualized word representations that are effective for a wide range of natural language processing applications. # Conclusion In this lecture, we explored the Transformer architecture and delved into the specifics of BERT, focusing on its encoder component. We discussed the evolution from the \"Attention is All You Need\" paper to the development of BERT and related models like ChatGPT. Key aspects covered include the architecture of the BERT encoder, input processing techniques such as tokenization, numericalization, embedding, padding, and positional encoding. We also examined the multi-head attention mechanism in detail, including the roles of Queries, Keys, and Values, scaled dot-product attention, and the multi-head structure. Finally, we discussed the training methodologies for BERT, specifically the Masked Language Model (MLM) and Next Sentence Prediction (NSP) tasks, and how BERT learns contextualized word representations starting from random embeddings. ::: tcolorbox **Important Remarks and Key Takeaways:** - **Attention is Key**: The attention mechanism is the core innovation in Transformer models, enabling them to effectively capture relationships between words in a sentence. - **Contextualized Embeddings**: BERT learns contextualized word embeddings, meaning the representation of a word changes based on its surrounding context, a significant advancement over static embeddings. - **Unsupervised Training**: BERT is pre-trained using unsupervised tasks (MLM and NSP) on large text corpora, allowing it to learn rich language representations without extensive manual labeling. - **Encoder Focus**: BERT primarily utilizes the encoder part of the Transformer, making it highly effective for understanding language and various downstream tasks. ::: ## Next Steps {#next-steps .unnumbered} In the upcoming sessions, we will build upon this foundational knowledge to explore the decoder component of the Transformer and its applications in models like GPT and ChatGPT. Understanding the nuances of these architectures will further clarify how systems like ChatGPT operate. We will also have a practical laboratory session focusing on LSTMs to solidify your understanding of sequence models and provide a comparative perspective against the Transformer architecture. **Follow-up Questions and Topics for the Next Lecture:** - How does the decoder part of the Transformer work, and how is it used in models like GPT and ChatGPT? - What are the architectural differences between BERT and GPT, and why are they suited for different natural language tasks? - How are Transformer models fine-tuned for specific downstream applications after pre-training? - What are the computational demands and strategies for scaling the training and deployment of large Transformer models? - Laboratory Activity: Hands-on session with LSTMs for practical understanding and comparison with Transformers.