Lecture Notes on Word Embeddings and Recurrent Neural Networks
Introduction
This lecture introduces the fundamental concepts of word embeddings and recurrent neural networks (RNNs) as essential components in modern deep learning for natural language processing. We will explore how to model sequences of words, represent words as vectors, and use neural networks to predict the probability of word sequences. The lecture covers \(n\)-gram models, word embeddings, the architecture of RNNs, Long Short-Term Memory (LSTM) networks, and techniques to handle overfitting and improve model performance.
Language Modeling
The Goal: Modeling Sequences of Words
The primary goal is to build a language model that can understand and generate human language. This involves assigning probabilities to sequences of words, effectively modeling the likelihood of different sentences occurring in a language.
Probability Distributions over Sentences
We aim to model the probability distribution over sentences. While creating a probability distribution over individual words is relatively straightforward (by counting word frequencies and normalizing), language modeling requires understanding the relationships and dependencies between words in a sentence. A crucial application of this is in tasks like translation, where generating probable and contextually relevant sentences is essential.
A probability distribution over strings is important for language modeling. We are not just interested in individual words, but in the probability of entire sentences.
Conditional Probability and the Chain Rule
Basic Conditional Probability
Conditioning is a fundamental concept in statistics. The basic formula for conditional probability is given by:
\[\Prm(A|B) = \frac{\Prm(A \cap B)}{\Prm(B)}\] or equivalently, \[\Prm(A \cap B) = \Prm(A|B) \Prm(B)\]
Here, \(\Prm(A|B)\) represents the probability of event \(A\) occurring given that event \(B\) has already occurred. \(\Prm(A)\) is the prior probability of \(A\), and \(\Prm(A|B)\) is the posterior probability of \(A\) after observing \(B\).
Chain Rule for Joint Probability
The chain rule extends conditional probability to sequences of events. For a sequence of events \(A_0, A_1, \ldots, A_m\), the probability of their joint occurrence is: \[\Prm(A_0, A_1, \ldots, A_m) = \Prm(A_0) \Prm(A_1|A_0) \Prm(A_2|A_0, A_1) \cdots \Prm(A_m|A_0, \ldots, A_{m-1})\]
This rule is essential for decomposing the probability of a sentence into a product of conditional probabilities.
Interpreting the Conditioning Event
In \(\Prm(A|B)\), \(B\) represents the evidence or the condition that we are considering. It is crucial to understand that \(\Prm(A|B)\) is the probability of \(A\) given the evidence \(B\), and not the probability of some combined event ‘AB’. There is no event ‘AB’.
It is important to note that \(\Prm(A|B)\) can be high even if \(\Prm(A)\) is low. This means that the likelihood of \(A\) given \(B\) can be significant even if \(A\) is generally rare. For example, the word ‘mat’ might be uncommon in general text, but after the sequence ‘the cat sat on the’, the probability of ‘mat’ appearing next is considerably higher.
Both \(\Prm(A|B)\) and \(\Prm(B|A)\) are meaningful and depend on the context and the information we want to extract, not necessarily on chronological order or causality.
Notation for Word Sequences
We represent a sequence of words as \(E_{1,m} = (E_1, \ldots, E_m)\), where each \(E_i\) is a random variable representing the word at position \(i\). We can think of commas as representing ‘and’. For example, for the sentence "We live in a small world", an instance \(e_{1,m}\) would be (We, live, in, a, small, world), where \(m=6\) is the length of the sentence.
We are interested in the probability of a specific sentence, such as \(\Prm(E_{1,n} = e_{1,n})\), which is the probability of the sequence of words \(e_{1,n}\). For instance, \(\Prm(\text{We, live, in, a, small, world})\). Using the chain rule, this can be written as:
\[\Prm(\text{world} | \text{We, live, in, a, small}) \times \Prm(\text{small} | \text{We, live, in, a}) \times \cdots \times \Prm(\text{live} | \text{We}) \times \Prm(\text{We})\]
Typically, this is written in reverse order (right-to-left), predicting each word based on the preceding words. In general, the probability of a sentence is:
\[\Prm(E_{1,n} = e_{1,n}) = \prod_{j=1}^{n} \Prm(E_j = e_j | E_{1,j-1} = e_{1,j-1})\]
Here, \(\prod\) denotes the product from \(j=1\) to \(n\) of the conditional probability of the \(j\)-th word given the prefix sentence. This framework forms the basis for language modeling.
Tokenization and Vocabulary
Handling Out-of-Vocabulary Words
Tokenization is the process of breaking down text into individual words or tokens. This can be complex due to punctuation and contractions. For simplicity, we assume a tokenization method is available.
To manage the complexity of natural language, we typically cap the vocabulary size. The number of possible words in a language is vast, making it impractical to model probabilities for every word. We select a vocabulary \(\V\) of a fixed size, say 40,000 words, where \(|\V|\) is the vocabulary size.
Words that are not included in our vocabulary are considered out-of-vocabulary (OOV) words. These are represented by a special token, . If a word is encountered that is not in \(\V\), it is replaced with .
The Penn Treebank Corpus (PTB)
We use the Penn Treebank Corpus (PTB), which contains approximately 1,000,000 words from Wall Street Journal news articles. The PTB is already tokenized, but not "unked" in the original data, meaning unknown words are not pre-marked. The vocabulary size is around 50,000 words. PTB is a "treebank" because sentences are parsed into grammatical trees, but we focus solely on the words and replace words occurring less than 10 times with .
The Challenge of Long-Range Dependencies
A significant issue is that longer prefix sentences tend to have lower probabilities. This is because we are multiplying probabilities (which are less than 1), causing the overall probability to decrease with sentence length. Longer sentences are also harder to model reliably due to data sparsity for long prefixes.
To address this, we need to reduce the span of conditioning for probability computation. Instead of conditioning on the entire prefix, we can condition only on the last few words. This approach leads to models like \(n\)-gram models.
N-gram Models
Bigram Model Approximation
An bigram model simplifies probability estimation by approximating the probability of a word based only on the preceding word. In a bigram model, we approximate \(\Prm(E_j = e_j | E_{1,j-1} = e_{1,j-1})\) as \(\Prm(E_j = e_j | E_{j-1} = e_{j-1})\). This simplification is a practical technique to make probability computations more manageable.
Model Construction Using Frequencies
Constructing a language model involves defining a probability distribution. We typically use frequencies from training data to estimate these probabilities.
Super-trivial Model: Assumes every word is equally probable, i.e., \(1/|\V|\). This model is not very useful in practice.
Trivial Model: Counts the occurrences of each word and uses maximum likelihood estimation. This is still context-agnostic.
Bigram Model: Estimates probabilities based on pairs of consecutive words. This is more reliable and can be generalized using deep learning techniques.
Limitations of Simple Models
Simple models like unigram and bigram models have limitations in capturing long-range dependencies and semantic nuances in language. To overcome these, we turn to more sophisticated methods like neural networks and word embeddings.
Word Embeddings
Motivation: Representing Words as Vectors
To use deep learning for language modeling, we need to represent words numerically. Word embeddings are vectors of floating-point numbers that serve as representations of words. These embeddings act as features for language models, enabling neural networks to process and understand textual data.
Word Embeddings as Features for Language Models
Word embeddings transform words into a continuous vector space where words with similar meanings are located closer to each other. These vectors capture semantic and syntactic information, making them powerful features for various NLP tasks, including language modeling.
Creating Word Embeddings
Indexing the Vocabulary
The first step in creating word embeddings is to index the vocabulary. This involves assigning a unique integer index to each word in our vocabulary \(\V\). This allows us to refer to words by their numerical indices.
The Embedding Matrix
Next, we create an embedding matrix \(E\). This matrix has dimensions \(|\V| \times e\), where \(|\V|\) is the vocabulary size and \(e\) is the embedding size (a hyperparameter, e.g., 20, 100, or 1000). Each row in \(E\) corresponds to a word in the vocabulary, and the row itself is the word’s embedding vector. The values in \(E\) are initially random and are learned during the training process.
Neural Network Architecture for Word Prediction
Input: Word Index
The input to our neural network is the index of a word, say \(e_i\).
Embedding Lookup
We perform an embedding lookup to retrieve the embedding vector \(\mathbf{x}\) corresponding to the input word index from the embedding matrix \(E\).
Linear Layer and Softmax Output
The embedding vector \(\mathbf{x}\) is then fed into a neural network. A simple architecture could consist of a single linear layer with weights \(W\) and bias \(b\), followed by a softmax activation function \(\sigma\).
\[\text{logits} = \mathbf{x}W + b\] \[\mathbf{y} = \sigma(\text{logits})\]
where \(\mathbf{y}\) is the output vector representing the probability distribution over the vocabulary.
Output as a Probability Distribution
The output \(\mathbf{y}\) is a probability distribution over the vocabulary \(\V\). It represents the predicted probabilities for each word in \(\V\) being the next word in the sequence, given the input word. We aim for this distribution to predict the correct next word with high probability.
Training and Backpropagation
Updating Embeddings through Backpropagation
During training, we compare the predicted probability distribution with the actual next word (target word). We use a loss function, such as cross-entropy loss, to quantify the difference between the prediction and the target. Backpropagation is then used to update the network parameters, including the weights \(W\), bias \(b\), and crucially, the embedding matrix \(E\).
The embedding matrix \(E\) is a trainable parameter of the model. Initially, its values are random, typically drawn from a normal distribution with mean zero and small standard deviation. Through stochastic gradient descent during backpropagation, these values are adjusted to minimize the loss function.
The Emergence of Semantic Meaning in Embeddings
A remarkable outcome of training word embeddings is the emergence of semantic meaning. After training on a large corpus, words with similar meanings tend to have embedding vectors that are close to each other in the embedding space. Words that appear in similar contexts in the language will have similar embedding vectors.
Measuring Similarity: Cosine Similarity
To measure the "closeness" or similarity between word embedding vectors, cosine similarity is commonly used. The cosine similarity between two vectors \(\mathbf{x}\) and \(\mathbf{y}\) is defined as:
\[\cos(\mathbf{x}, \mathbf{y}) = \frac{\mathbf{x} \cdot \mathbf{y}}{||\mathbf{x}|| \cdot ||\mathbf{y}||}\]
where \(\mathbf{x} \cdot \mathbf{y}\) is the dot product of \(\mathbf{x}\) and \(\mathbf{y}\), and \(||\mathbf{x}||\) and \(||\mathbf{y}||\) are their respective Euclidean norms. Cosine similarity yields a value between -1 and 1, representing the cosine of the angle between the vectors. A value closer to 1 indicates higher similarity.
Examples of word similarities measured by cosine similarity: For instance, the word "above" is most similar to "under" with a cosine similarity of 0.362.
TensorFlow Implementation
Placeholders for Input and Output
In TensorFlow, we define placeholders for input and output data:
inpt = tf.placeholder(tf.int32, shape=[batchSz])
answr = tf.placeholder(tf.int32, shape=[batchSz])
Here, ‘inpt’ will hold the input word indices and ‘answr’ will hold the indices of the correct next words. ‘batchSz’ represents the batch size.
Embedding Matrix as a Trainable Variable
The embedding matrix \(E\) is defined as a trainable variable:
E = tf.Variable(tf.random_normal([vocabSz, embedSz], stddev=0.1))
‘vocabSz’ is the vocabulary size and ‘embedSz’ is the embedding size. ‘tf.random_normal’ initializes the embedding matrix with random values from a normal distribution.
Embedding Lookup Operation
The embedding lookup operation is performed using ‘tf.nn.embedding_lookup’:
embed = tf.nn.embedding_lookup(E, inpt)
This operation retrieves the embedding vectors for the input word indices from the embedding matrix \(E\).
Loss Function: Sparse Softmax Cross-Entropy
For the loss function, we use sparse softmax cross-entropy with logits:
xEnt = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=answr)
loss = tf.reduce_sum(xEnt)
This computes the cross-entropy loss between the predicted logits and the true labels (‘answr’). ‘tf.reduce_sum’ sums up the cross-entropy values across the batch to get the total loss.
Evaluation Metrics
Average Loss per Example
A basic evaluation metric is the average loss per example or per word. This is similar to the approach used in digit recognition. We aim to minimize this loss during training.
Perplexity: A Common Language Modeling Metric
A more standard metric in language modeling is perplexity. Perplexity measures how well a language model predicts a sample of text. It is related to the average probability of predicting the correct next word.
Perplexity is defined as \(e^Z\), where \(Z\) is the average negative log-probability of the correct words in a sentence or corpus \(d\) of length \(|d|\): \[Z = \frac{1}{|d|} \sum_{i=1}^{|d|} -\log(\Prm_{z_i})\] Here, \(\Prm_{z_i}\) is the probability assigned by the model to the correct \(i\)-th word. Perplexity is then: \[\text{Perplexity} = e^Z = \left( \prod_{i=1}^{|d|} \frac{1}{\Prm_{z_i}} \right)^{1/|d|}\]
Lower perplexity indicates a better language model. If the model assigns a uniform probability \(1/|\V|\) to every word, then perplexity becomes:
\[\text{Perplexity} = |\V|\]
which is equal to the vocabulary size.
Beyond Bigrams: Trigrams and Higher-Order N-grams
We can extend the bigram model to trigram models and higher-order \(n\)-gram models by conditioning on more preceding words. For example, a trigram model considers the two previous words to predict the next word. This requires adjusting the input to the model to take into account the additional context words. Adding hidden layers to the neural network can also improve perplexity.
Overfitting and Regularization
The Problem of Overfitting in Language Models
Overfitting is a significant concern in language modeling. Language models, especially complex ones, can easily overfit to the training data, performing well on the training set but poorly on unseen data (development or test sets). This is partly because language data may not always be perfectly independently and identically distributed (i.i.d.), and training datasets may not be fully representative of the vastness of language.
Training perplexity decreases consistently over epochs, but development perplexity starts to increase after a certain point, indicating overfitting.
Regularization Techniques to Combat Overfitting
Regularization techniques are used to mitigate overfitting.
Early Stopping
Early stopping is a simple regularization method where training is stopped when the performance on a validation set (e.g., development perplexity) starts to degrade. This prevents the model from continuing to train and overfit to the training data.
Dropout Regularization
Dropout is a more advanced regularization technique that involves randomly dropping out (setting to zero) a fraction of neurons during training. This forces the network to learn more robust features and reduces reliance on individual neurons. In TensorFlow, dropout can be implemented using ‘tf.nn.dropout’.
Dropout can be implemented by randomly dropping 50% of the first layer’s output units during training.
L2 Regularization
L2 regularization adds a penalty term to the loss function that is proportional to the sum of the squared weights of the network. This encourages the network to learn smaller weights, which can help prevent overfitting. The modified loss function becomes:
\[L(\Phi) = -\log(\Prm(c)) + \alpha \sum_{\phi \in \Phi} ||\phi||^2\]
where \(\Phi\) represents the set of weights, and \(\alpha\) is a hyperparameter controlling the strength of regularization. In TensorFlow, L2 regularization can be implemented using ‘tf.nn.l2_loss’.
keepP = tf.placeholder(tf.float32)
w1out = tf.nn.dropout(w1out, keepP) % Dropout implementation
l2_loss = 0.01 * tf.nn.l2_loss(W1) % L2 regularization term
loss = original_loss + l2_loss % Adding L2 regularization to the loss
Recurrent Neural Networks (RNNs)
Limitations of Feedforward Networks for Sequential Data
Feedforward neural networks, even with techniques like word embeddings, treat each input independently. They lack the ability to maintain state or memory of past inputs, which is crucial for processing sequential data like language where context is important.
Introducing Recurrent Neural Networks
Recurrent Neural Networks (RNNs) are designed to handle sequential data by incorporating cycles in their network graph. Unlike feedforward networks, RNNs have connections that loop back to previous layers, allowing information to persist over time.
The Concept of State and Memory in RNNs
RNNs introduce the concept of state or memory. This state allows the network to remember information from previous inputs and use it to influence the processing of current inputs. This memory is essential for understanding context and dependencies in sequential data.
A Simple RNN Model
State Update Rule
A simple RNN model can be defined by the following equations:
\[\mathbf{s}_0 = \mathbf{0} \quad \text{(initial state is zero vector)}\] \[\mathbf{s}_{t+1} = \rho((\mathbf{e}_{t+1} \cdot \mathbf{s}_t)W_r + \mathbf{b}_r) \quad \text{(state update rule)}\]
Here, \(t\) represents time, \(\mathbf{s}_t\) is the state vector at time \(t\), \(\mathbf{e}_{t+1}\) is the embedding vector of the input word at time \(t+1\), \(W_r\) and \(\mathbf{b}_r\) are the weight matrix and bias vector for the recurrent layer, and \(\rho\) is an activation function, such as tanh.
Output Calculation
The output \(\mathbf{o}\) is calculated from the updated state:
\[\mathbf{o} = \mathbf{s}_{t+1}W_o + \mathbf{b}_o \quad \text{(output calculation)}\]
where \(W_o\) and \(\mathbf{b}_o\) are the weight matrix and bias vector for the output layer. The output \(\mathbf{o}\) can then be passed through a softmax layer to obtain a probability distribution over the vocabulary for the next word prediction.
The state vector \(\mathbf{s}_t\) summarizes information from all previous inputs up to time \(t\). The dimension of the state vector is a hyperparameter. New states are computed recursively, incorporating the current input and the previous state.
Backpropagation Through Time (BPTT)
Training RNNs involves Backpropagation Through Time (BPTT). Since RNNs have cycles, standard backpropagation cannot be directly applied. BPTT unfolds the RNN over a fixed number of time steps (window size) and then applies backpropagation as if it were a deep feedforward network.
A simple approach is to unfold the RNN for a fixed number of steps \(N\) and then perform backpropagation. This is an approximation, as true recurrence implies potentially infinite unfolding.
Batching and Windowing for RNN Training
To train RNNs efficiently, we use batching and windowing. The corpus is first split into batches and then further divided into windows of a fixed size. For example, with a window size of 3, we process sequences of 3 words at a time.
Data is processed in batches, and within each batch, sequences of words are processed in windows. This approach allows for parallel processing within a batch and manages the computational complexity of BPTT.
When the batch size is 2 and the window size is 3, the corpus is divided into chunks of 2x3 words. Each chunk is processed sequentially, and the final state of one chunk is used as the initial state for the next chunk.
TensorFlow Implementation of RNNs
In TensorFlow, a basic RNN cell can be created using:
rnn = tf.contrib.rnn.BasicRNNCell(rnnSz)
where ‘rnnSz’ is the size of the RNN state. The initial state is set to zero:
initialState = rnn.zero_state(batchSz, tf.float32)
and the dynamic RNN is created using ‘tf.nn.dynamic_rnn’:
outputs, nextState = tf.nn.dynamic_rnn(rnn, embeddings, initial_state=initialState)
‘embeddings’ is the input word embeddings, and ‘initialState’ is the initial RNN state. ‘tf.nn.dynamic_rnn’ handles the unfolding and recurrence over the input sequence.
Complexity Analysis of RNNs
Time Complexity: The time complexity of training an RNN with BPTT is \(O(N \times B \times T)\), where \(N\) is the number of steps to unfold, \(B\) is the batch size, and \(T\) is the length of the sequence.
Space Complexity: The space complexity is primarily determined by the size of the state vector and the number of parameters in the network. It is \(O(S + P)\), where \(S\) is the size of the state vector and \(P\) is the number of parameters.
Long Short-Term Memory (LSTM) Networks
Motivation for LSTMs: Addressing RNN Limitations
Simple RNNs suffer from vanishing gradient and exploding gradient problems, especially when dealing with long sequences. This makes it difficult for them to learn long-range dependencies. Long Short-Term Memory (LSTM) networks are a type of RNN designed to address these limitations and effectively capture long-range dependencies.
LSTM Architecture and Components
LSTMs introduce a more complex cell structure with a cell state (long-term memory) and gates that control the flow of information.
Cell State: Long-Term Memory
The cell state acts as a long-term memory, carrying information across time steps. It is a central component that LSTM gates manipulate.
Gates: Forget, Input, and Output
LSTMs use three main types of gates to regulate information flow:
Forget Gate: Determines what information to discard from the cell state.
Input Gate: Determines what new information to add to the cell state.
Output Gate: Determines what information to output based on the cell state and input.
These gates are implemented using sigmoid and tanh activation functions and pointwise multiplication.
Detailed LSTM Equations and Operations
Forget Gate: Controlling Information Retention
The forget gate \(f_t\) decides what information to forget from the previous cell state \(c_{t-1}\). It is calculated as:
\[f_t = \sigma(\mathbf{h}'W_f + \mathbf{b}_f)\]
where \(\mathbf{h}' = [\mathbf{h}_{t-1}, \mathbf{e}_t]\) is the concatenation of the previous hidden state and the current input embedding, \(W_f\) and \(\mathbf{b}_f\) are weights and biases for the forget gate, and \(\sigma\) is the sigmoid function.
Input Gate: Adding New Information
The input gate has two parts: an input gate layer \(a_{1,t}\) and a candidate cell state \(\tilde{c}_t\).
\[a_{1,t} = \sigma(\mathbf{h}'W_{a1} + \mathbf{b}_{a1})\] \[a_{2,t} = \tanh(\mathbf{h}'W_{a2} + \mathbf{b}_{a2})\]
where \(W_{a1}, \mathbf{b}_{a1}, W_{a2}, \mathbf{b}_{a2}\) are weights and biases for the input gate components.
Cell State Update: Combining Old and New Information
The cell state \(c_t\) is updated by combining the forgotten information from \(c_{t-1}\) and the new information from the input gate:
\[c_t = c_{t-1} \odot f_t + (a_{1,t} \odot a_{2,t})\]
where \(\odot\) denotes element-wise multiplication.
Output Gate: Controlling Information Output
The output gate \(o_t\) decides what parts of the cell state to output:
\[\mathbf{h}''_t = \mathbf{h}'W_h + \mathbf{b}_h\] \[o_t = \sigma(\mathbf{h}''_t W_o + \mathbf{b}_o)\] \[\mathbf{h}_t = \tanh(c_t) \odot o_t\]
where \(W_h, \mathbf{b}_h, W_o, \mathbf{b}_o\) are weights and biases for the output gate components, and \(\mathbf{h}_t\) is the hidden state output at time \(t\).
The sigmoid gates (forget, input, output) produce values between 0 and 1, acting as "smooth" controls to regulate information flow.
Complexity Analysis of LSTMs
Time Complexity: The time complexity of training an LSTM is similar to that of a simple RNN, but with a larger constant factor due to the more complex operations within each LSTM cell. It is \(O(N \times B \times T \times G)\), where \(N\) is the number of steps to unfold, \(B\) is the batch size, \(T\) is the length of the sequence, and \(G\) represents the complexity factor due to gates.
Space Complexity: The space complexity of an LSTM is higher than that of a simple RNN due to the additional parameters in the gates and the cell state. It is \(O(S + P + G)\), where \(S\) is the size of the state vector, \(P\) is the number of parameters in the basic RNN, and \(G\) is the additional parameters due to the LSTM gates.
Conclusion
In this lecture, we covered the basics of language modeling, word embeddings, and recurrent neural networks, focusing on RNNs and LSTMs. We discussed how word embeddings provide numerical representations of wordsthat capture semantic meaning, and how RNNs and LSTMs are designed to process sequential data by maintaining state and memory. We also explored techniques to address overfitting, such as dropout and L2 regularization.
Key takeaways include:
Language models aim to predict the probability of word sequences.
Word embeddings are crucial for representing words as numerical vectors in deep learning models.
RNNs and LSTMs are powerful architectures for handling sequential data and capturing long-range dependencies.
Regularization techniques are essential to prevent overfitting in language models.
Further topics to explore might include attention mechanisms, transformers, and more advanced applications of RNNs and LSTMs in natural language processing. Understanding these concepts is crucial for building sophisticated NLP systems.