Lecture Notes: AI Fundamentals

Author

Your Name

Published

January 28, 2025

Introduction

Good afternoon, everyone. Welcome to today’s lecture. The program for today is structured into two main parts. First, we will revise the previous exercises to ensure everyone is on track and address any difficulties encountered. Following this, we will proceed with the laboratory activity. We are also expecting some students to join us for the lab session.

Regarding the schedule for next week, please note that the lesson scheduled for Wednesday might be cancelled. We will confirm this as soon as possible. However, the session on Thursday will proceed as planned.

As the head of the Artificial Intelligence Lab, I want to introduce a series of activities showcasing real-world AI projects developed in our lab. Our group has grown significantly and is currently involved in more than ten projects. These sessions are designed to provide you with insights into practical AI applications and inspire you to consider similar projects in your own professional contexts. This introduction aims to give you a glimpse into the possibilities within the field of artificial intelligence.

Course Administration and Announcements

Review of Previous Exercise Solutions

We will now begin by reviewing the solutions to the exercises from the previous lecture. This is to ensure everyone understands the fundamental concepts and to clarify any doubts you might have.

Introduction to the Laboratory Activity

Following the exercise review, we will transition to the hands-on laboratory activity planned for today. This session is designed to provide practical experience and complement the theoretical concepts we have been discussing in lectures.

Schedule Changes and Updates

Just to reiterate, the Wednesday lecture next week is potentially cancelled. Please keep an eye out for further announcements confirming the schedule. The Thursday session, focusing on AI Lab projects, is confirmed and will proceed as planned.

Overview of AI Lab Projects

As mentioned earlier, the Thursday sessions will be dedicated to showcasing the diverse projects undertaken in the Artificial Intelligence Lab. This initiative aims to bridge the gap between theoretical knowledge and practical application, giving you a clearer understanding of the real-world impact of AI technologies. We hope these sessions will inspire you and provide valuable insights into potential future projects or applications in your respective fields.

AI Technology and News Highlights

Discussion of the Agent Application

I wanted to share some recent news regarding an interesting agent application. I believe I briefly mentioned it yesterday.

Introduction to OpenAI’s Sora

Yesterday, I also mentioned Sora, OpenAI’s new system. I attempted to use it, but unfortunately, I did not have access. Sora is a new text-to-video system from OpenAI, representing significant advancements in generative AI models capable of creating realistic and imaginative video content from text instructions.

Demonstration of Avatar Generation

This morning, during a short break, I experimented with a website that allows you to create personalized avatars and generate videos using them. After a brief interaction with my camera, the system generated an avatar for me. I then used text input to create a video with this avatar. Let’s take a look at the result.

Video Playback and Transcript:
Ciao, sono Giuseppe Serra e sono professore associato presso l'Università degli Studi di Udine.

As you can see, while not perfect, the lip synchronization is quite impressive. The voice quality could be improved, but overall, it demonstrates the rapid progress in this technology domain. Feel free to explore this technology and create your own avatars. This is just an example of the cutting-edge techniques currently available.

Detailed Review of Exercise Problems

Now, let’s go through the exercise problems. I will provide the correct answers and explanations to help you understand the solutions.

Importance of Data Distribution in Dev and Test Sets

For the first exercise, the question was about the relationship between the development (dev) and test datasets. The correct answer is A: the dev and test sets should come from the same distribution.

It is crucial for the dev and test sets to be drawn from the same distribution to ensure that the performance metrics obtained on the dev set are a reliable indicator of the performance on unseen test data. This comparability is essential for making informed decisions during model development and evaluation. While data augmentation can be applied to the training set to increase its diversity, the dev and test sets should reflect the real-world data distribution the model is expected to perform on.

Strategies for Addressing Underfitting Issues

The second question addressed strategies for dealing with underfitting in neural networks. The correct answers are B and C: increase the complexity of the model and increase the number of units or layers.

Underfitting occurs when a model is too simple to capture the underlying patterns in the data. To address underfitting, we need to increase the model’s capacity. This can be achieved by:

Increasing Model Complexity: Using a more complex model architecture, such as adding more layers to a neural network or using a different type of model altogether.
Increasing Units/Layers: For neural networks, this means increasing the number of neurons in each layer or adding more layers to create a deeper network.

Strategies for Addressing Overfitting Issues

The third exercise presented a scenario of building a classifier for fruits at a supermarket checkout kiosk. Given a low training error (0.5) and a higher dev set error (7), the question asked for promising strategies to improve the classifier. The correct answers are A and C: increase the regularization term and increase the number of training data.

Overfitting is indicated by a significant gap between training and dev set performance. The model has memorized the training data but fails to generalize to new, unseen data. Strategies to combat overfitting include:

Increasing Regularization: Regularization techniques, such as L2 regularization (weight decay), penalize large weights, encouraging the model to learn simpler patterns that generalize better.
Increasing Training Data: Providing more training examples can help the model learn more robust features and reduce its reliance on memorizing specific training instances.

Understanding Weight Decay Regularization

The fourth question was about weight decay. The statement is that weight decay, such as L2 regularization, results in gradient descent shrinking the weights on every iteration. This statement is true.

Weight decay (L2 regularization) adds a penalty term to the cost function that is proportional to the square of the weights. During gradient descent, this penalty term encourages the weights to be smaller, effectively shrinking them towards zero in each iteration. This helps in preventing overfitting by simplifying the model.

Impact of the Regularization Parameter (Lambda)

Question five asked about the effect of increasing the regularization hyperparameter $\lambda$. The correct answer is A: weights are pushed towards becoming smaller.

The regularization parameter $\lambda$ controls the strength of the regularization. A larger $\lambda$ increases the penalty for large weights in the cost function. Consequently, gradient descent will more aggressively push the weights towards smaller values to minimize the cost function, thus further reducing model complexity and mitigating overfitting.

Interpretation of Cost Function Plots

Exercise six presented a cost function plot that is not smooth and asked for interpretation. The correct answer is A: if using mini-batch gradient descent, this looks acceptable, but if using batch gradient descent, something is wrong.

The jagged nature of the cost function plot suggests that mini-batch gradient descent is being used. In mini-batch gradient descent, the cost is calculated on small batches of data, which can lead to noisy updates and oscillations in the cost function. This is expected and acceptable. However, if batch gradient descent (using the entire dataset for each update) were used, a smoother, monotonically decreasing cost function would be expected. A non-smooth plot in batch gradient descent would indicate issues such as a high learning rate or problems with the cost function implementation.

Solution to the Numerical Calculation Problem

Question seven was a numerical calculation problem. The correct answer is A.

Solution: [Detail missing: Numerical solution to be inserted here. Please refer to the slides for the detailed calculation.] Note: The specific numerical problem description is not available in the provided transcript. Please refer to the lecture slides for the problem statement and detailed solution.

Analysis of Learning Rate Decay Strategies

Exercise eight asked which of the given learning rate decay schemes is not a good one. The correct answer is A.

Learning rate decay is a strategy to reduce the learning rate during training, typically starting with a larger learning rate and gradually decreasing it. Options B, C, and D represent typical decay schemes where the learning rate decreases with iteration $T$. Option A, however, increases the learning rate exponentially with $T$, which is counterproductive for gradient descent optimization. Gradient descent aims to converge to a minimum, and a decreasing learning rate helps in fine-tuning the convergence in later iterations.

Application of Gradient Descent with Momentum

Question nine presented a picture describing a learning process and asked what it represents. The correct answer is A: gradient descent with momentum.

The picture likely depicted the behavior of gradient descent with momentum, where the optimization process not only considers the current gradient but also accumulates a velocity vector from previous gradients. This momentum helps to accelerate convergence in the relevant direction and dampen oscillations, leading to a smoother and faster descent towards the minimum of the cost function.

Calculating Parameters in Fully Connected Layers

Exercise ten asked to calculate the number of parameters in a fully connected layer. Given a $300 \times 300$ color image input and a first layer with 100 neurons, fully connected to the input, the correct answer is D.

For a fully connected layer, each neuron is connected to every input. For a $300 \times 300$ color image, the input size is $300 \times 300 \times 3 = 270,000$. If the first layer has 100 neurons, each neuron has 270,000 weights plus a bias term. Therefore, the total number of parameters is $(300 \times 300 \times 3 \times 100) + 100 = 27,000,000 + 100$. The closest option would be D.

Calculating Parameters in Convolutional Layers

Exercise eleven was similar to ten, but for a convolutional layer. With the same input and a convolutional layer with 100 filters of size $5 \times 5$, the correct answer is D.

In a convolutional layer, filters are applied across the input. Here, we have 100 filters of size $5 \times 5$. Since it’s a color image, the filter depth is 3. Each filter has $5 \times 5 \times 3$ weights, and there are 100 filters, plus a bias term for each filter. So, the number of parameters is $(5 \times 5 \times 3 \times 100) + 100 = 7500 + 100 = 7600$. The closest option is D.

Understanding the Max Pooling Operation

Exercise twelve described an input volume of $32 \times 32 \times 16$ and asked about the output volume after applying max pooling with a stride of 2 and a filter size of 2. The correct answer is B.

Max pooling reduces the spatial dimensions of the input. With a $2 \times 2$ filter and stride 2, the dimensions are halved. For a $32 \times 32$ input, the output dimension after pooling becomes $32/2 \times 32/2 = 16 \times 16$. The number of channels (depth) remains unchanged by max pooling. Therefore, the output volume is $16 \times 16 \times 16$.

Analysis of Sliding Window Classifiers

For the exercise related to sliding window classifiers, the question was about the effect of increasing the stride. The correct answer is B: increasing the stride would tend to decrease accuracy and decrease the computational cost.

In a sliding window classifier, increasing the stride means the window moves in larger steps across the image. This results in:

Decreased Computational Cost: Fewer window positions to evaluate, thus reducing computation.
Decreased Accuracy: Larger strides can miss smaller objects or details, potentially reducing detection accuracy.

Therefore, option B correctly describes the trade-off.

Overview of the YOLO Algorithm

The exercise about YOLO (You Only Look Once) algorithm, the correct answer is A.

YOLO Algorithm: [Detail missing: Explanation about YOLO algorithm and why option A is correct. Refer to lecture notes for details.] Note: The specific question about the YOLO algorithm is not available in the provided transcript. Please refer to the lecture slides for the question statement and detailed explanation.

Calculation of Intersection over Union (IoU)

The final exercise was about calculating Intersection over Union (IoU). The correct answer is B.

Intersection over Union (IoU): [Detail missing: Explanation about IoU calculation and why option B is correct. Refer to lecture notes for details.] Note: The specific question about Intersection over Union (IoU) is not available in the provided transcript. Please refer to the lecture slides for the question statement and detailed explanation.

Regarding the exercise initially skipped, after review, the correct answer is C, so the answer is 5.

Skipped Exercise: [Detail missing: Context and solution for the skipped exercise. Refer to lecture notes for details.] Note: The context and the question for this skipped exercise are not available in the provided transcript. Please refer to the lecture slides for the question statement and detailed solution.

I will share this annotated file with you so you can review the solutions again. If you have any further questions, please feel free to ask.

Hands-on Laboratory Session: Sentiment Analysis with NLP

Now, we will proceed with the laboratory activity. I will hand over to Bea and Ian who will guide you through the sentiment analysis task using Natural Language Processing (NLP).

Introduction to the Sentiment Classification Task

Bea: Okay, can you hear me online? Yes, yes, we can. Okay, great. First of all, you can find in the Google Drive folder, you can find the solution of the last lesson and the file for today. The files for today are also available in the lab three channel under the files tab, where you will find the notebook we will use today.

Today, we will be working on a sentiment classification problem using an NLP dataset. Building upon your classroom explorations of NLP concepts, this lab offers a practical application. We will focus on sentiment classification, which involves determining the sentiment expressed in text. We will use basic deep neural networks, similar to those discussed in our previous lectures.

Before we begin, are there any questions or observations regarding the previous exercise or the last lab session? If you encounter any issues during this session, please write to us in the channel, and we will respond as soon as possible.

For today’s activity, please download the provided notebook file and upload it to your Colab environment. We will be performing sentiment classification using pre-trained GloVe embeddings.

Data Loading and Preprocessing

Downloading Pre-trained GloVe Embeddings

The first step in our lab is to download the GloVe embeddings. These are pre-trained word embeddings that provide numerical representations for words, capturing semantic relationships. We will download these embeddings from a mirror of the Stanford website. These embeddings are pre-computed and publicly available.

Loading and Inspecting Training and Test Datasets

Next, we will download and load the training and test datasets for movie reviews. These datasets contain movie reviews along with their sentiment labels, which are either positive (1) or negative (0). We will use pandas to load these CSV files and inspect their structure to understand the data format.

# Code snippet from the lab notebook (Illustrative)
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip

import pandas as pd

# Load training data
train_df = pd.read_csv("train_movie_data.csv")
print("Train Data Head:")
print(train_df.head())

# Load test data
test_df = pd.read_csv("test_movie_data.csv")
print("\nTest Data Head:")
print(test_df.head())

We will extract the text reviews and sentiment labels into separate numpy arrays for easier processing. Specifically, we will save the reviews into ‘X_train’ and ‘X_test’, and the sentiment labels into ‘Y_train’ and ‘Y_test’. It is also useful to print the shape of the training and test sets to understand the number of samples in each dataset.

Text Preprocessing Techniques

Text data is often very noisy and requires preprocessing before it can be effectively used in machine learning models. We will use the Natural Language Toolkit (NLTK) library, a common library for text cleaning and normalization in Python.

Tokenization and Text Segmentation

Tokenization is the process of breaking down text into individual words or tokens. We will use NLTK’s word tokenization to segment the reviews into words. This step is crucial for converting raw text into a format that our model can understand.

Lowercasing and Text Normalization

To standardize the text and reduce vocabulary size, we will convert all text to lowercase. Additionally, we will perform normalization steps such as removing punctuation and handling special characters. This ensures that different forms of the same word (e.g., "The" and "the") are treated as identical and reduces noise in the text data. We will define a function ‘tokenize_text’ to handle these preprocessing steps.

# Code snippet illustrating text tokenization and lowercasing
import nltk
import re

def tokenize_text(text):
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"n\'t", " not ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'s", " is ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r"\'m", " am ", text)
    text = re.sub(r'[^\w\s]',' ',text) # Remove punctuation
    text = text.lower()
    tokens = nltk.word_tokenize(text)
    return tokens

sample_text = "This is a sample text with punctuation! And UPPERCASE words."
tokenized_text = tokenize_text(sample_text)
print("Original Text:", sample_text)
print("Tokenized Text:", tokenized_text)

After defining the ‘tokenize_text’ function, we will apply it to the full training and test datasets to process all text reviews. This will convert each review into a list of cleaned tokens.

Vocabulary Construction and Management

To convert text into numerical representations that can be fed into a neural network, we need to build a vocabulary. The vocabulary is a list of all unique words from our training data. We will also include special tokens in our vocabulary to handle specific cases.

Handling Special Tokens (Padding and Unknown)

We will add two special tokens to our vocabulary:

Padding (PAD): Represented as ‘<pad>’, used to ensure all input sequences have the same length by adding padding at the end of shorter sentences.
Unknown (UNK): Represented as ‘<unk>’, used to represent words that are not found in our vocabulary or the pre-trained embeddings. This is important for handling out-of-vocabulary words during inference.

We will initialize our vocabulary with these special tokens and assign them unique indices.

# Code snippet for vocabulary creation
vocab = {'<pad>': 0, '<unk>': 1}
next_index = 2

def build_vocabulary(tokenized_texts, vocab, next_index):
    for tokens in tokenized_texts:
        for token in tokens:
            if token not in vocab:
                vocab[token] = next_index
                next_index += 1
    return vocab, next_index

# Example usage (Illustrative - actual code processes full dataset)
# tokenized_train_texts = [tokenize_text(text) for text in X_train[:100]] # Process a subset for example
# vocab, next_index = build_vocabulary(tokenized_train_texts, vocab, next_index)
# print("Vocabulary Size:", len(vocab))
# print("Sample Vocabulary:", list(vocab.keys())[:10])

We will iterate through the tokenized training texts and build our vocabulary, assigning a unique index to each new word encountered. The size of the vocabulary will be printed to show the total number of unique words found in the training data.

Conversion of Text to Numerical Representations

Once we have constructed our vocabulary, we can convert each word in our tokenized text into its corresponding numerical index from the vocabulary. This process transforms text data into a numerical format suitable for neural network input. Words that are not in our vocabulary will be mapped to the index of the ‘<unk>’ token.

# Function to convert tokenized text to numerical IDs
def text_to_ids(tokenized_text, vocab):
    ids = [vocab.get(token, vocab['<unk>']) for token in tokenized_text]
    return ids

# Example usage (Illustrative)
# sample_tokenized_text = tokenize_text("example of unknown word hypotenuse")
# ids = text_to_ids(sample_tokenized_text, vocab)
# print("Tokenized Text:", sample_tokenized_text)
# print("Numerical IDs:", ids)

We will apply this ‘text_to_ids’ function to both the training and test datasets, converting all tokenized reviews into sequences of numerical IDs.

Padding and Truncation of Sequences

Neural networks typically require inputs of a fixed length. Since sentences can vary in length, we need to make all input sequences the same length. To achieve this, we will pad shorter sequences with ‘<pad>’ tokens and truncate longer sequences to a maximum length. We will choose a ‘max_sequence_length’, for example, 320.

# Function for padding sequences
def pad_sequences(id_sequences, max_length, padding_id):
    padded_sequences = []
    for ids in id_sequences:
        if len(ids) < max_length:
            ids.extend([padding_id] * (max_length - len(ids)))
        else:
            ids = ids[:max_length]
        padded_sequences.append(ids)
    return padded_sequences

max_sequence_length = 320
padding_id = vocab['<pad>']

# Example usage (Illustrative - actual code processes full dataset)
# numericalized_train_texts = [text_to_ids(tokens, vocab) for tokens in tokenized_train_texts]
# padded_train_sequences = pad_sequences(numericalized_train_texts, max_sequence_length, padding_id)
# print("Padded Sequence Example:", padded_train_sequences[0])

We will use the ‘pad_sequences’ function to process the numerical ID sequences for both training and test sets, ensuring all sequences are of ‘max_sequence_length’.

Loading and Utilizing Pre-trained GloVe Embeddings

We will load the downloaded GloVe embeddings and create a dictionary mapping words to their embedding vectors. These pre-trained embeddings are vector representations of words that capture semantic meaning. We will use these pre-trained embeddings to initialize the embedding layer of our neural network. The GloVe embeddings file contains word embeddings of 300 dimensions.

# Code for loading GloVe embeddings
import numpy as np
import pickle

def load_glove_embeddings(file_path):
    embeddings_dict = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], "float32")
            embeddings_dict[word] = vector
    return embeddings_dict

glove_file_path = 'glove.6B.300d.txt' # Path to downloaded GloVe file
glove_embeddings = load_glove_embeddings(glove_file_path)
print("Loaded GloVe Embeddings. Example embedding for 'the':")
print(glove_embeddings.get('the', None)[:10]) # Print first 10 dimensions of 'the' embedding

We will load the GloVe embeddings from the downloaded text file into a dictionary, where keys are words and values are their corresponding 300-dimensional embedding vectors.

Construction of the Embedding Matrix

We will create an embedding matrix that will be used to initialize the weights of the embedding layer in our neural network. Each row of this matrix will correspond to a word in our vocabulary, and the values in the row will be the pre-trained GloVe embedding for that word. For the ‘<pad>’ and ‘<unk>’ tokens, and for words in our vocabulary that are not found in GloVe, we will initialize random embeddings. This ensures that every word in our vocabulary has an associated embedding vector.

# Code to construct embedding matrix
import torch

embedding_dim = 300
vocabulary_size = len(vocab)
embedding_matrix = np.zeros((vocabulary_size, embedding_dim))

unk_embedding = np.random.randn(embedding_dim)
pad_embedding = np.random.randn(embedding_dim)

embedding_matrix[vocab['<pad>']] = pad_embedding
embedding_matrix[vocab['<unk>']] = unk_embedding

oov_words = []
glove_words_count = 0

for word, index in vocab.items():
    if word in ['<pad>', '<unk>']:
        continue
    try:
        embedding_vector = glove_embeddings[word]
        embedding_matrix[index] = embedding_vector
        glove_words_count += 1
    except KeyError:
        embedding_matrix[index] = unk_embedding # Assign UNK embedding to OOV words
        oov_words.append(word)

print(f"Words in GloVe vocabulary: {glove_words_count}")
print(f"Out-of-Vocabulary words count: {len(oov_words)}")

embedding_matrix_tensor = torch.tensor(embedding_matrix, dtype=torch.float)

We will iterate through our vocabulary. If a word is found in the GloVe embeddings, we use its pre-trained embedding; otherwise, we assign a randomly initialized embedding. We also keep count of words found in GloVe and out-of-vocabulary words. Finally, we convert the numpy embedding matrix to a PyTorch tensor.

Implementation of the Neural Network Model

We will now implement a simple neural network model for sentiment classification using PyTorch. This model will consist of an embedding layer, a mean pooling layer, and a linear classification layer.

Detailed Explanation of the Embedding Layer

The embedding layer is the first layer of our network. It takes integer indices as input, which are the numerical representations of words, and outputs dense vector embeddings. We will initialize this layer with our pre-computed embedding matrix, effectively using pre-trained GloVe embeddings. The embedding layer transforms each word index into its corresponding vector embedding.

Application of Mean Pooling

After the embedding layer, we will apply mean pooling. Mean pooling averages the embeddings of all words in a sequence (sentence) along the sequence length dimension to produce a single fixed-size vector representation for the entire sequence. This results in a sentence-level embedding that captures the average semantic meaning of the words in the sentence.

Implementation of the Linear Classification Layer

Following mean pooling, we will use a linear layer to perform the final classification. This layer will take the sentence embedding from the mean pooling layer as input and map it to a two-dimensional output. These two dimensions represent the scores for the two sentiment classes: positive and negative.

Use of the Log-Softmax Activation Function

Finally, we will apply the Log-Softmax activation function to the output of the linear layer. Log-Softmax converts the scores into log-probabilities, which are more numerically stable for training with negative log-likelihood loss. These log-probabilities represent the model’s predicted log-likelihood of each sentiment class (positive or negative).

import torch.nn as nn
import torch.nn.functional as F

class SentimentClassifier(nn.Module):
    def __init__(self, embedding_matrix, freeze_embeddings=True):
        super(SentimentClassifier, self).__init__()
        num_embeddings, embedding_dim = embedding_matrix.shape
        self.embedding = nn.Embedding(num_embeddings, embedding_dim)
        self.embedding.weight.data.copy_(torch.from_numpy(embedding_matrix))
        if freeze_embeddings:
            self.embedding.weight.requires_grad = False # Freeze embeddings
        self.linear = nn.Linear(embedding_dim, 2) # Output size is 2 for binary sentiment

    def forward(self, input_ids):
        embedded = self.embedding(input_ids) # [batch_size, seq_len, embedding_dim]
        pooled = torch.mean(embedded, dim=1) # [batch_size, embedding_dim] - Mean pooling over sequence length
        output = self.linear(pooled) # [batch_size, 2]
        return F.log_softmax(output, dim=1) # LogSoftmax for numerical stability with NLLLoss

# Initialize the model
model = SentimentClassifier(embedding_matrix, freeze_embeddings=True) # Freezing embeddings as per lab discussion
print(model)

We define the ‘SentimentClassifier’ class as a PyTorch ‘nn.Module’. In the ‘__init__’ method, we initialize the embedding layer with the pre-computed ‘embedding_matrix’. Based on the ‘freeze_embeddings’ flag, we decide whether to freeze the embedding layer weights or allow them to be updated during training. We also initialize a linear layer for classification. The ‘forward’ method defines the forward pass of the network: input IDs are passed through the embedding layer, then mean pooling is applied, followed by the linear layer and finally the Log-Softmax activation function. We initialize the model, deciding to freeze the embeddings based on the class discussion.

Discussion on Freezing the Embedding Layer Weights

We had a class discussion and vote on whether to freeze the weights of the embedding layer or allow them to be trained during the sentiment classification task. Based on the class vote, we decided to freeze the embedding layer weights.

Freezing the embedding layer means that the pre-trained GloVe embeddings remain constant throughout the training process. Only the weights of the linear classification layer are updated via backpropagation. This approach leverages the general semantic knowledge already captured in the pre-trained GloVe embeddings and can be particularly beneficial when the training dataset is relatively small. Freezing embeddings can act as a form of regularization, preventing overfitting to the training data and potentially improving generalization to unseen data. However, it also means that the model cannot fine-tune the word embeddings to be specifically optimized for the sentiment classification task at hand, which might limit performance in some cases. Training the embeddings, on the other hand, allows for task-specific adaptation but requires a larger dataset to avoid overfitting and effectively learn meaningful embeddings from scratch or fine-tune pre-trained ones.

Setting up the Dataset and DataLoader

To efficiently train our model, we need to prepare PyTorch datasets and data loaders. DataLoaders handle batching, shuffling, and parallel loading of the data, making the trainingof the data, making the training process more efficient. We will create TensorDatasets from our processed training and test data and then use DataLoaders to manage these datasets.

from torch.utils.data import Dataset, DataLoader, TensorDataset

# Convert data to tensors (Illustrative - actual code processes full datasets)
# train_input_tensor = torch.tensor(padded_train_sequences, dtype=torch.long)
# train_output_tensor = torch.tensor(Y_train, dtype=torch.long) # Assuming Y_train is numerical sentiment labels
# test_input_tensor = torch.tensor(padded_test_sequences, dtype=torch.long)
# test_output_tensor = torch.tensor(Y_test, dtype=torch.long) # Assuming Y_test is numerical sentiment labels

# train_dataset = TensorDataset(train_input_tensor, train_output_tensor)
# test_dataset = TensorDataset(test_input_tensor, test_output_tensor)

batch_size = 64

# train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
# test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False) # No need to shuffle test data

# Example of iterating through dataloader (Illustrative)
# for batch in train_dataloader:
#     input_batch, labels_batch = batch
#     print("Input Batch Shape:", input_batch.shape)
#     print("Labels Batch Shape:", labels_batch.shape)
#     break # Just showing one batch

We will convert our padded input sequences and sentiment labels into PyTorch tensors. Then, we create ‘TensorDataset’ instances for both training and test sets, combining input features and labels. Finally, we use ‘DataLoader’ to create iterators for these datasets, specifying a ‘batch_size’ and whether to shuffle the training data (shuffling is typically used for training but not for testing).

Model Training and Performance Evaluation

We will now proceed to train our sentiment classification model and evaluate its performance on the test dataset. We will use standard training loops and metrics, such as accuracy and loss, to assess the model’s effectiveness. We will train the model for a number of epochs and monitor the training and test performance.

Analysis of Overfitting Behavior

During training, it is crucial to monitor the performance on both the training and test sets to detect overfitting. Overfitting occurs when the model learns the training data too well, capturing noise and specific details that do not generalize to new, unseen data. Signs of overfitting include high accuracy and low loss on the training set, but significantly lower accuracy and higher loss on the test set. We will analyze the training curves of accuracy and loss for both training and test sets to identify potential overfitting.

Comparison of Training with and without Freezing

To understand the impact of freezing pre-trained embeddings, we compare the performance of the model trained with frozen embeddings to a model trained without freezing the embeddings. By training two models—one with ‘freeze_embeddings=True’ and another with ‘freeze_embeddings=False’—and comparing their training and test performance, we can observe the effects of freezing embeddings on model generalization and overall performance. The results of this comparison are summarized in the performance plots shown below.

Performance Plots: Frozen vs. Non-Frozen Embeddings

As observed in Figure 1, the model trained with frozen embeddings (left side plots) exhibits a more stable training process. The training and test accuracies increase in tandem, and the training and test losses remain closely aligned. This indicates better generalization, as the model’s performance on the test set closely mirrors its performance on the training set. In contrast, the model trained without freezing embeddings (right side plots) shows clear signs of overfitting. The training accuracy rapidly approaches 100%, and the training loss nears zero, while the test accuracy plateaus at a considerably lower level, and the test loss diverges upwards from the training loss. This divergence highlights overfitting: the model is memorizing the training data rather than learning generalizable features. The comparison suggests that freezing the embedding layer, in this particular scenario, acts as an effective regularization technique, preventing overfitting and leading to a model that generalizes more effectively to unseen data, albeit with a slightly lower peak test accuracy compared to the overfitted model.

Conclusion

In summary, today’s session covered a review of the previous exercises and a hands-on laboratory activity focused on sentiment analysis using NLP techniques. We systematically went through the steps of data loading, preprocessing, vocabulary creation, and the implementation of a neural network model incorporating an embedding layer.

Key takeaways from the lab session include:

Importance of Data Preprocessing: Effective use of text data in NLP models hinges on meticulous preprocessing steps, including tokenization, lowercasing, and normalization, to reduce noise and standardize the input.
Leveraging Pre-trained Embeddings: GloVe embeddings offer a robust method for numerically representing words, capturing semantic relationships and enhancing model performance, particularly when training data is limited.
Impact of Freezing Embedding Layers: Freezing pre-trained embedding layers can serve as an effective regularization strategy, mitigating overfitting and improving a model’s ability to generalize, especially with smaller datasets. However, this approach may restrict the model’s capacity to adapt to the specific nuances of the task at hand.
Overfitting Diagnosis: Monitoring both training and test set performance is essential for diagnosing overfitting. A significant divergence between training and test accuracies and losses is a key indicator of overfitting.

For future exploration, it would be beneficial to experiment with various techniques aimed at mitigating overfitting when the embedding layers are not frozen. Strategies such as incorporating dropout layers or applying more intensive regularization methods could be investigated. Furthermore, comparing the performance against models trained from scratch, without leveraging pre-trained embeddings, would provide deeper insights into the advantages of transfer learning in NLP tasks and the effectiveness of pre-trained embeddings like GloVe.

Thank you, Beatrice and Ian, for leading the insightful laboratory session. The solutions and the lab notebook from today’s activity have been uploaded to the shared folder for your reference and further study. We hope you have a pleasant weekend, and we look forward to our next session with you next week.

--- title: "Lecture Notes: AI Fundamentals" author: "Your Name" date: "2025-01-28" format: html: toc: true # Table of Contents toc-depth: 2 code-tools: true theme: cosmo # Or "journal" for Distill-like minimalism --- # Introduction Good afternoon, everyone. Welcome to today's lecture. The program for today is structured into two main parts. First, we will revise the previous exercises to ensure everyone is on track and address any difficulties encountered. Following this, we will proceed with the laboratory activity. We are also expecting some students to join us for the lab session. Regarding the schedule for next week, please note that the lesson scheduled for Wednesday might be cancelled. We will confirm this as soon as possible. However, the session on Thursday will proceed as planned. As the head of the Artificial Intelligence Lab, I want to introduce a series of activities showcasing real-world AI projects developed in our lab. Our group has grown significantly and is currently involved in more than ten projects. These sessions are designed to provide you with insights into practical AI applications and inspire you to consider similar projects in your own professional contexts. This introduction aims to give you a glimpse into the possibilities within the field of artificial intelligence. # Course Administration and Announcements ## Review of Previous Exercise Solutions We will now begin by reviewing the solutions to the exercises from the previous lecture. This is to ensure everyone understands the fundamental concepts and to clarify any doubts you might have. ## Introduction to the Laboratory Activity Following the exercise review, we will transition to the hands-on laboratory activity planned for today. This session is designed to provide practical experience and complement the theoretical concepts we have been discussing in lectures. ## Schedule Changes and Updates Just to reiterate, the Wednesday lecture next week is potentially cancelled. Please keep an eye out for further announcements confirming the schedule. The Thursday session, focusing on AI Lab projects, is confirmed and will proceed as planned. ## Overview of AI Lab Projects As mentioned earlier, the Thursday sessions will be dedicated to showcasing the diverse projects undertaken in the Artificial Intelligence Lab. This initiative aims to bridge the gap between theoretical knowledge and practical application, giving you a clearer understanding of the real-world impact of AI technologies. We hope these sessions will inspire you and provide valuable insights into potential future projects or applications in your respective fields. # AI Technology and News Highlights ## Discussion of the Agent Application I wanted to share some recent news regarding an interesting agent application. I believe I briefly mentioned it yesterday. ## Introduction to OpenAI's Sora Yesterday, I also mentioned Sora, OpenAI's new system. I attempted to use it, but unfortunately, I did not have access. Sora is a new text-to-video system from OpenAI, representing significant advancements in generative AI models capable of creating realistic and imaginative video content from text instructions. ## Demonstration of Avatar Generation This morning, during a short break, I experimented with a website that allows you to create personalized avatars and generate videos using them. After a brief interaction with my camera, the system generated an avatar for me. I then used text input to create a video with this avatar. Let's take a look at the result. Video Playback and Transcript: Ciao, sono Giuseppe Serra e sono professore associato presso l'Università degli Studi di Udine. As you can see, while not perfect, the lip synchronization is quite impressive. The voice quality could be improved, but overall, it demonstrates the rapid progress in this technology domain. Feel free to explore this technology and create your own avatars. This is just an example of the cutting-edge techniques currently available. # Detailed Review of Exercise Problems Now, let's go through the exercise problems. I will provide the correct answers and explanations to help you understand the solutions. ## Importance of Data Distribution in Dev and Test Sets For the first exercise, the question was about the relationship between the development (dev) and test datasets. The correct answer is **A**: the dev and test sets should come from the same distribution. ::: tcolorbox It is crucial for the dev and test sets to be drawn from the same distribution to ensure that the performance metrics obtained on the dev set are a reliable indicator of the performance on unseen test data. This comparability is essential for making informed decisions during model development and evaluation. While data augmentation can be applied to the training set to increase its diversity, the dev and test sets should reflect the real-world data distribution the model is expected to perform on. ::: ## Strategies for Addressing Underfitting Issues The second question addressed strategies for dealing with underfitting in neural networks. The correct answers are **B** and **C**: increase the complexity of the model and increase the number of units or layers. ::: tcolorbox Underfitting occurs when a model is too simple to capture the underlying patterns in the data. To address underfitting, we need to increase the model's capacity. This can be achieved by: - **Increasing Model Complexity**: Using a more complex model architecture, such as adding more layers to a neural network or using a different type of model altogether. - **Increasing Units/Layers**: For neural networks, this means increasing the number of neurons in each layer or adding more layers to create a deeper network. ::: ## Strategies for Addressing Overfitting Issues The third exercise presented a scenario of building a classifier for fruits at a supermarket checkout kiosk. Given a low training error (0.5) and a higher dev set error (7), the question asked for promising strategies to improve the classifier. The correct answers are **A** and **C**: increase the regularization term and increase the number of training data. ::: tcolorbox Overfitting is indicated by a significant gap between training and dev set performance. The model has memorized the training data but fails to generalize to new, unseen data. Strategies to combat overfitting include: - **Increasing Regularization**: Regularization techniques, such as L2 regularization (weight decay), penalize large weights, encouraging the model to learn simpler patterns that generalize better. - **Increasing Training Data**: Providing more training examples can help the model learn more robust features and reduce its reliance on memorizing specific training instances. ::: ## Understanding Weight Decay Regularization The fourth question was about weight decay. The statement is that weight decay, such as L2 regularization, results in gradient descent shrinking the weights on every iteration. This statement is **true**. ::: tcolorbox Weight decay (L2 regularization) adds a penalty term to the cost function that is proportional to the square of the weights. During gradient descent, this penalty term encourages the weights to be smaller, effectively shrinking them towards zero in each iteration. This helps in preventing overfitting by simplifying the model. ::: ## Impact of the Regularization Parameter (Lambda) Question five asked about the effect of increasing the regularization hyperparameter $\lambda$. The correct answer is **A**: weights are pushed towards becoming smaller. ::: tcolorbox The regularization parameter $\lambda$ controls the strength of the regularization. A larger $\lambda$ increases the penalty for large weights in the cost function. Consequently, gradient descent will more aggressively push the weights towards smaller values to minimize the cost function, thus further reducing model complexity and mitigating overfitting. ::: ## Interpretation of Cost Function Plots Exercise six presented a cost function plot that is not smooth and asked for interpretation. The correct answer is **A**: if using mini-batch gradient descent, this looks acceptable, but if using batch gradient descent, something is wrong. ::: tcolorbox The jagged nature of the cost function plot suggests that mini-batch gradient descent is being used. In mini-batch gradient descent, the cost is calculated on small batches of data, which can lead to noisy updates and oscillations in the cost function. This is expected and acceptable. However, if batch gradient descent (using the entire dataset for each update) were used, a smoother, monotonically decreasing cost function would be expected. A non-smooth plot in batch gradient descent would indicate issues such as a high learning rate or problems with the cost function implementation. ::: ## Solution to the Numerical Calculation Problem Question seven was a numerical calculation problem. The correct answer is **A**. ::: tcolorbox **Solution:** \[Detail missing: Numerical solution to be inserted here. Please refer to the slides for the detailed calculation.\] **Note:** The specific numerical problem description is not available in the provided transcript. Please refer to the lecture slides for the problem statement and detailed solution. ::: ## Analysis of Learning Rate Decay Strategies Exercise eight asked which of the given learning rate decay schemes is not a good one. The correct answer is **A**. ::: tcolorbox Learning rate decay is a strategy to reduce the learning rate during training, typically starting with a larger learning rate and gradually decreasing it. Options B, C, and D represent typical decay schemes where the learning rate decreases with iteration $T$. Option A, however, increases the learning rate exponentially with $T$, which is counterproductive for gradient descent optimization. Gradient descent aims to converge to a minimum, and a decreasing learning rate helps in fine-tuning the convergence in later iterations. ::: ## Application of Gradient Descent with Momentum Question nine presented a picture describing a learning process and asked what it represents. The correct answer is **A**: gradient descent with momentum. ::: tcolorbox The picture likely depicted the behavior of gradient descent with momentum, where the optimization process not only considers the current gradient but also accumulates a velocity vector from previous gradients. This momentum helps to accelerate convergence in the relevant direction and dampen oscillations, leading to a smoother and faster descent towards the minimum of the cost function. ::: ## Calculating Parameters in Fully Connected Layers Exercise ten asked to calculate the number of parameters in a fully connected layer. Given a $300 \times 300$ color image input and a first layer with 100 neurons, fully connected to the input, the correct answer is **D**. ::: tcolorbox For a fully connected layer, each neuron is connected to every input. For a $300 \times 300$ color image, the input size is $300 \times 300 \times 3 = 270,000$. If the first layer has 100 neurons, each neuron has 270,000 weights plus a bias term. Therefore, the total number of parameters is $(300 \times 300 \times 3 \times 100) + 100 = 27,000,000 + 100$. The closest option would be D. ::: ## Calculating Parameters in Convolutional Layers Exercise eleven was similar to ten, but for a convolutional layer. With the same input and a convolutional layer with 100 filters of size $5 \times 5$, the correct answer is **D**. ::: tcolorbox In a convolutional layer, filters are applied across the input. Here, we have 100 filters of size $5 \times 5$. Since it's a color image, the filter depth is 3. Each filter has $5 \times 5 \times 3$ weights, and there are 100 filters, plus a bias term for each filter. So, the number of parameters is $(5 \times 5 \times 3 \times 100) + 100 = 7500 + 100 = 7600$. The closest option is D. ::: ## Understanding the Max Pooling Operation Exercise twelve described an input volume of $32 \times 32 \times 16$ and asked about the output volume after applying max pooling with a stride of 2 and a filter size of 2. The correct answer is **B**. ::: tcolorbox Max pooling reduces the spatial dimensions of the input. With a $2 \times 2$ filter and stride 2, the dimensions are halved. For a $32 \times 32$ input, the output dimension after pooling becomes $32/2 \times 32/2 = 16 \times 16$. The number of channels (depth) remains unchanged by max pooling. Therefore, the output volume is $16 \times 16 \times 16$. ::: ## Analysis of Sliding Window Classifiers For the exercise related to sliding window classifiers, the question was about the effect of increasing the stride. The correct answer is **B**: increasing the stride would tend to decrease accuracy and decrease the computational cost. ::: tcolorbox In a sliding window classifier, increasing the stride means the window moves in larger steps across the image. This results in: - **Decreased Computational Cost**: Fewer window positions to evaluate, thus reducing computation. - **Decreased Accuracy**: Larger strides can miss smaller objects or details, potentially reducing detection accuracy. Therefore, option B correctly describes the trade-off. ::: ## Overview of the YOLO Algorithm The exercise about YOLO (You Only Look Once) algorithm, the correct answer is **A**. ::: tcolorbox **YOLO Algorithm:** \[Detail missing: Explanation about YOLO algorithm and why option A is correct. Refer to lecture notes for details.\] **Note:** The specific question about the YOLO algorithm is not available in the provided transcript. Please refer to the lecture slides for the question statement and detailed explanation. ::: ## Calculation of Intersection over Union (IoU) The final exercise was about calculating Intersection over Union (IoU). The correct answer is **B**. ::: tcolorbox **Intersection over Union (IoU):** \[Detail missing: Explanation about IoU calculation and why option B is correct. Refer to lecture notes for details.\] **Note:** The specific question about Intersection over Union (IoU) is not available in the provided transcript. Please refer to the lecture slides for the question statement and detailed explanation. ::: Regarding the exercise initially skipped, after review, the correct answer is **C**, so the answer is 5. ::: tcolorbox **Skipped Exercise:** \[Detail missing: Context and solution for the skipped exercise. Refer to lecture notes for details.\] **Note:** The context and the question for this skipped exercise are not available in the provided transcript. Please refer to the lecture slides for the question statement and detailed solution. ::: I will share this annotated file with you so you can review the solutions again. If you have any further questions, please feel free to ask. # Hands-on Laboratory Session: Sentiment Analysis with NLP Now, we will proceed with the laboratory activity. I will hand over to Bea and Ian who will guide you through the sentiment analysis task using Natural Language Processing (NLP). ## Introduction to the Sentiment Classification Task **Bea:** Okay, can you hear me online? Yes, yes, we can. Okay, great. First of all, you can find in the Google Drive folder, you can find the solution of the last lesson and the file for today. The files for today are also available in the lab three channel under the files tab, where you will find the notebook we will use today. Today, we will be working on a sentiment classification problem using an NLP dataset. Building upon your classroom explorations of NLP concepts, this lab offers a practical application. We will focus on sentiment classification, which involves determining the sentiment expressed in text. We will use basic deep neural networks, similar to those discussed in our previous lectures. Before we begin, are there any questions or observations regarding the previous exercise or the last lab session? If you encounter any issues during this session, please write to us in the channel, and we will respond as soon as possible. For today's activity, please download the provided notebook file and upload it to your Colab environment. We will be performing sentiment classification using pre-trained GloVe embeddings. ## Data Loading and Preprocessing ### Downloading Pre-trained GloVe Embeddings The first step in our lab is to download the GloVe embeddings. These are pre-trained word embeddings that provide numerical representations for words, capturing semantic relationships. We will download these embeddings from a mirror of the Stanford website. These embeddings are pre-computed and publicly available. ### Loading and Inspecting Training and Test Datasets Next, we will download and load the training and test datasets for movie reviews. These datasets contain movie reviews along with their sentiment labels, which are either positive (1) or negative (0). We will use pandas to load these CSV files and inspect their structure to understand the data format. # Code snippet from the lab notebook (Illustrative) !wget http://nlp.stanford.edu/data/glove.6B.zip !unzip glove*.zip import pandas as pd # Load training data train_df = pd.read_csv("train_movie_data.csv") print("Train Data Head:") print(train_df.head()) # Load test data test_df = pd.read_csv("test_movie_data.csv") print("\nTest Data Head:") print(test_df.head()) We will extract the text reviews and sentiment labels into separate numpy arrays for easier processing. Specifically, we will save the reviews into 'X_train' and 'X_test', and the sentiment labels into 'Y_train' and 'Y_test'. It is also useful to print the shape of the training and test sets to understand the number of samples in each dataset. ## Text Preprocessing Techniques Text data is often very noisy and requires preprocessing before it can be effectively used in machine learning models. We will use the Natural Language Toolkit (NLTK) library, a common library for text cleaning and normalization in Python. ### Tokenization and Text Segmentation Tokenization is the process of breaking down text into individual words or tokens. We will use NLTK's word tokenization to segment the reviews into words. This step is crucial for converting raw text into a format that our model can understand. ### Lowercasing and Text Normalization To standardize the text and reduce vocabulary size, we will convert all text to lowercase. Additionally, we will perform normalization steps such as removing punctuation and handling special characters. This ensures that different forms of the same word (e.g., \"The\" and \"the\") are treated as identical and reduces noise in the text data. We will define a function 'tokenize_text' to handle these preprocessing steps. # Code snippet illustrating text tokenization and lowercasing import nltk import re def tokenize_text(text): text = re.sub(r"\'ve", " have ", text) text = re.sub(r"n\'t", " not ", text) text = re.sub(r"\'re", " are ", text) text = re.sub(r"\'s", " is ", text) text = re.sub(r"\'d", " would ", text) text = re.sub(r"\'ll", " will ", text) text = re.sub(r"\'m", " am ", text) text = re.sub(r'[^\w\s]',' ',text) # Remove punctuation text = text.lower() tokens = nltk.word_tokenize(text) return tokens sample_text = "This is a sample text with punctuation! And UPPERCASE words." tokenized_text = tokenize_text(sample_text) print("Original Text:", sample_text) print("Tokenized Text:", tokenized_text) After defining the 'tokenize_text' function, we will apply it to the full training and test datasets to process all text reviews. This will convert each review into a list of cleaned tokens. ## Vocabulary Construction and Management To convert text into numerical representations that can be fed into a neural network, we need to build a vocabulary. The vocabulary is a list of all unique words from our training data. We will also include special tokens in our vocabulary to handle specific cases. ### Handling Special Tokens (Padding and Unknown) We will add two special tokens to our vocabulary: - **Padding (PAD)**: Represented as '\<pad\>', used to ensure all input sequences have the same length by adding padding at the end of shorter sentences. - **Unknown (UNK)**: Represented as '\<unk\>', used to represent words that are not found in our vocabulary or the pre-trained embeddings. This is important for handling out-of-vocabulary words during inference. We will initialize our vocabulary with these special tokens and assign them unique indices. # Code snippet for vocabulary creation vocab = {'<pad>': 0, '<unk>': 1} next_index = 2 def build_vocabulary(tokenized_texts, vocab, next_index): for tokens in tokenized_texts: for token in tokens: if token not in vocab: vocab[token] = next_index next_index += 1 return vocab, next_index # Example usage (Illustrative - actual code processes full dataset) # tokenized_train_texts = [tokenize_text(text) for text in X_train[:100]] # Process a subset for example # vocab, next_index = build_vocabulary(tokenized_train_texts, vocab, next_index) # print("Vocabulary Size:", len(vocab)) # print("Sample Vocabulary:", list(vocab.keys())[:10]) We will iterate through the tokenized training texts and build our vocabulary, assigning a unique index to each new word encountered. The size of the vocabulary will be printed to show the total number of unique words found in the training data. ## Conversion of Text to Numerical Representations Once we have constructed our vocabulary, we can convert each word in our tokenized text into its corresponding numerical index from the vocabulary. This process transforms text data into a numerical format suitable for neural network input. Words that are not in our vocabulary will be mapped to the index of the '\<unk\>' token. # Function to convert tokenized text to numerical IDs def text_to_ids(tokenized_text, vocab): ids = [vocab.get(token, vocab['<unk>']) for token in tokenized_text] return ids # Example usage (Illustrative) # sample_tokenized_text = tokenize_text("example of unknown word hypotenuse") # ids = text_to_ids(sample_tokenized_text, vocab) # print("Tokenized Text:", sample_tokenized_text) # print("Numerical IDs:", ids) We will apply this 'text_to_ids' function to both the training and test datasets, converting all tokenized reviews into sequences of numerical IDs. ## Padding and Truncation of Sequences Neural networks typically require inputs of a fixed length. Since sentences can vary in length, we need to make all input sequences the same length. To achieve this, we will pad shorter sequences with '\<pad\>' tokens and truncate longer sequences to a maximum length. We will choose a 'max_sequence_length', for example, 320. # Function for padding sequences def pad_sequences(id_sequences, max_length, padding_id): padded_sequences = [] for ids in id_sequences: if len(ids) < max_length: ids.extend([padding_id] * (max_length - len(ids))) else: ids = ids[:max_length] padded_sequences.append(ids) return padded_sequences max_sequence_length = 320 padding_id = vocab['<pad>'] # Example usage (Illustrative - actual code processes full dataset) # numericalized_train_texts = [text_to_ids(tokens, vocab) for tokens in tokenized_train_texts] # padded_train_sequences = pad_sequences(numericalized_train_texts, max_sequence_length, padding_id) # print("Padded Sequence Example:", padded_train_sequences[0]) We will use the 'pad_sequences' function to process the numerical ID sequences for both training and test sets, ensuring all sequences are of 'max_sequence_length'. ## Loading and Utilizing Pre-trained GloVe Embeddings We will load the downloaded GloVe embeddings and create a dictionary mapping words to their embedding vectors. These pre-trained embeddings are vector representations of words that capture semantic meaning. We will use these pre-trained embeddings to initialize the embedding layer of our neural network. The GloVe embeddings file contains word embeddings of 300 dimensions. # Code for loading GloVe embeddings import numpy as np import pickle def load_glove_embeddings(file_path): embeddings_dict = {} with open(file_path, 'r', encoding='utf-8') as f: for line in f: values = line.split() word = values[0] vector = np.asarray(values[1:], "float32") embeddings_dict[word] = vector return embeddings_dict glove_file_path = 'glove.6B.300d.txt' # Path to downloaded GloVe file glove_embeddings = load_glove_embeddings(glove_file_path) print("Loaded GloVe Embeddings. Example embedding for 'the':") print(glove_embeddings.get('the', None)[:10]) # Print first 10 dimensions of 'the' embedding We will load the GloVe embeddings from the downloaded text file into a dictionary, where keys are words and values are their corresponding 300-dimensional embedding vectors. ## Construction of the Embedding Matrix We will create an embedding matrix that will be used to initialize the weights of the embedding layer in our neural network. Each row of this matrix will correspond to a word in our vocabulary, and the values in the row will be the pre-trained GloVe embedding for that word. For the '\<pad\>' and '\<unk\>' tokens, and for words in our vocabulary that are not found in GloVe, we will initialize random embeddings. This ensures that every word in our vocabulary has an associated embedding vector. # Code to construct embedding matrix import torch embedding_dim = 300 vocabulary_size = len(vocab) embedding_matrix = np.zeros((vocabulary_size, embedding_dim)) unk_embedding = np.random.randn(embedding_dim) pad_embedding = np.random.randn(embedding_dim) embedding_matrix[vocab['<pad>']] = pad_embedding embedding_matrix[vocab['<unk>']] = unk_embedding oov_words = [] glove_words_count = 0 for word, index in vocab.items(): if word in ['<pad>', '<unk>']: continue try: embedding_vector = glove_embeddings[word] embedding_matrix[index] = embedding_vector glove_words_count += 1 except KeyError: embedding_matrix[index] = unk_embedding # Assign UNK embedding to OOV words oov_words.append(word) print(f"Words in GloVe vocabulary: {glove_words_count}") print(f"Out-of-Vocabulary words count: {len(oov_words)}") embedding_matrix_tensor = torch.tensor(embedding_matrix, dtype=torch.float) We will iterate through our vocabulary. If a word is found in the GloVe embeddings, we use its pre-trained embedding; otherwise, we assign a randomly initialized embedding. We also keep count of words found in GloVe and out-of-vocabulary words. Finally, we convert the numpy embedding matrix to a PyTorch tensor. ## Implementation of the Neural Network Model We will now implement a simple neural network model for sentiment classification using PyTorch. This model will consist of an embedding layer, a mean pooling layer, and a linear classification layer. ### Detailed Explanation of the Embedding Layer ::: tcolorbox The embedding layer is the first layer of our network. It takes integer indices as input, which are the numerical representations of words, and outputs dense vector embeddings. We will initialize this layer with our pre-computed embedding matrix, effectively using pre-trained GloVe embeddings. The embedding layer transforms each word index into its corresponding vector embedding. ::: ### Application of Mean Pooling ::: tcolorbox After the embedding layer, we will apply mean pooling. Mean pooling averages the embeddings of all words in a sequence (sentence) along the sequence length dimension to produce a single fixed-size vector representation for the entire sequence. This results in a sentence-level embedding that captures the average semantic meaning of the words in the sentence. ::: ### Implementation of the Linear Classification Layer ::: tcolorbox Following mean pooling, we will use a linear layer to perform the final classification. This layer will take the sentence embedding from the mean pooling layer as input and map it to a two-dimensional output. These two dimensions represent the scores for the two sentiment classes: positive and negative. ::: ### Use of the Log-Softmax Activation Function ::: tcolorbox Finally, we will apply the Log-Softmax activation function to the output of the linear layer. Log-Softmax converts the scores into log-probabilities, which are more numerically stable for training with negative log-likelihood loss. These log-probabilities represent the model's predicted log-likelihood of each sentiment class (positive or negative). ::: import torch.nn as nn import torch.nn.functional as F class SentimentClassifier(nn.Module): def __init__(self, embedding_matrix, freeze_embeddings=True): super(SentimentClassifier, self).__init__() num_embeddings, embedding_dim = embedding_matrix.shape self.embedding = nn.Embedding(num_embeddings, embedding_dim) self.embedding.weight.data.copy_(torch.from_numpy(embedding_matrix)) if freeze_embeddings: self.embedding.weight.requires_grad = False # Freeze embeddings self.linear = nn.Linear(embedding_dim, 2) # Output size is 2 for binary sentiment def forward(self, input_ids): embedded = self.embedding(input_ids) # [batch_size, seq_len, embedding_dim] pooled = torch.mean(embedded, dim=1) # [batch_size, embedding_dim] - Mean pooling over sequence length output = self.linear(pooled) # [batch_size, 2] return F.log_softmax(output, dim=1) # LogSoftmax for numerical stability with NLLLoss # Initialize the model model = SentimentClassifier(embedding_matrix, freeze_embeddings=True) # Freezing embeddings as per lab discussion print(model) We define the 'SentimentClassifier' class as a PyTorch 'nn.Module'. In the '\_\_init\_\_' method, we initialize the embedding layer with the pre-computed 'embedding_matrix'. Based on the 'freeze_embeddings' flag, we decide whether to freeze the embedding layer weights or allow them to be updated during training. We also initialize a linear layer for classification. The 'forward' method defines the forward pass of the network: input IDs are passed through the embedding layer, then mean pooling is applied, followed by the linear layer and finally the Log-Softmax activation function. We initialize the model, deciding to freeze the embeddings based on the class discussion. ## Discussion on Freezing the Embedding Layer Weights We had a class discussion and vote on whether to freeze the weights of the embedding layer or allow them to be trained during the sentiment classification task. Based on the class vote, we decided to freeze the embedding layer weights. ::: tcolorbox Freezing the embedding layer means that the pre-trained GloVe embeddings remain constant throughout the training process. Only the weights of the linear classification layer are updated via backpropagation. This approach leverages the general semantic knowledge already captured in the pre-trained GloVe embeddings and can be particularly beneficial when the training dataset is relatively small. Freezing embeddings can act as a form of regularization, preventing overfitting to the training data and potentially improving generalization to unseen data. However, it also means that the model cannot fine-tune the word embeddings to be specifically optimized for the sentiment classification task at hand, which might limit performance in some cases. Training the embeddings, on the other hand, allows for task-specific adaptation but requires a larger dataset to avoid overfitting and effectively learn meaningful embeddings from scratch or fine-tune pre-trained ones. ::: ## Setting up the Dataset and DataLoader To efficiently train our model, we need to prepare PyTorch datasets and data loaders. DataLoaders handle batching, shuffling, and parallel loading of the data, making the trainingof the data, making the training process more efficient. We will create TensorDatasets from our processed training and test data and then use DataLoaders to manage these datasets. from torch.utils.data import Dataset, DataLoader, TensorDataset # Convert data to tensors (Illustrative - actual code processes full datasets) # train_input_tensor = torch.tensor(padded_train_sequences, dtype=torch.long) # train_output_tensor = torch.tensor(Y_train, dtype=torch.long) # Assuming Y_train is numerical sentiment labels # test_input_tensor = torch.tensor(padded_test_sequences, dtype=torch.long) # test_output_tensor = torch.tensor(Y_test, dtype=torch.long) # Assuming Y_test is numerical sentiment labels # train_dataset = TensorDataset(train_input_tensor, train_output_tensor) # test_dataset = TensorDataset(test_input_tensor, test_output_tensor) batch_size = 64 # train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True) # test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False) # No need to shuffle test data # Example of iterating through dataloader (Illustrative) # for batch in train_dataloader: # input_batch, labels_batch = batch # print("Input Batch Shape:", input_batch.shape) # print("Labels Batch Shape:", labels_batch.shape) # break # Just showing one batch We will convert our padded input sequences and sentiment labels into PyTorch tensors. Then, we create 'TensorDataset' instances for both training and test sets, combining input features and labels. Finally, we use 'DataLoader' to create iterators for these datasets, specifying a 'batch_size' and whether to shuffle the training data (shuffling is typically used for training but not for testing). ## Model Training and Performance Evaluation We will now proceed to train our sentiment classification model and evaluate its performance on the test dataset. We will use standard training loops and metrics, such as accuracy and loss, to assess the model's effectiveness. We will train the model for a number of epochs and monitor the training and test performance. ### Analysis of Overfitting Behavior ::: tcolorbox During training, it is crucial to monitor the performance on both the training and test sets to detect overfitting. Overfitting occurs when the model learns the training data too well, capturing noise and specific details that do not generalize to new, unseen data. Signs of overfitting include high accuracy and low loss on the training set, but significantly lower accuracy and higher loss on the test set. We will analyze the training curves of accuracy and loss for both training and test sets to identify potential overfitting. ::: ### Comparison of Training with and without Freezing To understand the impact of freezing pre-trained embeddings, we compare the performance of the model trained with frozen embeddings to a model trained without freezing the embeddings. By training two models---one with 'freeze_embeddings=True' and another with 'freeze_embeddings=False'---and comparing their training and test performance, we can observe the effects of freezing embeddings on model generalization and overall performance. The results of this comparison are summarized in the performance plots shown below. <figure id="fig:performance_plots"> <figcaption>Performance Plots: Frozen vs. Non-Frozen Embeddings</figcaption> </figure> ::: tcolorbox As observed in Figure [1](#fig:performance_plots){reference-type="ref" reference="fig:performance_plots"}, the model trained with frozen embeddings (left side plots) exhibits a more stable training process. The training and test accuracies increase in tandem, and the training and test losses remain closely aligned. This indicates better generalization, as the model's performance on the test set closely mirrors its performance on the training set. In contrast, the model trained without freezing embeddings (right side plots) shows clear signs of overfitting. The training accuracy rapidly approaches 100%, and the training loss nears zero, while the test accuracy plateaus at a considerably lower level, and the test loss diverges upwards from the training loss. This divergence highlights overfitting: the model is memorizing the training data rather than learning generalizable features. The comparison suggests that freezing the embedding layer, in this particular scenario, acts as an effective regularization technique, preventing overfitting and leading to a model that generalizes more effectively to unseen data, albeit with a slightly lower peak test accuracy compared to the overfitted model. ::: # Conclusion In summary, today's session covered a review of the previous exercises and a hands-on laboratory activity focused on sentiment analysis using NLP techniques. We systematically went through the steps of data loading, preprocessing, vocabulary creation, and the implementation of a neural network model incorporating an embedding layer. Key takeaways from the lab session include: - **Importance of Data Preprocessing**: Effective use of text data in NLP models hinges on meticulous preprocessing steps, including tokenization, lowercasing, and normalization, to reduce noise and standardize the input. - **Leveraging Pre-trained Embeddings**: GloVe embeddings offer a robust method for numerically representing words, capturing semantic relationships and enhancing model performance, particularly when training data is limited. - **Impact of Freezing Embedding Layers**: Freezing pre-trained embedding layers can serve as an effective regularization strategy, mitigating overfitting and improving a model's ability to generalize, especially with smaller datasets. However, this approach may restrict the model's capacity to adapt to the specific nuances of the task at hand. - **Overfitting Diagnosis**: Monitoring both training and test set performance is essential for diagnosing overfitting. A significant divergence between training and test accuracies and losses is a key indicator of overfitting. For future exploration, it would be beneficial to experiment with various techniques aimed at mitigating overfitting when the embedding layers are not frozen. Strategies such as incorporating dropout layers or applying more intensive regularization methods could be investigated. Furthermore, comparing the performance against models trained from scratch, without leveraging pre-trained embeddings, would provide deeper insights into the advantages of transfer learning in NLP tasks and the effectiveness of pre-trained embeddings like GloVe. Thank you, Beatrice and Ian, for leading the insightful laboratory session. The solutions and the lab notebook from today's activity have been uploaded to the shared folder for your reference and further study. We hope you have a pleasant weekend, and we look forward to our next session with you next week.