Natural Language Processing and Word Embeddings

Author

Your Name

Published

January 28, 2025

Introduction

Good morning, everyone. Welcome and thank you for attending today’s lecture. Before we begin with the main topic, I want to share some recent news from OpenAI. They have just released Sora, a new system capable of generating videos from text prompts. While it is not yet available in Europe due to privacy regulations concerning user data, it is anticipated to be accessible in the near future. Sora is remarkably interesting as it allows users to generate videos by simply writing prompts, and also to view existing videos along with the prompts used to create them, enabling a learning process for prompt engineering. This technology has the potential to significantly change the landscape of video generation.

Today, we will delve into Natural Language Processing (NLP) and word embeddings, foundational concepts for systems like ChatGPT and transformers. Similar to our approach in previous lectures on computer vision, we will explore some of the common tasks that the NLP research community is focused on. We will begin with document analysis, a broad and important area within NLP. The core idea of document analysis is to extract useful information from documents. In the current era, with the advent of powerful tools like ChatGPT, understanding and implementing these tasks has become more accessible and intuitive.

Natural Language Processing Tasks

Natural Language Processing encompasses a wide range of tasks designed to enable computers to understand, interpret, and generate human language. This section outlines several key tasks within NLP, illustrating the breadth and depth of this field.

Document Analysis

Document analysis is a foundational area in NLP, focusing on the automated extraction of meaningful information from textual documents. This broad task encompasses several more specific sub-tasks, each contributing to a deeper understanding of document content.

Text Summarization

Text summarization is the process of automatically condensing large volumes of text into concise summaries that capture the essential information. This is particularly useful for quickly understanding the gist of long documents, articles, or reports. Modern NLP tools, such as ChatGPT, demonstrate remarkable proficiency in text summarization, offering efficient ways to digest textual data.

Named Entity Recognition

Named Entity Recognition (NER) is a task that involves identifying and classifying named entities within text. Entities are typically categorized into predefined types such as names of persons, organizations, locations, dates, and quantities. For example, in the sentence “Tim Cook is the CEO of Apple in Cupertino, California”, an NER system would identify “Tim Cook” as a person, “Apple” as an organization, and “Cupertino, California” as locations.

The capability of NER is crucial for knowledge graph construction and information retrieval. By identifying and cataloging entities across numerous documents, it becomes possible to establish links and relationships between them, creating a network of interconnected information. This is invaluable for advanced research, enabling users to explore and navigate large sets of documents based on key entities. For instance, recognizing the entity “Tim Cook” across various documents allows a system to link all documents mentioning him, facilitating topic-based browsing and research.

Question Answering

Question Answering (QA) systems are designed to provide direct answers to questions posed in natural language. This represents a significant shift in information seeking behavior on the internet. Historically, internet searches relied on keyword queries that returned lists of web pages, requiring users to manually search for the answer within these pages. Modern QA systems, however, aim to directly provide the answer to the user’s query.

This paradigm shift is exemplified by systems like ChatGPT and Claude, which not only attempt to answer questions directly but also often provide citations or references to the sources from which the answers are derived. This enhances the transparency and credibility of the information provided, marking a substantial advancement in how users interact with and obtain information from digital sources. This approach mirrors the natural human process of asking a question and receiving a direct, referenced answer.

Social Media Analysis

Social Media Analysis leverages NLP techniques to process and interpret the vast amounts of text generated on social media platforms. The scale and dynamic nature of social media content necessitate automated NLP tools for effective analysis. Key tasks in this domain include sentiment analysis, trend detection, and understanding public opinion. For example, analyzing comments on social media posts from influencers can automatically summarize prevalent opinions, gauge the sentiment (positive, negative, or neutral) towards a topic, and identify emerging trends or discussion points.

Prior to the widespread availability of versatile tools like ChatGPT, addressing these social media analysis tasks required the development of specialized, task-specific tools. Today, large language models offer a more generalized solution, capable of handling a diverse range of analytical tasks across different social media platforms with remarkable adaptability.

Real-Time Transcription and Translation

Real-time text transcription and translation focuses on the immediate conversion of spoken language into written text and subsequent translation into another language. This is a technically challenging area, addressing the need for seamless communication across language barriers in real-time settings. Platforms like Skype and specialized companies are actively developing solutions for real-time translation, although achieving truly seamless, accurate real-time translation remains an ongoing challenge. A comprehensive real-time translation system typically involves several complex stages: accurate transcription of spoken audio into text, precise translation of the transcribed text into the target language, and natural-sounding voice synthesis to deliver the translated text audibly.

Innovations in this area are exemplified by companies like Heygen, which are exploring the use of avatars to enhance the user experience of real-time translation. Heygen’s technology allows for the creation of videos where digital avatars speak text provided by users, and they are further developing capabilities to create personalized avatars from user-provided video, aiming to make real-time communication more engaging and personalized.

Automated Email Responses

Automated Email Response systems utilize NLP to automatically generate responses to incoming emails, particularly those that are routine or frequently asked. Many organizations face significant challenges in managing large volumes of email, with a considerable portion consisting of repetitive inquiries regarding standard information such as business hours, service details, or basic product queries. NLP-driven systems can be deployed to automatically identify and respond to these common questions, thereby significantly reducing the workload on human staff and improving response times for routine inquiries.

By integrating NLP components into email management workflows, organizations can achieve substantial gains in efficiency and customer service. These systems can filter and address a significant portion of email traffic automatically, allowing human agents to focus on more complex, nuanced, or unique communications that require personalized attention.

Chatbot Design

Chatbot Design involves the creation of interactive conversational agents, or chatbots, capable of engaging with users in natural language. Chatbots are designed to provide information, answer questions, and perform tasks through natural, text-based or voice-based interactions. They are increasingly utilized across various sectors for customer support, information dissemination, and user engagement. Universities and other organizations are increasingly implementing chatbots on their websites to enhance user experience by providing immediate access to information and support. This approach aims to alleviate the burden on traditional support channels, such as email and phone, and to provide users with a more efficient and intuitive way to find information. For example, the University of Udine is in the process of deploying a chatbot on its homepage to assist users in navigating the university website and accessing information more effectively. This initiative is in direct response to user feedback indicating that the extensive website can be challenging to navigate, and a chatbot offers a more user-friendly alternative to traditional site navigation methods.

Fake News Detection

Fake News Detection is an increasingly critical application of NLP, particularly in the contemporary digital information environment where misinformation can proliferate rapidly through online channels. The primary objective of fake news detection systems is to automatically identify news articles and other forms of online content that are likely to be intentionally false, misleading, or propagandistic. These systems typically employ a range of NLP techniques to analyze textual content, assess the credibility of sources, and cross-reference information against trusted repositories of factual data.

The rise of sophisticated text generation models, such as ChatGPT, has amplified the importance of fake news detection. These powerful AI tools can be exploited to generate highly convincing yet entirely fabricated news articles and social media posts, making it more challenging to distinguish between credible and false information. Consequently, there is a growing emphasis on developing robust and scalable fake news detection technologies. For instance, a project involving collaboration with MIT aims to address specific challenges within this broad area. The urgency and relevance of this research are underscored by real-world examples, such as the rapid emergence of a book about King Charles’ cancer diagnosis on Amazon within hours of the public announcement. This example illustrates the speed at which content, potentially including misinformation, can be generated and disseminated, highlighting the critical need for effective fake news detection mechanisms.

Word Embeddings

The Need for Word Embeddings

As we have previously established, computers process information in the form of numbers. Therefore, to enable computers to understand and process text, it is essential to convert words into numerical representations. In the domain of image processing, we observe that similar images, even with minor pixel variations, are represented by numerical vectors that are also similar, reflecting a degree of semantic consistency. However, this direct analogy does not hold for natural language. In text, even a small change, such as altering a single letter within a word, can lead to a significant shift in meaning.

Consider, for instance, the words “sun” and “son”. These words are nearly identical in spelling, differing by just one letter, yet they possess entirely different meanings. This example illustrates a critical challenge: representing words numerically in a way that captures and preserves semantic relationships. Our objective is to develop a method where words with similar meanings are represented by numerical vectors that are also close to each other in a vector space, thereby reflecting their semantic similarity.

Numerical Representation of Words

The fundamental goal is to devise a system for representing words using numbers, such that these numerical representations effectively encode and reflect the semantic relationships between words. Several approaches have been considered, each with varying degrees of success in achieving this goal.

Simple Integer Mapping

One of the earliest and most straightforward approaches to numerical word representation is simple integer mapping. This method involves creating a vocabulary of all unique words in a corpus and assigning a unique integer to each word, often based on alphabetical order or frequency. For example, we might assign “a” the integer 1, “able” the integer 2, “about” the integer 3, and so forth.

Vocabulary: {a, able, about, ... , zebra}

  • a \(\rightarrow\) 1

  • able \(\rightarrow\) 2

  • about \(\rightarrow\) 3

  • ...

  • zebra \(\rightarrow\) N (vocabulary size)

While this method is computationally simple to implement, it completely fails to capture any semantic nuances. The assigned numerical values are arbitrary and based on superficial criteria like alphabetical order, which have no correlation with word meaning. For example, in an alphabetical mapping, “and” and “at” might be numerically close, but semantically, they are not particularly related. Conversely, words that are semantically similar could be assigned numbers that are far apart, simply due to their alphabetical positions. Critically, this approach does not facilitate the use of distance metrics to assess semantic similarity, as the numerical distances do not reflect semantic distances.

One-Hot Encoding

Another early technique for representing words numerically is one-hot encoding. In this method, each word in the vocabulary is represented as a vector where only one element is set to 1, while all other elements are 0. The position of the ‘1’ in the vector corresponds to the index of the word in the vocabulary. If a vocabulary contains \(V\) unique words, each word is represented by a \(V\)-dimensional vector.

Vocabulary: {apple, banana, cherry}

  • apple \(\rightarrow\) \([1, 0, 0]\)

  • banana \(\rightarrow\) \([0, 1, 0]\)

  • cherry \(\rightarrow\) \([0, 0, 1]\)

For a vocabulary of 10,000 words, each vector would be 10,000-dimensional.

One-hot encoding successfully converts words into a numerical format suitable for machine processing. However, it suffers from significant limitations. First, for large vocabularies, the dimensionality of the vectors becomes excessively high, leading to computational inefficiency and increased memory usage. These high-dimensional vectors are also sparse, meaning most of their elements are zero, which further contributes to computational inefficiency.

More fundamentally, one-hot encoding fails to capture semantic relationships between words. In the one-hot vector space, every word is equidistant from every other word. The dot product of any two different one-hot vectors is always zero, and the Euclidean distance is \(\sqrt{2}\), regardless of the semantic similarity of the words. For instance, the distance between the vectors for “king” and “queen” is the same as the distance between “king” and “dog”, even though “king” and “queen” are semantically much more closely related than “king” and “dog”. This uniform distance metric makes one-hot encoded vectors ineffective for tasks that require understanding semantic similarity.

Semantic Meaning in Vector Space

To overcome the limitations of simple integer mapping and one-hot encoding, a more sophisticated approach is needed to represent words in a way that encodes semantic meaning. Word embeddings provide such a representation by mapping words into a lower-dimensional, continuous vector space. In this space, the position of each word vector is learned based on its semantic context, such that words with similar meanings are located closer to each other.

Imagine a semantic spectrum, such as sentiment, ranging from “negative” to “positive”. We can visualize words positioned along this spectrum based on their sentiment polarity. For example, “happy” and “excited” would be placed towards the positive end, “boring” towards the negative end, and semantically neutral words like “paper” would be located in a more central position.

Words plotted on a 1D semantic axis representing sentiment.

Expanding upon this, we can consider a multi-dimensional semantic space to capture more complex semantic relationships. For instance, a 2D semantic space could be defined by two axes: one representing “sentiment” (positive to negative) and another representing “concreteness” (concrete to abstract). In this 2D space, words can be positioned based on their coordinates along these semantic dimensions. For example, “excited” might be placed in the quadrant representing positive and abstract, while “paper” would be in the quadrant representing neutral and concrete.

Words plotted in a 2D semantic space with axes representing sentiment and concreteness.

The advantages of using word embeddings are significant:

  1. Low Dimensionality: Word embeddings typically have a much lower dimensionality compared to one-hot vectors. Common dimensions range from 100 to 500, significantly less than the vocabulary size, which can be in the tens or hundreds of thousands. This reduced dimensionality makes computations more efficient and reduces memory requirements.

  2. Semantic Distance: Word embeddings are learned in such a way that the distance between word vectors in the embedding space reflects their semantic similarity. Semantically similar words are positioned closer together, while dissimilar words are further apart. This property is crucial for NLP tasks as it allows models to naturally capture and utilize semantic relationships between words.

Learning Word Embeddings

While the concept of a semantic space is intuitive, manually defining the semantic axes and positioning words within this space is impractical, subjective, and not scalable. The breakthrough in word embeddings came with the development of automated methods to learn these embeddings directly from large text corpora.

The underlying principle behind learning word embeddings is the distributional hypothesis, which posits that words that appear in similar contexts tend to have similar meanings. Algorithms like Word2Vec and GloVe are designed to exploit this hypothesis to automatically learn word embeddings from text data.

The general process for learning word embeddings typically involves the following steps:

  1. Initialization: Begin by assigning random, low-dimensional vectors as initial embeddings for all words in the vocabulary. At this stage, these vectors do not yet encode any meaningful semantic information.

  2. Contextual Learning: Train a model on a large corpus of text to predict a word based on its surrounding context words (or vice versa). For example, given the sentence “The king and queen are in the castle”, the model might be trained to predict “king” and “queen” based on the context of words like “the”, “and”, “are”, “in”, and “castle”. This training is performed across a vast dataset of text.

  3. Vector Adjustment (Iterative Refinement): During the training process, the model iteratively adjusts the word vectors. Words that are found to occur in similar contexts are moved closer together in the vector space, while words with dissimilar contexts are pushed further apart. This adjustment is driven by the objective of improving the model’s ability to predict words from their contexts.

Large language models like ChatGPT and transformer networks heavily utilize word embeddings as a foundational component. When text is input into ChatGPT, each word is first converted into its corresponding word embedding vector. These vectors then serve as the input to the neural network architecture of the model. By using word embeddings, these models can effectively understand the semantic content of the input text, enabling them to generate contextually relevant and semantically coherent responses. The dimensionality of word embeddings used in these models is typically in the range of 300 to 500 dimensions, allowing for a rich and nuanced representation of word semantics.

Initially, before the training process begins, the word embeddings are essentially random vectors. The extensive training on massive text datasets is what refines these initial random embeddings into meaningful semantic representations. Through this dynamic adjustment process, the model learns to position word vectors in the semantic space in a way that reflects the co-occurrence patterns and contextual relationships observed in the training data. This learned semantic space is what empowers models like ChatGPT to achieve a sophisticated understanding of natural language. Pre-trained word embeddings, such as those available from the Stanford NLP group (e.g., GloVe embeddings), exemplify the result of this learning process, providing numerical representations of words that capture rich semantic information learned from vast amounts of text data. These embeddings can then be used as a starting point for various NLP tasks.

Recurrent Neural Networks

Recurrent Neural Networks (RNNs) represent a distinct class of neural networks specifically engineered to process sequential data. Unlike traditional feedforward networks that treat each input independently, RNNs are designed to handle sequences of data where the order and temporal dependencies between data points are significant. This section will detail the architecture, functionality, and key characteristics of RNNs.

Sequence Data and Applications

Sequence data is characterized by a series of data points where each point is related to the ones that precede and follow it. The order in which these data points appear is crucial for understanding the information they convey. Common examples of sequence data include:

  • Text: In natural language, text is inherently sequential. Sentences are composed of words arranged in a specific order, and altering this order can change or negate the meaning. For example, understanding the question “Do you want to join me for a coffee?” requires processing the words in the given sequence.

  • Music: Music is a sequence of notes or audio signals unfolding over time. The melody, rhythm, and harmony are all perceived and understood sequentially.

  • Video: Video is a sequence of frames displayed in rapid succession to create the illusion of motion. Understanding events in a video requires processing these frames in order.

  • Time Series Data: Data points collected over time, such as stock prices, sensor readings, or weather data, are also sequential, where the value at any point is often dependent on previous values.

In contrast to static data like images, where spatial relationships are paramount, sequence data emphasizes temporal or sequential dependencies. While an image can be considered as a single, static entity, sequence data unfolds over time, and its meaning is derived from this progression.

Recurrent Architecture and Sequential Processing

The core innovation of RNNs lies in their recurrent architecture, which allows them to maintain an internal state while processing a sequence of inputs. This internal state acts as a form of memory, enabling the network to retain information about past inputs in the sequence and use it to influence the processing of current and future inputs.

Imagine an RNN as a processing “box” that operates sequentially on each element of an input sequence. For a sentence, this box would process one word at a time. When the first word (represented as its word embedding) is fed into the RNN, the box processes it and updates its internal state. This state now holds information about the first word. As the next word in the sequence is input, the RNN processes it in conjunction with its current internal state, further updating the state to reflect the information from both the first and second words. This process repeats for each word in the sentence, with the internal state being continuously refined to accumulate information from the entire sequence.

This sequential processing mechanism is what allows RNNs to understand context and dependencies in sequential data. By maintaining and updating an internal state, RNNs can effectively “remember” and utilize information from earlier parts of the sequence when processing later parts.

Hidden States as Memory

The internal state maintained by an RNN is formally known as the hidden state. At each time step, the hidden state is a vector that encapsulates information about the sequence processed up to that point. It serves as the RNN’s memory, storing relevant features and patterns extracted from the input sequence.

When an RNN processes a new input at time step \(t\), it updates its hidden state \(h_t\) based on two primary sources of information: the current input \(x_t\) and the hidden state from the previous time step \(h_{t-1}\). This update mechanism is crucial because it allows the RNN to incorporate both the immediate input and the historical context from the sequence. The hidden state is thus dynamically adjusted at each step, effectively accumulating and refining its representation of the sequence as it progresses. This accumulated information in the hidden state is then used to make predictions or decisions based on the entire sequence processed so far.

Detailed Connections and Weight Sharing

To delve deeper into the mechanics of RNNs, let’s examine the connections and computations involved at each time step. Consider an example where each word is represented by a word embedding vector \(x_t\). In an RNN, these input vectors are processed sequentially.

At each time step \(t\), the input vector \(x_t\) is fed into the RNN unit. This input is connected to the hidden state layer through a weight matrix \(W_{ax}\). Simultaneously, the hidden state from the previous time step, \(h_{t-1}\), is also connected to the current hidden state through another weight matrix \(W_{aa}\). The current hidden state \(h_t\) is computed as a function of both the current input and the previous hidden state.

Simplified representation of an RNN unit at time step t, showing connections from input (x_t) and previous hidden state (h_{t-1}) to the current hidden state (h_t).

Mathematically, the hidden state update is typically formulated as: \[h_t = f(W_{ax} x_t + W_{aa} h_{t-1} + b_a)\] where:

  • \(h_t\) is the hidden state at time step \(t\).

  • \(x_t\) is the input vector at time step \(t\).

  • \(h_{t-1}\) is the hidden state from the previous time step \(t-1\). For the first time step (\(t=0\)), \(h_0\) is usually initialized to a vector of zeros.

  • \(W_{ax}\) is the weight matrix that connects the input \(x_t\) to the hidden state. This matrix learns how to weigh the importance of the current input.

  • \(W_{aa}\) is the weight matrix that connects the previous hidden state \(h_{t-1}\) to the current hidden state. This matrix learns how to incorporate past information into the current state.

  • \(b_a\) is a bias vector, which provides an offset to the computation.

  • \(f\) is an activation function, such as tanh or ReLU, which introduces non-linearity into the network, enabling it to learn complex patterns.

A defining characteristic of RNNs is weight sharing. The weight matrices \(W_{ax}\) and \(W_{aa}\), as well as the bias vector \(b_a\), are shared across all time steps in the sequence. This means that the same set of parameters is used to process each input in the sequence, regardless of its position.

Rationale for Weight Sharing: Position Independence

Weight sharing is not merely an architectural convenience; it is fundamentally important for enabling RNNs to generalize and learn effectively from sequential data. The primary reason for weight sharing is to ensure that the network processes each element of the sequence in a consistent manner, irrespective of its position in the sequence.

Consider the task of understanding the semantic content of a sentence. The meaning of words and their relationships should ideally be interpreted consistently throughout the sentence, regardless of whether a word appears at the beginning, middle, or end. If different weight matrices were used at each time step, the network would treat identical inputs differently based solely on their position in the sequence. This would hinder the RNN’s ability to recognize patterns and semantic relationships that are position-independent.

For example, consider the sentences:

  1. “Good morning, Alex. Today the weather is great. How are you?”

  2. “Hi, Alex, how are you?”

Both sentences contain the question “how are you?”. With weight sharing, the RNN processes “how are you?” in a consistent manner in both sentences, allowing it to recognize the question irrespective of the preceding context. If weights were not shared, the network might interpret “how are you?” differently in each sentence, undermining its ability to generalize and understand language structure.

Weight sharing enforces consistency in processing across the sequence, allowing the RNN to learn generalizable features and patterns that are relevant regardless of position. This is crucial for tasks like language modeling, sentiment analysis, and sequence translation, where understanding position-independent features is essential.

Limitations: Fixed Input Sequence Length

Standard RNN architectures are typically designed to handle input sequences up to a predefined maximum length. This limitation arises from the fixed structure of the network, which is unrolled for a specific number of time steps during training and inference. When processing sequences longer than this predefined length, truncation or other methods must be employed.

If an input sequence exceeds the maximum length that the RNN is configured to handle, a common approach is to truncate the sequence, effectively discarding the elements beyond the maximum length. For tasks like sentiment analysis of customer reviews, where reviews can vary in length, it is often practical to consider only the first \(N\) words of each review if it exceeds the maximum allowed length. While truncation simplifies processing, it may lead to information loss, especially if the later parts of the sequence contain crucial information. More advanced RNN architectures, such as LSTMs and GRUs, and attention mechanisms, have been developed to mitigate some of these limitations and handle longer sequences more effectively, but the basic RNN structure inherently has constraints related to sequence length.

Despite this limitation, RNNs are powerful tools for a wide range of sequence processing tasks, particularly those where the dependencies are relatively short-range and the input sequences are of manageable length.

Conclusion

In summary, today’s lecture provided an overview of several key tasks within Natural Language Processing, including document analysis, question answering, social media analysis, real-time transcription and translation, automated email responses, chatbot design, and fake news detection. We emphasized the increasing relevance of these tasks in our daily lives and in various industries, especially with the rise of advanced AI systems.

We then transitioned to a detailed discussion of word embeddings, a crucial concept for modern NLP. We explored the necessity of word embeddings in creating meaningful numerical representations of words that capture semantic relationships, contrasting them with less effective methods like simple integer mapping and one-hot encoding. We highlighted how word embeddings address the limitations of these simpler methods by placing semantically similar words closer together in a vector space, thus enabling NLP models to understand and leverage word relationships effectively.

Finally, we introduced Recurrent Neural Networks (RNNs) as a powerful neural network architecture specifically designed for processing sequence data. We detailed the structure of RNNs, focusing on the concept of hidden states as a mechanism for maintaining memory of past inputs and the importance of weight sharing across time steps for consistent and position-independent processing of sequential information.

  • Breadth of NLP Tasks: NLP is a diverse field encompassing a wide range of tasks critical for understanding and generating human language, with applications spanning from information retrieval to combating misinformation.

  • Importance of Word Embeddings: Word embeddings are fundamental for enabling NLP models to understand semantic meaning by representing words as vectors in a continuous space where distances reflect semantic similarity.

  • RNNs for Sequence Data: Recurrent Neural Networks are specifically designed to process sequential data by utilizing hidden states to maintain context and employing weight sharing to ensure consistent processing across sequences.

In our next session, we will build upon today’s lecture by engaging in practical activities to further explore word embeddings and their properties. We will also delve deeper into the architectures and applications of Recurrent Neural Networks in NLP, examining how these networks are used to solve real-world problems. As preparation for the next lecture, please consider exploring examples of semantically similar words and reflect on how different word representation methods might capture these similarities.

Thank you for your active participation and insightful questions during today’s lecture. For those attending remotely, thank you for joining online, and please feel free to share any further questions or comments via the chat. I look forward to continuing our exploration of NLP in the next class.