Large Language Models: From BERT to ChatGPT

Author

Your Name

Published

January 28, 2025

Introduction

This lecture explores the evolution and inner workings of large language models (LLMs), focusing on the transition from BERT to ChatGPT. We will examine the architecture of transformers, the concept of word embeddings, and the application of LLMs to various natural language processing tasks. The main objectives are to understand how these models are trained, how they represent text, and how they can be fine-tuned for specific specific tasks like text classification, named entity recognition, and question answering. Additionally, we will delve into the architecture of GPT and its advancements, culminating in an analysis of ChatGPT’s capabilities, limitations, and its training process that involves human feedback.

Transformer architecture and its components (encoder, decoder, self-attention)
Word embeddings and their role in representing text
Application of LLMs to NLP tasks (text classification, NER, question answering)
Evolution from BERT to GPT
ChatGPT’s architecture, training process, and capabilities
Limitations of ChatGPT (hallucinations, reasoning)
Prompt engineering techniques

Background: BERT and Transformers

Transformer Architecture

The transformer architecture, introduced in the paper "Attention is All You Need," revolutionized natural language processing. It consists of an encoder and a decoder, both of which utilize self-attention mechanisms to process input sequences. The transformer’s ability to capture long-range dependencies through self-attention was a significant advancement over previous models like RNNs and LSTMs.

Transformer Architecture

BERT: Bidirectional Encoder Representations from Transformers

BERT (Bidirectional Encoder Representations from Transformers), developed by Google in 2017, utilizes the encoder part of the transformer. It is pre-trained on a large corpus of text, including Wikipedia and BooksCorpus, using a masked language modeling objective. In this approach, random words in a sentence are masked, and the model’s task is to predict the masked words based on the surrounding context. This bidirectional training allows BERT to understand the context from both directions (left-to-right and right-to-left), hence the name "bidirectional."

Word embeddings are vector representations of words, where semantically similar words are mapped to nearby points in the vector space. These embeddings capture semantic relationships and contextual information.

BERT’s Training Process

During training, BERT updates its internal parameters to refine the word embeddings, enabling them to capture contextual information effectively. The output for a given input sentence is a new sequence of representations, denoted as $O_1, O_2, O_3, \dots$, where each $O_i$ corresponds to the contextualized representation of the $i$-th word in the input sentence. This process allows BERT to generate rich and context-aware embeddings that can be used for various downstream tasks.

Input: A sequence of words $W = [w_1, w_2, \dots, w_n]$ Output: Predicted probabilities for masked words Create a copy of $W$ called $W'$ Randomly select a subset of indices $I$ from $[1, 2, \dots, n]$ (e.g., 15% of the words) With 80% probability, replace $w'_i$ with [MASK] With 10% probability, replace $w'_i$ with a random word With 10% probability, keep $w'_i$ unchanged Feed $W'$ to the BERT encoder Obtain the output embeddings $O = [O_1, O_2, \dots, O_n]$ Predict the original word $w_i$ using the embedding $O_i$ Calculate the loss based on the predictions Update BERT’s parameters using backpropagation

Complexity Analysis:

Time Complexity: The time complexity of BERT’s training is primarily determined by the self-attention mechanism, which has a time complexity of $O(n^2 \cdot d)$, where $n$ is the sequence length and $d$ is the embedding dimension.
Space Complexity: The space complexity is dominated by the model parameters, which grow with the number of layers, attention heads, and embedding dimensions.

Large Language Models (LLMs)

LLMs are built upon the transformer architecture and are trained on massive datasets, often containing billions of words. They serve as a foundation for various downstream tasks, providing a powerful and versatile tool for natural language processing. The scale of these models allows them to learn complex patterns and relationships in language, leading to impressive performance across a wide range of applications.

Pre-trained Language Models

A pre-trained language model is trained on a large corpus of text, such as Wikipedia, books, and web crawl data. This initial training phase allows the model to learn general language understanding and generation capabilities. Subsequently, the pre-trained model can be fine-tuned for specific tasks using smaller, task-specific datasets. This transfer learning approach reduces the need for large labeled datasets for each task, making it more efficient and practical to develop NLP applications.

Reduced Data Requirements: Fine-tuning requires significantly less labeled data compared to training from scratch.
Improved Performance: Pre-training provides a strong foundation, leading to better performance on downstream tasks.
Faster Training: Fine-tuning is typically faster than training a model from scratch.
Generalization: Pre-trained models can generalize well to new, unseen data within the same domain.

Applications of LLMs

LLMs have demonstrated remarkable capabilities in various natural language processing tasks, including:

Text Classification: Classifying text into predefined categories, such as sentiment analysis (positive, negative, neutral), topic classification (sports, politics, technology), and spam detection.
Question Answering: Finding answers to questions within a given text passage or document. This can range from simple factoid questions to complex questions requiring reasoning and inference.
Named Entity Recognition (NER): Identifying and classifying named entities in text, such as persons, organizations, locations, dates, and times.
Machine Translation: The models can be used to translate entire sentences or documents, given the proper training.
Text Summarization: The models can be trained to generate concise and coherent summaries of longer texts, such as news articles or research papers, given the proper training.

BERT (Bidirectional Encoder Representations from Transformers)
GPT (Generative Pre-trained Transformer)
RoBERTa (A Robustly Optimized BERT Pretraining Approach)
T5 (Text-to-Text Transfer Transformer)
LaMDA (Language Model for Dialogue Applications)

From Encoder to Decoder: The Rise of GPT

GPT: Generative Pre-trained Transformer

GPT (Generative Pre-trained Transformer), developed by OpenAI, represents a significant shift in the application of transformers. Unlike BERT, which uses the encoder part of the transformer for bidirectional context understanding, GPT utilizes the decoder part. This makes GPT inherently autoregressive, meaning it is trained to predict the next word in a sequence, making it particularly well-suited for text generation tasks. The initial version, GPT-1, was followed by increasingly powerful successors, GPT-2 and GPT-3.

GPT Architecture

The decoder in GPT is similar to the encoder in structure but includes a crucial modification: masked multi-head attention. This mechanism ensures that the model only attends to previous words in the sequence when predicting the next word. This is essential for autoregressive text generation, as the model must generate text sequentially, one word at a time, without access to future information.

Masked multi-head attention is a variant of multi-head attention where the attention mechanism is restricted to only consider the preceding words in the sequence. This is achieved by masking (setting to negative infinity) the attention weights corresponding to future tokens. This prevents the model from "seeing" future tokens during training, ensuring that the model’s predictions are based solely on the preceding context.

GPT Decoder Architecture

GPT-3 and Beyond

GPT-3, released in 2020, with its staggering 175 billion parameters, demonstrated remarkable performance in various NLP tasks, including text generation, translation, and question answering. Its architecture includes a large number of layers (96 layers) and attention heads (96 heads per layer), allowing it to capture complex relationships in the data and generate highly coherent and contextually relevant text. GPT-3 was trained on a massive dataset, including Common Crawl, WebText2, Books1, Books2, and Wikipedia. The sheer size of the model and the training data contributes to its impressive capabilities.

GPT Model Comparison
Model	Parameters	Layers	Attention Heads
GPT-1	117 million	12	12
GPT-2	1.5 billion	48	16
GPT-3	175 billion	96	96

Complexity Analysis of GPT-3:

Time Complexity: The time complexity of GPT-3’s inference is also dominated by the self-attention mechanism, similar to BERT, with a time complexity of $O(n^2 \cdot d)$, where $n$ is the sequence length and $d$ is the embedding dimension. However, during training, the masked self-attention allows for parallel computation of all output tokens, making it more efficient than traditional autoregressive models.
Space Complexity: GPT-3’s 175 billion parameters require substantial memory for storage and computation. This poses challenges for deployment and accessibility.

Further advancements beyond GPT-3 are being actively researched, including models with even more parameters and more efficient architectures. The trend suggests a continued increase in model size and complexity, pushing the boundaries of what’s possible with language models.

Fine-tuning LLMs for Specific Tasks

Fine-tuning is a crucial step in adapting pre-trained LLMs to perform specific downstream tasks. It involves adding task-specific layers on top of the pre-trained model and training these new layers, along with potentially a small portion of the pre-trained layers, on a labeled dataset relevant to the task.

Text Classification

To fine-tune an LLM for text classification, a classification layer (typically a neural network with one or more hidden layers) is added on top of the pre-trained model. This layer takes the output embeddings from the LLM as input and produces a probability distribution over the possible classification labels.

Given a text, the LLM produces embeddings $O_1, O_2, \dots, O_n$. Often, a special token’s embedding (e.g., the [CLS] token in BERT) or the average of all output embeddings is used as input to the classification layer. A neural network with one hidden layer processes this input to produce a label (e.g., positive or negative sentiment for sentiment analysis).

Input: Text sequence $T$, LLM (e.g., BERT), Classification Layer $C$ Output: Predicted class label $y$ Convert $T$ into input features (e.g., tokens, segment embeddings) $O_1, O_2, \dots, O_n \gets \text{LLM}(T)$ $O \gets \text{Aggregate}(O_1, O_2, \dots, O_n)$ (e.g., using [CLS] token or averaging) $y \gets C(O)$ return $y$

Named Entity Recognition (NER)

For NER, a classification layer is added for each word’s embedding. Each classifier predicts the entity type of the corresponding word (e.g., person, organization, location, date, or none).

Given a text, the LLM produces embeddings $O_1, O_2, \dots, O_n$. A separate classifier is applied to each $O_i$ to predict the entity type of the $i$-th word. These classifiers can be independent or share some parameters.

Input: Text sequence $T$, LLM (e.g., BERT), NER Classifier $C$ Output: Predicted entity labels $Y = [y_1, y_2, \dots, y_n]$ Convert $T$ into input features $O_1, O_2, \dots, O_n \gets \text{LLM}(T)$ $y_i \gets C(O_i)$ return $Y$

Question Answering

In question answering, the LLM processes both the question and the context text (often concatenated as a single input sequence). A common approach is to predict the start and end positions of the answer within the text using a regression or classification model.

Given a question and a text, the LLM produces embeddings $O_1, O_2, \dots, O_n$. Two separate classifiers are used: one to predict the probability of each token being the start of the answer, and another to predict the probability of each token being the end of the answer.

Input: Question $Q$, Context Text $C$, LLM (e.g., BERT), Start Classifier $S$, End Classifier $E$ Output: Predicted start position $s$ and end position $e$ Concatenate $Q$ and $C$ into a single input sequence $T$ Convert $T$ into input features $O_1, O_2, \dots, O_n \gets \text{LLM}(T)$ $P_{\text{start}} \gets [S(O_1), S(O_2), \dots, S(O_n)]$ $P_{\text{end}} \gets [E(O_1), E(O_2), \dots, E(O_n)]$ $s \gets \text{argmax}(P_{\text{start}})$ $e \gets \text{argmax}(P_{\text{end}})$ return $s, e$

Complexity Analysis for Fine-tuning:

Time Complexity: The time complexity of fine-tuning is generally dominated by the forward and backward passes through the LLM, which is $O(n^2 \cdot d)$ for self-attention-based models. The added task-specific layers usually have a smaller impact on the overall time complexity.
Space Complexity: Fine-tuning typically involves adding a relatively small number of parameters compared to the pre-trained LLM. Therefore, the space complexity remains largely determined by the size of the LLM.

Generative Models and ChatGPT

Generative Models

Generative models, like GPT-2 and GPT-3, are trained to generate text by predicting the next word in a sequence autoregressively. The predicted word is then used as input to predict the subsequent word, and so on, creating a coherent and contextually relevant text sequence. This process continues until a stopping criterion is met, such as reaching a maximum length or generating an end-of-sequence token.

Masked Self-Attention in Generative Models

Masked self-attention is crucial in generative models to ensure that the model only uses information from previous words when predicting the next word. This prevents the model from "cheating" by looking at future tokens during training, which would lead to poor generalization performance during inference.

Masked Self-Attention in Generative Models

Input: Initial sequence $S = [s_1, s_2, \dots, s_k]$, Generative Model $G$ Output: Generated sequence $S'$ Convert $S$ into input features $O \gets G(S)$ (using masked self-attention) $s_{k+1} \gets \text{Sample from } O$ (e.g., using argmax or probabilistic sampling) Append $s_{k+1}$ to $S$ $k \gets k + 1$ $S' \gets S$ return $S'$

Prompt Engineering

Prompt engineering is crucial for effectively using LLMs, particularly for models like ChatGPT. It involves carefully crafting the input text (prompt) to guide the model towards generating the desired output. A well-designed prompt can significantly improve the quality, relevance, and coherence of the model’s response. OpenAI provides guidelines for crafting prompts, such as using delimiters to specify parts of the text and specifying the desired length of the output.

Clarity and Specificity: Be clear and specific about the task you want the model to perform. Avoid ambiguity and provide sufficient context.
Use of Delimiters: Use delimiters (e.g., quotation marks, triple quotes, XML tags) to clearly separate different parts of the prompt, such as instructions, context, and input data.
Output Length Specification: Specify the desired length of the output, such as the number of sentences, paragraphs, or words. This helps control the verbosity of the model’s response.
Few-Shot Examples: Provide a few examples of the desired input-output behavior to guide the model. This is particularly useful for complex tasks.
Iterative Refinement: Start with a simple prompt and iteratively refine it based on the model’s output. Experiment with different phrasing and formatting to achieve the best results.
Role-Playing Assign a role to the model (e.g., "You are a helpful assistant") to influence its response style.

Task: Summarize a given text.

Poor Prompt: "Summarize this."

Better Prompt: “’ Summarize the following text in three sentences.

Text: """ [Insert text here] """ “’

Explanation: The better prompt uses delimiters (triple quotes) to clearly separate the text to be summarized. It also specifies the desired output length (three sentences).

Task: Extract key information from a text.

Poor Prompt: "What are the important points?"

Better Prompt: “’ Extract the names of all people and organizations mentioned in the following text.

Text: """ [Insert text here] """ “’

Explanation: The better prompt clearly specifies the type of information to be extracted (names of people and organizations) and uses delimiters to separate the text.

Prompt: “’ Please translate the following sentence into French. Keep the translation concise and accurate.

Sentence: """ Large language models are powerful tools for natural language processing. """

Desired length: One sentence. “’

Further Considerations:

Chain-of-Thought Prompting: Encourage the model to reason step-by-step by including phrases like "Let’s think step by step."
Zero-Shot Prompting: Ask the model to perform a task without providing any examples. This tests the model’s generalization ability.
Temperature and Top-p: These are parameters that control the randomness and creativity of the model’s output. They can be adjusted during inference to influence the diversity of the generated text.

Capabilities and Limitations of ChatGPT

ChatGPT has demonstrated impressive capabilities in various natural language processing tasks, making it a valuable tool for a wide range of applications. However, it also has limitations that need to be considered when using the model.

Capabilities

ChatGPT excels in tasks that involve understanding and generating text based on the information provided in the prompt. Some of its key capabilities include:

Summarizing text: ChatGPT can generate concise and coherent summaries of longer texts, such as articles, documents, or conversations. It can effectively identify the main points and key information within the provided text.
Extracting information: ChatGPT can extract specific information from a given text, such as names, dates, locations, or other entities. It can also answer questions based on the content of the text.
Rephrasing text: ChatGPT can rephrase or rewrite text in different styles or tones while preserving the original meaning. This can be useful for paraphrasing, simplifying complex language, or adapting the text for different audiences.
Classifying information: ChatGPT can classify text into predefined categories, such as sentiment analysis (positive, negative, neutral), topic classification, or intent recognition.
Content Generation: It can generate various types of content, including articles, stories, poems, and scripts, given appropriate prompts.
Translation: It can perform translation between multiple languages, although the quality may vary depending on the language pair and the complexity of the text.
Question Answering: Given a context, it can answer questions related to that context accurately.

Limitations

Despite its impressive capabilities, ChatGPT has several limitations:

Hallucinations: ChatGPT may sometimes generate incorrect, nonsensicalThe user wants me to continue the LaTeX document from where I left off. I need to complete the section on the limitations of ChatGPT, including the examples of hallucinations and reasoning failures, and then move on to the concluding sections of the document., or misleading information that is not grounded in the provided context or real-world facts. This is often referred to as "hallucination." These hallucinations can be subtle and may appear plausible, making it important to verify the model’s output.
Reasoning: ChatGPT may struggle with complex reasoning tasks, particularly those involving multi-step inference, logical deductions, or common sense reasoning. It may produce inconsistent or illogical answers when faced with such tasks. For example, it failed to answer correctly to the following problem: "I left five clothes to dry out in the sun. It took them five hours to dry completely. How long would it take to dry 30 clothes?".
Contextual Understanding: ChatGPT’s understanding of context is limited to the information provided in the current prompt. It does not have memory of previous interactions beyond the current conversation, which can lead to inconsistencies or a lack of coherence in longer conversations. Also, its knowledge is limited to what it has learned during training, which has a cutoff date.
Bias and Sensitivity: ChatGPT may reflect biases present in the training data, which can lead to generating responses that are unfair, discriminatory, or offensive. It may also be sensitive to slight changes in the input, leading to significantly different outputs for seemingly similar prompts.
Over-Reliance on Prompting: The quality of ChatGPT’s output is heavily dependent on the quality of the prompt. Poorly crafted prompts can lead to irrelevant, incoherent, or inaccurate responses.
Verbosity: It tends to be verbose, often providing longer answers than necessary.

Prompt: "What is the capital of France?"

ChatGPT’s Response: "The capital of France is Paris." (Correct)

Prompt: "What is the capital of the United States?"

ChatGPT’s Response: "The capital of the United States is Washington, D.C." (Correct)

Prompt: "What is the largest planet in our solar system?"

ChatGPT’s Response: "The largest planet in our solar system is Jupiter." (Correct)

Prompt: "Who painted the Mona Lisa?"

ChatGPT’s Response: "The Mona Lisa was painted by Leonardo da Vinci." (Correct)

Prompt: "When did World War II end?"

ChatGPT’s Response: "World War II ended in 1945." (Correct)

Prompt: "What is the highest mountain in the world?"

ChatGPT’s Response: "The highest mountain in the world is Mount Everest." (Correct)

Prompt: "What is the chemical formula for water?"

ChatGPT’s Response: "The chemical formula for water is H2O." (Correct)

Prompt: "Who wrote the theory of relativity?"

ChatGPT’s Response: "The theory of relativity was written by Albert Einstein." (Correct)

Prompt: "What is the smallest country in the world?"

ChatGPT’s Response: "The smallest country in the world is Vatican City." (Correct)

Prompt: "Who is the current president of France?"

ChatGPT’s Response: "The current president of France is Emmanuel Macron." (Correct)

Prompt: "What is the population of Mars?"

ChatGPT’s Response: "The population of Mars is approximately 7.9 billion people." (Incorrect - Hallucination)

Prompt: "I left five clothes to dry out in the sun. It took them five hours to dry completely. How long would it take to dry 30 clothes?"

ChatGPT’s Response: "If it took five hours to dry five clothes, that means each piece of clothing took one hour to dry (5 hours / 5 clothes = 1 hour/clothes). Therefore, to dry 30 clothes, it would take 30 hours (30 clothes * 1 hour/clothes = 30 hours)." (Incorrect - Reasoning Failure)

Addressing Limitations:

Prompt Engineering: Carefully crafting prompts can help mitigate some of the limitations, such as hallucinations and verbosity.
Fact Verification: Always verify the information provided by ChatGPT, especially when it involves factual claims.
Combining with Other Systems: Integrate ChatGPT with other systems, such as knowledge bases or search engines, to improve accuracy and reliability.
Human Oversight: Use human oversight to review and edit the model’s output, particularly in critical applications.

Conclusion

This lecture covered the evolution of large language models (LLMs) from BERT to ChatGPT, highlighting their architectures, training processes, and applications. We explored how LLMs can be fine-tuned for specific tasks and discussed the advancements in generative models, particularly the transition from encoder-based models like BERT to decoder-based models like GPT. While LLMs like ChatGPT offer impressive capabilities, they also have limitations, such as hallucinations and difficulties with reasoning. Understanding these aspects is crucial for effectively utilizing and improving these powerful tools.

Key Takeaways:

Transformers and their self-attention mechanisms are foundational to modern LLMs, enabling them to capture long-range dependencies in text.
BERT and GPT represent two different approaches, with BERT focusing on understanding context bidirectionally and GPT on generating text autoregressively.
Fine-tuning allows LLMs to be adapted for various NLP tasks with smaller datasets, leveraging the knowledge gained during pre-training.
ChatGPT’s training involves a multi-stage process: pre-training on a massive text corpus, supervised fine-tuning on human-written dialogues, and reinforcement learning from human feedback (RLHF) to align the model with human preferences.
Prompt engineering is essential for maximizing the effectiveness of LLMs, requiring careful crafting of input text to guide the model towards desired outputs.
The specific datasets used for training GPT-3 include Common Crawl, WebText2, Books1, Books2, and Wikipedia. ChatGPT’s training data is a mix of filtered web content, books, and other text sources, along with human-written dialogues and preference data for RLHF.

Transformers (2017): Introduced self-attention mechanisms, revolutionizing NLP.
BERT (2018): Bidirectional encoder model, excelling at understanding context.
GPT (2018): Autoregressive decoder model, focused on text generation.
GPT-2 (2019): Larger version of GPT, demonstrating improved text generation capabilities.
GPT-3 (2020): 175 billion parameters, showcasing remarkable performance across various tasks.
ChatGPT (2022): Fine-tuned using RLHF for conversational abilities, but still prone to limitations.

Follow-up Questions:

How can we mitigate the issue of hallucinations in LLMs? Potential solutions include improving training data quality, incorporating fact verification mechanisms, and developing better methods for grounding the model’s output in real-world knowledge.
What are some strategies for improving the reasoning abilities of LLMs? This could involve developing new architectures that can perform multi-step reasoning, integrating symbolic reasoning approaches, and training on datasets that require more complex logical inference.
How might future advancements in LLMs address current limitations? Future research may focus on developing more efficient architectures, improving training methods, incorporating multimodal learning, and enhancing the model’s ability to learn from limited data.
How can we evaluate and address biases in LLMs? This is an active area of research, with potential solutions including developing methods for detecting and mitigating biases in training data, creating more diverse and representative datasets, and incorporating fairness constraints during training.
What are the ethical implications of using LLMs, and how can we ensure responsible deployment? This involves considering issues such as potential misuse, job displacement, the spread of misinformation, and the need for transparency and accountability.

Suggested Improvements for Future Work:

Develop more robust methods for evaluating the reasoning abilities of LLMs.
Explore techniques for incorporating external knowledge sources to reduce hallucinations.
Investigate methods for improving the sample efficiency of LLM training.
Develop better techniques for controlling the style and tone of LLM output.
Create more comprehensive benchmarks for evaluating the performance of LLMs across a wider range of tasks and domains.

--- title: "Large Language Models: From BERT to ChatGPT" author: "Your Name" date: "2025-01-28" format: html: toc: true # Table of Contents toc-depth: 2 code-tools: true theme: cosmo # Or "journal" for Distill-like minimalism --- # Introduction This lecture explores the evolution and inner workings of large language models (LLMs), focusing on the transition from BERT to ChatGPT. We will examine the architecture of transformers, the concept of word embeddings, and the application of LLMs to various natural language processing tasks. The main objectives are to understand how these models are trained, how they represent text, and how they can be fine-tuned for specific specific tasks like text classification, named entity recognition, and question answering. Additionally, we will delve into the architecture of GPT and its advancements, culminating in an analysis of ChatGPT's capabilities, limitations, and its training process that involves human feedback. ::: tcolorbox - Transformer architecture and its components (encoder, decoder, self-attention) - Word embeddings and their role in representing text - Application of LLMs to NLP tasks (text classification, NER, question answering) - Evolution from BERT to GPT - ChatGPT's architecture, training process, and capabilities - Limitations of ChatGPT (hallucinations, reasoning) - Prompt engineering techniques ::: # Background: BERT and Transformers ## Transformer Architecture The transformer architecture, introduced in the paper \"Attention is All You Need,\" revolutionized natural language processing. It consists of an encoder and a decoder, both of which utilize self-attention mechanisms to process input sequences. The transformer's ability to capture long-range dependencies through self-attention was a significant advancement over previous models like RNNs and LSTMs. <figure id="fig:transformer"> <figcaption>Transformer Architecture</figcaption> </figure> ## BERT: Bidirectional Encoder Representations from Transformers BERT (Bidirectional Encoder Representations from Transformers), developed by Google in 2017, utilizes the encoder part of the transformer. It is pre-trained on a large corpus of text, including Wikipedia and BooksCorpus, using a masked language modeling objective. In this approach, random words in a sentence are masked, and the model's task is to predict the masked words based on the surrounding context. This bidirectional training allows BERT to understand the context from both directions (left-to-right and right-to-left), hence the name \"bidirectional.\" ::: tcolorbox Word embeddings are vector representations of words, where semantically similar words are mapped to nearby points in the vector space. These embeddings capture semantic relationships and contextual information. ::: ## BERT's Training Process During training, BERT updates its internal parameters to refine the word embeddings, enabling them to capture contextual information effectively. The output for a given input sentence is a new sequence of representations, denoted as $O_1, O_2, O_3, \dots$, where each $O_i$ corresponds to the contextualized representation of the $i$-th word in the input sentence. This process allows BERT to generate rich and context-aware embeddings that can be used for various downstream tasks. :::: tcolorbox ::: algorithmic **Input:** A sequence of words $W = [w_1, w_2, \dots, w_n]$ **Output:** Predicted probabilities for masked words Create a copy of $W$ called $W'$ Randomly select a subset of indices $I$ from $[1, 2, \dots, n]$ (e.g., 15% of the words) With 80% probability, replace $w'_i$ with \[MASK\] With 10% probability, replace $w'_i$ with a random word With 10% probability, keep $w'_i$ unchanged Feed $W'$ to the BERT encoder Obtain the output embeddings $O = [O_1, O_2, \dots, O_n]$ Predict the original word $w_i$ using the embedding $O_i$ Calculate the loss based on the predictions Update BERT's parameters using backpropagation ::: :::: **Complexity Analysis:** - **Time Complexity:** The time complexity of BERT's training is primarily determined by the self-attention mechanism, which has a time complexity of $O(n^2 \cdot d)$, where $n$ is the sequence length and $d$ is the embedding dimension. - **Space Complexity:** The space complexity is dominated by the model parameters, which grow with the number of layers, attention heads, and embedding dimensions. # Large Language Models (LLMs) LLMs are built upon the transformer architecture and are trained on massive datasets, often containing billions of words. They serve as a foundation for various downstream tasks, providing a powerful and versatile tool for natural language processing. The scale of these models allows them to learn complex patterns and relationships in language, leading to impressive performance across a wide range of applications. ## Pre-trained Language Models A pre-trained language model is trained on a large corpus of text, such as Wikipedia, books, and web crawl data. This initial training phase allows the model to learn general language understanding and generation capabilities. Subsequently, the pre-trained model can be fine-tuned for specific tasks using smaller, task-specific datasets. This transfer learning approach reduces the need for large labeled datasets for each task, making it more efficient and practical to develop NLP applications. ::: tcolorbox - **Reduced Data Requirements:** Fine-tuning requires significantly less labeled data compared to training from scratch. - **Improved Performance:** Pre-training provides a strong foundation, leading to better performance on downstream tasks. - **Faster Training:** Fine-tuning is typically faster than training a model from scratch. - **Generalization:** Pre-trained models can generalize well to new, unseen data within the same domain. ::: ## Applications of LLMs LLMs have demonstrated remarkable capabilities in various natural language processing tasks, including: - **Text Classification**: Classifying text into predefined categories, such as sentiment analysis (positive, negative, neutral), topic classification (sports, politics, technology), and spam detection. - **Question Answering**: Finding answers to questions within a given text passage or document. This can range from simple factoid questions to complex questions requiring reasoning and inference. - **Named Entity Recognition (NER)**: Identifying and classifying named entities in text, such as persons, organizations, locations, dates, and times. - **Machine Translation**: The models can be used to translate entire sentences or documents, given the proper training. - **Text Summarization**: The models can be trained to generate concise and coherent summaries of longer texts, such as news articles or research papers, given the proper training. ::: tcolorbox - **BERT** (Bidirectional Encoder Representations from Transformers) - **GPT** (Generative Pre-trained Transformer) - **RoBERTa** (A Robustly Optimized BERT Pretraining Approach) - **T5** (Text-to-Text Transfer Transformer) - **LaMDA** (Language Model for Dialogue Applications) ::: # From Encoder to Decoder: The Rise of GPT ## GPT: Generative Pre-trained Transformer GPT (Generative Pre-trained Transformer), developed by OpenAI, represents a significant shift in the application of transformers. Unlike BERT, which uses the encoder part of the transformer for bidirectional context understanding, GPT utilizes the decoder part. This makes GPT inherently autoregressive, meaning it is trained to predict the next word in a sequence, making it particularly well-suited for text generation tasks. The initial version, GPT-1, was followed by increasingly powerful successors, GPT-2 and GPT-3. ## GPT Architecture The decoder in GPT is similar to the encoder in structure but includes a crucial modification: masked multi-head attention. This mechanism ensures that the model only attends to previous words in the sequence when predicting the next word. This is essential for autoregressive text generation, as the model must generate text sequentially, one word at a time, without access to future information. ::: tcolorbox Masked multi-head attention is a variant of multi-head attention where the attention mechanism is restricted to only consider the preceding words in the sequence. This is achieved by masking (setting to negative infinity) the attention weights corresponding to future tokens. This prevents the model from \"seeing\" future tokens during training, ensuring that the model's predictions are based solely on the preceding context. ::: <figure id="fig:gpt-decoder"> <figcaption>GPT Decoder Architecture</figcaption> </figure> ## GPT-3 and Beyond GPT-3, released in 2020, with its staggering 175 billion parameters, demonstrated remarkable performance in various NLP tasks, including text generation, translation, and question answering. Its architecture includes a large number of layers (96 layers) and attention heads (96 heads per layer), allowing it to capture complex relationships in the data and generate highly coherent and contextually relevant text. GPT-3 was trained on a massive dataset, including Common Crawl, WebText2, Books1, Books2, and Wikipedia. The sheer size of the model and the training data contributes to its impressive capabilities. ::: {#tab:gpt-comparison} **Model** **Parameters** **Layers** **Attention Heads** ----------- ---------------- ------------ --------------------- GPT-1 117 million 12 12 GPT-2 1.5 billion 48 16 GPT-3 175 billion 96 96 : GPT Model Comparison ::: **Complexity Analysis of GPT-3:** - **Time Complexity:** The time complexity of GPT-3's inference is also dominated by the self-attention mechanism, similar to BERT, with a time complexity of $O(n^2 \cdot d)$, where $n$ is the sequence length and $d$ is the embedding dimension. However, during training, the masked self-attention allows for parallel computation of all output tokens, making it more efficient than traditional autoregressive models. - **Space Complexity:** GPT-3's 175 billion parameters require substantial memory for storage and computation. This poses challenges for deployment and accessibility. Further advancements beyond GPT-3 are being actively researched, including models with even more parameters and more efficient architectures. The trend suggests a continued increase in model size and complexity, pushing the boundaries of what's possible with language models. # Fine-tuning LLMs for Specific Tasks Fine-tuning is a crucial step in adapting pre-trained LLMs to perform specific downstream tasks. It involves adding task-specific layers on top of the pre-trained model and training these new layers, along with potentially a small portion of the pre-trained layers, on a labeled dataset relevant to the task. ## Text Classification To fine-tune an LLM for text classification, a classification layer (typically a neural network with one or more hidden layers) is added on top of the pre-trained model. This layer takes the output embeddings from the LLM as input and produces a probability distribution over the possible classification labels. ::: tcolorbox Given a text, the LLM produces embeddings $O_1, O_2, \dots, O_n$. Often, a special token's embedding (e.g., the \[CLS\] token in BERT) or the average of all output embeddings is used as input to the classification layer. A neural network with one hidden layer processes this input to produce a label (e.g., positive or negative sentiment for sentiment analysis). ::: :::: tcolorbox ::: algorithmic **Input:** Text sequence $T$, LLM (e.g., BERT), Classification Layer $C$ **Output:** Predicted class label $y$ Convert $T$ into input features (e.g., tokens, segment embeddings) $O_1, O_2, \dots, O_n \gets \text{LLM}(T)$ $O \gets \text{Aggregate}(O_1, O_2, \dots, O_n)$ (e.g., using \[CLS\] token or averaging) $y \gets C(O)$ **return** $y$ ::: :::: ## Named Entity Recognition (NER) For NER, a classification layer is added for each word's embedding. Each classifier predicts the entity type of the corresponding word (e.g., person, organization, location, date, or none). ::: tcolorbox Given a text, the LLM produces embeddings $O_1, O_2, \dots, O_n$. A separate classifier is applied to each $O_i$ to predict the entity type of the $i$-th word. These classifiers can be independent or share some parameters. ::: :::: tcolorbox ::: algorithmic **Input:** Text sequence $T$, LLM (e.g., BERT), NER Classifier $C$ **Output:** Predicted entity labels $Y = [y_1, y_2, \dots, y_n]$ Convert $T$ into input features $O_1, O_2, \dots, O_n \gets \text{LLM}(T)$ $y_i \gets C(O_i)$ **return** $Y$ ::: :::: ## Question Answering In question answering, the LLM processes both the question and the context text (often concatenated as a single input sequence). A common approach is to predict the start and end positions of the answer within the text using a regression or classification model. ::: tcolorbox Given a question and a text, the LLM produces embeddings $O_1, O_2, \dots, O_n$. Two separate classifiers are used: one to predict the probability of each token being the start of the answer, and another to predict the probability of each token being the end of the answer. ::: :::: tcolorbox ::: algorithmic **Input:** Question $Q$, Context Text $C$, LLM (e.g., BERT), Start Classifier $S$, End Classifier $E$ **Output:** Predicted start position $s$ and end position $e$ Concatenate $Q$ and $C$ into a single input sequence $T$ Convert $T$ into input features $O_1, O_2, \dots, O_n \gets \text{LLM}(T)$ $P_{\text{start}} \gets [S(O_1), S(O_2), \dots, S(O_n)]$ $P_{\text{end}} \gets [E(O_1), E(O_2), \dots, E(O_n)]$ $s \gets \text{argmax}(P_{\text{start}})$ $e \gets \text{argmax}(P_{\text{end}})$ **return** $s, e$ ::: :::: **Complexity Analysis for Fine-tuning:** - **Time Complexity:** The time complexity of fine-tuning is generally dominated by the forward and backward passes through the LLM, which is $O(n^2 \cdot d)$ for self-attention-based models. The added task-specific layers usually have a smaller impact on the overall time complexity. - **Space Complexity:** Fine-tuning typically involves adding a relatively small number of parameters compared to the pre-trained LLM. Therefore, the space complexity remains largely determined by the size of the LLM. # Generative Models and ChatGPT ## Generative Models Generative models, like GPT-2 and GPT-3, are trained to generate text by predicting the next word in a sequence autoregressively. The predicted word is then used as input to predict the subsequent word, and so on, creating a coherent and contextually relevant text sequence. This process continues until a stopping criterion is met, such as reaching a maximum length or generating an end-of-sequence token. ## Masked Self-Attention in Generative Models Masked self-attention is crucial in generative models to ensure that the model only uses information from previous words when predicting the next word. This prevents the model from \"cheating\" by looking at future tokens during training, which would lead to poor generalization performance during inference. <figure id="fig:masked-self-attention"> <figcaption>Masked Self-Attention in Generative Models</figcaption> </figure> :::: tcolorbox ::: algorithmic **Input:** Initial sequence $S = [s_1, s_2, \dots, s_k]$, Generative Model $G$ **Output:** Generated sequence $S'$ Convert $S$ into input features $O \gets G(S)$ (using masked self-attention) $s_{k+1} \gets \text{Sample from } O$ (e.g., using argmax or probabilistic sampling) Append $s_{k+1}$ to $S$ $k \gets k + 1$ $S' \gets S$ **return** $S'$ ::: :::: ## ChatGPT: Training and Refinement ChatGPT, based on the GPT architecture, is trained in multiple stages to enhance its conversational abilities: 1. **Pre-training**: Similar to GPT, ChatGPT is pre-trained on a massive corpus of text data from the internet, including books, articles, and websites. This stage allows the model to learn general language patterns and knowledge. The dataset used for training ChatGPT is a mix of filtered web content, books, and other text sources. 2. **Supervised Fine-tuning (SFT)**: In this stage, human AI trainers provide conversations where they play both the user and the AI assistant. The model is then fine-tuned on these conversations using supervised learning to improve its ability to generate relevant and coherent responses. This dataset is created by human labelers who write and rank responses to a variety of prompts. 3. **Reinforcement Learning from Human Feedback (RLHF)**: A reward model is trained based on human preferences. Human trainers rank multiple outputs from the model, and this data is used to train a reward model that predicts the quality of the generated responses. This reward model is then used to fine-tune ChatGPT using reinforcement learning algorithms, specifically Proximal Policy Optimization (PPO). <figure id="fig:rlhf"> <figcaption>Reinforcement Learning from Human Feedback (RLHF)</figcaption> </figure> :::: tcolorbox ::: algorithmic **Input:** Pre-trained ChatGPT model, Reward Model $R$ **Output:** Fine-tuned ChatGPT model Collect a set of prompts $P$ Generate a response $r$ using ChatGPT Obtain a reward $R(p, r)$ from the reward model Update ChatGPT's parameters using a reinforcement learning algorithm (e.g., PPO) based on the collected rewards ::: :::: **Complexity Analysis:** - **Time Complexity:** The RLHF process involves multiple steps, including generating responses, evaluating rewards, and updating the model using RL. The time complexity is influenced by the number of prompts, the length of the generated responses, and the complexity of the RL algorithm. - **Space Complexity:** The space complexity is primarily determined by the size of the ChatGPT model and the reward model. ::: tcolorbox - **Stage 1: Pre-training:** Learns general language patterns from a massive text corpus. - **Stage 2: Supervised Fine-tuning (SFT):** Improves conversational abilities using human-written dialogues. - **Stage 3: Reinforcement Learning from Human Feedback (RLHF):** Refines the model based on human preferences using a reward model and reinforcement learning. ::: # Prompt Engineering Prompt engineering is crucial for effectively using LLMs, particularly for models like ChatGPT. It involves carefully crafting the input text (prompt) to guide the model towards generating the desired output. A well-designed prompt can significantly improve the quality, relevance, and coherence of the model's response. OpenAI provides guidelines for crafting prompts, such as using delimiters to specify parts of the text and specifying the desired length of the output. ::: tcolorbox - **Clarity and Specificity:** Be clear and specific about the task you want the model to perform. Avoid ambiguity and provide sufficient context. - **Use of Delimiters:** Use delimiters (e.g., quotation marks, triple quotes, XML tags) to clearly separate different parts of the prompt, such as instructions, context, and input data. - **Output Length Specification:** Specify the desired length of the output, such as the number of sentences, paragraphs, or words. This helps control the verbosity of the model's response. - **Few-Shot Examples:** Provide a few examples of the desired input-output behavior to guide the model. This is particularly useful for complex tasks. - **Iterative Refinement:** Start with a simple prompt and iteratively refine it based on the model's output. Experiment with different phrasing and formatting to achieve the best results. - **Role-Playing** Assign a role to the model (e.g., \"You are a helpful assistant\") to influence its response style. ::: ::: tcolorbox **Task:** Summarize a given text. **Poor Prompt:** \"Summarize this.\" **Better Prompt:** "' Summarize the following text in three sentences. Text: \"\"\" \[Insert text here\] \"\"\" "' **Explanation:** The better prompt uses delimiters (triple quotes) to clearly separate the text to be summarized. It also specifies the desired output length (three sentences). ::: ::: tcolorbox **Task:** Extract key information from a text. **Poor Prompt:** \"What are the important points?\" **Better Prompt:** "' Extract the names of all people and organizations mentioned in the following text. Text: \"\"\" \[Insert text here\] \"\"\" "' **Explanation:** The better prompt clearly specifies the type of information to be extracted (names of people and organizations) and uses delimiters to separate the text. ::: ::: tcolorbox **Prompt:** "' Please translate the following sentence into French. Keep the translation concise and accurate. Sentence: \"\"\" Large language models are powerful tools for natural language processing. \"\"\" Desired length: One sentence. "' ::: **Further Considerations:** - **Chain-of-Thought Prompting:** Encourage the model to reason step-by-step by including phrases like \"Let's think step by step.\" - **Zero-Shot Prompting:** Ask the model to perform a task without providing any examples. This tests the model's generalization ability. - **Temperature and Top-p:** These are parameters that control the randomness and creativity of the model's output. They can be adjusted during inference to influence the diversity of the generated text. # Capabilities and Limitations of ChatGPT ChatGPT has demonstrated impressive capabilities in various natural language processing tasks, making it a valuable tool for a wide range of applications. However, it also has limitations that need to be considered when using the model. ## Capabilities ChatGPT excels in tasks that involve understanding and generating text based on the information provided in the prompt. Some of its key capabilities include: - **Summarizing text**: ChatGPT can generate concise and coherent summaries of longer texts, such as articles, documents, or conversations. It can effectively identify the main points and key information within the provided text. - **Extracting information**: ChatGPT can extract specific information from a given text, such as names, dates, locations, or other entities. It can also answer questions based on the content of the text. - **Rephrasing text**: ChatGPT can rephrase or rewrite text in different styles or tones while preserving the original meaning. This can be useful for paraphrasing, simplifying complex language, or adapting the text for different audiences. - **Classifying information**: ChatGPT can classify text into predefined categories, such as sentiment analysis (positive, negative, neutral), topic classification, or intent recognition. - **Content Generation**: It can generate various types of content, including articles, stories, poems, and scripts, given appropriate prompts. - **Translation**: It can perform translation between multiple languages, although the quality may vary depending on the language pair and the complexity of the text. - **Question Answering**: Given a context, it can answer questions related to that context accurately. ## Limitations Despite its impressive capabilities, ChatGPT has several limitations: - **Hallucinations**: ChatGPT may sometimes generate incorrect, nonsensicalThe user wants me to continue the LaTeX document from where I left off. I need to complete the section on the limitations of ChatGPT, including the examples of hallucinations and reasoning failures, and then move on to the concluding sections of the document., or misleading information that is not grounded in the provided context or real-world facts. This is often referred to as \"hallucination.\" These hallucinations can be subtle and may appear plausible, making it important to verify the model's output. - **Reasoning**: ChatGPT may struggle with complex reasoning tasks, particularly those involving multi-step inference, logical deductions, or common sense reasoning. It may produce inconsistent or illogical answers when faced with such tasks. For example, it failed to answer correctly to the following problem: \"I left five clothes to dry out in the sun. It took them five hours to dry completely. How long would it take to dry 30 clothes?\". - **Contextual Understanding**: ChatGPT's understanding of context is limited to the information provided in the current prompt. It does not have memory of previous interactions beyond the current conversation, which can lead to inconsistencies or a lack of coherence in longer conversations. Also, its knowledge is limited to what it has learned during training, which has a cutoff date. - **Bias and Sensitivity**: ChatGPT may reflect biases present in the training data, which can lead to generating responses that are unfair, discriminatory, or offensive. It may also be sensitive to slight changes in the input, leading to significantly different outputs for seemingly similar prompts. - **Over-Reliance on Prompting**: The quality of ChatGPT's output is heavily dependent on the quality of the prompt. Poorly crafted prompts can lead to irrelevant, incoherent, or inaccurate responses. - **Verbosity**: It tends to be verbose, often providing longer answers than necessary. ::: tcolorbox **Prompt:** \"What is the capital of France?\" **ChatGPT's Response:** \"The capital of France is Paris.\" (Correct) **Prompt:** \"What is the capital of the United States?\" **ChatGPT's Response:** \"The capital of the United States is Washington, D.C.\" (Correct) **Prompt:** \"What is the largest planet in our solar system?\" **ChatGPT's Response:** \"The largest planet in our solar system is Jupiter.\" (Correct) **Prompt:** \"Who painted the Mona Lisa?\" **ChatGPT's Response:** \"The Mona Lisa was painted by Leonardo da Vinci.\" (Correct) **Prompt:** \"When did World War II end?\" **ChatGPT's Response:** \"World War II ended in 1945.\" (Correct) **Prompt:** \"What is the highest mountain in the world?\" **ChatGPT's Response:** \"The highest mountain in the world is Mount Everest.\" (Correct) **Prompt:** \"What is the chemical formula for water?\" **ChatGPT's Response:** \"The chemical formula for water is H2O.\" (Correct) **Prompt:** \"Who wrote the theory of relativity?\" **ChatGPT's Response:** \"The theory of relativity was written by Albert Einstein.\" (Correct) **Prompt:** \"What is the smallest country in the world?\" **ChatGPT's Response:** \"The smallest country in the world is Vatican City.\" (Correct) **Prompt:** \"Who is the current president of France?\" **ChatGPT's Response:** \"The current president of France is Emmanuel Macron.\" (Correct) **Prompt:** \"What is the population of Mars?\" **ChatGPT's Response:** \"The population of Mars is approximately 7.9 billion people.\" (Incorrect - Hallucination) ::: ::: tcolorbox **Prompt:** \"I left five clothes to dry out in the sun. It took them five hours to dry completely. How long would it take to dry 30 clothes?\" **ChatGPT's Response:** \"If it took five hours to dry five clothes, that means each piece of clothing took one hour to dry (5 hours / 5 clothes = 1 hour/clothes). Therefore, to dry 30 clothes, it would take 30 hours (30 clothes \* 1 hour/clothes = 30 hours).\" (Incorrect - Reasoning Failure) ::: **Addressing Limitations:** - **Prompt Engineering:** Carefully crafting prompts can help mitigate some of the limitations, such as hallucinations and verbosity. - **Fact Verification:** Always verify the information provided by ChatGPT, especially when it involves factual claims. - **Combining with Other Systems:** Integrate ChatGPT with other systems, such as knowledge bases or search engines, to improve accuracy and reliability. - **Human Oversight:** Use human oversight to review and edit the model's output, particularly in critical applications. # Conclusion This lecture covered the evolution of large language models (LLMs) from BERT to ChatGPT, highlighting their architectures, training processes, and applications. We explored how LLMs can be fine-tuned for specific tasks and discussed the advancements in generative models, particularly the transition from encoder-based models like BERT to decoder-based models like GPT. While LLMs like ChatGPT offer impressive capabilities, they also have limitations, such as hallucinations and difficulties with reasoning. Understanding these aspects is crucial for effectively utilizing and improving these powerful tools. **Key Takeaways:** - Transformers and their self-attention mechanisms are foundational to modern LLMs, enabling them to capture long-range dependencies in text. - BERT and GPT represent two different approaches, with BERT focusing on understanding context bidirectionally and GPT on generating text autoregressively. - Fine-tuning allows LLMs to be adapted for various NLP tasks with smaller datasets, leveraging the knowledge gained during pre-training. - ChatGPT's training involves a multi-stage process: pre-training on a massive text corpus, supervised fine-tuning on human-written dialogues, and reinforcement learning from human feedback (RLHF) to align the model with human preferences. - Prompt engineering is essential for maximizing the effectiveness of LLMs, requiring careful crafting of input text to guide the model towards desired outputs. - The specific datasets used for training GPT-3 include Common Crawl, WebText2, Books1, Books2, and Wikipedia. ChatGPT's training data is a mix of filtered web content, books, and other text sources, along with human-written dialogues and preference data for RLHF. ::: tcolorbox 1. **Transformers (2017):** Introduced self-attention mechanisms, revolutionizing NLP. 2. **BERT (2018):** Bidirectional encoder model, excelling at understanding context. 3. **GPT (2018):** Autoregressive decoder model, focused on text generation. 4. **GPT-2 (2019):** Larger version of GPT, demonstrating improved text generation capabilities. 5. **GPT-3 (2020):** 175 billion parameters, showcasing remarkable performance across various tasks. 6. **ChatGPT (2022):** Fine-tuned using RLHF for conversational abilities, but still prone to limitations. ::: **Follow-up Questions:** - How can we mitigate the issue of hallucinations in LLMs? Potential solutions include improving training data quality, incorporating fact verification mechanisms, and developing better methods for grounding the model's output in real-world knowledge. - What are some strategies for improving the reasoning abilities of LLMs? This could involve developing new architectures that can perform multi-step reasoning, integrating symbolic reasoning approaches, and training on datasets that require more complex logical inference. - How might future advancements in LLMs address current limitations? Future research may focus on developing more efficient architectures, improving training methods, incorporating multimodal learning, and enhancing the model's ability to learn from limited data. - How can we evaluate and address biases in LLMs? This is an active area of research, with potential solutions including developing methods for detecting and mitigating biases in training data, creating more diverse and representative datasets, and incorporating fairness constraints during training. - What are the ethical implications of using LLMs, and how can we ensure responsible deployment? This involves considering issues such as potential misuse, job displacement, the spread of misinformation, and the need for transparency and accountability. **Suggested Improvements for Future Work:** - Develop more robust methods for evaluating the reasoning abilities of LLMs. - Explore techniques for incorporating external knowledge sources to reduce hallucinations. - Investigate methods for improving the sample efficiency of LLM training. - Develop better techniques for controlling the style and tone of LLM output. - Create more comprehensive benchmarks for evaluating the performance of LLMs across a wider range of tasks and domains.