Predictive and Classification Methods

Author

Your Name

Published

February 12, 2025

Introduction

Summary and Introduction to Predictive Modeling

Introduction: This lecture introduces predictive modeling and classification methods, fundamental techniques in applied statistics and data analysis. We will explore the core concepts of predictive modeling, methods for assessing model accuracy, and the application of regression models for predictive purposes. Furthermore, we will transition to classification problems, focusing on predicting categorical outcomes.

Predictive modeling is defined as the process of creating or selecting a model that best forecasts the probability of an outcome. As Geisser S. articulated in “Predictive Inference: An Introduction”, the essence of predictive modeling is the development of tools for accurate prediction. At its heart, predictive modeling utilizes available data to construct mathematical or statistical models capable of forecasting future events or unknown values. This involves identifying patterns within data to extrapolate and anticipate future or unseen observations.

Examples of Predictive Modeling Applications:
- Insurance Companies: Predictive models are crucial for assessing risk associated with potential policyholders in auto, health, and life insurance. These models help determine policy eligibility and calculate appropriate premiums by predicting the likelihood of claims based on applicant characteristics and historical data. For instance, in auto insurance, models might predict accident probability based on age, driving history, and vehicle type.
- Governments: Governments employ predictive modeling to evaluate and mitigate potential risks to public safety and security. Biometric models for identifying terror suspects and fraud detection systems are examples. These systems analyze patterns and anomalies in large datasets to proactively identify and address threats, enhancing citizen protection. For example, fraud detection models in tax systems can predict potentially fraudulent returns, allowing for targeted audits and resource allocation.
- Internet Companies: In the commercial sector, internet companies leverage predictive models to enhance customer satisfaction and profitability. Recommendation systems, for example, guide consumers towards products they are likely to purchase or investments that align with their financial goals. These models analyze user behavior, purchase history, and product attributes to personalize recommendations, thereby increasing sales and user engagement. For example, e-commerce platforms use collaborative filtering and content-based filtering to predict products a user might be interested in.

While predictive models are powerful, it is crucial to acknowledge their limitations. Several factors can lead to inaccurate predictions, undermining the reliability of model outputs.

Reasons for Predictive Model Failure:
- Inadequate Data Pre-processing: The quality of input data significantly impacts model performance. Poor data quality, including missing values, errors, or biases, and inappropriate data transformations can lead to misleading results. Proper data cleaning, handling missing data, and feature scaling are essential pre-processing steps. For example, if income data is used for credit risk prediction but contains many unreported values or outliers, the model’s accuracy will be compromised.
- Insufficient Model Selection and Validation: Choosing an inappropriate model for the data or failing to properly validate the chosen model can result in poor predictions. Model selection should be guided by the nature of the data and the problem, and validation techniques like cross-validation are necessary to ensure the model generalizes well to unseen data. For example, applying a linear model to data with highly non-linear relationships will likely yield poor predictive performance.
- Unjustified Extrapolation: Predictive models are built based on the range of available observations in the training data. Applying a model to predict outcomes for inputs outside this range, known as extrapolation, can lead to unreliable predictions. The relationships learned by the model may not hold beyond the observed data range. For example, a model trained to predict house prices based on data from a specific city might not accurately predict prices in a different city with different market dynamics.
- Overfitting: Overfitting occurs when a model learns the training data too well, including noise. It becomes overly complex and sensitive to the training data, failing to generalize to new data. It focuses on random patterns and misses the true function $f$.

Developing dependable and trustworthy predictive models is therefore essential for informed decision-making. However, it is also important to understand that there is an inherent irreducible error component in predictive modeling. This irreducible error arises from factors that are fundamentally beyond the modeler’s control.

Irreducible Error Sources:
- Omission of Relevant Predictor Variables: No model can account for variables that are not measured or included in the dataset, even if they are relevant to the outcome. If crucial predictors are missing, the model’s predictive accuracy will be inherently limited. For example, predicting student performance without considering their motivation or study habits, if such data is unavailable, will introduce irreducible error.
- Unquantifiable and Unusable Variables: Some factors influencing outcomes are inherently difficult or impossible to quantify and incorporate into statistical models. Human behavior, subjective opinions, and unforeseen external events are examples of such variables. For instance, predicting market crashes is inherently difficult due to the influence of unpredictable investor sentiment and global events.
- Limitations of Current and Past Knowledge: Predictions are fundamentally constrained by the current state of knowledge and the data available from the past. Models are built upon historical data and existing theories, which may not perfectly capture future dynamics or emerging trends. For example, predicting technological breakthroughs or paradigm shifts is limited by our current understanding and the absence of historical precedents.

Prediction versus Inference

In statistical modeling, it is crucial to distinguish between the goals of prediction and inference. While both utilize statistical models, they address different questions and prioritize different aspects of the modeling process.

Consider a scenario where we aim to model a quantitative response variable $Y$ using $p \geq 1$ explanatory variables (predictors) denoted as $\bm{X}= (X_1, \dots, X_p)$. We assume a general model framework:

\[Y = f(\bm{X}) + \varepsilon \label{eq:general_model}\] where $f(\bm{X})$ represents a fixed, unknown function that describes the systematic relationship between the predictors $\bm{X}$ and the response $Y$. This function encapsulates the underlying pattern we aim to learn from the data. The term $\varepsilon$ represents a random error component, assumed to have a mean of zero, accounting for the variability in $Y$ that is not explained by $\bm{X}$. Regression models are built upon this fundamental framework.

Inference: The primary goal of inference is to understand and explain the relationship between $Y$ and $\bm{X}$. Inference seeks to determine the nature of the function $f(\bm{X})$. Specifically, we want to understand how changes in the predictors $X_1, \dots, X_p$ affect the response $Y$. This involves estimating the form of the function $f$, often denoted as $\hat{f}$, using the available data. In inference, we are interested in interpreting the model parameters, assessing the statistical significance of predictors, and understanding the underlying mechanisms driving the response. For example, in a study examining the effect of fertilizer and sunlight on crop yield, inference would focus on quantifying the individual and combined effects of fertilizer and sunlight on yield and determining if these effects are statistically significant.
Prediction: In contrast, the goal of prediction is to accurately forecast the value of $Y$ for a given set of input values for $\bm{X}$. Given a new set of inputs $\bm{X}$, prediction aims to generate an estimate $\hat{Y}= \hat{f}(\bm{X})$ that is as close as possible to the true, but unknown, value of $Y$. We might also be interested in providing a prediction interval, which quantifies the uncertainty associated with our point prediction. In prediction, the focus is on the accuracy of the predicted values. The model $\hat{f}$ is often treated as a “black box”, and interpretability of the model parameters is secondary to predictive performance. For example, in sales forecasting, the primary goal is to predict future sales volume accurately, regardless of the specific functional form of the model or the interpretability of individual coefficients, as long as the predictions are reliable.
Applications: Many real-world problems require either prediction, inference, or a combination of both. The choice between prediction and inference depends on the specific objectives of the analysis.
- Example: Marketing Campaign Analysis: A marketing team might want to understand the impact of advertising spending on sales (inference) to optimize their marketing strategy. Simultaneously, they need to predict future sales based on planned advertising budgets (prediction) to manage inventory and revenue expectations. In this case, both inference and prediction are crucial. Inference helps understand the effectiveness of different advertising channels, while prediction aids in planning and resource allocation.
- Example: Credit Risk Assessment: A bank uses customer data to predict the probability of loan default (prediction). The primary goal is to accurately classify loan applicants as low-risk or high-risk to minimize potential losses. While understanding the factors contributing to default (inference) can be valuable, the immediate objective is accurate prediction for risk management.

Example: Advertising Data

To illustrate the concepts of prediction and inference, let’s consider the “advertising data” example, derived from the well-known textbook “An Introduction to Statistical Learning” by James et al. This dataset explores the relationship between product sales and advertising budgets across different media channels. The dataset includes observations from 200 distinct markets and records the sales of a particular product (response variable $Y$) along with advertising budgets allocated to three different media: TV ($X_1$), radio ($X_2$), and newspaper ($X_3$). All monetary values are recorded in thousands of dollars, and sales are in thousands of units.

Pairwise scatter plots and histograms for the advertising dataset. Sales show a positive correlation with TV and radio advertising budgets.

Pairwise scatter plots and histograms of the advertising data are shown in 1. The scatter plots suggest a positive relationship between sales and both TV and radio advertising budgets, indicating that increased spending in these media is associated with higher sales. Newspaper advertising shows a weaker and less clear relationship with sales. The histograms display the distributions of each variable, providing insights into their ranges and central tendencies.

To model the relationship between advertising expenditure and sales, a multiple linear regression model can be fitted to this data. The output from fitting such a model might resemble the following:

Coefficients:
            Estimate   SE       p-value
Intercept     2.9389   0.3119   <0.0001
TV            0.0458   0.0014   <0.0001
radio         0.1885   0.0086   <0.0001
newspaper    -0.0010   0.0059   0.8599
s^2 = 2.841, Radj^2 = 0.896

From this regression output, we can make several observations. The intercept term is statistically significant, indicating a baseline level of sales even without advertising expenditure. Both TV and radio advertising coefficients are positive and highly statistically significant (p-value < 0.0001), suggesting that increasing spending on TV and radio advertising leads to a statistically significant increase in sales. In contrast, the coefficient for newspaper advertising is very small and not statistically significant (p-value = 0.8599), implying that, in this model, newspaper advertising does not have a significant impact on sales. The adjusted R-squared value of 0.896 indicates that the model explains approximately 89.6% of the variance in sales, suggesting a good fit to the data.

Based on this model, we can formulate questions relevant to both inference and prediction frameworks:

Inference Framework Questions: In an inference framework, we are interested in understanding the relationships between advertising media and sales. Questions we might ask include:
- Which media types significantly drive sales? Based on the p-values, we can infer that TV and radio advertising have a statistically significant positive impact on sales, while newspaper advertising does not appear to be significant in this model.
- Which medium generates the biggest boost in sales? Comparing the magnitudes of the coefficients for TV and radio, we observe that the coefficient for radio (0.1885) is substantially larger than that for TV (0.0458). This suggests that, for each additional dollar spent, radio advertising generates a larger increase in sales compared to TV advertising, although this interpretation should be made cautiously, considering potential differences in the scales and ranges of these variables.
- What is the expected increase in sales associated with a given increase in TV advertising? The coefficient for TV advertising is approximately 0.0458. This implies that, on average, for every additional $1,000 spent on TV advertising, sales are expected to increase by approximately 0.0458 thousand units, or 45.8 units, holding radio and newspaper advertising budgets constant.
Prediction Framework Question: In a prediction framework, our primary interest is to forecast sales given specific advertising budgets. A typical prediction question would be:
- Predict sales volume given specific budget allocations for TV, radio, and newspaper. For example, if a company plans to spend $100,000 on TV, $20,000 on radio, and $10,000 on newspaper advertising in a particular market, we can use the fitted regression model to predict the expected sales in that market. By plugging these values into the regression equation: \[\text{Predicted Sales} = 2.9389 + 0.0458 \times 100 + 0.1885 \times 20 - 0.0010 \times 10\] \[\text{Predicted Sales} \approx 16.95 \text{ thousand units}\] Thus, the predicted sales volume for these advertising budgets is approximately 16,950 units. We can also calculate prediction intervals to quantify the uncertainty associated with this prediction.

Predictive Model Accuracy

Measuring the Quality of Fit

To effectively evaluate a predictive model, it is crucial to quantify how well its predictions align with the actual observed values. Consider a scenario where we have a set of training data consisting of input-output pairs $(\mathbf{x}_1, y_1), \dots, (\mathbf{x}_n, y_n)$. We employ these data to train a model and obtain an estimated function $\hat{f}$ that approximates the true relationship between $\bm{X}$ and $Y$.

Definition 1 (Training Mean Squared Error (MSE)). A widely used metric for assessing the fit of a regression model to the training data is the training mean squared error (MSE). It is calculated as the average of the squared differences between the actual response values and the predicted values from the model: \[\text{MSE}_{training} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{f}(\mathbf{x}_i))^2 \label{eq:training_mse}\]

Remark. Remark 1. The training MSE quantifies how well the model fits the training data. However, it is crucial to note:

Fitting Accuracy versus Predictive Accuracy: The training MSE reflects the fitting accuracy of the model, indicating how well the model conforms to the training data. It does not directly measure the predictive accuracy, which is the model’s ability to generalize and make accurate predictions on new, unseen data. A model with a low training MSE may not necessarily have a low test MSE.
Overoptimistic Assessment of Model Performance: Evaluating a model’s performance solely based on the training MSE can lead to an overoptimistic assessment. This is because the model is trained and evaluated on the same dataset. The model has effectively "memorized" the training data and is optimized to minimize the error on these specific data points. Consequently, the training MSE tends to underestimate the error rate on new, unobserved data. For instance, consider fitting a highly complex polynomial to a small dataset. It might perfectly interpolate the training points, resulting in zero training MSE. However, such a model is likely to perform poorly on new data points due to overfitting.

To obtain a more realistic evaluation of a model’s performance in predicting new outcomes, we need to assess its accuracy on test data, which includes observations not used in training. Suppose we have test observations $(\mathbf{x}_{01}, y_{01}), \dots, (\mathbf{x}_{0m}, y_{0m})$.

Definition 2 (Test Mean Squared Error (MSE)). The test MSE is then calculated as: \[\text{MSE}_{test} = \frac{1}{m} \sum_{j=1}^{m} (y_{0j} - \hat{f}(\mathbf{x}_{0j}))^2 \label{eq:test_mse}\] where $\hat{f}$ is the model estimated using the training observations. The test MSE provides a more reliable estimate of the model’s generalization to new data.

Remark. Remark 2.

Availability of Test Data: In some scenarios, a distinct test dataset is readily available. This is particularly common when the original dataset is large enough to be split into training and testing sets.
Cross-validation: When a separate test set is not available, techniques like cross-validation can be used to estimate the test MSE using only the training data. Cross-validation involves dividing the training data into subsets, training the model on some subsets, and testing on the remaining subset, repeating this to average the test error estimate.

Overfitting and Bias-Variance Trade-off

A fundamental challenge in predictive modeling is to minimize the test MSE, which directly reflects the model’s ability to generalize to new data. It is important to note that simply minimizing the training MSE does not guarantee the lowest test MSE. In fact, striving for an extremely low training MSE can often lead to a higher test MSE, a phenomenon known as overfitting. This gives rise to the critical concept of the bias-variance trade-off.

Remark. Remark 3.

Training MSE versus Test MSE Behavior: As we increase the flexibility of a statistical learning method (e.g., by increasing the complexity of a model), we typically observe a consistent decrease in the training MSE. More flexible models can fit the training data more closely, leading to a better fit and thus a lower training MSE. However, the test MSE often exhibits a U-shaped behavior as model flexibility increases. Initially, as model flexibility increases, the test MSE decreases, indicating improved prediction accuracy on unseen data. But beyond a certain level of flexibility, the test MSE starts to increase, signifying a decline in generalization performance.
The Phenomenon of Overfitting: The increase in test MSE that occurs with increasing model flexibility beyond an optimal point is known as overfitting. Overfitting happens when a model becomes excessively tailored to the idiosyncrasies of the training data, including the random noise and fluctuations present in that specific dataset. An overfitted model learns the training data too closely, capturing not only the underlying patterns but also the noise. Consequently, it becomes overly complex and sensitive to the specific training dataset, failing to generalize effectively to new, unseen data. In essence, an overfitted model memorizes the training examples rather than learning the underlying relationship, leading to poor predictive performance on new observations. For example, a decision tree model allowed to grow very deep might perfectly classify all training samples but perform poorly on test data because it has created overly specific rules that do not generalize.

Theorem 1 (Bias-Variance Decomposition). The test MSE can estimate the expected test MSE for a future random response $Y_0$ with predictor value $\mathbf{x}_0$. The expected squared error can be decomposed as: \[\mathbb{E}[(Y_0 - \hat{Y}_0)^2] = \mathbb{V}[\hat{f}(\mathbf{x}_0)] + [\text{Bias}(\hat{f}(\mathbf{x}_0))]^2 + \mathbb{V}[\varepsilon] \label{eq:bias_variance_decomposition}\] where $\hat{Y}_0 = \hat{f}(\mathbf{x}_0)$ is the point predictor from the fitted model.

Remark. Remark 4.

Irreducible Error: The term $\mathbb{V}[\varepsilon]$ represents the irreducible error, the variance of the error term $\varepsilon$. It is the lowest possible expected test MSE, unattainable by any model.
Bias-Variance Trade-off: To minimize expected test MSE, we need a model with low variance and low bias, known as the bias-variance trade-off. Generally, as model flexibility increases:
- Variance Increases: More flexible models are more sensitive to training data changes, increasing variance in $\hat{f}(\mathbf{x}_0)$.
- Bias Decreases: More flexible models better capture the true relationship between $\bm{X}$ and $Y$, reducing bias.

Prediction using Regression Models

Regression versus Classification Problems

Variables can be quantitative or qualitative (categorical). This distinction determines whether we face a regression or classification problem.

Regression Problems: Predictive problems with a quantitative response variable are regression problems. The aim is to predict a numerical value, such as house prices, sales, or stock prices.
Classification Problems: Predictive problems with a qualitative response variable are classification problems. The goal is to predict the category or class of an observation, like classifying emails as spam or not spam, diagnosing diseases, or predicting customer churn.

Remark. Remark 5. The line between regression and classification is not always sharp. Logistic regression, a classification method, is an extension of linear regression. It models probabilities on a transformed scale, suitable for binary or categorical response variables.

This section focuses on prediction using (multiple) linear regression models, revisiting concepts with two datasets. The next section will analyze classification problems.

Example: Automobile Bodily Injury Claims

Consider the “automobile bodily injury claims” dataset from “Regression Modeling with Actuarial and Financial Applications” by E.W. Frees. Data from the Insurance Research Council in 2002 on automobile bodily injury claims includes:

ATTORNEY: Claimant represented by an attorney (yes/no).
CLMSEX: Claimant gender (male/female).
MARITAL: Claimant marital status (married, single, widowed, divorced).
CLMINSUR: Driver of claimant’s vehicle uninsured (yes/no).
SEATBELT: Claimant wearing seatbelt/child restraint (yes/no).
CLMAGE: Claimant’s age (quantitative).
AGECLASS: Claimant’s age in five classes: (-18], (18,26], (26,36], (36,47], (47+].
LOSS: Claimant’s total economic loss (in thousands, quantitative response).

The objective is to predict claim amounts (LOSS) for future policies. LOSS is quantitative, representing claim severity. Due to its skewed distribution, the logarithmic transformation, logLOSS, is used as the response for a more symmetric distribution.

Histograms of LOSS, LOSS (without max value), and log(LOSS). Log transformation provides a more symmetric distribution.

Marginal descriptions of explanatory variables:

Histograms of explanatory variables in the automobile bodily injury claims dataset.

Relationships between logLOSS and explanatory variables:

Boxplots of log(LOSS) against categorical explanatory variables and scatter plot against CLMAGE. ATTORNEY and SEATBELT appear to affect log(LOSS).

Attorney representation (ATTORNEY=yes) is linked to higher logLOSS:

Density estimates of log(LOSS) for ATTORNEY=yes and ATTORNEY=no. Claimants with attorneys tend to have higher losses.

A multiple linear regression model describes the joint effects of ATTORNEY (dummy variable) and CLMAGE on logLOSS:

\[\text{logLOSS} = \beta_0 + \beta_1 \text{ATTORNEY} + \beta_2 \text{CLMAGE} + \varepsilon \label{eq:logloss_model_1}\]

Coefficients:
            Estimate   SE       p-value
Intercept     0.7376   0.0851   <0.0001
ATTORNEYno   -1.3699   0.0729   <0.0001
CLMAGE        0.0160   0.0021   <0.0001
n - p = 1148, s = 1.23, Radj^2 = 0.259

Both coefficients are significant, indicating ATTORNEY and CLMAGE affect mean response. Negative coefficient for ATTORNEYno (ATTORNEY=no dummy) suggests lower mean logLOSS for those without attorneys.

For a 30-year-old with ATTORNEY=yes, estimated mean logLOSS and 95% CI:

\[\begin{aligned} \hat{y}&= 0.7376 + 30 \times 0.016 = 1.22 \\ SE &= 0.05 \\ 95\% \text{ CI} &= [1.12, 1.32]\end{aligned}\] On the LOSS scale: \[\begin{aligned} \text{Estimated mean} &= e^{1.22} = 3.38 \\ 95\% \text{ CI} &= [e^{1.12}, e^{1.32}] = [3.06, 3.73] \text{ (in thousands)}\end{aligned}\]

For a 30-year-old with ATTORNEY=no: \[\begin{aligned} \hat{y}&= 0.7376 - 1.3699 + 30 \times 0.016 = -0.15 \\ SE &= 0.05 \\ 95\% \text{ CI} &= [-0.26, -0.05]\end{aligned}\] On the LOSS scale: \[\begin{aligned} \text{Estimated mean} &= e^{-0.15} = 0.86 \\ 95\% \text{ CI} &= [e^{-0.26}, e^{-0.05}] = [0.77, 0.95] \text{ (in thousands)}\end{aligned}\]

Diagnostic plots (Residuals vs Fitted, Q-Q plot, Scale-Location, Residuals vs Leverage) suggest the model is acceptable.

To improve the model, add SEATBELT and a quadratic term for CLMAGE:

\[\text{logLOSS} = \beta_0 + \beta_1 \text{ATTORNEY} + \beta_2 \text{CLMAGE} + \beta_3 \text{CLMAGE}^2 + \beta_4 \text{SEATBELT} + \varepsilon \label{eq:logloss_model_2}\]

Coefficients:
            Estimate   SE       p-value
Intercept    -0.2249   0.1376   0.1024
ATTORNEYno-1.3522   0.0725   <0.0001
CLMAGE        0.0828   0.0075   <0.0001
CLMAGE^2     -0.0009   0.0001   <0.0001
SEATBELTno    0.9241   0.2681   0.0006
s^2 = 1.404, Radj^2 = 0.321

Improved fit statistics (higher adjusted R-squared, lower AIC, potentially lower test MSE). Diagnostic plots are fine. Model for out-of-sample predictions.

For a 30-year-old insured, attorney, seatbelt, 95% prediction interval for log(LOSS) is [-0.89, 3.77], and for LOSS is [0.41, 43.43]. Cross-validation SE interval for LOSS: [0.41, 43.62].

Prediction intervals for logLOSS and LOSS vs. age:

95% prediction intervals for log(LOSS) and LOSS as a function of CLMAGE. Vertical lines for age 30 intervals.

Example: Advertising Data Revisited

Revisiting advertising data. Linear additive regression model may not fully capture data structure. Diagnostic plots suggest nonlinear covariate effects.

Diagnostic plots for linear additive regression model on advertising data, suggesting nonlinearities.

Model with additive effects of TV and radio, their interaction, and quadratic TV effect:

\[\text{sales} = \beta_0 + \beta_1 \text{TV} + \beta_2 \text{radio} + \beta_3 \text{TV}^2 + \beta_4 \text{TV} \times \text{radio} + \varepsilon \label{eq:sales_model_nonlinear}\]

Coefficients:
            Estimate   SE       p-value
Intercept     5.1371   0.1927   <0.0001
TV            0.0509   0.0022   <0.0001
radio         0.0352   0.0059   <0.0001
TV^2         -0.0001   0.0000   <0.0001
TV:radio      0.0011   0.0000   <0.0001
s^2 = 0.389, Radj^2 = 0.986

Smaller AIC and higher adjusted R-squared than additive model. Better diagnostic plots, but some local deviations remain.

Diagnostic plots for nonlinear regression model on advertising data. Improvement over linear model is visible.

With a fitted multiple regression model, predict response $Y_{calculated}$ to quantify the uncertainty in predictions. For instance, if we consider advertising budgets of $100,000 for TV and $20,000 for radio (and $0 for newspaper), we can obtain a 95% confidence interval for the mean sales $\mu_0 = \mathbf{x}_0^T \boldsymbol{\beta}$ which might be [11.864, 12.112] (in thousands of units). A 95% prediction interval for the sales $Y_0$ in a specific market with these budgets might be wider, for example, [10.752, 13.225] (in thousands of units), reflecting the additional uncertainty when predicting for a single market versus the average sales across many markets. Both intervals are centered around the point prediction of 11.988 thousand units.

The Classification Problem

The Classification Setting

In contrast to regression problems where the response variable is quantitative, classification problems involve a qualitative or categorical response variable. In these scenarios, the objective shifts from predicting a numerical value to predicting the category or class to which a new observation belongs.

Classification Task: The core task in classification is to predict the class label or category for a new, unobserved data point, based on its features. This involves assigning the observation to one of a predefined set of categories.
Connection to Regression: Although classification deals with categorical outcomes, many classification methods, such as logistic regression and Linear Discriminant Analysis (LDA), estimate the probability of an observation belonging to each category. These probabilities are numerical scores, similar to the fitted values produced by regression models. In fact, some classification techniques can be seen as extensions of regression methods adapted for categorical outcomes.
Ubiquity of Classification: Classification problems are ubiquitous in various fields, including statistics, machine learning, and data science. They are essential for tasks ranging from automated decision-making to pattern recognition and data interpretation.

Examples of Classification Problems:
- Email Spam Filtering: A common example is classifying emails as either "spam" or "not spam". This is a binary classification problem where the categories are mutually exclusive. Features used for classification might include email content, sender information, and email headers. The goal is to accurately separate unwanted spam emails from legitimate communications.
- Medical Diagnosis: In medical contexts, classification is crucial for diagnosing conditions based on patient symptoms and test results. For instance, attributing a patient’s symptoms to one of several medical conditions, such as stroke, drug overdose, or epileptic seizure, is a multi-class classification problem.
- Credit Scoring: In finance, credit scoring involves assessing the creditworthiness of loan applicants. Banks and financial institutions use classification models to determine whether an applicant is likely to default on a loan or repay it. This is typically a binary classification problem, categorizing applicants into "low-risk" (will repay) or "high-risk" (will default).

Example: Credit Scoring

To illustrate classification concepts, let’s consider the “credit scoring” dataset, derived from “Regression: Models, Methods and Applications” by Fahrmeir et al. This dataset is relevant to a bank loan service that needs to evaluate the potential loan insolvency risk associated with customers. The fundamental problem is to decide, based on customer characteristics, whether a customer will default on a loan or successfully pay it back. This is a typical credit scoring problem in the financial industry.

The dataset comprises information on $n=1000$ private credits issued by a German bank. For each client, the dataset includes a binary response variable and several explanatory variables:

Y: This is the binary response variable indicating creditworthiness. It takes the value 1 if the client defaulted on the loan (did not pay back) and 0 if the client paid back the loan.

The five explanatory variables (predictors) are:

account: Represents the status of the client’s running account with the bank. It is a categorical variable with three levels: 0 for "no running account", 1 for "bad running account", and 2 for "good running account".
duration: The duration of the credit in months. This is a quantitative variable, indicating the length of the loan period.
amount: The credit amount in thousands of euros. This is another quantitative variable, representing the size of the loan.
moral: Reflects the client’s previous payment behavior. It is a categorical variable with two levels: 0 for "bad payment behavior" and 1 for "good payment behavior".
intuse: Indicates the intended use of the credit. It is a categorical variable with two levels: 0 for "business use" and 1 for "private use".

Histograms showing the marginal distributions of the response variable and each explanatory variable are presented in 9. These histograms provide an overview of the distribution of each variable, such as the balance of classes in the response variable and the frequency of different categories or values in the predictors.

Histograms of response variable (y) and explanatory variables in the credit scoring dataset.

Further exploratory analysis involves examining the relationships between the binary response and the explanatory variables. Visualizations like mosaic plots or boxplots, as shown in 10, can be used to visually identify if certain categories or ranges of predictor variables are associated with a higher or lower likelihood of loan default.

Relationships between binary response and explanatory variables in the credit scoring dataset.

Classification of a Categorical Response

Many predictive model accuracy concepts from regression apply to classification, with modifications. Focus on binary classification with response $Y_i \in \{0, 1\}$ for $i = 1, \dots, n$.

A classification method uses training data $(\mathbf{x}_1, y_1), \dots, (\mathbf{x}_n, y_n)$ to build a classifier $\hat{Y}_0 \in \{0, 1\}$ for predictors $\mathbf{x}_0$.

Definition 3 (Training Error Rate). Training error rate quantifies classifier accuracy: \[\text{trainingER} = \frac{1}{n} \sum_{i=1}^{n} I(y_i \neq \hat{y}_i) \label{eq:training_error_rate}\] where $\hat{y}_i$ is the predicted class for observation $i$, and $I(y_i \neq \hat{y}_i)$ is 1 if $y_i \neq \hat{y}_i$, 0 otherwise.

Remark. Remark 6. Training error rate is overoptimistic.

Definition 4 (Test Error Rate). Test error rate on test observations $(\mathbf{x}_{01}, y_{01}), \dots, (\mathbf{x}_{0m}, y_{0m})$: \[\text{testER} = \frac{1}{m} \sum_{j=1}^{m} I(y_{0j} \neq \hat{y}_{0j}) \label{eq:test_error_rate}\]

Remark. Remark 7. Estimated via cross-validation. Test error rate estimates prediction (classification) error for future response $Y_0$ at $\mathbf{x}_0$: \[\mathbb{E}[I(Y_0 \neq \hat{Y}_0)] = \mathbb{P}(Y_0 \neq \hat{Y}_0) \label{eq:prediction_error}\] Goal: minimize prediction error.

The Bayes Classifier

Definition 5 (Bayes Classifier). The Bayes classifier is theoretically optimal (minimizes classification error). It assigns observation $\mathbf{x}_0$ to class 1 if: \[\mathbb{P}(Y_0 = 1 | \bm{X}_0 = \mathbf{x}_0) > \mathbb{P}(Y_0 = 0 | \bm{X}_0 = \mathbf{x}_0) \label{eq:bayes_classifier_rule}\] and to class 0 otherwise (assuming equal error costs).

Remark. Remark 8. Bayes classifier is ideal but $P(Y_0 | \bm{X}_0 = \mathbf{x}_0)$ is usually unknown for real data, making direct computation impossible. Classification methods approximate this probability.

Classification based on Logistic Regression

Definition 6 (Logistic Regression Model). Logistic regression directly models $P(Y_i = 1 | \bm{X}_i = \mathbf{x}_i)$ as: \[\mathbb{P}(Y_i = 1 | \bm{X}_i = \mathbf{x}_i) = \frac{\exp\{\mathbf{x}_i^T \boldsymbol{\beta}\}}{1 + \exp\{\mathbf{x}_i^T \boldsymbol{\beta}\}} \label{eq:logistic_regression_model}\]

Remark. Remark 9. Decision boundary $P(Y = 1 | \bm{X}= \mathbf{x}) = 0.5$ is linear: $\mathbf{x}^T \boldsymbol{\beta}= 0$. Effective when Bayes decision boundary is approximately linear.

Logistic regression decision boundary vs. Bayes decision boundary for simulated two-predictor data.

Remark. Remark 10. Logistic regression uses regression theory. In classification, focus is on performance, not coefficient inference. Extensions to more than two categories exist but are less common.

Linear Discriminant Analysis (LDA)

Definition 7 (Linear Discriminant Analysis (LDA)). Linear Discriminant Analysis (LDA) is another classification method. Linear regression for binary responses can give similar results to logistic regression but may produce probabilities outside [0, 1]. LDA models predictors and uses Bayes’ theorem for conditional probabilities.

LDA assumes that for each class $s$, the predictors $\bm{X}$ follow a Gaussian distribution $N(\bm{\mu}_s, \Sigma)$, sharing a common covariance matrix $\Sigma$ but with class-specific mean vectors $\bm{\mu}_s$.

Using Bayes’ theorem, the probability of class $s$ given predictor vector $\mathbf{x}$ is: \[P(Y = s | \bm{X}= \mathbf{x}) = \frac{\pi_s f_s(\mathbf{x})}{\sum_{r=1}^{S} \pi_r f_r(\mathbf{x})} \label{eq:lda_bayes_theorem}\] where $f_s(\mathbf{x})$ is the multivariate Gaussian density for class $s$, and $\pi_s = P(Y=s)$ is the prior probability of class $s$.

The LDA classifier assigns $\mathbf{x}$ to the class $s$ that maximizes this posterior probability. For computational purposes, LDA uses discriminant functions $\delta_s(\mathbf{x})$ which are linear in $\mathbf{x}$: \[\delta_s(\mathbf{x}) = \mathbf{x}^T \Sigma^{-1} \bm{\mu}_s - \frac{1}{2} \bm{\mu}_s^T \Sigma^{-1} \bm{\mu}_s + \log(\pi_s) \label{eq:lda_discriminant_function}\] The classification rule is to assign $\mathbf{x}$ to the class $s$ for which $\delta_s(\mathbf{x})$ is largest.

Remark. Remark 11. LDA assumes normality of predictors and equal covariance matrices across classes. While linear regression for binary outcomes can be similar to logistic regression, LDA provides a probabilistic framework rooted in Bayes’ theorem and Gaussian assumptions.

k-Nearest Neighbors (kNN)

k-Nearest Neighbors Classification

Input: Training data $(\mathbf{x}_1, y_1), \dots, (\mathbf{x}_n, y_n)$, new data point $\mathbf{x}_0$, number of neighbors $k$, distance metric $d$
Output: Predicted class $\hat{y}_0$ for $\mathbf{x}_0$

Calculate the distance $d(\mathbf{x}_0, \mathbf{x}_i)$ between $\mathbf{x}_0$ and each training data point $\mathbf{x}_i$, for $i = 1, \dots, n$. Select the $k$ nearest neighbors of $\mathbf{x}_0$ from the training data based on the distance metric $d$. Let $N_0$ be the set of indices of these $k$ nearest neighbors. For each class $j$, estimate the conditional probability: \[\hat{P}(Y_0 = j | \bm{X}_0 = \mathbf{x}_0) = \frac{1}{k} \sum_{i \in N_0} I(y_i = j)\] Assign the class label $\hat{y}_0$ to $\mathbf{x}_0$ as the class $j$ that maximizes $\hat{P}(Y_0 = j | \bm{X}_0 = \mathbf{x}_0)$. return Predicted class $\hat{y}_0$

Complexity Analysis:

Training Phase: kNN has a lazy learning approach, and the training phase is minimal. It primarily involves storing the training data. Thus, the training complexity is $O(1)$.
Prediction Phase: To predict the class for a new data point, kNN needs to calculate the distance to all $n$ training points, sort these distances to find the $k$ nearest neighbors, and then perform a majority vote.
- Distance calculation: $O(n \times p)$, where $p$ is the number of predictors (features).
- Sorting distances: $O(n \log n)$ or $O(n)$ if using selection algorithms to find the $k$ smallest distances.
- Majority vote: $O(k)$.
Therefore, the prediction complexity for a single test point is dominated by distance calculation and sorting, resulting in an overall prediction complexity of approximately $O(n \times p + n \log k)$ or $O(n \times p + nk)$. For each test point, we need to compute distances to all training points. If we have $m$ test points, the total prediction complexity becomes $O(m \times n \times p + m \times n \log k)$ or $O(m \times n \times p + m \times nk)$.

Comments on kNN

Remark. Remark 12.

Effectiveness: kNN can be very effective, especially when the Bayes decision boundary is highly nonlinear. Its non-parametric nature allows it to adapt to complex decision boundaries.
Versatility: Applicable to both numeric and categorical predictors, provided a suitable distance metric is defined. For categorical predictors, one can use metrics like Hamming distance or convert categorical variables to numerical representations.
Optimal k: The choice of $k$ is critical and affects the bias-variance trade-off. A small $k$ leads to a more flexible model (low bias, high variance), while a large $k$ leads to a less flexible model (high bias, low variance). The optimal $k$ is typically chosen using cross-validation to minimize the error rate on validation data.
Bias-Variance Trade-off: As $k$ increases, the bias increases, and the variance decreases, illustrating the bias-variance trade-off. Selecting an appropriate $k$ is crucial for balancing this trade-off and achieving good generalization.
Regression: kNN can also be used for regression problems. In kNN regression, the prediction for a new data point is the average (or weighted average) of the response values of its $k$ nearest neighbors in the training data.

Confusion Matrix and Evaluation Metrics

Confusion Matrix

Definition 8 (Confusion Matrix). The confusion matrix is a table that summarizes the performance of a classification model by presenting the counts of true positive, true negative, false positive, and false negative predictions. It is particularly useful for binary classification.

Confusion Matrix for Binary Classification
	Actual Class
Predicted Class	$Y = 0$ (Observed)	$Y = 1$ (Observed)
$Y = 0$ (Predicted)	True Negative (TN)	False Negative (FN)
$Y = 1$ (Predicted)	False Positive (FP)	True Positive (TP)

Classification Performance Metrics

Based on the confusion matrix, several metrics can be derived to evaluate classifier performance:

Accuracy: Overall correctness of the classifier: \[\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}\]
True Positive Rate (Sensitivity, Recall): Proportion of actual positives correctly identified: \[\text{Sensitivity} = \frac{TP}{TP + FN}\]
True Negative Rate (Specificity): Proportion of actual negatives correctly identified: \[\text{Specificity} = \frac{TN}{TN + FP}\]
False Positive Rate: Proportion of actual negatives incorrectly classified as positives: \[\text{False Positive Rate} = \frac{FP}{TN + FP} = 1 - \text{Specificity}\]
Precision (Positive Predictive Value): Proportion of predicted positives that are actually positive: \[\text{Precision} = \frac{TP}{TP + FP}\]
Negative Predictive Value: Proportion of predicted negatives that are actually negative: \[\text{Negative Predictive Value} = \frac{TN}{TN + FN}\]
Log-Odds Ratio: \[\text{Log-Odds Ratio} = \log\left(\frac{TP \times TN}{FN \times FP}\right)\]

Remark. Remark 13. These metrics provide a more detailed view of classifier performance than just accuracy, especially when dealing with imbalanced datasets or when different types of errors have different costs.

ROC Curve

Receiver Operating Characteristic (ROC) Curve

Definition 9 (Receiver Operating Characteristic (ROC) Curve). The Receiver Operating Characteristic (ROC) curve is a graphical plot that illustrates the diagnostic ability of a binary classifier as its discrimination threshold is varied. It plots the True Positive Rate (Sensitivity) against the False Positive Rate (1 - Specificity) at various threshold settings.

Threshold Variation: For classifiers that output a probability or score, the classification decision depends on a threshold (often 0.5). The ROC curve shows performance across all possible thresholds.
True Positive Rate (TPR) vs. False Positive Rate (FPR):
- TPR (Sensitivity) is on the y-axis.
- FPR (1 - Specificity) is on the x-axis.
AUC (Area Under the Curve): The Area Under the ROC Curve (AUC) quantifies the overall performance of the classifier.
- AUC ranges from 0 to 1.
- A higher AUC indicates better classifier performance.
- AUC = 0.5 represents a classifier no better than random guessing.
- AUC = 1 represents a perfect classifier.

Example of an ROC curve, showing different threshold points and AUC.

Using ROC Curves for Comparison

Remark. Remark 14. ROC curves are useful for comparing different binary classifiers. The classifier with an ROC curve that is closer to the top-left corner (higher AUC) is generally considered better, as it achieves a higher true positive rate for a lower false positive rate across different thresholds.

Further Methods in Classification

Classification is a broad field with numerous methods beyond logistic regression, LDA, and kNN. Some further useful methods include:

Naive Bayes: A simple probabilistic classifier based on Bayes’ theorem with strong (naive) independence assumptions between predictors. Effective and computationally efficient, especially with high-dimensional data.
Classification Trees and Decision Stumps: Tree-based methods that partition the predictor space into regions to make classifications. Decision trees are very general and interpretable.
Ensemble Methods: Combine multiple classifiers to improve predictive performance. Examples include:
- Boosting: Iteratively combines weak learners, focusing on misclassified instances.
- Bagging: Creates multiple bootstrap samples and trains classifiers on each, averaging predictions.
- Random Forests: An extension of bagging that decorrelates trees by random feature selection.
Support Vector Machines (SVMs): Powerful methods that find optimal hyperplanes to separate classes, effective in high-dimensional spaces.
Neural Networks and Deep Learning: Complex models capable of learning highly nonlinear relationships, widely used in various classification tasks, especially with large datasets.

Conclusion

Summary: This lecture introduced fundamental concepts in predictive modeling and classification. We covered:

The distinction between prediction and inference.
Methods for assessing predictive model accuracy, including training and test MSE, and cross-validation.
The bias-variance trade-off and the phenomenon of overfitting.
Prediction using linear regression models, illustrated with advertising and automobile claims datasets.
Classification problems, error rates, and the Bayes classifier.
Logistic Regression and Linear Discriminant Analysis (LDA) for classification.
k-Nearest Neighbors (kNN) as a non-parametric classification method.
Evaluation metrics for classification, including the confusion matrix and ROC curves.

Key Takeaways:

Predictive modeling is crucial for forecasting and decision-making in various fields.
Model accuracy assessment requires evaluating performance on unseen data to avoid overfitting.
The bias-variance trade-off is central to model selection and complexity tuning.
Regression and classification are fundamental types of predictive problems, each with specific methods and evaluation approaches.
Classification performance is multifaceted and requires metrics beyond simple accuracy, such as sensitivity, specificity, and AUC.

Further Topics: For the next lecture, we could delve deeper into:

Advanced classification methods, such as Support Vector Machines and tree-based ensemble methods.
Model selection and hyperparameter tuning techniques.
Handling imbalanced datasets in classification.
Applications of predictive modeling and classification in specific domains.

List of Definitions, Theorems, Examples, Remarks, and Algorithms in Tcolorboxes

Definitions

A widely used metric for assessing the fit of a regression model to the training data is the training mean squared error (MSE). It is calculated as the average of the squared differences between the actual response values and the predicted values from the model: \[\text{MSE}_{training} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{f}(\mathbf{x}_i))^2\]

The test MSE is then calculated as: \[\text{MSE}_{test} = \frac{1}{m} \sum_{j=1}^{m} (y_{0j} - \hat{f}(\mathbf{x}_{0j}))^2\] where $\hat{f}$ is the model estimated using the training observations. The test MSE provides a more reliable estimate of the model’s generalization to new data.

Training error rate quantifies classifier accuracy: \[\text{trainingER} = \frac{1}{n} \sum_{i=1}^{n} I(y_i \neq \hat{y}_i)\] where $\hat{y}_i$ is the predicted class for observation $i$, and $I(y_i \neq \hat{y}_i)$ is 1 if $y_i \neq \hat{y}_i$, 0 otherwise.

Test error rate on test observations $(\mathbf{x}_{01}, y_{01}), \dots, (\mathbf{x}_{0m}, y_{0m})$: \[\text{testER} = \frac{1}{m} \sum_{j=1}^{m} I(y_{0j} \neq \hat{y}_{0j})\]

The Bayes classifier is theoretically optimal (minimizes classification error). It assigns observation $\mathbf{x}_0$ to class 1 if: \[\mathbb{P}(Y_0 = 1 | \bm{X}_0 = \mathbf{x}_0) > \mathbb{P}(Y_0 = 0 | \bm{X}_0 = \mathbf{x}_0)\] andto class 0 otherwise (assuming equal error costs).

Logistic regression directly models $P(Y_i = 1 | \bm{X}_i = \mathbf{x}_i)$ as: \[\mathbb{P}(Y_i = 1 | \bm{X}_i = \mathbf{x}_i) = \frac{\exp\{\mathbf{x}_i^T \boldsymbol{\beta}\}}{1 + \exp\{\mathbf{x}_i^T \boldsymbol{\beta}\}}\]

Linear Discriminant Analysis (LDA) is another classification method. Linear regression for binary responses can give similar results to logistic regression but may produce probabilities outside [0, 1]. LDA models predictors and uses Bayes’ theorem for conditional probabilities.

The confusion matrix is a table that summarizes the performance of a classification model by presenting the counts of true positive, true negative, false positive, and false negative predictions. It is particularly useful for binary classification.

The Receiver Operating Characteristic (ROC) curve is a graphical plot that illustrates the diagnostic ability of a binary classifier as its discrimination threshold is varied. It plots the True Positive Rate (Sensitivity) against the False Positive Rate (1 - Specificity) at various threshold settings.

Theorems

The test MSE can estimate the expected test MSE for a future random response $Y_0$ with predictor value $\mathbf{x}_0$. The expected squared error can be decomposed as: \[\mathbb{E}[(Y_0 - \hat{Y}_0)^2] = \mathbb{V}[\hat{f}(\mathbf{x}_0)] + [\text{Bias}(\hat{f}(\mathbf{x}_0))]^2 + \mathbb{V}[\varepsilon]\] where $\hat{Y}_0 = \hat{f}(\mathbf{x}_0)$ is the point predictor from the fitted model.

Examples

A marketing team might want to understand the impact of advertising spending on sales (inference) to optimize their marketing strategy. Simultaneously, they need to predict future sales based on planned advertising budgets (prediction) to manage inventory and revenue expectations. In this case, both inference and prediction are crucial. Inference helps understand the effectiveness of different advertising channels, while prediction aids in planning and resource allocation.

A bank uses customer data to predict the probability of loan default (prediction). The primary goal is to accurately classify loan applicants as low-risk or high-risk to minimize potential losses. While understanding the factors contributing to default (inference) can be valuable, the immediate objective is accurate prediction for risk management.

ATTORNEY: Claimant represented by an attorney (yes/no).
CLMSEX: Claimant gender (male/female).
MARITAL: Claimant marital status (married, single, widowed, divorced).
CLMINSUR: Driver of claimant’s vehicle uninsured (yes/no).
SEATBELT: Claimant wearing seatbelt/child restraint (yes/no).
CLMAGE: Claimant’s age (quantitative).
AGECLASS: Claimant’s age in five classes: (-18], (18,26], (26,36], (36,47], (47+].
LOSS: Claimant’s total economic loss (in thousands, quantitative response).

Revisiting advertising data. Linear additive regression model may not fully capture data structure. Diagnostic plots suggest nonlinear covariate effects.

A common example is classifying emails as either "spam" or "not spam". This is a binary classification problem where the categories are mutually exclusive. Features used for classification might include email content, sender information, and email headers. The goal is to accurately separate unwanted spam emails from legitimate communications.

In medical contexts, classification is crucial for diagnosing conditions based on patient symptoms and test results. For instance, attributing a patient’s symptoms to one of several medical conditions, such as stroke, drug overdose, or epileptic seizure, is a multi-class classification problem.

In finance, credit scoring involves assessing the creditworthiness of loan applicants. Banks and financial institutions use classification models to determine whether an applicant is likely to default on a loan or repay it. This is typically a binary classification problem, categorizing applicants into "low-risk" (will repay) or "high-risk" (will default).

The dataset comprises information on $n=1000$ private credits issued by a German bank. For each client, the dataset includes a binary response variable and several explanatory variables:

Y: This is the binary response variable indicating creditworthiness. It takes the value 1 if the client defaulted on the loan (did not pay back) and 0 if the client paid back the loan.

The five explanatory variables (predictors) are:

account: Represents the status of the client’s running account with the bank. It is a categorical variable with three levels: 0 for "no running account", 1 for "bad running account", and 2 for "good running account".
duration: The duration of the credit in months. This is a quantitative variable, indicating the length of the loan period.
amount: The credit amount in thousands of euros. This is another quantitative variable, representing the size of the loan.
moral: Reflects the client’s previous payment behavior. It is a categorical variable with two levels: 0 for "bad payment behavior" and 1 for "good payment behavior".
intuse: Indicates the intended use of the credit. It is a categorical variable with two levels: 0 for "business use" and 1 for "private use".

Remarks

The training MSE quantifies how well the model fits the training data. However, it is crucial to note:

Fitting Accuracy versus Predictive Accuracy: The training MSE reflects the fitting accuracy of the model, indicating how well the model conforms to the training data. It does not directly measure the predictive accuracy, which is the model’s ability to generalize and make accurate predictions on new, unseen data. A model with a low training MSE may not necessarily have a low test MSE.
Overoptimistic Assessment of Model Performance: Evaluating a model’s performance solely based on the training MSE can lead to an overoptimistic assessment. This is because the model is trained and evaluated on the same dataset. The model has effectively "memorized" the training data and is optimized to minimize the error on these specific data points. Consequently, the training MSE tends to underestimate the error rate on new, unobserved data. For instance, consider fitting a highly complex polynomial to a small dataset. It might perfectly interpolate the training points, resulting in zero training MSE. However, such a model is likely to perform poorly on new data points due to overfitting.

Availability of Test Data: In some scenarios, a distinct test dataset is readily available. This is particularly common when the original dataset is large enough to be split into training and testing sets.
Cross-validation: When a separate test set is not available, techniques like cross-validation can be used to estimate the test MSE using only the training data. Cross-validation involves dividing the training data into subsets, training the model on some subsets, and testing on the remaining subset, repeating this to average the test error estimate.

Training MSE versus Test MSE Behavior: As we increase the flexibility of a statistical learning method (e.g., by increasing the complexity of a model), we typically observe a consistent decrease in the training MSE. More flexible models can fit the training data more closely, leading to a better fit and thus a lower training MSE. However, the test MSE often exhibits a U-shaped behavior as model flexibility increases. Initially, as model flexibility increases, the test MSE decreases, indicating improved prediction accuracy on unseen data. But beyond a certain level of flexibility, the test MSE starts to increase, signifying a decline in generalization performance.
The Phenomenon of Overfitting: The increase in test MSE that occurs with increasing model flexibility beyond an optimal point is known as overfitting. Overfitting happens when a model becomes excessively tailored to the idiosyncrasies of the training data, including the random noise and fluctuations present in that specific dataset. An overfitted model learns the training data too closely, capturing not only the underlying patterns but also the noise. Consequently, it becomes overly complex and sensitive to the specific training dataset, failing to generalize effectively to new, unseen data. In essence, an overfitted model memorizes the training examples rather than learning the underlying relationship, leading to poor predictive performance on new observations. For example, a decision tree model allowed to grow very deep might perfectly classify all training samples but perform poorly on test data because it has created overly specific rules that do not generalize.

Irreducible Error: The term $\mathbb{V}[\varepsilon]$ represents the irreducible error, the variance of the error term $\varepsilon$. It is the lowest possible expected test MSE, unattainable by any model.
Bias-Variance Trade-off: To minimize expected test MSE, we need a model with low variance and low bias, known as the bias-variance trade-off. Generally, as model flexibility increases:
- Variance Increases: More flexible models are more sensitive to training data changes, increasing variance in $\hat{f}(\mathbf{x}_0)$.
- Bias Decreases: More flexible models better capture the true relationship between $\bm{X}$ and $Y$, reducing bias.

The line between regression and classification is not always sharp. Logistic regression, a classification method, is an extension of linear regression. It models probabilities on a transformed scale, suitable for binary or categorical response variables.

Training error rate is overoptimistic.

Estimated via cross-validation. Test error rate estimates prediction (classification) error for future response $Y_0$ at $\mathbf{x}_0$: \[\mathbb{E}[I(Y_0 \neq \hat{Y}_0)] = \mathbb{P}(Y_0 \neq \hat{Y}_0)\] Goal: minimize prediction error.

Bayes classifier is ideal but $P(Y_0 | \bm{X}_0 = \mathbf{x}_0)$ is usually unknown for real data, making direct computation impossible. Classification methods approximate this probability.

Decision boundary $P(Y = 1 | \bm{X}= \mathbf{x}) = 0.5$ is linear: $\mathbf{x}^T \boldsymbol{\beta}= 0$. Effective when Bayes decision boundary is approximately linear.

Logistic regression uses regression theory. In classification, focus is on performance, not coefficient inference. Extensions to more than two categories exist but are less common.

LDA assumes normality of predictors and equal covariance matrices across classes. While linear regression for binary outcomes can be similar to logistic regression, LDA provides a probabilistic framework rooted in Bayes’ theorem and Gaussian assumptions.

Effectiveness: kNN can be very effective, especially when the Bayes decision boundary is highly nonlinear. Its non-parametric nature allows it to adapt to complex decision boundaries.
Versatility: Applicable to both numeric and categorical predictors, provided a suitable distance metric is defined. For categorical predictors, one can use metrics like Hamming distance or convert categorical variables to numerical representations.
Optimal k: The choice of $k$ is critical and affects the bias-variance trade-off. A small $k$ leads to a more flexible model (low bias, high variance), while a large $k$ leads to a less flexible model (high bias, low variance). The optimal $k$ is typically chosen using cross-validation to minimize the error rate on validation data.
Bias-Variance Trade-off: As $k$ increases, the bias increases, and the variance decreases, illustrating the bias-variance trade-off. Selecting an appropriate $k$ is crucial for balancing this trade-off and achieving good generalization.
Regression: kNN can also be used for regression problems. In kNN regression, the prediction for a new data point is the average (or weighted average) of the response values of its $k$ nearest neighbors in the training data.

These metrics provide a more detailed view of classifier performance than just accuracy, especially when dealing with imbalanced datasets or when different types of errors have different costs.

ROC curves are useful for comparing different binary classifiers. The classifier with an ROC curve that is closer to the top-left corner (higher AUC) is generally considered better, as it achieves a higher true positive rate for a lower false positive rate across different thresholds.

Algorithms

Complexity Analysis:

Training Phase: kNN has a lazy learning approach, and the training phase is minimal. It primarily involves storing the training data. Thus, the training complexity is $O(1)$.
Prediction Phase: To predict the class for a new data point, kNN needs to calculate the distance to all $n$ training points, sort these distances to find the $k$ nearest neighbors, and then perform a majority vote.
- Distance calculation: $O(n \times p)$, where $p$ is the number of predictors (features).
- Sorting distances: $O(n \log n)$ or $O(n)$ if using selection algorithms to find the $k$ smallest distances.
- Majority vote: $O(k)$.
Therefore, the prediction complexity for a single test point is dominated by distance calculation and sorting, resulting in an overall prediction complexity of approximately $O(n \times p + n \log k)$ or $O(n \times p + nk)$. For each test point, we need to compute distances to all training points. If we have $m$ test points, the total prediction complexity becomes $O(m \times n \times p + m \times n \log k)$ or $O(m \times n \times p + m \times nk)$.

--- title: "Predictive and Classification Methods" author: "Your Name" date: "2025-02-12" format: html: toc: true # Table of Contents toc-depth: 2 code-tools: true theme: cosmo # Or "journal" for Distill-like minimalism --- # Introduction ## Summary and Introduction to Predictive Modeling {#subsec:intro_predictive_modeling} **Introduction:** This lecture introduces predictive modeling and classification methods, fundamental techniques in applied statistics and data analysis. We will explore the core concepts of predictive modeling, methods for assessing model accuracy, and the application of regression models for predictive purposes. Furthermore, we will transition to classification problems, focusing on predicting categorical outcomes. **Predictive modeling** is defined as the process of creating or selecting a model that best forecasts the probability of an outcome. As Geisser S. articulated in "Predictive Inference: An Introduction", the essence of predictive modeling is the development of tools for accurate prediction. At its heart, predictive modeling utilizes available data to construct mathematical or statistical models capable of forecasting future events or unknown values. This involves identifying patterns within data to extrapolate and anticipate future or unseen observations. - **Examples of Predictive Modeling Applications:** - **Insurance Companies:** Predictive models are crucial for assessing risk associated with potential policyholders in auto, health, and life insurance. These models help determine policy eligibility and calculate appropriate premiums by predicting the likelihood of claims based on applicant characteristics and historical data. For instance, in auto insurance, models might predict accident probability based on age, driving history, and vehicle type. - **Governments:** Governments employ predictive modeling to evaluate and mitigate potential risks to public safety and security. Biometric models for identifying terror suspects and fraud detection systems are examples. These systems analyze patterns and anomalies in large datasets to proactively identify and address threats, enhancing citizen protection. For example, fraud detection models in tax systems can predict potentially fraudulent returns, allowing for targeted audits and resource allocation. - **Internet Companies:** In the commercial sector, internet companies leverage predictive models to enhance customer satisfaction and profitability. Recommendation systems, for example, guide consumers towards products they are likely to purchase or investments that align with their financial goals. These models analyze user behavior, purchase history, and product attributes to personalize recommendations, thereby increasing sales and user engagement. For example, e-commerce platforms use collaborative filtering and content-based filtering to predict products a user might be interested in. While predictive models are powerful, it is crucial to acknowledge their limitations. Several factors can lead to inaccurate predictions, undermining the reliability of model outputs. - **Reasons for Predictive Model Failure:** - **Inadequate Data Pre-processing:** The quality of input data significantly impacts model performance. Poor data quality, including missing values, errors, or biases, and inappropriate data transformations can lead to misleading results. Proper data cleaning, handling missing data, and feature scaling are essential pre-processing steps. For example, if income data is used for credit risk prediction but contains many unreported values or outliers, the model's accuracy will be compromised. - **Insufficient Model Selection and Validation:** Choosing an inappropriate model for the data or failing to properly validate the chosen model can result in poor predictions. Model selection should be guided by the nature of the data and the problem, and validation techniques like cross-validation are necessary to ensure the model generalizes well to unseen data. For example, applying a linear model to data with highly non-linear relationships will likely yield poor predictive performance. - **Unjustified Extrapolation:** Predictive models are built based on the range of available observations in the training data. Applying a model to predict outcomes for inputs outside this range, known as extrapolation, can lead to unreliable predictions. The relationships learned by the model may not hold beyond the observed data range. For example, a model trained to predict house prices based on data from a specific city might not accurately predict prices in a different city with different market dynamics. - **Overfitting:** Overfitting occurs when a model learns the training data too well, including noise. It becomes overly complex and sensitive to the training data, failing to generalize to new data. It focuses on random patterns and misses the true function $f$. Developing dependable and trustworthy predictive models is therefore essential for informed decision-making. However, it is also important to understand that there is an inherent **irreducible error component** in predictive modeling. This irreducible error arises from factors that are fundamentally beyond the modeler's control. - **Irreducible Error Sources:** - **Omission of Relevant Predictor Variables:** No model can account for variables that are not measured or included in the dataset, even if they are relevant to the outcome. If crucial predictors are missing, the model's predictive accuracy will be inherently limited. For example, predicting student performance without considering their motivation or study habits, if such data is unavailable, will introduce irreducible error. - **Unquantifiable and Unusable Variables:** Some factors influencing outcomes are inherently difficult or impossible to quantify and incorporate into statistical models. Human behavior, subjective opinions, and unforeseen external events are examples of such variables. For instance, predicting market crashes is inherently difficult due to the influence of unpredictable investor sentiment and global events. - **Limitations of Current and Past Knowledge:** Predictions are fundamentally constrained by the current state of knowledge and the data available from the past. Models are built upon historical data and existing theories, which may not perfectly capture future dynamics or emerging trends. For example, predicting technological breakthroughs or paradigm shifts is limited by our current understanding and the absence of historical precedents. ## Prediction versus Inference {#subsec:prediction_vs_inference} In statistical modeling, it is crucial to distinguish between the goals of **prediction** and **inference**. While both utilize statistical models, they address different questions and prioritize different aspects of the modeling process. Consider a scenario where we aim to model a quantitative response variable $Y$ using $p \geq 1$ explanatory variables (predictors) denoted as $\bm{X}= (X_1, \dots, X_p)$. We assume a general model framework: $$Y = f(\bm{X}) + \varepsilon \label{eq:general_model}$$ where $f(\bm{X})$ represents a fixed, unknown function that describes the systematic relationship between the predictors $\bm{X}$ and the response $Y$. This function encapsulates the underlying pattern we aim to learn from the data. The term $\varepsilon$ represents a random error component, assumed to have a mean of zero, accounting for the variability in $Y$ that is not explained by $\bm{X}$. Regression models are built upon this fundamental framework. - **Inference:** The primary goal of **inference** is to understand and explain the relationship between $Y$ and $\bm{X}$. Inference seeks to determine the nature of the function $f(\bm{X})$. Specifically, we want to understand how changes in the predictors $X_1, \dots, X_p$ affect the response $Y$. This involves estimating the form of the function $f$, often denoted as $\hat{f}$, using the available data. In inference, we are interested in interpreting the model parameters, assessing the statistical significance of predictors, and understanding the underlying mechanisms driving the response. For example, in a study examining the effect of fertilizer and sunlight on crop yield, inference would focus on quantifying the individual and combined effects of fertilizer and sunlight on yield and determining if these effects are statistically significant. - **Prediction:** In contrast, the goal of **prediction** is to accurately forecast the value of $Y$ for a given set of input values for $\bm{X}$. Given a new set of inputs $\bm{X}$, prediction aims to generate an estimate $\hat{Y}= \hat{f}(\bm{X})$ that is as close as possible to the true, but unknown, value of $Y$. We might also be interested in providing a prediction interval, which quantifies the uncertainty associated with our point prediction. In prediction, the focus is on the accuracy of the predicted values. The model $\hat{f}$ is often treated as a "black box", and interpretability of the model parameters is secondary to predictive performance. For example, in sales forecasting, the primary goal is to predict future sales volume accurately, regardless of the specific functional form of the model or the interpretability of individual coefficients, as long as the predictions are reliable. - **Applications:** Many real-world problems require either prediction, inference, or a combination of both. The choice between prediction and inference depends on the specific objectives of the analysis. - **Example: Marketing Campaign Analysis:** A marketing team might want to understand the impact of advertising spending on sales (inference) to optimize their marketing strategy. Simultaneously, they need to predict future sales based on planned advertising budgets (prediction) to manage inventory and revenue expectations. In this case, both inference and prediction are crucial. Inference helps understand the effectiveness of different advertising channels, while prediction aids in planning and resource allocation. - **Example: Credit Risk Assessment:** A bank uses customer data to predict the probability of loan default (prediction). The primary goal is to accurately classify loan applicants as low-risk or high-risk to minimize potential losses. While understanding the factors contributing to default (inference) can be valuable, the immediate objective is accurate prediction for risk management. ## Example: Advertising Data {#subsec:example_advertising_data} To illustrate the concepts of prediction and inference, let's consider the "advertising data" example, derived from the well-known textbook "An Introduction to Statistical Learning" by James et al. This dataset explores the relationship between product sales and advertising budgets across different media channels. The dataset includes observations from 200 distinct markets and records the sales of a particular product (response variable $Y$) along with advertising budgets allocated to three different media: TV ($X_1$), radio ($X_2$), and newspaper ($X_3$). All monetary values are recorded in thousands of dollars, and sales are in thousands of units. ![Pairwise scatter plots and histograms for the advertising dataset. Sales show a positive correlation with TV and radio advertising budgets.](example_advertising_data.png){#fig:advertising_data width="75%"} Pairwise scatter plots and histograms of the advertising data are shown in [1](#fig:advertising_data){reference-type="ref+Label" reference="fig:advertising_data"}. The scatter plots suggest a positive relationship between sales and both TV and radio advertising budgets, indicating that increased spending in these media is associated with higher sales. Newspaper advertising shows a weaker and less clear relationship with sales. The histograms display the distributions of each variable, providing insights into their ranges and central tendencies. To model the relationship between advertising expenditure and sales, a multiple linear regression model can be fitted to this data. The output from fitting such a model might resemble the following: Coefficients: Estimate SE p-value Intercept 2.9389 0.3119 <0.0001 TV 0.0458 0.0014 <0.0001 radio 0.1885 0.0086 <0.0001 newspaper -0.0010 0.0059 0.8599 s^2 = 2.841, Radj^2 = 0.896 From this regression output, we can make several observations. The intercept term is statistically significant, indicating a baseline level of sales even without advertising expenditure. Both TV and radio advertising coefficients are positive and highly statistically significant (p-value \< 0.0001), suggesting that increasing spending on TV and radio advertising leads to a statistically significant increase in sales. In contrast, the coefficient for newspaper advertising is very small and not statistically significant (p-value = 0.8599), implying that, in this model, newspaper advertising does not have a significant impact on sales. The adjusted R-squared value of 0.896 indicates that the model explains approximately 89.6% of the variance in sales, suggesting a good fit to the data. Based on this model, we can formulate questions relevant to both inference and prediction frameworks: - **Inference Framework Questions:** In an inference framework, we are interested in understanding the relationships between advertising media and sales. Questions we might ask include: - **Which media types significantly drive sales?** Based on the p-values, we can infer that TV and radio advertising have a statistically significant positive impact on sales, while newspaper advertising does not appear to be significant in this model. - **Which medium generates the biggest boost in sales?** Comparing the magnitudes of the coefficients for TV and radio, we observe that the coefficient for radio (0.1885) is substantially larger than that for TV (0.0458). This suggests that, for each additional dollar spent, radio advertising generates a larger increase in sales compared to TV advertising, although this interpretation should be made cautiously, considering potential differences in the scales and ranges of these variables. - **What is the expected increase in sales associated with a given increase in TV advertising?** The coefficient for TV advertising is approximately 0.0458. This implies that, on average, for every additional \$1,000 spent on TV advertising, sales are expected to increase by approximately 0.0458 thousand units, or 45.8 units, holding radio and newspaper advertising budgets constant. - **Prediction Framework Question:** In a prediction framework, our primary interest is to forecast sales given specific advertising budgets. A typical prediction question would be: - **Predict sales volume given specific budget allocations for TV, radio, and newspaper.** For example, if a company plans to spend \$100,000 on TV, \$20,000 on radio, and \$10,000 on newspaper advertising in a particular market, we can use the fitted regression model to predict the expected sales in that market. By plugging these values into the regression equation: $$\text{Predicted Sales} = 2.9389 + 0.0458 \times 100 + 0.1885 \times 20 - 0.0010 \times 10$$ $$\text{Predicted Sales} \approx 16.95 \text{ thousand units}$$ Thus, the predicted sales volume for these advertising budgets is approximately 16,950 units. We can also calculate prediction intervals to quantify the uncertainty associated with this prediction. # Predictive Model Accuracy {#sec:predictive_model_accuracy} ## Measuring the Quality of Fit {#subsec:measuring_quality_of_fit} To effectively evaluate a predictive model, it is crucial to quantify how well its predictions align with the actual observed values. Consider a scenario where we have a set of training data consisting of input-output pairs $(\mathbf{x}_1, y_1), \dots, (\mathbf{x}_n, y_n)$. We employ these data to train a model and obtain an estimated function $\hat{f}$ that approximates the true relationship between $\bm{X}$ and $Y$. ::: definition **Definition 1** (Training Mean Squared Error (MSE)). A widely used metric for assessing the fit of a regression model to the training data is the **training mean squared error (MSE)**. It is calculated as the average of the squared differences between the actual response values and the predicted values from the model: $$\text{MSE}_{training} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{f}(\mathbf{x}_i))^2 \label{eq:training_mse}$$ ::: ::: remark **Remark 1**. The training MSE quantifies how well the model fits the training data. However, it is crucial to note: - **Fitting Accuracy versus Predictive Accuracy:** The training MSE reflects the **fitting accuracy** of the model, indicating how well the model conforms to the training data. It does not directly measure the **predictive accuracy**, which is the model's ability to generalize and make accurate predictions on new, unseen data. A model with a low training MSE may not necessarily have a low test MSE. - **Overoptimistic Assessment of Model Performance:** Evaluating a model's performance solely based on the training MSE can lead to an **overoptimistic assessment**. This is because the model is trained and evaluated on the same dataset. The model has effectively \"memorized\" the training data and is optimized to minimize the error on these specific data points. Consequently, the training MSE tends to underestimate the error rate on new, unobserved data. For instance, consider fitting a highly complex polynomial to a small dataset. It might perfectly interpolate the training points, resulting in zero training MSE. However, such a model is likely to perform poorly on new data points due to overfitting. ::: To obtain a more realistic evaluation of a model's performance in predicting new outcomes, we need to assess its accuracy on **test data**, which includes observations not used in training. Suppose we have test observations $(\mathbf{x}_{01}, y_{01}), \dots, (\mathbf{x}_{0m}, y_{0m})$. ::: definition **Definition 2** (Test Mean Squared Error (MSE)). The **test MSE** is then calculated as: $$\text{MSE}_{test} = \frac{1}{m} \sum_{j=1}^{m} (y_{0j} - \hat{f}(\mathbf{x}_{0j}))^2 \label{eq:test_mse}$$ where $\hat{f}$ is the model estimated using the training observations. The test MSE provides a more reliable estimate of the model's generalization to new data. ::: ::: remark **Remark 2**. - **Availability of Test Data:** In some scenarios, a distinct test dataset is readily available. This is particularly common when the original dataset is large enough to be split into training and testing sets. - **Cross-validation:** When a separate test set is not available, techniques like **cross-validation** can be used to estimate the test MSE using only the training data. Cross-validation involves dividing the training data into subsets, training the model on some subsets, and testing on the remaining subset, repeating this to average the test error estimate. ::: ## Overfitting and Bias-Variance Trade-off {#subsec:overfitting_bias_variance} A fundamental challenge in predictive modeling is to minimize the test MSE, which directly reflects the model's ability to generalize to new data. It is important to note that simply minimizing the training MSE does not guarantee the lowest test MSE. In fact, striving for an extremely low training MSE can often lead to a higher test MSE, a phenomenon known as overfitting. This gives rise to the critical concept of the bias-variance trade-off. ::: remark **Remark 3**. - **Training MSE versus Test MSE Behavior:** As we increase the flexibility of a statistical learning method (e.g., by increasing the complexity of a model), we typically observe a consistent decrease in the training MSE. More flexible models can fit the training data more closely, leading to a better fit and thus a lower training MSE. However, the test MSE often exhibits a U-shaped behavior as model flexibility increases. Initially, as model flexibility increases, the test MSE decreases, indicating improved prediction accuracy on unseen data. But beyond a certain level of flexibility, the test MSE starts to increase, signifying a decline in generalization performance. - **The Phenomenon of Overfitting:** The increase in test MSE that occurs with increasing model flexibility beyond an optimal point is known as **overfitting**. Overfitting happens when a model becomes excessively tailored to the idiosyncrasies of the training data, including the random noise and fluctuations present in that specific dataset. An overfitted model learns the training data too closely, capturing not only the underlying patterns but also the noise. Consequently, it becomes overly complex and sensitive to the specific training dataset, failing to generalize effectively to new, unseen data. In essence, an overfitted model memorizes the training examples rather than learning the underlying relationship, leading to poor predictive performance on new observations. For example, a decision tree model allowed to grow very deep might perfectly classify all training samples but perform poorly on test data because it has created overly specific rules that do not generalize. ::: ::: theorem **Theorem 1** (Bias-Variance Decomposition). *The test MSE can estimate the **expected test MSE** for a future random response $Y_0$ with predictor value $\mathbf{x}_0$. The expected squared error can be decomposed as: $$\mathbb{E}[(Y_0 - \hat{Y}_0)^2] = \mathbb{V}[\hat{f}(\mathbf{x}_0)] + [\text{Bias}(\hat{f}(\mathbf{x}_0))]^2 + \mathbb{V}[\varepsilon] \label{eq:bias_variance_decomposition}$$ where $\hat{Y}_0 = \hat{f}(\mathbf{x}_0)$ is the point predictor from the fitted model.* ::: ::: remark **Remark 4**. - **Irreducible Error:** The term $\mathbb{V}[\varepsilon]$ represents the **irreducible error**, the variance of the error term $\varepsilon$. It is the lowest possible expected test MSE, unattainable by any model. - **Bias-Variance Trade-off:** To minimize expected test MSE, we need a model with low variance and low bias, known as the **bias-variance trade-off**. Generally, as model flexibility increases: - **Variance Increases:** More flexible models are more sensitive to training data changes, increasing variance in $\hat{f}(\mathbf{x}_0)$. - **Bias Decreases:** More flexible models better capture the true relationship between $\bm{X}$ and $Y$, reducing bias. ::: # Prediction using Regression Models {#sec:prediction_regression_models} ## Regression versus Classification Problems {#subsec:regression_vs_classification} Variables can be **quantitative** or **qualitative** (categorical). This distinction determines whether we face a regression or classification problem. - **Regression Problems:** Predictive problems with a **quantitative response** variable are **regression problems**. The aim is to predict a numerical value, such as house prices, sales, or stock prices. - **Classification Problems:** Predictive problems with a **qualitative response** variable are **classification problems**. The goal is to predict the category or class of an observation, like classifying emails as spam or not spam, diagnosing diseases, or predicting customer churn. ::: remark **Remark 5**. The line between regression and classification is not always sharp. **Logistic regression**, a classification method, is an extension of linear regression. It models probabilities on a transformed scale, suitable for binary or categorical response variables. ::: This section focuses on prediction using (multiple) linear regression models, revisiting concepts with two datasets. The next section will analyze classification problems. ## Example: Automobile Bodily Injury Claims {#subsec:example_automobile_claims} Consider the "automobile bodily injury claims" dataset from "Regression Modeling with Actuarial and Financial Applications" by E.W. Frees. Data from the Insurance Research Council in 2002 on automobile bodily injury claims includes: - **ATTORNEY**: Claimant represented by an attorney (yes/no). - **CLMSEX**: Claimant gender (male/female). - **MARITAL**: Claimant marital status (married, single, widowed, divorced). - **CLMINSUR**: Driver of claimant's vehicle uninsured (yes/no). - **SEATBELT**: Claimant wearing seatbelt/child restraint (yes/no). - **CLMAGE**: Claimant's age (quantitative). - **AGECLASS**: Claimant's age in five classes: (-18\], (18,26\], (26,36\], (36,47\], (47+\]. - **LOSS**: Claimant's total economic loss (in thousands, quantitative response). The objective is to predict claim amounts (LOSS) for future policies. LOSS is quantitative, representing claim severity. Due to its skewed distribution, the logarithmic transformation, **logLOSS**, is used as the response for a more symmetric distribution. ![Histograms of LOSS, LOSS (without max value), and log(LOSS). Log transformation provides a more symmetric distribution.](example_automobile_loss_histograms.png){#fig:automobile_loss_histograms width="75%"} Marginal descriptions of explanatory variables: ![Histograms of explanatory variables in the automobile bodily injury claims dataset.](example_automobile_explanatory_variables_histograms.png){#fig:automobile_explanatory_variables_histograms width="85%"} Relationships between logLOSS and explanatory variables: ![Boxplots of log(LOSS) against categorical explanatory variables and scatter plot against CLMAGE. ATTORNEY and SEATBELT appear to affect log(LOSS).](example_automobile_logloss_boxplots.png){#fig:automobile_logloss_boxplots width="85%"} Attorney representation (ATTORNEY=yes) is linked to higher logLOSS: ![Density estimates of log(LOSS) for ATTORNEY=yes and ATTORNEY=no. Claimants with attorneys tend to have higher losses.](example_automobile_logloss_density_attorney.png){#fig:automobile_logloss_density_attorney width="65%"} A multiple linear regression model describes the joint effects of ATTORNEY (dummy variable) and CLMAGE on logLOSS: $$\text{logLOSS} = \beta_0 + \beta_1 \text{ATTORNEY} + \beta_2 \text{CLMAGE} + \varepsilon \label{eq:logloss_model_1}$$ Coefficients: Estimate SE p-value Intercept 0.7376 0.0851 <0.0001 ATTORNEYno -1.3699 0.0729 <0.0001 CLMAGE 0.0160 0.0021 <0.0001 n - p = 1148, s = 1.23, Radj^2 = 0.259 Both coefficients are significant, indicating ATTORNEY and CLMAGE affect mean response. Negative coefficient for ATTORNEYno (ATTORNEY=no dummy) suggests lower mean logLOSS for those without attorneys. For a 30-year-old with ATTORNEY=yes, estimated mean logLOSS and 95% CI: $$\begin{aligned} \hat{y}&= 0.7376 + 30 \times 0.016 = 1.22 \\ SE &= 0.05 \\ 95\% \text{ CI} &= [1.12, 1.32]\end{aligned}$$ On the LOSS scale: $$\begin{aligned} \text{Estimated mean} &= e^{1.22} = 3.38 \\ 95\% \text{ CI} &= [e^{1.12}, e^{1.32}] = [3.06, 3.73] \text{ (in thousands)}\end{aligned}$$ For a 30-year-old with ATTORNEY=no: $$\begin{aligned} \hat{y}&= 0.7376 - 1.3699 + 30 \times 0.016 = -0.15 \\ SE &= 0.05 \\ 95\% \text{ CI} &= [-0.26, -0.05]\end{aligned}$$ On the LOSS scale: $$\begin{aligned} \text{Estimated mean} &= e^{-0.15} = 0.86 \\ 95\% \text{ CI} &= [e^{-0.26}, e^{-0.05}] = [0.77, 0.95] \text{ (in thousands)}\end{aligned}$$ Diagnostic plots (Residuals vs Fitted, Q-Q plot, Scale-Location, Residuals vs Leverage) suggest the model is acceptable. To improve the model, add SEATBELT and a quadratic term for CLMAGE: $$\text{logLOSS} = \beta_0 + \beta_1 \text{ATTORNEY} + \beta_2 \text{CLMAGE} + \beta_3 \text{CLMAGE}^2 + \beta_4 \text{SEATBELT} + \varepsilon \label{eq:logloss_model_2}$$ Coefficients: Estimate SE p-value Intercept -0.2249 0.1376 0.1024 ATTORNEYno-1.3522 0.0725 <0.0001 CLMAGE 0.0828 0.0075 <0.0001 CLMAGE^2 -0.0009 0.0001 <0.0001 SEATBELTno 0.9241 0.2681 0.0006 s^2 = 1.404, Radj^2 = 0.321 Improved fit statistics (higher adjusted R-squared, lower AIC, potentially lower test MSE). Diagnostic plots are fine. Model for out-of-sample predictions. For a 30-year-old insured, attorney, seatbelt, 95% prediction interval for log(LOSS) is \[-0.89, 3.77\], and for LOSS is \[0.41, 43.43\]. Cross-validation SE interval for LOSS: \[0.41, 43.62\]. Prediction intervals for logLOSS and LOSS vs. age: ![95% prediction intervals for log(LOSS) and LOSS as a function of CLMAGE. Vertical lines for age 30 intervals.](example_automobile_prediction_intervals.png){#fig:automobile_prediction_intervals width="85%"} ## Example: Advertising Data Revisited {#subsec:advertising_data_revisited} Revisiting advertising data. Linear additive regression model may not fully capture data structure. Diagnostic plots suggest nonlinear covariate effects. ![Diagnostic plots for linear additive regression model on advertising data, suggesting nonlinearities.](example_advertising_data_diagnostic_plots_linear.png){#fig:advertising_data_diagnostic_plots_linear width="75%"} Model with additive effects of TV and radio, their interaction, and quadratic TV effect: $$\text{sales} = \beta_0 + \beta_1 \text{TV} + \beta_2 \text{radio} + \beta_3 \text{TV}^2 + \beta_4 \text{TV} \times \text{radio} + \varepsilon \label{eq:sales_model_nonlinear}$$ Coefficients: Estimate SE p-value Intercept 5.1371 0.1927 <0.0001 TV 0.0509 0.0022 <0.0001 radio 0.0352 0.0059 <0.0001 TV^2 -0.0001 0.0000 <0.0001 TV:radio 0.0011 0.0000 <0.0001 s^2 = 0.389, Radj^2 = 0.986 Smaller AIC and higher adjusted R-squared than additive model. Better diagnostic plots, but some local deviations remain. ![Diagnostic plots for nonlinear regression model on advertising data. Improvement over linear model is visible.](example_advertising_data_diagnostic_plots_nonlinear.png){#fig:advertising_data_diagnostic_plots_nonlinear width="75%"} With a fitted multiple regression model, predict response $Y_{calculated}$ to quantify the uncertainty in predictions. For instance, if we consider advertising budgets of \$100,000 for TV and \$20,000 for radio (and \$0 for newspaper), we can obtain a 95% confidence interval for the **mean sales** $\mu_0 = \mathbf{x}_0^T \boldsymbol{\beta}$ which might be \[11.864, 12.112\] (in thousands of units). A 95% **prediction interval** for the sales $Y_0$ in a specific market with these budgets might be wider, for example, \[10.752, 13.225\] (in thousands of units), reflecting the additional uncertainty when predicting for a single market versus the average sales across many markets. Both intervals are centered around the point prediction of 11.988 thousand units. # The Classification Problem {#sec:classification_problem} ## The Classification Setting {#subsec:classification_setting} In contrast to regression problems where the response variable is quantitative, **classification problems** involve a **qualitative** or **categorical** response variable. In these scenarios, the objective shifts from predicting a numerical value to predicting the category or class to which a new observation belongs. - **Classification Task:** The core task in classification is to predict the class label or category for a new, unobserved data point, based on its features. This involves assigning the observation to one of a predefined set of categories. - **Connection to Regression:** Although classification deals with categorical outcomes, many classification methods, such as logistic regression and Linear Discriminant Analysis (LDA), estimate the probability of an observation belonging to each category. These probabilities are numerical scores, similar to the fitted values produced by regression models. In fact, some classification techniques can be seen as extensions of regression methods adapted for categorical outcomes. - **Ubiquity of Classification:** Classification problems are ubiquitous in various fields, including statistics, machine learning, and data science. They are essential for tasks ranging from automated decision-making to pattern recognition and data interpretation.  - **Examples of Classification Problems:** - **Email Spam Filtering:** A common example is classifying emails as either \"spam\" or \"not spam\". This is a binary classification problem where the categories are mutually exclusive. Features used for classification might include email content, sender information, and email headers. The goal is to accurately separate unwanted spam emails from legitimate communications. - **Medical Diagnosis:** In medical contexts, classification is crucial for diagnosing conditions based on patient symptoms and test results. For instance, attributing a patient's symptoms to one of several medical conditions, such as stroke, drug overdose, or epileptic seizure, is a multi-class classification problem. - **Credit Scoring:** In finance, credit scoring involves assessing the creditworthiness of loan applicants. Banks and financial institutions use classification models to determine whether an applicant is likely to default on a loan or repay it. This is typically a binary classification problem, categorizing applicants into \"low-risk\" (will repay) or \"high-risk\" (will default). ## Example: Credit Scoring {#subsec:example_credit_scoring} To illustrate classification concepts, let's consider the "credit scoring" dataset, derived from "Regression: Models, Methods and Applications" by Fahrmeir et al. This dataset is relevant to a bank loan service that needs to evaluate the potential loan insolvency risk associated with customers. The fundamental problem is to decide, based on customer characteristics, whether a customer will default on a loan or successfully pay it back. This is a typical credit scoring problem in the financial industry. The dataset comprises information on $n=1000$ private credits issued by a German bank. For each client, the dataset includes a binary response variable and several explanatory variables: - **Y**: This is the binary response variable indicating creditworthiness. It takes the value 1 if the client defaulted on the loan (did not pay back) and 0 if the client paid back the loan. The five explanatory variables (predictors) are: - **account**: Represents the status of the client's running account with the bank. It is a categorical variable with three levels: 0 for \"no running account\", 1 for \"bad running account\", and 2 for \"good running account\". - **duration**: The duration of the credit in months. This is a quantitative variable, indicating the length of the loan period. - **amount**: The credit amount in thousands of euros. This is another quantitative variable, representing the size of the loan. - **moral**: Reflects the client's previous payment behavior. It is a categorical variable with two levels: 0 for \"bad payment behavior\" and 1 for \"good payment behavior\". - **intuse**: Indicates the intended use of the credit. It is a categorical variable with two levels: 0 for \"business use\" and 1 for \"private use\". Histograms showing the marginal distributions of the response variable and each explanatory variable are presented in [9](#fig:credit_scoring_histograms){reference-type="ref+Label" reference="fig:credit_scoring_histograms"}. These histograms provide an overview of the distribution of each variable, such as the balance of classes in the response variable and the frequency of different categories or values in the predictors. ![Histograms of response variable (y) and explanatory variables in the credit scoring dataset.](example_credit_scoring_histograms.png){#fig:credit_scoring_histograms width="85%"} Further exploratory analysis involves examining the relationships between the binary response and the explanatory variables. Visualizations like mosaic plots or boxplots, as shown in [10](#fig:credit_scoring_relationships){reference-type="ref+Label" reference="fig:credit_scoring_relationships"}, can be used to visually identify if certain categories or ranges of predictor variables are associated with a higher or lower likelihood of loan default. ![Relationships between binary response and explanatory variables in the credit scoring dataset.](example_credit_scoring_relationships.png){#fig:credit_scoring_relationships width="85%"} ## Classification of a Categorical Response {#subsec:classification_categorical_response} Many predictive model accuracy concepts from regression apply to classification, with modifications. Focus on binary classification with response $Y_i \in \{0, 1\}$ for $i = 1, \dots, n$. A classification method uses training data $(\mathbf{x}_1, y_1), \dots, (\mathbf{x}_n, y_n)$ to build a **classifier** $\hat{Y}_0 \in \{0, 1\}$ for predictors $\mathbf{x}_0$. ::: definition **Definition 3** (Training Error Rate). **Training error rate** quantifies classifier accuracy: $$\text{trainingER} = \frac{1}{n} \sum_{i=1}^{n} I(y_i \neq \hat{y}_i) \label{eq:training_error_rate}$$ where $\hat{y}_i$ is the predicted class for observation $i$, and $I(y_i \neq \hat{y}_i)$ is 1 if $y_i \neq \hat{y}_i$, 0 otherwise. ::: ::: remark **Remark 6**. Training error rate is overoptimistic. ::: ::: definition **Definition 4** (Test Error Rate). **Test error rate** on test observations $(\mathbf{x}_{01}, y_{01}), \dots, (\mathbf{x}_{0m}, y_{0m})$: $$\text{testER} = \frac{1}{m} \sum_{j=1}^{m} I(y_{0j} \neq \hat{y}_{0j}) \label{eq:test_error_rate}$$ ::: ::: remark **Remark 7**. Estimated via cross-validation. Test error rate estimates **prediction (classification) error** for future response $Y_0$ at $\mathbf{x}_0$: $$\mathbb{E}[I(Y_0 \neq \hat{Y}_0)] = \mathbb{P}(Y_0 \neq \hat{Y}_0) \label{eq:prediction_error}$$ Goal: minimize prediction error. ::: ## The Bayes Classifier {#subsec:bayes_classifier} ::: definition **Definition 5** (Bayes Classifier). The **Bayes classifier** is theoretically optimal (minimizes classification error). It assigns observation $\mathbf{x}_0$ to class 1 if: $$\mathbb{P}(Y_0 = 1 | \bm{X}_0 = \mathbf{x}_0) > \mathbb{P}(Y_0 = 0 | \bm{X}_0 = \mathbf{x}_0) \label{eq:bayes_classifier_rule}$$ and to class 0 otherwise (assuming equal error costs). ::: ::: remark **Remark 8**. Bayes classifier is ideal but $P(Y_0 | \bm{X}_0 = \mathbf{x}_0)$ is usually unknown for real data, making direct computation impossible. Classification methods approximate this probability. ::: ## Classification based on Logistic Regression {#subsec:logistic_regression_classification} ::: definition **Definition 6** (Logistic Regression Model). **Logistic regression** directly models $P(Y_i = 1 | \bm{X}_i = \mathbf{x}_i)$ as: $$\mathbb{P}(Y_i = 1 | \bm{X}_i = \mathbf{x}_i) = \frac{\exp\{\mathbf{x}_i^T \boldsymbol{\beta}\}}{1 + \exp\{\mathbf{x}_i^T \boldsymbol{\beta}\}} \label{eq:logistic_regression_model}$$ ::: ::: remark **Remark 9**. Decision boundary $P(Y = 1 | \bm{X}= \mathbf{x}) = 0.5$ is linear: $\mathbf{x}^T \boldsymbol{\beta}= 0$. Effective when Bayes decision boundary is approximately linear. ::: ![Logistic regression decision boundary vs. Bayes decision boundary for simulated two-predictor data.](example_two_predictors_logistic_boundary.png){#fig:two_predictors_logistic_boundary width="65%"} ::: remark **Remark 10**. Logistic regression uses regression theory. In classification, focus is on performance, not coefficient inference. Extensions to more than two categories exist but are less common. ::: ## Linear Discriminant Analysis (LDA) {#subsec:linear_discriminant_analysis} ::: definition **Definition 7** (Linear Discriminant Analysis (LDA)). **Linear Discriminant Analysis (LDA)** is another classification method. Linear regression for binary responses can give similar results to logistic regression but may produce probabilities outside \[0, 1\]. LDA models predictors and uses Bayes' theorem for conditional probabilities. LDA assumes that for each class $s$, the predictors $\bm{X}$ follow a Gaussian distribution $N(\bm{\mu}_s, \Sigma)$, sharing a common covariance matrix $\Sigma$ but with class-specific mean vectors $\bm{\mu}_s$. ::: Using Bayes' theorem, the probability of class $s$ given predictor vector $\mathbf{x}$ is: $$P(Y = s | \bm{X}= \mathbf{x}) = \frac{\pi_s f_s(\mathbf{x})}{\sum_{r=1}^{S} \pi_r f_r(\mathbf{x})} \label{eq:lda_bayes_theorem}$$ where $f_s(\mathbf{x})$ is the multivariate Gaussian density for class $s$, and $\pi_s = P(Y=s)$ is the prior probability of class $s$. The LDA classifier assigns $\mathbf{x}$ to the class $s$ that maximizes this posterior probability. For computational purposes, LDA uses discriminant functions $\delta_s(\mathbf{x})$ which are linear in $\mathbf{x}$: $$\delta_s(\mathbf{x}) = \mathbf{x}^T \Sigma^{-1} \bm{\mu}_s - \frac{1}{2} \bm{\mu}_s^T \Sigma^{-1} \bm{\mu}_s + \log(\pi_s) \label{eq:lda_discriminant_function}$$ The classification rule is to assign $\mathbf{x}$ to the class $s$ for which $\delta_s(\mathbf{x})$ is largest. ::: remark **Remark 11**. LDA assumes normality of predictors and equal covariance matrices across classes. While linear regression for binary outcomes can be similar to logistic regression, LDA provides a probabilistic framework rooted in Bayes' theorem and Gaussian assumptions. ::: # k-Nearest Neighbors (kNN) {#sec:knn} ## k-Nearest Neighbors Classification {#subsec:knn_classification} :::: algorithm **Input:** Training data $(\mathbf{x}_1, y_1), \dots, (\mathbf{x}_n, y_n)$, new data point $\mathbf{x}_0$, number of neighbors $k$, distance metric $d$\ **Output:** Predicted class $\hat{y}_0$ for $\mathbf{x}_0$ ::: algorithmic Calculate the distance $d(\mathbf{x}_0, \mathbf{x}_i)$ between $\mathbf{x}_0$ and each training data point $\mathbf{x}_i$, for $i = 1, \dots, n$. Select the $k$ nearest neighbors of $\mathbf{x}_0$ from the training data based on the distance metric $d$. Let $N_0$ be the set of indices of these $k$ nearest neighbors. For each class $j$, estimate the conditional probability: $$\hat{P}(Y_0 = j | \bm{X}_0 = \mathbf{x}_0) = \frac{1}{k} \sum_{i \in N_0} I(y_i = j)$$ Assign the class label $\hat{y}_0$ to $\mathbf{x}_0$ as the class $j$ that maximizes $\hat{P}(Y_0 = j | \bm{X}_0 = \mathbf{x}_0)$. **return** Predicted class $\hat{y}_0$ ::: **Complexity Analysis:** - **Training Phase:** kNN has a lazy learning approach, and the training phase is minimal. It primarily involves storing the training data. Thus, the training complexity is $O(1)$. - **Prediction Phase:** To predict the class for a new data point, kNN needs to calculate the distance to all $n$ training points, sort these distances to find the $k$ nearest neighbors, and then perform a majority vote. - Distance calculation: $O(n \times p)$, where $p$ is the number of predictors (features). - Sorting distances: $O(n \log n)$ or $O(n)$ if using selection algorithms to find the $k$ smallest distances. - Majority vote: $O(k)$. Therefore, the prediction complexity for a single test point is dominated by distance calculation and sorting, resulting in an overall prediction complexity of approximately $O(n \times p + n \log k)$ or $O(n \times p + nk)$. For each test point, we need to compute distances to all training points. If we have $m$ test points, the total prediction complexity becomes $O(m \times n \times p + m \times n \log k)$ or $O(m \times n \times p + m \times nk)$. :::: ## Comments on kNN {#subsec:comments_knn} ::: remark **Remark 12**. - **Effectiveness:** kNN can be very effective, especially when the Bayes decision boundary is highly nonlinear. Its non-parametric nature allows it to adapt to complex decision boundaries. - **Versatility:** Applicable to both numeric and categorical predictors, provided a suitable distance metric is defined. For categorical predictors, one can use metrics like Hamming distance or convert categorical variables to numerical representations. - **Optimal k:** The choice of $k$ is critical and affects the bias-variance trade-off. A small $k$ leads to a more flexible model (low bias, high variance), while a large $k$ leads to a less flexible model (high bias, low variance). The optimal $k$ is typically chosen using cross-validation to minimize the error rate on validation data. - **Bias-Variance Trade-off:** As $k$ increases, the bias increases, and the variance decreases, illustrating the bias-variance trade-off. Selecting an appropriate $k$ is crucial for balancing this trade-off and achieving good generalization. - **Regression:** kNN can also be used for regression problems. In kNN regression, the prediction for a new data point is the average (or weighted average) of the response values of its $k$ nearest neighbors in the training data. ::: # Confusion Matrix and Evaluation Metrics {#sec:confusion_matrix} ## Confusion Matrix {#subsec:confusion_matrix_definition} ::: definition **Definition 8** (Confusion Matrix). The **confusion matrix** is a table that summarizes the performance of a classification model by presenting the counts of true positive, true negative, false positive, and false negative predictions. It is particularly useful for binary classification. ::: ::: tab:confusion_matrix --------------------- --------------------- --------------------- **Actual Class** **Predicted Class** $Y = 0$ (Observed) $Y = 1$ (Observed) $Y = 0$ (Predicted) True Negative (TN) False Negative (FN) $Y = 1$ (Predicted) False Positive (FP) True Positive (TP) --------------------- --------------------- --------------------- : Confusion Matrix for Binary Classification ::: ## Classification Performance Metrics {#subsec:classification_metrics} Based on the confusion matrix, several metrics can be derived to evaluate classifier performance: - **Accuracy:** Overall correctness of the classifier: $$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$ - **True Positive Rate (Sensitivity, Recall):** Proportion of actual positives correctly identified: $$\text{Sensitivity} = \frac{TP}{TP + FN}$$ - **True Negative Rate (Specificity):** Proportion of actual negatives correctly identified: $$\text{Specificity} = \frac{TN}{TN + FP}$$ - **False Positive Rate:** Proportion of actual negatives incorrectly classified as positives: $$\text{False Positive Rate} = \frac{FP}{TN + FP} = 1 - \text{Specificity}$$ - **Precision (Positive Predictive Value):** Proportion of predicted positives that are actually positive: $$\text{Precision} = \frac{TP}{TP + FP}$$ - **Negative Predictive Value:** Proportion of predicted negatives that are actually negative: $$\text{Negative Predictive Value} = \frac{TN}{TN + FN}$$ - **Log-Odds Ratio:** $$\text{Log-Odds Ratio} = \log\left(\frac{TP \times TN}{FN \times FP}\right)$$ ::: remark **Remark 13**. These metrics provide a more detailed view of classifier performance than just accuracy, especially when dealing with imbalanced datasets or when different types of errors have different costs. ::: # ROC Curve {#sec:roc_curve} ## Receiver Operating Characteristic (ROC) Curve {#subsec:roc_curve_definition} ::: definition **Definition 9** (Receiver Operating Characteristic (ROC) Curve). The **Receiver Operating Characteristic (ROC) curve** is a graphical plot that illustrates the diagnostic ability of a binary classifier as its discrimination threshold is varied. It plots the True Positive Rate (Sensitivity) against the False Positive Rate (1 - Specificity) at various threshold settings. ::: - **Threshold Variation:** For classifiers that output a probability or score, the classification decision depends on a threshold (often 0.5). The ROC curve shows performance across all possible thresholds. - **True Positive Rate (TPR) vs. False Positive Rate (FPR):** - **TPR (Sensitivity)** is on the y-axis. - **FPR (1 - Specificity)** is on the x-axis. - **AUC (Area Under the Curve):** The **Area Under the ROC Curve (AUC)** quantifies the overall performance of the classifier. - AUC ranges from 0 to 1. - A higher AUC indicates better classifier performance. - AUC = 0.5 represents a classifier no better than random guessing. - AUC = 1 represents a perfect classifier. <figure id="fig:roc_curve_example"> <figcaption>Example of an ROC curve, showing different threshold points and AUC.</figcaption> </figure> ## Using ROC Curves for Comparison {#subsec:roc_curve_comparison} ::: remark **Remark 14**. ROC curves are useful for comparing different binary classifiers. The classifier with an ROC curve that is closer to the top-left corner (higher AUC) is generally considered better, as it achieves a higher true positive rate for a lower false positive rate across different thresholds. ::: # Further Methods in Classification {#sec:further_methods} Classification is a broad field with numerous methods beyond logistic regression, LDA, and kNN. Some further useful methods include: - **Naive Bayes:** A simple probabilistic classifier based on Bayes' theorem with strong (naive) independence assumptions between predictors. Effective and computationally efficient, especially with high-dimensional data. - **Classification Trees and Decision Stumps:** Tree-based methods that partition the predictor space into regions to make classifications. Decision trees are very general and interpretable. - **Ensemble Methods:** Combine multiple classifiers to improve predictive performance. Examples include: - **Boosting:** Iteratively combines weak learners, focusing on misclassified instances. - **Bagging:** Creates multiple bootstrap samples and trains classifiers on each, averaging predictions. - **Random Forests:** An extension of bagging that decorrelates trees by random feature selection. - **Support Vector Machines (SVMs):** Powerful methods that find optimal hyperplanes to separate classes, effective in high-dimensional spaces. - **Neural Networks and Deep Learning:** Complex models capable of learning highly nonlinear relationships, widely used in various classification tasks, especially with large datasets. # Conclusion {#sec:conclusion} **Summary:** This lecture introduced fundamental concepts in predictive modeling and classification. We covered: - The distinction between prediction and inference. - Methods for assessing predictive model accuracy, including training and test MSE, and cross-validation. - The bias-variance trade-off and the phenomenon of overfitting. - Prediction using linear regression models, illustrated with advertising and automobile claims datasets. - Classification problems, error rates, and the Bayes classifier. - Logistic Regression and Linear Discriminant Analysis (LDA) for classification. - k-Nearest Neighbors (kNN) as a non-parametric classification method. - Evaluation metrics for classification, including the confusion matrix and ROC curves. **Key Takeaways:** - Predictive modeling is crucial for forecasting and decision-making in various fields. - Model accuracy assessment requires evaluating performance on unseen data to avoid overfitting. - The bias-variance trade-off is central to model selection and complexity tuning. - Regression and classification are fundamental types of predictive problems, each with specific methods and evaluation approaches. - Classification performance is multifaceted and requires metrics beyond simple accuracy, such as sensitivity, specificity, and AUC. **Further Topics:** For the next lecture, we could delve deeper into: - Advanced classification methods, such as Support Vector Machines and tree-based ensemble methods. - Model selection and hyperparameter tuning techniques. - Handling imbalanced datasets in classification. - Applications of predictive modeling and classification in specific domains. # List of Definitions, Theorems, Examples, Remarks, and Algorithms in Tcolorboxes {#list-of-definitions-theorems-examples-remarks-and-algorithms-in-tcolorboxes .unnumbered} ## Definitions {#definitions .unnumbered} ::: tcolorbox A widely used metric for assessing the fit of a regression model to the training data is the **training mean squared error (MSE)**. It is calculated as the average of the squared differences between the actual response values and the predicted values from the model: $$\text{MSE}_{training} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{f}(\mathbf{x}_i))^2$$ ::: ::: tcolorbox The **test MSE** is then calculated as: $$\text{MSE}_{test} = \frac{1}{m} \sum_{j=1}^{m} (y_{0j} - \hat{f}(\mathbf{x}_{0j}))^2$$ where $\hat{f}$ is the model estimated using the training observations. The test MSE provides a more reliable estimate of the model's generalization to new data. ::: ::: tcolorbox **Training error rate** quantifies classifier accuracy: $$\text{trainingER} = \frac{1}{n} \sum_{i=1}^{n} I(y_i \neq \hat{y}_i)$$ where $\hat{y}_i$ is the predicted class for observation $i$, and $I(y_i \neq \hat{y}_i)$ is 1 if $y_i \neq \hat{y}_i$, 0 otherwise. ::: ::: tcolorbox **Test error rate** on test observations $(\mathbf{x}_{01}, y_{01}), \dots, (\mathbf{x}_{0m}, y_{0m})$: $$\text{testER} = \frac{1}{m} \sum_{j=1}^{m} I(y_{0j} \neq \hat{y}_{0j})$$ ::: ::: tcolorbox The **Bayes classifier** is theoretically optimal (minimizes classification error). It assigns observation $\mathbf{x}_0$ to class 1 if: $$\mathbb{P}(Y_0 = 1 | \bm{X}_0 = \mathbf{x}_0) > \mathbb{P}(Y_0 = 0 | \bm{X}_0 = \mathbf{x}_0)$$ andto class 0 otherwise (assuming equal error costs). ::: ::: tcolorbox **Logistic regression** directly models $P(Y_i = 1 | \bm{X}_i = \mathbf{x}_i)$ as: $$\mathbb{P}(Y_i = 1 | \bm{X}_i = \mathbf{x}_i) = \frac{\exp\{\mathbf{x}_i^T \boldsymbol{\beta}\}}{1 + \exp\{\mathbf{x}_i^T \boldsymbol{\beta}\}}$$ ::: ::: tcolorbox **Linear Discriminant Analysis (LDA)** is another classification method. Linear regression for binary responses can give similar results to logistic regression but may produce probabilities outside \[0, 1\]. LDA models predictors and uses Bayes' theorem for conditional probabilities. LDA assumes that for each class $s$, the predictors $\bm{X}$ follow a Gaussian distribution $N(\bm{\mu}_s, \Sigma)$, sharing a common covariance matrix $\Sigma$ but with class-specific mean vectors $\bm{\mu}_s$. ::: ::: tcolorbox The **confusion matrix** is a table that summarizes the performance of a classification model by presenting the counts of true positive, true negative, false positive, and false negative predictions. It is particularly useful for binary classification. ::: ::: tcolorbox The **Receiver Operating Characteristic (ROC) curve** is a graphical plot that illustrates the diagnostic ability of a binary classifier as its discrimination threshold is varied. It plots the True Positive Rate (Sensitivity) against the False Positive Rate (1 - Specificity) at various threshold settings. ::: ## Theorems {#theorems .unnumbered} ::: tcolorbox The test MSE can estimate the **expected test MSE** for a future random response $Y_0$ with predictor value $\mathbf{x}_0$. The expected squared error can be decomposed as: $$\mathbb{E}[(Y_0 - \hat{Y}_0)^2] = \mathbb{V}[\hat{f}(\mathbf{x}_0)] + [\text{Bias}(\hat{f}(\mathbf{x}_0))]^2 + \mathbb{V}[\varepsilon]$$ where $\hat{Y}_0 = \hat{f}(\mathbf{x}_0)$ is the point predictor from the fitted model. ::: ## Examples {#examples .unnumbered} ::: tcolorbox A marketing team might want to understand the impact of advertising spending on sales (inference) to optimize their marketing strategy. Simultaneously, they need to predict future sales based on planned advertising budgets (prediction) to manage inventory and revenue expectations. In this case, both inference and prediction are crucial. Inference helps understand the effectiveness of different advertising channels, while prediction aids in planning and resource allocation. ::: ::: tcolorbox A bank uses customer data to predict the probability of loan default (prediction). The primary goal is to accurately classify loan applicants as low-risk or high-risk to minimize potential losses. While understanding the factors contributing to default (inference) can be valuable, the immediate objective is accurate prediction for risk management. ::: ::: tcolorbox To illustrate the concepts of prediction and inference, let's consider the "advertising data" example, derived from the well-known textbook "An Introduction to Statistical Learning" by James et al. This dataset explores the relationship between product sales and advertising budgets across different media channels. The dataset includes observations from 200 distinct markets and records the sales of a particular product (response variable $Y$) along with advertising budgets allocated to three different media: TV ($X_1$), radio ($X_2$), and newspaper ($X_3$). All monetary values are recorded in thousands of dollars, and sales are in thousands of units. ::: ::: tcolorbox Consider the "automobile bodily injury claims" dataset from "Regression Modeling with Actuarial and Financial Applications" by E.W. Frees. Data from the Insurance Research Council in 2002 on automobile bodily injury claims includes: - **ATTORNEY**: Claimant represented by an attorney (yes/no). - **CLMSEX**: Claimant gender (male/female). - **MARITAL**: Claimant marital status (married, single, widowed, divorced). - **CLMINSUR**: Driver of claimant's vehicle uninsured (yes/no). - **SEATBELT**: Claimant wearing seatbelt/child restraint (yes/no). - **CLMAGE**: Claimant's age (quantitative). - **AGECLASS**: Claimant's age in five classes: (-18\], (18,26\], (26,36\], (36,47\], (47+\]. - **LOSS**: Claimant's total economic loss (in thousands, quantitative response). The objective is to predict claim amounts (LOSS) for future policies. LOSS is quantitative, representing claim severity. Due to its skewed distribution, the logarithmic transformation, **logLOSS**, is used as the response for a more symmetric distribution. ::: ::: tcolorbox Revisiting advertising data. Linear additive regression model may not fully capture data structure. Diagnostic plots suggest nonlinear covariate effects. ::: ::: tcolorbox A common example is classifying emails as either \"spam\" or \"not spam\". This is a binary classification problem where the categories are mutually exclusive. Features used for classification might include email content, sender information, and email headers. The goal is to accurately separate unwanted spam emails from legitimate communications. ::: ::: tcolorbox In medical contexts, classification is crucial for diagnosing conditions based on patient symptoms and test results. For instance, attributing a patient's symptoms to one of several medical conditions, such as stroke, drug overdose, or epileptic seizure, is a multi-class classification problem. ::: ::: tcolorbox In finance, credit scoring involves assessing the creditworthiness of loan applicants. Banks and financial institutions use classification models to determine whether an applicant is likely to default on a loan or repay it. This is typically a binary classification problem, categorizing applicants into \"low-risk\" (will repay) or \"high-risk\" (will default). ::: ::: tcolorbox To illustrate classification concepts, let's consider the "credit scoring" dataset, derived from "Regression: Models, Methods and Applications" by Fahrmeir et al. This dataset is relevant to a bank loan service that needs to evaluate the potential loan insolvency risk associated with customers. The fundamental problem is to decide, based on customer characteristics, whether a customer will default on a loan or successfully pay it back. This is a typical credit scoring problem in the financial industry. The dataset comprises information on $n=1000$ private credits issued by a German bank. For each client, the dataset includes a binary response variable and several explanatory variables: - **Y**: This is the binary response variable indicating creditworthiness. It takes the value 1 if the client defaulted on the loan (did not pay back) and 0 if the client paid back the loan. The five explanatory variables (predictors) are: - **account**: Represents the status of the client's running account with the bank. It is a categorical variable with three levels: 0 for \"no running account\", 1 for \"bad running account\", and 2 for \"good running account\". - **duration**: The duration of the credit in months. This is a quantitative variable, indicating the length of the loan period. - **amount**: The credit amount in thousands of euros. This is another quantitative variable, representing the size of the loan. - **moral**: Reflects the client's previous payment behavior. It is a categorical variable with two levels: 0 for \"bad payment behavior\" and 1 for \"good payment behavior\". - **intuse**: Indicates the intended use of the credit. It is a categorical variable with two levels: 0 for \"business use\" and 1 for \"private use\". ::: ## Remarks {#remarks .unnumbered} ::: tcolorbox The training MSE quantifies how well the model fits the training data. However, it is crucial to note: - **Fitting Accuracy versus Predictive Accuracy:** The training MSE reflects the **fitting accuracy** of the model, indicating how well the model conforms to the training data. It does not directly measure the **predictive accuracy**, which is the model's ability to generalize and make accurate predictions on new, unseen data. A model with a low training MSE may not necessarily have a low test MSE. - **Overoptimistic Assessment of Model Performance:** Evaluating a model's performance solely based on the training MSE can lead to an **overoptimistic assessment**. This is because the model is trained and evaluated on the same dataset. The model has effectively \"memorized\" the training data and is optimized to minimize the error on these specific data points. Consequently, the training MSE tends to underestimate the error rate on new, unobserved data. For instance, consider fitting a highly complex polynomial to a small dataset. It might perfectly interpolate the training points, resulting in zero training MSE. However, such a model is likely to perform poorly on new data points due to overfitting. ::: ::: tcolorbox - **Availability of Test Data:** In some scenarios, a distinct test dataset is readily available. This is particularly common when the original dataset is large enough to be split into training and testing sets. - **Cross-validation:** When a separate test set is not available, techniques like **cross-validation** can be used to estimate the test MSE using only the training data. Cross-validation involves dividing the training data into subsets, training the model on some subsets, and testing on the remaining subset, repeating this to average the test error estimate. ::: ::: tcolorbox - **Training MSE versus Test MSE Behavior:** As we increase the flexibility of a statistical learning method (e.g., by increasing the complexity of a model), we typically observe a consistent decrease in the training MSE. More flexible models can fit the training data more closely, leading to a better fit and thus a lower training MSE. However, the test MSE often exhibits a U-shaped behavior as model flexibility increases. Initially, as model flexibility increases, the test MSE decreases, indicating improved prediction accuracy on unseen data. But beyond a certain level of flexibility, the test MSE starts to increase, signifying a decline in generalization performance. - **The Phenomenon of Overfitting:** The increase in test MSE that occurs with increasing model flexibility beyond an optimal point is known as **overfitting**. Overfitting happens when a model becomes excessively tailored to the idiosyncrasies of the training data, including the random noise and fluctuations present in that specific dataset. An overfitted model learns the training data too closely, capturing not only the underlying patterns but also the noise. Consequently, it becomes overly complex and sensitive to the specific training dataset, failing to generalize effectively to new, unseen data. In essence, an overfitted model memorizes the training examples rather than learning the underlying relationship, leading to poor predictive performance on new observations. For example, a decision tree model allowed to grow very deep might perfectly classify all training samples but perform poorly on test data because it has created overly specific rules that do not generalize. ::: ::: tcolorbox - **Irreducible Error:** The term $\mathbb{V}[\varepsilon]$ represents the **irreducible error**, the variance of the error term $\varepsilon$. It is the lowest possible expected test MSE, unattainable by any model. - **Bias-Variance Trade-off:** To minimize expected test MSE, we need a model with low variance and low bias, known as the **bias-variance trade-off**. Generally, as model flexibility increases: - **Variance Increases:** More flexible models are more sensitive to training data changes, increasing variance in $\hat{f}(\mathbf{x}_0)$. - **Bias Decreases:** More flexible models better capture the true relationship between $\bm{X}$ and $Y$, reducing bias. ::: ::: tcolorbox The line between regression and classification is not always sharp. **Logistic regression**, a classification method, is an extension of linear regression. It models probabilities on a transformed scale, suitable for binary or categorical response variables. ::: ::: tcolorbox Training error rate is overoptimistic. ::: ::: tcolorbox Estimated via cross-validation. Test error rate estimates **prediction (classification) error** for future response $Y_0$ at $\mathbf{x}_0$: $$\mathbb{E}[I(Y_0 \neq \hat{Y}_0)] = \mathbb{P}(Y_0 \neq \hat{Y}_0)$$ Goal: minimize prediction error. ::: ::: tcolorbox Bayes classifier is ideal but $P(Y_0 | \bm{X}_0 = \mathbf{x}_0)$ is usually unknown for real data, making direct computation impossible. Classification methods approximate this probability. ::: ::: tcolorbox Decision boundary $P(Y = 1 | \bm{X}= \mathbf{x}) = 0.5$ is linear: $\mathbf{x}^T \boldsymbol{\beta}= 0$. Effective when Bayes decision boundary is approximately linear. ::: ::: tcolorbox Logistic regression uses regression theory. In classification, focus is on performance, not coefficient inference. Extensions to more than two categories exist but are less common. ::: ::: tcolorbox LDA assumes normality of predictors and equal covariance matrices across classes. While linear regression for binary outcomes can be similar to logistic regression, LDA provides a probabilistic framework rooted in Bayes' theorem and Gaussian assumptions. ::: ::: tcolorbox - **Effectiveness:** kNN can be very effective, especially when the Bayes decision boundary is highly nonlinear. Its non-parametric nature allows it to adapt to complex decision boundaries. - **Versatility:** Applicable to both numeric and categorical predictors, provided a suitable distance metric is defined. For categorical predictors, one can use metrics like Hamming distance or convert categorical variables to numerical representations. - **Optimal k:** The choice of $k$ is critical and affects the bias-variance trade-off. A small $k$ leads to a more flexible model (low bias, high variance), while a large $k$ leads to a less flexible model (high bias, low variance). The optimal $k$ is typically chosen using cross-validation to minimize the error rate on validation data. - **Bias-Variance Trade-off:** As $k$ increases, the bias increases, and the variance decreases, illustrating the bias-variance trade-off. Selecting an appropriate $k$ is crucial for balancing this trade-off and achieving good generalization. - **Regression:** kNN can also be used for regression problems. In kNN regression, the prediction for a new data point is the average (or weighted average) of the response values of its $k$ nearest neighbors in the training data. ::: ::: tcolorbox These metrics provide a more detailed view of classifier performance than just accuracy, especially when dealing with imbalanced datasets or when different types of errors have different costs. ::: ::: tcolorbox ROC curves are useful for comparing different binary classifiers. The classifier with an ROC curve that is closer to the top-left corner (higher AUC) is generally considered better, as it achieves a higher true positive rate for a lower false positive rate across different thresholds. ::: ## Algorithms {#algorithms .unnumbered} :::: tcolorbox **Input:** Training data $(\mathbf{x}_1, y_1), \dots, (\mathbf{x}_n, y_n)$, new data point $\mathbf{x}_0$, number of neighbors $k$, distance metric $d$\ **Output:** Predicted class $\hat{y}_0$ for $\mathbf{x}_0$ ::: algorithmic Calculate the distance $d(\mathbf{x}_0, \mathbf{x}_i)$ between $\mathbf{x}_0$ and each training data point $\mathbf{x}_i$, for $i = 1, \dots, n$. Select the $k$ nearest neighbors of $\mathbf{x}_0$ from the training data based on the distance metric $d$. Let $N_0$ be the set of indices of these $k$ nearest neighbors. For each class $j$, estimate the conditional probability: $$\hat{P}(Y_0 = j | \bm{X}_0 = \mathbf{x}_0) = \frac{1}{k} \sum_{i \in N_0} I(y_i = j)$$ Assign the class label $\hat{y}_0$ to $\mathbf{x}_0$ as the class $j$ that maximizes $\hat{P}(Y_0 = j | \bm{X}_0 = \mathbf{x}_0)$. **return** Predicted class $\hat{y}_0$ ::: **Complexity Analysis:** - **Training Phase:** kNN has a lazy learning approach, and the training phase is minimal. It primarily involves storing the training data. Thus, the training complexity is $O(1)$. - **Prediction Phase:** To predict the class for a new data point, kNN needs to calculate the distance to all $n$ training points, sort these distances to find the $k$ nearest neighbors, and then perform a majority vote. - Distance calculation: $O(n \times p)$, where $p$ is the number of predictors (features). - Sorting distances: $O(n \log n)$ or $O(n)$ if using selection algorithms to find the $k$ smallest distances. - Majority vote: $O(k)$. Therefore, the prediction complexity for a single test point is dominated by distance calculation and sorting, resulting in an overall prediction complexity of approximately $O(n \times p + n \log k)$ or $O(n \times p + nk)$. For each test point, we need to compute distances to all training points. If we have $m$ test points, the total prediction complexity becomes $O(m \times n \times p + m \times n \log k)$ or $O(m \times n \times p + m \times nk)$. ::::