Towards Multiple Linear Regression and Logistic Regression

Author

Your Name

Published

February 12, 2025

Introduction

Introduction to Multiple Regression

In simple linear regression, a response variable $Y$ is regressed on a single explanatory variable.
Multiple linear regression generalizes this methodology to allow for multiple explanatory (predictor) variables, often referred to as covariates.
The multiple linear regression model is a cornerstone of statistical modeling.
While it relies on assumptions that may not always hold in real-world scenarios, it serves as a foundational model upon which many other statistical models are built.
Despite its simplicity, multiple linear regression is invaluable for both interpretation and prediction in various applications.

Illustrative Example: Factor Variables and Non-linearities

To illustrate the flexibility of multiple linear regression, consider a scenario where the response variable $y$ is influenced by:
- Two numeric regressors, $u$ and $v$.
- A factor variable $g$ with three levels, categorizing observations into three distinct groups.
Factor variables can be incorporated into the model using dummy variables. For a factor with three levels, we can use two dummy variables. For instance, we can define two dummy indicators showing whether an observation belongs to the first or second group (the third group being the baseline).
Numeric variables can be included in the model in non-linear forms, such as squared terms or interaction terms. It is crucial to remember that while the predictors can be non-linear, the model remains linear in the parameters and the error term.
For example, a model matrix $\mathbf{X}$ can be constructed to account for non-linear effects of numeric regressors ($u, v$) and a factor regressor ($g$) with three levels. This matrix might include columns for:
- Intercept (implicitly included in most regression software).
- Dummy variables for factor levels (e.g., two columns for a three-level factor).
- Numeric regressors ($u, v$).
- Non-linear transformations of numeric regressors (e.g., $u^2, v^2, u \cdot v$).
In such a model matrix, the first few rows might represent observations from the first group, followed by rows from the second and third groups.
When working with factor regressors, it is essential to ensure that the model matrix $\mathbf{X}$ has full rank to guarantee the identifiability of model parameters. Without full rank, the model parameters cannot be uniquely estimated from the data.
In some formulations, particularly when focusing on deviations from a reference level for factors, the intercept column might be intentionally omitted in the model matrix. This depends on the chosen parameterization and software implementation.

Multiple Linear Regression: Assumptions and Inference

Model Assumptions

Given data for a response variable $Y$ and $p$ regressors $X_1, \dots, X_p$, the multiple linear regression model posits a linear relationship between the expected value of the response variable and the regressors. The model is mathematically expressed for the $i$-th observation as: \[Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \dots + \beta_p X_{ip} + \varepsilon_i \label{eq:mlr_model_2}\] where:
- $Y_i$ is the $i$-th observation of the response variable.
- $X_{ij}$ is the $i$-th observation of the $j$-th regressor, for $j = 1, \dots, p$.
- $\beta_0, \beta_1, \dots, \beta_p$ are the regression coefficients, representing the unknown parameters to be estimated. $\beta_0$ is the intercept, and $\beta_j$ for $j \geq 1$ represents the effect of a one-unit increase in $X_j$ on the expected value of $Y$, holding other regressors constant.
- $\varepsilon_i$ is the random error term for the $i$-th observation, representing the deviation of the observed value from the expected value.
The fundamental assumptions of the multiple linear regression model are centered around the error term $\varepsilon_i$:
1. Normality: The error terms are normally distributed, $\varepsilon_i \sim N(0, \sigma^2)$. This assumption is crucial for hypothesis testing and constructing confidence intervals.
2. Zero Mean: The expected value of the error term is zero, $\mathbb{E}(\varepsilon_i) = 0$. This implies that the linear model is correctly specified in terms of the functional form for the mean response.
3. Homoscedasticity (Constant Variance): The variance of the error terms is constant across all observations, $\mathbb{V}(\varepsilon_i) = \sigma^2$. This means the spread of the residuals should be roughly constant across the range of predictor values.
4. Independence: The error terms are independent of each other, $\text{Cov}(\varepsilon_i, \varepsilon_k) = 0$ for $i \neq k$. This assumption is particularly important for data collected over time or in clusters; violations can lead to biased standard error estimates.
Under these assumptions, conditional on the regressors $X_{ij}$, the response variable $Y_i$ is also normally distributed and independent of other responses, with a constant variance $\sigma^2$. The expected value of $Y_i$ is a linear combination of the covariates: \[\mathbb{E}(Y_i) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} \label{eq:expected_response_2}\]
It is important to recognize that the linearity assumption for $\mathbb{E}(Y_i)$ is often an approximation. In practice, relationships may be non-linear, and these assumptions may not perfectly hold. However, multiple linear regression often provides a useful and robust approximation, especially as a starting point for analysis. If diagnostics reveal violations of these assumptions, transformations of variables or more complex models may be considered.

Matrix Form of the Model

To streamline notation and facilitate computations, the multiple linear regression model is often expressed in matrix form. This representation is essential for understanding the underlying linear algebra and for efficient computation of model estimates and statistics. The matrix form of the model is: \[\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon} \label{eq:matrix_model_2}\] where:
- $\mathbf{y} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix}$ is an $n \times 1$ column vector of the observed responses, where $n$ is the number of observations.
- $\mathbf{X} = \begin{pmatrix} 1 & x_{11} & x_{12} & \dots & x_{1p} \\ 1 & x_{21} & x_{22} & \dots & x_{2p} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & x_{n1} & x_{n2} & \dots & x_{np} \end{pmatrix}$ is the $n \times (p + 1)$ model matrix or design matrix.
  - Each row of $\mathbf{X}$ corresponds to an observation, and each column corresponds to a regressor.
  - The first column is a vector of ones, representing the intercept term.
  - The subsequent $p$ columns contain the observed values of the $p$ regressors.
  - For the model to be identifiable, the model matrix $\mathbf{X}$ must have full column rank, which is $p+1$. This requires that $n > p + 1$ and that there is no perfect multicollinearity among the regressors.
- $\boldsymbol{\beta} = \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{pmatrix}$ is a $(p + 1) \times 1$ column vector of the unknown regression coefficients.
- $\boldsymbol{\varepsilon} = \begin{pmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{pmatrix}$ is an $n \times 1$ column vector of the unobserved error terms.
Under the model assumptions, the response vector $\mathbf{Y} = (Y_1, \dots, Y_n)^T$ follows a multivariate normal distribution: $\mathbf{Y} \sim N_n(\boldsymbol{\mu}, \sigma^2 \mathbf{I}_n)$, where:
- $\boldsymbol{\mu} = \mathbf{X}\boldsymbol{\beta} = \mathbb{E}(\mathbf{Y})$ is the $n \times 1$ mean vector, representing the expected values of the responses.
- $\sigma^2 \mathbf{I}_n$ is the $n \times n$ variance-covariance matrix, where $\sigma^2$ is the constant error variance and $\mathbf{I}_n$ is the $n \times n$ identity matrix, reflecting the independence and homoscedasticity assumptions.
- Note on Notation: In matrix notation, random variables are typically represented by uppercase letters (e.g., $\mathbf{Y}$, $\boldsymbol{\varepsilon}$), while observed values and fixed quantities are in lowercase (e.g., $\mathbf{y}$, $\mathbf{X}$, $\boldsymbol{\beta}$, $\boldsymbol{\mu}$, $\sigma^2$). However, in some contexts, particularly when parameters are treated as fixed but unknown, lowercase letters might also be used for parameters (e.g., $\boldsymbol{\beta}$, $\sigma^2$).
The predictor variables that constitute the model matrix $\mathbf{X}$ can be either metric (numeric) variables or factor variables. Metric variables are quantitative and continuous, while factor variables are categorical and represent groups or levels.
In these notes, we primarily focus on models with metric regressors. The incorporation of factor regressors will be discussed in detail in subsequent sections.

Inference

Distribution of $\widehat{\boldsymbol{\beta}}$: Under the normality assumption, the OLS estimator $\widehat{\boldsymbol{\beta}}$ is normally distributed: \[\widehat{\boldsymbol{\beta}} \sim N_{p+1}(\boldsymbol{\beta}, \mathbb{V}(\widehat{\boldsymbol{\beta}})) \label{eq:beta_distribution_2}\] with mean $\boldsymbol{\beta}$ and variance-covariance matrix: \[\mathbb{V}(\widehat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^T \mathbf{X})^{-1} \label{eq:variance_beta_hat}\] This variance-covariance matrix is crucial for constructing confidence intervals and hypothesis tests for the regression coefficients.

Multiple Linear Regression: Diagnostics

Violation of Model Assumptions

The assumptions underlying multiple linear regression may not always be valid in practical applications. Violations can occur in both the systematic and random components of the model.
The assumption that the mean of $Y_i$ is a linear combination of the $x$’s is often an approximation.
- Significant deviations can include nonlinear effects of explanatory variables and the omission of relevant variables. The latter is particularly serious and requires careful consideration of the research question.
The assumption of independence may be violated in cases of data clustering or repeated measurements over time or space, leading to correlated errors.
The assumptions of constant variance (homoscedasticity) and normality of errors are frequently violated in practice. Transforming the response variable scale can sometimes make these assumptions more plausible.
The presence of outliers needs to be investigated as they can unduly influence the regression results.

Checking on the Residuals

Many assumption violations can be addressed, at least partially, if they are detected.
While detecting assumption failures can be challenging, regression diagnostics provide tools to identify potential problems and suggest their nature.
Regression diagnostics are a collection of methods designed to assess the validity of model assumptions.
Residual plots in multiple linear regression are interpreted similarly to those in simple linear regression. They help assess nonlinearity, non-constant variance, and departures from normality.
To check for nonlinearity in a multiple regression context, plotting residuals against individual predictors is often more informative than just plotting residuals against fitted values.
Algorithm 1.
Fit the multiple linear regression model using Ordinary Least Squares (OLS). Obtain residuals ($\widehat{\varepsilon}_i$) and fitted values ($\widehat{y}_i$). Check for Nonlinearity and Heteroscedasticity:
1. Plot residuals ($\widehat{\varepsilon}_i$) vs. fitted values ($\widehat{y}_i$).
2. Examine for patterns (curves, non-constant spread) indicating nonlinearity or heteroscedasticity.
3. Plot residuals ($\widehat{\varepsilon}_i$) vs. each predictor ($x_{ij}$).
4. Examine each plot for patterns suggesting nonlinear effects of specific predictors.
Check for Non-constant Variance (Heteroscedasticity):
1. Create a Scale-Location plot (sqrt(|standardized residuals|) vs. fitted values).
2. Look for trends in spread, indicating variance changes with fitted values.
Check for Non-Normality:

Interpret and Address Violations:
1. Based on plots and diagnostic statistics, identify assumption violations.
2. Consider transformations (variables, response), adding polynomial terms, or robust regression methods to address violations.
3. Re-run diagnostics on refined model and iterate until assumptions are reasonably met.
Complexity Analysis of Algorithm [alg:regression_diagnostics]:
- Step 1 (OLS fitting): $O(np^2 + p^3)$
- Step 2 (Residual calculation): $O(n)$
- Step 3-6 (Diagnostic plots and statistics): $O(n \cdot p)$ for residual plots, $O(n \cdot p^2 + p^3)$ for influence statistics.
- Step 7 (Interpretation and Refinement): Depends on complexity of refinement, but typically less than fitting.
Overall Complexity: Dominated by OLS fitting and influence statistics calculation, approximately $O(np^2 + p^3)$.

Outliers: Leverage and Influence

Identifying outliers becomes more complex with multiple explanatory variables. Multiple influential outliers can mask each other’s effects. Outlier analysis in multiple regression builds upon concepts from simple linear regression.
Leverage and influence are key concepts in outlier diagnostics.
If the $i$-th response $y_i$ is perturbed by $\Delta_i$ (while keeping other $y$-values unchanged), the fitted value $\widehat{y}_i$ changes to $\widehat{y}_i + h_{ii} \Delta_i$. The value $h_{ii}$ is the leverage of the $i$-th observation.
The leverage values $h_{ii}$ are the diagonal elements of the hat matrix $\mathbf{H} = \mathbf{X} (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T$, where $\widehat{\mathbf{y}} = \mathbf{H} \mathbf{y}$.
High leverage values indicate observations that are distant from the centroid of the predictor variables.
In a model with $p+1$ coefficients (including intercept), $\sum_{i=1}^n h_{ii} = p+1$. As a rule of thumb, leverage values $h_{ii}$ larger than $2(p+1)/n$ or $3(p+1)/n$ are considered high.
Influential points are those that, if removed, would substantially alter the regression results (fitted values, coefficient estimates). Influence is a combination of a point’s residual size and its leverage.
Cook’s distance is a common measure of influence. Using standardized residuals: \[r_i = \frac{y_i - \widehat{y}_i}{\widehat{\sigma} \sqrt{1 - h_{ii}}} \label{eq:standardized_residuals_2}\] Cook’s distance for the $i$-th observation is: \[D_i = \frac{1}{p+1} r_i^2 \frac{h_{ii}}{1 - h_{ii}} \label{eq:cooks_distance_2}\] It quantifies the change in model estimates when the $i$-th observation is removed.
Cook’s distance values $D_i > 0.5$ (or even $D_i > 1$) are often considered suspicious, suggesting potentially influential points that significantly affect coefficient estimates and standard errors.
The effect of each observation on the coefficient estimates $\widehat{\boldsymbol{\beta}}$ can be further evaluated using other influence measures.

Example: Hill Races

For a dataset on hill races in Northern Ireland, diagnostic plots of Cook’s distances and leverages are examined. These plots do not reveal serious problems, except for one point with a Cook’s distance slightly above 0.5.
Based on these diagnostics and the nature of the data, a log-log model is considered: \[\log(\text{time}_i) = \beta_0 + \beta_1 \log(\text{dist}_i) + \beta_2 \log(\text{climb}_i) + \varepsilon_i \label{eq:hill_races_log_model_2}\]
This is equivalent to a power relationship in the original scale: \[\text{time} = e^{\beta_0} \cdot \text{dist}^{\beta_1} \cdot \text{climb}^{\beta_2} \label{eq:hill_races_power_model_2}\]
The estimated coefficients are $\widehat{\beta}_0 = -4.96$ ($SE(\widehat{\beta}_0) = 0.273$), $\widehat{\beta}_1 = 0.68$ ($SE(\widehat{\beta}_1) = 0.055$), $\widehat{\beta}_2 = 0.47$ ($SE(\widehat{\beta}_2) = 0.045$). The residual standard error is $\widehat{\sigma} = 0.076$.
The low $p$-values for $\log(\text{dist})$ and $\log(\text{climb})$ confirm their significant predictive power for $\log(\text{time})$. The global significance of the regression coefficients is also confirmed by the F-test.
The multiple $R^2$ and adjusted multiple $R^2$ are $0.983$ and $0.981$, respectively, indicating a very strong fit.
For model interpretation, it is important to recognize that different model formulations can serve different explanatory purposes.
In the original scale, the fitted deterministic part of the model is: \[\text{time} = 0.007 \cdot \text{dist}^{0.68} \cdot \text{climb}^{0.47} \label{eq:hill_races_fitted_power_2}\]
Surprisingly, the model suggests that for a fixed climb, a 1% increase in distance leads to only a 0.68% increase in time.
This implies that for a fixed climb, the time taken for the second mile is less than for the first mile, which seems counterintuitive. For example, with climb=1500, the times are approximately: \[\begin{aligned} 0.007 \cdot 1^{0.68} \cdot 1500^{0.47} &\approx 0.218 \\ 0.007 \cdot 2^{0.68} \cdot 1500^{0.47} &\approx 0.349 \end{aligned}\]
This apparent paradox arises from the condition keeping climb constant. Shorter races tend to be steeper than longer races, so holding climb constant while increasing distance is not realistic in this context.
If $\log(\text{time})$ is regressed on $\log(\text{dist})$ and $\log(\text{climb}/\text{dist})$ instead of $\log(\text{climb})$, the coefficient for $\log(\text{dist})$ becomes greater than 1, which is more reassuring: \[\text{time} = 0.007 \cdot \text{dist}^{1.15} \cdot (\text{climb}/\text{dist})^{0.47} \label{eq:hill_races_alternative_power_2}\]
Both models provide the same fit to the data, as they are mathematically equivalent formulations. The choice between them depends on interpretability and the specific application goals.

Centering the Covariates

Centering covariates involves subtracting their sample means before including them in the regression model. For example, the model: \[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \varepsilon_i \label{eq:mlr_model_uncentered_2}\] can be rewritten as: \[y_i = \alpha + \beta_1 (x_{i1} - \overline{x}_1) + \beta_2 (x_{i2} - \overline{x}_2) + \varepsilon_i \label{eq:mlr_model_centered_2}\] where $\overline{x}_1$ and $\overline{x}_2$ are the sample means of $x_{1}$ and $x_{2}$ respectively, and $\alpha = \beta_0 + \beta_1 \overline{x}_1 + \beta_2 \overline{x}_2$.
Centering aids in model interpretability in two ways:
- The estimated intercept $\widehat{\alpha} = \overline{y}$ becomes the fitted value when both covariates are at their sample means.
- Observed values can be decomposed as: \[y_i = \overline{y} + \beta_1 (x_{i1} - \overline{x}_1) + \beta_2 (x_{i2} - \overline{x}_2) + \widehat{\varepsilon}_i = \overline{y} + t_{i1} + t_{i2} + \widehat{\varepsilon}_i \label{eq:centered_decomposition_2}\] where $t_{i1} = \beta_1 (x_{i1} - \overline{x}_1)$ and $t_{i2} = \beta_2 (x_{i2} - \overline{x}_2)$ are zero-sum contributions from each covariate.
The terms $t_{i1}$ and $t_{i2}$ are used in partial residual plots.

Partial Residual Plot

A partial residual plot for covariate $x_j$, given all other covariates, is a scatterplot of $t_{ij} + \widehat{\varepsilon}_i$ versus $x_{ij}$. It visualizes the part of the response not explained by covariates other than $x_j$.
For example, with two covariates, the partial residual plot for $x_1$ graphs $t_{i1} + \widehat{\varepsilon}_i = y_i - \overline{y} - t_{i2}$ against $x_{i1}$. It helps assess whether the relationship between the part of the response not explained by $x_2$ and $x_1$ is approximately linear.

Quadratic Effect of a Covariate

A covariate may exhibit a nonlinear effect on the mean response, which might be suggested by a partial residual plot.
In such cases, including polynomial terms of the covariate in the model can capture the nonlinearity.
For a single covariate, a model with a quadratic effect of $x$ specifies the mean response as: \[\mathbb{E}(Y_i) = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 \label{eq:quadratic_model_2}\]
This model is no longer a simple linear regression model but is a special case of multiple linear regression with two covariates: $x_{i1} = x_i$ and $x_{i2} = x_i^2$.
Polynomial regression models generalize the multivariate linear model by including squared, cross-product, and higher-order terms of the original predictor variables, allowing for more flexible modeling of nonlinear relationships.

Example: Cars

For the ‘cars’ dataset, preliminary statistical analysis, supported by physical considerations, suggests that the distance taken to stop should be a non-linear function of the speed. Then, a plausible model is: \[\text{distance}_i = \beta_0 + \beta_1 \text{speed}_i + \beta_2 \text{speed}_i^2 + \varepsilon_i \label{eq:cars_quadratic_model_2}\]
Fitting this quadratic model to the ‘cars’ data yields estimated coefficients: $\widehat{\beta}_0 = 2.47$ ($SE(\widehat{\beta}_0) = 14.82$), $\widehat{\beta}_1 = 0.91$ ($SE(\widehat{\beta}_1) = 2.03$), and $\widehat{\beta}_2 = 0.10$ ($SE(\widehat{\beta}_2) = 0.07$).
Despite the high $p$-values for individual coefficients, the $p$-value for the global F-test is $5.85 \times 10^{-12}$, indicating strong evidence that the quadratic model is significantly better than a constant model.
The $p$-values for individual coefficients test whether each coefficient is zero given that other terms are already in the model. They should not be interpreted in isolation when assessing the overall significance of the quadratic effect.
Diagnostic plots may reveal issues such as non-constant variance and non-normality of residuals, suggesting the need for further model refinement, such as weighted least squares regression.

Conclusion

Multiple linear regression extends simple linear regression to accommodate multiple predictor variables, offering a powerful tool for modeling relationships between a response variable and several covariates.
Key assumptions of multiple linear regression include linearity, independence of errors, homoscedasticity, and normality of errors. Diagnostic tools are essential to check the validity of these assumptions and to identify potential model inadequacies.
Inference in multiple linear regression involves estimating model parameters using least squares, conducting hypothesis tests for individual and global significance of predictors, and constructing confidence and prediction intervals.
Centering covariates can improve interpretability by making the intercept represent the expected response at the average covariate values and facilitating the creation of partial residual plots.
Polynomial regression extends the linear model to capture nonlinear effects of covariates, increasing model flexibility.
Model selection techniques, such as F-tests, AIC, BIC, and stepwise methods, help in choosing the most appropriate model among a set of candidate models, balancing model fit and complexity.
Multicollinearity, the presence of high correlation among predictors, can inflate variance of coefficient estimates and complicate interpretation. Variance inflation factor (VIF) is a useful measure to detect multicollinearity. Remedies include removing redundant predictors or using dimensionality reduction techniques.
Regression models can be extended to handle factor variables using dummy coding, allowing for the analysis of variance and covariance models within the regression framework.
Diagnostic checks and model validation are crucial steps in ensuring the reliability and validity of multiple linear regression models in practical applications.

Summary

This lecture introduces the fundamental concepts of multiple linear regression, extending from simple linear regression to models with multiple explanatory variables. We explore the assumptions underlying multiple linear regression, methods for parameter estimation and inference, and diagnostic techniques for model validation. The lecture covers essential topics such as model assumptions, matrix representation, least squares estimation, hypothesis testing, confidence and prediction intervals, and regression diagnostics including residual analysis, leverage, and influence. Examples are used to illustrate the application of multiple linear regression in real-world scenarios, including book weight prediction, hill race analysis, and car stopping distance modeling. The lecture also touches upon model selection criteria and methods for handling multicollinearity, providing a comprehensive overview of multiple linear regression methodology and its practical implications.

Final Remarks

In summary, this lecture has provided a detailed exploration of multiple linear regression, a versatile and fundamental statistical tool. We have covered the theoretical underpinnings, including model assumptions and inference procedures, and practical aspects such as diagnostics and model selection. Key takeaways include the importance of checking model assumptions using residual plots and diagnostic statistics, the interpretation of regression coefficients in the context of multiple predictors, and strategies for model refinement and selection. We also discussed extensions to handle nonlinear effects and factor variables, broadening the applicability of regression models. Understanding these concepts is crucial for effectively applying multiple linear regression in data analysis and for building a solid foundation for more advanced statistical modeling techniques, such as logistic regression which will be discussed in the subsequent lectures. Further study should focus on practicing model building, diagnostics, and interpretation with real datasets to solidify these concepts. What are the implications of relaxing the linearity assumption? How can we extend these models to handle non-constant error variance more robustly? These questions could guide further exploration and study in this area.

Mathematical Statements in Tcolorboxes

Section 2.1 Model Assumptions

Definition 1 (Multiple Linear Regression Model). Given data for a response variable $Y$ and $p$ regressors $X_1, \dots, X_p$, the multiple linear regression model for the $i$-th observation is: \[Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \dots + \beta_p X_{ip} + \varepsilon_i \label{eq:mlr_model_2_box}\] where:

$Y_i$ is the $i$-th observation of the response variable.
$X_{ij}$ is the $i$-th observation of the $j$-th regressor, for $j = 1, \dots, p$.
$\beta_0, \beta_1, \dots, \beta_p$ are the regression coefficients.
$\varepsilon_i$ is the random error term for the $i$-th observation.

Definition 2 (Expected Value of Response in MLR). Under the assumptions of multiple linear regression, the expected value of $Y_i$ is a linear combination of the covariates: \[\mathbb{E}(Y_i) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} \label{eq:expected_response_2_box}\]

Section 2.2 Matrix Form of the Model

Definition 3 (Matrix Form of Multiple Linear Regression Model). The matrix form of the multiple linear regression model is: \[\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon} \label{eq:matrix_model_2_box}\] where:

$\mathbf{y}$ is the vector of observed responses.
$\mathbf{X}$ is the model matrix.
$\boldsymbol{\beta}$ is the vector of regression coefficients.
$\boldsymbol{\varepsilon}$ is the vector of error terms.

Theorem 1 (Distribution of Response Vector in MLR). Under the model assumptions, the response vector $\mathbf{Y}$ follows a multivariate normal distribution: \[\mathbf{Y} \sim N_n(\boldsymbol{\mu}, \sigma^2 \mathbf{I}_n) \label{eq:distribution_response_vector_box}\] where $\boldsymbol{\mu} = \mathbf{X}\boldsymbol{\beta} = \mathbb{E}(\mathbf{Y})$ and $\sigma^2 \mathbf{I}_n$ is the variance-covariance matrix.

Description: This theorem describes the distribution of the response vector in multiple linear regression under the standard assumptions, stating it follows a multivariate normal distribution with a mean vector determined by the model matrix and coefficients, and a variance-covariance matrix proportional to the identity matrix.

Section 2.3 Inference

Definition 4 (OLS Estimator). The Ordinary Least Squares (OLS) estimator $\widehat{\boldsymbol{\beta}}$ is given by: \[\widehat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} \label{eq:ols_estimator_2_box}\]

Theorem 2 (Distribution of OLS Estimator). Under the normality assumption, the OLS estimator $\widehat{\boldsymbol{\beta}}$ is normally distributed: \[\widehat{\boldsymbol{\beta}} \sim N_{p+1}(\boldsymbol{\beta}, \mathbb{V}(\widehat{\boldsymbol{\beta}})) \label{eq:beta_distribution_2_box}\] with variance-covariance matrix: \[\mathbb{V}(\widehat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^T \mathbf{X})^{-1} \label{eq:variance_beta_hat_box}\]

Description: This theorem describes the statistical distribution of the OLS estimator, stating it is normally distributed around the true coefficient vector with a specific variance-covariance matrix, which is crucial for inference.

Section 3.3 Outliers: Leverage and Influence

Definition 5 (Standardized Residuals). Standardized residuals $r_i$ are defined as: \[r_i = \frac{y_i - \widehat{y}_i}{\widehat{\sigma} \sqrt{1 - h_{ii}}} \label{eq:standardized_residuals_2_box}\]

Definition 6 (Cook’s Distance). Cook’s distance for the $i$-th observation is: \[D_i = \frac{1}{p+1} r_i^2 \frac{h_{ii}}{1 - h_{ii}} \label{eq:cooks_distance_2_box}\]

Section 3.4 Example: Hill Races

Example 1 (Log-Log Model for Hill Races). A log-log model for hill races is given by: \[\log(\text{time}_i) = \beta_0 + \beta_1 \log(\text{dist}_i) + \beta_2 \log(\text{climb}_i) + \varepsilon_i \label{eq:hill_races_log_model_2_box}\] which is equivalent to a power relationship: \[\text{time} = e^{\beta_0} \cdot \text{dist}^{\beta_1} \cdot \text{climb}^{\beta_2} \label{eq:hill_races_power_model_2_box}\]

Example 2 (Alternative Power Model for Hill Races). An alternative power model for hill races, using climb per distance, is: \[\text{time} = 0.007 \cdot \text{dist}^{1.15} \cdot (\text{climb}/\text{dist})^{0.47} \label{eq:hill_races_alternative_power_2_box}\]

Section 3.5 Centering the Covariates

Example 3 (Uncentered MLR Model). An uncentered multiple linear regression model: \[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \varepsilon_i \label{eq:mlr_model_uncentered_2_box}\]

Example 4 (Centered MLR Model). A centered multiple linear regression model: \[y_i = \alpha + \beta_1 (x_{i1} - \overline{x}_1) + \beta_2 (x_{i2} - \overline{x}_2) + \varepsilon_i \label{eq:mlr_model_centered_2_box}\] where $\alpha = \beta_0 + \beta_1 \overline{x}_1 + \beta_2 \overline{x}_2$.

Example 5 (Decomposition of Observed Values in Centered Model). Decomposition of observed values in a centered model: \[y_i = \overline{y} + \beta_1 (x_{i1} - \overline{x}_1) + \beta_2 (x_{i2} - \overline{x}_2) + \widehat{\varepsilon}_i = \overline{y} + t_{i1} + t_{i2} + \widehat{\varepsilon}_i \label{eq:centered_decomposition_2_box}\]

Section 3.7 Quadratic Effect of a Covariate

Definition 7 (Quadratic Effect Model). A model with a quadratic effect of $x$ is given by: \[\mathbb{E}(Y_i) = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 \label{eq:quadratic_model_2_box}\]

Section 3.8 Example: Cars

Example 6 (Quadratic Model for Cars Dataset). A quadratic model for the ‘cars’ dataset: \[\text{distance}_i = \beta_0 + \beta_1 \text{speed}_i + \beta_2 \text{speed}_i^2 + \varepsilon_i \label{eq:cars_quadratic_model_2_box}\]

Algorithm 1 in Tcolorbox

Algorithm [alg:regression_diagnostics]: Regression Diagnostics Check

Fit the multiple linear regression model using Ordinary Least Squares (OLS). Obtain residuals ($\widehat{\varepsilon}_i$) and fitted values ($\widehat{y}_i$). Check for Nonlinearity and Heteroscedasticity:

Plot residuals ($\widehat{\varepsilon}_i$) vs. fitted values ($\widehat{y}_i$).
Examine for patterns (curves, non-constant spread) indicating nonlinearity or heteroscedasticity.
Plot residuals ($\widehat{\varepsilon}_i$) vs. each predictor ($x_{ij}$).
Examine each plot for patterns suggesting nonlinear effects of specific predictors.

Check for Non-constant Variance (Heteroscedasticity):

Create a Scale-Location plot (sqrt(|standardized residuals|) vs. fitted values).
Look for trends in spread, indicating variance changes with fitted values.

Check for Non-Normality:

Interpret and Address Violations:

Based on plots and diagnostic statistics, identify assumption violations.
Consider transformations (variables, response), adding polynomial terms, or robust regression methods to address violations.
Re-run diagnostics on refined model and iterate until assumptions are reasonably met.

Complexity Analysis of Algorithm [alg:regression_diagnostics]:

Step 1 (OLS fitting): $O(np^2 + p^3)$
Step 2 (Residual calculation): $O(n)$
Step 3-6 (Diagnostic plots and statistics): $O(n \cdot p)$ for residual plots, $O(n \cdot p^2 + p^3)$ for influence statistics.
Step 7 (Interpretation and Refinement): Depends on complexity of refinement, but typically less than fitting.

Overall Complexity: Dominated by OLS fitting and influence statistics calculation, approximately $O(np^2 + p^3)$.

Mathematical Statements in Original Sections with Tcolorboxes

Section 2.1 Model Assumptions

Definition 8. Given data for a response variable $Y$ and $p$ regressors $X_1, \dots, X_p$, the multiple linear regression model for the $i$-th observation is: \[Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \dots + \beta_p X_{ip} + \varepsilon_i \label{eq:mlr_model_2}\] where:

$Y_i$ is the $i$-th observation of the response variable.
$X_{ij}$ is the $i$-th observation of the $j$-th regressor, for $j = 1, \dots, p$.
$\beta_0, \beta_1, \dots, \beta_p$ are the regression coefficients.
$\varepsilon_i$ is the random error term for the $i$-th observation.

Definition 9. Under the assumptions of multiple linear regression, the expected value of $Y_i$ is a linear combination of the covariates: \[\mathbb{E}(Y_i) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} \label{eq:expected_response_2}\]

Section 2.2 Matrix Form of the Model

Definition 10. The matrix form of the multiple linear regression model is: \[\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon} \label{eq:matrix_model_2}\] where:

$\mathbf{y}$ is the vector of observed responses.
$\mathbf{X}$ is the model matrix.
$\boldsymbol{\beta}$ is the vector of regression coefficients.
$\boldsymbol{\varepsilon}$ is the vector of error terms.

Theorem 3. Under the model assumptions, the response vector $\mathbf{Y}$ follows a multivariate normal distribution: \[\mathbf{Y} \sim N_n(\boldsymbol{\mu}, \sigma^2 \mathbf{I}_n) \label{eq:distribution_response_vector_box_original}\] where $\boldsymbol{\mu} = \mathbf{X}\boldsymbol{\beta} = \mathbb{E}(\mathbf{Y})$ and $\sigma^2 \mathbf{I}_n$ is the variance-covariance matrix.

Section 2.3 Inference

Definition 11. The Ordinary Least Squares (OLS) estimator $\widehat{\boldsymbol{\beta}}$ is given by: \[\widehat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} \label{eq:ols_estimator_2}\]

Theorem 4. Under the normality assumption, the OLS estimator $\widehat{\boldsymbol{\beta}}$ is normally distributed: \[\widehat{\boldsymbol{\beta}} \sim N_{p+1}(\boldsymbol{\beta}, \mathbb{V}(\widehat{\boldsymbol{\beta}})) \label{eq:beta_distribution_2}\] with variance-covariance matrix: \[\mathbb{V}(\widehat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^T \mathbf{X})^{-1} \label{eq:variance_beta_hat}\]

Section 3.3 Outliers: Leverage and Influence

Definition 12. Standardized residuals $r_i$ are defined as: \[r_i = \frac{y_i - \widehat{y}_i}{\widehat{\sigma} \sqrt{1 - h_{ii}}} \label{eq:standardized_residuals_2}\]

Definition 13. Cook’s distance for the $i$-th observation is: \[D_i = \frac{1}{p+1} r_i^2 \frac{h_{ii}}{1 - h_{ii}} \label{eq:cooks_distance_2}\]

Section 3.4 Example: Hill Races

Example 7. A log-log model for hill races is given by: \[\log(\text{time}_i) = \beta_0 + \beta_1 \log(\text{dist}_i) + \beta_2 \log(\text{climb}_i) + \varepsilon_i \label{eq:hill_races_log_model_2}\] which is equivalent to a power relationship: \[\text{time} = e^{\beta_0} \cdot \text{dist}^{\beta_1} \cdot \text{climb}^{\beta_2} \label{eq:hill_races_power_model_2}\]

Example 8. An alternative power model for hill races, using climb per distance, is: \[\text{time} = 0.007 \cdot \text{dist}^{1.15} \cdot (\text{climb}/\text{dist})^{0.47} \label{eq:hill_races_alternative_power_2}\]

Section 3.5 Centering the Covariates

Example 9. An uncentered multiple linear regression model: \[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \varepsilon_i \label{eq:mlr_model_uncentered_2}\]

Example 10. A centered multiple linear regression model: \[y_i = \alpha + \beta_1 (x_{i1} - \overline{x}_1) + \beta_2 (x_{i2} - \overline{x}_2) + \varepsilon_i \label{eq:mlr_model_centered_2}\] where $\alpha = \beta_0 + \beta_1 \overline{x}_1 + \beta_2 \overline{x}_2$.

Section 3.8 Example: Cars

Example 11. A quadratic model for the ‘cars’ dataset: \[\text{distance}_i = \beta_0 + \beta_1 \text{speed}_i + \beta_2 \text{speed}_i^2 + \varepsilon_i \label{eq:cars_quadratic_model_2}\]

--- title: "Towards Multiple Linear Regression and Logistic Regression" author: "Your Name" date: "2025-02-12" format: html: toc: true # Table of Contents toc-depth: 2 code-tools: true theme: cosmo # Or "journal" for Distill-like minimalism --- # Introduction {#sec:introduction} ## Introduction to Multiple Regression {#subsec:intro_multiple_regression} - In simple linear regression, a response variable $Y$ is regressed on a single explanatory variable. - Multiple linear regression generalizes this methodology to allow for multiple explanatory (predictor) variables, often referred to as **covariates**. - The multiple linear regression model is a cornerstone of statistical modeling. - While it relies on assumptions that may not always hold in real-world scenarios, it serves as a foundational model upon which many other statistical models are built. - Despite its simplicity, multiple linear regression is invaluable for both interpretation and prediction in various applications. ### Illustrative Example: Factor Variables and Non-linearities {#subsubsec:factor_variables_nonlinearities} - To illustrate the flexibility of multiple linear regression, consider a scenario where the response variable $y$ is influenced by: - Two numeric regressors, $u$ and $v$. - A factor variable $g$ with three levels, categorizing observations into three distinct groups. - Factor variables can be incorporated into the model using **dummy variables**. For a factor with three levels, we can use two dummy variables. For instance, we can define two dummy indicators showing whether an observation belongs to the first or second group (the third group being the baseline). - Numeric variables can be included in the model in non-linear forms, such as squared terms or interaction terms. It is crucial to remember that while the predictors can be non-linear, the model remains *linear in the parameters* and the error term. - For example, a model matrix $\mathbf{X}$ can be constructed to account for non-linear effects of numeric regressors ($u, v$) and a factor regressor ($g$) with three levels. This matrix might include columns for: - Intercept (implicitly included in most regression software). - Dummy variables for factor levels (e.g., two columns for a three-level factor). - Numeric regressors ($u, v$). - Non-linear transformations of numeric regressors (e.g., $u^2, v^2, u \cdot v$). - In such a model matrix, the first few rows might represent observations from the first group, followed by rows from the second and third groups. - When working with factor regressors, it is essential to ensure that the model matrix $\mathbf{X}$ has full rank to guarantee the identifiability of model parameters. Without full rank, the model parameters cannot be uniquely estimated from the data. - In some formulations, particularly when focusing on deviations from a reference level for factors, the intercept column might be intentionally omitted in the model matrix. This depends on the chosen parameterization and software implementation. # Multiple Linear Regression: Assumptions and Inference {#sec:multiple_linear_regression} ## Model Assumptions {#subsec:model_assumptions} - Given data for a response variable $Y$ and $p$ regressors $X_1, \dots, X_p$, the multiple linear regression model posits a linear relationship between the expected value of the response variable and the regressors. The model is mathematically expressed for the $i$-th observation as: $$Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \dots + \beta_p X_{ip} + \varepsilon_i \label{eq:mlr_model_2}$$ where: - $Y_i$ is the $i$-th observation of the response variable. - $X_{ij}$ is the $i$-th observation of the $j$-th regressor, for $j = 1, \dots, p$. - $\beta_0, \beta_1, \dots, \beta_p$ are the regression coefficients, representing the unknown parameters to be estimated. $\beta_0$ is the intercept, and $\beta_j$ for $j \geq 1$ represents the effect of a one-unit increase in $X_j$ on the expected value of $Y$, holding other regressors constant. - $\varepsilon_i$ is the random error term for the $i$-th observation, representing the deviation of the observed value from the expected value. - The fundamental assumptions of the multiple linear regression model are centered around the error term $\varepsilon_i$: 1. **Normality:** The error terms are normally distributed, $\varepsilon_i \sim N(0, \sigma^2)$. This assumption is crucial for hypothesis testing and constructing confidence intervals. 2. **Zero Mean:** The expected value of the error term is zero, $\mathbb{E}(\varepsilon_i) = 0$. This implies that the linear model is correctly specified in terms of the functional form for the mean response. 3. **Homoscedasticity (Constant Variance):** The variance of the error terms is constant across all observations, $\mathbb{V}(\varepsilon_i) = \sigma^2$. This means the spread of the residuals should be roughly constant across the range of predictor values. 4. **Independence:** The error terms are independent of each other, $\text{Cov}(\varepsilon_i, \varepsilon_k) = 0$ for $i \neq k$. This assumption is particularly important for data collected over time or in clusters; violations can lead to biased standard error estimates. - Under these assumptions, conditional on the regressors $X_{ij}$, the response variable $Y_i$ is also normally distributed and independent of other responses, with a constant variance $\sigma^2$. The expected value of $Y_i$ is a linear combination of the covariates: $$\mathbb{E}(Y_i) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} \label{eq:expected_response_2}$$ - It is important to recognize that the linearity assumption for $\mathbb{E}(Y_i)$ is often an approximation. In practice, relationships may be non-linear, and these assumptions may not perfectly hold. However, multiple linear regression often provides a useful and robust approximation, especially as a starting point for analysis. If diagnostics reveal violations of these assumptions, transformations of variables or more complex models may be considered. ## Matrix Form of the Model {#subsec:matrix_form} - To streamline notation and facilitate computations, the multiple linear regression model is often expressed in matrix form. This representation is essential for understanding the underlying linear algebra and for efficient computation of model estimates and statistics. The matrix form of the model is: $$\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon} \label{eq:matrix_model_2}$$ where: - $\mathbf{y} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix}$ is an $n \times 1$ column vector of the observed responses, where $n$ is the number of observations. - $\mathbf{X} = \begin{pmatrix} 1 & x_{11} & x_{12} & \dots & x_{1p} \\ 1 & x_{21} & x_{22} & \dots & x_{2p} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & x_{n1} & x_{n2} & \dots & x_{np} \end{pmatrix}$ is the $n \times (p + 1)$ **model matrix** or design matrix. - Each row of $\mathbf{X}$ corresponds to an observation, and each column corresponds to a regressor. - The first column is a vector of ones, representing the intercept term. - The subsequent $p$ columns contain the observed values of the $p$ regressors. - For the model to be identifiable, the model matrix $\mathbf{X}$ must have full column rank, which is $p+1$. This requires that $n > p + 1$ and that there is no perfect multicollinearity among the regressors. - $\boldsymbol{\beta} = \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{pmatrix}$ is a $(p + 1) \times 1$ column vector of the unknown regression coefficients. - $\boldsymbol{\varepsilon} = \begin{pmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{pmatrix}$ is an $n \times 1$ column vector of the unobserved error terms. - Under the model assumptions, the response vector $\mathbf{Y} = (Y_1, \dots, Y_n)^T$ follows a multivariate normal distribution: $\mathbf{Y} \sim N_n(\boldsymbol{\mu}, \sigma^2 \mathbf{I}_n)$, where: - $\boldsymbol{\mu} = \mathbf{X}\boldsymbol{\beta} = \mathbb{E}(\mathbf{Y})$ is the $n \times 1$ mean vector, representing the expected values of the responses. - $\sigma^2 \mathbf{I}_n$ is the $n \times n$ variance-covariance matrix, where $\sigma^2$ is the constant error variance and $\mathbf{I}_n$ is the $n \times n$ identity matrix, reflecting the independence and homoscedasticity assumptions. - **Note on Notation:** In matrix notation, random variables are typically represented by uppercase letters (e.g., $\mathbf{Y}$, $\boldsymbol{\varepsilon}$), while observed values and fixed quantities are in lowercase (e.g., $\mathbf{y}$, $\mathbf{X}$, $\boldsymbol{\beta}$, $\boldsymbol{\mu}$, $\sigma^2$). However, in some contexts, particularly when parameters are treated as fixed but unknown, lowercase letters might also be used for parameters (e.g., $\boldsymbol{\beta}$, $\sigma^2$). - The predictor variables that constitute the model matrix $\mathbf{X}$ can be either metric (numeric) variables or factor variables. Metric variables are quantitative and continuous, while factor variables are categorical and represent groups or levels. - In these notes, we primarily focus on models with metric regressors. The incorporation of factor regressors will be discussed in detail in subsequent sections. ## Inference {#subsec:inference} **Distribution of $\widehat{\boldsymbol{\beta}}$:** Under the normality assumption, the OLS estimator $\widehat{\boldsymbol{\beta}}$ is normally distributed: $$\widehat{\boldsymbol{\beta}} \sim N_{p+1}(\boldsymbol{\beta}, \mathbb{V}(\widehat{\boldsymbol{\beta}})) \label{eq:beta_distribution_2}$$ with mean $\boldsymbol{\beta}$ and variance-covariance matrix: $$\mathbb{V}(\widehat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^T \mathbf{X})^{-1} \label{eq:variance_beta_hat}$$ This variance-covariance matrix is crucial for constructing confidence intervals and hypothesis tests for the regression coefficients. # Multiple Linear Regression: Diagnostics {#sec:mlr_diagnostics} ## Violation of Model Assumptions {#subsec:violation_assumptions} - The assumptions underlying multiple linear regression may not always be valid in practical applications. Violations can occur in both the systematic and random components of the model. - The assumption that the mean of $Y_i$ is a linear combination of the $x$'s is often an approximation. - Significant deviations can include **nonlinear effects** of explanatory variables and the **omission of relevant variables**. The latter is particularly serious and requires careful consideration of the research question. - The assumption of independence may be violated in cases of data **clustering** or **repeated measurements** over time or space, leading to correlated errors. - The assumptions of constant variance (homoscedasticity) and normality of errors are frequently violated in practice. Transforming the response variable scale can sometimes make these assumptions more plausible. - The presence of **outliers** needs to be investigated as they can unduly influence the regression results. ## Checking on the Residuals {#subsec:checking_residuals} - Many assumption violations can be addressed, at least partially, if they are detected. - While detecting assumption failures can be challenging, regression diagnostics provide tools to identify potential problems and suggest their nature. - **Regression diagnostics** are a collection of methods designed to assess the validity of model assumptions. - Residual plots in multiple linear regression are interpreted similarly to those in simple linear regression. They help assess nonlinearity, non-constant variance, and departures from normality. - To check for nonlinearity in a multiple regression context, plotting residuals against individual predictors is often more informative than just plotting residuals against fitted values. - :::: algorithm **Algorithm 1**. ::: algorithmic Fit the multiple linear regression model using Ordinary Least Squares (OLS). Obtain residuals ($\widehat{\varepsilon}_i$) and fitted values ($\widehat{y}_i$). **Check for Nonlinearity and Heteroscedasticity:** 1. Plot residuals ($\widehat{\varepsilon}_i$) vs. fitted values ($\widehat{y}_i$). 2. Examine for patterns (curves, non-constant spread) indicating nonlinearity or heteroscedasticity. 3. Plot residuals ($\widehat{\varepsilon}_i$) vs. each predictor ($x_{ij}$). 4. Examine each plot for patterns suggesting nonlinear effects of specific predictors. **Check for Non-constant Variance (Heteroscedasticity):** 1. Create a Scale-Location plot (sqrt(\|standardized residuals\|) vs. fitted values). 2. Look for trends in spread, indicating variance changes with fitted values. **Check for Non-Normality:** **Interpret and Address Violations:** 1. Based on plots and diagnostic statistics, identify assumption violations. 2. Consider transformations (variables, response), adding polynomial terms, or robust regression methods to address violations. 3. Re-run diagnostics on refined model and iterate until assumptions are reasonably met. ::: **Complexity Analysis of Algorithm [\[alg:regression_diagnostics\]](#alg:regression_diagnostics){reference-type="ref" reference="alg:regression_diagnostics"}:** - Step 1 (OLS fitting): $O(np^2 + p^3)$ - Step 2 (Residual calculation): $O(n)$ - Step 3-6 (Diagnostic plots and statistics): $O(n \cdot p)$ for residual plots, $O(n \cdot p^2 + p^3)$ for influence statistics. - Step 7 (Interpretation and Refinement): Depends on complexity of refinement, but typically less than fitting. Overall Complexity: Dominated by OLS fitting and influence statistics calculation, approximately $O(np^2 + p^3)$. :::: ## Outliers: Leverage and Influence {#subsec:outliers_leverage_influence} - Identifying outliers becomes more complex with multiple explanatory variables. Multiple influential outliers can mask each other's effects. Outlier analysis in multiple regression builds upon concepts from simple linear regression. - **Leverage** and **influence** are key concepts in outlier diagnostics. - If the $i$-th response $y_i$ is perturbed by $\Delta_i$ (while keeping other $y$-values unchanged), the fitted value $\widehat{y}_i$ changes to $\widehat{y}_i + h_{ii} \Delta_i$. The value $h_{ii}$ is the **leverage** of the $i$-th observation. - The leverage values $h_{ii}$ are the diagonal elements of the hat matrix $\mathbf{H} = \mathbf{X} (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T$, where $\widehat{\mathbf{y}} = \mathbf{H} \mathbf{y}$. - High leverage values indicate observations that are distant from the centroid of the predictor variables. - In a model with $p+1$ coefficients (including intercept), $\sum_{i=1}^n h_{ii} = p+1$. As a rule of thumb, leverage values $h_{ii}$ larger than $2(p+1)/n$ or $3(p+1)/n$ are considered high. - **Influential** points are those that, if removed, would substantially alter the regression results (fitted values, coefficient estimates). Influence is a combination of a point's residual size and its leverage. - **Cook's distance** is a common measure of influence. Using standardized residuals: $$r_i = \frac{y_i - \widehat{y}_i}{\widehat{\sigma} \sqrt{1 - h_{ii}}} \label{eq:standardized_residuals_2}$$ Cook's distance for the $i$-th observation is: $$D_i = \frac{1}{p+1} r_i^2 \frac{h_{ii}}{1 - h_{ii}} \label{eq:cooks_distance_2}$$ It quantifies the change in model estimates when the $i$-th observation is removed. - Cook's distance values $D_i > 0.5$ (or even $D_i > 1$) are often considered suspicious, suggesting potentially influential points that significantly affect coefficient estimates and standard errors. - The effect of each observation on the coefficient estimates $\widehat{\boldsymbol{\beta}}$ can be further evaluated using other influence measures. ## Example: Hill Races {#subsec:example_hill_races} - For a dataset on hill races in Northern Ireland, diagnostic plots of Cook's distances and leverages are examined. These plots do not reveal serious problems, except for one point with a Cook's distance slightly above 0.5. - Based on these diagnostics and the nature of the data, a log-log model is considered: $$\log(\text{time}_i) = \beta_0 + \beta_1 \log(\text{dist}_i) + \beta_2 \log(\text{climb}_i) + \varepsilon_i \label{eq:hill_races_log_model_2}$$ - This is equivalent to a power relationship in the original scale: $$\text{time} = e^{\beta_0} \cdot \text{dist}^{\beta_1} \cdot \text{climb}^{\beta_2} \label{eq:hill_races_power_model_2}$$ - The estimated coefficients are $\widehat{\beta}_0 = -4.96$ ($SE(\widehat{\beta}_0) = 0.273$), $\widehat{\beta}_1 = 0.68$ ($SE(\widehat{\beta}_1) = 0.055$), $\widehat{\beta}_2 = 0.47$ ($SE(\widehat{\beta}_2) = 0.045$). The residual standard error is $\widehat{\sigma} = 0.076$. - The low $p$-values for $\log(\text{dist})$ and $\log(\text{climb})$ confirm their significant predictive power for $\log(\text{time})$. The global significance of the regression coefficients is also confirmed by the F-test. - The multiple $R^2$ and adjusted multiple $R^2$ are $0.983$ and $0.981$, respectively, indicating a very strong fit. - For model interpretation, it is important to recognize that different model formulations can serve different explanatory purposes. - In the original scale, the fitted deterministic part of the model is: $$\text{time} = 0.007 \cdot \text{dist}^{0.68} \cdot \text{climb}^{0.47} \label{eq:hill_races_fitted_power_2}$$ - Surprisingly, the model suggests that for a fixed climb, a 1% increase in distance leads to only a 0.68% increase in time. - This implies that for a fixed climb, the time taken for the second mile is less than for the first mile, which seems counterintuitive. For example, with climb=1500, the times are approximately: $$\begin{aligned} 0.007 \cdot 1^{0.68} \cdot 1500^{0.47} &\approx 0.218 \\ 0.007 \cdot 2^{0.68} \cdot 1500^{0.47} &\approx 0.349 \end{aligned}$$ - This apparent paradox arises from the condition *keeping climb constant*. Shorter races tend to be steeper than longer races, so holding climb constant while increasing distance is not realistic in this context. - If $\log(\text{time})$ is regressed on $\log(\text{dist})$ and $\log(\text{climb}/\text{dist})$ instead of $\log(\text{climb})$, the coefficient for $\log(\text{dist})$ becomes greater than 1, which is more reassuring: $$\text{time} = 0.007 \cdot \text{dist}^{1.15} \cdot (\text{climb}/\text{dist})^{0.47} \label{eq:hill_races_alternative_power_2}$$ - Both models provide the same fit to the data, as they are mathematically equivalent formulations. The choice between them depends on interpretability and the specific application goals. ## Centering the Covariates {#subsec:centering_covariates} - Centering covariates involves subtracting their sample means before including them in the regression model. For example, the model: $$y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \varepsilon_i \label{eq:mlr_model_uncentered_2}$$ can be rewritten as: $$y_i = \alpha + \beta_1 (x_{i1} - \overline{x}_1) + \beta_2 (x_{i2} - \overline{x}_2) + \varepsilon_i \label{eq:mlr_model_centered_2}$$ where $\overline{x}_1$ and $\overline{x}_2$ are the sample means of $x_{1}$ and $x_{2}$ respectively, and $\alpha = \beta_0 + \beta_1 \overline{x}_1 + \beta_2 \overline{x}_2$. - Centering aids in model interpretability in two ways: - The estimated intercept $\widehat{\alpha} = \overline{y}$ becomes the fitted value when both covariates are at their sample means. - Observed values can be decomposed as: $$y_i = \overline{y} + \beta_1 (x_{i1} - \overline{x}_1) + \beta_2 (x_{i2} - \overline{x}_2) + \widehat{\varepsilon}_i = \overline{y} + t_{i1} + t_{i2} + \widehat{\varepsilon}_i \label{eq:centered_decomposition_2}$$ where $t_{i1} = \beta_1 (x_{i1} - \overline{x}_1)$ and $t_{i2} = \beta_2 (x_{i2} - \overline{x}_2)$ are zero-sum contributions from each covariate. - The terms $t_{i1}$ and $t_{i2}$ are used in **partial residual plots**. ## Partial Residual Plot {#subsec:partial_residual_plot} - A partial residual plot for covariate $x_j$, given all other covariates, is a scatterplot of $t_{ij} + \widehat{\varepsilon}_i$ versus $x_{ij}$. It visualizes the part of the response not explained by covariates other than $x_j$. - For example, with two covariates, the partial residual plot for $x_1$ graphs $t_{i1} + \widehat{\varepsilon}_i = y_i - \overline{y} - t_{i2}$ against $x_{i1}$. It helps assess whether the relationship between the part of the response not explained by $x_2$ and $x_1$ is approximately linear. - ## Quadratic Effect of a Covariate {#subsec:quadratic_effect} - A covariate may exhibit a nonlinear effect on the mean response, which might be suggested by a partial residual plot. - In such cases, including polynomial terms of the covariate in the model can capture the nonlinearity. - For a single covariate, a model with a **quadratic effect** of $x$ specifies the mean response as: $$\mathbb{E}(Y_i) = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 \label{eq:quadratic_model_2}$$ - This model is no longer a simple linear regression model but is a special case of multiple linear regression with two covariates: $x_{i1} = x_i$ and $x_{i2} = x_i^2$. - **Polynomial regression models** generalize the multivariate linear model by including squared, cross-product, and higher-order terms of the original predictor variables, allowing for more flexible modeling of nonlinear relationships. ## Example: Cars {#subsec:example_cars} - For the 'cars' dataset, preliminary statistical analysis, supported by physical considerations, suggests that the distance taken to stop should be a non-linear function of the speed. Then, a plausible model is: $$\text{distance}_i = \beta_0 + \beta_1 \text{speed}_i + \beta_2 \text{speed}_i^2 + \varepsilon_i \label{eq:cars_quadratic_model_2}$$ - Fitting this quadratic model to the 'cars' data yields estimated coefficients: $\widehat{\beta}_0 = 2.47$ ($SE(\widehat{\beta}_0) = 14.82$), $\widehat{\beta}_1 = 0.91$ ($SE(\widehat{\beta}_1) = 2.03$), and $\widehat{\beta}_2 = 0.10$ ($SE(\widehat{\beta}_2) = 0.07$). - Despite the high $p$-values for individual coefficients, the $p$-value for the global F-test is $5.85 \times 10^{-12}$, indicating strong evidence that the quadratic model is significantly better than a constant model. - The $p$-values for individual coefficients test whether each coefficient is zero given that other terms are already in the model. They should not be interpreted in isolation when assessing the overall significance of the quadratic effect. - Diagnostic plots may reveal issues such as non-constant variance and non-normality of residuals, suggesting the need for further model refinement, such as weighted least squares regression. # Conclusion {#sec:conclusion} - Multiple linear regression extends simple linear regression to accommodate multiple predictor variables, offering a powerful tool for modeling relationships between a response variable and several covariates. - Key assumptions of multiple linear regression include linearity, independence of errors, homoscedasticity, and normality of errors. Diagnostic tools are essential to check the validity of these assumptions and to identify potential model inadequacies. - Inference in multiple linear regression involves estimating model parameters using least squares, conducting hypothesis tests for individual and global significance of predictors, and constructing confidence and prediction intervals. - Centering covariates can improve interpretability by making the intercept represent the expected response at the average covariate values and facilitating the creation of partial residual plots. - Polynomial regression extends the linear model to capture nonlinear effects of covariates, increasing model flexibility. - Model selection techniques, such as F-tests, AIC, BIC, and stepwise methods, help in choosing the most appropriate model among a set of candidate models, balancing model fit and complexity. - Multicollinearity, the presence of high correlation among predictors, can inflate variance of coefficient estimates and complicate interpretation. Variance inflation factor (VIF) is a useful measure to detect multicollinearity. Remedies include removing redundant predictors or using dimensionality reduction techniques. - Regression models can be extended to handle factor variables using dummy coding, allowing for the analysis of variance and covariance models within the regression framework. - Diagnostic checks and model validation are crucial steps in ensuring the reliability and validity of multiple linear regression models in practical applications. # Summary {#sec:introduction_final} This lecture introduces the fundamental concepts of multiple linear regression, extending from simple linear regression to models with multiple explanatory variables. We explore the assumptions underlying multiple linear regression, methods for parameter estimation and inference, and diagnostic techniques for model validation. The lecture covers essential topics such as model assumptions, matrix representation, least squares estimation, hypothesis testing, confidence and prediction intervals, and regression diagnostics including residual analysis, leverage, and influence. Examples are used to illustrate the application of multiple linear regression in real-world scenarios, including book weight prediction, hill race analysis, and car stopping distance modeling. The lecture also touches upon model selection criteria and methods for handling multicollinearity, providing a comprehensive overview of multiple linear regression methodology and its practical implications. # Final Remarks {#sec:conclusion_final} In summary, this lecture has provided a detailed exploration of multiple linear regression, a versatile and fundamental statistical tool. We have covered the theoretical underpinnings, including model assumptions and inference procedures, and practical aspects such as diagnostics and model selection. Key takeaways include the importance of checking model assumptions using residual plots and diagnostic statistics, the interpretation of regression coefficients in the context of multiple predictors, and strategies for model refinement and selection. We also discussed extensions to handle nonlinear effects and factor variables, broadening the applicability of regression models. Understanding these concepts is crucial for effectively applying multiple linear regression in data analysis and for building a solid foundation for more advanced statistical modeling techniques, such as logistic regression which will be discussed in the subsequent lectures. Further study should focus on practicing model building, diagnostics, and interpretation with real datasets to solidify these concepts. What are the implications of relaxing the linearity assumption? How can we extend these models to handle non-constant error variance more robustly? These questions could guide further exploration and study in this area. # Mathematical Statements in Tcolorboxes ## Section 2.1 Model Assumptions ::: {.def:mlr_model .definition} **Definition 1** (Multiple Linear Regression Model). Given data for a response variable $Y$ and $p$ regressors $X_1, \dots, X_p$, the multiple linear regression model for the $i$-th observation is: $$Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \dots + \beta_p X_{ip} + \varepsilon_i \label{eq:mlr_model_2_box}$$ where: - $Y_i$ is the $i$-th observation of the response variable. - $X_{ij}$ is the $i$-th observation of the $j$-th regressor, for $j = 1, \dots, p$. - $\beta_0, \beta_1, \dots, \beta_p$ are the regression coefficients. - $\varepsilon_i$ is the random error term for the $i$-th observation. ::: ::: {.def:expected_response_mlr .definition} **Definition 2** (Expected Value of Response in MLR). Under the assumptions of multiple linear regression, the expected value of $Y_i$ is a linear combination of the covariates: $$\mathbb{E}(Y_i) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} \label{eq:expected_response_2_box}$$ ::: ## Section 2.2 Matrix Form of the Model ::: {.def:matrix_form_mlr .definition} **Definition 3** (Matrix Form of Multiple Linear Regression Model). The matrix form of the multiple linear regression model is: $$\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon} \label{eq:matrix_model_2_box}$$ where: - $\mathbf{y}$ is the vector of observed responses. - $\mathbf{X}$ is the model matrix. - $\boldsymbol{\beta}$ is the vector of regression coefficients. - $\boldsymbol{\varepsilon}$ is the vector of error terms. ::: ::: {.thm:distribution_response_vector .theorem} **Theorem 1** (Distribution of Response Vector in MLR). *Under the model assumptions, the response vector $\mathbf{Y}$ follows a multivariate normal distribution: $$\mathbf{Y} \sim N_n(\boldsymbol{\mu}, \sigma^2 \mathbf{I}_n) \label{eq:distribution_response_vector_box}$$ where $\boldsymbol{\mu} = \mathbf{X}\boldsymbol{\beta} = \mathbb{E}(\mathbf{Y})$ and $\sigma^2 \mathbf{I}_n$ is the variance-covariance matrix.* ::: **Description:** This theorem describes the distribution of the response vector in multiple linear regression under the standard assumptions, stating it follows a multivariate normal distribution with a mean vector determined by the model matrix and coefficients, and a variance-covariance matrix proportional to the identity matrix. ## Section 2.3 Inference ::: {.def:ols_estimator .definition} **Definition 4** (OLS Estimator). The Ordinary Least Squares (OLS) estimator $\widehat{\boldsymbol{\beta}}$ is given by: $$\widehat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} \label{eq:ols_estimator_2_box}$$ ::: ::: {.thm:distribution_ols_estimator .theorem} **Theorem 2** (Distribution of OLS Estimator). *Under the normality assumption, the OLS estimator $\widehat{\boldsymbol{\beta}}$ is normally distributed: $$\widehat{\boldsymbol{\beta}} \sim N_{p+1}(\boldsymbol{\beta}, \mathbb{V}(\widehat{\boldsymbol{\beta}})) \label{eq:beta_distribution_2_box}$$ with variance-covariance matrix: $$\mathbb{V}(\widehat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^T \mathbf{X})^{-1} \label{eq:variance_beta_hat_box}$$* ::: **Description:** This theorem describes the statistical distribution of the OLS estimator, stating it is normally distributed around the true coefficient vector with a specific variance-covariance matrix, which is crucial for inference. ## Section 3.3 Outliers: Leverage and Influence ::: {.def:standardized_residuals .definition} **Definition 5** (Standardized Residuals). Standardized residuals $r_i$ are defined as: $$r_i = \frac{y_i - \widehat{y}_i}{\widehat{\sigma} \sqrt{1 - h_{ii}}} \label{eq:standardized_residuals_2_box}$$ ::: ::: {.def:cooks_distance .definition} **Definition 6** (Cook's Distance). Cook's distance for the $i$-th observation is: $$D_i = \frac{1}{p+1} r_i^2 \frac{h_{ii}}{1 - h_{ii}} \label{eq:cooks_distance_2_box}$$ ::: ## Section 3.4 Example: Hill Races ::: {.example:log_log_hill_races .example} **Example 1** (Log-Log Model for Hill Races). A log-log model for hill races is given by: $$\log(\text{time}_i) = \beta_0 + \beta_1 \log(\text{dist}_i) + \beta_2 \log(\text{climb}_i) + \varepsilon_i \label{eq:hill_races_log_model_2_box}$$ which is equivalent to a power relationship: $$\text{time} = e^{\beta_0} \cdot \text{dist}^{\beta_1} \cdot \text{climb}^{\beta_2} \label{eq:hill_races_power_model_2_box}$$ ::: ::: {.example:alternative_power_hill_races .example} **Example 2** (Alternative Power Model for Hill Races). An alternative power model for hill races, using climb per distance, is: $$\text{time} = 0.007 \cdot \text{dist}^{1.15} \cdot (\text{climb}/\text{dist})^{0.47} \label{eq:hill_races_alternative_power_2_box}$$ ::: ## Section 3.5 Centering the Covariates ::: {.example:uncentered_mlr .example} **Example 3** (Uncentered MLR Model). An uncentered multiple linear regression model: $$y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \varepsilon_i \label{eq:mlr_model_uncentered_2_box}$$ ::: ::: {.example:centered_mlr .example} **Example 4** (Centered MLR Model). A centered multiple linear regression model: $$y_i = \alpha + \beta_1 (x_{i1} - \overline{x}_1) + \beta_2 (x_{i2} - \overline{x}_2) + \varepsilon_i \label{eq:mlr_model_centered_2_box}$$ where $\alpha = \beta_0 + \beta_1 \overline{x}_1 + \beta_2 \overline{x}_2$. ::: ::: {.example:centered_decomposition .example} **Example 5** (Decomposition of Observed Values in Centered Model). Decomposition of observed values in a centered model: $$y_i = \overline{y} + \beta_1 (x_{i1} - \overline{x}_1) + \beta_2 (x_{i2} - \overline{x}_2) + \widehat{\varepsilon}_i = \overline{y} + t_{i1} + t_{i2} + \widehat{\varepsilon}_i \label{eq:centered_decomposition_2_box}$$ ::: ## Section 3.7 Quadratic Effect of a Covariate ::: {.def:quadratic_effect_model .definition} **Definition 7** (Quadratic Effect Model). A model with a quadratic effect of $x$ is given by: $$\mathbb{E}(Y_i) = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 \label{eq:quadratic_model_2_box}$$ ::: ## Section 3.8 Example: Cars ::: {.example:quadratic_cars_model .example} **Example 6** (Quadratic Model for Cars Dataset). A quadratic model for the 'cars' dataset: $$\text{distance}_i = \beta_0 + \beta_1 \text{speed}_i + \beta_2 \text{speed}_i^2 + \varepsilon_i \label{eq:cars_quadratic_model_2_box}$$ ::: ## Algorithm 1 in Tcolorbox :::: tcolorbox **Algorithm [\[alg:regression_diagnostics\]](#alg:regression_diagnostics){reference-type="ref" reference="alg:regression_diagnostics"}:** Regression Diagnostics Check ::: algorithmic Fit the multiple linear regression model using Ordinary Least Squares (OLS). Obtain residuals ($\widehat{\varepsilon}_i$) and fitted values ($\widehat{y}_i$). **Check for Nonlinearity and Heteroscedasticity:** 1. Plot residuals ($\widehat{\varepsilon}_i$) vs. fitted values ($\widehat{y}_i$). 2. Examine for patterns (curves, non-constant spread) indicating nonlinearity or heteroscedasticity. 3. Plot residuals ($\widehat{\varepsilon}_i$) vs. each predictor ($x_{ij}$). 4. Examine each plot for patterns suggesting nonlinear effects of specific predictors. **Check for Non-constant Variance (Heteroscedasticity):** 1. Create a Scale-Location plot (sqrt(\|standardized residuals\|) vs. fitted values). 2. Look for trends in spread, indicating variance changes with fitted values. **Check for Non-Normality:** **Interpret and Address Violations:** 1. Based on plots and diagnostic statistics, identify assumption violations. 2. Consider transformations (variables, response), adding polynomial terms, or robust regression methods to address violations. 3. Re-run diagnostics on refined model and iterate until assumptions are reasonably met. ::: **Complexity Analysis of Algorithm [\[alg:regression_diagnostics\]](#alg:regression_diagnostics){reference-type="ref" reference="alg:regression_diagnostics"}:** - Step 1 (OLS fitting): $O(np^2 + p^3)$ - Step 2 (Residual calculation): $O(n)$ - Step 3-6 (Diagnostic plots and statistics): $O(n \cdot p)$ for residual plots, $O(n \cdot p^2 + p^3)$ for influence statistics. - Step 7 (Interpretation and Refinement): Depends on complexity of refinement, but typically less than fitting. Overall Complexity: Dominated by OLS fitting and influence statistics calculation, approximately $O(np^2 + p^3)$. :::: # Mathematical Statements in Original Sections with Tcolorboxes ## Section 2.1 Model Assumptions :::: tcolorbox ::: {.def:mlr_model_original .definition} **Definition 8**. Given data for a response variable $Y$ and $p$ regressors $X_1, \dots, X_p$, the multiple linear regression model for the $i$-th observation is: $$Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \dots + \beta_p X_{ip} + \varepsilon_i \label{eq:mlr_model_2}$$ where: - $Y_i$ is the $i$-th observation of the response variable. - $X_{ij}$ is the $i$-th observation of the $j$-th regressor, for $j = 1, \dots, p$. - $\beta_0, \beta_1, \dots, \beta_p$ are the regression coefficients. - $\varepsilon_i$ is the random error term for the $i$-th observation. ::: :::: :::: tcolorbox ::: {.def:expected_response_mlr_original .definition} **Definition 9**. Under the assumptions of multiple linear regression, the expected value of $Y_i$ is a linear combination of the covariates: $$\mathbb{E}(Y_i) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} \label{eq:expected_response_2}$$ ::: :::: ## Section 2.2 Matrix Form of the Model :::: tcolorbox ::: {.def:matrix_form_mlr_original .definition} **Definition 10**. The matrix form of the multiple linear regression model is: $$\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon} \label{eq:matrix_model_2}$$ where: - $\mathbf{y}$ is the vector of observed responses. - $\mathbf{X}$ is the model matrix. - $\boldsymbol{\beta}$ is the vector of regression coefficients. - $\boldsymbol{\varepsilon}$ is the vector of error terms. ::: :::: :::: tcolorbox ::: {.thm:distribution_response_vector_original .theorem} **Theorem 3**. *Under the model assumptions, the response vector $\mathbf{Y}$ follows a multivariate normal distribution: $$\mathbf{Y} \sim N_n(\boldsymbol{\mu}, \sigma^2 \mathbf{I}_n) \label{eq:distribution_response_vector_box_original}$$ where $\boldsymbol{\mu} = \mathbf{X}\boldsymbol{\beta} = \mathbb{E}(\mathbf{Y})$ and $\sigma^2 \mathbf{I}_n$ is the variance-covariance matrix.* ::: **Description:** This theorem describes the distribution of the response vector in multiple linear regression under the standard assumptions, stating it follows a multivariate normal distribution with a mean vector determined by the model matrix and coefficients, and a variance-covariance matrix proportional to the identity matrix. :::: ## Section 2.3 Inference :::: tcolorbox ::: {.def:ols_estimator_original .definition} **Definition 11**. The Ordinary Least Squares (OLS) estimator $\widehat{\boldsymbol{\beta}}$ is given by: $$\widehat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} \label{eq:ols_estimator_2}$$ ::: :::: :::: tcolorbox ::: {.thm:distribution_ols_estimator_original .theorem} **Theorem 4**. *Under the normality assumption, the OLS estimator $\widehat{\boldsymbol{\beta}}$ is normally distributed: $$\widehat{\boldsymbol{\beta}} \sim N_{p+1}(\boldsymbol{\beta}, \mathbb{V}(\widehat{\boldsymbol{\beta}})) \label{eq:beta_distribution_2}$$ with variance-covariance matrix: $$\mathbb{V}(\widehat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^T \mathbf{X})^{-1} \label{eq:variance_beta_hat}$$* ::: **Description:** This theorem describes the statistical distribution of the OLS estimator, stating it is normally distributed around the true coefficient vector with a specific variance-covariance matrix, which is crucial for inference. :::: ## Section 3.3 Outliers: Leverage and Influence :::: tcolorbox ::: {.def:standardized_residuals_original .definition} **Definition 12**. Standardized residuals $r_i$ are defined as: $$r_i = \frac{y_i - \widehat{y}_i}{\widehat{\sigma} \sqrt{1 - h_{ii}}} \label{eq:standardized_residuals_2}$$ ::: :::: :::: tcolorbox ::: {.def:cooks_distance_original .definition} **Definition 13**. Cook's distance for the $i$-th observation is: $$D_i = \frac{1}{p+1} r_i^2 \frac{h_{ii}}{1 - h_{ii}} \label{eq:cooks_distance_2}$$ ::: :::: ## Section 3.4 Example: Hill Races :::: tcolorbox ::: {.example:log_log_hill_races_original .example} **Example 7**. A log-log model for hill races is given by: $$\log(\text{time}_i) = \beta_0 + \beta_1 \log(\text{dist}_i) + \beta_2 \log(\text{climb}_i) + \varepsilon_i \label{eq:hill_races_log_model_2}$$ which is equivalent to a power relationship: $$\text{time} = e^{\beta_0} \cdot \text{dist}^{\beta_1} \cdot \text{climb}^{\beta_2} \label{eq:hill_races_power_model_2}$$ ::: :::: :::: tcolorbox ::: {.example:alternative_power_hill_races_original .example} **Example 8**. An alternative power model for hill races, using climb per distance, is: $$\text{time} = 0.007 \cdot \text{dist}^{1.15} \cdot (\text{climb}/\text{dist})^{0.47} \label{eq:hill_races_alternative_power_2}$$ ::: :::: ## Section 3.5 Centering the Covariates :::: tcolorbox ::: {.example:uncentered_mlr_original .example} **Example 9**. An uncentered multiple linear regression model: $$y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \varepsilon_i \label{eq:mlr_model_uncentered_2}$$ ::: :::: :::: tcolorbox ::: {.example:centered_mlr_original .example} **Example 10**. A centered multiple linear regression model: $$y_i = \alpha + \beta_1 (x_{i1} - \overline{x}_1) + \beta_2 (x_{i2} - \overline{x}_2) + \varepsilon_i \label{eq:mlr_model_centered_2}$$ where $\alpha = \beta_0 + \beta_1 \overline{x}_1 + \beta_2 \overline{x}_2$. ::: :::: ## Section 3.8 Example: Cars :::: tcolorbox ::: {.example:quadratic_cars_model_original .example} **Example 11**. A quadratic model for the 'cars' dataset: $$\text{distance}_i = \beta_0 + \beta_1 \text{speed}_i + \beta_2 \text{speed}_i^2 + \varepsilon_i \label{eq:cars_quadratic_model_2}$$ ::: ::::