Applied Statistics and Data Analysis: Statistical Models

Author

Your Name

Published

February 12, 2025

Introduction

In the realm of applied statistics and data analysis, statistical models serve as the cornerstone for extracting meaningful insights from data. This lesson provides a review of fundamental inference concepts, focusing primarily on statistical models. We will explore the basics of statistical modeling, delve into probability distributions and random variables, examine common statistical models, discuss simulation techniques, and finally, address crucial model assumptions.

This lecture is based mainly on Chapter 3 of the textbook Statistical models and Chapter 2 of Core Statistics by S.N. Wood. Understanding these concepts is essential for building a solid foundation in statistical analysis and its applications across various disciplines.

Lecture Overview

This lecture is structured to provide a comprehensive understanding of statistical models, covering the following key areas:

Basics of Statistical Models: We begin by defining statistical models and discussing their purpose in inferential statistics. This section will lay the groundwork for understanding how models help us learn from data by representing the data generating process.
Probability Distributions and Random Variables: A review of essential probabilistic concepts, including random variables, distribution functions, and the distinction between discrete and continuous variables. We will also cover measures such as mean, variance, and quantiles, which are crucial for characterizing distributions.
Basic Statistical Models: Common Distributions: We will explore several fundamental probability distributions that are frequently used as building blocks in statistical models. These include discrete distributions such as discrete uniform, Bernoulli, binomial, and Poisson, as well as continuous distributions (to be covered in subsequent lectures).
Simulation of Random Samples: An introduction to the concept of simulation in statistical analysis. We will discuss how simulation techniques are used to generate random samples from specified distributions and their role in understanding model behavior and statistical procedures.
Model Assumptions: Finally, we will address the critical role of model assumptions in statistical modeling. We will discuss common assumptions such as independence and normality, and the importance of validating these assumptions to ensure the reliability of our statistical inferences.

This structured approach will equip you with the foundational knowledge necessary to understand and apply statistical models effectively in various data analysis scenarios.

Basics of Statistical Models

What are Statistical Models?

Inferential statistics is fundamentally about learning from data. More specifically, it is concerned with extracting information about the underlying "system" that generated the observed data, or about the population from which a sample has been drawn.

A key characteristic of most data is the presence of random variability. If we were to repeat the data gathering process multiple times, we would invariably obtain somewhat different datasets on each occasion. This inherent variability necessitates the use of statistical models.

In many physical sciences, deterministic models might suffice when variability is minimal. Deterministic models are characterized by predicting outcomes with certainty, given a set of inputs and parameters. For instance, in classical physics, predicting the trajectory of a projectile under gravity might be accurately done using deterministic equations, assuming minimal external disturbances. However, in fields like social sciences, economics, biology, and even in many complex physical systems, variability is a significant factor. Consider measuring the height of individuals; even under controlled conditions, heights will vary due to genetic and environmental factors. To properly analyze data in such contexts, we must employ models that explicitly incorporate this variability. These are known as statistical models.

Statistical models involve families of probability distributions. The aim of a statistical model is to provide an adequate and probabilistic description of the data generating system or the phenomenon of interest.

A statistical model does not predict a single outcome but rather a range of possible outcomes, each with an associated probability. This probabilistic nature allows us to quantify uncertainty and make inferences about the data generating process in the presence of variability.

Purpose of Statistical Models

If all model elements (e.g., model parameters) were known, an adequate statistical model could generate data that closely resembles the observed data, including reproducing its inherent variability through replication. For example, if we have a model for plant growth that includes parameters for sunlight, water, and nutrient levels, and we know these parameters, we could simulate plant growth data that reflects the variability observed in real-world experiments.
The primary purpose of statistical inference is to reverse this process. We use a statistical model to infer the values of the model unknowns (parameters) that are consistent with the observed data. Instead of knowing the parameters and generating data, we observe data and use the statistical model to estimate the parameters that best explain the data. This is the core of statistical inference – learning about the unknowns from the observed data using the framework of a statistical model.
The choice of statistical models for specific datasets is often guided by:
- Previous experience with similar data. If similar types of data have been analyzed before, it’s often reasonable to start with models that have been successful in those contexts. For instance, in ecological studies, count data is often modeled using Poisson distributions based on prior experience.
- Subject area knowledge. Understanding the underlying mechanisms of the phenomenon being studied is crucial. For example, in epidemiological studies of disease spread, models might incorporate knowledge about transmission routes and incubation periods.
- Careful use of Exploratory Data Analysis (EDA) findings. EDA techniques, such as histograms, scatter plots, and summary statistics, help reveal patterns and characteristics of the data that can guide model selection. For example, observing a linear trend in a scatter plot might suggest a linear regression model.
Statistical models often combine two key components:
- A deterministic component: This part captures the systematic or predictable aspects of the phenomenon. It represents the underlying signal or pattern we are trying to model. In a linear regression model, this is the linear predictor part (e.g., $\alpha + \beta \cdot \text{weight}$).
- A random component: This part accounts for the inherent unpredictability and variability in the data. It represents the noise or error that obscures the deterministic signal. In a linear regression model, this is the error term ($\varepsilon$).
The random component is frequently referred to as noise or error. The deterministic part is sometimes called the signal. It is important to note that "noise" or "error" in this context does not necessarily imply something is wrong, but rather reflects the natural variability in the data. This variability can arise from various sources, such as measurement errors, unmeasured factors, or intrinsic randomness in the system.

Example: Temperatures

Consider a dataset y consisting of a 60-year record of mean annual temperatures (°F) in New Haven, Connecticut, from 1912 to 1971. This dataset provides a time series of annual average temperatures, capturing the year-to-year variations and potential long-term trends in temperature.

Numerical summaries of this data are: $\bar{y} = 51.16$, $y_{0.5} = 51.20$, $s^2 = 1.60$, $\gamma = -0.07$, $\beta = 3.38$. Here, $\bar{y}$ is the sample mean, $y_{0.5}$ is the median, $s^2$ is the sample variance, $\gamma$ is the skewness, and $\beta$ is the kurtosis. These summaries provide a quantitative description of the data’s central tendency, spread, and shape.

Histogram of mean annual temperatures with superimposed normal density curve.

Graphical and numerical summaries suggest a simple statistical model where we consider y as independent observations from a normal distribution. The histogram in Figure 1 visually compares the distribution of the temperature data to a normal distribution curve. The bell shape of the histogram and its rough alignment with the superimposed normal density curve support the idea that a normal distribution might be a reasonable model. However, it is also noted that the tails of the empirical distribution appear to be heavier than those of a typical "bell curve". This might suggest that while the normal distribution is a reasonable first approximation, it may not perfectly capture all aspects of the data, particularly in the extremes. More complex models, such as t-distributions which have heavier tails, could be considered for a better fit if capturing tail behavior is critical.

Example: Roller Data

Consider data from an experiment where different weights (in tons, denoted by weight) of a roller were rolled over different parts of a lawn, and the resulting depression (in mm, denoted by depression) was measured. This experiment aims to understand the relationship between roller weight and lawn depression, which could be relevant in lawn care or construction contexts.

Scatterplot of roller data with least squares line.

The graphical summary, a scatterplot of the data with the least squares line superimposed, suggests a linear relationship between the weight of the roller and the depression it causes. Figure 2 displays the scatterplot, where each point represents an observation of (weight, depression). The upward trend of the points and the fitted least squares line visually indicate a positive linear association: as roller roller weight increases, depression tends to increase as well.

Linear Regression Model

The assumed statistical model for this data has a linear form for the deterministic part and an additive error term:

\[\text{depression} = \alpha + \beta \cdot \text{weight} + \varepsilon\]

This is known as a linear regression model. Here, $\alpha$ and $\beta$ are model parameters, which are constants that must be estimated using the data.

$\alpha$ (intercept): Represents the expected depression when the roller weight is zero. In practice, this might not have a direct physical interpretation within the realistic range of roller weights but is a necessary parameter for defining the line.
$\beta$ (slope): Represents the expected change in depression for a one-unit increase in roller weight. This is the key parameter quantifying the effect of roller weight on depression.
$\varepsilon$ (error term): Represents the random variability in depression that is not explained by the roller weight. This term accounts for all other factors influencing depression.

To formalize this for individual observations, let $(x_1, y_1), \dots, (x_n, y_n)$ represent the observed data points, where $x_i$ is the weight of the roller and $y_i$ is the depression for the $i$-th observation. The model can be written as:

\[y_i = \alpha + \beta x_i + \varepsilon_i, \quad i = 1, \dots, n\] This equation states that for each observation $i$, the observed depression $y_i$ is the sum of a deterministic part ($\alpha + \beta x_i$) and a random error term ($\varepsilon_i$).

Using the least squares method, we can estimate the parameters $\alpha$ and $\beta$. The least squares method finds the values of $\alpha$ and $\beta$ that minimize the sum of the squared differences between the observed depressions $y_i$ and the values predicted by the deterministic part of the model ($\alpha + \beta x_i$). For this example, the parameter estimates are given as $\hat{\alpha} = -2.087$ and $\hat{\beta} = 2.667$. These are the values that define the least squares line shown in Figure 2.

The fitted values $\hat{y}_i$ are defined by:

\[\hat{y}_i = \hat{\alpha} + \hat{\beta} x_i, \quad i = 1, \dots, n\] Fitted values are the model’s predictions for depression for each observed roller weight $x_i$, using the estimated parameters $\hat{\alpha}$ and $\hat{\beta}$. They represent the deterministic part of the model’s output.

The observed residuals $\hat{\varepsilon}_i$ are given by:

\[\hat{\varepsilon}_i = y_i - \hat{y}_i = y_i - \hat{\alpha} - \hat{\beta} x_i, \quad i = 1, \dots, n\] Observed residuals are the differences between the actual observed depressions $y_i$ and the fitted values $\hat{y}_i$. They represent the part of the depression that is not explained by the linear relationship with roller weight, and are estimates of the error terms $\varepsilon_i$. Analyzing residuals is crucial for checking the adequacy of the model assumptions.

Interpretation and Focus of Interest

The focus of interest in statistical modeling can often be cast in terms of:

Interpretation of model parameters: For example, in the roller data, $\beta$ is a crucial parameter representing the rate of increase of depression with increasing roller weight. The estimated value $\hat{\beta} = 2.667$ suggests that for each additional ton of roller weight, the lawn depression is expected to increase by approximately 2.667 mm, on average. This interpretation provides a quantifiable measure of the effect of roller weight.
Prediction: Predictions are given by the fitted values $\hat{y}_i$. For the observed roller weights $x_i$, the fitted values $\hat{y}_i$ are the model’s predictions of depression. We can also predict the depression for out-of-sample roller weights, although this requires careful consideration. Predicting for weights outside the range of observed weights (extrapolation) should be done cautiously, as the linear relationship might not hold beyond the observed data range. For example, predicting depression for a very heavy roller weight might be unrealistic if the lawn’s response becomes non-linear at higher weights.

The model treats the pattern of change in depression with roller weight as a deterministic or fixed effect term. This means we assume that the linear relationship is systematic and consistent across different parts of the lawn (within the scope of the experiment). However, the measured values of depression also incorporate a random term ($\varepsilon_i$) that reflects:

Variation from one part of the lawn to another. Lawns are not perfectly uniform; soil composition, grass density, and moisture levels can vary, leading to different depressions even under the same roller weight.
Differences in handling the roller. Even with standardized procedures, slight variations in how the roller is applied (speed, number of passes, etc.) can introduce variability in the measured depression.
Measurement error. There is always some degree of imprecision in measuring depression. The measurement tools and techniques have limitations, and repeated measurements on the same spot might yield slightly different values.

It is typically assumed that the elements of the random term ($\varepsilon_i$) are uncorrelated. This means that the size and sign of one element do not provide any information about the other elements. In simpler terms, the random error in one measurement is not related to the random error in another measurement. This assumption is important for the validity of many statistical inference procedures in regression analysis.

Finally, it’s important to consider the scope of inference. Data from a single lawn might not be sufficient to generalize results to other lawns. The characteristics of the lawn (grass type, soil type, underlying structure) could influence the relationship between roller weight and depression. Data from multiple lawns would be essential if one wants to generalize the findings to a broader population of lawns. Collecting data from a variety of lawns would allow for a more robust and generalizable model, potentially including lawn characteristics as additional factors in the model.

A Brief Note on the Least Squares Line

It’s important to remember that the least squares line, with coefficients $\hat{\alpha}$ and $\hat{\beta}$, is obtained by minimizing the sum of squared residuals:

\[\sum_{i=1}^{n} (y_i - \hat{\alpha} - \hat{\beta} x_i)^2\] The least squares method is a widely used approach for estimating parameters in regression models because it has desirable statistical properties under certain assumptions. Intuitively, by minimizing the sum of squared residuals, we are finding the line that is "closest" to all the data points in terms of vertical distances.

Using linear algebra, it can be shown that the coefficients that minimize this sum are given by the solution to a simple linear system, resulting in the formulas:

\[\hat{\beta} = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}\] \[\hat{\alpha} = \bar{y} - \hat{\beta} \bar{x}\] These formulas provide direct calculations for the least squares estimates $\hat{\alpha}$ and $\hat{\beta}$ based on the sample data. $\bar{x}$ and $\bar{y}$ represent the sample means of the roller weights and depressions, respectively.

Remark. Remark 1. In some situations, weighted least squares are useful. In this approach, fixed weights $w_i$ are introduced, and the quantity to be minimized becomes:

\[\sum_{i=1}^{n} (y_i - \hat{\alpha} - \hat{\beta} x_i)^2 w_i\] Weighted least squares are used when the variance of the error term is not constant across all observations (heteroscedasticity). The weights $w_i$ are typically chosen to be inversely proportional to the variance of the error for the $i$-th observation. This gives more influence to observations with smaller variance and less influence to observations with larger variance, leading to more efficient parameter estimates in the presence of heteroscedasticity. For example, if we know that the measurement error is larger for heavier rollers, we might use weights that are smaller for observations with larger $x_i$ values.

Probability Distributions and Random Variables

Random Variables: Building Blocks of Statistical Models

The concepts of randomness and probability are central to statistics. The perspective of data originating from a probability distribution is fundamental to understanding statistical methods.

Random variables (r.v.’s) are the essential building blocks for statistical models, particularly for their random components. A random variable is a variable whose value is a numerical outcome of a random phenomenon. Each time a random variable is observed, it takes on a different (numerical) value, at random.

We can make probability statements about the values that are likely to occur for a random variable; this specification of probabilities is known as its probability distribution.

Distribution Function

The distribution function of a random variable $X$, denoted by $F(x)$, is defined as: \[F(x) = P(X \leq x)\] This function gives the probability that the random variable $X$ will take a value less than or equal to $x$, for any real number $x \in \mathbb{R}$.

The distribution function $F(x)$ is a non-decreasing function, right-continuous, with $\lim_{x \to -\infty} F(x) = 0$ and $\lim_{x \to +\infty} F(x) = 1$. From the distribution function $F(x)$, we can define the set of potential values for $X$, which is called the support $S$ of $X$. The support is the set of all values $x$ for which $f(x) > 0$ (for discrete r.v.’s) or $f(x) > 0$ in a neighborhood of $x$ (for continuous r.v.’s). We can also determine the probabilities of events related to $X$, such as $X = a$, $X > a$, or $a < X \leq b$, where $a < b \in \mathbb{R}$. For example:

$P(X > a) = 1 - P(X \leq a) = 1 - F(a)$
$P(a < X \leq b) = P(X \leq b) - P(X \leq a) = F(b) - F(a)$
For a continuous random variable, $P(X = a) = 0$ for any $a \in \mathbb{R}$.

Discrete and Continuous Random Variables

Random variables can be broadly classified into two types: discrete and continuous.

Discrete Random Variables

Discrete random variables take values from a discrete set, which can be finite or countably infinite. They are suitable for modeling finite or count data.

Examples of discrete random variables include:

The number of heads in three coin tosses (values: 0, 1, 2, 3).
The number of cars passing a point on a highway in an hour (values: 0, 1, 2, ...).
The outcome of rolling a die (values: 1, 2, 3, 4, 5, 6).

Discrete random variables are described by their probability mass function (pmf):

\[f(x) = P(X = x)\] The pmf $f(x)$ gives the probability that the random variable $X$ takes on a specific value $x$.

For a valid pmf, the following conditions must hold:

$0 \leq f(x) \leq 1$ for all $x$. Probabilities must be between 0 and 1, inclusive.
For the potential values of $X$, denoted as $x_i$, where $i \in I \subseteq \mathbb{N}$, $f(x_i) > 0$. The pmf is positive for all possible values in the support.
The sum of probabilities over all possible values is equal to 1: $\sum_{i \in I} f(x_i) = 1$. The total probability of all possible outcomes must be 1.

Continuous Random Variables

Continuous random variables take values in a continuous set. For continuous random variables, the probability of taking any particular value is zero.

Examples of continuous random variables include:

Height or weight of a person.
Temperature of a room.
Time until failure of a device.

Continuous random variables are described by their probability density function (pdf) $f(x)$. The probability density function $f(x)$ does not directly give probabilities, but the probability of $X$ falling within an interval $[a, b]$ is given by the integral of the pdf over that interval:

\[P(a \leq X \leq b) = \int_{a}^{b} f(x) dx, \quad a < b \in \mathbb{R}\] The probability is represented by the area under the pdf curve between $a$ and $b$.

For a valid pdf, the following conditions must hold:

$f(x) \geq 0$ for all $x$. The pdf must be non-negative everywhere.
The integral of the pdf over the entire real line is equal to 1: $\int_{-\infty}^{+\infty} f(x) dx = 1$. The total area under the pdf curve must be 1.
The distribution function $F(b)$ can be obtained by integrating the pdf from $-\infty$ to $b$: $F(b) = \int_{-\infty}^{b} f(x) dx$. The distribution function is the cumulative integral of the pdf.
The derivative of the distribution function gives the pdf: $F'(x) = f(x)$, where the derivative $F'(x)$ exists. The pdf is the rate of change of the distribution function.

Mean, Variance, and Quantiles

Instead of fully describing the entire distribution of a random variable $X$, often its first two moments are sufficient for many purposes. These moments, particularly the mean and variance, provide key summary characteristics of the distribution.

Expected Value (Mean)

The expected value (or mean) $\mu = \mathbb{E}(X)$ of a random variable $X$ represents its average value in the long run. It is a measure of the central tendency of the distribution. For discrete and continuous random variables, the expected value is calculated as follows:

Discrete case: \[\mathbb{E}(X) = \sum_{i \in I} x_i f(x_i)\] where the sum is taken over all possible values $x_i$ in the support $I$, weighted by their probabilities $f(x_i)$.

Continuous case: \[\mathbb{E}(X) = \int_{-\infty}^{+\infty} x f(x) dx\] where the integral is taken over the entire support of $X$, with each value $x$ weighted by its probability density $f(x)$.

The expected value can be thought of as the weighted average of all possible values of $X$, where the weights are given by the probabilities (or probability densities).

Variance

The variance $\sigma^2 = \mathbb{V}(X)$ measures the spread or dispersion of the random variable around its mean. It quantifies how much the values of $X$ deviate from the expected value on average. It is defined as: \[\sigma^2 = \mathbb{V}(X) = \mathbb{E}[(X - \mu)^2]\] which is the expected value of the squared deviation of $X$ from its mean $\mu$.

The variance can also be calculated using the formula $\mathbb{V}(X) = \mathbb{E}(X^2) - [\mathbb{E}(X)]^2$.

The square root of the variance, $\sigma = \sqrt{\mathbb{V}(X)}$, is the standard deviation, often referred to as the standard error when it refers to the standard deviation of an estimator. The standard deviation is in the same units as $X$, making it often more interpretable than the variance.

Indices of skewness and kurtosis are higher-order moments that can be defined similarly to further describe the shape of the distribution.

Skewness measures the asymmetry of the distribution. A skewness of zero indicates symmetry, positive skewness indicates a longer tail on the right, and negative skewness indicates a longer tail on the left.
Kurtosis measures the "tailedness" of the distribution. Higher kurtosis indicates heavier tails and a sharper peak, while lower kurtosis indicates lighter tails and a flatter peak.

Quantiles

The $\alpha$-quantile $x_\alpha$ of a random variable $X$, where $\alpha \in (0, 1)$, is a value such that the probability of $X$ being less than or equal to $x_\alpha$ is $\alpha$: \[P(X \leq x_\alpha) = \alpha\] In other words, the $\alpha$-quantile is the value below which a proportion $\alpha$ of the distribution lies.

The median of $X$ corresponds to the 0.5-quantile ($x_{0.5}$). The median divides the distribution into two equal halves. Quartiles divide the distribution into four parts:

First quartile (Q1) or 0.25-quantile ($x_{0.25}$): 25% of the data falls below this value.
Second quartile (Q2) or 0.5-quantile ($x_{0.5}$): This is the median.
Third quartile (Q3) or 0.75-quantile ($x_{0.75}$): 75% of the data falls below this value.

Percentiles divide the distribution into one hundred parts, with the $p$-th percentile being the $0.0p$-quantile. Quantiles are robust measures of location and spread, less sensitive to outliers than the mean and variance.

Standardized Random Variable

The standardized random variable $Z$ is obtained by transforming $X$ as follows: \[Z = \frac{X - \mu}{\sigma}\] Standardization centers the random variable at zero (by subtracting the mean) and scales it to have unit variance (by dividing by the standard deviation).

The standardized random variable $Z$ has a mean of 0 and a variance of 1: $\mathbb{E}(Z) = 0$ and $\mathbb{V}(Z) = 1$. Standardization is a common technique in statistics to compare random variables from different distributions or to simplify analysis.

Random Vectors

Often, statistical analysis requires considering multiple observations simultaneously. These are viewed as realizations of a random vector (or multivariate random variable).

A random vector $(X_1, \dots, X_n)$ takes values in $\mathbb{R}^n$, meaning it is a vector of $n$ numerical components, according to a joint probability distribution.

For example, if we measure both height and weight for a sample of individuals, we are dealing with a bivariate random vector for each individual.

Joint Distribution Function

The probability distribution of a random vector is defined by the joint distribution function: \[F(x_1, \dots, x_n) = P(X_1 \leq x_1, \dots, X_n \leq x_n)\] The joint distribution function gives the probability that each random variable $X_i$ is less than or equal to a corresponding value $x_i$, for all $i = 1, \dots, n$.

Alternatively, for continuous random vectors, it can be defined by the multivariate (joint) probability density function $f(x_1, \dots, x_n)$. The joint pdf generalizes the concept of pdf to multiple dimensions, describing the probability density at each point in $\mathbb{R}^n$.

Marginal Density

Each component $X_i$, for $i = 1, \dots, n$, of a random vector is itself a random variable with a marginal density (probability) function $f_i(x_i)$.

The marginal density of $X_i$ describes the distribution of $X_i$ considered in isolation, without regard to the other components of the random vector. It can be derived from the joint distribution by integrating (or summing) over all possible values of the other variables.

Independent and Identically Distributed (i.i.d.) Random Variables

Two situations greatly simplify statistical analysis:

Independence: The component random variables $X_i$, $i = 1, \dots, n$, are independent if the realization of one does not affect the probability distribution of the others. In this case, the joint density function is the product of the marginal density functions: \[f(x_1, \dots, x_n) = \prod_{i=1}^{n} f_i(x_i)\] Independence implies that knowing the value of one random variable provides no information about the values of the other random variables.
Independent and Identically Distributed (i.i.d.): The component random variables $X_i$, $i = 1, \dots, n$, are independent and identically distributed (i.i.d.) if they are independent and each component follows the same distribution with density (probability) function $g(x)$. In this case, the joint density function is: \[f(x_1, \dots, x_n) = \prod_{i=1}^{n} g(x_i)\] I.i.d. random variables are fundamental in many statistical models, simplifying both theoretical analysis and practical applications. For example, in random sampling, observations are oftenoften assumed to be i.i.d. realizations from the population distribution. For example, if $X_1, \dots, X_n$ are i.i.d. from a normal distribution with mean $\mu$ and variance $\sigma^2$, their joint pdf is given by: \[f(x_1, \dots, x_n) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x_i - \mu)^2}{2\sigma^2}} = \left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^n e^{-\sum_{i=1}^{n}\frac{(x_i - \mu)^2}{2\sigma^2}}\]

Bivariate Random Variables

The two-dimensional case (bivariate random variables) is sufficient to illustrate many concepts required for higher dimensions. Let’s consider a continuous bivariate random variable $(X, Y)$ with joint density function $f(x, y)$. Results for the discrete case are obtained by replacing integration with summation.

Marginal Density (Bivariate Case)

The marginal density of $X$ is obtained by integrating the joint density over all possible values of $Y$: \[f(x) = \int_{-\infty}^{+\infty} f(x, y) dy\] Similarly, the marginal density of $Y$ is: \[f(y) = \int_{-\infty}^{+\infty} f(x, y) dx\] Marginal densities are obtained by "integrating out" the other variable from the joint density, effectively summarizing the distribution of each variable independently.

For discrete random variables, the marginal pmf is obtained by summing the joint pmf over the possible values of the other variable.

Conditional Density

The conditional density of $X$ given $Y = y$ is defined as: \[f(x|y) = \frac{f(x, y)}{f(y)}\] This is defined assuming $f(y) > 0$. Similarly, the conditional density of $Y$ given $X = x$ is: \[f(y|x) = \frac{f(x, y)}{f(x)}\] assuming $f(x) > 0$. Conditional density describes the probability distribution of one random variable given that another random variable has taken a specific value.

The conditional density $f(x|y)$ represents the probability density of $X$ at value $x$, given that $Y$ is known to be equal to $y$. It is essentially a "slice" through the joint distribution at a fixed value of $Y$, renormalized to be a proper probability density function.

Bayes’ Theorem

Bayes’ theorem relates conditional probabilities and is a cornerstone of Bayesian statistical methods: \[f(x|y) = \frac{f(x)f(y|x)}{f(y)}\] This is again defined assuming $f(y) > 0$. Bayes’ theorem provides a way to update probabilities based on new evidence. It allows us to reverse the conditioning: if we know $f(y|x)$, we can find $f(x|y)$.

Bayes’ theorem is derived directly from the definition of conditional probability and the symmetry property $f(x, y) = f(y, x)$. It is fundamental in Bayesian inference for updating beliefs in light of new data.

Independence (Bivariate Case)

Two random variables $X$ and $Y$ are independent if and only if their joint density function is the product of their marginal density functions: \[f(x, y) = f(x)f(y)\] Independence means that the joint distribution is simply the product of the marginal distributions, implying no statistical association between the variables.

Equivalently, $X$ and $Y$ are independent if and only if $f(x|y) = f(x)$ for all $y$ with $f(y) > 0$, or $f(y|x) = f(y)$ for all $x$ with $f(x) > 0$.

Conditional Expectation

The conditional expectation (mean) of $X$ given $Y = y$ is: \[\mathbb{E}(X|Y = y) = \int_{-\infty}^{+\infty} x f(x|y) dx\] Similarly, the conditional expectation of $Y$ given $X = x$ is: \[\mathbb{E}(Y|X = x) = \int_{-\infty}^{+\infty} y f(y|x) dy\] Conditional expectation is the expected value of one random variable given that another random variable is held at a specific value.

The conditional expectation $\mathbb{E}(X|Y = y)$ is the mean of the conditional distribution of $X$ given $Y = y$. It represents the average value of $X$ when we know that $Y$ has taken the value $y$.

Analogous definitions exist for the conditional variance of $X$ given $Y = y$ and $Y$ given $X = x$. For example, the conditional variance of $X$ given $Y=y$ is $\mathbb{V}(X|Y=y) = \mathbb{E}[(X - \mathbb{E}(X|Y=y))^2 | Y=y]$.

Covariance and Correlation

The covariance of $(X, Y)$, denoted as $\mathrm{Cov}(X, Y)$ or $\sigma_{XY}$, measures the linear relationship between $X$ and $Y$: \[\mathrm{Cov}(X, Y) = \sigma_{XY} = \int_{-\infty}^{+\infty} \int_{-\infty}^{+\infty} \{x - \mathbb{E}(X)\}\{y - \mathbb{E}(Y)\}f(x, y) dx dy\] Covariance quantifies the direction and strength of the linear association between two random variables.

An alternative formula for covariance is $\mathrm{Cov}(X, Y) = \mathbb{E}(XY) - \mathbb{E}(X)\mathbb{E}(Y)$.

If $X$ and $Y$ are independent, then $\mathrm{Cov}(X, Y) = 0$. However, the reverse is not always true (a relevant exception is the multivariate normal distribution). Zero covariance implies no linear relationship, but there might still be non-linear relationships between $X$ and $Y$.

The Pearson correlation coefficient $\rho_{XY}$ of $(X, Y)$ is a standardized measure of linear dependence, useful for describing the strength and direction of linear relationships: \[\rho_{XY} = \frac{\mathrm{Cov}(X, Y)}{\sqrt{\mathbb{V}(X)\mathbb{V}(Y)}} = \frac{\sigma_{XY}}{\sigma_X \sigma_Y}\] Correlation coefficient standardizes the covariance to be between -1 and 1, making it easier to interpret the strength of linear association.

The correlation coefficient $\rho_{XY}$ is dimensionless and always lies between -1 and 1.

$\rho_{XY} = 1$: perfect positive linear correlation.
$\rho_{XY} = -1$: perfect negative linear correlation.
$\rho_{XY} = 0$: no linear correlation.

Mean Vector and Variance-Covariance Matrix

The first and second order moments of a bivariate random vector $(X, Y)$ are summarized by the mean vector $\mu$ and the variance-covariance matrix $\Sigma$.

The mean vector is: \[\mu = \begin{pmatrix} \mu_X \\ \mu_Y \end{pmatrix} = \begin{pmatrix} \mathbb{E}(X) \\ \mathbb{E}(Y) \end{pmatrix}\] The mean vector is a vector of the expected values of each component of the random vector.

The variance-covariance matrix is: \[\Sigma = \begin{pmatrix} \sigma_X^2 & \sigma_{XY} \\ \sigma_{XY} & \sigma_Y^2 \end{pmatrix} = \begin{pmatrix} \mathbb{V}(X) & \mathrm{Cov}(X, Y) \\ \mathrm{Cov}(Y, X) & \mathbb{V}(Y) \end{pmatrix}\] The variance-covariance matrix summarizes the variances of each variable along the diagonal and the covariances between pairs of variables off-diagonal.

The matrix $\Sigma$ is symmetric because $\mathrm{Cov}(X, Y) = \mathrm{Cov}(Y, X)$, and it is positive semi-definite, ensuring that variances are non-negative and correlations are well-behaved.

Statistics and Sampling Distribution

A (sample) statistic is a function of a set of random variables. Since it is a function of random variables, a statistic is itself a random variable.

Examples of statistics include the sample mean, sample variance, sample median, etc. Statistics are used to estimate population parameters or to test hypotheses.

The probability distribution of a sample statistic is called its sampling distribution. The form of the sampling distribution depends on the joint distribution of the initial random vector.

Understanding the sampling distribution of a statistic is crucial for statistical inference, as it allows us to assess the variability and uncertainty associated with the statistic as an estimator or test statistic.

Given a random vector $(X_1, \dots, X_n)$, well-known examples of statistics include the sample mean $\bar{X}$ and the (corrected) sample variance $S^2$, defined as:

\[\bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i\] \[S^2 = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2\]

The uncorrected sample variance is obtained by substituting the degrees of freedom $n-1$ with $n$. The corrected sample variance $S^2$ is an unbiased estimator of the population variance $\sigma^2$, while the uncorrected sample variance is biased, underestimating $\sigma^2$ on average.

Further examples of statistics are the sample median, sample quantiles, sample Mean Absolute Deviation (MAD), sample covariance, and sample correlation coefficient. Each statistic has its own sampling distribution, which depends on the distribution of the data and the sample size.

Useful Results for Sample Statistics

Some useful results for sample statistics are listed below, particularly when $X_1, \dots, X_n$ are uncorrelated (or independent) random variables with the same marginal mean $\mu$ and variance $\sigma^2$ (e.g., this holds for identically distributed random variables).

For the sum of random variables: \[\mathbb{E}\left(\sum_{i=1}^{n} X_i\right) = \sum_{i=1}^{n} \mathbb{E}(X_i) = n\mu\] \[\mathbb{V}\left(\sum_{i=1}^{n} X_i\right) = \sum_{i=1}^{n} \mathbb{V}(X_i) = n\sigma^2 \quad \text{(if } X_i \text{ are uncorrelated)}\] For uncorrelated random variables, the variance of the sum is the sum of the variances.
For the sample mean: \[\mathbb{E}(\bar{X}) = \mathbb{E}\left(\frac{1}{n} \sum_{i=1}^{n} X_i\right) = \frac{1}{n} \mathbb{E}\left(\sum_{i=1}^{n} X_i\right) = \frac{1}{n} (n\mu) = \mu\] \[\mathbb{V}(\bar{X}) = \mathbb{V}\left(\frac{1}{n} \sum_{i=1}^{n} X_i\right) = \frac{1}{n^2} \mathbb{V}\left(\sum_{i=1}^{n} X_i\right) = \frac{1}{n^2} (n\sigma^2) = \frac{\sigma^2}{n} \quad \text{(if } X_i \text{ are uncorrelated)}\] The sample mean is an unbiased estimator of the population mean $\mu$, and its variance decreases as the sample size $n$ increases.
For the (corrected) sample variance: \[\mathbb{E}(S^2) = \sigma^2\] The corrected sample variance $S^2$ is an unbiased estimator of the population variance $\sigma^2$. For the uncorrected sample variance, the expectation is $\frac{\sigma^2(n-1)}{n} = \sigma^2 \frac{n-1}{n} < \sigma^2$. The uncorrected sample variance is biased downwards, especially for small sample sizes.

Weak Law of Large Numbers

Weak law of large numbers states that if $X_1, \dots, X_n$ are i.i.d. random variables with mean $\mu$, then the sample mean $\bar{X}$ converges in probability to $\mu$ as $n \to +\infty$. In symbols, $\bar{X} \stackrel{p}{\longrightarrow}\mu$. This means that as the sample size $n$ increases, the distribution of $\bar{X}$ becomes more and more concentrated around the marginal mean $\mu$.

The weak law of large numbers is a fundamental result in probability theory, providing a theoretical justification for using the sample mean to estimate the population mean. It states that for a sufficiently large sample size, the sample mean is likely to be close to the true population mean.

A similar result holds for the (corrected and uncorrected) sample variance: $S^2 \stackrel{p}{\longrightarrow}\sigma^2$ as $n \to +\infty$. As the sample size increases, the sample variance converges in probability to the population variance $\sigma^2$.

Example: Application of Weak Law of Large Numbers

Consider $X_1, \dots, X_n$ to be i.i.d. Poisson distributed random variables with parameter $\lambda = 5$, denoted as $Po(\lambda)$. For a Poisson distribution, both the mean and variance are equal to $\lambda$, so $\mu = \sigma^2 = 5$.

A sequence of observed values for the sample mean $\bar{X} = \sum_{i=1}^{n} X_i / n$, for $n = 1, \dots, 1000$, is shown in Figure 3.

Sample path of the sample mean for Poisson distribution.

The sample path demonstrates that as $n$ increases, the observed values of $\bar{X}$ tend to be more concentrated around $\mu = \lambda = 5$, as predicted by the weak law of large numbers. Initially, for small $n$, the sample mean fluctuates considerably, but as $n$ grows, the fluctuations dampen, and $\bar{X}$ stabilizes around the true mean $\mu = 5$.

Since the sum of independent Poisson random variables is also Poisson, the sample sum $\sum_{i=1}^{n} X_i$ follows a $Po(n\lambda)$ distribution. This allows us to specify the distribution of the sample mean $\bar{X} = \sum_{i=1}^{n} X_i / n$. While $\sum_{i=1}^{n} X_i \sim Po(n\lambda)$, the distribution of $\bar{X}$ is not Poisson, but its behavior is governed by the properties of sums of random variables and the weak law of large numbers.

Figure 4 illustrates the probability function of $\bar{X}$ for different values of $n$ ($n = 5, 10, 25, 50$).

Probability function of sample mean for different n values.

As $n$ increases, the mean value remains constant at $\mu = 5$, while the variability of the sample mean lessens, showing a more concentrated distribution around the true mean. The probability mass becomes increasingly concentrated around $\mu = 5$, visually demonstrating the convergence in probability predicted by the weak law of large numbers.

Basic Statistical Models: Common Distributions

This section reviews some basic and commonly used statistical models, which are essentially probability distributions that are frequently employed in statistical modeling. These distributions serve as fundamental building blocks for more complex statistical models and are essential for understanding and applying statistical methods across various fields.

Discrete Uniform Distribution

The discrete uniform distribution describes an experiment where a finite number of values are equally likely to be observed. It is characterized by constant probability over a finite set of values.

A discrete random variable $X$ follows a discrete uniform distribution with values $x_1, \dots, x_n \in \mathbb{R}$, where $n \in \mathbb{N}^+$, abbreviated as $X \sim Ud(x_1, \dots, x_n)$, if its support is $S = \{x_1, \dots, x_n\}$ and the probability mass function is: \[f(x_i) = \frac{1}{n}, \quad \text{for } i = 1, \dots, n\]

Each value in the set $\{x_1, \dots, x_n\}$ has an equal probability of $\frac{1}{n}$ of being observed.

The expected value and variance for a discrete uniform distribution are:

\[\mathbb{E}(X) = \sum_{i=1}^{n} \frac{x_i}{n} = \frac{1}{n} \sum_{i=1}^{n} x_i\] \[\mathbb{V}(X) = \sum_{i=1}^{n} \frac{\{x_i - \mathbb{E}(X)\}^2}{n} = \frac{1}{n} \sum_{i=1}^{n} \{x_i - \mathbb{E}(X)\}^2\] The expected value is the average of the values $x_1, \dots, x_n$, and the variance measures the spread of these values around their mean, assuming equal probability for each value.

Figure 5 shows the probability function for a discrete uniform distribution with $n=6$ and $x_i = i$ for $i = 1, \dots, 6$. In this case, $X$ can take values $\{1, 2, 3, 4, 5, 6\}$, each with probability $\frac{1}{6}$, resembling the outcome of rolling a fair six-sided die.

Probability function of discrete uniform distribution.

The pmf is constant across all possible outcomes, visually represented as bars of equal height in Figure 5.

Bernoulli Distribution

The Bernoulli distribution models an experiment with two possible outcomes, often labeled as "success" (1) and "failure" (0). It is the simplest discrete distribution, representing a single trial with a binary outcome.

A discrete random variable $X$ follows a Bernoulli distribution with parameter $p \in (0, 1)$, abbreviated as $X \sim Ber(p)$, if it describes an experiment where the possible outcomes are "success" (or 1) and "failure" (or 0), and the probability of success is $p$.

The support is $S = \{0, 1\}$, and the probability mass function is: \[f(1) = P(X = 1) = p, \quad f(0) = P(X = 0) = 1 - p\]

The parameter $p$ represents the probability of success, and $(1-p)$ is the probability of failure.

The expected value and variance of a Bernoulli distribution are:

\[\mathbb{E}(X) = 1 \cdot P(X=1) + 0 \cdot P(X=0) = p\] \[\mathbb{V}(X) = \mathbb{E}[(X - p)^2] = (1-p)^2 \cdot P(X=1) + (0-p)^2 \cdot P(X=0) = (1-p)^2 p + p^2 (1-p) = p(1-p)\] The expected value is equal to the probability of success $p$, and the variance is maximized when $p = 0.5$.

Figure 6 shows the probability function for a Bernoulli distribution with $p = 1/3$. In this case, there is a $\frac{1}{3}$ probability of success ($X=1$) and a $\frac{2}{3}$ probability of failure ($X=0$).

Probability function of Bernoulli distribution.

The pmf has two points, one at $x=0$ with height $1-p$ and another at $x=1$ with height $p$, as shown in Figure 6.

Binomial Distribution

The binomial distribution models the number of successes in a fixed number of independent Bernoulli trials. It extends the Bernoulli distribution to multiple trials, counting the total number of successes.

A discrete random variable $X$ follows a binomial distribution with parameters $n \in \mathbb{N}$ and $p \in (0, 1)$, abbreviated as $X \sim Bi(n, p)$, if it describes the number of successes in $n$ independent Bernoulli experiments, each with the same success probability $p$.

The support is $S = \{0, 1, \dots, n\}$, and the probability mass function is: \[f(x) = \begin{cases} \binom{n}{x} p^x (1-p)^{n-x} & \text{if } x \in S \\0 & \text{otherwise} \end{cases}\]

The binomial coefficient $\binom{n}{x} = \frac{n!}{x!(n-x)!}$ counts the number of ways to choose $x$ successes from $n$ trials. The term $p^x (1-p)^{n-x}$ is the probability of getting exactly $x$ successes and $n-x$ failures in a specific sequence of trials.

where $\binom{n}{x} = \frac{n!}{x!(n-x)!}$ is the binomial coefficient.

The expected value and variance of a binomial distribution are:

\[\mathbb{E}(X) = np\] \[\mathbb{V}(X) = np(1 - p)\] These formulas show that both the mean and variance of the binomial distribution are directly proportional to the number of trials $n$ and the success probability $p$.

Note that a $Bi(1, p)$ distribution is equivalent to a $Ber(p)$ distribution, as a binomial distribution with one trial reduces to a Bernoulli distribution.

Figure 7 shows the probability function for binomial distributions with different $n$ and $p$ values. As $n$ increases, the distribution becomes more spread out, and as $p$ varies, the distribution shifts towards more successes (higher $p$) or failures (lower $p$).

Probability function of binomial distribution for different parameters.

The shape of the binomial pmf varies with $n$ and $p$, ranging from skewed to approximately symmetric for large $n$ and $p \approx 0.5$.

Poisson Distribution

The Poisson distribution is often used to model the number of events occurring in a fixed interval of time or space. It is particularly useful for rare events.

A discrete random variable $X$ follows a Poisson distribution with parameter $\lambda > 0$, abbreviated as $X \sim Po(\lambda)$, if its support is $S = \mathbb{N}= \{0, 1, 2, \dots\}$ and the probability mass function is: \[f(x) = \begin{cases} \frac{\lambda^x e^{-\lambda}}{x!} & \text{if } x \in S \\ 0 & \text{otherwise} \end{cases}\]

The parameter $\lambda$ represents the average rate of events (events per unit time or space).

The expected value and variance of a Poisson distribution are both equal to the parameter $\lambda$:

\[\mathbb{E}(X) = \lambda\] \[\mathbb{V}(X) = \lambda\] The equality of mean and variance is a unique property of the Poisson distribution.

The Poisson distribution can be used as an approximation to the binomial distribution when $n$ is large and $p$ is small (specifically, for $n \geq 50$ and $p \leq 1/25$), with $\lambda = np$. This approximation is useful because Poisson calculations can be simpler than binomial calculations for large $n$.

Figure 8 shows the probability function for Poisson distributions with different $\lambda$ values. As $\lambda$ increases, the distribution shifts to the right, and its spread also increases.

Probability function of Poisson distribution for different $\lambda$ values.

The Poisson pmf is typically skewed to the right, especially for small $\lambda$, becoming more symmetric as $\lambda$ increases.

Conclusion

This lecture has provided a foundational review of statistical models, starting with the basic definitions and purposes of these models. We explored the crucial role of probability distributions and random variables as building blocks for statistical models, and examined several common discrete distributions such as Discrete Uniform, Bernoulli, Binomial, and Poisson.

Key takeaways from this lecture include:

Statistical models are essential tools for learning from data in the presence of random variability, enabling us to make inferences and predictions.
Understanding the properties of random variables and their distributions is fundamental to statistical modeling, as they form the basis for describing random phenomena.
Common discrete distributions like the Discrete Uniform, Bernoulli, Binomial, and Poisson are widely applicable in various statistical contexts, serving as basic models for many real-world phenomena.

The distributions covered in this lecture are just the beginning. Statistical modeling employs a vastarray of distributions, both discrete and continuous, to accommodate the complexities of real-world data.

Further topics to be covered in the subsequent lectures will include continuous distributions, simulation techniques, model assumptions, and more advanced statistical models. Continuous distributions such as the Normal, Exponential, and Gamma distributions are crucial for modeling continuous data and will be explored in detail. Simulation techniques are vital for understanding model behavior and validating statistical methods. Model assumptions, such as independence and normality, underpin the validity of statistical inferences and require careful consideration and checking. More advanced statistical models build upon these foundational concepts to address complex data structures and research questions.

It is recommended to review the textbook chapters mentioned in the introduction to reinforce the concepts discussed in this lecture and prepare for the upcoming topics. A solid grasp of these fundamental concepts is crucial for building a strong foundation in applied statistics and data analysis.

--- title: "Applied Statistics and Data Analysis: Statistical Models" author: "Your Name" date: "2025-02-12" format: html: toc: true # Table of Contents toc-depth: 2 code-tools: true theme: cosmo # Or "journal" for Distill-like minimalism --- # Introduction In the realm of applied statistics and data analysis, statistical models serve as the cornerstone for extracting meaningful insights from data. This lesson provides a review of fundamental inference concepts, focusing primarily on statistical models. We will explore the basics of statistical modeling, delve into probability distributions and random variables, examine common statistical models, discuss simulation techniques, and finally, address crucial model assumptions. This lecture is based mainly on Chapter 3 of the textbook *Statistical models* and Chapter 2 of *Core Statistics* by S.N. Wood. Understanding these concepts is essential for building a solid foundation in statistical analysis and its applications across various disciplines. ## Lecture Overview This lecture is structured to provide a comprehensive understanding of statistical models, covering the following key areas: - **Basics of Statistical Models**: We begin by defining statistical models and discussing their purpose in inferential statistics. This section will lay the groundwork for understanding how models help us learn from data by representing the data generating process. - **Probability Distributions and Random Variables**: A review of essential probabilistic concepts, including random variables, distribution functions, and the distinction between discrete and continuous variables. We will also cover measures such as mean, variance, and quantiles, which are crucial for characterizing distributions. - **Basic Statistical Models: Common Distributions**: We will explore several fundamental probability distributions that are frequently used as building blocks in statistical models. These include discrete distributions such as discrete uniform, Bernoulli, binomial, and Poisson, as well as continuous distributions (to be covered in subsequent lectures). - **Simulation of Random Samples**: An introduction to the concept of simulation in statistical analysis. We will discuss how simulation techniques are used to generate random samples from specified distributions and their role in understanding model behavior and statistical procedures. - **Model Assumptions**: Finally, we will address the critical role of model assumptions in statistical modeling. We will discuss common assumptions such as independence and normality, and the importance of validating these assumptions to ensure the reliability of our statistical inferences. This structured approach will equip you with the foundational knowledge necessary to understand and apply statistical models effectively in various data analysis scenarios. # Basics of Statistical Models ## What are Statistical Models? **Inferential statistics** is fundamentally about learning from data. More specifically, it is concerned with extracting information about the underlying **\"system\"** that generated the observed data, or about the **population** from which a sample has been drawn. A key characteristic of most data is the presence of **random variability**. If we were to repeat the data gathering process multiple times, we would invariably obtain somewhat different datasets on each occasion. This inherent variability necessitates the use of statistical models. In many physical sciences, **deterministic models** might suffice when variability is minimal. Deterministic models are characterized by predicting outcomes with certainty, given a set of inputs and parameters. For instance, in classical physics, predicting the trajectory of a projectile under gravity might be accurately done using deterministic equations, assuming minimal external disturbances. However, in fields like social sciences, economics, biology, and even in many complex physical systems, variability is a significant factor. Consider measuring the height of individuals; even under controlled conditions, heights will vary due to genetic and environmental factors. To properly analyze data in such contexts, we must employ models that explicitly incorporate this variability. These are known as **statistical models**. ::: tcolorbox **Statistical models** involve **families of probability distributions**. The aim of a statistical model is to provide an adequate and probabilistic description of the data generating system or the phenomenon of interest. ::: A statistical model does not predict a single outcome but rather a range of possible outcomes, each with an associated probability. This probabilistic nature allows us to quantify uncertainty and make inferences about the data generating process in the presence of variability. ## Purpose of Statistical Models - If all **model elements** (e.g., model parameters) were known, an adequate statistical model could generate data that closely resembles the observed data, including reproducing its inherent variability through replication. For example, if we have a model for plant growth that includes parameters for sunlight, water, and nutrient levels, and we know these parameters, we could simulate plant growth data that reflects the variability observed in real-world experiments. - The primary purpose of **statistical inference** is to reverse this process. We use a statistical model to infer the values of the **model unknowns** (parameters) that are consistent with the observed data. Instead of knowing the parameters and generating data, we observe data and use the statistical model to estimate the parameters that best explain the data. This is the core of statistical inference -- learning about the unknowns from the observed data using the framework of a statistical model. - The choice of statistical models for specific datasets is often guided by: - Previous **experience** with similar data. If similar types of data have been analyzed before, it's often reasonable to start with models that have been successful in those contexts. For instance, in ecological studies, count data is often modeled using Poisson distributions based on prior experience. - Subject area **knowledge**. Understanding the underlying mechanisms of the phenomenon being studied is crucial. For example, in epidemiological studies of disease spread, models might incorporate knowledge about transmission routes and incubation periods. - Careful use of **Exploratory Data Analysis (EDA)** findings. EDA techniques, such as histograms, scatter plots, and summary statistics, help reveal patterns and characteristics of the data that can guide model selection. For example, observing a linear trend in a scatter plot might suggest a linear regression model. - Statistical models often combine two key components: - A **deterministic component**: This part captures the systematic or predictable aspects of the phenomenon. It represents the underlying signal or pattern we are trying to model. In a linear regression model, this is the linear predictor part (e.g., $\alpha + \beta \cdot \text{weight}$). - A **random component**: This part accounts for the inherent unpredictability and variability in the data. It represents the noise or error that obscures the deterministic signal. In a linear regression model, this is the error term ($\varepsilon$). - The **random component** is frequently referred to as **noise** or **error**. The **deterministic part** is sometimes called the **signal**. It is important to note that \"noise\" or \"error\" in this context does not necessarily imply something is wrong, but rather reflects the natural variability in the data. This variability can arise from various sources, such as measurement errors, unmeasured factors, or intrinsic randomness in the system. ## Example: Temperatures Consider a dataset **y** consisting of a 60-year record of mean annual temperatures (°F) in New Haven, Connecticut, from 1912 to 1971. This dataset provides a time series of annual average temperatures, capturing the year-to-year variations and potential long-term trends in temperature. Numerical summaries of this data are: $\bar{y} = 51.16$, $y_{0.5} = 51.20$, $s^2 = 1.60$, $\gamma = -0.07$, $\beta = 3.38$. Here, $\bar{y}$ is the sample mean, $y_{0.5}$ is the median, $s^2$ is the sample variance, $\gamma$ is the skewness, and $\beta$ is the kurtosis. These summaries provide a quantitative description of the data's central tendency, spread, and shape. ![Histogram of mean annual temperatures with superimposed normal density curve.](example_temperature.png){#fig:temperature_hist width="50%"} Graphical and numerical summaries suggest a simple statistical model where we consider **y** as independent observations from a **normal distribution**. The histogram in Figure [1](#fig:temperature_hist){reference-type="ref" reference="fig:temperature_hist"} visually compares the distribution of the temperature data to a normal distribution curve. The bell shape of the histogram and its rough alignment with the superimposed normal density curve support the idea that a normal distribution might be a reasonable model. However, it is also noted that the tails of the empirical distribution appear to be heavier than those of a typical \"bell curve\". This might suggest that while the normal distribution is a reasonable first approximation, it may not perfectly capture all aspects of the data, particularly in the extremes. More complex models, such as t-distributions which have heavier tails, could be considered for a better fit if capturing tail behavior is critical. ## Example: Roller Data Consider data from an experiment where different weights (in tons, denoted by **weight**) of a roller were rolled over different parts of a lawn, and the resulting depression (in mm, denoted by **depression**) was measured. This experiment aims to understand the relationship between roller weight and lawn depression, which could be relevant in lawn care or construction contexts. ![Scatterplot of roller data with least squares line.](example_roller.png){#fig:roller_scatter width="50%"} The graphical summary, a scatterplot of the data with the least squares line superimposed, suggests a linear relationship between the weight of the roller and the depression it causes. Figure [2](#fig:roller_scatter){reference-type="ref" reference="fig:roller_scatter"} displays the scatterplot, where each point represents an observation of (weight, depression). The upward trend of the points and the fitted least squares line visually indicate a positive linear association: as roller roller weight increases, depression tends to increase as well. ### Linear Regression Model The assumed statistical model for this data has a linear form for the deterministic part and an additive error term: $$\text{depression} = \alpha + \beta \cdot \text{weight} + \varepsilon$$ This is known as a **linear regression model**. Here, $\alpha$ and $\beta$ are **model parameters**, which are constants that must be estimated using the data. - $\alpha$ (intercept): Represents the expected depression when the roller weight is zero. In practice, this might not have a direct physical interpretation within the realistic range of roller weights but is a necessary parameter for defining the line. - $\beta$ (slope): Represents the expected change in depression for a one-unit increase in roller weight. This is the key parameter quantifying the effect of roller weight on depression. - $\varepsilon$ (error term): Represents the random variability in depression that is not explained by the roller weight. This term accounts for all other factors influencing depression. To formalize this for individual observations, let $(x_1, y_1), \dots, (x_n, y_n)$ represent the observed data points, where $x_i$ is the weight of the roller and $y_i$ is the depression for the $i$-th observation. The model can be written as: $$y_i = \alpha + \beta x_i + \varepsilon_i, \quad i = 1, \dots, n$$ This equation states that for each observation $i$, the observed depression $y_i$ is the sum of a deterministic part ($\alpha + \beta x_i$) and a random error term ($\varepsilon_i$). Using the **least squares method**, we can estimate the parameters $\alpha$ and $\beta$. The least squares method finds the values of $\alpha$ and $\beta$ that minimize the sum of the squared differences between the observed depressions $y_i$ and the values predicted by the deterministic part of the model ($\alpha + \beta x_i$). For this example, the parameter estimates are given as $\hat{\alpha} = -2.087$ and $\hat{\beta} = 2.667$. These are the values that define the least squares line shown in Figure [2](#fig:roller_scatter){reference-type="ref" reference="fig:roller_scatter"}. The **fitted values** $\hat{y}_i$ are defined by: $$\hat{y}_i = \hat{\alpha} + \hat{\beta} x_i, \quad i = 1, \dots, n$$ Fitted values are the model's predictions for depression for each observed roller weight $x_i$, using the estimated parameters $\hat{\alpha}$ and $\hat{\beta}$. They represent the deterministic part of the model's output. The **observed residuals** $\hat{\varepsilon}_i$ are given by: $$\hat{\varepsilon}_i = y_i - \hat{y}_i = y_i - \hat{\alpha} - \hat{\beta} x_i, \quad i = 1, \dots, n$$ Observed residuals are the differences between the actual observed depressions $y_i$ and the fitted values $\hat{y}_i$. They represent the part of the depression that is not explained by the linear relationship with roller weight, and are estimates of the error terms $\varepsilon_i$. Analyzing residuals is crucial for checking the adequacy of the model assumptions. ### Interpretation and Focus of Interest The focus of interest in statistical modeling can often be cast in terms of: - **Interpretation of model parameters**: For example, in the roller data, $\beta$ is a crucial parameter representing the rate of increase of depression with increasing roller weight. The estimated value $\hat{\beta} = 2.667$ suggests that for each additional ton of roller weight, the lawn depression is expected to increase by approximately 2.667 mm, on average. This interpretation provides a quantifiable measure of the effect of roller weight. - **Prediction**: Predictions are given by the fitted values $\hat{y}_i$. For the observed roller weights $x_i$, the fitted values $\hat{y}_i$ are the model's predictions of depression. We can also predict the depression for out-of-sample roller weights, although this requires careful consideration. Predicting for weights outside the range of observed weights (extrapolation) should be done cautiously, as the linear relationship might not hold beyond the observed data range. For example, predicting depression for a very heavy roller weight might be unrealistic if the lawn's response becomes non-linear at higher weights. The model treats the pattern of change in depression with roller weight as a **deterministic** or **fixed effect** term. This means we assume that the linear relationship is systematic and consistent across different parts of the lawn (within the scope of the experiment). However, the measured values of depression also incorporate a **random term** ($\varepsilon_i$) that reflects: - Variation from one part of the lawn to another. Lawns are not perfectly uniform; soil composition, grass density, and moisture levels can vary, leading to different depressions even under the same roller weight. - Differences in handling the roller. Even with standardized procedures, slight variations in how the roller is applied (speed, number of passes, etc.) can introduce variability in the measured depression. - Measurement error. There is always some degree of imprecision in measuring depression. The measurement tools and techniques have limitations, and repeated measurements on the same spot might yield slightly different values. It is typically assumed that the elements of the random term ($\varepsilon_i$) are **uncorrelated**. This means that the size and sign of one element do not provide any information about the other elements. In simpler terms, the random error in one measurement is not related to the random error in another measurement. This assumption is important for the validity of many statistical inference procedures in regression analysis. Finally, it's important to consider the scope of inference. Data from a single lawn might not be sufficient to generalize results to other lawns. The characteristics of the lawn (grass type, soil type, underlying structure) could influence the relationship between roller weight and depression. **Data from multiple lawns** would be essential if one wants to generalize the findings to a broader population of lawns. Collecting data from a variety of lawns would allow for a more robust and generalizable model, potentially including lawn characteristics as additional factors in the model. ## A Brief Note on the Least Squares Line It's important to remember that the least squares line, with coefficients $\hat{\alpha}$ and $\hat{\beta}$, is obtained by minimizing the sum of squared residuals: $$\sum_{i=1}^{n} (y_i - \hat{\alpha} - \hat{\beta} x_i)^2$$ The least squares method is a widely used approach for estimating parameters in regression models because it has desirable statistical properties under certain assumptions. Intuitively, by minimizing the sum of squared residuals, we are finding the line that is \"closest\" to all the data points in terms of vertical distances. Using linear algebra, it can be shown that the coefficients that minimize this sum are given by the solution to a simple linear system, resulting in the formulas: $$\hat{\beta} = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}$$ $$\hat{\alpha} = \bar{y} - \hat{\beta} \bar{x}$$ These formulas provide direct calculations for the least squares estimates $\hat{\alpha}$ and $\hat{\beta}$ based on the sample data. $\bar{x}$ and $\bar{y}$ represent the sample means of the roller weights and depressions, respectively. ::: remark **Remark 1**. In some situations, **weighted least squares** are useful. In this approach, fixed weights $w_i$ are introduced, and the quantity to be minimized becomes: $$\sum_{i=1}^{n} (y_i - \hat{\alpha} - \hat{\beta} x_i)^2 w_i$$ Weighted least squares are used when the variance of the error term is not constant across all observations (heteroscedasticity). The weights $w_i$ are typically chosen to be inversely proportional to the variance of the error for the $i$-th observation. This gives more influence to observations with smaller variance and less influence to observations with larger variance, leading to more efficient parameter estimates in the presence of heteroscedasticity. For example, if we know that the measurement error is larger for heavier rollers, we might use weights that are smaller for observations with larger $x_i$ values. ::: # Probability Distributions and Random Variables ## Random Variables: Building Blocks of Statistical Models The concepts of **randomness** and **probability** are central to statistics. The perspective of data originating from a probability distribution is fundamental to understanding statistical methods. ::: tcolorbox **Random variables** (r.v.'s) are the essential building blocks for statistical models, particularly for their random components. A random variable is a variable whose value is a numerical outcome of a random phenomenon. Each time a random variable is observed, it takes on a different (numerical) value, at random. ::: We can make probability statements about the values that are likely to occur for a random variable; this specification of probabilities is known as its **probability distribution**. ## Distribution Function ::: tcolorbox The **distribution function** of a random variable $X$, denoted by $F(x)$, is defined as: $$F(x) = P(X \leq x)$$ This function gives the probability that the random variable $X$ will take a value less than or equal to $x$, for any real number $x \in \mathbb{R}$. ::: The distribution function $F(x)$ is a non-decreasing function, right-continuous, with $\lim_{x \to -\infty} F(x) = 0$ and $\lim_{x \to +\infty} F(x) = 1$. From the distribution function $F(x)$, we can define the set of potential values for $X$, which is called the **support** $S$ of $X$. The support is the set of all values $x$ for which $f(x) > 0$ (for discrete r.v.'s) or $f(x) > 0$ in a neighborhood of $x$ (for continuous r.v.'s). We can also determine the probabilities of events related to $X$, such as $X = a$, $X > a$, or $a < X \leq b$, where $a < b \in \mathbb{R}$. For example: - $P(X > a) = 1 - P(X \leq a) = 1 - F(a)$ - $P(a < X \leq b) = P(X \leq b) - P(X \leq a) = F(b) - F(a)$ - For a continuous random variable, $P(X = a) = 0$ for any $a \in \mathbb{R}$. ## Discrete and Continuous Random Variables Random variables can be broadly classified into two types: discrete and continuous. ### Discrete Random Variables ::: tcolorbox **Discrete random variables** take values from a discrete set, which can be finite or countably infinite. They are suitable for modeling finite or count data. ::: Examples of discrete random variables include: - The number of heads in three coin tosses (values: 0, 1, 2, 3). - The number of cars passing a point on a highway in an hour (values: 0, 1, 2, \...). - The outcome of rolling a die (values: 1, 2, 3, 4, 5, 6). Discrete random variables are described by their **probability mass function (pmf)**: $$f(x) = P(X = x)$$ The pmf $f(x)$ gives the probability that the random variable $X$ takes on a specific value $x$. For a valid pmf, the following conditions must hold: - $0 \leq f(x) \leq 1$ for all $x$. Probabilities must be between 0 and 1, inclusive. - For the potential values of $X$, denoted as $x_i$, where $i \in I \subseteq \mathbb{N}$, $f(x_i) > 0$. The pmf is positive for all possible values in the support. - The sum of probabilities over all possible values is equal to 1: $\sum_{i \in I} f(x_i) = 1$. The total probability of all possible outcomes must be 1. ### Continuous Random Variables ::: tcolorbox **Continuous random variables** take values in a continuous set. For continuous random variables, the probability of taking any particular value is zero. ::: Examples of continuous random variables include: - Height or weight of a person. - Temperature of a room. - Time until failure of a device. Continuous random variables are described by their **probability density function (pdf)** $f(x)$. The probability density function $f(x)$ does not directly give probabilities, but the probability of $X$ falling within an interval $[a, b]$ is given by the integral of the pdf over that interval: $$P(a \leq X \leq b) = \int_{a}^{b} f(x) dx, \quad a < b \in \mathbb{R}$$ The probability is represented by the area under the pdf curve between $a$ and $b$. For a valid pdf, the following conditions must hold: - $f(x) \geq 0$ for all $x$. The pdf must be non-negative everywhere. - The integral of the pdf over the entire real line is equal to 1: $\int_{-\infty}^{+\infty} f(x) dx = 1$. The total area under the pdf curve must be 1. - The distribution function $F(b)$ can be obtained by integrating the pdf from $-\infty$ to $b$: $F(b) = \int_{-\infty}^{b} f(x) dx$. The distribution function is the cumulative integral of the pdf. - The derivative of the distribution function gives the pdf: $F'(x) = f(x)$, where the derivative $F'(x)$ exists. The pdf is the rate of change of the distribution function. ## Mean, Variance, and Quantiles Instead of fully describing the entire distribution of a random variable $X$, often its first two moments are sufficient for many purposes. These moments, particularly the mean and variance, provide key summary characteristics of the distribution. ### Expected Value (Mean) ::: tcolorbox The **expected value** (or mean) $\mu = \mathbb{E}(X)$ of a random variable $X$ represents its average value in the long run. It is a measure of the central tendency of the distribution. For discrete and continuous random variables, the expected value is calculated as follows: **Discrete case:** $$\mathbb{E}(X) = \sum_{i \in I} x_i f(x_i)$$ where the sum is taken over all possible values $x_i$ in the support $I$, weighted by their probabilities $f(x_i)$. **Continuous case:** $$\mathbb{E}(X) = \int_{-\infty}^{+\infty} x f(x) dx$$ where the integral is taken over the entire support of $X$, with each value $x$ weighted by its probability density $f(x)$. ::: The expected value can be thought of as the weighted average of all possible values of $X$, where the weights are given by the probabilities (or probability densities). ### Variance ::: tcolorbox The **variance** $\sigma^2 = \mathbb{V}(X)$ measures the spread or dispersion of the random variable around its mean. It quantifies how much the values of $X$ deviate from the expected value on average. It is defined as: $$\sigma^2 = \mathbb{V}(X) = \mathbb{E}[(X - \mu)^2]$$ which is the expected value of the squared deviation of $X$ from its mean $\mu$. ::: The variance can also be calculated using the formula $\mathbb{V}(X) = \mathbb{E}(X^2) - [\mathbb{E}(X)]^2$. The square root of the variance, $\sigma = \sqrt{\mathbb{V}(X)}$, is the **standard deviation**, often referred to as the **standard error** when it refers to the standard deviation of an estimator. The standard deviation is in the same units as $X$, making it often more interpretable than the variance. Indices of skewness and kurtosis are higher-order moments that can be defined similarly to further describe the shape of the distribution. - **Skewness** measures the asymmetry of the distribution. A skewness of zero indicates symmetry, positive skewness indicates a longer tail on the right, and negative skewness indicates a longer tail on the left. - **Kurtosis** measures the \"tailedness\" of the distribution. Higher kurtosis indicates heavier tails and a sharper peak, while lower kurtosis indicates lighter tails and a flatter peak. ### Quantiles ::: tcolorbox The $\alpha$**-quantile** $x_\alpha$ of a random variable $X$, where $\alpha \in (0, 1)$, is a value such that the probability of $X$ being less than or equal to $x_\alpha$ is $\alpha$: $$P(X \leq x_\alpha) = \alpha$$ In other words, the $\alpha$-quantile is the value below which a proportion $\alpha$ of the distribution lies. ::: The **median** of $X$ corresponds to the 0.5-quantile ($x_{0.5}$). The median divides the distribution into two equal halves. **Quartiles** divide the distribution into four parts: - First quartile (Q1) or 0.25-quantile ($x_{0.25}$): 25% of the data falls below this value. - Second quartile (Q2) or 0.5-quantile ($x_{0.5}$): This is the median. - Third quartile (Q3) or 0.75-quantile ($x_{0.75}$): 75% of the data falls below this value. **Percentiles** divide the distribution into one hundred parts, with the $p$-th percentile being the $0.0p$-quantile. Quantiles are robust measures of location and spread, less sensitive to outliers than the mean and variance. ### Standardized Random Variable ::: tcolorbox The **standardized random variable** $Z$ is obtained by transforming $X$ as follows: $$Z = \frac{X - \mu}{\sigma}$$ Standardization centers the random variable at zero (by subtracting the mean) and scales it to have unit variance (by dividing by the standard deviation). ::: The standardized random variable $Z$ has a mean of 0 and a variance of 1: $\mathbb{E}(Z) = 0$ and $\mathbb{V}(Z) = 1$. Standardization is a common technique in statistics to compare random variables from different distributions or to simplify analysis. ## Random Vectors Often, statistical analysis requires considering multiple observations simultaneously. These are viewed as realizations of a **random vector** (or multivariate random variable). ::: tcolorbox A random vector $(X_1, \dots, X_n)$ takes values in $\mathbb{R}^n$, meaning it is a vector of $n$ numerical components, according to a **joint probability distribution**. ::: For example, if we measure both height and weight for a sample of individuals, we are dealing with a bivariate random vector for each individual. ### Joint Distribution Function ::: tcolorbox The probability distribution of a random vector is defined by the **joint distribution function**: $$F(x_1, \dots, x_n) = P(X_1 \leq x_1, \dots, X_n \leq x_n)$$ The joint distribution function gives the probability that each random variable $X_i$ is less than or equal to a corresponding value $x_i$, for all $i = 1, \dots, n$. ::: Alternatively, for continuous random vectors, it can be defined by the multivariate (joint) **probability density function** $f(x_1, \dots, x_n)$. The joint pdf generalizes the concept of pdf to multiple dimensions, describing the probability density at each point in $\mathbb{R}^n$. ### Marginal Density ::: tcolorbox Each component $X_i$, for $i = 1, \dots, n$, of a random vector is itself a random variable with a **marginal density (probability) function** $f_i(x_i)$. ::: The marginal density of $X_i$ describes the distribution of $X_i$ considered in isolation, without regard to the other components of the random vector. It can be derived from the joint distribution by integrating (or summing) over all possible values of the other variables. ### Independent and Identically Distributed (i.i.d.) Random Variables Two situations greatly simplify statistical analysis: - **Independence:** The component random variables $X_i$, $i = 1, \dots, n$, are **independent** if the realization of one does not affect the probability distribution of the others. In this case, the joint density function is the product of the marginal density functions: $$f(x_1, \dots, x_n) = \prod_{i=1}^{n} f_i(x_i)$$ Independence implies that knowing the value of one random variable provides no information about the values of the other random variables. - **Independent and Identically Distributed (i.i.d.):** The component random variables $X_i$, $i = 1, \dots, n$, are **independent and identically distributed (i.i.d.)** if they are independent and each component follows the same distribution with density (probability) function $g(x)$. In this case, the joint density function is: $$f(x_1, \dots, x_n) = \prod_{i=1}^{n} g(x_i)$$ I.i.d. random variables are fundamental in many statistical models, simplifying both theoretical analysis and practical applications. For example, in random sampling, observations are oftenoften assumed to be i.i.d. realizations from the population distribution. For example, if $X_1, \dots, X_n$ are i.i.d. from a normal distribution with mean $\mu$ and variance $\sigma^2$, their joint pdf is given by: $$f(x_1, \dots, x_n) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x_i - \mu)^2}{2\sigma^2}} = \left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^n e^{-\sum_{i=1}^{n}\frac{(x_i - \mu)^2}{2\sigma^2}}$$ ## Bivariate Random Variables The two-dimensional case (bivariate random variables) is sufficient to illustrate many concepts required for higher dimensions. Let's consider a continuous bivariate random variable $(X, Y)$ with joint density function $f(x, y)$. Results for the discrete case are obtained by replacing integration with summation. ### Marginal Density (Bivariate Case) ::: tcolorbox The **marginal density** of $X$ is obtained by integrating the joint density over all possible values of $Y$: $$f(x) = \int_{-\infty}^{+\infty} f(x, y) dy$$ Similarly, the marginal density of $Y$ is: $$f(y) = \int_{-\infty}^{+\infty} f(x, y) dx$$ Marginal densities are obtained by \"integrating out\" the other variable from the joint density, effectively summarizing the distribution of each variable independently. ::: For discrete random variables, the marginal pmf is obtained by summing the joint pmf over the possible values of the other variable. ### Conditional Density ::: tcolorbox The **conditional density** of $X$ given $Y = y$ is defined as: $$f(x|y) = \frac{f(x, y)}{f(y)}$$ This is defined assuming $f(y) > 0$. Similarly, the conditional density of $Y$ given $X = x$ is: $$f(y|x) = \frac{f(x, y)}{f(x)}$$ assuming $f(x) > 0$. Conditional density describes the probability distribution of one random variable given that another random variable has taken a specific value. ::: The conditional density $f(x|y)$ represents the probability density of $X$ at value $x$, given that $Y$ is known to be equal to $y$. It is essentially a \"slice\" through the joint distribution at a fixed value of $Y$, renormalized to be a proper probability density function. ### Bayes' Theorem ::: tcolorbox **Bayes' theorem** relates conditional probabilities and is a cornerstone of Bayesian statistical methods: $$f(x|y) = \frac{f(x)f(y|x)}{f(y)}$$ This is again defined assuming $f(y) > 0$. Bayes' theorem provides a way to update probabilities based on new evidence. It allows us to reverse the conditioning: if we know $f(y|x)$, we can find $f(x|y)$. ::: Bayes' theorem is derived directly from the definition of conditional probability and the symmetry property $f(x, y) = f(y, x)$. It is fundamental in Bayesian inference for updating beliefs in light of new data. ### Independence (Bivariate Case) ::: tcolorbox Two random variables $X$ and $Y$ are **independent** if and only if their joint density function is the product of their marginal density functions: $$f(x, y) = f(x)f(y)$$ Independence means that the joint distribution is simply the product of the marginal distributions, implying no statistical association between the variables. ::: Equivalently, $X$ and $Y$ are independent if and only if $f(x|y) = f(x)$ for all $y$ with $f(y) > 0$, or $f(y|x) = f(y)$ for all $x$ with $f(x) > 0$. ### Conditional Expectation ::: tcolorbox The **conditional expectation** (mean) of $X$ given $Y = y$ is: $$\mathbb{E}(X|Y = y) = \int_{-\infty}^{+\infty} x f(x|y) dx$$ Similarly, the conditional expectation of $Y$ given $X = x$ is: $$\mathbb{E}(Y|X = x) = \int_{-\infty}^{+\infty} y f(y|x) dy$$ Conditional expectation is the expected value of one random variable given that another random variable is held at a specific value. ::: The conditional expectation $\mathbb{E}(X|Y = y)$ is the mean of the conditional distribution of $X$ given $Y = y$. It represents the average value of $X$ when we know that $Y$ has taken the value $y$. Analogous definitions exist for the conditional variance of $X$ given $Y = y$ and $Y$ given $X = x$. For example, the conditional variance of $X$ given $Y=y$ is $\mathbb{V}(X|Y=y) = \mathbb{E}[(X - \mathbb{E}(X|Y=y))^2 | Y=y]$. ### Covariance and Correlation ::: tcolorbox The **covariance** of $(X, Y)$, denoted as $\mathrm{Cov}(X, Y)$ or $\sigma_{XY}$, measures the linear relationship between $X$ and $Y$: $$\mathrm{Cov}(X, Y) = \sigma_{XY} = \int_{-\infty}^{+\infty} \int_{-\infty}^{+\infty} \{x - \mathbb{E}(X)\}\{y - \mathbb{E}(Y)\}f(x, y) dx dy$$ Covariance quantifies the direction and strength of the linear association between two random variables. ::: An alternative formula for covariance is $\mathrm{Cov}(X, Y) = \mathbb{E}(XY) - \mathbb{E}(X)\mathbb{E}(Y)$. If $X$ and $Y$ are independent, then $\mathrm{Cov}(X, Y) = 0$. However, the reverse is not always true (a relevant exception is the multivariate normal distribution). Zero covariance implies no linear relationship, but there might still be non-linear relationships between $X$ and $Y$. ::: tcolorbox The **Pearson correlation coefficient** $\rho_{XY}$ of $(X, Y)$ is a standardized measure of linear dependence, useful for describing the strength and direction of linear relationships: $$\rho_{XY} = \frac{\mathrm{Cov}(X, Y)}{\sqrt{\mathbb{V}(X)\mathbb{V}(Y)}} = \frac{\sigma_{XY}}{\sigma_X \sigma_Y}$$ Correlation coefficient standardizes the covariance to be between -1 and 1, making it easier to interpret the strength of linear association. ::: The correlation coefficient $\rho_{XY}$ is dimensionless and always lies between -1 and 1. - $\rho_{XY} = 1$: perfect positive linear correlation. - $\rho_{XY} = -1$: perfect negative linear correlation. - $\rho_{XY} = 0$: no linear correlation. ### Mean Vector and Variance-Covariance Matrix The first and second order moments of a bivariate random vector $(X, Y)$ are summarized by the **mean vector** $\mu$ and the **variance-covariance matrix** $\Sigma$. ::: tcolorbox The **mean vector** is: $$\mu = \begin{pmatrix} \mu_X \\ \mu_Y \end{pmatrix} = \begin{pmatrix} \mathbb{E}(X) \\ \mathbb{E}(Y) \end{pmatrix}$$ The mean vector is a vector of the expected values of each component of the random vector. ::: ::: tcolorbox The **variance-covariance matrix** is: $$\Sigma = \begin{pmatrix} \sigma_X^2 & \sigma_{XY} \\ \sigma_{XY} & \sigma_Y^2 \end{pmatrix} = \begin{pmatrix} \mathbb{V}(X) & \mathrm{Cov}(X, Y) \\ \mathrm{Cov}(Y, X) & \mathbb{V}(Y) \end{pmatrix}$$ The variance-covariance matrix summarizes the variances of each variable along the diagonal and the covariances between pairs of variables off-diagonal. ::: The matrix $\Sigma$ is symmetric because $\mathrm{Cov}(X, Y) = \mathrm{Cov}(Y, X)$, and it is positive semi-definite, ensuring that variances are non-negative and correlations are well-behaved. ## Statistics and Sampling Distribution ::: tcolorbox A **(sample) statistic** is a function of a set of random variables. Since it is a function of random variables, a statistic is itself a random variable. ::: Examples of statistics include the sample mean, sample variance, sample median, etc. Statistics are used to estimate population parameters or to test hypotheses. ::: tcolorbox The **probability distribution** of a sample statistic is called its **sampling distribution**. The form of the sampling distribution depends on the joint distribution of the initial random vector. ::: Understanding the sampling distribution of a statistic is crucial for statistical inference, as it allows us to assess the variability and uncertainty associated with the statistic as an estimator or test statistic. Given a random vector $(X_1, \dots, X_n)$, well-known examples of statistics include the **sample mean** $\bar{X}$ and the **(corrected) sample variance** $S^2$, defined as: $$\bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i$$ $$S^2 = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2$$ The **uncorrected sample variance** is obtained by substituting the degrees of freedom $n-1$ with $n$. The corrected sample variance $S^2$ is an unbiased estimator of the population variance $\sigma^2$, while the uncorrected sample variance is biased, underestimating $\sigma^2$ on average. Further examples of statistics are the sample median, sample quantiles, sample Mean Absolute Deviation (MAD), sample covariance, and sample correlation coefficient. Each statistic has its own sampling distribution, which depends on the distribution of the data and the sample size. ## Useful Results for Sample Statistics Some useful results for sample statistics are listed below, particularly when $X_1, \dots, X_n$ are uncorrelated (or independent) random variables with the same marginal mean $\mu$ and variance $\sigma^2$ (e.g., this holds for identically distributed random variables). - For the sum of random variables: $$\mathbb{E}\left(\sum_{i=1}^{n} X_i\right) = \sum_{i=1}^{n} \mathbb{E}(X_i) = n\mu$$ $$\mathbb{V}\left(\sum_{i=1}^{n} X_i\right) = \sum_{i=1}^{n} \mathbb{V}(X_i) = n\sigma^2 \quad \text{(if } X_i \text{ are uncorrelated)}$$ For uncorrelated random variables, the variance of the sum is the sum of the variances. - For the sample mean: $$\mathbb{E}(\bar{X}) = \mathbb{E}\left(\frac{1}{n} \sum_{i=1}^{n} X_i\right) = \frac{1}{n} \mathbb{E}\left(\sum_{i=1}^{n} X_i\right) = \frac{1}{n} (n\mu) = \mu$$ $$\mathbb{V}(\bar{X}) = \mathbb{V}\left(\frac{1}{n} \sum_{i=1}^{n} X_i\right) = \frac{1}{n^2} \mathbb{V}\left(\sum_{i=1}^{n} X_i\right) = \frac{1}{n^2} (n\sigma^2) = \frac{\sigma^2}{n} \quad \text{(if } X_i \text{ are uncorrelated)}$$ The sample mean is an unbiased estimator of the population mean $\mu$, and its variance decreases as the sample size $n$ increases. - For the (corrected) sample variance: $$\mathbb{E}(S^2) = \sigma^2$$ The corrected sample variance $S^2$ is an unbiased estimator of the population variance $\sigma^2$. For the uncorrected sample variance, the expectation is $\frac{\sigma^2(n-1)}{n} = \sigma^2 \frac{n-1}{n} < \sigma^2$. The uncorrected sample variance is biased downwards, especially for small sample sizes. ## Weak Law of Large Numbers ::: tcolorbox **Weak law of large numbers** states that if $X_1, \dots, X_n$ are i.i.d. random variables with mean $\mu$, then the sample mean $\bar{X}$ converges in probability to $\mu$ as $n \to +\infty$. In symbols, $\bar{X} \stackrel{p}{\longrightarrow}\mu$. This means that as the sample size $n$ increases, the distribution of $\bar{X}$ becomes more and more concentrated around the marginal mean $\mu$. ::: The weak law of large numbers is a fundamental result in probability theory, providing a theoretical justification for using the sample mean to estimate the population mean. It states that for a sufficiently large sample size, the sample mean is likely to be close to the true population mean. A similar result holds for the (corrected and uncorrected) sample variance: $S^2 \stackrel{p}{\longrightarrow}\sigma^2$ as $n \to +\infty$. As the sample size increases, the sample variance converges in probability to the population variance $\sigma^2$. ## Example: Application of Weak Law of Large Numbers Consider $X_1, \dots, X_n$ to be i.i.d. Poisson distributed random variables with parameter $\lambda = 5$, denoted as $Po(\lambda)$. For a Poisson distribution, both the mean and variance are equal to $\lambda$, so $\mu = \sigma^2 = 5$. A sequence of observed values for the sample mean $\bar{X} = \sum_{i=1}^{n} X_i / n$, for $n = 1, \dots, 1000$, is shown in Figure [3](#fig:sample_mean_path){reference-type="ref" reference="fig:sample_mean_path"}. ![Sample path of the sample mean for Poisson distribution.](sample_mean_path.png){#fig:sample_mean_path width="70%"} The sample path demonstrates that as $n$ increases, the observed values of $\bar{X}$ tend to be more concentrated around $\mu = \lambda = 5$, as predicted by the weak law of large numbers. Initially, for small $n$, the sample mean fluctuates considerably, but as $n$ grows, the fluctuations dampen, and $\bar{X}$ stabilizes around the true mean $\mu = 5$. Since the sum of independent Poisson random variables is also Poisson, the sample sum $\sum_{i=1}^{n} X_i$ follows a $Po(n\lambda)$ distribution. This allows us to specify the distribution of the sample mean $\bar{X} = \sum_{i=1}^{n} X_i / n$. While $\sum_{i=1}^{n} X_i \sim Po(n\lambda)$, the distribution of $\bar{X}$ is not Poisson, but its behavior is governed by the properties of sums of random variables and the weak law of large numbers. Figure [4](#fig:sample_mean_dist){reference-type="ref" reference="fig:sample_mean_dist"} illustrates the probability function of $\bar{X}$ for different values of $n$ ($n = 5, 10, 25, 50$). ![Probability function of sample mean for different n values.](sample_mean_dist.png){#fig:sample_mean_dist width="80%"} As $n$ increases, the mean value remains constant at $\mu = 5$, while the variability of the sample mean lessens, showing a more concentrated distribution around the true mean. The probability mass becomes increasingly concentrated around $\mu = 5$, visually demonstrating the convergence in probability predicted by the weak law of large numbers. # Basic Statistical Models: Common Distributions This section reviews some basic and commonly used statistical models, which are essentially probability distributions that are frequently employed in statistical modeling. These distributions serve as fundamental building blocks for more complex statistical models and are essential for understanding and applying statistical methods across various fields. ## Discrete Uniform Distribution ::: tcolorbox The **discrete uniform distribution** describes an experiment where a finite number of values are equally likely to be observed. It is characterized by constant probability over a finite set of values. A discrete random variable $X$ follows a discrete uniform distribution with values $x_1, \dots, x_n \in \mathbb{R}$, where $n \in \mathbb{N}^+$, abbreviated as $X \sim Ud(x_1, \dots, x_n)$, if its support is $S = \{x_1, \dots, x_n\}$ and the probability mass function is: $$f(x_i) = \frac{1}{n}, \quad \text{for } i = 1, \dots, n$$ ::: Each value in the set $\{x_1, \dots, x_n\}$ has an equal probability of $\frac{1}{n}$ of being observed. The expected value and variance for a discrete uniform distribution are: $$\mathbb{E}(X) = \sum_{i=1}^{n} \frac{x_i}{n} = \frac{1}{n} \sum_{i=1}^{n} x_i$$ $$\mathbb{V}(X) = \sum_{i=1}^{n} \frac{\{x_i - \mathbb{E}(X)\}^2}{n} = \frac{1}{n} \sum_{i=1}^{n} \{x_i - \mathbb{E}(X)\}^2$$ The expected value is the average of the values $x_1, \dots, x_n$, and the variance measures the spread of these values around their mean, assuming equal probability for each value. Figure [5](#fig:discrete_uniform){reference-type="ref" reference="fig:discrete_uniform"} shows the probability function for a discrete uniform distribution with $n=6$ and $x_i = i$ for $i = 1, \dots, 6$. In this case, $X$ can take values $\{1, 2, 3, 4, 5, 6\}$, each with probability $\frac{1}{6}$, resembling the outcome of rolling a fair six-sided die. ![Probability function of discrete uniform distribution.](discrete_uniform.png){#fig:discrete_uniform width="60%"} The pmf is constant across all possible outcomes, visually represented as bars of equal height in Figure [5](#fig:discrete_uniform){reference-type="ref" reference="fig:discrete_uniform"}. ## Bernoulli Distribution ::: tcolorbox The **Bernoulli distribution** models an experiment with two possible outcomes, often labeled as \"success\" (1) and \"failure\" (0). It is the simplest discrete distribution, representing a single trial with a binary outcome. A discrete random variable $X$ follows a Bernoulli distribution with parameter $p \in (0, 1)$, abbreviated as $X \sim Ber(p)$, if it describes an experiment where the possible outcomes are \"success\" (or 1) and \"failure\" (or 0), and the probability of success is $p$. The support is $S = \{0, 1\}$, and the probability mass function is: $$f(1) = P(X = 1) = p, \quad f(0) = P(X = 0) = 1 - p$$ ::: The parameter $p$ represents the probability of success, and $(1-p)$ is the probability of failure. The expected value and variance of a Bernoulli distribution are: $$\mathbb{E}(X) = 1 \cdot P(X=1) + 0 \cdot P(X=0) = p$$ $$\mathbb{V}(X) = \mathbb{E}[(X - p)^2] = (1-p)^2 \cdot P(X=1) + (0-p)^2 \cdot P(X=0) = (1-p)^2 p + p^2 (1-p) = p(1-p)$$ The expected value is equal to the probability of success $p$, and the variance is maximized when $p = 0.5$. Figure [6](#fig:bernoulli){reference-type="ref" reference="fig:bernoulli"} shows the probability function for a Bernoulli distribution with $p = 1/3$. In this case, there is a $\frac{1}{3}$ probability of success ($X=1$) and a $\frac{2}{3}$ probability of failure ($X=0$). ![Probability function of Bernoulli distribution.](bernoulli.png){#fig:bernoulli width="60%"} The pmf has two points, one at $x=0$ with height $1-p$ and another at $x=1$ with height $p$, as shown in Figure [6](#fig:bernoulli){reference-type="ref" reference="fig:bernoulli"}. ## Binomial Distribution ::: tcolorbox The **binomial distribution** models the number of successes in a fixed number of independent Bernoulli trials. It extends the Bernoulli distribution to multiple trials, counting the total number of successes. A discrete random variable $X$ follows a binomial distribution with parameters $n \in \mathbb{N}$ and $p \in (0, 1)$, abbreviated as $X \sim Bi(n, p)$, if it describes the number of successes in $n$ independent Bernoulli experiments, each with the same success probability $p$. The support is $S = \{0, 1, \dots, n\}$, and the probability mass function is: $$f(x) = \begin{cases} \binom{n}{x} p^x (1-p)^{n-x} & \text{if } x \in S \\0 & \text{otherwise} \end{cases}$$ ::: The binomial coefficient $\binom{n}{x} = \frac{n!}{x!(n-x)!}$ counts the number of ways to choose $x$ successes from $n$ trials. The term $p^x (1-p)^{n-x}$ is the probability of getting exactly $x$ successes and $n-x$ failures in a specific sequence of trials. where $\binom{n}{x} = \frac{n!}{x!(n-x)!}$ is the binomial coefficient. The expected value and variance of a binomial distribution are: $$\mathbb{E}(X) = np$$ $$\mathbb{V}(X) = np(1 - p)$$ These formulas show that both the mean and variance of the binomial distribution are directly proportional to the number of trials $n$ and the success probability $p$. Note that a $Bi(1, p)$ distribution is equivalent to a $Ber(p)$ distribution, as a binomial distribution with one trial reduces to a Bernoulli distribution. Figure [7](#fig:binomial_dist){reference-type="ref" reference="fig:binomial_dist"} shows the probability function for binomial distributions with different $n$ and $p$ values. As $n$ increases, the distribution becomes more spread out, and as $p$ varies, the distribution shifts towards more successes (higher $p$) or failures (lower $p$). ![Probability function of binomial distribution for different parameters.](binomial_dist.png){#fig:binomial_dist width="80%"} The shape of the binomial pmf varies with $n$ and $p$, ranging from skewed to approximately symmetric for large $n$ and $p \approx 0.5$. ## Poisson Distribution ::: tcolorbox The **Poisson distribution** is often used to model the number of events occurring in a fixed interval of time or space. It is particularly useful for rare events. A discrete random variable $X$ follows a Poisson distribution with parameter $\lambda > 0$, abbreviated as $X \sim Po(\lambda)$, if its support is $S = \mathbb{N}= \{0, 1, 2, \dots\}$ and the probability mass function is: $$f(x) = \begin{cases} \frac{\lambda^x e^{-\lambda}}{x!} & \text{if } x \in S \\ 0 & \text{otherwise} \end{cases}$$ ::: The parameter $\lambda$ represents the average rate of events (events per unit time or space). The expected value and variance of a Poisson distribution are both equal to the parameter $\lambda$: $$\mathbb{E}(X) = \lambda$$ $$\mathbb{V}(X) = \lambda$$ The equality of mean and variance is a unique property of the Poisson distribution. The Poisson distribution can be used as an approximation to the binomial distribution when $n$ is large and $p$ is small (specifically, for $n \geq 50$ and $p \leq 1/25$), with $\lambda = np$. This approximation is useful because Poisson calculations can be simpler than binomial calculations for large $n$. Figure [8](#fig:poisson_dist){reference-type="ref" reference="fig:poisson_dist"} shows the probability function for Poisson distributions with different $\lambda$ values. As $\lambda$ increases, the distribution shifts to the right, and its spread also increases. ![Probability function of Poisson distribution for different $\lambda$ values.](poisson_dist.png){#fig:poisson_dist width="80%"} The Poisson pmf is typically skewed to the right, especially for small $\lambda$, becoming more symmetric as $\lambda$ increases. # Conclusion This lecture has provided a foundational review of statistical models, starting with the basic definitions and purposes of these models. We explored the crucial role of probability distributions and random variables as building blocks for statistical models, and examined several common discrete distributions such as Discrete Uniform, Bernoulli, Binomial, and Poisson. Key takeaways from this lecture include: - Statistical models are essential tools for learning from data in the presence of random variability, enabling us to make inferences and predictions. - Understanding the properties of random variables and their distributions is fundamental to statistical modeling, as they form the basis for describing random phenomena. - Common discrete distributions like the Discrete Uniform, Bernoulli, Binomial, and Poisson are widely applicable in various statistical contexts, serving as basic models for many real-world phenomena. The distributions covered in this lecture are just the beginning. Statistical modeling employs a vastarray of distributions, both discrete and continuous, to accommodate the complexities of real-world data. Further topics to be covered in the subsequent lectures will include continuous distributions, simulation techniques, model assumptions, and more advanced statistical models. Continuous distributions such as the Normal, Exponential, and Gamma distributions are crucial for modeling continuous data and will be explored in detail. Simulation techniques are vital for understanding model behavior and validating statistical methods. Model assumptions, such as independence and normality, underpin the validity of statistical inferences and require careful consideration and checking. More advanced statistical models build upon these foundational concepts to address complex data structures and research questions. It is recommended to review the textbook chapters mentioned in the introduction to reinforce the concepts discussed in this lecture and prepare for the upcoming topics. A solid grasp of these fundamental concepts is crucial for building a strong foundation in applied statistics and data analysis.