Applied Statistics and Data Analysis: Statistical Inference

Author

Your Name

Published

February 12, 2025

Introduction to Statistical Inference

What is Statistical Inference?

Statistical inference is the process of drawing conclusions about a population or a phenomenon of interest based on a random sample taken from that population or generated by that phenomenon. It is a major task in statistical analysis, allowing us to generalize findings from a limited dataset to a broader context.

Statistical inference provides methods to analyze data characterized by inherent random variability. The objective is to derive generally valid conclusions, even when data originates from a single set of observations. This is crucial in many scientific and practical applications where collecting data for the entire population is infeasible or impossible.

Parametric Statistical Models

For the most part, statistical inference relies on parametric statistical models. These models are families of probability distributions that are characterized by one or more unknown parameters. These parameters are intended to describe how the data might have been generated.

Parametric models simplify the complexity of real-world phenomena by assuming that the data-generating process can be approximated by a known probability distribution, such as the normal, Poisson, or binomial distribution, among others. The specific form of the distribution is chosen based on prior knowledge or assumptions about the nature of the data.

The primary aims of statistical inference within the parametric framework are:

  • Parameter Estimation: To infer the values of the unknown model parameters that are most consistent with the observed data. This involves using estimation techniques to obtain point estimates and interval estimates for the parameters.

  • Accuracy Assessment: To provide a measure of the accuracy of these inferential conclusions. This is typically achieved by quantifying the uncertainty associated with parameter estimates, often through standard errors and confidence intervals.

The focus of interest in statistical inference can vary depending on the research question:

  • Interpretation of Model Parameters: This involves understanding the substantive meaning and implications of the estimated parameters in the context of the phenomenon being studied. For example, in a linear regression model, the slope parameter might represent the change in the response variable for a unit increase in the predictor variable.

  • Prediction of Future Observations: Utilizing the estimated model to predict future outcomes or observations related to the phenomenon. This is crucial in forecasting, risk assessment, and decision-making. For instance, predicting future sales based on historical sales data and marketing expenditure.

Data and Parametric Models: Formalization

Let the observed data be represented as a vector \(y = (y_1, \ldots, y_n)\), where each \(y_i\) is an observation. We consider \(y\) as a realization of a random vector \(Y = (Y_1, \ldots, Y_n)\), which follows an unknown joint probability distribution.

A parametric statistical model is a family of joint density (or probability) functions \(\{f(y_1, \ldots, y_n; \theta) : \theta \in \Theta\}\), where \(\theta \in \Theta\) is a vector of parameters belonging to the parameter space \(\Theta\). The model is chosen with the aim of encompassing or suitably approximating the true, unknown probability distribution that generated the data.

The parameter vector \(\theta\) encapsulates the unknown parameters that define the density or probability functions within the chosen family. Crucially, specific components of \(\theta\) are designed to address the research questions concerning the system that generated the observed data \(y\). For example, \(\theta\) might include parameters representing the mean, variance, or regression coefficients, depending on the model.

Statistical models can also incorporate additional known information in the form of covariates or predictor variables, denoted as \(x\). These covariates are typically treated as fixed and are used to explain or predict the response variable \(Y\).

If the true value of \(\theta\) were known, a correctly specified statistical model would allow for the simulation of random data vectors that statistically resemble the observed data \(y\). This simulation capability is valuable for model validation and understanding the implications of different parameter values.

Random Sampling

A random sample is a subset of units selected from a larger population. In a (uniform) random sample, every unit in the population has an equal probability of being included in the sample.

Random sampling is a fundamental concept in statistical inference, ensuring that the sample is representative of the population from which it is drawn. This representativeness is crucial for generalizing findings from the sample to the broader population.

A random sample can also represent repeated measurements from a random experiment or observations of a random phenomenon over time. In such cases, each observation in the sample is considered to be independently and identically distributed (i.i.d.) according to the underlying probability distribution of the phenomenon.

Random sampling is the cornerstone of statistical inference, providing the methodological basis for drawing conclusions from a sample that are valid for the entire population or the random phenomenon under investigation. Without random sampling, the generalizability of statistical inferences becomes questionable.

When analyzing observed sample data \(y\), assumed to be generated by a random vector \(Y\), analysts often make assumptions about the distributional properties of the component random variables \(Y_i\). Common assumptions include:

  • Independence and Identical Distribution (i.i.d.): This is a widely used assumption, positing that the random variables \(Y_1, \ldots, Y_n\) are statistically independent of each other and are drawn from the same probability distribution. This assumption simplifies the statistical analysis and is often reasonable for well-designed experiments and surveys.

  • Independence but not Identical Distribution: In some situations, it may be appropriate to assume that the random variables are independent but not identically distributed. This might occur when observations are collected under different conditions or from different subgroups within the population.

  • Specific Forms of Dependence: In more complex scenarios, the random variables may exhibit structured dependence. Time series data, spatial data, and longitudinal data often involve dependencies between observations that need to be explicitly modeled.

Example: Temperatures

Consider a dataset \(y\) consisting of a 60-year record of mean annual temperatures (\(^\circ\)F) in New Haven, Connecticut, from 1912 to 1971. This dataset provides a historical record of temperature variations in a specific location.

A simple statistical model for these data is to assume that they are independent observations from a normal distribution \(N(\mu, \sigma^2)\), where the unknown parameters are \(\theta = (\mu, \sigma^2)\). Here, \(\mu\) represents the mean annual temperature, and \(\sigma^2\) represents the variance of annual temperatures.

The probability density function for a single temperature measurement \(Y_i\), for \(i = 1, \ldots, 60\), under the normal distribution assumption is: \[f(y_i; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left\{ -\frac{(y_i - \mu)^2}{2\sigma^2} \right\} \label{eq:normal_density_temp}\] Assuming independence between annual temperature measurements, the joint density of the vector data \(Y = (Y_1, \ldots, Y_{60})\) is given by the product of the individual densities: \[f(y_1, \ldots, y_{60}; \mu, \sigma^2) = \prod_{i=1}^{60} f(y_i; \mu, \sigma^2) \label{eq:joint_normal_density_temp}\]

Numerical summaries of the observed temperature data \(y\) are: sample mean \(\bar{y} = 51.16\) and sample median \(y_{0.5} = 51.20\). These values provide initial "guesses" for the population mean \(\mu\). The sample variance \(s^2 = 1.60\) serves as a "guess" for the population variance \(\sigma^2\).

To visually assess the appropriateness of the normal distribution model, we can compare the empirical distribution of the data with the normal density function using a histogram and density plots.

Estimates of the generating probability distribution for New Haven temperatures: histogram, smooth density estimate, and normal density with \(\mu = 51.16\) and \(\sigma^2 = 1.60\). The normal density curve is shown in red.

Upon examining the histogram in Figure 1, it is observed that the tails of the empirical data distribution appear to be heavier than those of the fitted normal density. This suggests that the normal distribution might not fully capture the extreme temperature variations in the data.

A potentially more suitable model could consider the data as independent observations from a Student’s t-distribution. The Student’s t-distribution is similar to the normal distribution but has heavier tails, which can better accommodate outliers and extreme values. In this case, we assume that \(Y_1, \ldots, Y_{60}\) are i.i.d. random variables such that: \[\frac{Y_i - \mu}{\sigma} \sim t(k) \label{eq:student_t_distribution_temp}\] where \(\theta = (\mu, \sigma, k)\) are the unknown parameters, and \(k\) represents the degrees of freedom of the t-distribution, controlling the tail heaviness.

Example: Roller Data

Consider an experiment designed to investigate the relationship between the weight of a lawn roller and the depression it creates on a lawn surface. Different weights of a roller (in tons, t) were applied to various parts of a lawn, and the resulting depression (in millimeters, mm) was measured.

Let the vector \(y\) represent the measured depression (the response variable), and the vector \(x\) represent the weights of the roller (the covariate or predictor variable), which are controlled and thus considered fixed.

A scatterplot of the depression measurements against the roller weights, with a superimposed least squares regression line (Figure 2), provides a visual summary of the relationship between these variables.

Scatterplot of lawn depression versus roller weight with least squares line. Depression is on the y-axis, and roller weight is on the x-axis. The red line represents the least squares regression line.

The scatterplot in Figure 2 suggests a positive linear relationship between roller weight and lawn depression. As the weight of the roller increases, the depression tends to increase as well. This visual pattern motivates the use of a simple linear regression model to formally describe this relationship.

The simple linear regression model posits that the response variable \(Y_i\) for the \(i\)-th observation is linearly related to the predictor variable \(x_i\), with an added random error term \(\varepsilon_i\): \[Y_i = \alpha + \beta x_i + \varepsilon_i, \quad \varepsilon_i \sim N(0, \sigma^2) \label{eq:linear_regression_model_roller}\] Here, \(\alpha\) is the intercept, \(\beta\) is the slope, and \(\varepsilon_i\) represents the random error, assumed to be normally distributed with mean 0 and constant variance \(\sigma^2\) (\(\varepsilon_i \sim N(0, \sigma^2)\)). Furthermore, the errors for different observations are assumed to be independent.

This model implies that, for a given roller weight \(x_i\) (treated as fixed), the observed depression \(y_i\) is a realization of a normally distributed random variable: \[Y_i \sim N(\alpha + \beta x_i, \sigma^2) \label{eq:response_normal_distribution_roller}\] and \(Y_i\) is independent of other response random variables \(Y_j\) for \(i \neq j\).

In this linear regression model, the parameter vector is \(\theta = (\alpha, \beta, \sigma^2)\). Using the least squares method to fit the model to the roller data, we obtain plausible "guesses" for the intercept and slope parameters: \(\hat{\alpha} = -2.087\) and \(\hat{\beta} = 2.667\). These estimates define the least squares regression line shown in Figure 2.

In terms of interpretation, the slope parameter \(\beta\) is of primary interest. It represents the expected change in lawn depression for each unit increase in roller weight. A positive \(\beta\) indicates that heavier rollers tend to produce greater depression. For prediction, the estimated regression model can be used to predict the depression for a given roller weight, including weights outside the range of those used in the experiment. However, caution is advised when extrapolating predictions far beyond the observed range of weights, as the linear relationship may not hold indefinitely.

Fundamental Inferential Questions

Given a dataset \(y\) and a chosen statistical model characterized by unknown parameters \(\theta\), statistical inference aims to answer several fundamental questions. These questions can be broadly categorized into four main types:

  1. Point Estimation: What are the most plausible single values for the unknown parameters \(\theta\), given the observed data \(y\)? Point estimation methods seek to find the "best" estimate of each parameter. Common point estimators include the sample mean, sample variance, and maximum likelihood estimators.

  2. Interval Estimation: Instead of a single value, can we provide a range of plausible values for \(\theta\) that are consistent with the data \(y\)? Interval estimation methods, such as confidence intervals, provide a measure of the uncertainty associated with parameter estimates by specifying a range within which the true parameter value is likely to lie.

  3. Hypothesis Testing: Is a pre-specified restriction or claim about \(\theta\) (a hypothesis) compatible with the data \(y\)? Hypothesis testing involves formulating null and alternative hypotheses about the parameters and using the data to assess the evidence against the null hypothesis. This helps in making decisions or drawing conclusions about the parameters.

  4. Model Selection and Checking: Is the chosen statistical model itself a good representation of the data-generating process? Model selection involves comparing different candidate models and choosing the one that best fits the data, while model checking assesses the adequacy of a chosen model by examining its assumptions and goodness-of-fit.

Beyond these core inferential questions, experimental and survey design plays a crucial role when the data collection process can be controlled. Careful design of experiments and surveys is essential to ensure that the collected data are informative and allow for precise and reliable answers to the inferential questions. Design considerations include sample size determination, randomization, and control of confounding variables.

There are two primary methodological frameworks for addressing these inferential questions: the frequentist approach and the Bayesian approach. These approaches differ fundamentally in their interpretation of probability and their methods for statistical inference.

The Frequentist Approach

The frequentist approach to statistical inference is rooted in classical probability theory and is the foundation for many traditional statistical methods taught in introductory courses.

In the frequentist framework:

  • Parameters as Fixed Unknowns: Model parameters \(\theta\) are considered to be fixed but unknown constants, representing true, unchanging characteristics of the population or phenomenon under study. The goal of inference is to learn about these fixed values using the information contained in the observed data \(y\).

  • Probability as Long-Run Frequency: Probability is interpreted as the long-run relative frequency of events in repeated experiments or sampling. Frequentist methods evaluate the performance of inferential procedures based on their behavior under hypothetical repeated sampling from the population.

Frequentist inference heavily relies on the concept of the likelihood function. For a given parametric model \(f(y_1, \ldots, y_n; \theta)\), the likelihood function is defined as: \[L(\theta; y) = f(y_1, \ldots, y_n; \theta), \quad \theta \in \Theta \label{eq:likelihood_function_frequentist}\] The likelihood function, viewed as a function of \(\theta\) for a fixed observed dataset \(y\), quantifies the plausibility of different parameter values \(\theta\) in light of the data. It represents the "probability" (or probability density) of observing the sample \(y\) that was actually obtained, for different possible values of the parameter \(\theta\).

A central principle in frequentist point estimation is to choose an estimate \(\hat{\theta}\) that maximizes the likelihood function. This estimate, known as the maximum likelihood estimate (MLE), is considered to be the parameter value that makes the observed data most probable under the assumed model. Mathematically, it is often more convenient to maximize the log-likelihood function, which is a monotonic transformation of the likelihood function and yields the same maximizer: \[\hat{\theta} = \arg\max_{\theta} \log L(\theta; y) \label{eq:mle_estimator_frequentist}\]

Likelihood-based procedures provide a versatile and widely applicable framework for addressing the fundamental inferential problems outlined earlier. MLEs often possess desirable theoretical properties, such as consistency, asymptotic normality, and efficiency, under certain regularity conditions.

Within the frequentist framework, inference about \(\theta\) is often based on sample statistics, which are functions of the random variables in the sample. These statistics summarize the information in the data relevant to the parameters of interest.

For instance, if \(y = (y_1, \ldots, y_n)\) are i.i.d. observations from a normal distribution \(N(\mu, \sigma^2)\), the likelihood-based sample statistics for estimating \(\mu\) and \(\sigma^2\) are the sample mean \(\bar{Y}\) and the uncorrected sample variance \(S^2(n-1)/n\), respectively: \[\begin{aligned} \hat{\mu} &= \bar{Y} \label{eq:mle_mean_frequentist} \\ \hat{\sigma}^2 &= S^2(n-1)/n \label{eq:mle_variance_frequentist}\end{aligned}\]

Besides likelihood-based methods, another important frequentist approach is the least squares method, particularly prominent in regression analysis. Least squares estimation focuses on minimizing the sum of squared differences between the observed responses and the model’s predicted responses.

A simpler, more intuitive approach is the method of moments. This method involves equating sample moments (e.g., sample mean, sample variance) to their corresponding population moments (expressed as functions of the parameters) and solving for the parameters. For example, the sample mean is used to estimate the population mean, the sample median to estimate the population median, and the (corrected) sample variance to estimate the population variance.

Basic Concepts of Point Estimation

Estimators and Standard Errors

An estimator is a sample statistic (a function of the random sample) used to estimate a population parameter \(\theta\). The value of an estimator calculated from the observed data \(y\) is called the point estimate of \(\theta\), denoted as \(\hat{\theta} = \hat{\theta}(y)\).

To elaborate, an estimator is essentially a rule or a formula, based on the sample data, that we use to guess the value of an unknown parameter. Since it’s calculated from a random sample, the estimator itself is a random variable. The specific numerical value we get when we apply this rule to a particular observed dataset is the point estimate. For instance, if we want to estimate the average height of all students in a university, the sample mean height from a randomly selected group of students is a point estimate, and the method used to calculate this sample mean is the estimator.

Every estimator has a sampling distribution. This distribution describes how the estimator’s values would be spread out if we were to take many different random samples from the same population and calculate the estimator for each sample. Understanding the sampling distribution is crucial because it tells us about the estimator’s behavior and reliability. If the sampling distribution can’t be derived mathematically, simulation methods are often employed to approximate it.

Two key properties are highly desirable for any estimator:

  • Unbiasedness: An estimator \(\hat{\theta}\) is unbiased if its expected value is equal to the true value of the parameter \(\theta\), i.e., \(\mathbb{E}(\hat{\theta}) = \theta\). In simpler terms, if we were to take countless samples and calculate the point estimate each time, the average of these estimates would converge to the true parameter value. If \(\mathbb{E}(\hat{\theta}) \neq \theta\), the estimator is biased, and the bias is defined as \(Bias(\hat{\theta}) = \mathbb{E}(\hat{\theta}) - \theta\). We generally prefer estimators with low or zero bias.

  • Low Variance: The variance \(\mathbb{V}(\hat{\theta})\) of an estimator measures its precision. An estimator with low variance will produce estimates that are tightly clustered around its expected value. Lower variance indicates more consistent and reliable estimation.

In practice, achieving both unbiasedness and minimum variance simultaneously is not always possible. Often, there’s a trade-off, and choosing an estimator involves balancing these two properties. For example, one might accept a slightly biased estimator if it offers a significantly lower variance, leading to a smaller overall error in estimation.

Mean Squared Error and Standard Error

To combine the concepts of bias and variance into a single measure of estimator quality, we use the Mean Squared Error (MSE).

The mean squared error (MSE) of an estimator \(\hat{\theta}\) is defined as the expected value of the squared difference between the estimator and the true parameter value \(\theta\): \[MSE(\hat{\theta}) = \mathbb{E}\{(\hat{\theta} - \theta)^2\} = \mathbb{V}(\hat{\theta}) + |\mathbb{E}(\hat{\theta}) - \theta|^2 \label{eq:mse_definition}\] As shown, the MSE can be decomposed into the sum of the estimator’s variance and the square of its bias. This decomposition highlights how MSE accounts for both the variability and the systematic error of the estimator.

A lower MSE indicates a better estimator, as it implies that the estimator’s values are, on average, close to the true parameter value.

The standard error (SE) is the square root of the MSE: \[SE(\hat{\theta}) = \sqrt{MSE(\hat{\theta})} = \sqrt{\mathbb{V}(\hat{\theta}) + |\mathbb{E}(\hat{\theta}) - \theta|^2} \label{eq:se_definition}\] The standard error is interpreted as the standard deviation of the estimator around the true parameter value, incorporating both variance and bias. It provides a measure of the typical error magnitude in estimation, expressed in the same units as the parameter itself.

When reporting a point estimate, it is crucial to also provide its estimated standard error. Since the true SE often depends on unknown parameters, we estimate it using the sample data. This estimated SE gives an idea of the precision of our point estimate. If the estimator is unbiased, the SE simplifies to the standard deviation of the estimator’s sampling distribution, \(SE(\hat{\theta}) = \sqrt{\mathbb{V}(\hat{\theta})}\).

In the quest for optimal estimators, Minimum Variance Unbiased Estimators (MVUE) are particularly sought after when unbiased estimation is desired. Among all unbiased estimators for a parameter, the MVUE is the one with the smallest variance, offering the highest precision within the class of unbiased estimators.

Another essential property for estimators is consistency.

An estimator \(\hat{\theta}\) is consistent if it converges in probability to the true parameter \(\theta\) as the sample size \(n\) approaches infinity, denoted as \(\hat{\theta} \xrightarrow{p} \theta\), as \(n \to +\infty\).

Consistency is a large-sample property, ensuring that with enough data, the estimator will get arbitrarily close to the true parameter value. It’s a fundamental requirement for estimators to be reliable in practice, especially when dealing with large datasets.

Under certain regularity conditions, Maximum Likelihood Estimators (MLEs) are known to be asymptotically optimal. They are consistent, asymptotically unbiased, and in large samples, their sampling distribution approaches a normal distribution. Furthermore, the variance of this asymptotic normal distribution achieves the Cramér-Rao Lower Bound, which is the theoretical minimum variance for any unbiased estimator. This asymptotic efficiency makes MLEs a cornerstone of frequentist inference.

Estimation of the Mean

The sample mean \(\bar{Y} = \frac{1}{n} \sum_{i=1}^{n} Y_i\) is arguably the most fundamental estimator in statistics, widely used to estimate the population mean \(\mu\). It is intuitive, easy to calculate, and possesses several desirable statistical properties.

Assuming we have an i.i.d. sample \(Y_1, \ldots, Y_n\) from a population with mean \(\mu\) and variance \(\sigma^2\), the sample mean \(\bar{Y}\) exhibits the following key properties:

  • Unbiasedness: The sample mean is an unbiased estimator of the population mean, i.e., \(\mathbb{E}(\bar{Y}) = \mu\). This property holds regardless of the underlying distribution of the population, as long as the population mean exists.

  • Consistency: The sample mean is a consistent estimator of the population mean, i.e., \(\bar{Y} \xrightarrow{p} \mu\) as \(n \to +\infty\). As the sample size grows, the sample mean gets closer to the true population mean.

The Central Limit Theorem (CLT) further elucidates the importance of the sample mean. It states that for a sufficiently large sample size \(n\), the sampling distribution of the sample mean \(\bar{Y}\) is approximately normal, regardless of the original population distribution’s shape. Specifically, \(\bar{Y} \approx N(\mu, \sigma^2/n)\) for large \(n\). If the population itself is normally distributed, then the sample mean \(\bar{Y}\) is exactly normally distributed for any sample size \(n\).

The standard error of the mean (SEM) is the standard deviation of the sampling distribution of the sample mean \(\bar{Y}\). It quantifies the variability of sample means around the population mean. \[SEM(\bar{Y}) = \frac{\sigma}{\sqrt{n}} \label{eq:sem_definition}\]

The SEM is inversely proportional to the square root of the sample size \(n\). This means that as we increase the sample size, the SEM decreases, indicating that our estimate of the mean becomes more precise.

In practice, the population standard deviation \(\sigma\) is often unknown. Therefore, we estimate the SEM by replacing \(\sigma\) with the sample standard deviation \(S\), calculated from the data. The estimated standard error of the mean is: \[\widehat{SEM}(\bar{Y}) = \frac{S}{\sqrt{n}} \label{eq:estimated_sem}\] where \(S = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (Y_i - \bar{Y})^2}\) is the sample standard deviation, and \(S^2\) is the (corrected) sample variance, an unbiased estimator of \(\sigma^2\).

Estimation of the Difference of Means

In comparative studies, we frequently want to estimate the difference between the means of two distinct populations. Suppose we have two independent random samples: \(X_1, \ldots, X_{n_X}\) from population X and \(Y_1, \ldots, Y_{n_Y}\) from population Y. We are interested in estimating the difference between the population means, \(\mu_X - \mu_Y\).

A natural and unbiased estimator for \(\mu_X - \mu_Y\) is the sample difference of means \(\bar{X} - \bar{Y}\), where \(\bar{X} = \frac{1}{n_X} \sum_{i=1}^{n_X} X_i\) and \(\bar{Y} = \frac{1}{n_Y} \sum_{i=1}^{n_Y} Y_i\) are the sample means for the two groups.

If \(SEM_X = \frac{\sigma_X}{\sqrt{n_X}}\) and \(SEM_Y = \frac{\sigma_Y}{\sqrt{n_Y}}\) are the standard errors of \(\bar{X}\) and \(\bar{Y}\) respectively, and assuming the two samples are independent, the standard error of the difference (SED) \(\bar{X} - \bar{Y}\) is given by: \[SED(\bar{X} - \bar{Y}) = \sqrt{SEM_X^2 + SEM_Y^2} = \sqrt{\frac{\sigma_X^2}{n_X} + \frac{\sigma_Y^2}{n_Y}} \label{eq:sed_definition}\] This formula arises from the property that the variance of the difference of two independent random variables is the sum of their variances.

In practice, \(\sigma_X^2\) and \(\sigma_Y^2\) are usually unknown and are estimated by the sample variances \(S_X^2\) and \(S_Y^2\). The estimated standard error of the difference is then: \[\widehat{SED}(\bar{X} - \bar{Y}) = \sqrt{\widehat{SEM}_X^2 + \widehat{SEM}_Y^2} = \sqrt{\frac{S_X^2}{n_X} + \frac{S_Y^2}{n_Y}} \label{eq:estimated_sed_point_est}\]

In some cases, it may be reasonable to assume that the two populations have a common population standard deviation \(\sigma\), i.e., \(\sigma_X^2 = \sigma_Y^2 = \sigma^2\). In this situation, we can obtain a pooled estimate of the common variance, called the pooled sample variance \(S_p^2\). The pooled sample variance combines information from both samples to estimate \(\sigma^2\): \[S_p^2 = \frac{\sum_{i=1}^{n_X}(X_i - \bar{X})^2 + \sum_{i=1}^{n_Y}(Y_i - \bar{Y})^2}{n_X + n_Y - 2} = \frac{(n_X - 1)S_X^2 + (n_Y - 1)S_Y^2}{n_X + n_Y - 2} \label{eq:pooled_variance_point_est}\] The denominator \(n_X + n_Y - 2\) represents the degrees of freedom for the pooled variance estimate.

Using the pooled variance estimate, the estimated standard error for the difference under the assumption of equal variances becomes: \[\widehat{SED}_{pooled}(\bar{X} - \bar{Y}) = S_p \sqrt{\frac{1}{n_X} + \frac{1}{n_Y}} \label{eq:estimated_sed_pooled_point_est}\] This pooled standard error is used in the independent samples t-test when assuming equal population variances.

Example: Elastic Bands

Let’s consider the elastic bands experiment described earlier, where we investigated the effect of heat on the stretch of elastic bands. We had two independent groups: a control group (X) and a heated group (Y).

The summary statistics are given as: \(\bar{x} = 253.5\) mm, \(\bar{y} = 244.1\) mm, \(s_X = 9.92\) mm, \(s_Y = 11.73\) mm, \(n_X = 10\), and \(n_Y = 11\). The standard errors of the means are \(SEM_X = \frac{s_X}{\sqrt{n_X}} = \frac{9.92}{\sqrt{10}} \approx 3.14\) mm and \(SEM_Y = \frac{s_Y}{\sqrt{n_Y}} = \frac{11.73}{\sqrt{11}} \approx 3.54\) mm.

Since the sample standard deviations \(s_X\) and \(s_Y\) are quite similar (9.92 and 11.73), it is reasonable to assume equal population variances and use the pooled standard deviation estimate. The pooled sample variance is calculated as: \[\begin{aligned} s_p^2 &= \frac{(n_X - 1)s_X^2 + (n_Y - 1)s_Y^2}{n_X + n_Y - 2} \\ &= \frac{(10 - 1)(9.92)^2 + (11 - 1)(11.73)^2}{10 + 11 - 2} \\ &\approx 118.92\end{aligned}\] The pooled standard deviation is \(s_p = \sqrt{s_p^2} \approx \sqrt{118.92} \approx 10.91\) mm.

The difference in sample means is \(\bar{x} - \bar{y} = 253.5 - 244.1 = 9.4\) mm. The standard error of the difference (SED) is calculated using the individual SEMs: \[\begin{aligned} SED &= \sqrt{SEM_X^2 + SEM_Y^2} \\ &= \sqrt{(3.14)^2 + (3.54)^2} \\ &\approx \sqrt{9.8596 + 12.5316} \\ &\approx \sqrt{22.3912} \\ &\approx 4.73 \text{ mm}\end{aligned}\] Using the pooled standard deviation, the estimated SED is: \[\begin{aligned} \widehat{SED}_{pooled} &= s_p \sqrt{\frac{1}{n_X} + \frac{1}{n_Y}} \\ &= 10.91 \sqrt{\frac{1}{10} + \frac{1}{11}} \\ &\approx 10.91 \times \sqrt{0.0909 + 0.1} \\ &\approx 10.91 \times \sqrt{0.1909} \\ &\approx 10.91 \times 0.4369 \\ &\approx 4.77 \text{ mm}\end{aligned}\] The ratio of the mean difference to the SED is \(\frac{9.4}{4.77} \approx 1.97\). This ratio, often referred to as a t-statistic in hypothesis testing, indicates that the observed difference in means is approximately 1.97 times larger than the standard error of the difference. This suggests a potentially meaningful difference between the groups, which can be further investigated using hypothesis testing or confidence intervals.

Estimation of a Proportion

The goal is to estimate the probability of success \(p\) in a sequence of \(n\) i.i.d. Bernoulli trials, represented by random variables \(Y_1, \ldots, Y_n \sim Ber(p)\). Alternatively, this can be viewed as a realization of a Binomial random variable \(Bi(n, p)\).

An unbiased and consistent estimator for \(p\) is the sample proportion \(\hat{p}\), which is the observed proportion of successes in \(n\) trials. This is equivalent to the sample mean of the Bernoulli random variables: \[\hat{p} = \bar{Y} \label{eq:sample_proportion}\]

The associated standard error for the sample proportion is: \[SE = \sqrt{\frac{p(1-p)}{n}} \label{eq:se_proportion}\] This standard error is estimated by substituting the unknown \(p\) with its estimate \(\hat{p}\): \[\widehat{SE} = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \label{eq:estimated_se_proportion}\]

In a random sample of \(n = 132\) freshmen, we want to estimate the proportion of freshmen displaced from their home. Suppose 37 out of 132 freshmen are displaced. Then, the estimated proportion is \(\hat{p} = \frac{37}{132} \approx 0.28\). The estimated standard error is \(\widehat{SE} = \sqrt{\frac{0.28(1-0.28)}{132}} \approx 0.039\).

Sampling Distribution of z- and t- statistics

Consider an i.i.d. sample \(Y = (Y_1, \ldots, Y_n)\) from a normal distribution \(N(\mu, \sigma^2)\). The z-statistic (standardized sample mean), when \(\sigma\) is known, is: \[Z = \frac{\bar{Y} - \mu}{\sigma/\sqrt{n}} \sim N(0, 1) \label{eq:z_statistic}\] This z-statistic follows a standard normal distribution. However, it is not directly useful in practice when \(\sigma\) is unknown. Even when the component r.v.’s are not normally distributed, this result holds approximately for large \(n\) due to the Central Limit Theorem.

When \(\sigma\) is unknown, we use the t-statistic (studentized sample mean), obtained by substituting \(\sigma\) with the sample standard deviation \(S\): \[T = \frac{\bar{Y} - \mu}{S/\sqrt{n}} \sim t_{n-1} \label{eq:t_statistic}\] The t-statistic follows a t-distribution with \(n-1\) degrees of freedom.

Given observed data \(y\), the formula \(t = (\bar{y} - \mu) / \widehat{SEM}\) provides a standardized measure (in number of SEMs) of the distance between the sample mean \(\bar{y}\) and the true population mean \(\mu\).

Estimation of the Variance

The (corrected) sample variance \(S^2\) is widely used as an estimator for the population variance \(\sigma^2\).

Under i.i.d. assumptions, the sample variance is:

  • Unbiased: \(\mathbb{E}(S^2) = \sigma^2\).

  • Consistent: \(S^2 \xrightarrow{p} \sigma^2\), as \(n \to +\infty\).

If the data are i.i.d. observations from a normal distribution \(N(\mu, \sigma^2)\), then the distribution of \(S^2\) is related to the chi-squared distribution: \[\frac{(n-1)S^2}{\sigma^2} \sim \chi^2(n-1) \label{eq:chi_squared_variance}\] This is a scaled chi-squared distribution with \(n-1\) degrees of freedom.

When comparing variances from two independent i.i.d. samples of sizes \(n_X\) and \(n_Y\), the comparison is often based on the sample variance ratio \(S_X^2 / S_Y^2\), where \(S_X^2\) and \(S_Y^2\) are the respective corrected sample variances.

If the samples are from normal distributions \(N(\mu_X, \sigma_X^2)\) and \(N(\mu_Y, \sigma_Y^2)\), then the ratio of variances follows an F-distribution under the null hypothesis that \(\sigma_X^2 = \sigma_Y^2\): \[\frac{S_X^2/\sigma_X^2}{S_Y^2/\sigma_Y^2} \sim F(n_X-1, n_Y-1) \label{eq:f_distribution_variance_ratio}\] If \(\sigma_X^2 = \sigma_Y^2\), then \(\frac{S_X^2}{S_Y^2} \sim F(n_X-1, n_Y-1)\).

Basic Concepts of Interval Estimation

Introduction to Confidence Intervals

Confidence intervals (interval estimates) provide more informative estimation results than point estimates alone. They give an entire set of plausible values (usually an interval) for the population parameter.

Interval estimation also provides an implicit idea of the accuracy of the estimation procedure.

A \((1 - \alpha)100\%\) confidence interval for a scalar parameter \(\theta\) is an observation of a random interval, constructed from a suitable sample statistic, that is designed to have a prescribed probability \(1-\alpha\) (confidence level) of containing the true value of \(\theta\).

The inferential procedure is designed such that, over repeated sampling, the constructed intervals will include the true parameter value in a proportion of times equal to the confidence level.

It is crucial to remember that the confidence level refers to the probability associated with the random interval before data is observed, not to the probability of a specific observed confidence interval containing the true parameter value. Once the interval is calculated from data, it either contains the true value or it does not; the probability statement applies to the method’s long-run performance.

Confidence Interval for the Mean (Known Variance)

Consider an i.i.d. sample from a normal distribution \(N(\mu, \sigma^2)\) with \(\sigma^2\) known. Using the z-statistic, it can be shown that: \[P\left( \bar{Y} - z_{\alpha/2} \frac{\sigma}{\sqrt{n}} \le \mu \le \bar{Y} + z_{\alpha/2} \frac{\sigma}{\sqrt{n}} \right) = 1 - \alpha \label{eq:ci_mean_known_variance_prob}\] where \(z_{\alpha/2}\) is the \(\alpha/2\)-critical value of a \(N(0, 1)\) distribution (i.e., the value such that \(P(Z > z_{\alpha/2}) = \alpha/2\) for \(Z \sim N(0,1)\)).

Given an i.i.d. sample from a normal distribution \(N(\mu, \sigma^2)\) with known variance \(\sigma^2\), a \((1 - \alpha)100\%\) confidence interval for the population mean \(\mu\) is given by the random interval: \[\left[ \bar{Y} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}} \right] = \left[ \bar{Y} - z_{\alpha/2} \frac{\sigma}{\sqrt{n}} , \bar{Y} + z_{\alpha/2} \frac{\sigma}{\sqrt{n}} \right] \label{eq:ci_mean_known_variance_interval}\] where \(z_{\alpha/2}\) is the \(\alpha/2\)-critical value of the standard normal distribution.

The observed confidence interval is obtained by substituting the sample mean \(\bar{y}\) for \(\bar{Y}\): \[\left[ \bar{y} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}} \right] = \left[ \bar{y} - z_{\alpha/2} \frac{\sigma}{\sqrt{n}} , \bar{y} + z_{\alpha/2} \frac{\sigma}{\sqrt{n}} \right] \label{eq:observed_ci_mean_known_variance}\] This observed interval summarizes the information provided by the observed data about the unknown population mean \(\mu\).

Confidence Interval for the Mean (Unknown Variance)

When \(\sigma^2\) is unknown, the \((1 - \alpha)100\%\)-level confidence interval for \(\mu\) is based on the t-statistic and is given by:

Given an i.i.d. sample from a normal distribution \(N(\mu, \sigma^2)\) with unknown variance \(\sigma^2\), a \((1 - \alpha)100\%\) confidence interval for the population mean \(\mu\) is given by the random interval: \[\left[ \bar{Y} \pm t_{n-1;\alpha/2} \frac{S}{\sqrt{n}} \right] = \left[ \bar{Y} - t_{n-1;\alpha/2} \frac{S}{\sqrt{n}} , \bar{Y} + t_{n-1;\alpha/2} \frac{S}{\sqrt{n}} \right] \label{eq:ci_mean_unknown_variance}\] where \(t_{n-1;\alpha/2}\) is the \(\alpha/2\)-critical value of a t-distribution with \(n-1\) degrees of freedom, and \(S\) is the sample standard deviation.

A simulation was conducted to demonstrate the coverage of 95% confidence intervals. 100 normal samples of size \(n=15\) were simulated from a \(N(0, 1)\) distribution (\(\mu = 0, \sigma^2 = 1\)). For each sample, a 95% confidence interval was calculated using the t-statistic. In this simulation, four intervals failed to contain the true value \(\mu = 0\).

95% Confidence Intervals from 100 Simulated Samples. (Diagram Description: A plot showing 100 vertical lines representing 95% confidence intervals. Four intervals are highlighted in red, indicating they do not contain the true mean of 0, which is shown as a horizontal line at y=0.)

The estimated confidence level from this simulation is \(\frac{96}{100} = 0.96\), or 96%. Increasing the number of simulated samples would bring this result closer to the nominal 95% level.

Example: Cork Stoppers

Data on the total perimeter of defects (in pixels) measured in \(n = 50\) high quality cork stoppers.

Numerical summaries: \(\bar{y} = 365\), \(y_{0.5} = 363\), \(S^2 = 12167\), \(S = 110\), \(SEM = S/\sqrt{50} = 15.6\).

Observations are interpreted as i.i.d. realizations of a normal distribution. The 95% confidence interval for \(\mu\) is: \[\left[ \bar{y} \pm t_{49;0.025} SEM \right] = \left[ 365 \pm t_{49;0.025} \times 15.6 \right] \approx [334, 396] \label{eq:ci_cork_stoppers}\] where \(t_{49;0.025} \approx 2.01\).

The observed confidence interval \([334, 396]\) is obtained using a statistical procedure which is characterized by a risk of 5% of giving wrong results (that is, intervals not containing the true value of \(\mu\)).

Conclusion

In this lecture, we have reviewed the fundamental concepts of statistical inference, focusing on point estimation and interval estimation within the frequentist framework. We started by defining statistical inference and parametric statistical models, emphasizing the role of random sampling. Key inferential questions were introduced, including point estimation, interval estimation, hypothesis testing, and model selection.

For point estimation, we discussed estimators, standard errors, mean squared error, and important properties such as unbiasedness, consistency, and efficiency. We explored the estimation of means, differences of means, and proportions, along with the sampling distributions of relevant statistics like z- and t-statistics.

Moving to interval estimation, we introduced confidence intervals as a method to provide a range of plausible values for population parameters, offering a measure of estimation accuracy. We specifically examined confidence intervals for the mean under both known and unknown variance scenarios, using z- and t-distributions respectively.

Key Takeaways:

  • Statistical inference is the process of drawing conclusions about populations from sample data, relying on probabilistic models to account for inherent data variability.

  • Parametric statistical models are families of probability distributions indexed by a set of parameters, which we aim to estimate to understand the data-generating process.

  • Point estimators provide single-value "guesses" for unknown parameters, while interval estimators provide a range of plausible values, reflecting estimation uncertainty.

  • Standard error is a crucial measure that quantifies the typical variability or imprecision of a point estimator. It is the standard deviation of the sampling distribution of the estimator.

  • Confidence intervals, constructed with a specified confidence level, are designed to contain the true parameter value in a certain proportion of repeated samples, providing a measure of estimation reliability.

  • The choice between z-intervals and t-intervals for the mean depends on whether the population variance is known or unknown, and the sample size.

In the next lecture, we will delve into the realm of hypothesis testing. This powerful framework allows us to formally assess specific claims or hypotheses about population parameters using sample data. We will explore how to formulate hypotheses, calculate test statistics, and interpret p-values to make informed decisions based on statistical evidence. Hypothesis testing provides a structured approach to answering questions about the validity of assumptions or theories in various fields of study.