Sample Mean Estimator¶
The Story Behind the Mathematics¶
The arithmetic mean is probably the oldest and most intuitive statistical concept in human history. Ancient Babylonians (around 2000 BC) already calculated averages to fairly distribute resources and lands. However, using the mean as a statistical estimator — a systematic way to infer the "true" value of a population from a sample — is a much more recent idea.
Carl Friedrich Gauss (1777-1855) was among the first to mathematically formalize the sample mean. In 1809, while working on astronomy and geodesy problems, Gauss wondered: "If I make \(n\) measurements of the same quantity (like a planet's position), and each measurement has random error, what final value should I report?"
Gauss proved that if errors follow a normal distribution (which he himself had characterized), the arithmetic mean is the optimal estimator — the one that minimizes mean squared error. This discovery was revolutionary: it transformed the mean from a simple "center of data" into an inference tool with provable mathematical properties.
Pierre-Simon Laplace (1749-1827) had already used the mean in his work on probability theory, but it was Gauss who established its theoretical primacy. In his monumental "Theoria Motus Corporum Coelestium" (1809), Gauss wrote:
"The arithmetic mean of many observations is always more reliable than a single observation."
This principle, obvious today, was not at all clear at the time. Gauss's mathematical formalization provided the foundation for all of modern statistics.
In the 20th century, Ronald Fisher proved that the sample mean is not just "good," but optimal in a precise sense: among all unbiased estimators of the mean of a normal population, the sample mean has the minimum variance (it's the UMVUE - Uniformly Minimum Variance Unbiased Estimator). This property, known as efficiency, completes the theoretical justification for why the sample mean is so universally used.
Why It Matters¶
The sample mean estimator is the foundation of nearly all statistical analyses. It's used in:
- Surveys: estimating average population opinion from a sample
- Scientific experiments: combining repeated measurements to reduce error
- Quality control: monitoring the average value of a production process
- Machine Learning: calculating data statistics for normalization and preprocessing
- Economics: estimating average income, average GDP growth, etc.
- Medicine: comparing average treatment effectiveness in clinical trials
Without a rigorous understanding of the sample mean and its properties, we couldn't quantify the uncertainty of our estimates or construct confidence intervals and hypothesis tests.
Prerequisites¶
- Concept of random variable and distribution
- Expected Value (mean of a distribution)
- Variance and variance properties
- Independence of random variables
The Estimator¶
Suppose we have a random sample of \(n\) observations \(X_1, X_2, \ldots, X_n\) from a population with (unknown) mean \(\mu\) and variance \(\sigma^2\). The observations are independent and identically distributed (i.i.d.).
The sample mean estimator is:
Crucial distinction: - \(\mu\) is a fixed (but unknown) population parameter - \(\bar{X}\) is an estimator — a random variable that depends on the sample - \(\bar{x}\) is the estimate — the numerical value computed from a specific sample \(x_1, \ldots, x_n\)
Derivation of Properties¶
Property 1: Unbiasedness¶
An estimator is unbiased if its expected value equals the true parameter. Let's prove that \(E[\bar{X}] = \mu\).
Proof:
By definition:
Take the expected value of both sides:
Key property: Expected value is a linear operator, so we can bring it inside the sum and extract constants:
By linearity of expectation, the sum of expected values is the expected value of the sum:
i.i.d. assumption: Since all \(X_i\) come from the same population, \(E[X_i] = \mu\) for all \(i\):
Conclusion: \(E[\bar{X}] = \mu\). The sample mean is an unbiased estimator of the population mean.
Interpretation: If we repeated sampling infinitely many times and calculated \(\bar{X}\) each time, the average value of all these estimates would converge exactly to \(\mu\). There's no systematic bias.
Property 2: Variance of the Sample Mean¶
The variance of \(\bar{X}\) measures how much estimates fluctuate around \(\mu\) across different samples. Let's prove that:
Proof:
By definition:
Take the variance:
Key property: For a constant \(c\), \(\text{Var}(cY) = c^2 \text{Var}(Y)\). So we can extract \(\frac{1}{n}\) as \(\frac{1}{n^2}\):
Independence assumption: If \(X_1, \ldots, X_n\) are independent, the variance of the sum is the sum of variances:
Why? For independent variables, there's no covariance:
If \(X \perp Y\), then \(\text{Cov}(X,Y) = 0\).
Identical distribution assumption: Each \(X_i\) has the same variance \(\sigma^2\):
Substituting:
Conclusion: The variance of the sample mean is \(\frac{\sigma^2}{n}\).
Crucial interpretation: - Variance decreases with \(n\) (more data → more precise estimates) - Decreases with \(\frac{1}{n}\), not \(\frac{1}{n^2}\) - To halve the standard deviation of \(\bar{X}\), you need 4 times the data (because \(\text{SD}(\bar{X}) = \frac{\sigma}{\sqrt{n}}\))
Property 3: Standard Error of the Mean¶
The standard deviation of \(\bar{X}\) is called the standard error of the mean (SEM):
Practical problem: In reality, \(\sigma\) is unknown. How do we estimate SE?
We use the sample standard deviation \(s\) as an estimate of \(\sigma\):
Why \(n-1\) instead of \(n\)? We use \(n-1\) to get an unbiased estimator of \(\sigma^2\) (Bessel's correction). When we compute \((x_i - \bar{x})^2\), we use \(\bar{x}\) instead of the true \(\mu\). This introduces a dependency that "uses up" one degree of freedom.
The estimated standard error is:
Interpretation: The standard error measures the precision of our estimate. A small SE means that if we repeated the experiment, we'd get similar estimates of \(\bar{X}\). A large SE means high variability.
Numerical example: If we measure student heights with \(s = 10\) cm and \(n = 100\):
This means that if we repeated sampling, the sample mean would typically fluctuate by about 1 cm around the true value \(\mu\).
Property 4: Sampling Distribution (Normal Case)¶
If data come from a normal distribution \(X_i \sim \mathcal{N}(\mu, \sigma^2)\), then the sample mean has distribution:
Intuitive proof:
The sum of independent normal variables is still normal. If \(X_i \sim \mathcal{N}(\mu, \sigma^2)\) and are independent:
Dividing by \(n\) (linear transformation):
Standardization: We can standardize \(\bar{X}\):
This is the basis for constructing confidence intervals and hypothesis tests.
Case with unknown \(\sigma\): If we replace \(\sigma\) with its estimate \(s\), the statistic becomes:
which follows a Student's t-distribution with \(n-1\) degrees of freedom.
Property 5: Central Limit Theorem (CLT)¶
Remarkable result: Even if \(X_i\) does not come from a normal distribution, for sufficiently large \(n\), the distribution of \(\bar{X}\) is approximately normal:
More precisely:
Implications: - We don't need to assume data normality to do inference on the mean (if \(n\) is large) - "Large" depends on the shape of the original distribution. Often \(n \geq 30\) suffices - This explains why the normal distribution is so ubiquitous in statistics
Property 6: Consistency¶
An estimator is consistent if it converges to the true value as \(n \to \infty\):
Proof via Chebyshev's Inequality:
For any \(\epsilon > 0\):
Interpretation: With enough data, the probability that \(\bar{X}\) deviates from \(\mu\) by more than any fixed amount \(\epsilon\) becomes arbitrarily small.
Property 7: Efficiency (Normal Case)¶
For normal data, the sample mean is the UMVUE (Uniformly Minimum Variance Unbiased Estimator) of \(\mu\).
Meaning: Among all unbiased estimators of \(\mu\), the sample mean has the smallest variance. There's no better estimator (in the variance sense).
Moreover, \(\bar{X}\) is the MLE (Maximum Likelihood Estimator) of \(\mu\) for normal data, as we derived in the Likelihood-Based-Statistics page.
Special Case: Bernoulli Distribution¶
When the population follows a Bernoulli distribution \(X_i \sim \text{Bernoulli}(p)\), the sample mean estimator has special interpretations and properties.
The Setup¶
For Bernoulli data: - Each \(X_i\) takes values: \(X_i = 1\) (success) with probability \(p\), \(X_i = 0\) (failure) with probability \(1-p\) - The population mean: \(\mu = E[X_i] = p\) - The population variance: \(\sigma^2 = \text{Var}(X_i) = p(1-p)\)
Sample Mean as Sample Proportion¶
For Bernoulli data, the sample mean becomes the sample proportion:
Interpretation: \(\bar{X}\) estimates the true probability of success \(p\) by counting the proportion of successes in the sample.
Properties Specialized for Bernoulli¶
Unbiasedness:
Variance:
Standard Error:
Key insight: The variance of \(\hat{p}\) depends on \(p\) itself! This creates a unique situation where the precision of our estimator depends on the parameter we're estimating.
Maximum Variance at \(p = 0.5\)¶
The function \(p(1-p)\) (and thus the variance) is maximized when \(p = 0.5\):
- Maximum variance: \(\text{Var}(\hat{p})_{\max} = \frac{0.5 \times 0.5}{n} = \frac{0.25}{n}\)
- Minimum variance: \(\text{Var}(\hat{p})_{\min} = 0\) when \(p = 0\) or \(p = 1\)
Practical implication: It's hardest to estimate probabilities near 0.5 (most uncertainty) and easiest to estimate probabilities near 0 or 1 (almost certain outcomes).
Estimated Standard Error¶
In practice, we replace \(p\) with \(\hat{p}\):
Example: In 100 trials with 35 successes:
Sampling Distribution¶
For large \(n\), by the Central Limit Theorem:
The normal approximation works well when \(np \geq 5\) and \(n(1-p) \geq 5\).
Connection to Count Data¶
The sum \(S = \sum_{i=1}^n X_i\) follows a Binomial distribution:
Therefore:
This connection explains why proportion problems are fundamentally counting problems in disguise.
Practical Rule of Thumb for Sample Size¶
For a desired margin of error \(E\) at 95% confidence:
Since we don't know \(p\), we use the worst-case scenario \(p = 0.5\):
Example: For margin of error ±3% (\(E = 0.03\)):
This explains why political polls typically need around 1,000 respondents!
Confidence Interval for the Mean¶
A 95% confidence interval for \(\mu\) is:
where \(t_{n-1, 0.025}\) is the 97.5% quantile of Student's t-distribution with \(n-1\) degrees of freedom.
Interpretation: If we repeated sampling infinitely many times and calculated this interval each time, 95% of the intervals would contain the true value \(\mu\).
For large \(n\) (typically \(n \geq 30\)), we can use the normal approximation:
Complete Practical Example¶
Problem: We measure server response time (in ms) for 25 requests:
Step 1: Calculate sample mean:
Step 2: Calculate sample standard deviation:
Step 3: Calculate standard error:
Interpretation: Our estimate of the mean is 130.4 ms, with a standard error of 1.57 ms.
Step 4: 95% confidence interval:
For \(n-1 = 24\) degrees of freedom, \(t_{24, 0.025} \approx 2.064\).
Interpretation: We are 95% confident that the true average response time is between 127.16 ms and 133.64 ms.
Comparison with Other Location Estimators¶
| Estimator | Formula | Advantages | Disadvantages |
|---|---|---|---|
| Mean | \(\bar{x} = \frac{1}{n}\sum x_i\) | Unbiased, efficient (normal), uses all data | Sensitive to outliers |
| Median | Middle value when sorted | Robust to outliers | Less efficient (normal), loses information |
| Trimmed Mean | Mean after removing top/bottom % | Robustness/efficiency compromise | Arbitrariness in % choice |
When to use the mean: - Data approximately symmetric - Few or no outliers - Normal distribution or large \(n\) (CLT)
When NOT to use the mean: - Heavily skewed data (e.g., incomes) - Presence of extreme outliers - Heavy-tailed distributions
Common Errors¶
-
Confusing \(\sigma\) with \(s\): \(\sigma\) is the population parameter (fixed, unknown), \(s\) is the sample estimator (random variable).
-
Confusing SE with SD:
- SD (standard deviation) measures data spread
-
SE (standard error) measures estimator precision
-
Forgetting \(\sqrt{n}\): Precision improves as \(1/\sqrt{n}\), not \(1/n\). To halve the error requires 4 times the data.
-
Using \(n\) instead of \(n-1\): To estimate \(\sigma^2\), use \(n-1\) (Bessel's correction).
-
Ignoring CLT: Even with non-normal data, for large \(n\) we can use normal approximations.
Variables and Symbols¶
| Symbol | Name | Description |
|---|---|---|
| \(\mu\) | Population mean | True parameter (fixed, unknown) |
| \(\sigma^2\) | Population variance | True parameter (fixed, unknown) |
| \(X_i\) | Random variable | Model for the \(i\)-th observation |
| \(x_i\) | Observation | Realized value of \(X_i\) |
| \(\bar{X}\) | Sample mean (estimator) | Random variable \(\frac{1}{n}\sum X_i\) |
| \(\bar{x}\) | Point estimate | Numerical value computed from sample |
| \(s^2\) | Sample variance | Estimator of \(\sigma^2\) with denominator \(n-1\) |
| \(s\) | Sample standard deviation | \(\sqrt{s^2}\) |
| \(\text{SE}(\bar{X})\) | Standard error | \(\sigma/\sqrt{n}\) (theoretical) |
| \(\widehat{\text{SE}}\) | Estimated standard error | \(s/\sqrt{n}\) (practical) |
| \(n\) | Sample size | Number of observations |
Related Concepts¶
- Standard Error — Deep dive into standard error
- Confidence Interval — Constructing confidence intervals
- Student t-Distribution — Distribution when \(\sigma\) is unknown
- Central Limit Theorem — Why the mean is normal for large \(n\)
- Variance — Measure of dispersion
- Likelihood-Based Statistics — The mean as MLE estimator
- Bernoulli Distribution — Special case where mean equals probability \(p\)
References¶
- Gauss, C. F. (1809). Theoria Motus Corporum Coelestium. Perthes et Besser, Hamburg.
- Laplace, P. S. (1812). Théorie Analytique des Probabilités. Courcier, Paris.
- Fisher, R. A. (1925). "Theory of Statistical Estimation." Proceedings of the Cambridge Philosophical Society, 22:700-725.
- Casella, G., & Berger, R. L. (2002). Statistical Inference. Duxbury Press.