Sample Mean Estimator¶

The Story Behind the Mathematics¶

The arithmetic mean is probably the oldest and most intuitive statistical concept in human history. Ancient Babylonians (around 2000 BC) already calculated averages to fairly distribute resources and lands. However, using the mean as a statistical estimator — a systematic way to infer the "true" value of a population from a sample — is a much more recent idea.

Carl Friedrich Gauss (1777-1855) was among the first to mathematically formalize the sample mean. In 1809, while working on astronomy and geodesy problems, Gauss wondered: "If I make \(n\) measurements of the same quantity (like a planet's position), and each measurement has random error, what final value should I report?"

Gauss proved that if errors follow a normal distribution (which he himself had characterized), the arithmetic mean is the optimal estimator — the one that minimizes mean squared error. This discovery was revolutionary: it transformed the mean from a simple "center of data" into an inference tool with provable mathematical properties.

Pierre-Simon Laplace (1749-1827) had already used the mean in his work on probability theory, but it was Gauss who established its theoretical primacy. In his monumental "Theoria Motus Corporum Coelestium" (1809), Gauss wrote:

"The arithmetic mean of many observations is always more reliable than a single observation."

This principle, obvious today, was not at all clear at the time. Gauss's mathematical formalization provided the foundation for all of modern statistics.

In the 20th century, Ronald Fisher proved that the sample mean is not just "good," but optimal in a precise sense: among all unbiased estimators of the mean of a normal population, the sample mean has the minimum variance (it's the UMVUE - Uniformly Minimum Variance Unbiased Estimator). This property, known as efficiency, completes the theoretical justification for why the sample mean is so universally used.

Why It Matters¶

The sample mean estimator is the foundation of nearly all statistical analyses. It's used in:

Surveys: estimating average population opinion from a sample
Scientific experiments: combining repeated measurements to reduce error
Quality control: monitoring the average value of a production process
Machine Learning: calculating data statistics for normalization and preprocessing
Economics: estimating average income, average GDP growth, etc.
Medicine: comparing average treatment effectiveness in clinical trials

Without a rigorous understanding of the sample mean and its properties, we couldn't quantify the uncertainty of our estimates or construct confidence intervals and hypothesis tests.

Prerequisites¶

Concept of random variable and distribution
Expected Value (mean of a distribution)
Variance and variance properties
Independence of random variables

The Estimator¶

Suppose we have a random sample of \(n\) observations \(X_1, X_2, \ldots, X_n\) from a population with (unknown) mean \(\mu\) and variance \(\sigma^2\). The observations are independent and identically distributed (i.i.d.).

The sample mean estimator is:

\[ \bar{X} = \frac{1}{n}\sum_{i=1}^n X_i \]

Crucial distinction: - \(\mu\) is a fixed (but unknown) population parameter - \(\bar{X}\) is an estimator — a random variable that depends on the sample - \(\bar{x}\) is the estimate — the numerical value computed from a specific sample \(x_1, \ldots, x_n\)

Derivation of Properties¶

Property 1: Unbiasedness¶

An estimator is unbiased if its expected value equals the true parameter. Let's prove that \(E[\bar{X}] = \mu\).

Proof:

By definition:

\[ \bar{X} = \frac{1}{n}\sum_{i=1}^n X_i \]

Take the expected value of both sides:

\[ E[\bar{X}] = E\left[\frac{1}{n}\sum_{i=1}^n X_i\right] \]

Key property: Expected value is a linear operator, so we can bring it inside the sum and extract constants:

\[ E[\bar{X}] = \frac{1}{n} E\left[\sum_{i=1}^n X_i\right] \]

By linearity of expectation, the sum of expected values is the expected value of the sum:

\[ E[\bar{X}] = \frac{1}{n} \sum_{i=1}^n E[X_i] \]

i.i.d. assumption: Since all \(X_i\) come from the same population, \(E[X_i] = \mu\) for all \(i\):

\[ E[\bar{X}] = \frac{1}{n} \sum_{i=1}^n \mu = \frac{1}{n} \cdot n\mu = \mu \]

Conclusion: \(E[\bar{X}] = \mu\). The sample mean is an unbiased estimator of the population mean.

Interpretation: If we repeated sampling infinitely many times and calculated \(\bar{X}\) each time, the average value of all these estimates would converge exactly to \(\mu\). There's no systematic bias.

Property 2: Variance of the Sample Mean¶

The variance of \(\bar{X}\) measures how much estimates fluctuate around \(\mu\) across different samples. Let's prove that:

\[ \text{Var}(\bar{X}) = \frac{\sigma^2}{n} \]

Proof:

By definition:

\[ \bar{X} = \frac{1}{n}\sum_{i=1}^n X_i \]

Take the variance:

\[ \text{Var}(\bar{X}) = \text{Var}\left(\frac{1}{n}\sum_{i=1}^n X_i\right) \]

Key property: For a constant \(c\), \(\text{Var}(cY) = c^2 \text{Var}(Y)\). So we can extract \(\frac{1}{n}\) as \(\frac{1}{n^2}\):

\[ \text{Var}(\bar{X}) = \frac{1}{n^2} \text{Var}\left(\sum_{i=1}^n X_i\right) \]

Independence assumption: If \(X_1, \ldots, X_n\) are independent, the variance of the sum is the sum of variances:

\[ \text{Var}\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n \text{Var}(X_i) \]

Why? For independent variables, there's no covariance:

\[ \text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X,Y) \]

If \(X \perp Y\), then \(\text{Cov}(X,Y) = 0\).

Identical distribution assumption: Each \(X_i\) has the same variance \(\sigma^2\):

\[ \sum_{i=1}^n \text{Var}(X_i) = \sum_{i=1}^n \sigma^2 = n\sigma^2 \]

Substituting:

\[ \text{Var}(\bar{X}) = \frac{1}{n^2} \cdot n\sigma^2 = \frac{\sigma^2}{n} \]

Conclusion: The variance of the sample mean is \(\frac{\sigma^2}{n}\).

Crucial interpretation: - Variance decreases with \(n\) (more data → more precise estimates) - Decreases with \(\frac{1}{n}\), not \(\frac{1}{n^2}\) - To halve the standard deviation of \(\bar{X}\), you need 4 times the data (because \(\text{SD}(\bar{X}) = \frac{\sigma}{\sqrt{n}}\))

Property 3: Standard Error of the Mean¶

The standard deviation of \(\bar{X}\) is called the standard error of the mean (SEM):

\[ \text{SE}(\bar{X}) = \sqrt{\text{Var}(\bar{X})} = \frac{\sigma}{\sqrt{n}} \]

Practical problem: In reality, \(\sigma\) is unknown. How do we estimate SE?

We use the sample standard deviation \(s\) as an estimate of \(\sigma\):

\[ s = \sqrt{\frac{1}{n-1}\sum_{i=1}^n(x_i-\bar{x})^2} \]

Why \(n-1\) instead of \(n\)? We use \(n-1\) to get an unbiased estimator of \(\sigma^2\) (Bessel's correction). When we compute \((x_i - \bar{x})^2\), we use \(\bar{x}\) instead of the true \(\mu\). This introduces a dependency that "uses up" one degree of freedom.

The estimated standard error is:

\[ \widehat{\text{SE}}(\bar{X}) = \frac{s}{\sqrt{n}} \]

Interpretation: The standard error measures the precision of our estimate. A small SE means that if we repeated the experiment, we'd get similar estimates of \(\bar{X}\). A large SE means high variability.

Numerical example: If we measure student heights with \(s = 10\) cm and \(n = 100\):

\[ \widehat{\text{SE}} = \frac{10}{\sqrt{100}} = \frac{10}{10} = 1 \text{ cm} \]

This means that if we repeated sampling, the sample mean would typically fluctuate by about 1 cm around the true value \(\mu\).

Property 4: Sampling Distribution (Normal Case)¶

If data come from a normal distribution \(X_i \sim \mathcal{N}(\mu, \sigma^2)\), then the sample mean has distribution:

\[ \bar{X} \sim \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right) \]

Intuitive proof:

The sum of independent normal variables is still normal. If \(X_i \sim \mathcal{N}(\mu, \sigma^2)\) and are independent:

\[ \sum_{i=1}^n X_i \sim \mathcal{N}(n\mu, n\sigma^2) \]

Dividing by \(n\) (linear transformation):

\[ \bar{X} = \frac{1}{n}\sum_{i=1}^n X_i \sim \mathcal{N}\left(\frac{n\mu}{n}, \frac{n\sigma^2}{n^2}\right) = \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right) \]

Standardization: We can standardize \(\bar{X}\):

\[ Z = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \sim \mathcal{N}(0, 1) \]

This is the basis for constructing confidence intervals and hypothesis tests.

Case with unknown \(\sigma\): If we replace \(\sigma\) with its estimate \(s\), the statistic becomes:

\[ T = \frac{\bar{X} - \mu}{s/\sqrt{n}} \sim t_{n-1} \]

which follows a Student's t-distribution with \(n-1\) degrees of freedom.

Property 5: Central Limit Theorem (CLT)¶

Remarkable result: Even if \(X_i\) does not come from a normal distribution, for sufficiently large \(n\), the distribution of \(\bar{X}\) is approximately normal:

\[ \bar{X} \overset{\text{approx}}{\sim} \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right) \quad \text{for large } n \]

More precisely:

\[ \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \overset{d}{\to} \mathcal{N}(0, 1) \quad \text{as } n \to \infty \]

Implications: - We don't need to assume data normality to do inference on the mean (if \(n\) is large) - "Large" depends on the shape of the original distribution. Often \(n \geq 30\) suffices - This explains why the normal distribution is so ubiquitous in statistics

Property 6: Consistency¶

An estimator is consistent if it converges to the true value as \(n \to \infty\):

\[ \bar{X} \overset{P}{\to} \mu \quad \text{as } n \to \infty \]

Proof via Chebyshev's Inequality:

For any \(\epsilon > 0\):

\[ P(|\bar{X} - \mu| > \epsilon) \leq \frac{\text{Var}(\bar{X})}{\epsilon^2} = \frac{\sigma^2}{n\epsilon^2} \to 0 \quad \text{as } n \to \infty \]

Interpretation: With enough data, the probability that \(\bar{X}\) deviates from \(\mu\) by more than any fixed amount \(\epsilon\) becomes arbitrarily small.

Property 7: Efficiency (Normal Case)¶

For normal data, the sample mean is the UMVUE (Uniformly Minimum Variance Unbiased Estimator) of \(\mu\).

Meaning: Among all unbiased estimators of \(\mu\), the sample mean has the smallest variance. There's no better estimator (in the variance sense).

Moreover, \(\bar{X}\) is the MLE (Maximum Likelihood Estimator) of \(\mu\) for normal data, as we derived in the Likelihood-Based-Statistics page.

Special Case: Bernoulli Distribution¶

When the population follows a Bernoulli distribution \(X_i \sim \text{Bernoulli}(p)\), the sample mean estimator has special interpretations and properties.

The Setup¶

For Bernoulli data: - Each \(X_i\) takes values: \(X_i = 1\) (success) with probability \(p\), \(X_i = 0\) (failure) with probability \(1-p\) - The population mean: \(\mu = E[X_i] = p\) - The population variance: \(\sigma^2 = \text{Var}(X_i) = p(1-p)\)

Sample Mean as Sample Proportion¶

For Bernoulli data, the sample mean becomes the sample proportion:

\[ \bar{X} = \frac{1}{n}\sum_{i=1}^n X_i = \frac{\text{Number of successes}}{n} = \hat{p} \]

Interpretation: \(\bar{X}\) estimates the true probability of success \(p\) by counting the proportion of successes in the sample.

Properties Specialized for Bernoulli¶

Unbiasedness:

\[ E[\bar{X}] = E[\hat{p}] = p \]

Variance:

\[ \text{Var}(\bar{X}) = \text{Var}(\hat{p}) = \frac{\sigma^2}{n} = \frac{p(1-p)}{n} \]

Standard Error:

\[ \text{SE}(\hat{p}) = \sqrt{\frac{p(1-p)}{n}} \]

Key insight: The variance of \(\hat{p}\) depends on \(p\) itself! This creates a unique situation where the precision of our estimator depends on the parameter we're estimating.

Maximum Variance at \(p = 0.5\)¶

The function \(p(1-p)\) (and thus the variance) is maximized when \(p = 0.5\):

Maximum variance: \(\text{Var}(\hat{p})_{\max} = \frac{0.5 \times 0.5}{n} = \frac{0.25}{n}\)
Minimum variance: \(\text{Var}(\hat{p})_{\min} = 0\) when \(p = 0\) or \(p = 1\)

Practical implication: It's hardest to estimate probabilities near 0.5 (most uncertainty) and easiest to estimate probabilities near 0 or 1 (almost certain outcomes).

Estimated Standard Error¶

In practice, we replace \(p\) with \(\hat{p}\):

\[ \widehat{\text{SE}}(\hat{p}) = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]

Example: In 100 trials with 35 successes:

\[ \hat{p} = 0.35, \quad \widehat{\text{SE}} = \sqrt{\frac{0.35 \times 0.65}{100}} = \sqrt{0.002275} \approx 0.0477 \]

Sampling Distribution¶

For large \(n\), by the Central Limit Theorem:

\[ \hat{p} \overset{\text{approx}}{\sim} \mathcal{N}\left(p, \frac{p(1-p)}{n}\right) \]

The normal approximation works well when \(np \geq 5\) and \(n(1-p) \geq 5\).

Connection to Count Data¶

The sum \(S = \sum_{i=1}^n X_i\) follows a Binomial distribution:

\[ S \sim \text{Binomial}(n, p) \]

Therefore:

\[ \hat{p} = \frac{S}{n} \]

This connection explains why proportion problems are fundamentally counting problems in disguise.

Practical Rule of Thumb for Sample Size¶

For a desired margin of error \(E\) at 95% confidence:

\[ n \approx \frac{4 \cdot p(1-p)}{E^2} \]

Since we don't know \(p\), we use the worst-case scenario \(p = 0.5\):

\[ n \approx \frac{4 \cdot 0.25}{E^2} = \frac{1}{E^2} \]

Example: For margin of error ±3% (\(E = 0.03\)):

\[ n \approx \frac{1}{0.03^2} = \frac{1}{0.0009} \approx 1,111 \]

This explains why political polls typically need around 1,000 respondents!

Confidence Interval for the Mean¶

A 95% confidence interval for \(\mu\) is:

\[ \bar{x} \pm t_{n-1, 0.025} \cdot \frac{s}{\sqrt{n}} \]

where \(t_{n-1, 0.025}\) is the 97.5% quantile of Student's t-distribution with \(n-1\) degrees of freedom.

Interpretation: If we repeated sampling infinitely many times and calculated this interval each time, 95% of the intervals would contain the true value \(\mu\).

For large \(n\) (typically \(n \geq 30\)), we can use the normal approximation:

\[ \bar{x} \pm 1.96 \cdot \frac{s}{\sqrt{n}} \]

Complete Practical Example¶

Problem: We measure server response time (in ms) for 25 requests:

\[ \text{Data: } 120, 135, 118, 142, 130, 125, 138, 122, 145, 128, 133, 127, 140, 124, 136, 129, 131, 126, 139, 123, 141, 132, 137, 121, 134 \]

Step 1: Calculate sample mean:

\[ \bar{x} = \frac{1}{25}\sum_{i=1}^{25} x_i = \frac{3260}{25} = 130.4 \text{ ms} \]

Step 2: Calculate sample standard deviation:

\[ s = \sqrt{\frac{1}{24}\sum_{i=1}^{25}(x_i - 130.4)^2} \approx 7.85 \text{ ms} \]

Step 3: Calculate standard error:

\[ \widehat{\text{SE}} = \frac{s}{\sqrt{n}} = \frac{7.85}{\sqrt{25}} = \frac{7.85}{5} = 1.57 \text{ ms} \]

Interpretation: Our estimate of the mean is 130.4 ms, with a standard error of 1.57 ms.

Step 4: 95% confidence interval:

For \(n-1 = 24\) degrees of freedom, \(t_{24, 0.025} \approx 2.064\).

\[ CI_{95\%} = 130.4 \pm 2.064 \times 1.57 = 130.4 \pm 3.24 = [127.16, 133.64] \]

Interpretation: We are 95% confident that the true average response time is between 127.16 ms and 133.64 ms.

Comparison with Other Location Estimators¶

Estimator	Formula	Advantages	Disadvantages
Mean	\(\bar{x} = \frac{1}{n}\sum x_i\)	Unbiased, efficient (normal), uses all data	Sensitive to outliers
Median	Middle value when sorted	Robust to outliers	Less efficient (normal), loses information
Trimmed Mean	Mean after removing top/bottom %	Robustness/efficiency compromise	Arbitrariness in % choice

When to use the mean: - Data approximately symmetric - Few or no outliers - Normal distribution or large \(n\) (CLT)

When NOT to use the mean: - Heavily skewed data (e.g., incomes) - Presence of extreme outliers - Heavy-tailed distributions

Common Errors¶

Confusing \(\sigma\) with \(s\): \(\sigma\) is the population parameter (fixed, unknown), \(s\) is the sample estimator (random variable).
Confusing SE with SD:
SD (standard deviation) measures data spread
SE (standard error) measures estimator precision
Forgetting \(\sqrt{n}\): Precision improves as \(1/\sqrt{n}\), not \(1/n\). To halve the error requires 4 times the data.
Using \(n\) instead of \(n-1\): To estimate \(\sigma^2\), use \(n-1\) (Bessel's correction).
Ignoring CLT: Even with non-normal data, for large \(n\) we can use normal approximations.

Variables and Symbols¶

Symbol	Name	Description
\(\mu\)	Population mean	True parameter (fixed, unknown)
\(\sigma^2\)	Population variance	True parameter (fixed, unknown)
\(X_i\)	Random variable	Model for the \(i\)-th observation
\(x_i\)	Observation	Realized value of \(X_i\)
\(\bar{X}\)	Sample mean (estimator)	Random variable \(\frac{1}{n}\sum X_i\)
\(\bar{x}\)	Point estimate	Numerical value computed from sample
\(s^2\)	Sample variance	Estimator of \(\sigma^2\) with denominator \(n-1\)
\(s\)	Sample standard deviation	\(\sqrt{s^2}\)
\(\text{SE}(\bar{X})\)	Standard error	\(\sigma/\sqrt{n}\) (theoretical)
\(\widehat{\text{SE}}\)	Estimated standard error	\(s/\sqrt{n}\) (practical)
\(n\)	Sample size	Number of observations

Standard Error — Deep dive into standard error
Confidence Interval — Constructing confidence intervals
Student t-Distribution — Distribution when \(\sigma\) is unknown
Central Limit Theorem — Why the mean is normal for large \(n\)
Variance — Measure of dispersion
Likelihood-Based Statistics — The mean as MLE estimator
Bernoulli Distribution — Special case where mean equals probability \(p\)

References¶

Gauss, C. F. (1809). Theoria Motus Corporum Coelestium. Perthes et Besser, Hamburg.
Laplace, P. S. (1812). Théorie Analytique des Probabilités. Courcier, Paris.
Fisher, R. A. (1925). "Theory of Statistical Estimation." Proceedings of the Cambridge Philosophical Society, 22:700-725.
Casella, G., & Berger, R. L. (2002). Statistical Inference. Duxbury Press.

Sample Mean Estimator¶

The Story Behind the Mathematics¶

Why It Matters¶

Prerequisites¶

The Estimator¶

Derivation of Properties¶

Property 1: Unbiasedness¶

Property 2: Variance of the Sample Mean¶

Property 3: Standard Error of the Mean¶

Property 4: Sampling Distribution (Normal Case)¶

Property 5: Central Limit Theorem (CLT)¶

Property 6: Consistency¶

Property 7: Efficiency (Normal Case)¶

Special Case: Bernoulli Distribution¶

The Setup¶

Sample Mean as Sample Proportion¶

Properties Specialized for Bernoulli¶

Maximum Variance at \(p = 0.5\)¶

Estimated Standard Error¶

Sampling Distribution¶

Connection to Count Data¶

Practical Rule of Thumb for Sample Size¶

Confidence Interval for the Mean¶

Complete Practical Example¶

Comparison with Other Location Estimators¶

Common Errors¶

Variables and Symbols¶

Related Concepts¶

References¶