Likelihood-Based Statistics¶

The Story Behind the Mathematics¶

In the early 18th century, mathematicians faced a fundamental problem: how to extract reliable information from noisy, incomplete data? Thomas Bayes, Daniel Bernoulli, and Pierre-Simon Laplace were among the first to recognize that probability could be used "in reverse" — not to predict data from known parameters, but to infer parameters from observed data.

However, it was Ronald Aylmer Fisher (1890-1962), a British geneticist and statistician, who completely revolutionized this field. In the 1920s, Fisher worked at the Rothamsted Experimental Station, analyzing agricultural data. He realized that existing statistical methods were inadequate: they lacked mathematical rigor and a unifying theory.

In his seminal 1922 paper "On the Mathematical Foundations of Theoretical Statistics," Fisher introduced three revolutionary concepts:

The Likelihood Principle — The idea that all relevant information about parameters contained in the data is captured by the likelihood function
Maximum Likelihood Estimation (MLE) — A systematic method for finding the "best" estimators
Fisher Information — A measure of the precision with which a parameter can be estimated

Fisher was known for his combative character. He had famous controversies with Karl Pearson and Jerzy Neyman about the philosophy of statistical inference. While Pearson preferred the method of moments (simpler but less efficient), Fisher insisted that likelihood provided the optimal theoretical framework. History proved him right: MLE is today the most widely used parametric estimation method in nearly all scientific fields.

Why It Matters¶

Likelihood-based statistics are the foundation of modern statistical inference. They are used in:

Machine Learning: training probabilistic models (logistic regression, neural networks, language models)
Medicine: estimating drug efficacy rates in clinical trials
Economics: estimating parameters in econometric models (ARIMA, GARCH)
Physics: calibrating particle detectors and analyzing experimental data
Biology: phylogenetic inference, genomic sequence analysis
Finance: option pricing, volatility estimation

Without likelihood theory, we wouldn't have rigorous methods to quantify the uncertainty of our estimates or to compare competing models.

Prerequisites¶

Basic probability concepts (random variables, distributions)
Variance and Moments
Differential calculus (derivatives, function maximization)
Knowledge of the Gaussian Distribution

Fundamental Concepts¶

Before deriving formulas, we must build the conceptual building blocks from first principles.

What is Likelihood?¶

Likelihood is not a probability. This distinction is crucial and often misunderstood.

Probability: We fix the parameter \(\theta\) and ask: "How likely are we to observe different values of the data \(X\)?"

\[ P(X = x | \theta) = f(x | \theta) \]

Likelihood: We fix the observed data \(x\) and ask: "How 'plausible' are different values of the parameter \(\theta\) given what we observed?"

\[ L(\theta | x) = f(x | \theta) \]

Note: the mathematical function is the same (\(f\)), but the meaning is opposite. The likelihood \(L(\theta | x)\) is not a probability distribution for \(\theta\) — it doesn't integrate to 1 with respect to \(\theta\).

Intuitive Example: Imagine observing 8 heads in 10 coin flips. - If \(\theta = 0.5\) (fair coin), the probability of this outcome is \(\binom{10}{8}(0.5)^{10} \approx 0.044\) - If \(\theta = 0.8\) (biased coin), the same probability is \(\binom{10}{8}(0.8)^8(0.2)^2 \approx 0.302\)

Likelihood tells us: "The data are about 7 times more likely if \(\theta = 0.8\) compared to \(\theta = 0.5\)."

Likelihood for Independent Samples¶

If we have \(n\) independent and identically distributed (i.i.d.) observations \(x_1, x_2, \ldots, x_n\), the likelihood of the complete sample is the product of individual likelihoods:

\[ L(\theta | x_1, \ldots, x_n) = \prod_{i=1}^n f(x_i | \theta) \]

Why the product? Because of independence. The joint probability of independent events is the product of individual probabilities:

\[ P(A \cap B) = P(A) \cdot P(B) \quad \text{if } A \perp B \]

The Log-Likelihood¶

In practice, we almost always work with the log-likelihood:

\[ \ell(\theta) = \ln L(\theta) = \sum_{i=1}^n \ln f(x_i | \theta) \]

Why take the logarithm?

Products → Sums: \(\ln(a \cdot b) = \ln a + \ln b\). Sums are much easier to differentiate.
Numerical stability: Multiplying many small probabilities (e.g., \(10^{-10} \times 10^{-12}\)) causes underflow on computers. Logarithms transform these numbers into manageable sums.
Monotonicity: Since \(\ln(x)\) is strictly increasing, maximizing \(\ell(\theta)\) is equivalent to maximizing \(L(\theta)\).

The Maximum Likelihood Estimator (MLE)¶

The maximum likelihood estimator \(\hat{\theta}_{MLE}\) is the value that maximizes the likelihood:

\[ \hat{\theta}_{MLE} = \arg\max_{\theta} L(\theta) = \arg\max_{\theta} \ell(\theta) \]

Interpretation: "Among all possible values of \(\theta\), which one makes the observed data most plausible?"

To find it, we use differential calculus:

Differentiate \(\ell(\theta)\) with respect to \(\theta\)
Set the derivative equal to zero (first-order condition)
Solve for \(\theta\)
Verify that it's a maximum (not a minimum or saddle point)

\[ \frac{d\ell}{d\theta} = 0 \quad \Rightarrow \quad \hat{\theta}_{MLE} \]

Complete Derivation: Normal Distribution¶

Now let's derive the MLE estimators for the most important distribution in statistics: the normal (or Gaussian).

Problem Setup¶

Suppose we have \(n\) i.i.d. observations \(x_1, x_2, \ldots, x_n\) from a normal distribution with unknown parameters \(\mu\) (mean) and \(\sigma^2\) (variance):

\[ X_i \sim \mathcal{N}(\mu, \sigma^2) \]

The probability density function (PDF) of the normal is:

\[ f(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) \]

Objective: Find \(\hat{\mu}\) and \(\hat{\sigma}^2\) that maximize the likelihood.

Step 1: Construct the Likelihood Function¶

For \(n\) i.i.d. observations, the likelihood is:

\[ L(\mu, \sigma^2 | x_1, \ldots, x_n) = \prod_{i=1}^n f(x_i | \mu, \sigma^2) \]

Substituting the PDF:

\[ L(\mu, \sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i-\mu)^2}{2\sigma^2}\right) \]

We can separate the product:

\[ L(\mu, \sigma^2) = \left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^n \exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2\right) \]

Simplifying:

\[ L(\mu, \sigma^2) = (2\pi\sigma^2)^{-n/2} \exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2\right) \]

Step 2: Switch to Log-Likelihood¶

Take the natural logarithm of both sides. Recall the properties of logarithms: - \(\ln(a^b) = b\ln(a)\) - \(\ln(ab) = \ln(a) + \ln(b)\) - \(\ln(e^x) = x\)

\[ \ell(\mu, \sigma^2) = \ln L(\mu, \sigma^2) \]

\[ \ell(\mu, \sigma^2) = \ln\left[(2\pi\sigma^2)^{-n/2}\right] + \ln\left[\exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2\right)\right] \]

\[ \ell(\mu, \sigma^2) = -\frac{n}{2}\ln(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2 \]

Expand further using \(\ln(ab) = \ln a + \ln b\):

\[ \ell(\mu, \sigma^2) = -\frac{n}{2}\ln(2\pi) - \frac{n}{2}\ln(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2 \]

Important note: The term \(-\frac{n}{2}\ln(2\pi)\) is a constant (doesn't depend on \(\mu\) or \(\sigma^2\)), so we can ignore it in maximization. We get:

\[ \ell(\mu, \sigma^2) = -\frac{n}{2}\ln(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2 \]

Step 3: Estimate \(\mu\) (Mean)¶

To find \(\hat{\mu}\), differentiate \(\ell\) with respect to \(\mu\) and set it equal to zero.

Why differentiate? At maxima (or minima), the slope of the function is zero. We're looking for the peak of the log-likelihood curve.

\[ \frac{\partial \ell}{\partial \mu} = \frac{\partial}{\partial \mu}\left[-\frac{n}{2}\ln(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2\right] \]

The first term doesn't depend on \(\mu\), so its derivative is zero:

\[ \frac{\partial \ell}{\partial \mu} = -\frac{1}{2\sigma^2} \frac{\partial}{\partial \mu}\sum_{i=1}^n(x_i-\mu)^2 \]

Calculate the derivative of the sum. Using the chain rule:

\[ \frac{\partial}{\partial \mu}(x_i-\mu)^2 = 2(x_i-\mu) \cdot (-1) = -2(x_i-\mu) \]

Therefore:

\[ \frac{\partial \ell}{\partial \mu} = -\frac{1}{2\sigma^2} \sum_{i=1}^n [-2(x_i-\mu)] \]

\[ \frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2} \sum_{i=1}^n (x_i-\mu) \]

Set the derivative equal to zero (first-order condition):

\[ \frac{1}{\sigma^2} \sum_{i=1}^n (x_i-\mu) = 0 \]

Since \(\sigma^2 \neq 0\), we can multiply both sides by \(\sigma^2\):

\[ \sum_{i=1}^n (x_i-\mu) = 0 \]

Expand the sum:

\[ \sum_{i=1}^n x_i - \sum_{i=1}^n \mu = 0 \]

The second term is \(n\mu\) (we sum \(\mu\) for \(n\) times):

\[ \sum_{i=1}^n x_i - n\mu = 0 \]

Solve for \(\mu\):

\[ n\mu = \sum_{i=1}^n x_i \]

\[ \hat{\mu} = \frac{1}{n}\sum_{i=1}^n x_i = \bar{x} \]

Result: The MLE estimator of the mean is the sample mean \(\bar{x}\).

Interpretation: The value of \(\mu\) that makes the data most likely is exactly the average of the data. Intuitively: if the data are distributed around \(\mu\), the best estimate of \(\mu\) is the center of the observed data.

Step 4: Estimate \(\sigma^2\) (Variance)¶

Now let's find \(\hat{\sigma}^2\). Differentiate \(\ell\) with respect to \(\sigma^2\) and set it equal to zero.

Technical note: We differentiate with respect to \(\sigma^2\) (not \(\sigma\)) because it makes calculations simpler. \(\sigma^2\) is the natural parameter of the normal distribution.

\[ \frac{\partial \ell}{\partial \sigma^2} = \frac{\partial}{\partial \sigma^2}\left[-\frac{n}{2}\ln(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2\right] \]

Calculate the derivatives of the two terms separately.

First term: Using \(\frac{d}{dx}\ln(x) = \frac{1}{x}\):

\[ \frac{\partial}{\partial \sigma^2}\left[-\frac{n}{2}\ln(\sigma^2)\right] = -\frac{n}{2} \cdot \frac{1}{\sigma^2} \]

Second term: Rewrite as \(-\frac{1}{2}(\sigma^2)^{-1}\sum(x_i-\mu)^2\). Using the power rule \(\frac{d}{dx}x^{-1} = -x^{-2}\):

\[ \frac{\partial}{\partial \sigma^2}\left[-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2\right] = -\frac{1}{2} \cdot (-1)(\sigma^2)^{-2}\sum_{i=1}^n(x_i-\mu)^2 \]

\[ = \frac{1}{2(\sigma^2)^2}\sum_{i=1}^n(x_i-\mu)^2 \]

Combining the two terms:

\[ \frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2}\sum_{i=1}^n(x_i-\mu)^2 \]

Set equal to zero:

\[ -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2}\sum_{i=1}^n(x_i-\mu)^2 = 0 \]

Multiply both sides by \(2(\sigma^2)^2\) to eliminate denominators:

\[ -n\sigma^2 + \sum_{i=1}^n(x_i-\mu)^2 = 0 \]

Solve for \(\sigma^2\):

\[ n\sigma^2 = \sum_{i=1}^n(x_i-\mu)^2 \]

\[ \sigma^2 = \frac{1}{n}\sum_{i=1}^n(x_i-\mu)^2 \]

But wait! We derived with respect to \(\sigma^2\) treating \(\mu\) as known. In reality, \(\mu\) is unknown and we must substitute it with its estimate \(\hat{\mu} = \bar{x}\):

\[ \hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n(x_i-\bar{x})^2 \]

Result: The MLE estimator of variance is the sample variance (with denominator \(n\)).

Important note on bias: This estimator is biased. Its expected value is:

\[ E[\hat{\sigma}^2] = \frac{n-1}{n}\sigma^2 \neq \sigma^2 \]

To obtain an unbiased estimator, we use denominator \(n-1\) instead of \(n\):

\[ s^2 = \frac{1}{n-1}\sum_{i=1}^n(x_i-\bar{x})^2 \]

Why this bias? When we estimate \(\mu\) with \(\bar{x}\), we "use up" one degree of freedom of the data. The sample mean \(\bar{x}\) is constructed to minimize the sum of squared deviations, so \(\sum(x_i-\bar{x})^2\) systematically underestimates \(\sum(x_i-\mu)^2\). The \(n-1\) correction compensates for this underestimation.

Summary of MLE Estimates for Normal¶

For an i.i.d. sample \(x_1, \ldots, x_n\) from \(\mathcal{N}(\mu, \sigma^2)\):

\[ \begin{align} \hat{\mu}_{MLE} &= \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i \\ \hat{\sigma}^2_{MLE} &= \frac{1}{n}\sum_{i=1}^n(x_i-\bar{x})^2 \end{align} \]

Other Examples: Exponential and Binomial Distributions¶

Exponential Distribution¶

PDF: \(f(x|\lambda) = \lambda e^{-\lambda x}\) for \(x \geq 0\), \(\lambda > 0\)

Log-likelihood:

\[ \ell(\lambda) = \sum_{i=1}^n \ln(\lambda e^{-\lambda x_i}) = n\ln\lambda - \lambda\sum_{i=1}^n x_i \]

Derivative:

\[ \frac{d\ell}{d\lambda} = \frac{n}{\lambda} - \sum_{i=1}^n x_i = 0 \]

Solution:

\[ \hat{\lambda}_{MLE} = \frac{n}{\sum_{i=1}^n x_i} = \frac{1}{\bar{x}} \]

The MLE of the rate parameter is the reciprocal of the sample mean.

Binomial Distribution¶

PMF: \(P(X=k) = \binom{n}{k}p^k(1-p)^{n-k}\)

For \(m\) independent repetitions with \(k_1, \ldots, k_m\) successes:

Log-likelihood:

\[ \ell(p) = \text{const} + \left(\sum_{i=1}^m k_i\right)\ln p + \left(mn - \sum_{i=1}^m k_i\right)\ln(1-p) \]

Derivative:

\[ \frac{d\ell}{dp} = \frac{\sum k_i}{p} - \frac{mn - \sum k_i}{1-p} = 0 \]

Solution:

\[ \hat{p}_{MLE} = \frac{\sum_{i=1}^m k_i}{mn} = \frac{\text{total successes}}{\text{total trials}} \]

The MLE of success probability is the empirical proportion of successes.

Properties of MLE Estimators¶

1. Consistency¶

As \(n \to \infty\), \(\hat{\theta}_{MLE} \overset{P}{\to} \theta_0\) (converges in probability to the true value).

Meaning: With enough data, MLE finds the true parameter.

2. Asymptotic Normality¶

For large \(n\):

\[ \sqrt{n}(\hat{\theta}_{MLE} - \theta_0) \overset{d}{\to} \mathcal{N}(0, I(\theta_0)^{-1}) \]

where \(I(\theta)\) is Fisher information:

\[ I(\theta) = -E\left[\frac{\partial^2 \ell}{\partial \theta^2}\right] \]

Meaning: The sampling distribution of MLE is approximately normal. This allows us to construct confidence intervals and hypothesis tests.

3. Efficiency¶

Among all unbiased estimators, MLE achieves minimum variance asymptotically (reaches the Cramér-Rao lower bound).

Meaning: No unbiased estimator has lower variance (asymptotically). MLE is "optimal" in this sense.

4. Invariance¶

If \(\hat{\theta}\) is the MLE of \(\theta\), then \(g(\hat{\theta})\) is the MLE of \(g(\theta)\) for any function \(g\).

Example: If \(\hat{\sigma}^2\) is the MLE of variance, then \(\sqrt{\hat{\sigma}^2} = \hat{\sigma}\) is the MLE of standard deviation.

Fisher Information¶

Fisher information quantifies how much "information" an observation provides about the unknown parameter.

\[ I(\theta) = E\left[\left(\frac{\partial \ln f(X|\theta)}{\partial \theta}\right)^2\right] = -E\left[\frac{\partial^2 \ln f(X|\theta)}{\partial \theta^2}\right] \]

Interpretation: - Higher \(I(\theta)\) means we can estimate \(\theta\) more precisely - The asymptotic variance of MLE is \(1/(nI(\theta))\)

Example for Normal: For \(X \sim \mathcal{N}(\mu, \sigma^2)\) with known \(\sigma^2\):

\[ I(\mu) = \frac{1}{\sigma^2} \]

Thus \(\text{Var}(\hat{\mu}) \approx \frac{\sigma^2}{n}\), which exactly matches the variance of the sample mean.

Likelihood Ratio Tests¶

Likelihood is used not only to estimate parameters, but also to compare models and test hypotheses.

Likelihood Ratio Statistic¶

To test \(H_0: \theta \in \Theta_0\) against \(H_1: \theta \in \Theta_1\):

\[ \Lambda = \frac{\max_{\theta \in \Theta_0} L(\theta)}{\max_{\theta \in \Theta_1} L(\theta)} \]

Under \(H_0\), for large \(n\):

\[ -2\ln\Lambda \overset{d}{\to} \chi^2_k \]

where \(k\) is the difference in the number of parameters between models.

Example: Testing if a coin is fair (\(H_0: p = 0.5\)) against \(H_1: p \neq 0.5\).

Model Selection Criteria¶

AIC (Akaike Information Criterion)¶

\[ AIC = 2k - 2\ln L(\hat{\theta}) \]

where \(k\) is the number of parameters. Choose the model with minimum AIC.

Complexity penalty: The \(2k\) term penalizes models with too many parameters (avoids overfitting).

BIC (Bayesian Information Criterion)¶

\[ BIC = k\ln n - 2\ln L(\hat{\theta}) \]

Penalizes complexity more strongly than AIC (penalty \(k\ln n\) instead of \(2k\)).

Common Errors and Misconceptions¶

Likelihood ≠ Probability: \(L(\theta|x)\) is not \(P(\theta|x)\). Likelihood is not a distribution for \(\theta\).
MLE can be biased: As in the case of \(\hat{\sigma}^2\) for normal. Consistency ≠ unbiasedness.
MLE may not exist or be unique: In some pathological models, the likelihood may not have a well-defined maximum.
Local maxima: In complex models (e.g., mixture models), \(\ell(\theta)\) can have multiple local maxima. Numerical methods needed (EM algorithm, gradient).
Small samples: The optimal properties of MLE are asymptotic. For small \(n\), other estimators might perform better.

Variables and Symbols¶

Symbol	Name	Description
\(\theta\)	Parameter	Unknown quantity to estimate
\(X_i\)	Random variable	Model for the \(i\)-th observation
\(x_i\)	Observation	Realized value of \(X_i\)
\(f(x\\|\theta)\)	PDF/PMF	Density/mass function
\(L(\theta\\|x)\)	Likelihood	Plausibility of \(\theta\) given data
\(\ell(\theta)\)	Log-likelihood	\(\ln L(\theta)\)
\(\hat{\theta}_{MLE}\)	MLE estimator	Value that maximizes \(L\) or \(\ell\)
\(I(\theta)\)	Fisher information	Amount of information about \(\theta\)
\(n\)	Sample size	Number of observations

Method of Moments — Alternative estimation method
Confidence Intervals — Constructed using asymptotic normality of MLE
Gaussian Distribution — Most common distribution for MLE
Variance — Estimated via MLE
Hypothesis Testing — Based on likelihood ratio

Historical and Modern References¶

Fisher, R. A. (1922). "On the mathematical foundations of theoretical statistics." Philosophical Transactions of the Royal Society A, 222:309-368.
Fisher, R. A. (1925). "Theory of Statistical Estimation." Proceedings of the Cambridge Philosophical Society, 22:700-725.
Casella, G., & Berger, R. L. (2002). Statistical Inference. Duxbury Press.
Lehmann, E. L., & Casella, G. (1998). Theory of Point Estimation. Springer.