Bernoulli Distribution¶
The Story Behind the Mathematics¶
The Bernoulli distribution is named after Jacob Bernoulli (1655-1705), a Swiss mathematician from the celebrated Bernoulli family. Jacob did not work in a traditional academic setting — he was a professor of mathematics at the University of Basel at a time when probability theory was still in its infancy.
His foundational work, "Ars Conjectandi" (The Art of Conjecturing), was published posthumously in 1713 — eight years after his death. In this work, Bernoulli addressed a fundamental question: "How can we quantify the uncertainty of repeating events?"
Jacob studied practical problems of his time: gambling, mortality, and legal questions where evidence had to be weighed. He realized that many random phenomena could be reduced to the simplest possible form: an experiment with only two possible outcomes.
This insight was revolutionary. Bernoulli understood that by decomposing complex phenomena into a sequence of independent binary experiments, he could apply rigorous mathematical reasoning. In his book, he proved the first version of the Law of Large Numbers — the fact that by repeating an experiment many times, the relative frequency of successes converges to the true probability.
Historical paradox: Although the distribution bears his name, Jacob Bernoulli never called it that. The term "Bernoulli distribution" was coined much later, in the 19th century, when probabilists sought to systematize fundamental discrete distributions.
Pierre-Simon Laplace (1749-1827) and Siméon Denis Poisson (1781-1840) extended Bernoulli's work, applying it to problems in physics, astronomy, and jurisprudence. The simplicity of the Bernoulli distribution made it the fundamental building block for constructing more complex distributions like the Binomial, Geometric, and Negative Binomial.
Why It Matters¶
The Bernoulli distribution is the simplest and most fundamental probability distribution. It's used in:
- Machine Learning: binary classification, logistic regression, neural networks (neurons with sigmoid activation)
- A/B Testing: comparing conversion rates (click/no-click, purchase/no-purchase)
- Quality Control: checking whether a product is defective or non-defective
- Medicine: binary outcomes (recovery/non-recovery, positive/negative test)
- Genetics: presence/absence of an allele
- Physics: spin up/down, particle detection
- Finance: default/non-default on a loan
Whenever we model an event with two possible outcomes, we're using a Bernoulli. It's the base case from which all discrete distributions based on repeated trials derive.
Prerequisites¶
- Concept of discrete random variable
- Expected Value
- Variance
- Probability concept
The Distribution¶
A random variable \(X\) follows a Bernoulli distribution with parameter \(p \in [0, 1]\) if it can take only two values: 0 (failure) and 1 (success), with probabilities:
Notation: \(X \sim \text{Bernoulli}(p)\) or \(X \sim \text{Ber}(p)\)
Probability Mass Function (PMF)¶
The PMF can be written in compact form as:
Let's verify this formula: - If \(x = 1\): \(P(X=1) = p^1 (1-p)^{1-1} = p \cdot 1 = p\) ✓ - If \(x = 0\): \(P(X=0) = p^0 (1-p)^{1-0} = 1 \cdot (1-p) = 1-p\) ✓
Why this form? The representation \(p^x(1-p)^{1-x}\) is elegant because: 1. It works for any \(x \in \{0,1\}\) without needing separate cases 2. It naturally generalizes to the Binomial distribution 3. It simplifies likelihood calculations
PMF Properties¶
Sums to 1 (probability axiom):
Derivation of Mean (Expected Value)¶
The expected value of a discrete variable is:
For Bernoulli:
Result: \(E[X] = p\)
Interpretation: The mean of a Bernoulli is simply the probability of success. If we repeated the experiment infinitely many times, the average proportion of successes would converge to \(p\).
Example: If we flip a fair coin (\(p=0.5\)), the mean value is 0.5 — we never get 0.5 in a single flip, but it's the "center" of the distribution.
Derivation of Variance¶
Variance is defined as:
Step 1: Calculate \(E[X^2]\).
Important observation: For Bernoulli, \(E[X^2] = E[X] = p\) because \(X\) only takes values 0 or 1, and \(1^2 = 1\).
Step 2: Apply the variance formula.
Factoring:
Result: \(\text{Var}(X) = p(1-p)\)
Interpretation: - Variance is maximum when \(p = 0.5\) (maximum uncertainty): \(\text{Var}(X) = 0.25\) - Variance is minimum (\(\text{Var}(X) = 0\)) when \(p = 0\) or \(p = 1\) (no uncertainty)
Mental graph: The function \(p(1-p)\) is a parabola with vertex at \(p=0.5\). When the outcome is certain (\(p\) near 0 or 1), there's no variability. When maximally uncertain (\(p=0.5\)), variance is maximum.
Derivation of Other Moments¶
Third Moment (Skewness)¶
The skewness index is:
- If \(p < 0.5\): \(\gamma_1 > 0\) (right tail, more 0s than 1s)
- If \(p = 0.5\): \(\gamma_1 = 0\) (symmetric)
- If \(p > 0.5\): \(\gamma_1 < 0\) (left tail, more 1s than 0s)
Moment Generating Function (MGF)¶
The MGF is defined as:
For Bernoulli:
Result: \(M_X(t) = 1 - p + pe^t\)
Verification: We can recover moments by differentiating the MGF at \(t=0\):
Parameter Estimation: Maximum Likelihood¶
Suppose we observe \(n\) i.i.d. realizations \(x_1, \ldots, x_n\) from a Bernoulli(\(p\)). We want to estimate \(p\).
Likelihood Function¶
Simplifying:
Let \(S = \sum_{i=1}^n x_i\) (number of successes):
Log-Likelihood¶
Maximization¶
Differentiate with respect to \(p\) and set equal to zero:
Multiply by \(p(1-p)\):
Result: The MLE estimator of \(p\) is the sample proportion of successes.
Example: If we flip a coin 10 times and get 7 heads, \(\hat{p} = 7/10 = 0.7\).
Estimator Properties¶
Unbiasedness:
Variance:
Standard error:
In practice, we substitute \(p\) with \(\hat{p}\):
Relationship with Other Distributions¶
Binomial¶
If \(X_1, \ldots, X_n \sim \text{Bernoulli}(p)\) i.i.d., then:
The Binomial counts the total number of successes in \(n\) independent Bernoulli trials.
In other words: Bernoulli is Binomial with \(n=1\):
Categorical (Multinomial with k=2)¶
Bernoulli is a special case of the Categorical distribution with only 2 categories.
Confidence Interval for p¶
To construct a 95% confidence interval for \(p\), we use the normal approximation (valid for large \(n\) and \(np, n(1-p) \geq 5\)):
Wilson interval (more accurate for small samples):
where \(z = 1.96\) for 95% confidence.
Complete Practical Example¶
Problem: A medical test for a rare disease gives positive or negative results. We test 200 patients and 15 test positive. Estimate the probability \(p\) of testing positive.
Data: \(n = 200\), \(S = 15\)
Step 1: MLE estimate
Step 2: Standard error
Step 3: 95% confidence interval
Interpretation: We are 95% confident that the true positivity rate is between 3.85% and 11.15%.
Hypothesis Test on p¶
Test: \(H_0: p = p_0\) against \(H_1: p \neq p_0\)
Test statistic:
Decision rule: Reject \(H_0\) if \(|Z| > 1.96\) (5% level).
Example: We want to test if a coin is fair (\(p_0 = 0.5\)). In 100 flips we get 60 heads.
Since \(|2| > 1.96\), we reject \(H_0\) at the 5% level. The coin appears biased.
Entropy¶
Shannon entropy measures the uncertainty of the distribution:
For Bernoulli:
- Maximum when \(p=0.5\): \(H(X) = 1\) bit (maximum uncertainty)
- Minimum when \(p=0\) or \(p=1\): \(H(X) = 0\) bits (no uncertainty)
Applications in Machine Learning¶
Logistic Regression¶
Logistic regression models \(P(Y=1|X=x)\) using:
Given \(X=x\), \(Y \sim \text{Bernoulli}(p(x))\).
Loss Function (Binary Cross-Entropy)¶
To train binary classification models, we minimize:
This is the negative log-likelihood of the Bernoulli.
Variables and Symbols¶
| Symbol | Name | Description |
|---|---|---|
| \(X\) | Bernoulli variable | Takes value 0 (failure) or 1 (success) |
| \(p\) | Success probability | Parameter \(\in [0,1]\) |
| \(E[X]\) | Expected value | \(p\) |
| \(\text{Var}(X)\) | Variance | \(p(1-p)\) |
| \(\hat{p}\) | MLE estimate | \(\sum x_i / n\) |
| \(n\) | Number of trials | Sample size |
| \(S\) | Number of successes | \(\sum_{i=1}^n x_i\) |
Common Errors¶
-
Confusing Bernoulli and Binomial: Bernoulli is a single trial, Binomial counts successes in \(n\) trials.
-
Using symmetric intervals for \(p\) near 0 or 1: Normal interval can give values outside \([0,1]\). Use Wilson interval.
-
Forgetting that \(X^2 = X\): For Bernoulli, \(X\) is 0 or 1, so \(X^2 = X\). This simplifies many calculations.
-
Interpreting \(p=0.5\) as "less informative": It's the most informative in terms of entropy — maximum uncertainty.
Related Concepts¶
- Binomial Distribution — Sum of \(n\) i.i.d. Bernoulli
- Geometric Distribution — Number of trials until first success
- Likelihood-Based Statistics — MLE estimation of \(p\)
- Sample Mean Estimator — \(\hat{p}\) is a sample mean
- Confidence Interval — Constructing CIs for proportions
References¶
- Bernoulli, J. (1713). Ars Conjectandi. Basel: Thurneysen Brothers.
- Feller, W. (1968). An Introduction to Probability Theory and Its Applications, Vol. 1. Wiley.
- Casella, G., & Berger, R. L. (2002). Statistical Inference. Duxbury Press.