Skip to content

Bernoulli Distribution

The Story Behind the Mathematics

The Bernoulli distribution is named after Jacob Bernoulli (1655-1705), a Swiss mathematician from the celebrated Bernoulli family. Jacob did not work in a traditional academic setting — he was a professor of mathematics at the University of Basel at a time when probability theory was still in its infancy.

His foundational work, "Ars Conjectandi" (The Art of Conjecturing), was published posthumously in 1713 — eight years after his death. In this work, Bernoulli addressed a fundamental question: "How can we quantify the uncertainty of repeating events?"

Jacob studied practical problems of his time: gambling, mortality, and legal questions where evidence had to be weighed. He realized that many random phenomena could be reduced to the simplest possible form: an experiment with only two possible outcomes.

This insight was revolutionary. Bernoulli understood that by decomposing complex phenomena into a sequence of independent binary experiments, he could apply rigorous mathematical reasoning. In his book, he proved the first version of the Law of Large Numbers — the fact that by repeating an experiment many times, the relative frequency of successes converges to the true probability.

Historical paradox: Although the distribution bears his name, Jacob Bernoulli never called it that. The term "Bernoulli distribution" was coined much later, in the 19th century, when probabilists sought to systematize fundamental discrete distributions.

Pierre-Simon Laplace (1749-1827) and Siméon Denis Poisson (1781-1840) extended Bernoulli's work, applying it to problems in physics, astronomy, and jurisprudence. The simplicity of the Bernoulli distribution made it the fundamental building block for constructing more complex distributions like the Binomial, Geometric, and Negative Binomial.

Why It Matters

The Bernoulli distribution is the simplest and most fundamental probability distribution. It's used in:

  • Machine Learning: binary classification, logistic regression, neural networks (neurons with sigmoid activation)
  • A/B Testing: comparing conversion rates (click/no-click, purchase/no-purchase)
  • Quality Control: checking whether a product is defective or non-defective
  • Medicine: binary outcomes (recovery/non-recovery, positive/negative test)
  • Genetics: presence/absence of an allele
  • Physics: spin up/down, particle detection
  • Finance: default/non-default on a loan

Whenever we model an event with two possible outcomes, we're using a Bernoulli. It's the base case from which all discrete distributions based on repeated trials derive.

Prerequisites

The Distribution

A random variable \(X\) follows a Bernoulli distribution with parameter \(p \in [0, 1]\) if it can take only two values: 0 (failure) and 1 (success), with probabilities:

\[ X = \begin{cases} 1 & \text{with probability } p \\ 0 & \text{with probability } 1-p \end{cases} \]

Notation: \(X \sim \text{Bernoulli}(p)\) or \(X \sim \text{Ber}(p)\)

Probability Mass Function (PMF)

The PMF can be written in compact form as:

\[ P(X = x) = p^x (1-p)^{1-x} \quad \text{for } x \in \{0, 1\} \]

Let's verify this formula: - If \(x = 1\): \(P(X=1) = p^1 (1-p)^{1-1} = p \cdot 1 = p\) ✓ - If \(x = 0\): \(P(X=0) = p^0 (1-p)^{1-0} = 1 \cdot (1-p) = 1-p\)

Why this form? The representation \(p^x(1-p)^{1-x}\) is elegant because: 1. It works for any \(x \in \{0,1\}\) without needing separate cases 2. It naturally generalizes to the Binomial distribution 3. It simplifies likelihood calculations

PMF Properties

Sums to 1 (probability axiom):

\[ \sum_{x=0}^1 P(X=x) = P(X=0) + P(X=1) = (1-p) + p = 1 \quad ✓ \]

Derivation of Mean (Expected Value)

The expected value of a discrete variable is:

\[ E[X] = \sum_{x} x \cdot P(X=x) \]

For Bernoulli:

\[ E[X] = \sum_{x=0}^1 x \cdot P(X=x) \]
\[ E[X] = 0 \cdot P(X=0) + 1 \cdot P(X=1) \]
\[ E[X] = 0 \cdot (1-p) + 1 \cdot p \]
\[ E[X] = p \]

Result: \(E[X] = p\)

Interpretation: The mean of a Bernoulli is simply the probability of success. If we repeated the experiment infinitely many times, the average proportion of successes would converge to \(p\).

Example: If we flip a fair coin (\(p=0.5\)), the mean value is 0.5 — we never get 0.5 in a single flip, but it's the "center" of the distribution.

Derivation of Variance

Variance is defined as:

\[ \text{Var}(X) = E[X^2] - (E[X])^2 \]

Step 1: Calculate \(E[X^2]\).

\[ E[X^2] = \sum_{x=0}^1 x^2 \cdot P(X=x) \]
\[ E[X^2] = 0^2 \cdot P(X=0) + 1^2 \cdot P(X=1) \]
\[ E[X^2] = 0 \cdot (1-p) + 1 \cdot p = p \]

Important observation: For Bernoulli, \(E[X^2] = E[X] = p\) because \(X\) only takes values 0 or 1, and \(1^2 = 1\).

Step 2: Apply the variance formula.

\[ \text{Var}(X) = E[X^2] - (E[X])^2 \]
\[ \text{Var}(X) = p - p^2 \]

Factoring:

\[ \text{Var}(X) = p(1-p) \]

Result: \(\text{Var}(X) = p(1-p)\)

Interpretation: - Variance is maximum when \(p = 0.5\) (maximum uncertainty): \(\text{Var}(X) = 0.25\) - Variance is minimum (\(\text{Var}(X) = 0\)) when \(p = 0\) or \(p = 1\) (no uncertainty)

Mental graph: The function \(p(1-p)\) is a parabola with vertex at \(p=0.5\). When the outcome is certain (\(p\) near 0 or 1), there's no variability. When maximally uncertain (\(p=0.5\)), variance is maximum.

Derivation of Other Moments

Third Moment (Skewness)

\[ E[X^3] = \sum_{x=0}^1 x^3 \cdot P(X=x) = 0^3(1-p) + 1^3 p = p \]

The skewness index is:

\[ \gamma_1 = \frac{E[(X-\mu)^3]}{\sigma^3} = \frac{1-2p}{\sqrt{p(1-p)}} \]
  • If \(p < 0.5\): \(\gamma_1 > 0\) (right tail, more 0s than 1s)
  • If \(p = 0.5\): \(\gamma_1 = 0\) (symmetric)
  • If \(p > 0.5\): \(\gamma_1 < 0\) (left tail, more 1s than 0s)

Moment Generating Function (MGF)

The MGF is defined as:

\[ M_X(t) = E[e^{tX}] \]

For Bernoulli:

\[ M_X(t) = \sum_{x=0}^1 e^{tx} \cdot P(X=x) \]
\[ M_X(t) = e^{t \cdot 0}(1-p) + e^{t \cdot 1}p \]
\[ M_X(t) = (1-p) + pe^t \]

Result: \(M_X(t) = 1 - p + pe^t\)

Verification: We can recover moments by differentiating the MGF at \(t=0\):

\[ M_X'(t) = pe^t \quad \Rightarrow \quad M_X'(0) = p = E[X] \quad ✓ \]
\[ M_X''(t) = pe^t \quad \Rightarrow \quad M_X''(0) = p = E[X^2] \quad ✓ \]

Parameter Estimation: Maximum Likelihood

Suppose we observe \(n\) i.i.d. realizations \(x_1, \ldots, x_n\) from a Bernoulli(\(p\)). We want to estimate \(p\).

Likelihood Function

\[ L(p) = \prod_{i=1}^n P(X_i = x_i) = \prod_{i=1}^n p^{x_i}(1-p)^{1-x_i} \]

Simplifying:

\[ L(p) = p^{\sum_{i=1}^n x_i} (1-p)^{n - \sum_{i=1}^n x_i} \]

Let \(S = \sum_{i=1}^n x_i\) (number of successes):

\[ L(p) = p^S (1-p)^{n-S} \]

Log-Likelihood

\[ \ell(p) = \ln L(p) = S \ln p + (n-S) \ln(1-p) \]

Maximization

Differentiate with respect to \(p\) and set equal to zero:

\[ \frac{d\ell}{dp} = \frac{S}{p} - \frac{n-S}{1-p} = 0 \]

Multiply by \(p(1-p)\):

\[ S(1-p) - (n-S)p = 0 \]
\[ S - Sp - np + Sp = 0 \]
\[ S = np \]
\[ \hat{p}_{MLE} = \frac{S}{n} = \frac{\sum_{i=1}^n x_i}{n} \]

Result: The MLE estimator of \(p\) is the sample proportion of successes.

Example: If we flip a coin 10 times and get 7 heads, \(\hat{p} = 7/10 = 0.7\).

Estimator Properties

Unbiasedness:

\[ E[\hat{p}] = E\left[\frac{1}{n}\sum_{i=1}^n X_i\right] = \frac{1}{n} \sum_{i=1}^n E[X_i] = \frac{1}{n} \cdot np = p \]

Variance:

\[ \text{Var}(\hat{p}) = \text{Var}\left(\frac{1}{n}\sum_{i=1}^n X_i\right) = \frac{1}{n^2} \sum_{i=1}^n \text{Var}(X_i) = \frac{1}{n^2} \cdot n p(1-p) = \frac{p(1-p)}{n} \]

Standard error:

\[ \text{SE}(\hat{p}) = \sqrt{\frac{p(1-p)}{n}} \]

In practice, we substitute \(p\) with \(\hat{p}\):

\[ \widehat{\text{SE}}(\hat{p}) = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]

Relationship with Other Distributions

Binomial

If \(X_1, \ldots, X_n \sim \text{Bernoulli}(p)\) i.i.d., then:

\[ Y = \sum_{i=1}^n X_i \sim \text{Binomial}(n, p) \]

The Binomial counts the total number of successes in \(n\) independent Bernoulli trials.

In other words: Bernoulli is Binomial with \(n=1\):

\[ \text{Bernoulli}(p) = \text{Binomial}(1, p) \]

Categorical (Multinomial with k=2)

Bernoulli is a special case of the Categorical distribution with only 2 categories.

Confidence Interval for p

To construct a 95% confidence interval for \(p\), we use the normal approximation (valid for large \(n\) and \(np, n(1-p) \geq 5\)):

\[ \hat{p} \pm 1.96 \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]

Wilson interval (more accurate for small samples):

\[ \frac{\hat{p} + \frac{z^2}{2n} \pm z\sqrt{\frac{\hat{p}(1-\hat{p})}{n} + \frac{z^2}{4n^2}}}{1 + \frac{z^2}{n}} \]

where \(z = 1.96\) for 95% confidence.

Complete Practical Example

Problem: A medical test for a rare disease gives positive or negative results. We test 200 patients and 15 test positive. Estimate the probability \(p\) of testing positive.

Data: \(n = 200\), \(S = 15\)

Step 1: MLE estimate

\[ \hat{p} = \frac{S}{n} = \frac{15}{200} = 0.075 \]

Step 2: Standard error

\[ \widehat{\text{SE}} = \sqrt{\frac{0.075 \cdot 0.925}{200}} = \sqrt{\frac{0.0694}{200}} \approx 0.0186 \]

Step 3: 95% confidence interval

\[ CI_{95\%} = 0.075 \pm 1.96 \times 0.0186 = 0.075 \pm 0.0365 = [0.0385, 0.1115] \]

Interpretation: We are 95% confident that the true positivity rate is between 3.85% and 11.15%.

Hypothesis Test on p

Test: \(H_0: p = p_0\) against \(H_1: p \neq p_0\)

Test statistic:

\[ Z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}} \sim \mathcal{N}(0,1) \quad \text{under } H_0 \]

Decision rule: Reject \(H_0\) if \(|Z| > 1.96\) (5% level).

Example: We want to test if a coin is fair (\(p_0 = 0.5\)). In 100 flips we get 60 heads.

\[ Z = \frac{0.6 - 0.5}{\sqrt{\frac{0.5 \cdot 0.5}{100}}} = \frac{0.1}{0.05} = 2 \]

Since \(|2| > 1.96\), we reject \(H_0\) at the 5% level. The coin appears biased.

Entropy

Shannon entropy measures the uncertainty of the distribution:

\[ H(X) = -\sum_{x} P(X=x) \log_2 P(X=x) \]

For Bernoulli:

\[ H(X) = -p\log_2 p - (1-p)\log_2(1-p) \]
  • Maximum when \(p=0.5\): \(H(X) = 1\) bit (maximum uncertainty)
  • Minimum when \(p=0\) or \(p=1\): \(H(X) = 0\) bits (no uncertainty)

Applications in Machine Learning

Logistic Regression

Logistic regression models \(P(Y=1|X=x)\) using:

\[ P(Y=1|X=x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}} \]

Given \(X=x\), \(Y \sim \text{Bernoulli}(p(x))\).

Loss Function (Binary Cross-Entropy)

To train binary classification models, we minimize:

\[ \mathcal{L} = -\frac{1}{n}\sum_{i=1}^n [y_i \log \hat{p}_i + (1-y_i)\log(1-\hat{p}_i)] \]

This is the negative log-likelihood of the Bernoulli.

Variables and Symbols

Symbol Name Description
\(X\) Bernoulli variable Takes value 0 (failure) or 1 (success)
\(p\) Success probability Parameter \(\in [0,1]\)
\(E[X]\) Expected value \(p\)
\(\text{Var}(X)\) Variance \(p(1-p)\)
\(\hat{p}\) MLE estimate \(\sum x_i / n\)
\(n\) Number of trials Sample size
\(S\) Number of successes \(\sum_{i=1}^n x_i\)

Common Errors

  1. Confusing Bernoulli and Binomial: Bernoulli is a single trial, Binomial counts successes in \(n\) trials.

  2. Using symmetric intervals for \(p\) near 0 or 1: Normal interval can give values outside \([0,1]\). Use Wilson interval.

  3. Forgetting that \(X^2 = X\): For Bernoulli, \(X\) is 0 or 1, so \(X^2 = X\). This simplifies many calculations.

  4. Interpreting \(p=0.5\) as "less informative": It's the most informative in terms of entropy — maximum uncertainty.

References

  • Bernoulli, J. (1713). Ars Conjectandi. Basel: Thurneysen Brothers.
  • Feller, W. (1968). An Introduction to Probability Theory and Its Applications, Vol. 1. Wiley.
  • Casella, G., & Berger, R. L. (2002). Statistical Inference. Duxbury Press.