Bayesian Classifier¶

The Story Behind the Mathematics¶

The story of Bayesian classification begins with one of the most revolutionary ideas in probability theory, formulated by an 18th-century Presbyterian minister who never published his work during his lifetime.

Thomas Bayes (1701-1761) was a nonconformist minister in Tunbridge Wells, England. Unlike the prominent mathematicians of his era, Bayes worked in relative obscurity, pursuing mathematics as an intellectual passion rather than a profession. His most important work, "An Essay towards solving a Problem in the Doctrine of Chances," was discovered among his papers after his death by his friend Richard Price, who published it in 1763.

The revolutionary insight: Bayes showed how to invert conditional probabilities — how to reason from effects back to causes, from data back to hypotheses. This was philosophically radical: it meant we could update our beliefs based on evidence, treating probability as a measure of uncertainty rather than just frequency.

For over 150 years, Bayes' ideas remained controversial. The dominant frequentist school, led by figures like Ronald Fisher and Jerzy Neyman, rejected the notion of probability as subjective belief. But Bayes' theorem quietly found practical applications.

Modern revival: The computational revolution of the late 20th century transformed Bayesian methods from theoretical curiosities to practical powerhouses:

1950s-60s: Early applications in signal processing and decision theory
1980s: Development of Markov Chain Monte Carlo (MCMC) made complex Bayesian computations feasible
1990s: Naive Bayes classifiers emerged as surprisingly effective for text classification (spam filtering)
2000s: Bayesian networks became standard in medical diagnosis and expert systems
2010s-present: Bayesian methods underpin modern machine learning (Bayesian optimization, variational inference, probabilistic programming)

Why "Naive" Bayes?: The most widely used Bayesian classifier makes a "naive" assumption — that all features are conditionally independent given the class. In other words: once we know which class an example belongs to, knowing the value of one feature tells us nothing about the values of the other features.

Example: In classifying emails as spam or not spam, the naive assumption says that if we know an email is spam, the presence of the word "free" is independent of the presence of the word "win". In reality these words tend to appear together (if "free" is present it's more likely that "win" is also present), so the assumption is "naive".

This assumption is almost always wrong, yet the method works remarkably well in practice. This paradox fascinated researchers and made Naive Bayes a fundamental benchmark in machine learning.

Philosophical significance: Bayesian thinking represents a different philosophy of inference: knowledge is uncertain, beliefs should be quantified as probabilities, and evidence should update these beliefs in a principled way. This framework now dominates fields from artificial intelligence to neuroscience to judicial reasoning.

Why It Matters¶

Bayesian classifiers are fundamental to modern statistical learning and are used in:

Email Filtering: Spam detection (classic application of Naive Bayes)
Text Classification: Document categorization, sentiment analysis, topic modeling
Medical Diagnosis: Disease prediction from symptoms and test results
Recommender Systems: Content filtering, user preference prediction
Bioinformatics: Protein classification, gene expression analysis
Computer Vision: Object recognition, image classification
Natural Language Processing: Part-of-speech tagging, named entity recognition
Fraud Detection: Credit card fraud, insurance fraud
Weather Forecasting: Probabilistic forecasts based on atmospheric conditions
Speech Recognition: Phoneme classification, word recognition

Bayesian classifiers provide probabilistic predictions — not just class labels, but confidence estimates — making them invaluable when uncertainty quantification matters.

Prerequisites¶

Probability fundamentals (random variables, conditional probability)
Bayes' Theorem (essential!)
Bernoulli Distribution and discrete distributions
Gaussian Distribution (for continuous features)
Likelihood-Based Statistics (for parameter estimation)
Concept of Independence and conditional independence

Fundamental Concepts¶

We'll build Bayesian classification from first principles, starting with Bayes' theorem and deriving the optimal decision rule.

The Classification Problem¶

We observe: - Features: \(\mathbf{X} = (X_1, X_2, \ldots, X_p)\) (could be discrete, continuous, or mixed) - Class label: \(Y \in \{1, 2, \ldots, K\}\) (discrete)

Goal: Given new features \(\mathbf{x}\), predict the class \(y\).

Bayesian approach: Use probability theory to quantify uncertainty and make optimal predictions.

Bayes' Theorem: The Foundation¶

Bayes' theorem for classification:

\[ P(Y = k | \mathbf{X} = \mathbf{x}) = \frac{P(\mathbf{X} = \mathbf{x} | Y = k) \cdot P(Y = k)}{P(\mathbf{X} = \mathbf{x})} \]

Components:

Posterior probability \(P(Y = k | \mathbf{X} = \mathbf{x})\): Probability of class \(k\) given features \(\mathbf{x}\) (what we want)
Likelihood \(P(\mathbf{X} = \mathbf{x} | Y = k)\): Probability of observing \(\mathbf{x}\) in class \(k\) (from data)
Prior \(P(Y = k)\): Probability of class \(k\) before seeing data (from class frequencies or domain knowledge)
Evidence \(P(\mathbf{X} = \mathbf{x})\): Total probability of observing \(\mathbf{x}\) (normalizing constant)

Expanding the evidence using the law of total probability:

\[ P(\mathbf{X} = \mathbf{x}) = \sum_{j=1}^K P(\mathbf{X} = \mathbf{x} | Y = j) \cdot P(Y = j) \]

Full Bayes' theorem:

\[ P(Y = k | \mathbf{X} = \mathbf{x}) = \frac{P(\mathbf{X} = \mathbf{x} | Y = k) \cdot P(Y = k)}{\sum_{j=1}^K P(\mathbf{X} = \mathbf{x} | Y = j) \cdot P(Y = j)} \]

The Optimal Bayesian Decision Rule¶

Question: Given posterior probabilities \(P(Y = k | \mathbf{X} = \mathbf{x})\) for all classes, how should we classify?

Bayes classifier (also called Bayes optimal classifier):

\[ \hat{y} = \arg\max_{k \in \{1, \ldots, K\}} P(Y = k | \mathbf{X} = \mathbf{x}) \]

Interpretation: Choose the class with the highest posterior probability.

Why is this optimal? This rule minimizes the probability of misclassification.

Proof: The probability of error for a classifier \(h(\mathbf{x})\) is:

\[ P(\text{error}) = P(h(\mathbf{X}) \neq Y) = 1 - P(h(\mathbf{X}) = Y) \]

To minimize error, we must maximize \(P(h(\mathbf{X}) = Y)\). For each \(\mathbf{x}\), the best choice is:

\[ h(\mathbf{x}) = \arg\max_k P(Y = k | \mathbf{X} = \mathbf{x}) \]

Result: No other classifier can achieve lower error rate than the Bayes classifier (if we know the true probabilities).

Bayes error rate: The minimum possible error rate:

\[ \epsilon_{\text{Bayes}} = 1 - \mathbb{E}\left[\max_k P(Y = k | \mathbf{X})\right] \]

This is an irreducible error floor — the best any classifier can do given the feature space.

Simplification for Classification¶

Since the evidence \(P(\mathbf{X} = \mathbf{x})\) is the same for all classes, we can ignore it for classification:

\[ P(Y = k | \mathbf{X} = \mathbf{x}) \propto P(\mathbf{X} = \mathbf{x} | Y = k) \cdot P(Y = k) \]

Equivalent classification rule:

\[ \hat{y} = \arg\max_k \left[P(\mathbf{X} = \mathbf{x} | Y = k) \cdot P(Y = k)\right] \]

In words: Multiply the likelihood by the prior, and choose the class with the highest product.

Log form (often more convenient numerically):

\[ \hat{y} = \arg\max_k \left[\log P(\mathbf{X} = \mathbf{x} | Y = k) + \log P(Y = k)\right] \]

Naive Bayes Classifier¶

The Naive Bayes classifier is the most widely used Bayesian classifier. It makes a strong simplifying assumption.

The Naive Independence Assumption¶

Assumption: Features are conditionally independent given the class:

\[ P(\mathbf{X} = \mathbf{x} | Y = k) = P(X_1 = x_1, X_2 = x_2, \ldots, X_p = x_p | Y = k) = \prod_{j=1}^p P(X_j = x_j | Y = k) \]

Meaning: Once we know the class, knowing one feature's value gives no information about another feature's value.

Example: For spam classification, this assumes that given an email is spam, the presence of the word "free" is independent of the presence of "winner" — clearly false! Yet Naive Bayes works.

Why "naive"? This assumption is almost always wrong in real data, but it makes computation tractable.

Naive Bayes Classification Rule¶

Substituting the independence assumption into Bayes' rule:

\[ P(Y = k | \mathbf{X} = \mathbf{x}) \propto P(Y = k) \prod_{j=1}^p P(X_j = x_j | Y = k) \]

Classification rule:

\[ \hat{y} = \arg\max_k \left[P(Y = k) \prod_{j=1}^p P(X_j = x_j | Y = k)\right] \]

Log form (preferred for numerical stability):

\[ \hat{y} = \arg\max_k \left[\log P(Y = k) + \sum_{j=1}^p \log P(X_j = x_j | Y = k)\right] \]

Interpretation: Add up log-probabilities (evidence) from each feature plus the log-prior.

Estimating Parameters from Data¶

To use Naive Bayes, we must estimate: 1. Class priors \(P(Y = k)\) 2. Feature likelihoods \(P(X_j = x | Y = k)\) for each feature \(j\) and class \(k\)

Training data: \(\{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)\}\)

Estimating Class Priors¶

Maximum likelihood estimate:

\[ \hat{P}(Y = k) = \frac{n_k}{n} \]

where \(n_k\) is the number of training examples in class \(k\).

Interpretation: Fraction of training data belonging to class \(k\).

Estimating Feature Likelihoods¶

The estimation method depends on the feature type.

For discrete features (e.g., word presence, categorical variables):

\[ \hat{P}(X_j = v | Y = k) = \frac{\text{count}(X_j = v \text{ and } Y = k)}{\text{count}(Y = k)} \]

For continuous features (e.g., measurements):

Assume a distribution (typically Gaussian):

\[ P(X_j = x | Y = k) = \frac{1}{\sqrt{2\pi\sigma_{jk}^2}} \exp\left(-\frac{(x - \mu_{jk})^2}{2\sigma_{jk}^2}\right) \]

Estimate parameters from data:

\[ \hat{\mu}_{jk} = \frac{1}{n_k} \sum_{i: y_i = k} x_{ij} \]

\[ \hat{\sigma}_{jk}^2 = \frac{1}{n_k} \sum_{i: y_i = k} (x_{ij} - \hat{\mu}_{jk})^2 \]

Laplace Smoothing (Additive Smoothing)¶

Problem: What if a feature value never appears with a class in training data?

\[ \hat{P}(X_j = v | Y = k) = 0 \]

Then \(P(Y = k | \mathbf{X} = \mathbf{x}) \propto 0\) for any \(\mathbf{x}\) containing \(v\) — the class gets zero probability!

Solution: Laplace smoothing adds a small pseudo-count \(\alpha > 0\):

\[ \hat{P}(X_j = v | Y = k) = \frac{\text{count}(X_j = v \text{ and } Y = k) + \alpha}{\text{count}(Y = k) + \alpha |V_j|} \]

where \(|V_j|\) is the number of possible values for feature \(j\).

Common choice: \(\alpha = 1\) (Laplace's original "add-one" smoothing)

Effect: - Never assigns zero probability to unseen events - Regularizes estimates (reduces overfitting) - Minimal impact when counts are large

Complete Worked Example: Text Classification¶

Problem: Classify emails as spam or ham (not spam) using word occurrences.

Training data:

Email	Class	Contains "free"?	Contains "money"?	Contains "meeting"?
1	Spam	Yes	Yes	No
2	Spam	Yes	No	No
3	Ham	No	No	Yes
4	Ham	No	Yes	Yes
5	Spam	Yes	Yes	No

Step 1: Estimate class priors.

\(n_{\text{spam}} = 3\), \(n_{\text{ham}} = 2\), \(n = 5\)
\(\hat{P}(Y = \text{spam}) = 3/5 = 0.6\)
\(\hat{P}(Y = \text{ham}) = 2/5 = 0.4\)

Step 2: Estimate feature likelihoods (with Laplace smoothing \(\alpha = 1\)).

For "free": - Spam: appears in 3/3 spam emails - Ham: appears in 0/2 ham emails

\[ \hat{P}(\text{free} = \text{yes} | \text{spam}) = \frac{3 + 1}{3 + 2} = \frac{4}{5} = 0.8 \]

\[ \hat{P}(\text{free} = \text{yes} | \text{ham}) = \frac{0 + 1}{2 + 2} = \frac{1}{4} = 0.25 \]

Similarly for "money":

\[ \hat{P}(\text{money} = \text{yes} | \text{spam}) = \frac{2 + 1}{3 + 2} = \frac{3}{5} = 0.6 \]

\[ \hat{P}(\text{money} = \text{yes} | \text{ham}) = \frac{1 + 1}{2 + 2} = \frac{2}{4} = 0.5 \]

For "meeting":

\[ \hat{P}(\text{meeting} = \text{yes} | \text{spam}) = \frac{0 + 1}{3 + 2} = \frac{1}{5} = 0.2 \]

\[ \hat{P}(\text{meeting} = \text{yes} | \text{ham}) = \frac{2 + 1}{2 + 2} = \frac{3}{4} = 0.75 \]

Step 3: Classify a new email with features: free=yes, money=no, meeting=yes.

For spam:

\[ P(\text{spam}) \cdot P(\text{free=yes}|\text{spam}) \cdot P(\text{money=no}|\text{spam}) \cdot P(\text{meeting=yes}|\text{spam}) \]

\[ = 0.6 \times 0.8 \times (1-0.6) \times 0.2 = 0.6 \times 0.8 \times 0.4 \times 0.2 = 0.0384 \]

For ham:

\[ P(\text{ham}) \cdot P(\text{free=yes}|\text{ham}) \cdot P(\text{money=no}|\text{ham}) \cdot P(\text{meeting=yes}|\text{ham}) \]

\[ = 0.4 \times 0.25 \times (1-0.5) \times 0.75 = 0.4 \times 0.25 \times 0.5 \times 0.75 = 0.0375 \]

Decision: Since \(0.0384 > 0.0375\), classify as spam.

Note: We can also compute normalized posterior probabilities:

\[ P(\text{spam}|\mathbf{x}) = \frac{0.0384}{0.0384 + 0.0375} = \frac{0.0384}{0.0759} \approx 0.506 \]

The email is 50.6% likely to be spam.

Gaussian Naive Bayes¶

When features are continuous (real-valued measurements), we typically assume Gaussian distributions.

Model: For class \(k\) and feature \(j\):

\[ P(X_j | Y = k) = \mathcal{N}(\mu_{jk}, \sigma_{jk}^2) \]

Classification rule:

\[ \hat{y} = \arg\max_k \left[\log P(Y = k) + \sum_{j=1}^p \log \mathcal{N}(x_j; \mu_{jk}, \sigma_{jk}^2)\right] \]

Log-likelihood for Gaussian:

\[ \log \mathcal{N}(x; \mu, \sigma^2) = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(x-\mu)^2}{2\sigma^2} \]

Simplification (dropping constants independent of \(k\)):

\[ \hat{y} = \arg\max_k \left[\log P(Y = k) - \sum_{j=1}^p \left(\frac{(x_j - \mu_{jk})^2}{2\sigma_{jk}^2} + \log \sigma_{jk}\right)\right] \]

Comparison with LDA¶

Gaussian Naive Bayes vs. Linear Discriminant Analysis:

Naive Bayes: Assumes diagonal covariance (features independent within each class)
LDA: Allows correlated features (full covariance matrix, shared across classes)

When Naive Bayes is better: - High-dimensional features (\(p\) large) - Features are approximately independent - Limited training data (fewer parameters to estimate)

When LDA is better: - Features are strongly correlated - Sufficient training data - Shared covariance assumption holds

Variants of Bayesian Classifiers¶

Multinomial Naive Bayes¶

For count data (e.g., word counts in documents):

\[ P(\mathbf{X} | Y = k) = \frac{\left(\sum_j x_j\right)!}{\prod_j x_j!} \prod_{j=1}^p \theta_{jk}^{x_j} \]

where \(\theta_{jk} = P(\text{feature } j | Y = k)\) and \(\sum_j \theta_{jk} = 1\).

Estimation:

\[ \hat{\theta}_{jk} = \frac{\sum_{i: y_i = k} x_{ij}}{\sum_{j'} \sum_{i: y_i = k} x_{ij'}} \]

Use case: Text classification with word counts (more sophisticated than binary word presence).

Bernoulli Naive Bayes¶

For binary features (presence/absence):

\[ P(X_j = 1 | Y = k) = \theta_{jk} \]

\[ P(\mathbf{X} = \mathbf{x} | Y = k) = \prod_{j=1}^p \theta_{jk}^{x_j} (1-\theta_{jk})^{1-x_j} \]

Use case: Document classification with binary word indicators.

Bayesian Networks (General Case)¶

Naive Bayes limitation: Independence assumption is too strong.

Bayesian network: Directed acyclic graph (DAG) encoding conditional dependencies.

Nodes: Variables (\(Y, X_1, \ldots, X_p\))

Edges: Direct dependencies

Joint distribution factorizes according to the graph:

\[ P(Y, X_1, \ldots, X_p) = \prod_{i} P(X_i | \text{Parents}(X_i)) \]

Example: Tree-Augmented Naive Bayes (TAN) allows each feature to depend on at most one other feature.

Why Does Naive Bayes Work Despite Wrong Assumptions?¶

Paradox: Naive independence assumption is almost always violated, yet Naive Bayes often performs well. Why?

Explanation 1: Classification vs. Density Estimation

Density estimation: Requires accurate \(P(Y=k|\mathbf{X})\) everywhere
Classification: Only requires correct \(\arg\max_k P(Y=k|\mathbf{X})\) — the ranking, not exact probabilities

Even if \(P(Y=k|\mathbf{X})\) is poorly estimated, the class with the highest probability may still be correct.

Explanation 2: Bias-Variance Trade-off

Naive Bayes has high bias (wrong model) but low variance (few parameters)
With limited data, low variance often outweighs high bias
Complex models may overfit, while Naive Bayes generalizes better

Explanation 3: Feature Redundancy

If multiple features provide redundant information about the class, the independence assumption may not hurt much — the classifier gets multiple "votes" pointing in the same direction.

Empirical observation: Naive Bayes is particularly effective when: - Features are not too strongly correlated - Sample size is small relative to feature dimension - Decision boundaries are relatively simple

Bayesian Classifier vs. Other Methods¶

Bayesian vs. Logistic Regression¶

Naive Bayes: Generative (models \(P(\mathbf{X}|Y)\) and \(P(Y)\))
Logistic Regression: Discriminative (directly models \(P(Y|\mathbf{X})\))

Trade-off: - Naive Bayes: More assumptions, fewer parameters, faster training, better with small data - Logistic Regression: Fewer assumptions, more flexible, better with large data

Ng & Jordan (2002) showed: Naive Bayes reaches asymptotic error faster (with less data), but Logistic Regression has lower asymptotic error (with infinite data).

Bayesian vs. Decision Trees¶

Bayesian: Probabilistic, smooth decision boundaries, handles uncertainty
Decision Trees: Non-probabilistic, axis-aligned splits, interpretable rules

Trade-off: - Bayesian: Better probability estimates, handles missing data naturally - Trees: Handle interactions better, no distributional assumptions

Bayesian vs. k-NN¶

Bayesian: Parametric (learns model parameters)
k-NN: Non-parametric (memorizes training data)

Trade-off: - Bayesian: Faster at test time, requires distributional assumptions - k-NN: No assumptions, but slow at test time and memory-intensive

Practical Considerations¶

Handling Missing Data¶

Advantage of Bayesian methods: Missing features are naturally handled.

If feature \(X_j\) is missing for observation \(\mathbf{x}\), simply omit it from the product:

\[ P(Y = k | \mathbf{X}_{\text{observed}}) \propto P(Y = k) \prod_{j \in \text{observed}} P(X_j | Y = k) \]

Feature Selection¶

Not all features are equally informative. Use mutual information to rank features:

\[ I(X_j; Y) = \sum_k \sum_v P(X_j = v, Y = k) \log \frac{P(X_j = v, Y = k)}{P(X_j = v) P(Y = k)} \]

Select features with highest mutual information with the class.

Computational Complexity¶

Training: \(O(np)\) — linear in data size and features

Prediction: \(O(Kp)\) — linear in classes and features

Scalability: Naive Bayes scales to very large datasets and high-dimensional feature spaces.

Confidence and Calibration¶

Posterior probabilities \(P(Y = k | \mathbf{X})\) can be interpreted as confidence.

Problem: Naive Bayes probabilities are often poorly calibrated (too extreme — too close to 0 or 1).

Why? Independence assumption causes overconfidence when features are correlated.

Solution: Apply calibration methods: - Platt scaling (fit logistic regression to Naive Bayes scores) - Isotonic regression (fit monotonic function to scores)

Variables and Symbols¶

Symbol	Name	Description
\(Y\)	Class variable	Discrete class label \(\in \{1, \ldots, K\}\)
\(\mathbf{X}\)	Feature vector	\((X_1, \ldots, X_p)\) features
\(K\)	Number of classes	Total classes
\(p\)	Number of features	Dimensionality
\(P(Y=k)\)	Prior	Probability of class \(k\) before seeing data
\(P(\mathbf{X}\\|Y=k)\)	Likelihood	Probability of features given class \(k\)
\(P(Y=k\\|\mathbf{X})\)	Posterior	Probability of class \(k\) given features
\(\hat{y}\)	Predicted class	\(\arg\max_k P(Y=k\\|\mathbf{X})\)
\(\alpha\)	Smoothing parameter	Laplace smoothing pseudo-count

Bayes' Theorem — Foundation of Bayesian classification
Likelihood-Based Statistics — Parameter estimation
Linear Discriminant Analysis — Related Gaussian classifier
Logistic Regression — Discriminative alternative
Bernoulli Distribution — For binary features
Gaussian Distribution — For continuous features
ROC Curve — Evaluating classifier performance

Historical and Modern References¶

Bayes, T., & Price, R. (1763). "An Essay towards solving a Problem in the Doctrine of Chances." Philosophical Transactions of the Royal Society, 53, 370-418.
Maron, M. E. (1961). "Automatic Indexing: An Experimental Inquiry." Journal of the ACM, 8(3), 404-417. [Early application to text classification]
Duda, R. O., & Hart, P. E. (1973). Pattern Classification and Scene Analysis. Wiley.
Domingos, P., & Pazzani, M. (1997). "On the Optimality of the Simple Bayesian Classifier under Zero-One Loss." Machine Learning, 29(2-3), 103-130.
Ng, A. Y., & Jordan, M. I. (2002). "On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes." NIPS 14, 841-848.
Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press. [Chapter 3]
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. [Chapter 6.6.3]