Bayesian Classifier¶
The Story Behind the Mathematics¶
The story of Bayesian classification begins with one of the most revolutionary ideas in probability theory, formulated by an 18th-century Presbyterian minister who never published his work during his lifetime.
Thomas Bayes (1701-1761) was a nonconformist minister in Tunbridge Wells, England. Unlike the prominent mathematicians of his era, Bayes worked in relative obscurity, pursuing mathematics as an intellectual passion rather than a profession. His most important work, "An Essay towards solving a Problem in the Doctrine of Chances," was discovered among his papers after his death by his friend Richard Price, who published it in 1763.
The revolutionary insight: Bayes showed how to invert conditional probabilities — how to reason from effects back to causes, from data back to hypotheses. This was philosophically radical: it meant we could update our beliefs based on evidence, treating probability as a measure of uncertainty rather than just frequency.
For over 150 years, Bayes' ideas remained controversial. The dominant frequentist school, led by figures like Ronald Fisher and Jerzy Neyman, rejected the notion of probability as subjective belief. But Bayes' theorem quietly found practical applications.
Modern revival: The computational revolution of the late 20th century transformed Bayesian methods from theoretical curiosities to practical powerhouses:
- 1950s-60s: Early applications in signal processing and decision theory
- 1980s: Development of Markov Chain Monte Carlo (MCMC) made complex Bayesian computations feasible
- 1990s: Naive Bayes classifiers emerged as surprisingly effective for text classification (spam filtering)
- 2000s: Bayesian networks became standard in medical diagnosis and expert systems
- 2010s-present: Bayesian methods underpin modern machine learning (Bayesian optimization, variational inference, probabilistic programming)
Why "Naive" Bayes?: The most widely used Bayesian classifier makes a "naive" assumption — that all features are conditionally independent given the class. In other words: once we know which class an example belongs to, knowing the value of one feature tells us nothing about the values of the other features.
Example: In classifying emails as spam or not spam, the naive assumption says that if we know an email is spam, the presence of the word "free" is independent of the presence of the word "win". In reality these words tend to appear together (if "free" is present it's more likely that "win" is also present), so the assumption is "naive".
This assumption is almost always wrong, yet the method works remarkably well in practice. This paradox fascinated researchers and made Naive Bayes a fundamental benchmark in machine learning.
Philosophical significance: Bayesian thinking represents a different philosophy of inference: knowledge is uncertain, beliefs should be quantified as probabilities, and evidence should update these beliefs in a principled way. This framework now dominates fields from artificial intelligence to neuroscience to judicial reasoning.
Why It Matters¶
Bayesian classifiers are fundamental to modern statistical learning and are used in:
- Email Filtering: Spam detection (classic application of Naive Bayes)
- Text Classification: Document categorization, sentiment analysis, topic modeling
- Medical Diagnosis: Disease prediction from symptoms and test results
- Recommender Systems: Content filtering, user preference prediction
- Bioinformatics: Protein classification, gene expression analysis
- Computer Vision: Object recognition, image classification
- Natural Language Processing: Part-of-speech tagging, named entity recognition
- Fraud Detection: Credit card fraud, insurance fraud
- Weather Forecasting: Probabilistic forecasts based on atmospheric conditions
- Speech Recognition: Phoneme classification, word recognition
Bayesian classifiers provide probabilistic predictions — not just class labels, but confidence estimates — making them invaluable when uncertainty quantification matters.
Prerequisites¶
- Probability fundamentals (random variables, conditional probability)
- Bayes' Theorem (essential!)
- Bernoulli Distribution and discrete distributions
- Gaussian Distribution (for continuous features)
- Likelihood-Based Statistics (for parameter estimation)
- Concept of Independence and conditional independence
Fundamental Concepts¶
We'll build Bayesian classification from first principles, starting with Bayes' theorem and deriving the optimal decision rule.
The Classification Problem¶
We observe: - Features: \(\mathbf{X} = (X_1, X_2, \ldots, X_p)\) (could be discrete, continuous, or mixed) - Class label: \(Y \in \{1, 2, \ldots, K\}\) (discrete)
Goal: Given new features \(\mathbf{x}\), predict the class \(y\).
Bayesian approach: Use probability theory to quantify uncertainty and make optimal predictions.
Bayes' Theorem: The Foundation¶
Bayes' theorem for classification:
Components:
- Posterior probability \(P(Y = k | \mathbf{X} = \mathbf{x})\): Probability of class \(k\) given features \(\mathbf{x}\) (what we want)
- Likelihood \(P(\mathbf{X} = \mathbf{x} | Y = k)\): Probability of observing \(\mathbf{x}\) in class \(k\) (from data)
- Prior \(P(Y = k)\): Probability of class \(k\) before seeing data (from class frequencies or domain knowledge)
- Evidence \(P(\mathbf{X} = \mathbf{x})\): Total probability of observing \(\mathbf{x}\) (normalizing constant)
Expanding the evidence using the law of total probability:
Full Bayes' theorem:
The Optimal Bayesian Decision Rule¶
Question: Given posterior probabilities \(P(Y = k | \mathbf{X} = \mathbf{x})\) for all classes, how should we classify?
Bayes classifier (also called Bayes optimal classifier):
Interpretation: Choose the class with the highest posterior probability.
Why is this optimal? This rule minimizes the probability of misclassification.
Proof: The probability of error for a classifier \(h(\mathbf{x})\) is:
To minimize error, we must maximize \(P(h(\mathbf{X}) = Y)\). For each \(\mathbf{x}\), the best choice is:
Result: No other classifier can achieve lower error rate than the Bayes classifier (if we know the true probabilities).
Bayes error rate: The minimum possible error rate:
This is an irreducible error floor — the best any classifier can do given the feature space.
Simplification for Classification¶
Since the evidence \(P(\mathbf{X} = \mathbf{x})\) is the same for all classes, we can ignore it for classification:
Equivalent classification rule:
In words: Multiply the likelihood by the prior, and choose the class with the highest product.
Log form (often more convenient numerically):
Naive Bayes Classifier¶
The Naive Bayes classifier is the most widely used Bayesian classifier. It makes a strong simplifying assumption.
The Naive Independence Assumption¶
Assumption: Features are conditionally independent given the class:
Meaning: Once we know the class, knowing one feature's value gives no information about another feature's value.
Example: For spam classification, this assumes that given an email is spam, the presence of the word "free" is independent of the presence of "winner" — clearly false! Yet Naive Bayes works.
Why "naive"? This assumption is almost always wrong in real data, but it makes computation tractable.
Naive Bayes Classification Rule¶
Substituting the independence assumption into Bayes' rule:
Classification rule:
Log form (preferred for numerical stability):
Interpretation: Add up log-probabilities (evidence) from each feature plus the log-prior.
Estimating Parameters from Data¶
To use Naive Bayes, we must estimate: 1. Class priors \(P(Y = k)\) 2. Feature likelihoods \(P(X_j = x | Y = k)\) for each feature \(j\) and class \(k\)
Training data: \(\{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)\}\)
Estimating Class Priors¶
Maximum likelihood estimate:
where \(n_k\) is the number of training examples in class \(k\).
Interpretation: Fraction of training data belonging to class \(k\).
Estimating Feature Likelihoods¶
The estimation method depends on the feature type.
For discrete features (e.g., word presence, categorical variables):
For continuous features (e.g., measurements):
Assume a distribution (typically Gaussian):
Estimate parameters from data:
Laplace Smoothing (Additive Smoothing)¶
Problem: What if a feature value never appears with a class in training data?
Then \(P(Y = k | \mathbf{X} = \mathbf{x}) \propto 0\) for any \(\mathbf{x}\) containing \(v\) — the class gets zero probability!
Solution: Laplace smoothing adds a small pseudo-count \(\alpha > 0\):
where \(|V_j|\) is the number of possible values for feature \(j\).
Common choice: \(\alpha = 1\) (Laplace's original "add-one" smoothing)
Effect: - Never assigns zero probability to unseen events - Regularizes estimates (reduces overfitting) - Minimal impact when counts are large
Complete Worked Example: Text Classification¶
Problem: Classify emails as spam or ham (not spam) using word occurrences.
Training data:
| Class | Contains "free"? | Contains "money"? | Contains "meeting"? | |
|---|---|---|---|---|
| 1 | Spam | Yes | Yes | No |
| 2 | Spam | Yes | No | No |
| 3 | Ham | No | No | Yes |
| 4 | Ham | No | Yes | Yes |
| 5 | Spam | Yes | Yes | No |
Step 1: Estimate class priors.
- \(n_{\text{spam}} = 3\), \(n_{\text{ham}} = 2\), \(n = 5\)
- \(\hat{P}(Y = \text{spam}) = 3/5 = 0.6\)
- \(\hat{P}(Y = \text{ham}) = 2/5 = 0.4\)
Step 2: Estimate feature likelihoods (with Laplace smoothing \(\alpha = 1\)).
For "free": - Spam: appears in 3/3 spam emails - Ham: appears in 0/2 ham emails
Similarly for "money":
For "meeting":
Step 3: Classify a new email with features: free=yes, money=no, meeting=yes.
For spam:
For ham:
Decision: Since \(0.0384 > 0.0375\), classify as spam.
Note: We can also compute normalized posterior probabilities:
The email is 50.6% likely to be spam.
Gaussian Naive Bayes¶
When features are continuous (real-valued measurements), we typically assume Gaussian distributions.
Model: For class \(k\) and feature \(j\):
Classification rule:
Log-likelihood for Gaussian:
Simplification (dropping constants independent of \(k\)):
Comparison with LDA¶
Gaussian Naive Bayes vs. Linear Discriminant Analysis:
- Naive Bayes: Assumes diagonal covariance (features independent within each class)
- LDA: Allows correlated features (full covariance matrix, shared across classes)
When Naive Bayes is better: - High-dimensional features (\(p\) large) - Features are approximately independent - Limited training data (fewer parameters to estimate)
When LDA is better: - Features are strongly correlated - Sufficient training data - Shared covariance assumption holds
Variants of Bayesian Classifiers¶
Multinomial Naive Bayes¶
For count data (e.g., word counts in documents):
where \(\theta_{jk} = P(\text{feature } j | Y = k)\) and \(\sum_j \theta_{jk} = 1\).
Estimation:
Use case: Text classification with word counts (more sophisticated than binary word presence).
Bernoulli Naive Bayes¶
For binary features (presence/absence):
Use case: Document classification with binary word indicators.
Bayesian Networks (General Case)¶
Naive Bayes limitation: Independence assumption is too strong.
Bayesian network: Directed acyclic graph (DAG) encoding conditional dependencies.
Nodes: Variables (\(Y, X_1, \ldots, X_p\))
Edges: Direct dependencies
Joint distribution factorizes according to the graph:
Example: Tree-Augmented Naive Bayes (TAN) allows each feature to depend on at most one other feature.
Why Does Naive Bayes Work Despite Wrong Assumptions?¶
Paradox: Naive independence assumption is almost always violated, yet Naive Bayes often performs well. Why?
Explanation 1: Classification vs. Density Estimation
- Density estimation: Requires accurate \(P(Y=k|\mathbf{X})\) everywhere
- Classification: Only requires correct \(\arg\max_k P(Y=k|\mathbf{X})\) — the ranking, not exact probabilities
Even if \(P(Y=k|\mathbf{X})\) is poorly estimated, the class with the highest probability may still be correct.
Explanation 2: Bias-Variance Trade-off
- Naive Bayes has high bias (wrong model) but low variance (few parameters)
- With limited data, low variance often outweighs high bias
- Complex models may overfit, while Naive Bayes generalizes better
Explanation 3: Feature Redundancy
If multiple features provide redundant information about the class, the independence assumption may not hurt much — the classifier gets multiple "votes" pointing in the same direction.
Empirical observation: Naive Bayes is particularly effective when: - Features are not too strongly correlated - Sample size is small relative to feature dimension - Decision boundaries are relatively simple
Bayesian Classifier vs. Other Methods¶
Bayesian vs. Logistic Regression¶
- Naive Bayes: Generative (models \(P(\mathbf{X}|Y)\) and \(P(Y)\))
- Logistic Regression: Discriminative (directly models \(P(Y|\mathbf{X})\))
Trade-off: - Naive Bayes: More assumptions, fewer parameters, faster training, better with small data - Logistic Regression: Fewer assumptions, more flexible, better with large data
Ng & Jordan (2002) showed: Naive Bayes reaches asymptotic error faster (with less data), but Logistic Regression has lower asymptotic error (with infinite data).
Bayesian vs. Decision Trees¶
- Bayesian: Probabilistic, smooth decision boundaries, handles uncertainty
- Decision Trees: Non-probabilistic, axis-aligned splits, interpretable rules
Trade-off: - Bayesian: Better probability estimates, handles missing data naturally - Trees: Handle interactions better, no distributional assumptions
Bayesian vs. k-NN¶
- Bayesian: Parametric (learns model parameters)
- k-NN: Non-parametric (memorizes training data)
Trade-off: - Bayesian: Faster at test time, requires distributional assumptions - k-NN: No assumptions, but slow at test time and memory-intensive
Practical Considerations¶
Handling Missing Data¶
Advantage of Bayesian methods: Missing features are naturally handled.
If feature \(X_j\) is missing for observation \(\mathbf{x}\), simply omit it from the product:
Feature Selection¶
Not all features are equally informative. Use mutual information to rank features:
Select features with highest mutual information with the class.
Computational Complexity¶
Training: \(O(np)\) — linear in data size and features
Prediction: \(O(Kp)\) — linear in classes and features
Scalability: Naive Bayes scales to very large datasets and high-dimensional feature spaces.
Confidence and Calibration¶
Posterior probabilities \(P(Y = k | \mathbf{X})\) can be interpreted as confidence.
Problem: Naive Bayes probabilities are often poorly calibrated (too extreme — too close to 0 or 1).
Why? Independence assumption causes overconfidence when features are correlated.
Solution: Apply calibration methods: - Platt scaling (fit logistic regression to Naive Bayes scores) - Isotonic regression (fit monotonic function to scores)
Variables and Symbols¶
| Symbol | Name | Description |
|---|---|---|
| \(Y\) | Class variable | Discrete class label \(\in \{1, \ldots, K\}\) |
| \(\mathbf{X}\) | Feature vector | \((X_1, \ldots, X_p)\) features |
| \(K\) | Number of classes | Total classes |
| \(p\) | Number of features | Dimensionality |
| \(P(Y=k)\) | Prior | Probability of class \(k\) before seeing data |
| \(P(\mathbf{X}\|Y=k)\) | Likelihood | Probability of features given class \(k\) |
| \(P(Y=k\|\mathbf{X})\) | Posterior | Probability of class \(k\) given features |
| \(\hat{y}\) | Predicted class | \(\arg\max_k P(Y=k\|\mathbf{X})\) |
| \(\alpha\) | Smoothing parameter | Laplace smoothing pseudo-count |
Related Concepts¶
- Bayes' Theorem — Foundation of Bayesian classification
- Likelihood-Based Statistics — Parameter estimation
- Linear Discriminant Analysis — Related Gaussian classifier
- Logistic Regression — Discriminative alternative
- Bernoulli Distribution — For binary features
- Gaussian Distribution — For continuous features
- ROC Curve — Evaluating classifier performance
Historical and Modern References¶
- Bayes, T., & Price, R. (1763). "An Essay towards solving a Problem in the Doctrine of Chances." Philosophical Transactions of the Royal Society, 53, 370-418.
- Maron, M. E. (1961). "Automatic Indexing: An Experimental Inquiry." Journal of the ACM, 8(3), 404-417. [Early application to text classification]
- Duda, R. O., & Hart, P. E. (1973). Pattern Classification and Scene Analysis. Wiley.
- Domingos, P., & Pazzani, M. (1997). "On the Optimality of the Simple Bayesian Classifier under Zero-One Loss." Machine Learning, 29(2-3), 103-130.
- Ng, A. Y., & Jordan, M. I. (2002). "On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes." NIPS 14, 841-848.
- Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press. [Chapter 3]
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. [Chapter 6.6.3]