Vai al contenuto

Regression Trees, Bagging, and Boosting

The Story Behind the Math

In the 1980s, statisticians faced a growing crisis. Linear models—linear regression, logistic regression—were elegant and interpretable, but the real world was desperately non-linear. Researchers had data with complex interactions, thresholds, and hierarchical structures that linear equations simply couldn't capture.

Leo Breiman (1928-2005), a statistician at UC Berkeley, was among the first to recognize that the field needed a paradigm shift. After leaving academia to become a consultant, he saw firsthand how messy real data was. In 1984, he co-authored "Classification and Regression Trees" (CART) with Jerome Friedman, Richard Olshen, and Charles Stone. Their radical insight: instead of fitting a single global equation, partition the data space recursively into simpler regions.

But trees had a fatal flaw—they were unstable. Small changes in training data could produce dramatically different trees. This variance problem limited their practical use.

The breakthrough came in 1996 when Leo Breiman introduced Bagging (Bootstrap Aggregating). His idea was almost absurd in its simplicity: if one tree is unstable, train hundreds of trees on random subsets of data and average their predictions. The variance drops, and accuracy soars.

But the real revolution was Boosting. In 1997, Yoav Freund and Robert Schapire published AdaBoost, proving that a sequence of "weak" learners could combine into a "strong" learner. Instead of averaging independent trees, boosting trains trees sequentially, with each new tree focusing on the examples the previous ones got wrong.

The paradigm shift: From "finding the one best model" to "combining many simple models." This became the foundation of modern ensemble methods and won the 2003 Gödel Prize for its theoretical significance.

Why It Matters

Tree-based ensemble methods are now the dominant approach in machine learning:

  • Kaggle Competitions: XGBoost and LightGBM (boosted trees) have won the majority of structured data competitions
  • Finance: Credit scoring, fraud detection, algorithmic trading
  • Medicine: Disease diagnosis, drug response prediction, survival analysis
  • E-commerce: Recommendation systems, churn prediction, price optimization
  • Genomics: Gene expression analysis, SNP detection
  • Natural Language Processing: Feature engineering, text classification
  • IoT: Sensor data analysis, anomaly detection

These methods are the workhorses of applied machine learning because they: - Handle mixed data types (numeric, categorical) naturally - Capture non-linear relationships without explicit feature engineering - Provide feature importance measures - Are robust to outliers and missing values - Scale well to large datasets

Prerequisites

  • Likelihood-Based-Statistics — Understanding estimation and optimization
  • Variance — Key concept for understanding bias-variance tradeoff
  • Gaussian-Distribution — For understanding loss functions
  • Basic calculus (gradients, optimization)
  • Familiarity with statistical learning concepts (overfitting, cross-validation)

The Core Insight

The Bias-Variance Decomposition

The fundamental problem in supervised learning can be understood through the bias-variance tradeoff. For a prediction model \(\hat{f}(x)\) of the true function \(f(x)\), the expected prediction error decomposes as:

\[ E[(Y - \hat{f}(X))^2] = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error} \]

Where:

  • Bias: \(E[\hat{f}(X)] - f(X)\) — How much the average prediction differs from the truth
  • Variance: \(E[(\hat{f}(X) - E[\hat{f}(X)])^2]\) — How much predictions vary across different training sets

Key insight: There's a tradeoff. Simple models (linear regression) have high bias but low variance. Complex models (deep trees) have low bias but high variance. Ensemble methods reduce variance without increasing bias by averaging multiple models.

Regression Trees

What is a Regression Tree?

A regression tree is a piecewise constant function that partitions the input space into rectangular regions and assigns a constant prediction to each region.

\[ \hat{f}(x) = \sum_{m=1}^{M} c_m \cdot \mathbb{1}_{R_m}(x) \]

Where \(R_1, \ldots, R_M\) are the regions and \(c_m\) is the prediction in region \(R_m\).

Building a Tree: The Greedy Algorithm

Step 1: Splitting Criterion

At each node, we look for the split that minimizes the sum of squared errors:

\[ \min_{j,s} \left[ \min_{c_1} \sum_{x_i \in R_1(j,s)} (y_i - c_1)^2 + \min_{c_2} \sum_{x_i \in R_2(j,s)} (y_i - c_2)^2 \right] \]

Where: - \(j\) is the feature to split on - \(s\) is the split point - \(R_1(j,s) = \{X | X_j \leq s\}\) and \(R_2(j,s) = \{X | X_j > s\}\)

Step 2: Optimal Prediction in Each Region

The optimal constant \(c_m\) is simply the mean of the responses in that region:

\[ \hat{c}_m = \frac{1}{n_m} \sum_{x_i \in R_m} y_i = \bar{y}_{R_m} \]

Step 3: Recursive Splitting

Repeat the splitting process on each new region until a stopping criterion is met (e.g., minimum samples per leaf, maximum depth, or no improvement in loss).

Why squared error? It leads to the sample mean as the optimal prediction. Other losses lead to different predictions (e.g., absolute error leads to the median).

Why Trees Have High Variance

Trees are unstable because: 1. The greedy splitting process is sensitive to small data changes 2. The hierarchical nature compounds errors early splits cause cascading effects 3. The piecewise constant structure creates discontinuities

A tree with depth \(d\) has approximately \(2^d\) leaves, making it highly flexible—but flexibility comes at the cost of variance.

Bagging (Bootstrap Aggregating)

The Bootstrap Principle

The bootstrap is a resampling technique. Given a dataset of size \(n\), we create \(B\) bootstrap samples by sampling with replacement. Each bootstrap sample also has size \(n\) but contains approximately \(63.2\%\) of unique observations (the rest are duplicates).

Why \(63.2\%\)? The probability that a given observation is not selected in one draw is \((1 - 1/n)\). For \(n\) draws:

\[ P(\text{not selected}) = \left(1 - \frac{1}{n}\right)^n \approx e^{-1} \approx 0.368 \]

So the probability of being selected at least once is \(1 - 0.368 = 0.632\).

The Bagging Algorithm

Step 1: Generate Bootstrap Samples

For \(b = 1, \ldots, B\): - Sample \(n\) observations with replacement from the training data

Step 2: Train Base Learners

For each bootstrap sample \(b\), train a regression tree \(\hat{f}_b(x)\).

Step 3: Aggregate Predictions

For regression, average the predictions:

\[ \hat{f}_{\text{bag}}(x) = \frac{1}{B} \sum_{b=1}^{B} \hat{f}_b(x) \]

For classification, take the majority vote.

Why Bagging Works: Variance Reduction

Key Theorem: If \(\hat{f}_b\) are identically distributed with variance \(\sigma^2\) and pairwise correlation \(\rho\), then:

\[ \text{Var}\left(\frac{1}{B} \sum_{b=1}^{B} \hat{f}_b\right) = \rho \sigma^2 + \frac{1 - \rho}{B} \sigma^2 \]

Proof:

Let \(Z_b = \hat{f}_b(x)\). Then:

\[ \text{Var}\left(\frac{1}{B} \sum_{b=1}^{B} Z_b\right) = \frac{1}{B^2} \sum_{i=1}^{B} \sum_{j=1}^{B} \text{Cov}(Z_i, Z_j) \]

There are \(B\) variance terms and \(B(B-1)\) covariance terms:

\[ = \frac{1}{B^2} \left[ B \cdot \sigma^2 + B(B-1) \cdot \rho \sigma^2 \right] \]
\[ = \frac{\sigma^2}{B} + \frac{B(B-1)\rho \sigma^2}{B^2} \]
\[ = \frac{\sigma^2}{B} + \rho \sigma^2 \left(1 - \frac{1}{B}\right) \]
\[ = \rho \sigma^2 + \frac{(1 - \rho)\sigma^2}{B} \]

Interpretation: - As \(B \to \infty\), the variance approaches \(\rho \sigma^2\) (the irreducible correlation) - If trees were uncorrelated (\(\rho = 0\)), variance would go to zero as \(1/B\) - Bootstrap sampling creates diverse trees while keeping them reasonably accurate

Out-of-Bag Error Estimation

Each observation appears in approximately \(63.2\%\) of bootstrap samples. The remaining \(36.8\%\) are out-of-bag (OOB).

For observation \(i\), the OOB prediction is:

\[ \hat{f}_{\text{OOB}}(x_i) = \frac{1}{B_i} \sum_{b: i \notin S_b} \hat{f}_b(x_i) \]

Where \(B_i\) is the number of trees where \(i\) was out-of-bag. The OOB error is an unbiased estimate of test error—no cross-validation needed!

Random Forests

The Problem with Bagging Trees

Bagged trees are still correlated because they all use the same strong features for early splits. This limits variance reduction.

Random Forest: Decorrelation via Feature Subsampling

At each split, Random Forest considers only a random subset of \(m \ll p\) features.

Algorithm: 1. For \(b = 1, \ldots, B\): - Draw bootstrap sample \(S_b\) - Train tree \(\hat{f}_b\) where each split considers only \(m\) randomly selected features 2. Average predictions: \(\hat{f}_{\text{rf}}(x) = \frac{1}{B} \sum_{b=1}^{B} \hat{f}_b(x)\)

Typical value: \(m = \sqrt{p}\) for classification, \(m = p/3\) for regression.

Why this reduces correlation: - If one feature is very predictive, all bagged trees will use it for the first split - Random Forest forces trees to consider other features, creating diversity - Lower correlation \(\rho\) means lower ensemble variance

Boosting

The Fundamental Difference

While bagging trains trees independently and averages them, boosting trains trees sequentially, with each tree focusing on the errors of the previous ones.

AdaBoost (Adaptive Boosting)

Core Idea: Train a sequence of weak learners, weighting each observation by how badly previous learners classified it.

Algorithm:

Initialize observation weights: \(w_i = 1/n\) for all \(i\).

For \(m = 1, \ldots, M\):

  1. Fit weak learner \(G_m(x)\) to weighted training data

  2. Compute weighted error:

$$ \text{err}m = \frac{\sum}^{n} w_i \cdot \mathbb{1{[y_i \neq G_m(x_i)]}}{\sum $$}^{n} w_i

  1. Compute learner weight:

$$ \alpha_m = \log\left(\frac{1 - \text{err}_m}{\text{err}_m}\right) $$

  1. Update observation weights:

$$ w_i \leftarrow w_i \cdot \exp(\alpha_m \cdot \mathbb{1}_{[y_i \neq G_m(x_i)]}) $$

Normalize weights to sum to 1.

Final classifier:

\[ G(x) = \text{sign}\left(\sum_{m=1}^{M} \alpha_m G_m(x)\right) \]

Why exponential weights? AdaBoost minimizes the exponential loss:

\[ L(y, f(x)) = \exp(-y \cdot f(x)) \]

Where \(f(x) = \sum_m \alpha_m G_m(x)\) is the additive model.

Gradient Boosting

AdaBoost fits for classification. For regression, Gradient Boosting generalizes the idea.

Core Idea: Fit new trees to the negative gradient (pseudo-residuals) of the loss function.

Algorithm:

  1. Initialize: \(\hat{f}_0(x) = \arg\min_c \sum_{i=1}^{n} L(y_i, c)\)

  2. For \(m = 1, \ldots, M\):

a. Compute pseudo-residuals:

\[ r_{im} = -\left[\frac{\partial L(y_i, f(x_i))}{\partial f(x_i)}\right]_{f = \hat{f}_{m-1}} \]

b. Fit regression tree to residuals \(r_{im}\), giving terminal regions \(R_{jm}\)

c. Compute optimal prediction in each region:

\[ c_{jm} = \arg\min_c \sum_{x_i \in R_{jm}} L(y_i, \hat{f}_{m-1}(x_i) + c) \]

d. Update model:

\[ \hat{f}_m(x) = \hat{f}_{m-1}(x) + \nu \cdot \sum_{j} c_{jm} \cdot \mathbb{1}_{R_{jm}}(x) \]

Where \(\nu \in (0, 1]\) is the learning rate (shrinkage).

For squared error loss: \(L(y, f) = (y - f)^2\)

The pseudo-residual is:

\[ r_{im} = -\frac{\partial}{\partial f} (y_i - f)^2 \bigg|_{f = \hat{f}_{m-1}(x_i)} = 2(y_i - \hat{f}_{m-1}(x_i)) \]

So we fit trees to the current residuals!

Why Shrinkage Helps

The learning rate \(\nu\) slows down learning:

\[ \hat{f}_m(x) = \hat{f}_{m-1}(x) + \nu \cdot h_m(x) \]

Tradeoff: Smaller \(\nu\) requires more trees \(M\) but typically achieves better generalization. It's like gradient descent with a smaller step size—slower but more precise.

Typical values: \(\nu = 0.01\) to \(0.1\), with \(M = 100\) to \(1000\) trees.

Comparing the Methods

Aspect Single Tree Bagging Random Forest Boosting
Training Single model Parallel Parallel Sequential
Bias Low (if deep) Same as tree Same as tree Low (reduced)
Variance High Reduced Further reduced Reduced
Overfitting Yes Less Even less Possible (control with \(\nu\))
Outliers Sensitive Robust Robust Sensitive
Interpretability High Low Low Low
Training Speed Fast Parallelizable Parallelizable Slow (sequential)
Prediction Speed Fast Slow (\(B\) trees) Slow (\(B\) trees) Slow (\(M\) trees)

Feature Importance

Mean Decrease in Impurity (MDI)

For tree-based methods, importance of feature \(j\) is:

\[ \text{Importance}_j = \frac{1}{B} \sum_{b=1}^{B} \sum_{t \in T_{b,j}} n_t \cdot \Delta i(t) \]

Where: - \(T_{b,j}\) is the set of splits on feature \(j\) in tree \(b\) - \(n_t\) is the number of samples reaching node \(t\) - \(\Delta i(t)\) is the impurity decrease from the split

Permutation Importance

More reliable but computationally expensive:

  1. Train model, compute baseline error
  2. For each feature \(j\):
  3. Randomly permute values of feature \(j\) in validation set
  4. Compute new error
  5. Importance = increase in error

Practical Guidelines

When to Use What

Single Tree: When interpretability is paramount and relationships are simple

Random Forest: - Default choice for tabular data - Good baseline before trying more complex methods - Robust, works well out-of-the-box - Built-in OOB error estimation

Gradient Boosting (XGBoost, LightGBM, CatBoost): - When you need maximum accuracy - Large datasets (boosting scales better) - Willing to tune hyperparameters - Kaggle competitions

Hyperparameter Tuning

Random Forest: - n_estimators: More is better (until convergence), typically 100-500 - max_features: \(\sqrt{p}\) for classification, \(p/3\) for regression - min_samples_leaf: 1 for default, increase to reduce overfitting - max_depth: None (grow fully), or limit for regularization

Gradient Boosting: - learning_rate (\(\nu\)): 0.01-0.1 (smaller = more trees needed) - n_estimators (\(M\)): 100-1000+ (depends on learning rate) - max_depth: 3-6 (shallow trees prevent overfitting) - subsample: 0.5-1.0 (stochastic gradient boosting)

Regularization in Boosting

Modern implementations include several regularization techniques:

  1. Shrinkage (learning rate): Slow learning prevents overfitting
  2. Subsampling: Train each tree on random subset (like bagging)
  3. Column subsampling: Use random feature subsets (like Random Forest)
  4. L1/L2 regularization: Penalize leaf values
  5. Early stopping: Stop when validation error stops improving

Common Misconceptions

  1. "More trees always better": For bagging/Random Forest, yes. For boosting, too many trees can overfit (unless using early stopping).

  2. "Boosting always beats Random Forest": Not necessarily. On some datasets, Random Forest performs better, especially with default hyperparameters.

  3. "Trees don't need feature scaling": True for single trees, but for ensembles with regularization, scaling can still help.

  4. "Feature importance tells you causal relationships": No! Importance measures predictive power, not causation. Correlated features split importance.

  5. "Outliers don't affect trees": Single trees are robust, but boosting can be sensitive to outliers because it focuses on hard examples.

References

  • Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and Regression Trees. Wadsworth.
  • Breiman, L. (1996). "Bagging predictors." Machine Learning, 24(2):123-140.
  • Breiman, L. (2001). "Random forests." Machine Learning, 45(1):5-32.
  • Freund, Y., & Schapire, R. E. (1997). "A decision-theoretic generalization of on-line learning and an application to boosting." Journal of Computer and System Sciences, 55(1):119-139.
  • Friedman, J. H. (2001). "Greedy function approximation: A gradient boosting machine." Annals of Statistics, 29(5):1189-1232.
  • Friedman, J. H. (2002). "Stochastic gradient boosting." Computational Statistics & Data Analysis, 38(4):367-378.
  • Chen, T., & Guestrin, C. (2016). "XGBoost: A scalable tree boosting system." KDD.
  • Ke, G., et al. (2017). "LightGBM: A highly efficient gradient boosting decision tree." NIPS.