Regression Trees, Bagging, and Boosting¶

The Story Behind the Math¶

In the 1980s, statisticians faced a growing crisis. Linear models—linear regression, logistic regression—were elegant and interpretable, but the real world was desperately non-linear. Researchers had data with complex interactions, thresholds, and hierarchical structures that linear equations simply couldn't capture.

Leo Breiman (1928-2005), a statistician at UC Berkeley, was among the first to recognize that the field needed a paradigm shift. After leaving academia to become a consultant, he saw firsthand how messy real data was. In 1984, he co-authored "Classification and Regression Trees" (CART) with Jerome Friedman, Richard Olshen, and Charles Stone. Their radical insight: instead of fitting a single global equation, partition the data space recursively into simpler regions.

But trees had a fatal flaw—they were unstable. Small changes in training data could produce dramatically different trees. This variance problem limited their practical use.

The breakthrough came in 1996 when Leo Breiman introduced Bagging (Bootstrap Aggregating). His idea was almost absurd in its simplicity: if one tree is unstable, train hundreds of trees on random subsets of data and average their predictions. The variance drops, and accuracy soars.

But the real revolution was Boosting. In 1997, Yoav Freund and Robert Schapire published AdaBoost, proving that a sequence of "weak" learners could combine into a "strong" learner. Instead of averaging independent trees, boosting trains trees sequentially, with each new tree focusing on the examples the previous ones got wrong.

The paradigm shift: From "finding the one best model" to "combining many simple models." This became the foundation of modern ensemble methods and won the 2003 Gödel Prize for its theoretical significance.

Why It Matters¶

Tree-based ensemble methods are now the dominant approach in machine learning:

Kaggle Competitions: XGBoost and LightGBM (boosted trees) have won the majority of structured data competitions
Finance: Credit scoring, fraud detection, algorithmic trading
Medicine: Disease diagnosis, drug response prediction, survival analysis
E-commerce: Recommendation systems, churn prediction, price optimization
Genomics: Gene expression analysis, SNP detection
Natural Language Processing: Feature engineering, text classification
IoT: Sensor data analysis, anomaly detection

These methods are the workhorses of applied machine learning because they: - Handle mixed data types (numeric, categorical) naturally - Capture non-linear relationships without explicit feature engineering - Provide feature importance measures - Are robust to outliers and missing values - Scale well to large datasets

Prerequisites¶

Likelihood-Based-Statistics — Understanding estimation and optimization
Variance — Key concept for understanding bias-variance tradeoff
Gaussian-Distribution — For understanding loss functions
Basic calculus (gradients, optimization)
Familiarity with statistical learning concepts (overfitting, cross-validation)

The Core Insight¶

The Bias-Variance Decomposition¶

The fundamental problem in supervised learning can be understood through the bias-variance tradeoff. For a prediction model $\hat{f}(x)$ of the true function $f(x)$, the expected prediction error decomposes as:

\[ E[(Y - \hat{f}(X))^2] = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error} \]

Where:

Bias: $E[\hat{f}(X)] - f(X)$ — How much the average prediction differs from the truth
Variance: $E[(\hat{f}(X) - E[\hat{f}(X)])^2]$ — How much predictions vary across different training sets

Key insight: There's a tradeoff. Simple models (linear regression) have high bias but low variance. Complex models (deep trees) have low bias but high variance. Ensemble methods reduce variance without increasing bias by averaging multiple models.

Regression Trees¶

What is a Regression Tree?¶

A regression tree is a piecewise constant function that partitions the input space into rectangular regions and assigns a constant prediction to each region.

\[ \hat{f}(x) = \sum_{m=1}^{M} c_m \cdot \mathbb{1}_{R_m}(x) \]

Where $R_1, \ldots, R_M$ are the regions and $c_m$ is the prediction in region $R_m$.

Building a Tree: The Greedy Algorithm¶

Step 1: Splitting Criterion

At each node, we look for the split that minimizes the sum of squared errors:

\[ \min_{j,s} \left[ \min_{c_1} \sum_{x_i \in R_1(j,s)} (y_i - c_1)^2 + \min_{c_2} \sum_{x_i \in R_2(j,s)} (y_i - c_2)^2 \right] \]

Where: - $j$ is the feature to split on - $s$ is the split point - $R_1(j,s) = \{X | X_j \leq s\}$ and $R_2(j,s) = \{X | X_j > s\}$

Step 2: Optimal Prediction in Each Region

The optimal constant $c_m$ is simply the mean of the responses in that region:

\[ \hat{c}_m = \frac{1}{n_m} \sum_{x_i \in R_m} y_i = \bar{y}_{R_m} \]

Step 3: Recursive Splitting

Repeat the splitting process on each new region until a stopping criterion is met (e.g., minimum samples per leaf, maximum depth, or no improvement in loss).

Why squared error? It leads to the sample mean as the optimal prediction. Other losses lead to different predictions (e.g., absolute error leads to the median).

Why Trees Have High Variance¶

Trees are unstable because: 1. The greedy splitting process is sensitive to small data changes 2. The hierarchical nature compounds errors early splits cause cascading effects 3. The piecewise constant structure creates discontinuities

A tree with depth $d$ has approximately $2^d$ leaves, making it highly flexible—but flexibility comes at the cost of variance.

Bagging (Bootstrap Aggregating)¶

The Bootstrap Principle¶

The bootstrap is a resampling technique. Given a dataset of size $n$, we create $B$ bootstrap samples by sampling with replacement. Each bootstrap sample also has size $n$ but contains approximately $63.2\%$ of unique observations (the rest are duplicates).

Why $63.2\%$? The probability that a given observation is not selected in one draw is $(1 - 1/n)$. For $n$ draws:

\[ P(\text{not selected}) = \left(1 - \frac{1}{n}\right)^n \approx e^{-1} \approx 0.368 \]

So the probability of being selected at least once is $1 - 0.368 = 0.632$.

The Bagging Algorithm¶

Step 1: Generate Bootstrap Samples

For $b = 1, \ldots, B$: - Sample $n$ observations with replacement from the training data

Step 2: Train Base Learners

For each bootstrap sample $b$, train a regression tree $\hat{f}_b(x)$.

Step 3: Aggregate Predictions

For regression, average the predictions:

\[ \hat{f}_{\text{bag}}(x) = \frac{1}{B} \sum_{b=1}^{B} \hat{f}_b(x) \]

For classification, take the majority vote.

Why Bagging Works: Variance Reduction¶

Key Theorem: If $\hat{f}_b$ are identically distributed with variance $\sigma^2$ and pairwise correlation $\rho$, then:

\[ \text{Var}\left(\frac{1}{B} \sum_{b=1}^{B} \hat{f}_b\right) = \rho \sigma^2 + \frac{1 - \rho}{B} \sigma^2 \]

Proof:

Let $Z_b = \hat{f}_b(x)$. Then:

\[ \text{Var}\left(\frac{1}{B} \sum_{b=1}^{B} Z_b\right) = \frac{1}{B^2} \sum_{i=1}^{B} \sum_{j=1}^{B} \text{Cov}(Z_i, Z_j) \]

There are $B$ variance terms and $B(B-1)$ covariance terms:

\[ = \frac{1}{B^2} \left[ B \cdot \sigma^2 + B(B-1) \cdot \rho \sigma^2 \right] \]

\[ = \frac{\sigma^2}{B} + \frac{B(B-1)\rho \sigma^2}{B^2} \]

\[ = \frac{\sigma^2}{B} + \rho \sigma^2 \left(1 - \frac{1}{B}\right) \]

\[ = \rho \sigma^2 + \frac{(1 - \rho)\sigma^2}{B} \]

Interpretation: - As $B \to \infty$, the variance approaches $\rho \sigma^2$ (the irreducible correlation) - If trees were uncorrelated ($\rho = 0$), variance would go to zero as $1/B$ - Bootstrap sampling creates diverse trees while keeping them reasonably accurate

Out-of-Bag Error Estimation¶

Each observation appears in approximately $63.2\%$ of bootstrap samples. The remaining $36.8\%$ are out-of-bag (OOB).

For observation $i$, the OOB prediction is:

\[ \hat{f}_{\text{OOB}}(x_i) = \frac{1}{B_i} \sum_{b: i \notin S_b} \hat{f}_b(x_i) \]

Where $B_i$ is the number of trees where $i$ was out-of-bag. The OOB error is an unbiased estimate of test error—no cross-validation needed!

Random Forests¶

The Problem with Bagging Trees¶

Bagged trees are still correlated because they all use the same strong features for early splits. This limits variance reduction.

Random Forest: Decorrelation via Feature Subsampling¶

At each split, Random Forest considers only a random subset of $m \ll p$ features.

Algorithm: 1. For $b = 1, \ldots, B$: - Draw bootstrap sample $S_b$ - Train tree $\hat{f}_b$ where each split considers only $m$ randomly selected features 2. Average predictions: $\hat{f}_{\text{rf}}(x) = \frac{1}{B} \sum_{b=1}^{B} \hat{f}_b(x)$

Typical value: $m = \sqrt{p}$ for classification, $m = p/3$ for regression.

Why this reduces correlation: - If one feature is very predictive, all bagged trees will use it for the first split - Random Forest forces trees to consider other features, creating diversity - Lower correlation $\rho$ means lower ensemble variance

Boosting¶

The Fundamental Difference¶

While bagging trains trees independently and averages them, boosting trains trees sequentially, with each tree focusing on the errors of the previous ones.

AdaBoost (Adaptive Boosting)¶

Core Idea: Train a sequence of weak learners, weighting each observation by how badly previous learners classified it.

Algorithm:

Initialize observation weights: $w_i = 1/n$ for all $i$.

For $m = 1, \ldots, M$:

Fit weak learner $G_m(x)$ to weighted training data
Compute weighted error:

$$ \text{err}m = \frac{\sum}^{n} w_i \cdot \mathbb{1{[y_i \neq G_m(x_i)]}}{\sum $$}^{n} w_i

Compute learner weight:

$$ \alpha_m = \log\left(\frac{1 - \text{err}_m}{\text{err}_m}\right) $$

Update observation weights:

$$ w_i \leftarrow w_i \cdot \exp(\alpha_m \cdot \mathbb{1}_{[y_i \neq G_m(x_i)]}) $$

Normalize weights to sum to 1.

Final classifier:

\[ G(x) = \text{sign}\left(\sum_{m=1}^{M} \alpha_m G_m(x)\right) \]

Why exponential weights? AdaBoost minimizes the exponential loss:

\[ L(y, f(x)) = \exp(-y \cdot f(x)) \]

Where $f(x) = \sum_m \alpha_m G_m(x)$ is the additive model.

Gradient Boosting¶

AdaBoost fits for classification. For regression, Gradient Boosting generalizes the idea.

Core Idea: Fit new trees to the negative gradient (pseudo-residuals) of the loss function.

Algorithm:

Initialize: $\hat{f}_0(x) = \arg\min_c \sum_{i=1}^{n} L(y_i, c)$
For $m = 1, \ldots, M$:

a. Compute pseudo-residuals:

\[ r_{im} = -\left[\frac{\partial L(y_i, f(x_i))}{\partial f(x_i)}\right]_{f = \hat{f}_{m-1}} \]

b. Fit regression tree to residuals $r_{im}$, giving terminal regions $R_{jm}$

c. Compute optimal prediction in each region:

\[ c_{jm} = \arg\min_c \sum_{x_i \in R_{jm}} L(y_i, \hat{f}_{m-1}(x_i) + c) \]

d. Update model:

\[ \hat{f}_m(x) = \hat{f}_{m-1}(x) + \nu \cdot \sum_{j} c_{jm} \cdot \mathbb{1}_{R_{jm}}(x) \]

Where $\nu \in (0, 1]$ is the learning rate (shrinkage).

For squared error loss: $L(y, f) = (y - f)^2$

The pseudo-residual is:

\[ r_{im} = -\frac{\partial}{\partial f} (y_i - f)^2 \bigg|_{f = \hat{f}_{m-1}(x_i)} = 2(y_i - \hat{f}_{m-1}(x_i)) \]

So we fit trees to the current residuals!

Why Shrinkage Helps¶

The learning rate $\nu$ slows down learning:

\[ \hat{f}_m(x) = \hat{f}_{m-1}(x) + \nu \cdot h_m(x) \]

Tradeoff: Smaller $\nu$ requires more trees $M$ but typically achieves better generalization. It's like gradient descent with a smaller step size—slower but more precise.

Typical values: $\nu = 0.01$ to $0.1$, with $M = 100$ to $1000$ trees.

Comparing the Methods¶

Aspect	Single Tree	Bagging	Random Forest	Boosting
Training	Single model	Parallel	Parallel	Sequential
Bias	Low (if deep)	Same as tree	Same as tree	Low (reduced)
Variance	High	Reduced	Further reduced	Reduced
Overfitting	Yes	Less	Even less	Possible (control with $\nu$)
Outliers	Sensitive	Robust	Robust	Sensitive
Interpretability	High	Low	Low	Low
Training Speed	Fast	Parallelizable	Parallelizable	Slow (sequential)
Prediction Speed	Fast	Slow ($B$ trees)	Slow ($B$ trees)	Slow ($M$ trees)

Feature Importance¶

Mean Decrease in Impurity (MDI)¶

For tree-based methods, importance of feature $j$ is:

\[ \text{Importance}_j = \frac{1}{B} \sum_{b=1}^{B} \sum_{t \in T_{b,j}} n_t \cdot \Delta i(t) \]

Where: - $T_{b,j}$ is the set of splits on feature $j$ in tree $b$ - $n_t$ is the number of samples reaching node $t$ - $\Delta i(t)$ is the impurity decrease from the split

Permutation Importance¶

More reliable but computationally expensive:

Train model, compute baseline error
For each feature $j$:
Randomly permute values of feature $j$ in validation set
Compute new error
Importance = increase in error

Practical Guidelines¶

When to Use What¶

Single Tree: When interpretability is paramount and relationships are simple

Random Forest: - Default choice for tabular data - Good baseline before trying more complex methods - Robust, works well out-of-the-box - Built-in OOB error estimation

Gradient Boosting (XGBoost, LightGBM, CatBoost): - When you need maximum accuracy - Large datasets (boosting scales better) - Willing to tune hyperparameters - Kaggle competitions

Hyperparameter Tuning¶

Random Forest: - n_estimators: More is better (until convergence), typically 100-500 - max_features: $\sqrt{p}$ for classification, $p/3$ for regression - min_samples_leaf: 1 for default, increase to reduce overfitting - max_depth: None (grow fully), or limit for regularization

Gradient Boosting: - learning_rate ($\nu$): 0.01-0.1 (smaller = more trees needed) - n_estimators ($M$): 100-1000+ (depends on learning rate) - max_depth: 3-6 (shallow trees prevent overfitting) - subsample: 0.5-1.0 (stochastic gradient boosting)

Regularization in Boosting¶

Modern implementations include several regularization techniques:

Shrinkage (learning rate): Slow learning prevents overfitting
Subsampling: Train each tree on random subset (like bagging)
Column subsampling: Use random feature subsets (like Random Forest)
L1/L2 regularization: Penalize leaf values
Early stopping: Stop when validation error stops improving

Common Misconceptions¶

"More trees always better": For bagging/Random Forest, yes. For boosting, too many trees can overfit (unless using early stopping).
"Boosting always beats Random Forest": Not necessarily. On some datasets, Random Forest performs better, especially with default hyperparameters.
"Trees don't need feature scaling": True for single trees, but for ensembles with regularization, scaling can still help.
"Feature importance tells you causal relationships": No! Importance measures predictive power, not causation. Correlated features split importance.
"Outliers don't affect trees": Single trees are robust, but boosting can be sensitive to outliers because it focuses on hard examples.

Central-Limit-Theorem — Underlies the averaging in ensembles
Likelihood-Based-Statistics — Foundation for loss functions in boosting
Variance — Key to understanding the bias-variance tradeoff
Gaussian-Distribution — Squared error loss and normal assumption
Sample-Mean-Estimator — Optimal prediction in tree leaves (squared error)

References¶

Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and Regression Trees. Wadsworth.
Breiman, L. (1996). "Bagging predictors." Machine Learning, 24(2):123-140.
Breiman, L. (2001). "Random forests." Machine Learning, 45(1):5-32.
Freund, Y., & Schapire, R. E. (1997). "A decision-theoretic generalization of on-line learning and an application to boosting." Journal of Computer and System Sciences, 55(1):119-139.
Friedman, J. H. (2001). "Greedy function approximation: A gradient boosting machine." Annals of Statistics, 29(5):1189-1232.
Friedman, J. H. (2002). "Stochastic gradient boosting." Computational Statistics & Data Analysis, 38(4):367-378.
Chen, T., & Guestrin, C. (2016). "XGBoost: A scalable tree boosting system." KDD.
Ke, G., et al. (2017). "LightGBM: A highly efficient gradient boosting decision tree." NIPS.