Lecture 8: Tree-based Methods

Author

Your Name

Published

February 12, 2025

Introduction

Tree-based methods, also known as decision tree methods, are a versatile and interpretable set of statistical learning techniques applicable to both regression and classification problems. The core idea behind these methods is to partition the \(p\)-dimensional predictor space into a number of simpler regions. Within each region, a straightforward model, such as a constant, is then fitted. These partitioning rules can be visually represented as a tree structure, making them inherently easy to understand and interpret. This lecture will primarily focus on the CART (Classification and Regression Trees) approach, a well-established methodology in the field.

Advantages of Tree-based Methods

Tree-based methods offer several compelling advantages that contribute to their popularity and wide applicability:

  • Interpretability: Decision trees are remarkably easy to visualize and interpret. Their tree-like structure provides a transparent decision-making process that can be readily explained, even to individuals without a strong statistical background. The path from the root to a leaf node clearly outlines the decision rules leading to a particular prediction.

  • Handling Different Data Types: These methods can seamlessly handle both numerical and categorical predictors. Unlike some other statistical techniques, tree-based methods do not require extensive preprocessing, such as the creation of dummy variables for categorical features. They can directly process categorical variables in their original format.

  • Non-linear Relationships Modeling: Tree-based methods excel at naturally modeling non-linear relationships between predictors and the response variable. They achieve this without requiring the explicit specification of a functional form. The tree structure itself adapts to capture complex non-linear patterns in the data.

  • Feature Interactions Capture: Decision trees implicitly capture interactions between predictor variables. The splits in a tree are based on conditions involving one or more predictors, effectively modeling how different predictors interact to influence the response. This is a significant advantage as it automatically detects and incorporates interaction effects without manual specification.

  • Suitability for Large Datasets: Tree-based methods are computationally efficient and can be effectively applied to large datasets. They are also particularly useful in situations where there is uncertainty about the underlying form of the relationship between the explanatory variables and the response. Their non-parametric nature makes them robust in such scenarios.

  • High Interpretability and Ease of Explanation

  • Versatile Handling of Numerical and Categorical Data

  • Effective Modeling of Non-linear Relationships

  • Automatic Capture of Feature Interactions

  • Efficient Use with Large Datasets

Limitations of Tree-based Methods

Despite their numerous advantages, tree-based methods also have certain limitations that should be considered:

  • High Variance: A primary drawback of single decision trees is their potential for high variance. This means they can be unstable; small alterations in the training dataset can lead to significantly different tree structures. Consequently, this instability can result in high variance and overfitting, especially if the tree is grown too deep.

  • Limited Predictive Accuracy: While highly interpretable, single decision trees often do not achieve the same level of predictive accuracy as more sophisticated methods, particularly when dealing with complex datasets. Their simplicity, which contributes to interpretability, can sometimes limit their ability to capture intricate patterns necessary for optimal prediction.

  • Loss of Information with Continuous Predictors: When handling continuous predictors, decision trees categorize them at each split. This categorization, while simplifying the decision process, can lead to a potential loss of information compared to methods that treat continuous variables as such throughout the modeling process. By discretizing continuous variables, some of the nuances in the data might be overlooked.

  • Interpretability Challenges with Large Trees: While small trees are easily interpretable, very large and complex trees can become difficult to understand and explain. As the tree size increases, tracing the decision paths and extracting meaningful insights becomes challenging, diminishing the interpretability advantage. In such cases, they can start to behave like ‘black boxes’.

  • Sensitivity to Data Changes: As mentioned earlier, small changes in the dataset can lead to significant changes in the tree structure. This sensitivity implies that the top splits, which are crucial for the overall tree structure, can be particularly affected by data perturbations. An error in an early split can propagate down and influence all subsequent splits, impacting the tree’s robustness.

  • Susceptibility to High Variance and Overfitting

  • Generally Lower Predictive Accuracy Compared to Complex Methods

  • Potential Information Loss from Continuous Predictor Categorization

  • Reduced Interpretability with Very Large Trees

  • Instability and Sensitivity to Training Data Variations

Ensemble Methods

To address the limitations of single decision trees, particularly their high variance and potentially lower predictive accuracy, ensemble methods based on decision trees have been developed. These methods strategically combine multiple decision trees to create more robust and accurate prediction models. By aggregating predictions from an ensemble of trees, these techniques aim to reduce variance and improve overall predictive performance. Key ensemble methods that will be discussed in this lecture include:

  • Bagging (Bootstrap Aggregating): A method that reduces variance by averaging predictions from multiple trees built on bootstrap samples of the training data.

  • Random Forests: An extension of bagging that further decorrelates the trees by randomly selecting a subset of predictors at each split, leading to even greater variance reduction and often improved accuracy.

  • Boosting: A sequential technique where trees are built iteratively, with each subsequent tree attempting to correct the errors of its predecessors. Boosting focuses on observations that are difficult to predict, thereby enhancing the model’s accuracy, often significantly.

Ensemble methods leverage the strengths of decision trees while mitigating their weaknesses, resulting in powerful and widely used techniques in statistical learning and data analysis. They are computationally intensive but provide improved predictive performance compared to single trees.

For instance, consider the UCI Machine Learning Repository spam dataset, which contains data on 4601 email items, aiming to classify emails as ‘spam’ or ‘not spam’. A simple classification tree might use features like the frequency of certain words (e.g., ‘dollar’, ‘money’, ‘make’) or characters (e.g., ‘!’, ‘000’, capital letters length) to make predictions. While a single tree can provide a basic classification, ensemble methods can combine multiple such trees to achieve a more accurate and reliable spam detection system. For example, a simple classification tree for spam detection might look like this:

                                 dollar < 0.0555?
                                 /             \
                             bang < 0.0915?      Predict 'spam' (y)
                             /             \
                         Predict 'not spam'(n)   crl.tot < 85.5?
                                                  /            \
                                             bang < 0.7735?    Predict 'spam' (y)
                                             /            \
                                        Predict 'not spam' (n)  crl.tot < 17?
                                                                 /            \
                                                            Predict 'not spam' (n) Predict 'spam' (y)

This example illustrates how tree-based methods, and especially ensemble methods, can be applied to real-world classification problems, offering both interpretability and, with ensemble techniques, enhanced predictive power.

Regression Trees

The Basic Idea

Regression trees are employed when the response variable is quantitative, aiming to predict its continuous value based on several predictor variables. The fundamental approach involves segmenting the predictor space into a set of non-overlapping regions. For each region, a simple prediction model, typically the mean of the response values of the training observations within that region, is used.

Formally, given a training dataset \(\{(x_1, y_1), \dots, (x_n, y_n)\}\), where \(x_i = (x_{i1}, \dots, x_{ip})\) represents the predictor variables for the \(i\)-th observation and \(y_i\) is the corresponding continuous response, the construction of a regression tree proceeds in two primary steps:

  1. Predictor Space Partitioning: The \(p\)-dimensional predictor space is divided into \(J\) disjoint regions, \(R_1, R_2, \dots, R_J\). For simplicity and interpretability, these regions are generally defined as rectangles or boxes in the predictor space. This partitioning is achieved through a recursive process.

  2. Region-based Prediction: For each region \(R_j\), the prediction for any observation falling into this region is the average of the response values of all training observations that belong to \(R_j\). Let \(\hat{y}_{R_j}\) be the predicted response for region \(R_j\), calculated as: \[\hat{y}_{R_j} = \text{mean}(y_i \mid x_i \in R_j)\]

The goal is to identify regions \(R_1, \dots, R_J\) that minimize the **Residual Sum of Squares (RSS)**, which quantifies the total discrepancy between the observed and predicted responses: \[\label{eq:rss_regression_tree_content} \text{RSS}= \sum_{j=1}^{J} \sum_{i \in R_j} (y_i - \hat{y}_{R_j})^2\] where \(\hat{y}_{R_j}\) is constant for all \(x_i \in R_j\).

However, finding the globally optimal partition that minimizes the RSS across all possible partitions is computationally challenging, especially for large datasets. Therefore, regression trees employ a greedy and computationally efficient approach known as recursive binary splitting.

Recursive Binary Splitting

Definition 1 (Recursive Binary Splitting). Recursive binary splitting is a top-down, greedy algorithm used to construct regression trees. It is termed "top-down" because it starts at the root of the tree, where all training observations reside in a single region, and successively splits the space. The approach is "greedy" because at each step, it makes the locally optimal decision, without looking ahead to see if that split will lead to a globally optimal tree.

The process begins with all training data in a single region and proceeds iteratively:

  1. Initialization: Start with the entire predictor space as a single region.

  2. Iterative Splitting: For each existing region, consider every predictor \(X_j\) and every possible split point \(s\). For a predictor \(X_j\), a split point \(s\) divides the region into two new sub-regions: \[\begin{aligned} R_1(j, s) &= \{x \mid x_j < s\} \\ R_2(j, s) &= \{x \mid x_j \geq s\} \end{aligned}\] For each potential split \((j, s)\), calculate the RSS.

  3. Best Split Selection: Choose the predictor \(X_j\) and the split point \(s\) that result in the greatest reduction in RSS. This is achieved by finding the \((j, s)\) pair that minimizes the sum of the RSS for the two resulting regions: \[\label{eq:rss_split_selection_content} \sum_{i: x_i \in R_1(j,s)} (y_i - \hat{y}_{R_1})^2 + \sum_{i: x_i \in R_2(j,s)} (y_i - \hat{y}_{R_2})^2\] where \(\hat{y}_{R_1}\) and \(\hat{y}_{R_2}\) are the mean responses in regions \(R_1(j, s)\) and \(R_2(j, s)\), respectively.

  4. Region Update and Recursion: Divide the region into the two new regions \(R_1(j, s)\) and \(R_2(j, s)\) using the selected best split. Repeat steps 2 and 3 for each of these new regions. This recursive process continues until a predefined stopping criterion is met.

The recursive binary splitting algorithm has a relatively low computational complexity. For each split, it iterates through each predictor and possible split point. If there are \(p\) predictors and \(n\) observations, and assuming roughly \(n\) possible split points for each predictor, the complexity at each node is approximately \(O(p \cdot n \log n)\) if sorting is used to efficiently find split points, or \(O(p \cdot n^2)\) in a naive implementation. Since the splitting process is binary and creates a tree structure, the overall complexity to build a tree of depth \(d\) is roughly \(O(p \cdot n \log n \cdot d)\) or \(O(p \cdot n^2 \cdot d)\). In practice, the depth \(d\) is often logarithmic in \(n\) in balanced trees, leading to an efficient tree construction process.

Stopping Criteria

The recursive splitting process should stop to prevent overfitting and to create trees that are interpretable. Common stopping criteria include:

  • Minimum node size: Stop splitting a node if it contains fewer than a certain minimum number of observations (e.g., 5).

  • RSS reduction threshold: Stop splitting if the reduction in RSS from a split is below a certain threshold.

  • Tree depth limit: Limit the maximum depth of the tree.

For categorical predictors, splits are defined by factor levels. For example, if a predictor is "Color" with levels "Red", "Green", "Blue", a split might be "Color is Red" vs "Color is Green or Blue".

Example: Automobile Data

Example 1 (Automobile Data Example). Consider the automobile dataset with \(n=60\) observations. We want to predict "Mileage" (fuel consumption in miles per US gallon) based on "Weight" (weight of the vehicle in pounds).

Scatterplot of Mileage vs. Weight with fitted smooth and linear regression curves.

Description of Figure 1: The scatterplot in Figure 1 displays Mileage against Weight. Overlaid are a fitted smooth curve (blue) and a linear regression curve (red). The smooth curve suggests a non-linear relationship between Mileage and Weight, indicating that a simple linear regression model might not optimally capture the underlying pattern.

Applying recursive binary splitting to this dataset, with Weight as the predictor and Mileage as the response, results in a partition of the one-dimensional predictor space (Weight) and a corresponding regression tree.

Partition of the predictor space and the resulting regression tree for the automobile data.

Description of Figure 2: Figure 2 illustrates the recursive binary splitting process in four steps:

  • No Split: The data is shown with a horizontal line representing the mean Mileage of all observations, indicating the prediction if no split is made.

  • First Split: A vertical blue line marks the first split point in Weight. Two horizontal black lines show the mean Mileage for the two regions created by this split, representing the improved prediction after the first split.

  • Second Split: A second vertical blue line further divides one of the regions from the first split. The horizontal black lines are updated to reflect the mean Mileage in these newly formed regions, showing further refinement in prediction.

  • Final Split: The final partition is shown with blue vertical lines indicating all split points. The black step function represents the regression curve formed by the means in the final regions, illustrating the piecewise constant prediction of the regression tree.

Below these partition plots, the corresponding regression tree is depicted. This tree visually summarizes the sequence of splits and the predicted Mileage values in the terminal nodes (leaves).

The regression tree derived from the automobile data can be represented as:

                                Weight $\geq$ 2568?
                                /          \
                       Weight $\geq$ 3088?        30.93
                       /          \
                20.41      Weight $\geq$ 2748?
                           /          \
                        23.8        25.62

This tree indicates that if a car’s weight is less than 2568 pounds, the predicted mileage is 20.41. For cars heavier than 2568 pounds, further splits based on weight refine the mileage prediction.

Pruning the Tree

A fully grown regression tree, developed through recursive binary splitting until a stopping criterion is met, often becomes excessively complex. Such complexity leads to overfitting, where the tree model fits the training data very closely, capturing noise as well as signal, and consequently performs poorly on unseen test data. Pruning is a crucial technique to simplify the tree by reducing its size. This simplification aims to decrease variance and improve the tree’s ability to generalize to new, unseen data, thereby enhancing prediction accuracy on independent datasets.

Cost Complexity Pruning (Weakest Link Pruning)

Definition 2 (Cost Complexity Pruning (Weakest Link Pruning)). Cost complexity pruning, also known as weakest link pruning, is a method used to prune a regression tree in a way that balances its complexity and its fit to the data. It works by considering a sequence of subtrees of the original, fully grown tree. For each subtree in the sequence, a cost complexity value is calculated. This value is composed of two parts: the RSS of the subtree and a penalty term proportional to the number of terminal nodes in the subtree.

The cost complexity measure is defined as: \[\label{eq:cost_complexity_measure_content} C_\alpha(T) = \text{RSS}(T) + \alpha \cdot |T|\] where:

  • \(T\) is a subtree of the original tree \(T_0\).

  • \(\text{RSS}(T)\) is the residual sum of squares calculated using the subtree \(T\). It measures how well the subtree fits the training data.

  • \(|T|\) is the number of terminal nodes (leaves) in the subtree \(T\). This term represents the complexity of the subtree; a larger \(|T|\) indicates a more complex tree.

  • \(\alpha \geq 0\) is a non-negative tuning parameter, often referred to as the complexity parameter. It controls the trade-off between the tree’s goodness of fit to the training data (RSS) and its complexity (\(|T|\)).

For a given value of \(\alpha\), we find the subtree \(T_\alpha\) that minimizes \(C_\alpha(T)\).

Algorithm for Pruning

Grow a Maximal Tree: Use recursive binary splitting to grow a large tree \(T_0\) on the training data, stopping only when the terminal nodes have a small minimum size. Cost Complexity Pruning: Obtain a sequence of best subtrees by applying cost complexity pruning to \(T_0\) for a range of values of \(\alpha \geq 0\). For each \(\alpha\), find the subtree \(T_\alpha\) that minimizes \(C_\alpha(T) = \text{RSS}(T) + \alpha \cdot |T|\). K-fold Cross-Validation: Use K-fold cross-validation to choose the best value of \(\alpha\). Divide the training data into \(K\) folds. Train the full sequence of subtrees on the data excluding the \(k\)-th fold. Evaluate the prediction error of each subtree in the sequence on the \(k\)-th fold. For each value of \(\alpha\), average the cross-validation errors across all \(K\) folds to estimate the cross-validation error for that \(\alpha\). Optimal \(\alpha\) Selection: Select the value of \(\alpha\), \(\alpha_{min}\), that minimizes the cross-validated error. \(T_{\alpha_{min}}\) is the best subtree. One-Standard-Deviation Rule (Optional): Calculate the standard deviation of the cross-validation error for \(\alpha_{min}\). Choose the smallest tree in the sequence whose cross-validation error is within one standard deviation of the minimum error. Return Pruned Tree: Return the subtree corresponding to the chosen value of \(\alpha\).

Complexity Analysis of Pruning Algorithm:

  • Growing the maximal tree (Step 1) has a complexity as discussed in 2.2, roughly \(O(p \cdot n \log n \cdot d_{max})\) or \(O(p \cdot n^2 \cdot d_{max})\) where \(d_{max}\) is the maximum depth.

  • Cost complexity pruning (Step 2) is relatively efficient, often linear in the number of nodes of the maximal tree.

  • K-fold cross-validation (Step 3) involves refitting and evaluating trees \(K\) times. If there are \(L\) subtrees in the sequence, and evaluating each tree on a fold of size \(n/K\) is approximately \(O(n/K)\), the total cost for cross-validation is roughly \(O(K \cdot L \cdot n/K) = O(L \cdot n)\). The number of subtrees \(L\) is at most the number of terminal nodes in the maximal tree, which is less than \(n\).

  • The selection steps (Steps 4 and 5) are computationally negligible.

Overall, the pruning process is computationally feasible, especially compared to exhaustively searching all possible subtrees. The dominant cost is typically in growing the initial maximal tree and in the cross-validation step, but these are manageable for typical datasets.

Example: Baseball Data

Example 2 (Baseball Data Example). Consider the baseball dataset, aiming to predict the logarithm of "Salary" of baseball players based on predictors such as "Years" in the major league, "Hits", and "RBI".

Unpruned regression tree for the baseball data.

Description of Figure 3: Figure 3 displays a complex, unpruned regression tree for predicting baseball player salaries. The tree is characterized by a large number of terminal nodes, suggesting a high degree of complexity and potential for overfitting.

To mitigate overfitting and improve generalization, cost complexity pruning is applied in conjunction with cross-validation.

Cross-validation error as a function of tree size and \(c_p\) for the baseball data.

Description of Figure 4: Figure 4 shows two curves: a black curve with error bars representing the cross-validation relative error, and a red curve showing the training error, both plotted against tree size (number of terminal nodes) and the cost complexity parameter \(c_p\) (related to \(\alpha\)). A horizontal blue line is drawn at one standard deviation above the minimum cross-validation error. The plot helps in selecting a pruned tree size that balances bias and variance, often using the one-standard-deviation rule to choose a simpler tree within acceptable error limits.

Based on the cross-validation results and applying the one-standard-deviation rule, a pruned tree with fewer terminal nodes is selected, aiming for a balance between model complexity and predictive performance.

Pruned regression tree for the baseball data.

Description of Figure 5: Figure 5 presents a simpler, pruned regression tree for the baseball data. This pruned tree, with only three terminal nodes, is significantly less complex than the unpruned tree. It is more interpretable and is expected to generalize better to new data due to the reduction in overfitting achieved through pruning.

Classification Trees

Classification trees are used when the response variable is qualitative or categorical. The process is very similar to regression trees, but with some key differences in the splitting criteria and prediction methods.

Introduction to Classification Trees

Similar to regression trees, classification trees partition the predictor space into regions. However, instead of predicting a continuous value in each region, we predict a class label.

Key differences from regression trees:

  • Response type: Classification trees are for categorical responses, while regression trees are for continuous responses.

  • Prediction in regions: In each region of a classification tree, we predict the class that is most frequent among the training observations in that region.

  • Splitting criteria: RSS is not appropriate for classification. We need different measures to evaluate the quality of a split.

Measures of Node Purity

When building a classification tree, we want to choose splits that create nodes that are "pure", meaning that the majority of observations in each node belong to a single class. Several measures can be used to quantify node purity.

Let \(p_{jk}\) be the proportion of training observations in region \(R_j\) that belong to class \(k\), for \(k = 1, 2, \dots, K\) (where \(K\) is the number of classes).

Common measures of node impurity:

Definition 3 (Classification Error Rate). The fraction of training observations in region \(R_j\) that do not belong to the most common class in \(R_j\). \[\label{eq:classification_error_rate_content} E_j = 1 - \max_k \{p_{jk}\}\] We aim to minimize the overall classification error rate.

Definition 4 (Gini Index). Measures the total variance across the classes in region \(R_j\). \[\label{eq:gini_index_content} G_j = \sum_{k=1}^{K} p_{jk} (1 - p_{jk}) = 1 - \sum_{k=1{K} p_{jk}^2\] The Gini index is small when \(p_{jk}\) is close to 0 or 1 for all \(k\).

Definition 5 (Cross-Entropy (Deviance)). Measures the deviance or disorder in region \(R_j\). \[\label{eq:cross_entropy_content} D_j = - \sum_{k=1}^{K} p_{jk} \log(p_{jk})\] Cross-entropy is also small when \(p_{jk}\) is close to 0 or 1.

When performing recursive binary splitting for classification trees, at each step, we choose the split that reduces node impurity the most, using one of these measures (Gini index or cross-entropy are generally preferred over classification error rate for tree growing, as they are more sensitive to node purity).

For pruning classification trees, cost complexity pruning can also be used, replacing RSS with a measure of classification error in the cost complexity formula.

Example: Spam Data Set

Example 3 (Spam Data Set Example). Consider the spam data set. We want to classify emails as "spam" or "not spam" based on features like frequency of certain words or characters.

Cross-validation classification error for the spam data set.

Description of Figure 6: Figure 6 shows the cross-validation relative error for classification trees applied to the spam data set, as a function of tree size and cost complexity parameter \(c_p\). A horizontal line is drawn at one standard deviation above the minimum CV error.

Applying cost complexity pruning and the one-standard-deviation rule, we can obtain a pruned classification tree for spam detection.

Pruned classification tree for the spam data set.

Description of Figure 7: Figure 7 shows a pruned classification tree for the spam data set. The tree uses features like "dollar", "bang", "crl.tot", "n000", and "money" to classify emails as "n" (not spam) or "y" (spam).

Ensemble Methods: Bagging, Random Forests, Boosting

Ensemble methods aim to improve the prediction accuracy and robustness of single decision trees by combining multiple trees.

Introduction to Ensemble Methods

Remark. Remark 1 (Introduction to Ensemble Methods). Ensemble methods use multiple classification or regression procedures to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. They are based on iterated applications of simple classifiers or regression models, often decision trees.

Why ensemble methods?

  • Reduce variance: Decision trees can have high variance. Ensemble methods like bagging and random forests reduce variance by averaging predictions from multiple trees.

  • Improve accuracy: Boosting methods sequentially build trees, focusing on observations that are difficult to predict, leading to improved accuracy.

Bagging (Bootstrap Aggregation)

Definition 6 (Bagging (Bootstrap Aggregation)). Bagging is a general technique for reducing the variance of a statistical learning method. It is particularly useful for methods like decision trees that have high variance.

Bagging algorithm for regression trees:

Bootstrap Sampling: Generate \(B\) bootstrap datasets \(D_1, D_2, \dots, D_B\) by randomly sampling \(n\) observations with replacement from the original training data \(D\). Build Trees: Train a regression tree \(T_b\) on bootstrap dataset \(D_b\) without pruning. Prediction Aggregation: For a new input \(x\), obtain predictions \(\hat{f}_b(x)\) from each tree \(T_b\). The bagged prediction is the average: \(\hat{f}_{\text{bag}}(x) = \frac{1}{B} \sum_{b=1}^{B} \hat{f}_b(x)\).

Description of Algorithm [alg:bagging_regression]: The Bagging Algorithm for Regression Trees outlines the steps to reduce variance by averaging predictions from multiple trees. It involves creating bootstrap samples, training a regression tree on each sample, and then averaging the predictions from all trees to get the final bagged prediction. Complexity Analysis of Bagging:

  • Bootstrap Sampling (Step 1): Creating \(B\) bootstrap samples, each of size \(n\), has a complexity of \(O(B \cdot n)\).

  • Tree Building (Step 2): Building each tree \(T_b\) has a complexity of approximately \(O(p \cdot n \log n \cdot d)\) or \(O(p \cdot n^2 \cdot d)\), where \(d\) is the average depth of the trees. Since we build \(B\)trees, the total complexity for this step is \(O(B \cdot p \cdot n \log n \cdot d)\) or \(O(B \cdot p \cdot n^2 \cdot d)\).

  • Prediction Aggregation (Step 3): For each new prediction, we need to average \(B\) predictions, which takes \(O(B)\) time.

The overall training complexity of bagging is dominated by the tree building step, making it computationally intensive but parallelizable, as each tree is built independently.

For classification trees, bagging follows a similar procedure, but instead of averaging predictions, it aggregates predictions by majority voting. For a new input \(x\), each tree \(T_b\) predicts a class. The bagged classification is the class that receives the majority of votes across all \(B\) trees.

Out-of-Bag (OOB) Error Estimation

Remark. Remark 2 (Out-of-Bag (OOB) Error Estimation). Bagging provides a natural way to estimate the test error without using cross-validation. For each tree \(T_b\), some observations are "out-of-bag" (not used to build \(T_b\)). We can use these OOB observations to estimate the test error.

Random Forests

Definition 7 (Random Forests). Random forests are an improvement over bagging that decorrelates the trees further, thus reducing variance even more.

Random Forest Algorithm:

Bootstrap Sampling: Generate \(B\) bootstrap datasets \(D_1, D_2, \dots, D_B\) from the original training data \(D\). Modified Tree Building: Train a regression tree \(T_b\) on bootstrap dataset \(D_b\) with the following modification: Randomly select \(m\) predictors from the full set of \(p\) predictors. Consider only these \(m\) predictors for splitting at the current node. Choose the best predictor and split point among these \(m\) predictors to split the node. Grow the tree \(T_b\) to maximum depth without pruning. Prediction Aggregation: For a new input \(x\), get predictions \(\hat{f}_b(x)\) from each tree \(T_b\). The random forest prediction is the average: \(\hat{f}_{\text{rf}}(x) = \frac{1}{B} \sum_{b=1}^{B} \hat{f}_b(x)\).

Description of Algorithm [alg:random_forest_regression]: The Random Forest Algorithm for Regression Trees builds upon bagging by adding feature randomness. For each tree and at each node, only a random subset of predictors is considered for splitting, further decorrelating the trees and reducing variance. Complexity Analysis of Random Forests: The complexity is similar to bagging, with an additional factor for feature subsampling. At each node split, instead of considering all \(p\) predictors, we consider only \(m\) predictors. Thus, the tree building complexity for each tree becomes roughly \(O(m \cdot n \log n \cdot d)\) or \(O(m \cdot n^2 \cdot d)\). The overall complexity for building \(B\) trees is \(O(B \cdot m \cdot n \log n \cdot d)\) or \(O(B \cdot m \cdot n^2 \cdot d)\). Since \(m < p\), random forests can be computationally more efficient than bagging, especially when \(p\) is large.

Boosting

Definition 8 (Boosting). Boosting is another ensemble method that sequentially builds trees. Unlike bagging and random forests, boosting does not use bootstrap sampling. Instead, it builds trees in a sequential manner, where each tree tries to correct the errors of the previous trees.

Boosting Algorithm for Regression Trees (Gradient Boosting):

Initialization: Initialize predictions \(\hat{f}(x_i) = 0\) and residuals \(r_i = y_i\) for all \(i = 1, 2, \dots, n\). Fit Tree to Residuals: Fit a regression tree \(\hat{f}^b\) with depth \(d\) to the training data \((x_i, r_i)\). Update Prediction: Update the prediction: \(\hat{f}(x_i) \leftarrow \hat{f}(x_i) + \lambda \hat{f}^b(x_i)\). Update Residuals: Update the residuals: \(r_i \leftarrow r_i - \lambda \hat{f}^b(x_i)\). Output: The boosted model is \(\hat{f}(x) = \sum_{b=1}^{B} \lambda \hat{f}^b(x)\).

Description of Algorithm [alg:boosting_regression]: The Boosting Algorithm for Regression Trees (Gradient Boosting) sequentially builds trees to correct errors of previous trees. It initializes predictions and residuals, then iteratively fits trees to the residuals, updates predictions, and recalculates residuals, focusing on improving predictions for difficult observations. Complexity Analysis of Boosting:

  • Initialization (Step 1): \(O(n)\).

  • Iterative Tree Building (Step 2): For each of the \(B\) iterations, we build a tree of depth \(d\). The complexity to build each tree is approximately \(O(p \cdot n \log n \cdot d)\) or \(O(p \cdot n^2 \cdot d)\). Since we repeat this \(B\) times, the total complexity is \(O(B \cdot p \cdot n \log n \cdot d)\) or \(O(B \cdot p \cdot n^2 \cdot d)\).

  • Output (Step 3): Prediction involves summing predictions from \(B\) trees, which takes \(O(B)\) time for each new prediction.

Tuning Parameters in Boosting

Remark. Remark 3 (Tuning Parameters in Boosting). Tuning parameters in boosting:

  • Number of trees \(B\): Unlike bagging and random forests, boosting can overfit if \(B\) is too large. \(B\) is typically chosen using cross-validation.

  • Shrinkage parameter \(\lambda\): Controls the learning rate. Small \(\lambda\) requires larger \(B\).

  • Interaction depth \(d\): Controls the complexity of each tree. Often small values like \(d=1\) (decision stumps) or \(d=2, 3\) are sufficient.

Example: Baseball and Spam Data with Ensemble Methods

Example 4 (Ensemble Methods on Baseball and Spam Data). Applying bagging, random forests, and boosting to the baseball and spam data sets shows improved performance compared to single decision trees.

Baseball Data: Ensemble methods reduce the test MSE compared to a single regression tree.

                      Tree (size 3)   Bagging   Random Forest (m=3)   Boosting (d=3)
Test MSE estimate:      0.418         0.257         0.241              0.281

Spam Data: Ensemble methods provide slightly better accuracy, true negative rate, and true positive rate compared to a single classification tree.

                     tot. accuracy   true negative   true positive
Trees                   0.87            0.94            0.77
Boosting                0.87            0.93            0.79
Bagging                 0.88            0.92            0.81
Random Forests          0.88            0.95            0.78

Conclusion

Remark. Remark 4 (Conclusion on Tree-based Methods). Tree-based methods stand out as highly interpretable and remarkably flexible tools in the realm of statistical learning, adept at tackling both regression and classification challenges. Their inherent visual appeal and ease of interpretation make them accessible and valuable across diverse fields.

While individual decision trees offer simplicity and interpretability, they are not without limitations. Their primary weaknesses include high variance and a ceiling on predictive accuracy, especially when confronted with complex datasets. These shortcomings, however, are effectively addressed by ensemble methods.

Techniques such as bagging, random forests, and boosting represent a significant evolution in tree-based modeling. By intelligently combining multiple decision trees, these ensemble methods harness the collective strength of individual trees, leading to models that are not only more robust but also substantially more accurate in their predictions.

The power of ensemble methods lies in their ability to mitigate the instability of single trees and to capture more intricate patterns in data. Bagging reduces variance through averaging, random forests further decorrelate trees for enhanced stability, and boosting sequentially refines the model’s focus, driving accuracy improvements.

In conclusion, tree-based methods, particularly in their ensemble forms, provide a compelling balance between interpretability and predictive power. This balance, coupled with their versatility and effectiveness, has cemented their widespread use across a multitude of disciplines, making them indispensable tools for modern data analysis and predictive modeling.

  • Tree-based methods: interpretable and flexible for regression and classification.

  • Single trees: simple but suffer from high variance and limited accuracy.

  • Ensemble methods (Bagging, Random Forests, Boosting): combine trees for robustness and accuracy.

  • Ensemble methods effectively address limitations of single trees.

  • Tree-based methods, especially ensembles, balance interpretability and predictive power, ensuring wide applicability.

Exercises

  1. Recursive Binary Splitting for Regression Trees:

    1. Explain the concept of recursive binary splitting in the context of regression trees.

    2. Describe the top-down, greedy approach and how it is applied to partition the predictor space into regions \(R_1, R_2, \dots, R_J\).

    3. What is the objective function that recursive binary splitting aims to minimize at each step, and why is a greedy approach necessary for this process, especially with large datasets?

  2. Cost Complexity Pruning:

    1. What is cost complexity pruning, also known as weakest link pruning, and what problem in decision trees does it address?

    2. Explain the cost complexity measure formula: \(C_\alpha(T) = \text{RSS}(T) + \alpha \cdot |T|\). Define each term (\(\text{RSS}(T)\), \(\alpha\), \(|T|\)) and explain how the tuning parameter \(\alpha\) controls the trade-off between tree complexity and goodness of fit.

    3. Why is pruning important for decision trees, and how does cost complexity pruning help in preventing overfitting and improving generalization performance on unseen data?

  3. Regression Trees vs. Classification Trees:

    1. Compare and contrast regression trees and classification trees in terms of the type of response variable they are designed to predict (quantitative vs. qualitative).

    2. Discuss the key differences in how predictions are made within the regions of regression trees versus classification trees.

    3. Explain why the Residual Sum of Squares (RSS) is used as a splitting criterion for regression trees but not for classification trees. What measures of node purity are used for classification trees instead, and why are these measures appropriate for categorical responses?

  4. Bagging Algorithm:

    1. Describe the bagging algorithm in detail, outlining the steps involved in creating bootstrap samples, building trees, and aggregating predictions for regression trees.

    2. Explain how bagging reduces the variance of decision trees. Why does averaging predictions from multiple trees, each trained on a bootstrap sample, lead to a lower variance compared to a single tree trained on the original dataset?

    3. How is the aggregation of predictions handled differently in bagging for classification trees compared to regression trees? Explain the use of majority voting in classification bagging.

  5. Random Forests Improvement over Bagging:

    1. How do random forests improve upon bagging to further reduce variance and decorrelate the trees in the ensemble?

    2. Explain the concept of feature subsampling in random forests. How is the number of predictors \(m\) chosen at each split, and what is a typical heuristic for selecting the value of \(m\) in relation to the total number of predictors \(p\)?

    3. Why does random feature selection at each split lead to less correlated trees and potentially better predictive performance compared to bagging, especially when strong predictors are present?

  6. Boosting Algorithm:

    1. Explain the boosting algorithm for regression trees (Gradient Boosting), detailing the sequential tree building process and how it differs from bagging and random forests in its approach to ensemble creation.

    2. Describe how boosting iteratively fits trees to the residuals from previous trees. What is the role of residuals in the boosting process, and how does it focus on improving predictions for observations that are difficult to predict?

    3. How does boosting improve accuracy compared to single trees and bagging? Explain the significance of the shrinkage parameter \(\lambda\) in controlling the learning rate and preventing overfitting in boosting, and discuss the role of the interaction depth \(d\).

  7. Advantages and Disadvantages of Tree-based Methods vs. Linear Models:

    1. Discuss the advantages and disadvantages of using tree-based methods compared to linear models in terms of interpretability, flexibility in modeling non-linear relationships, handling different types of predictors, and predictive accuracy.

    2. In what types of scenarios or datasets might tree-based methods be preferred over linear models due to their strengths? Conversely, when might linear models be more appropriate than tree-based methods, considering their limitations?

  8. Node Impurity Calculation: For a classification problem with three classes, calculate the Gini index and cross-entropy for a node with the following class proportions: \(p_1 = 0.5\), \(p_2 = 0.3\), \(p_3 = 0.2\). Show your calculations for both the Gini index (\(G_j = \sum_{k=1}^{K} p_{jk} (1 - p_{jk})\)) and cross-entropy (\(D_j = - \sum_{k=1}^{K} p_{jk} \log(p_{jk})\)). Comment on which measure indicates higher impurity for this node.

    Solution:

    1. Gini Index Calculation: \[\begin{aligned} G_j &= \sum_{k=1}^{3} p_{jk} (1 - p_{jk}) \\ &= p_{j1}(1-p_{j1}) + p_{j2}(1-p_{j2}) + p_{j3}(1-p_{j3}) \\ &= 0.5(1-0.5) + 0.3(1-0.3) + 0.2(1-0.2) \\ &= 0.5(0.5) + 0.3(0.7) + 0.2(0.8) \\ &= 0.25 + 0.21 + 0.16 \\ &= 0.62 \end{aligned}\]

    2. Cross-Entropy Calculation: \[\begin{aligned} D_j &= - \sum_{k=1}^{3} p_{jk} \log(p_{jk}) \\ &= - [p_{j1}\log(p_{j1}) + p_{j2}\log(p_{j2}) + p_{j3}\log(p_{j3})] \\ &= - [0.5\log(0.5) + 0.3\log(0.3) + 0.2\log(0.2)] \\ &\approx - [0.5(-0.693) + 0.3(-1.204) + 0.2(-1.609)] \\ &\approx - [-0.3465 - 0.3612 - 0.3218] \\ &\approx - [-1.0295] \\ &\approx 1.030 \end{aligned}\] Both Gini index and Cross-entropy quantify node impurity. Higher values indicate higher impurity. For this node, the cross-entropy value (1.030) is numerically higher than the Gini index (0.62), but both consistently indicate a certain level of impurity due to the mixed class proportions.

  9. Out-of-Bag (OOB) Error Estimation:

    1. Explain the concept of Out-of-Bag (OOB) error estimation in bagging. What are OOB observations, and why are they inherently generated in the bootstrap sampling process used in bagging?

    2. How is the OOB error rate or MSE calculated for bagged models? Describe the process of obtaining OOB predictions for each observation and then computing the overall error as an estimate of the test error.

    3. What are the advantages of using OOB error estimation compared to traditional cross-validation for estimating the test error of a bagged model, particularly in terms of computational efficiency and convenience?

  10. Tuning Parameters in Boosting:

    1. What are the key tuning parameters in boosting, specifically for Gradient Boosting (number of trees \(B\), shrinkage parameter \(\lambda\), and interaction depth \(d\))?

    2. Explain how each of these tuning parameters affects the boosting model’s performance, complexity, and risk of overfitting. For instance, how does increasing or decreasing the number of trees, shrinkage parameter, or interaction depth influence the bias-variance trade-off?

    3. Discuss the importance of tuning these parameters using techniques like cross-validation. What strategies can be used to select optimal values for \(B\), \(\lambda\), and \(d\) to achieve the best balance between model fit and generalization, and to avoid overfitting to the training data?