Tree-Based Regression Models with Scikit-learn

machine learning

tree-based models

scikit-learn

Python

Published

Mar 30, 2025

As we continue to explore the scikit-learn package, we now consider the following tree-based models: decision trees, bagging, random forests, and boosting. In this post, we will keep working on a regression problem and train these tree-based models using scikit-learn. We will also use lightgbm, a Python package for the LightGBM library that provides a highly efficient gradient boosting framework.

To start, we import the ames dataset from the Github repo and replace Sale_Price, the target variable, with its logarithm to the base 10.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

ames = pd.read_csv('https://raw.githubusercontent.com/barisguven/datasets/main/data/ames_tm.csv')
ames['Sale_Price'] = np.log10(ames['Sale_Price'])

Let’s split our data into train and test sets by reserving 20% of the data for the test set.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
  ames.drop(columns='Sale_Price'), 
  ames.loc[:, 'Sale_Price'], 
  test_size=0.2, 
  random_state=123
)

Decision Trees

To grow a single decision tree to predict the sale price of houses, we can use the DecisionTreeRegressor class from the tree module of scikit-learn. Below, we create an instance of it and call its get_params() method to view its parameters and their default values. The minimum number of observations required for splitting a node, min_samples_split, is 2, and the minimum number of observations in a leaf node, min_samples_leaf, is 1 by default. With such default values, the decision tree regressor will grow a highly complex/deep tree and overfit the data.

from sklearn.tree import DecisionTreeRegressor

DecisionTreeRegressor().get_params()

{'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'random_state': None,
 'splitter': 'best'}

To grow a less complex tree at the expense of some bias but with the benefit of less variance for our predictions, we need to optimize some of its parameters using cross-validation. We can do this in two ways:

Optimize such parameters as max_depth, min_samples_split, or min_samples_leaf;
Grow a relatively deep tree and then optimize ccp_alpha, the cost complexity parameter, to prune it (i.e., implement cross-validated cost complexity pruning).

While decision trees can accommodate categorical variables, scikit-learn does not support them and requires data to have numeric type. Below, we create a preprocessor that one-hot encodes the categorical features and keeps the numeric features as they are.

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_selector

preprocessor = ColumnTransformer(
  transformers=[
    ('encoder', OneHotEncoder(handle_unknown='ignore'), make_column_selector(dtype_include=object)),
    ('', 'passthrough', make_column_selector(dtype_include=np.number))
  ]
)

We now combine the preprocessor with an instance of DecisionTreeRegressor class to form a model. Decision trees are constructed by finding the feature/split that improves the criterion most (e.g., mean squared error for regression problems). Since they randomly break ties between features that improve the criterion equally, we set the random_state parameter to an integer to obtain the same splits across different runs.

from sklearn.pipeline import Pipeline

pipeline = Pipeline(
  steps=[
    ('preprocess', preprocessor),
    ('tree', DecisionTreeRegressor(random_state=3))
  ]
)

Scikit-learn offers several classes to find the optimum hyperparameter values through cross-validation. The GridSearchCV class from the model_selection module conducts an exhaustive search over specified parameter values. It requires an estimator (i.e., a model) and a parameter grid. The parameters and their values that will be tried out are specified as a dictionary. The cross-validation classes rely on the ParameterGrid class to generate all combinations of the hyperparameter grid.

We can construct a parameter grid using parameter names as keys and the list of parameter values as the corresponding values in a dictionary. Since our estimator is a pipeline, we must let the GridSearchCV know that the parameters belong to the DecisionTreeRegressor estimator. To do so, we need to use the step name for the estimator specified in the pipeline, which is 'tree' in our case. For every parameter, the step name must be followed by two underscores, __, and the parameter name to create a dictionary key.

The cv option of GridSearchCV can be specified similarly to the cross_validate() function. We can set it to 10 for a 10-fold cross-validation. To illustrate another option, we can also set it to an instance of a cross-validator, for instance, KFold. We can set the shuffle option of KFold to True to shuffle the observations before splitting data into k folds.¹

Below, we create an instance of GridSearchCV. We set its refit option to True to refit the estimator to the whole training data using the best parameters found. To conduct the grid search, we call the fit() method by passing the training data into it.

from sklearn.model_selection import GridSearchCV, KFold

param_grid = {
  'tree__max_depth': [3, 5, 10, 15, 20],
  'tree__min_samples_split': [2, 10, 20, 40]
}

kfold = KFold(n_splits=10, shuffle=True, random_state=1)

grid = GridSearchCV(
  estimator=pipeline,
  param_grid=param_grid,
  scoring='neg_mean_squared_error',
  cv=kfold,
  refit=True
)

grid.fit(X_train, y_train)
grid.best_params_

{'tree__max_depth': 10, 'tree__min_samples_split': 20}

We can view the best parameter combination by accessing the best_params_ attribute of grid. We get a relatively complex tree with a depth of 10.

The best_score_ attribute of grid stores the best (mean) test score. Its cv_results_ attribute, a dictionary, stores the cross-validation results, including test scores, parameter combinations, and more. Its best_estimator_ stores the decision tree regressor fitted to the entire training data using the best parameter values.

Let’s use the tree trained with the best parameters to predict the test set and compute the error. Since we previously set the refit option of GridSearchCV to True, we can directly call the predict() method of grid.

from sklearn.metrics import root_mean_squared_error

root_mean_squared_error(y_test, grid.predict(X_test))

0.0853639022906632

Finally, we plot the tree’s top section using the plot_tree function. To do so, we first access the best tree of the grid.² Since scikit-learn drops feature names from data frames to create arrays, we call the get_feature_names_out() method of the preprocessor to generate the feature names.³

from sklearn.tree import plot_tree

best = grid.best_estimator_._final_estimator
preprocessor.fit_transform(X_train)
feature_names = preprocessor.get_feature_names_out()

fig, ax = plt.subplots(figsize=(8, 6), dpi=200)
plot_tree(best, max_depth=2, feature_names=feature_names, ax=ax, fontsize=6, filled=True)
plt.show()

Year_Built seems to be the most important factor in determining Sale_Price. Houses built before 1986 with no fireplaces and gross living areas of less than 812 sqft have the lowest average sale price.

Pruning a tree

Another option to avoid overfitting involves optimizing the cost complexity parameter, $α$ . A deep tree is first grown in a strategy known as (minimal) cost complexity pruning (or weakest link pruning). For each value of nonnegative $α$ , there is a subtree that minimizes the following cost-complexity measure:

$\sum_{m = 1}^{| T |} \sum_{i : x_{i} \in R_{m}} (y_{i} - {\hat{y}}_{R_{m}})^{2} + α | T |$

where $T$ is the number of terminal nodes in the subtree, $R_{m}$ is the region corresponding to the mth terminal node, and ${\hat{y}}_{R_{m}}$ is the predicted response value for that region.⁴ The measure implies a trade-off between the cost (mean squared error) and tree complexity indicated by the number of terminal nodes, $| T |$ . As $α$ grows from zero to higher values, the subtree gets less complex as more nodes are pruned. The cost_complexity_pruning_path() method of the DecisionTreeRegressor class returns a sequence of $α$ values. A K-fold cross-validation can then be used to optimize $α$ . This means that for each value of $α$ , a subtree is trained on the analysis set (i.e., on $k - 1$ folds) and assessed on the kth fold.

Letting a tree grow as far as it can leads to a depth of 25 and over 2,200 leaves. This implies a long list of $α$ values to try out (over 2,000), increasing the optimization cost. Instead, we let the tree depth to be 10. This drastically reduces the number of alphas to 547.

tree = DecisionTreeRegressor(max_depth=10, random_state=3)

ccp_path = tree.cost_complexity_pruning_path(
  preprocessor.fit_transform(X_train), 
  y_train
)

ccp_path.ccp_alphas.shape

(547,)

We are ready to run a grid search. Given that we still have many $α$ values to try out, we can benefit from parallel processing. We can set the n_jobs parameter of GridSearchCV to the number of processes we want to run in parallel. I set it to 8, the number of (physical) cores on my machine.

grid = GridSearchCV(
  estimator=pipeline,
  param_grid={'tree__ccp_alpha': ccp_path.ccp_alphas},
  scoring='neg_mean_squared_error',
  cv=kfold,
  refit=True,
  n_jobs=8
)

grid.fit(X_train, y_train)
grid.best_params_

{'tree__ccp_alpha': np.float64(2.5419090091888236e-05)}

Let’s assess the performance of the pruned tree on the test set. The pruned tree does slightly worse than the previous tree that we tuned previously. Moreover, the first option takes less time and consumes less energy.

root_mean_squared_error(y_test, grid.predict(X_test))

0.08780243011867586

Bagging

Single trees are weak learners, and their predictions show high variance. Ensembling a sufficiently large number of complex trees can obtain low-variance learners. In the case of regression trees, bagging involves averaging the predictions from multiple trees grown on the bootstrapped samples of the training data, significantly reducing the variance of the averaged predictions. We can use the BaggingRegressor class from the ensemble module to build a bagging estimator.⁵

from sklearn.ensemble import BaggingRegressor

BaggingRegressor().get_params()

{'bootstrap': True,
 'bootstrap_features': False,
 'estimator': None,
 'max_features': 1.0,
 'max_samples': 1.0,
 'n_estimators': 10,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

Below, we construct a pipeline using an instance of the BaggingRegressor class as the final estimator. We pass n_estimators=500 to average the predictions from 500 trees and random_state=4 for reproducibility. Note that the class uses the bootstrap sampling by default and can grow trees with parallel processing, which can be activated through the n_jobs parameter.

pipeline = Pipeline(
  steps=[
    ('preprocess', preprocessor),
    ('bag', BaggingRegressor(n_estimators=500, n_jobs=5, random_state=4))
  ]
)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
root_mean_squared_error(y_test, y_pred)

0.058854906717615035

The bagging model has a lower RMSE than the decision tree model we optimized previously. Using the test data, we finally plot the predicted sale price against the true sale price.

Code

ax = plt.subplots()[1]
ax.scatter(y_test, y_pred, alpha=0.4)
ax.set_xlabel('True Sale Price (log $)')
ax.set_ylabel('Predicted Sale Price (log $)')
ax.set_title('Test Performance of Bagging - 200 Trees')
plt.show()

Random Forests

Like bagging, random forests also average the predictions from multiple trees trained on the bootstrapped samples. Unlike bagging, it randomly selects m features out of all p features to consider each time a split is attempted. This prevents trees from looking alike and helps decorrelate them, which in turn helps reduce the variance of the predictions further.

We can use the RandomForestRegressor class from the ensemble module to build a random forests model, another ensemble estimator.

Both for bagging and random forests, a sufficiently high number of trees that stabilizes the error is picked, so there is no need for tuning the number of trees in the forest. While $p / 3$ features are recommended for regression problems, we must tune $m$ to find its optimal value. Since each tree is trained on the bootstrapped sample of the training data, the out-of-bag (OOB) samples can be used for cross-validation. We can use the oob_score parameter of the RandomForestRegressor class to enable this. It can be set to True to use r2_score by default or to a callable to use a custom metric to compute the OOB-error.

np.floor(preprocessor.fit_transform(X_train).shape[1] / 3)

np.float64(104.0)

Below, we instantiate a random forests regressor and preset the number of trees to 500. We pass in oob_score=root_mean_squared_error to compute the RMSE on the OOB samples. We also pass in n_jobs=5 to assign the training work to 5 processes. We create the random forests model by combining the regressor with the preprocessor in a pipeline. Next, we define a list of 5 candidates for the max_features parameter of the regressor and train the model using each candidate. We use the set_params() method of the pipeline to update the max_features parameter of the regressor with the candidates iterated over in the loop. Note that we pass in a dictionary to the method. Its key combines 'rf__', which designates the step of the pipeline, with the parameter name, 'max_features', which is to be set to the provided value.⁶ Finally, we print the OOB-RMSE for each candidate by accessing the oob_score_ attribute of the trained model.

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(
  n_estimators=500,
  oob_score=root_mean_squared_error,
  n_jobs=5,
  random_state=4
)

pipeline = Pipeline([('preprocess', preprocessor), ('rf', rf)])

max_features = [1, 25, 50, 103, 150, 200]

for num in max_features:
  pipeline.set_params(**{'rf__max_features': num})
  pipeline.fit(X_train, y_train)
  print(f'{num} features; OOB-RMSE: {pipeline._final_estimator.oob_score_}')

1 features; OOB-RMSE: 0.07897349374129993
25 features; OOB-RMSE: 0.06356656433531577
50 features; OOB-RMSE: 0.061893200952276635
103 features; OOB-RMSE: 0.06188636870372245
150 features; OOB-RMSE: 0.06200775371602834
200 features; OOB-RMSE: 0.06211195372354106

The lowest RMSE is obtained with 103 features. Below, we also compute the test RMSE.

pipeline.set_params(**{'rf__max_features': 103})
pipeline.fit(X_train, y_train)
rf_rmse = root_mean_squared_error(y_test, pipeline.predict(X_test))
rf_rmse

0.05788936960826042

The test error of the optimized random forests model is only slightly lower than that of the bagging model. Let’s finally take a look at feature importances. The feature_importances_ attribute of the RandomForestClassifier class stores importance scores computed as the mean of accumulation of the impurity decrease within each tree, where impurity refers to mean squared error for regression problems. Year_Built and Gr_Liv_Area are the two most important features.

importances = pd.Series(
  pipeline['rf'].feature_importances_, 
  index=pipeline['preprocess'].get_feature_names_out()
)
importances.sort_values(ascending=False)[:10]

__Year_Built        0.174668
__Gr_Liv_Area       0.165319
__Garage_Cars       0.093800
__Total_Bsmt_SF     0.062179
__Garage_Area       0.059375
__First_Flr_SF      0.048567
__Full_Bath         0.048047
__Year_Remod_Add    0.038761
__Fireplaces        0.029639
__Lot_Area          0.018698
dtype: float64

Boosting

Unlike bagging and random forests, boosting trains each tree in the ensemble on the modified version of the training data. Each successive tree uses the previous tree’s prediction errors (i.e., residuals). While we prefer individual trees to be complex/deep in bagging and random forests, simple/shallow trees are preferred in boosting. Simple trees learn slowly. Moreover, a learning rate parameter additionally controls the magnitude of residuals. The number of trees in the ensemble, tree complexity, and learning rate can be optimized through cross-validation.

The GradientBoostingRegressor class from the ensemble module of scikit-learn can build boosting ensembles. However, three are more efficient libraries to train boosting models such as XGBoost and LightGBM for which there are Python (and R) interface packages. We will use the lightgbm Python package in this post. Please install it in your (virtual) environment by running pip install lightgbm.

The lightgbm package offers the LGBMRegressor class to build boosting regressors. Below, we import it along with two functions: early_stopping() to create early stopping callback and plot_importance() to plot feature importances.

from lightgbm import LGBMRegressor, early_stopping, plot_importance

One strategy in optimizing a boosting model is to fix the number of trees to a relatively large number and have a callback attached to the training process to activate early stopping if the model does not improve in at least a preset number of rounds. To decide whether the model improves, a performance metric(s), say RMSE, is computed on a fraction of training data, called validation/evaluation data, after each boosting iteration. For instance, we may want to stop adding trees to the ensemble if the RMSE of the ensemble on the validation set does not improve for at least 100 boosting rounds/iterations.

To create the validation dataset, we will split our data twice. Because of this operation, we will preprocess our data before the first split.⁷ Below, we redefine the preprocessor for clarity and then create the training, validation, and test sets.

preprocessor = ColumnTransformer(
  transformers=[
    ('encoder', OneHotEncoder(handle_unknown='ignore'), make_column_selector(dtype_include=object)),
    ('', 'passthrough', make_column_selector(dtype_include=np.number))
  ]
)

X = preprocessor.fit_transform(ames.drop(columns='Sale_Price'))
y = ames['Sale_Price']
feature_names = preprocessor.get_feature_names_out()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=0)

Next, we define a parameter grid. For the max_depth parameter, which controls tree complexity, we provide a list of small values as we want trees to be simple and learn slowly.

param_grid = {
  'max_depth': [1, 3, 5, 10],
  'learning_rate': [.1, .01, .001]
}

We now create a callback object using the early_stopping() function. We specify the number of rounds for which the performance metric computed on the validation set does not improve to 200. We then use the callback object to specify another parameters dictionary that will be passed to the fit() method of the LGBMRegressor class. In the same dictionary, we also specify the evaluation/validation set as a list of tuple pairs.⁸

early_stop_callback = early_stopping(stopping_rounds=200)

lgbm_params = {
  'callbacks': [early_stop_callback],
  'eval_set': [(X_val, y_val)]
}

Finally, we set up a grid search to evaluate the parameter grid using a 10-fold cross-validation with parallel processing. We use an instance of LGBMRegressor with 5000 trees as the estimator. We pass in lgbm_param to the fit() method of grid, which passes them to the same estimator method. The grid search takes approximately 8 minutes on my machine.

grid = GridSearchCV(
  estimator=LGBMRegressor(n_estimators=5000, random_state=2),
  param_grid=param_grid,
  scoring='neg_root_mean_squared_error',
  cv=10,
  n_jobs=8,
  refit=True
)

grid.fit(X_train, y_train, **lgbm_params)

The optimum value for the learning rate and maximum tree depth are 0.01 and 3, respectively.

grid.best_params_

{'learning_rate': 0.01, 'max_depth': 3}

We see almost a 12% improvement in test RMSE over random forests.

boost_rmse = root_mean_squared_error(y_test, grid.predict(X_test))
boost_rmse, 100 * (1 - boost_rmse / rf_rmse)

(0.051049829783347156, 11.814845922829509)

Feature importance

The LGBMRegressor also returns feature importance scores. There are two types of them. The ‘split’ type counts the number of times the feature is used in a model. The ‘gain’ type computes the total gains of splits, which uses the feature where the gain is the fall in L2 loss after a given split. The plot_importance() function can plot the feature importances of both types.⁹

Since we preprocessed our data, which dropped the feature names from the data, plot_importance() will use the Column_{column_index} pattern to name columns on the plot. Below, we call the function to plot the top 10 features by gain, get the y-axis tick labels, and rename them using the feature_names list we previously created by calling the get_feature_names_out() method of the preprocessor. We then use the new column names to update the tick labels. We see that Year_Built and Gr_Liv_Area are the two most important features.

fig, ax = plt.subplots()
plot_importance(
  grid.best_estimator_, max_num_features=10, 
  importance_type='gain', ax=ax, precision=1
)

yticklabels = ax.get_yaxis().get_majorticklabels()
labels = [label.get_text() for label in yticklabels]
column_idx = [int(label.replace('Column_', '')) for label in labels]
column_names = [feature_names[i].replace('__', '') for i in column_idx]
column_names = [column.replace('encoder', '') for column in column_names]
ax.set_yticklabels(column_names)
ax.set_title('Top 10 Important Features by L2 Loss Gain')
plt.show()

Summary

In this post, we used the scikit-learn and lightgbm packages to train the main tree-based models focusing on a regression problem. In the next post, we will keep exploring scikit-learn by concentrating on a classification problem.

Footnotes

The random_state parameter controls the shuffle.↩︎
The best_estimator_ attribute stores the best pipeline. To get the tree in the best pipeline, we access its _final_estimator attribute.↩︎
Once a ColumnTransformer is fit, the names of its transformers are used to create names for the transformed features.↩︎
See Hastie et al. (2021) for a detailed discussion of the strategy.↩︎
Since bagging is a special case of random forests, one can also use the RandomForestRegressor class from the same module to fit a bagging estimator.↩︎
When arguments are passed to a function as a dictionary, the dictionary must be preceded by **. For more on this, see the Python tutorial.↩︎
Since we only encode categorical variables, we do not need to worry about data leakage from the test set to the train set.↩︎
LGBMRegressor uses L2 loss, $\sqrt{\sum_{i}^{n} (y_{i} - {\hat{y}}_{i})^{2}}$ , both as the objective function to find the best splits and the default evaluation metric. See this exchange for the difference between L2 loss and mean squared error.↩︎
LGBMRegressor‘s importance_type parameter is set to 'split' by default to return the ’split’ type scores. It stores them in its feature_importances_ attribute. ↩︎