import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
= pd.read_csv('https://raw.githubusercontent.com/barisguven/datasets/main/data/ames_tm.csv')
ames 'Sale_Price'] = np.log10(ames['Sale_Price']) ames[
Tree-Based Regression Models with Scikit-learn
As we continue to explore the scikit-learn package, we now consider the following tree-based models: decision trees, bagging, random forests, and boosting. In this post, we will keep working on a regression problem and train these tree-based models using scikit-learn. We will also use lightgbm, a Python package for the LightGBM library that provides a highly efficient gradient boosting framework.
To start, we import the ames
dataset from the Github repo and replace Sale_Price
, the target variable, with its logarithm to the base 10.
Let’s split our data into train and test sets by reserving 20% of the data for the test set.
from sklearn.model_selection import train_test_split
= train_test_split(
X_train, X_test, y_train, y_test ='Sale_Price'),
ames.drop(columns'Sale_Price'],
ames.loc[:, =0.2,
test_size=123
random_state )
Decision Trees
To grow a single decision tree to predict the sale price of houses, we can use the DecisionTreeRegressor class from the tree
module of scikit-learn. Below, we create an instance of it and call its get_params()
method to view its parameters and their default values. The minimum number of observations required for splitting a node, min_samples_split
, is 2, and the minimum number of observations in a leaf node, min_samples_leaf
, is 1 by default. With such default values, the decision tree regressor will grow a highly complex/deep tree and overfit the data.
from sklearn.tree import DecisionTreeRegressor
DecisionTreeRegressor().get_params()
{'ccp_alpha': 0.0,
'criterion': 'squared_error',
'max_depth': None,
'max_features': None,
'max_leaf_nodes': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'monotonic_cst': None,
'random_state': None,
'splitter': 'best'}
To grow a less complex tree at the expense of some bias but with the benefit of less variance for our predictions, we need to optimize some of its parameters using cross-validation. We can do this in two ways:
Optimize such parameters as
max_depth
,min_samples_split
, ormin_samples_leaf
;Grow a relatively deep tree and then optimize
ccp_alpha
, the cost complexity parameter, to prune it (i.e., implement cross-validated cost complexity pruning).
While decision trees can accommodate categorical variables, scikit-learn does not support them and requires data to have numeric type. Below, we create a preprocessor that one-hot encodes the categorical features and keeps the numeric features as they are.
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_selector
= ColumnTransformer(
preprocessor =[
transformers'encoder', OneHotEncoder(handle_unknown='ignore'), make_column_selector(dtype_include=object)),
('', 'passthrough', make_column_selector(dtype_include=np.number))
(
] )
We now combine the preprocessor with an instance of DecisionTreeRegressor
class to form a model. Decision trees are constructed by finding the feature/split that improves the criterion most (e.g., mean squared error for regression problems). Since they randomly break ties between features that improve the criterion equally, we set the random_state
parameter to an integer to obtain the same splits across different runs.
from sklearn.pipeline import Pipeline
= Pipeline(
pipeline =[
steps'preprocess', preprocessor),
('tree', DecisionTreeRegressor(random_state=3))
(
] )
Scikit-learn offers several classes to find the optimum hyperparameter values through cross-validation. The GridSearchCV class from the model_selection
module conducts an exhaustive search over specified parameter values. It requires an estimator (i.e., a model) and a parameter grid. The parameters and their values that will be tried out are specified as a dictionary. The cross-validation classes rely on the ParameterGrid class to generate all combinations of the hyperparameter grid.
We can construct a parameter grid using parameter names as keys and the list of parameter values as the corresponding values in a dictionary. Since our estimator is a pipeline, we must let the GridSearchCV
know that the parameters belong to the DecisionTreeRegressor
estimator. To do so, we need to use the step name for the estimator specified in the pipeline, which is 'tree'
in our case. For every parameter, the step name must be followed by two underscores, __
, and the parameter name to create a dictionary key.
The cv
option of GridSearchCV
can be specified similarly to the cross_validate()
function. We can set it to 10 for a 10-fold cross-validation. To illustrate another option, we can also set it to an instance of a cross-validator, for instance, KFold. We can set the shuffle
option of KFold
to True
to shuffle the observations before splitting data into k folds.1
Below, we create an instance of GridSearchCV
. We set its refit
option to True
to refit the estimator to the whole training data using the best parameters found. To conduct the grid search, we call the fit()
method by passing the training data into it.
from sklearn.model_selection import GridSearchCV, KFold
= {
param_grid 'tree__max_depth': [3, 5, 10, 15, 20],
'tree__min_samples_split': [2, 10, 20, 40]
}
= KFold(n_splits=10, shuffle=True, random_state=1)
kfold
= GridSearchCV(
grid =pipeline,
estimator=param_grid,
param_grid='neg_mean_squared_error',
scoring=kfold,
cv=True
refit
)
grid.fit(X_train, y_train) grid.best_params_
{'tree__max_depth': 10, 'tree__min_samples_split': 20}
We can view the best parameter combination by accessing the best_params_
attribute of grid
. We get a relatively complex tree with a depth of 10.
The best_score_
attribute of grid
stores the best (mean) test score. Its cv_results_
attribute, a dictionary, stores the cross-validation results, including test scores, parameter combinations, and more. Its best_estimator_
stores the decision tree regressor fitted to the entire training data using the best parameter values.
Let’s use the tree trained with the best parameters to predict the test set and compute the error. Since we previously set the refit
option of GridSearchCV
to True
, we can directly call the predict()
method of grid
.
from sklearn.metrics import root_mean_squared_error
root_mean_squared_error(y_test, grid.predict(X_test))
0.0853639022906632
Finally, we plot the tree’s top section using the plot_tree
function. To do so, we first access the best tree of the grid.2 Since scikit-learn drops feature names from data frames to create arrays, we call the get_feature_names_out()
method of the preprocessor to generate the feature names.3
from sklearn.tree import plot_tree
= grid.best_estimator_._final_estimator
best
preprocessor.fit_transform(X_train)= preprocessor.get_feature_names_out()
feature_names
= plt.subplots(figsize=(8, 6), dpi=200)
fig, ax =2, feature_names=feature_names, ax=ax, fontsize=6, filled=True)
plot_tree(best, max_depth plt.show()
Year_Built
seems to be the most important factor in determining Sale_Price
. Houses built before 1986 with no fireplaces and gross living areas of less than 812 sqft have the lowest average sale price.
Pruning a tree
Another option to avoid overfitting involves optimizing the cost complexity parameter,
where cost_complexity_pruning_path()
method of the DecisionTreeRegressor
class returns a sequence of
Letting a tree grow as far as it can leads to a depth of 25 and over 2,200 leaves. This implies a long list of
= DecisionTreeRegressor(max_depth=10, random_state=3)
tree
= tree.cost_complexity_pruning_path(
ccp_path
preprocessor.fit_transform(X_train),
y_train
)
ccp_path.ccp_alphas.shape
(547,)
We are ready to run a grid search. Given that we still have many n_jobs
parameter of GridSearchCV
to the number of processes we want to run in parallel. I set it to 8, the number of (physical) cores on my machine.
= GridSearchCV(
grid =pipeline,
estimator={'tree__ccp_alpha': ccp_path.ccp_alphas},
param_grid='neg_mean_squared_error',
scoring=kfold,
cv=True,
refit=8
n_jobs
)
grid.fit(X_train, y_train) grid.best_params_
{'tree__ccp_alpha': np.float64(2.5419090091888236e-05)}
Let’s assess the performance of the pruned tree on the test set. The pruned tree does slightly worse than the previous tree that we tuned previously. Moreover, the first option takes less time and consumes less energy.
root_mean_squared_error(y_test, grid.predict(X_test))
0.08780243011867586
Bagging
Single trees are weak learners, and their predictions show high variance. Ensembling a sufficiently large number of complex trees can obtain low-variance learners. In the case of regression trees, bagging involves averaging the predictions from multiple trees grown on the bootstrapped samples of the training data, significantly reducing the variance of the averaged predictions. We can use the BaggingRegressor class from the ensemble module to build a bagging estimator.5
from sklearn.ensemble import BaggingRegressor
BaggingRegressor().get_params()
{'bootstrap': True,
'bootstrap_features': False,
'estimator': None,
'max_features': 1.0,
'max_samples': 1.0,
'n_estimators': 10,
'n_jobs': None,
'oob_score': False,
'random_state': None,
'verbose': 0,
'warm_start': False}
Below, we construct a pipeline using an instance of the BaggingRegressor
class as the final estimator. We pass n_estimators=500
to average the predictions from 500 trees and random_state=4
for reproducibility. Note that the class uses the bootstrap sampling by default and can grow trees with parallel processing, which can be activated through the n_jobs
parameter.
= Pipeline(
pipeline =[
steps'preprocess', preprocessor),
('bag', BaggingRegressor(n_estimators=500, n_jobs=5, random_state=4))
(
]
)
pipeline.fit(X_train, y_train)= pipeline.predict(X_test)
y_pred root_mean_squared_error(y_test, y_pred)
0.058854906717615035
The bagging model has a lower RMSE than the decision tree model we optimized previously. Using the test data, we finally plot the predicted sale price against the true sale price.
Code
= plt.subplots()[1]
ax =0.4)
ax.scatter(y_test, y_pred, alpha'True Sale Price (log $)')
ax.set_xlabel('Predicted Sale Price (log $)')
ax.set_ylabel('Test Performance of Bagging - 200 Trees')
ax.set_title( plt.show()
Random Forests
Like bagging, random forests also average the predictions from multiple trees trained on the bootstrapped samples. Unlike bagging, it randomly selects m features out of all p features to consider each time a split is attempted. This prevents trees from looking alike and helps decorrelate them, which in turn helps reduce the variance of the predictions further.
We can use the RandomForestRegressor class from the ensemble module to build a random forests model, another ensemble estimator.
Both for bagging and random forests, a sufficiently high number of trees that stabilizes the error is picked, so there is no need for tuning the number of trees in the forest. While oob_score
parameter of the RandomForestRegressor
class to enable this. It can be set to True
to use r2_score
by default or to a callable to use a custom metric to compute the OOB-error.
1] / 3) np.floor(preprocessor.fit_transform(X_train).shape[
np.float64(104.0)
Below, we instantiate a random forests regressor and preset the number of trees to 500. We pass in oob_score=root_mean_squared_error
to compute the RMSE on the OOB samples. We also pass in n_jobs=5
to assign the training work to 5 processes. We create the random forests model by combining the regressor with the preprocessor in a pipeline. Next, we define a list of 5 candidates for the max_features
parameter of the regressor and train the model using each candidate. We use the set_params()
method of the pipeline to update the max_features
parameter of the regressor with the candidates iterated over in the loop. Note that we pass in a dictionary to the method. Its key combines 'rf__'
, which designates the step of the pipeline, with the parameter name, 'max_features'
, which is to be set to the provided value.6 Finally, we print the OOB-RMSE for each candidate by accessing the oob_score_
attribute of the trained model.
from sklearn.ensemble import RandomForestRegressor
= RandomForestRegressor(
rf =500,
n_estimators=root_mean_squared_error,
oob_score=5,
n_jobs=4
random_state
)
= Pipeline([('preprocess', preprocessor), ('rf', rf)])
pipeline
= [1, 25, 50, 103, 150, 200]
max_features
for num in max_features:
**{'rf__max_features': num})
pipeline.set_params(
pipeline.fit(X_train, y_train)print(f'{num} features; OOB-RMSE: {pipeline._final_estimator.oob_score_}')
1 features; OOB-RMSE: 0.07897349374129993
25 features; OOB-RMSE: 0.06356656433531577
50 features; OOB-RMSE: 0.061893200952276635
103 features; OOB-RMSE: 0.06188636870372245
150 features; OOB-RMSE: 0.06200775371602834
200 features; OOB-RMSE: 0.06211195372354106
The lowest RMSE is obtained with 103 features. Below, we also compute the test RMSE.
**{'rf__max_features': 103})
pipeline.set_params(
pipeline.fit(X_train, y_train)= root_mean_squared_error(y_test, pipeline.predict(X_test))
rf_rmse rf_rmse
0.05788936960826042
The test error of the optimized random forests model is only slightly lower than that of the bagging model. Let’s finally take a look at feature importances. The feature_importances_
attribute of the RandomForestClassifier
class stores importance scores computed as the mean of accumulation of the impurity decrease within each tree, where impurity refers to mean squared error for regression problems. Year_Built
and Gr_Liv_Area
are the two most important features.
= pd.Series(
importances 'rf'].feature_importances_,
pipeline[=pipeline['preprocess'].get_feature_names_out()
index
)=False)[:10] importances.sort_values(ascending
__Year_Built 0.174668
__Gr_Liv_Area 0.165319
__Garage_Cars 0.093800
__Total_Bsmt_SF 0.062179
__Garage_Area 0.059375
__First_Flr_SF 0.048567
__Full_Bath 0.048047
__Year_Remod_Add 0.038761
__Fireplaces 0.029639
__Lot_Area 0.018698
dtype: float64
Boosting
Unlike bagging and random forests, boosting trains each tree in the ensemble on the modified version of the training data. Each successive tree uses the previous tree’s prediction errors (i.e., residuals). While we prefer individual trees to be complex/deep in bagging and random forests, simple/shallow trees are preferred in boosting. Simple trees learn slowly. Moreover, a learning rate parameter additionally controls the magnitude of residuals. The number of trees in the ensemble, tree complexity, and learning rate can be optimized through cross-validation.
The GradientBoostingRegressor class from the ensemble module of scikit-learn can build boosting ensembles. However, three are more efficient libraries to train boosting models such as XGBoost and LightGBM for which there are Python (and R) interface packages. We will use the lightgbm
Python package in this post. Please install it in your (virtual) environment by running pip install lightgbm
.
The lightgbm
package offers the LGBMRegressor class to build boosting regressors. Below, we import it along with two functions: early_stopping()
to create early stopping callback and plot_importance()
to plot feature importances.
from lightgbm import LGBMRegressor, early_stopping, plot_importance
One strategy in optimizing a boosting model is to fix the number of trees to a relatively large number and have a callback attached to the training process to activate early stopping if the model does not improve in at least a preset number of rounds. To decide whether the model improves, a performance metric(s), say RMSE, is computed on a fraction of training data, called validation/evaluation data, after each boosting iteration. For instance, we may want to stop adding trees to the ensemble if the RMSE of the ensemble on the validation set does not improve for at least 100 boosting rounds/iterations.
To create the validation dataset, we will split our data twice. Because of this operation, we will preprocess our data before the first split.7 Below, we redefine the preprocessor for clarity and then create the training, validation, and test sets.
= ColumnTransformer(
preprocessor =[
transformers'encoder', OneHotEncoder(handle_unknown='ignore'), make_column_selector(dtype_include=object)),
('', 'passthrough', make_column_selector(dtype_include=np.number))
(
]
)
= preprocessor.fit_transform(ames.drop(columns='Sale_Price'))
X = ames['Sale_Price']
y = preprocessor.get_feature_names_out()
feature_names
= train_test_split(X, y, test_size=0.2, random_state=123)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.1, random_state=0) X_train, X_val, y_train, y_val
Next, we define a parameter grid. For the max_depth
parameter, which controls tree complexity, we provide a list of small values as we want trees to be simple and learn slowly.
= {
param_grid 'max_depth': [1, 3, 5, 10],
'learning_rate': [.1, .01, .001]
}
We now create a callback object using the early_stopping() function. We specify the number of rounds for which the performance metric computed on the validation set does not improve to 200. We then use the callback object to specify another parameters dictionary that will be passed to the fit() method of the LGBMRegressor
class. In the same dictionary, we also specify the evaluation/validation set as a list of tuple pairs.8
= early_stopping(stopping_rounds=200)
early_stop_callback
= {
lgbm_params 'callbacks': [early_stop_callback],
'eval_set': [(X_val, y_val)]
}
Finally, we set up a grid search to evaluate the parameter grid using a 10-fold cross-validation with parallel processing. We use an instance of LGBMRegressor
with 5000 trees as the estimator. We pass in lgbm_param
to the fit()
method of grid
, which passes them to the same estimator method. The grid search takes approximately 8 minutes on my machine.
= GridSearchCV(
grid =LGBMRegressor(n_estimators=5000, random_state=2),
estimator=param_grid,
param_grid='neg_root_mean_squared_error',
scoring=10,
cv=8,
n_jobs=True
refit
)
**lgbm_params) grid.fit(X_train, y_train,
The optimum value for the learning rate and maximum tree depth are 0.01 and 3, respectively.
grid.best_params_
{'learning_rate': 0.01, 'max_depth': 3}
We see almost a 12% improvement in test RMSE over random forests.
= root_mean_squared_error(y_test, grid.predict(X_test))
boost_rmse 100 * (1 - boost_rmse / rf_rmse) boost_rmse,
(0.051049829783347156, 11.814845922829509)
Feature importance
The LGBMRegressor
also returns feature importance scores. There are two types of them. The ‘split’ type counts the number of times the feature is used in a model. The ‘gain’ type computes the total gains of splits, which uses the feature where the gain is the fall in L2 loss after a given split. The plot_importance()
function can plot the feature importances of both types.9
Since we preprocessed our data, which dropped the feature names from the data, plot_importance()
will use the Column_{column_index}
pattern to name columns on the plot. Below, we call the function to plot the top 10 features by gain, get the y-axis tick labels, and rename them using the feature_names
list we previously created by calling the get_feature_names_out()
method of the preprocessor. We then use the new column names to update the tick labels. We see that Year_Built
and Gr_Liv_Area
are the two most important features.
= plt.subplots()
fig, ax
plot_importance(=10,
grid.best_estimator_, max_num_features='gain', ax=ax, precision=1
importance_type
)
= ax.get_yaxis().get_majorticklabels()
yticklabels = [label.get_text() for label in yticklabels]
labels = [int(label.replace('Column_', '')) for label in labels]
column_idx = [feature_names[i].replace('__', '') for i in column_idx]
column_names = [column.replace('encoder', '') for column in column_names]
column_names
ax.set_yticklabels(column_names)'Top 10 Important Features by L2 Loss Gain')
ax.set_title( plt.show()
Summary
In this post, we used the scikit-learn and lightgbm packages to train the main tree-based models focusing on a regression problem. In the next post, we will keep exploring scikit-learn by concentrating on a classification problem.
Footnotes
The
random_state
parameter controls the shuffle.↩︎The
best_estimator_
attribute stores the best pipeline. To get the tree in the best pipeline, we access its_final_estimator
attribute.↩︎Once a
ColumnTransformer
is fit, the names of its transformers are used to create names for the transformed features.↩︎See Hastie et al. (2021) for a detailed discussion of the strategy.↩︎
Since bagging is a special case of random forests, one can also use the
RandomForestRegressor
class from the same module to fit a bagging estimator.↩︎When arguments are passed to a function as a dictionary, the dictionary must be preceded by
**
. For more on this, see the Python tutorial.↩︎Since we only encode categorical variables, we do not need to worry about data leakage from the test set to the train set.↩︎
LGBMRegressor
uses L2 loss, , both as the objective function to find the best splits and the default evaluation metric. See this exchange for the difference between L2 loss and mean squared error.↩︎LGBMRegressor
‘simportance_type
parameter is set to'split'
by default to return the ’split’ type scores. It stores them in itsfeature_importances_
attribute. ↩︎