Scikit-learn: Sampling, Feature Engineering, and Pipelines

machine learning
scikit-learn
Python
Published

Mar 23, 2025

I have published four posts on machine learning in R and illustrated how to use the great tidymodels package to implement the most common machine learning tasks: (re)sampling, feature engineering (preprocessing), creating workflows by combining preprocessing with model specification and tuning model hyperparameters with parallel processing. In the first three posts, I focused on regression problems and covered the following statistical models: linear regression, decision trees, bagged trees, random forests, and boosted trees. In the last post, focusing on classification problems, I covered logistic regression, linear discriminant analysis, quadratic discriminant analysis, support vector classifier (hard and soft margin classifier), and support vector machines.

In a new series of posts, I will illustrate how to use scikit-learn, a popular machine learning package written in Python, to implement the same machine learning tasks I covered in the previous four posts. These new posts should be additionally useful for those who are familiar with doing machine learning in R and want to pick up the same skills in Python.

To follow along, you should install the scikit-learn package on your machine. I recommend following the steps on its install guide. If you create a virtual environment as recommended, please install the pandas and matplotlib packages in the same environment.

Loading data and initial inspection

I will use the pandas package to load and process data, numpy for math operations, and the pyplot module of the matplotlib package to visualize data. Below, I install them using the import command.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

We will work on the ames housing data as before. This dataset is not included in the scikit-learn’s datasets, so we have to load it ourselves. I wrote the ames dataset from tidymodels to a CSV file and made it available in a Github repository. We can use the following line to install the dataset and create a dataframe from it.

ames = pd.read_csv('https://raw.githubusercontent.com/barisguven/datasets/main/data/ames_tm.csv')

The dataset contains 2930 records and 74 variables:

ames.shape
(2930, 74)

As previously observed, our response variable, Sale_Price, is right-skewed. The log transformation can drive it towards a normal distribution. Let’s replace the Sale_Price column with its log transformation.

ames['Sale_Price'] = np.log10(ames['Sale_Price'])

Splitting and (re)sampling data

If you are familiar with tidymodels, you know it is a meta-package containing many packages. Similarly, scikit-learn is a package containing many modules. Each module offers a set of functions and classes to undertake different machine learning tasks. The model_selection module has a variety of functions that allow random sampling from the original data to create training and test sets. It also has functions that handle tuning model hyperparameters using cross-validation resamples.

To get an initial split of our data into training and test sets, we can use the utility function train_test_split() from the sklearn.model_selection module. We start by importing the train_test_split() and the ShuffleSplit() functions from sklearn’s model_selection module. Unlike R, Python functions can return multiple objects. For every data array it gets, train_test_split() returns training and test sets. Below, we create the features data by naming it X and the target data y. We then pass these to train_test_split() by specifying the training set size and random state number for reproducibility.

from sklearn.model_selection import train_test_split, ShuffleSplit

X = ames.drop(columns='Sale_Price')
y = ames.loc[:, 'Sale_Price']

X_train, X_test, y_train, y_test = train_test_split(
  X, y, train_size=0.8, random_state=123
)

Both X_train and y_train contain 2344 (2930 x 0.8) rows as expected.

(X_train.shape, y_train.shape)
((2344, 73), (2344,))

train_test_split() relies on the CV splitter or cross-validator ShuffleSplit class to generate the training and test sets. The function generates random index numbers to sample the training data from the original data as the initial_split() function in tidymodels. Running the code below, you can confirm that train_test_split() uses the same rows as ShuffleSplit as expected.

Code
rs = ShuffleSplit(n_splits=1, train_size=0.8, random_state=123)
split = rs.split(X)
train_index, test_index = next(split)
print(X.iloc[train_index].index.isin(X_train.index).sum() == X_train.shape[0])

The model_selection module also offers the KFold class to split data into k consecutive folds; the RepeatedKFold class performs repeated K-Fold. For classification problems, it also offers their stratified versions: StratifiedShuffleSplit, StratifiedKFold, and RepeatedStratifiedKFold. We will see how to use them in future posts.

Feature engineering

scikit-learn’s three modules offer tools to preprocess features:

  • preprocessing contains a variety of methods to preprocess features.
  • feature_selection offers additional methods to use in the preprocessing stage.
  • compose provides helper functions to select columns for preprocessing and chain preprocessing steps

There is also a package called feature_engine (not part of scikit-learn) that provides tools to preprocess features that can complement scikit-learn’ capabilities. Please install it as we will use it below.

Log-transformation

To log-transform a feature, we can use the FunctionTransformer class in the preprocessing module. This class expects a function to apply it to the features that we want to transform. By passing in the np.log10 function to it, we can compute the logarithm of a select feature to the base 10. Below, we create an instance of the FunctionTransformer class and then use its transform() method to log-transform the predictor Gr_Liv_Area.

from sklearn.preprocessing import FunctionTransformer

transformer = FunctionTransformer(np.log10)
transformer.transform(X_train['Gr_Liv_Area'])
1588    3.128399
1667    2.977724
970     3.248219
714     3.108565
354     3.235023
          ...   
1147    3.208710
2154    3.222196
1766    3.377488
1122    3.154728
1346    3.279211
Name: Gr_Liv_Area, Length: 2344, dtype: float64

Standardization

Standardization or normalization first centers a variable by subtracting its mean from it and then divide the centered variable by its standard deviation. For variable X, its standardized version is given by

Xmean(X)sd(X)

Consider now the feature Bedroom_AbvGr, the number of bedrooms above ground, in our data. Let’s compute its mean and standard deviation first.

mean = np.mean(X_train['Bedroom_AbvGr'])
std = np.std(X_train['Bedroom_AbvGr'])

(mean, std)
(np.float64(2.85580204778157), np.float64(0.8413536073230645))

We can now standardize it as follows:

(X_train['Bedroom_AbvGr'] - mean) / std
1588   -1.017173
1667    0.171388
970     0.171388
714    -1.017173
354     0.171388
          ...   
1147   -1.017173
2154    0.171388
1766    1.359949
1122    0.171388
1346    1.359949
Name: Bedroom_AbvGr, Length: 2344, dtype: float64

The StandardScaler class in the preprocessing module can do all that for us. Unlike the log-transformation case, we need first to compute the variable’s mean and standard deviation to transform it. We can use the fit() method to achieve this. Below, we call the fit() method on the scaler object to obtain the mean and standard deviation of Bedroom_AbvGr. After the fit() method is executed, the mean and standard deviation statistics can be accessed through the mean_ and scale_ attributes of the scaler object.

from sklearn.preprocessing import StandardScaler

bedrooms = X_train[['Bedroom_AbvGr']]

scaler = StandardScaler()
scaler.fit(bedrooms)

(scaler.mean_, scaler.scale_)
(array([2.85580205]), array([0.84135361]))

Now that we have the statistics computed, we can standardize the variable by calling the .transform() method:

scaler.transform(bedrooms)
array([[-1.01717285],
       [ 0.17138805],
       [ 0.17138805],
       ...,
       [ 1.35994895],
       [ 0.17138805],
       [ 1.35994895]], shape=(2344, 1))

We can standardize a variable in a single step using the .fit_transform() method instead:

scaler.fit_transform(bedrooms)
array([[-1.01717285],
       [ 0.17138805],
       [ 0.17138805],
       ...,
       [ 1.35994895],
       [ 0.17138805],
       [ 1.35994895]], shape=(2344, 1))

StandardScaler also accepts two options: with_mean and with_sd. Passing in only with_mean=False implies we want only to scale the variable. Passing in with_sd=False implies we wish only to center the variable. Both are set to True by default. The other scaler classes available in the preprocessing module are MaxAbsScaler, MinMaxScaler, and RobustScaler.

Preprocessing multiple features in one step

Oftentimes, we want to apply a certain transformation to many features. In tidymodels, column selectors can select multiple variables for the same transformation. In scikit-learn, this can be achieved by combining the ColumnTransformer estimator with the make_colum_selector() function, both from the compose module.

We first create an instance of ColumnTransformer by specifying two of its parameters. Its transformers parameter expects a list of 3-item tuples where each tuple specifies a certain transformer/estimator. The first item of a given tuple is used to set a name for the transformer. The second item is set either to the estimator or 'drop' or 'passthrough'. The third and final item is set to the specific column names/positions or the make_column_selector() function.

The make_column_selector() function selects columns by name or data type (dtype). For instance, to select only the numeric columns, we set its dtype_include option to np.number. To select string/categorical variables, we set it to object. To select columns by name, its pattern parameter can be set to a regular expression, as shown below.

The remainder option of ColumnTransformer should be set to 'passthrough' to keep the untransformed features in the feature space. It is set to 'drop' by default, which means that all other variables are dropped and only the transformed variables are returned.

from sklearn.compose import ColumnTransformer, make_column_selector

ct = ColumnTransformer(
  transformers=[('scaler', StandardScaler(), make_column_selector(dtype_include=np.number))],
  remainder='passthrough'
)

ct.fit_transform(X_train[['Bedroom_AbvGr', 'First_Flr_SF', 'Bldg_Type']])
array([[-1.0171728513822815, 0.4472547012821348, 'Duplex'],
       [0.1713880477403844, -0.5404799563533553, 'OneFam'],
       [0.1713880477403844, 1.5177184038464857, 'OneFam'],
       ...,
       [1.3599489468630503, -0.14438331699699125, 'OneFam'],
       [0.1713880477403844, 0.657837724737417, 'OneFam'],
       [1.3599489468630503, 0.15143569214257177, 'OneFam']],
      shape=(2344, 3), dtype=object)

Note that the array returned by the fit_transform() method contains two numeric features, the standardized versions of Bedroom_AbvGr and First_Flr_SF, respectively. The object Bldg_Type column is returned untouched.

One-hot and dummy encoding

We can use the OneHotEncoder estimator in the preprocessing module to encode categorical features into numeric features. OneHotEncoder can perform both one-hot and dummy encoding. The former creates a binary variable for each category of a given feature.

Bldg_Type is a string/object feature with five categories:

X_train['Bldg_Type'].unique()
array(['Duplex', 'OneFam', 'Twnhs', 'TwnhsE', 'TwoFmCon'], dtype=object)

Below, we first instantiate a one-hot encoder by passing in handle_unkown='ignore' to OneHotEncoder. This option implies that five binary variables that will be created will all take on 0 for any unknown category, which means the unknown categories will be computationally ignored. Since OneHotEncoder returns a sparse matrix by default, we use the toarray() method to obtain an array representation.

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(handle_unknown='ignore')
encoder.fit_transform(X_train[['Bldg_Type']]).toarray()
array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       ...,
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.]], shape=(2344, 5))

We can access the categories_ attribute of the encoder to view the categories that match the columns of the previous array. Note that only the first column of the first row is 1, which means that the first house in the training set is a ‘Duplex’. The second and third are ‘OneFam’ and so on.

encoder.categories_
[array(['Duplex', 'OneFam', 'Twnhs', 'TwnhsE', 'TwoFmCon'], dtype=object)]

To dummy encode categorical features, we need to set the drop option of OneHotEncoder to 'first', which drops the first category and generates one less binary variable than the number of total categories.

encoder = OneHotEncoder(handle_unknown='ignore', drop='first')
encoder.fit_transform(X_train[['Bldg_Type']]).toarray()
array([[0., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       ...,
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.]], shape=(2344, 4))

Note now that all columns of the first row are zero, which means that the first row belongs to the dropped category, ‘Duplex’. To view the index of the dropped category, you can access the encoder’s drop_idx_ attribute.

Lumping

In tidymodels, you can use the step_other() function to lump the infrequent categories into the ‘Other’ category, which restricts the number of binary variables to be created. In scikit-learn, this can be achieved using OneHotEncoder. By setting its min_categories parameter to a float threshold number between 0 and 1, we can make it consider those categories whose proportional frequency is less than the threshold value infrequent. In the current data partitioning, we have six neighborhoods with a proportional frequency rate of less than 0.01, which will be considered infrequent.

(X_train['Neighborhood'].value_counts(ascending=True, dropna=False) / len(X_train['Neighborhood'])).head(10)
Neighborhood
Landmark               0.000427
Green_Hills            0.000853
Blueste                0.002986
Greens                 0.003413
Veenker                0.007253
Northpark_Villa        0.008106
Bloomington_Heights    0.010239
Briardale              0.010239
Meadow_Village         0.013225
Clear_Creek            0.014932
Name: count, dtype: float64

There are 28 categories in total:

X_train['Neighborhood'].nunique()
28

Given that six categories are considered infrequent, we should get 28 - 6 + 1 = 23 binary variables after one-hot encoding, which we do.

encoder = OneHotEncoder(min_frequency=0.01)
encoder.fit_transform(X_train[['Neighborhood']]).toarray().shape[1]
23

We can view the infrequent categories by accessing the infrequent_categories_ attribute of the encoder.

encoder.infrequent_categories_
[array(['Blueste', 'Green_Hills', 'Greens', 'Landmark', 'Northpark_Villa',
        'Veenker'], dtype=object)]

Yeo-Johnson transformation

We can use the YeoJohnsonTransformer estimator in the transformation module of the feature_engine package to drive features towards normality. Below, we apply the transformer to the Lot_Frontage, Gr_Liv_Area, and Latitude features and compare them with their raw version. It is important to note that Yeo-Johnson transformation does not guarantee to yield a (perfectly) normal distribution, and its success depends on the original distribution.

from feature_engine.transformation import YeoJohnsonTransformer

yjt = YeoJohnsonTransformer()
cols = ['Lot_Frontage', 'Gr_Liv_Area', 'Latitude']

yjt_data = yjt.fit_transform(X_train[cols])
yjt_data = yjt_data.join(X_train[cols], how='inner', lsuffix=' (transformed)')
Code
fig, axs = plt.subplots(3, 2, figsize=(8, 6), dpi=200)

for i, col in enumerate(cols):
  axs[i, 0].hist(yjt_data[col])
  axs[i, 0].set_xlabel(col)
  axs[i, 1].hist(yjt_data[col + ' (transformed)'])
  axs[i, 1].set_xlabel(col + ' (transformed)')

fig.suptitle('Yeo-Johnson Transformation')
plt.tight_layout()

Near-zero variance

The VarianceThreshold estimator in the scikit-learn’s feature_selection module removes all low-variance features. Its threshold option is set to zero by default to remove zero-variance features. The downside of this feature selector is that it only deals with numeric features. Moreover, it is not clear how to determine a threshold variance value to select the low-variance features.

Taking inspiration from the nearZeroVar() function in R package caret, we can write a similar function to determine zero-variance and near-zero variance features, numeric or string/object. That function checks two conditions to determine whether a feature has near-zero variance:

  1. the feature has very few unique values relative to the number of samples (by default below 10%), and

  2. the ratio of the frequency of the most common value to the frequency of the second most common value is large (by default above 19 = 95/5)

Below, we write a similar function that takes in a dataframe and returns a list of features with either zero-variance or near-zero variance using the same default values as the nearZeroVar() function.

Code
def near_zero_var(df, percent_unique=10, freq_ratio=19, sd=False):
  '''Returns a list of features in a DataFrame that have either zero variance or near zero variance'''
  
  results = {}

  for col in df.columns:
    sorted = df[col].value_counts(ascending=False)
    freq_ratio_ = sorted.iloc[0] / sorted.iloc[1]

    percent_unique_ = 100 * df[col].nunique() / len(df)

    if percent_unique_ == 100:
      zero_var = True
    else:
      zero_var = False

    if percent_unique_ < percent_unique and freq_ratio_ > freq_ratio :
      nzv = True
    else:
      nzv = False

    if sd is False:
      results.update({col: (freq_ratio_, percent_unique_, zero_var, nzv)})
    else:
      if df[col].dtype != object:
        sd = np.std(df[col])
      else:
        sd = np.nan
      results.update({col: (freq_ratio_, percent_unique_, zero_var, nzv, sd)})      

  df_res = pd.DataFrame(results).transpose()
  columns = {0:'freqRatio', 1: 'percentUnique', 2: 'zeroVar', 3: 'nzv'}

  if sd is True:
    columns.update({4: 'sd'})

  df_res.rename(columns=columns, inplace=True)
  selection = df_res.query('zeroVar == True or nzv == True')
  print(selection)
  return list(selection.index)

nzv_cols = near_zero_var(X_train)
                     freqRatio percentUnique zeroVar   nzv
Street                   233.4      0.085324   False  True
Alley                22.010101      0.127986   False  True
Land_Contour          21.71134      0.170648   False  True
Utilities               1170.5      0.127986   False  True
Land_Slope           22.785714      0.127986   False  True
Condition_2         193.166667      0.341297   False  True
Roof_Matl           135.823529      0.341297   False  True
Bsmt_Cond            22.052632      0.255973   False  True
BsmtFin_Type_2       23.552941      0.298635   False  True
BsmtFin_SF_2             413.8      9.897611   False  True
Heating             110.047619      0.255973   False  True
Kitchen_AbvGr        21.084906      0.170648   False  True
Functional           42.134615      0.341297   False  True
Enclosed_Porch        124.3125      6.953925   False  True
Three_season_porch       772.0      0.981229   False  True
Screen_Porch        237.555556      4.308874   False  True
Pool_Area               2334.0      0.469283   False  True
Pool_QC                  778.0      0.213311   False  True
Misc_Feature         22.333333      0.213311   False  True
Misc_Val            189.333333      1.237201   False  True

No feature has zero variance, but 20 features have near-zero variance. We now create an estimator using ColumnTransformer and set its estimator to 'drop', which will drop the columns in nzv_cols from the training data. Since there are 73 features in our data and we will drop 20, we will get a feature space with 53 features.

ct = ColumnTransformer([('nzv', 'drop', nzv_cols)], remainder='passthrough')

ct.fit_transform(X_train).shape[1]
53

Alternatively, we can use the DropConstantFeatures transformer from the selection module of the feature_engine package. Its tol parameter sets a threshold value to identify the constant and quasi-constant features to drop them from a dataframe. If tol=0.9, it will remove variables that show the same value in at least 90% of the observations. This is the inverse of the percent of unique values we used above. Therefore, the transformer is a version of the nearZeroVar() function that only checks uniqueness. Below, we create an instance of the transformer setting tol=0.9 and missing_values='include', then apply its fit_transform() method on X_train.

from feature_engine.selection import DropConstantFeatures

dcf = DropConstantFeatures(tol=0.9, missing_values='include')
dcf.fit_transform(X_train).shape[1]
53

We can view the features selected by the DropConstantFeatures transformer by accessing its features_to_drop_ atribute. Note that 20 features are identified as quasi-constant features. Note also that there is a good deal of overlap between this and the previous list of features.

Separate vs. sequential transformations

Separate transformations

So far, we have only considered single transformations. The transformer list of ColumnTransformer can have more than one transformer. However, it is important to note that these transformers will work separately. In other words, each transformer will work on the designated column(s) separately and return the transformed columns. The transformed columns returned by each transformer will then be concatenated to form the final feature space. There is no sequentiality here, and we are not chaining the transformers.

Let’s consider an example. Below, we instantiate a ColumnTransformer with log and scaler transformers. The log transformer will take the logarithm of Gr_Liv_Area to the base 10 and return the log version. The scaler will standardize Gr_Liv_Area and return the standardized version. So the first column of the array returned after applying the fit_transform() method is log(Gr_Liv_Area) and the second is the standardized Gr_Liv_Area, not the standardized log(Gr_Liv_Area).

ct = ColumnTransformer(
  [
    ('log', FunctionTransformer(np.log10), ['Gr_Liv_Area']),
    ('scaler', StandardScaler(), ['Gr_Liv_Area'])
  ]
)

ct.fit_transform(X_train)
array([[ 3.12839927, -0.32133513],
       [ 2.97772361, -1.0894528 ],
       [ 3.24821856,  0.51111726],
       ...,
       [ 3.37748838,  1.70813311],
       [ 3.15472821, -0.15757401],
       [ 3.27921051,  0.76650663]], shape=(2344, 2))

Separate transformations can be very useful for preprocessing features of different data types. For instance, we can scale all numeric columns and one-hot encode all string columns before combining them as below.

ct = ColumnTransformer(
  [
    ('scaler', StandardScaler(), make_column_selector(dtype_include=np.number)),
    ('encoder', OneHotEncoder(), make_column_selector(dtype_include=object))
  ]
)

Suppose we want to apply Yeo-Johnson transformation to all numeric features except for Gr_Liv_Area, which we want to log-transform. We can insert a log transformer to the ColumnTransformer for Gr_Liv_Area. We can use the pattern parameter of make_column_selector() to exclude it from the Yeo-Johson transformation. It expects a regular expression, and we can set it to ^(?!Gr_Liv_Area) to ignore Gr_Liv_Area when transforming all other numeric variables.

Once a ColumnTransformer’s fit() method is called on, its transformers_ attribute saves the feature names to be transformed in a list of tuples. ct.tranformers_[1] is the second tuple of the ct object. ct.tranformers_[1][2] is then the third item of the second tuple, which keeps the names of the features designated for the Yeo-Johnson transformation as a list. Finally, we confirm that Gr_Liv_Area is not in that list.

ct = ColumnTransformer(
  [
    ('log10', FunctionTransformer(np.log10), ['Gr_Liv_Area']),
    ('yeojohnson', YeoJohnsonTransformer(), make_column_selector(pattern='^(?!Gr_Liv_Area)', dtype_include=np.number))
  ]
)

ct.fit(X_train)

'Gr_Liv_Area' in ct.transformers_[1][2]
False

Sequential transformations

Suppose we want to log-transform Gr_Liv_Area and then standardize it. The Pipeline class from the pipeline module of sklearn allows us to apply a list of transformers to preprocess the data sequentially. Its steps parameter expects a list of 2-item tuples to specify the transformers. The first item of a given tuple sets the name of the transformer. The transformer itself is specified in the second item.

from sklearn.pipeline import Pipeline

preproc = Pipeline(
  steps=[
    ('log10', FunctionTransformer(np.log10)),
    ('scaler', StandardScaler())
  ]
)

preproc.fit_transform(X_train[['Gr_Liv_Area']])
array([[-0.19150261],
       [-1.25717671],
       [ 0.65593561],
       ...,
       [ 1.570214  ],
       [-0.00528762],
       [ 0.8751304 ]], shape=(2344, 1))

Putting it all together

The ColumnTransformer and Pipeline classes can be used together to implement more complex preprocessing steps. Suppose we want to drop the constant and quasi-constant features, apply the Yeo-Johnson transformation to all remaining numeric predictors, and then standardize them. Moreover, after dropping the constant and quasi-constant features, we want to one-hot encode categorical features by considering categories with a prevalence of less than one percent infrequent. Since we want to perform different transformations on numeric and categorical features, we can specify a sequential series of transformations for each type in a ColumnTransformer. We can use the Pipeline class to construct a sequential series of transformations.

num_seq = Pipeline(
  steps=[
    ('nzv', DropConstantFeatures(tol=0.9, missing_values='include')),
    ('yeojohnson', YeoJohnsonTransformer()),
    ('scaler', StandardScaler())
  ]
)

cat_seq = Pipeline(
  steps=[
    ('nzv', DropConstantFeatures(tol=0.9, missing_values='include')),
    ('encoder', OneHotEncoder(min_frequency=0.01, handle_unknown='ignore'))
  ]
)

preprocessor = ColumnTransformer(
  [
    ('yeojohnson_scaler', num_seq, make_column_selector(dtype_include=np.number)),
    ('encoder', cat_seq, make_column_selector(dtype_include=object))
  ]
)

preprocessor.fit_transform(X_train[['Gr_Liv_Area', 'Bedroom_AbvGr', 'Heating', 'Bldg_Type']])
array([[-1.86167631e-01, -1.01707990e+00,  1.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [-1.26060869e+00,  1.91901915e-01,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [ 6.59202847e-01,  1.91901915e-01,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       ...,
       [ 1.56230891e+00,  1.34300318e+00,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [ 2.78447133e-04,  1.91901915e-01,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [ 8.76563171e-01,  1.34300318e+00,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00]],
      shape=(2344, 7))

Note that Gr_Liv_Area and Bedroom_AbvGr are transformed by YeoJohnsonTranformer and then standardized. Heating is dropped because it has a low variance. Finally, Bldg_Type, which has five categories, is one-hot encoded, yielding five binary columns. So, we have seven columns in the final feature space. The transformers and the features they apply to can be viewed by accessing the transformers_ attribute of preprocessor.

Model training

scikit-learn provides a long list of predictor classes to specify models. For instance, the LinearRegression class from the linear_model module specifies a linear regression model. The RandomForestClassifier from the ensemble module specifies a random forest classifier.

Unlike transformers which have fit(), transform(), and fit_transform() methods, model classes, which are also called estimators, have fit() and predict() methods. Below, we create an example instance of LinearRegression and RandomForestClassifier classes.

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier

lm = LinearRegression()
rfc = RandomForestClassifier()

Unlike R or the tidymodels package, there is no similar formula object in scikit-learn to specify the response/target variable and features. The fit method of model classes takes in the features and the target variable as separate data, and each column of the features data represents a term in the implicit formula.

Moreover, unlike the model functions in R, which automatically create dummies for the categorical features by default, scikit-learn’s model classes expect categorical features to be already transformed into numeric type.

Below, we train a linear regression model to predict Sale_Price from only two features, Gr_Liv_Area and Bedroom_Abv_Gr, both numeric already. We then view the estimated coefficients by accessing the model object’s coef_ attribute. The order of coefficients matches the order of columns in the features data.

lm.fit(X_train[['Gr_Liv_Area', 'Bedroom_AbvGr']], y_train)
lm.coef_
array([ 0.00028543, -0.05046664])

We can also view the features by accessing the feature_names_in_ attribute of the model object:

lm.feature_names_in_
array(['Gr_Liv_Area', 'Bedroom_AbvGr'], dtype=object)

We have seen that we can use the ColumnTransformer class to perform separate transformations on select features and the Pipeline class to perform sequential transformations. We have also seen that we can combine these two classes to specify sequential transformations on select features.

We can use the Pipeline class to add a final predictor to the chain of transformations (i.e., the preprocessor) to specify a complete machine learning model. The Pipeline class has fit() and predict() methods just like the model classes. Below, we combine the preprocessor we created previously with an instance of the LinearRegression class to create a full model. We then fit it to the training data and use the pipeline’s predict() method on the features data to obtain the predicted values.

pipeline = Pipeline(
  [
    ('preprocessor', preprocessor),
    ('lm', LinearRegression())
  ]
)

pipeline.fit(X_train, y_train)
pred = pipeline.predict(X_train)

Let’s plot the predicted values of Sale_Price against its true values to have a visual sense of the model’s success:

Code
fig, ax = plt.subplots()

ax.scatter(y_train, pred, s=20, alpha=0.3)

lower_lim = np.min([np.min(pred), np.min(y_train)])
upper_lim = np.max([np.max(pred), np.max(y_train)])
diag = np.linspace(lower_lim, upper_lim)
ax.plot(diag, diag, '--', alpha=0.5)

ax.set_xlabel('True Sale Price (log $)')
ax.set_ylabel('Predicted Sale Price (log $)')
plt.show()

The coefficient of determination, R2, of the prediction can be computed by calling the score() method of the pipeline, which calls the score() method of the final predictor, LinearRegression().

pipeline.score(X_train, y_train)
0.9131760404703481

To compute the root mean squared error (RMSE) of the prediction, we can use the root_mean_squared_error() function from the metrics module of scikit-learn.

from sklearn.metrics import root_mean_squared_error

root_mean_squared_error(y_train, pred)
0.052730940537809405

Model selection

We can decide between two model specifications by comparing their test error. Suppose we want to compare a specification that drops the constant and quasi-constant features and encodes the categorical features with the specification we trained before, which additionally applies the Yeo-Johnson transformation to numerical features and standardizes them. Below, we create a preprocessor to transform the features according to the first specification. We then recreate the previous pipeline. Finally, we create a list of tuples where each tuple represents a pipeline with a name and the pipeline itself.

preprocessor_basic = ColumnTransformer(
  transformers=[
    ('nzv', DropConstantFeatures(tol=0.9, missing_values='include'), make_column_selector(dtype_include=np.number)),
    ('nvz-encode', cat_seq, make_column_selector(dtype_include=object))
  ]
)

pipeline1 = Pipeline(
  steps=[
    ('preprocessor', preprocessor_basic),
    ('lm', LinearRegression())
  ]
)

pipeline2 = Pipeline(
  [
    ('preprocessor', preprocessor),
    ('lm', LinearRegression())
  ]
)

predictors = [('basic', pipeline1), ('yj-scale', pipeline2)]

We can use either the cross_validate() or cross_val_score() function from the model_selection module of scikit-learn to evaluate metric(s) by cross-validation. Unlike cross_validate(), which returns metrics and fit/score times, cross_val_score() returns only metrics. Each function expects an estimator/predictor and the training data.

The cv option of both functions defaults to None to perform 5-fold cross-validation using the KFold splitter for regression problems and the StratifiedKFold splitter for classification problems. The cv option can also be set to an integer to specify the number of folds. Alternatively, it can be set to a pre-defined CV splitter.

The scoring option of both functions also defaults to None in which case they use the estimator’s default scorer in its score() method. The scoring option can also be set to a string name or a callable (i.e., scorer function) from the metrics module of scikit-learn. Unlike cross_validate(), cross_val_score() permits a single metric. Multiple metrics can be passed into cross_validate() as a list or dictionary.

The string name ‘r2’ implies the r2_score() scoring function, the default scorer for the LinearRegression class. ‘neg_mean_squared_error’ implies the root_mean_squared_error() function. Since scoring functions follow the convention that higher return values are better than lower return values, metrics such as root_mean_squared_error() are available as ‘neg_mean_squared_error’, which returns the negated value of the metric. The scoring functions expect the true and predicted target/response values to compute the respective scores.

Given that we have two models to compare with no optimization involved, we need to fit both models to the training data and then obtain their test error to decide which model performs better. Even though we do not need cross-validation proper, we can still use the cross_validate() function for this purpose. Passing in n_splits=1 to the ShuffleSplit() function splits the data into train and test sets. We can then pass the splitter to cross_validate(), which will fit the model to the training data and then compute its test error. It returns a dictionary and we can use the 'test_score' key to access the test score.

from sklearn.model_selection import cross_validate

split = ShuffleSplit(n_splits=1, test_size=0.2, random_state=123)

print('Test scores')
for pred_name, predictor in predictors:
  cv_results = cross_validate(
    predictor, X, y, scoring='neg_root_mean_squared_error', cv=split
  )
  print(f'{pred_name}: {-cv_results['test_score']}')
Test scores
basic: [0.05658546]
yj-scale: [0.052866]

Judging by their (log) RMSE, the second model performs around 6.5% better than the first on the test set.

Summary

In this post, we saw how to implement (re)sampling, preprocessing, model training, and model selection using scikit-learn, the Python package for predictive modeling. In the next post, we will learn how to train tree-based models and optimize their hyperparameters with cross-validation and parallel processing.

Footnotes

  1. scikit-learn installs numpy, scipy, and some additional packages automatically.↩︎

  2. Kaggle provides the original Ames housing data as well. But the tidymodels version has the additional Latitude and Longitude columns with some columns removed.↩︎

  3. You can also use the LogTransformer class from the feature_engine package which allows setting a base.↩︎

  4. For other feature scaling methods, see https://feaz-book.com/numeric.↩︎

  5. On whether one-hot and dummy encoding are different, see this exchange.↩︎

  6. The PowerTransformer class in the preprocessing module of scikit-learn can also do Yeo-Johnson and Box-Cox transformations and standardize the transformed features. However, my earlier inspection showed that it fails to transform the Latitude feature so that I won’t use it here.↩︎

  7. For more on Yeo-Johnson transformation, see here.↩︎

  8. Run dcf.features_to_drop_ to view them.↩︎

  9. scikit-learn’s documentation refers both to transformers and models as estimators.↩︎

  10. There is also the cross_val_predict() function which returns the cross-validated predictions, i.e., the predictions on the assessment sets (holdouts).↩︎

  11. See the scikit-learn’s user guide for a broader list of scoring functions and their corresponding string names. The get_scorer_names() function from the metrics module returns the complete list of the string names of the scoring functions.↩︎

  12. We will optimize model hyperparameters using cross-validation in the future posts.↩︎