Classification with Scikit-learn

classification

machine learning

scikit-learn

Python

Published

April 7, 2025

In the previous two posts of this series on scikit-learn, we focused on a regression problem and trained linear regression, decision trees, bagging, random forests, and boosting models. We will now concentrate on a classification problem and continue to use scikit-learn to train logistic regression, linear discriminant analysis, quadratic discriminant analysis, naive Bayes, k-nearest neighbors, and support vector classifiers. Let’s import the following packages to start.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Data

Initial inspection

We will use the stackoverflow dataset from tidymodels, the R meta-package for machine learning and statistical analysis. I made it available on a GitHub repository as a CSV file, which we can read into a data frame.

so = pd.read_csv("https://raw.githubusercontent.com/barisguven/datasets/main/data/stackoverflow_tm.csv")

The dataset contains 5594 observations and 21 variables, two of which are categorical, and the rest are numeric. There are no missing values.

so.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5594 entries, 0 to 5593
Data columns (total 21 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Country                               5594 non-null   object 
 1   Salary                                5594 non-null   float64
 2   YearsCodedJob                         5594 non-null   int64  
 3   OpenSource                            5594 non-null   int64  
 4   Hobby                                 5594 non-null   int64  
 5   CompanySizeNumber                     5594 non-null   int64  
 6   Remote                                5594 non-null   object 
 7   CareerSatisfaction                    5594 non-null   int64  
 8   Data_scientist                        5594 non-null   int64  
 9   Database_administrator                5594 non-null   int64  
 10  Desktop_applications_developer        5594 non-null   int64  
 11  Developer_with_stats_math_background  5594 non-null   int64  
 12  DevOps                                5594 non-null   int64  
 13  Embedded_developer                    5594 non-null   int64  
 14  Graphic_designer                      5594 non-null   int64  
 15  Graphics_programming                  5594 non-null   int64  
 16  Machine_learning_specialist           5594 non-null   int64  
 17  Mobile_developer                      5594 non-null   int64  
 18  Quality_assurance_engineer            5594 non-null   int64  
 19  Systems_administrator                 5594 non-null   int64  
 20  Web_developer                         5594 non-null   int64  
dtypes: float64(1), int64(18), object(2)
memory usage: 917.9+ KB

The variable Remote indicates the work status of the developers who participated in the StackOverflow survey. It only takes on ‘Not remote’ or ‘Remote’, which represent classes. It is our target variable, and we want to find the best classification model to predict the work status based on the available features. As such, we have a binomial or two-class classification problem. Only 10% of developers reported ‘Remote’ work status (the minority class), implying a class imbalance in our sample.

so.value_counts('Remote', normalize=True)

Remote
Not remote    0.897211
Remote        0.102789
Name: proportion, dtype: float64

The developers were from five countries, and half were from the United States.

so.value_counts('Country', normalize=True)

Country
United States     0.498927
United Kingdom    0.182875
Germany           0.135324
India             0.096174
Canada            0.086700
Name: proportion, dtype: float64

While CompanySizeNumber is a numeric feature, it only takes eight distinct values, and we can one-hot encode it to improve the model performance.¹

so.loc[:, 'CompanySizeNumber'].nunique()

Encoding the class labels

While scikit-learn estimators can work with string target labels, they can sometimes cause errors. We can use the LabelEncoder utility class from the preprocessing module to normalize target labels such that they contain only values 0, 1, etc. In our case, this means that “Not remote” will be replaced by zero and “Remote” by 1.

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(so.loc[:, 'Remote'])
np.unique(y), y.shape

(array([0, 1]), (5594,))

We can view the class labels of the encoder by accessing its classes_ attribute. To confirm that 0 represents “Not remote” and 1 “Remote”, we can use the inverse_transform() method of the encoder.

le.classes_

array(['Not remote', 'Remote'], dtype=object)

le.inverse_transform([1, 0])

array(['Remote', 'Not remote'], dtype=object)

Initial split and down-sampling

The presence of a class imbalance in our sample will make our classification models predict the majority class most of the time. As such, our models will not be any better than a naive classifier that predicts the majority class most of the time. One way to address the class imbalance problem is to down-sample the majority class, i.e., reduce the frequency of the majority class to make it equal to the frequency of the minority class. Unlike tidymodels, no function or class in scikit-learn can do this for us in a single step. However, we can use the resample function from the utils module of scikit-learn to perform down-sampling. This function randomly selects a desired number of observations from each array it takes as input. Down-sampling must be done after the data is split into train and test sets to avoid data leakage. Below, we perform an initial split and down-sample the majority class in the train set to balance the frequencies of the two classes. We pass stratify=y to the train_test_split() function to preserve the class imbalance in the initial train/test split. We then combine the minority class with the down-sampled majority class.

from sklearn.model_selection import train_test_split
from sklearn.utils import resample

# initial train/test split
X_train, X_test, y_train, y_test = train_test_split(
  so.drop(columns='Remote'),
  y,
  stratify=y,
  test_size=0.2,
  random_state=0
)

# class filters
major_class = (y_train == 0)
minor_class = (y_train == 1)

# down-sample the majority class in the train set
X_train_down, y_train_down = resample(
  X_train.loc[major_class],
  y_train[major_class],
  replace=False,
  n_samples=np.sum(minor_class),
  random_state=1
)

# combine down-sampled majority class with minority class
X_train = pd.concat([X_train_down, X_train.loc[minor_class]])
y_train = np.concat([y_train_down, y_train[minor_class]])

[data.shape for data in [X_train_down, y_train_down, X_train, y_train]]

[(460, 20), (460,), (920, 20), (920,)]

Feature engineering

Scikit-learn estimators require input data to be numeric, so we need to convert Country into numeric type, which we do by one-hot encoding it. Given the previous observation, we also one-hot encode CompanySizeNumber and exclude it from the second transformer that selects the numeric features. Some classification models need features to be scaled, so we include StandardScaler in the second step of the pipeline to build our preprocessor.

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.pipeline import Pipeline, make_pipeline

ct = ColumnTransformer(
  transformers=[
    ('encode', OneHotEncoder(handle_unknown='ignore'), ['Country', 'CompanySizeNumber']),
    ('', 'passthrough', make_column_selector(pattern='^(?!CompanySizeNumber)', dtype_include=np.number))
  ]
)

preprocessor = Pipeline([('encode', ct), ('scale', StandardScaler())])

Logistic Regression

The LogisticRegression estimator from the linear_model module of scikit-learn can fit a logistic regression model. The logistic regression classifier uses the standard logistic function, the sigmoid function, to model the conditional probability that observation $i$ is from the positive class:

$P (y_{i} = 1 | X_{i}) = p (X_{i}) = \frac{1}{1 + e^{- X_{i} w - w_{0}}} .$

The LogisticRegression estimator uses negative log loss as the objective function and does L2 regularization by default. This means that it minimizes the following loss function:

$min_{w} \frac{1}{S} \sum_{i = 1}^{n} s_{i} [- y_{i} \log (p (X_{i})) - (1 - y_{i}) \log (1 - p (X_{i}))] + \frac{r (w)}{S C}$

where $s_{i}$ is the weight assigned to observation $i$ , which is the product of sample weight for observation $i$ and class weight its class is assigned, with $S = \sum_{i}^{n} s_{i}$ , $r (w)$ is the regularization term, and $C$ is the inverse regularization strength.² When no sample and class weights are provided, $s_{i}$ ’s equal one and $S = n$ . With this, it can be shown that minimizing negative log loss is equivalent to maximizing the maximum likelihood.

Let’s first train a logistic regression model with no regularization and obtain its test error. The penalty parameter of the LogisticRegression class controls the type of regularization, $r (w)$ in the previous equation. It can be set to None to remove the regularization term from the loss function.³ Below, we combine the preprocessor we previously created with the logistic classifier using the make_pipeline() function. We then fit the classifier to the training data and access the classes_ attribute of the classifier to view the classes. The second class is our positive class (“Remote”).

from sklearn.linear_model import LogisticRegression

logit = LogisticRegression(penalty=None)
pipeline = make_pipeline(preprocessor, logit)
pipeline.fit(X_train, y_train)
pipeline[1].classes_

array([0, 1])

The predict() method of the classifiers computes the class probabilities and uses the ‘probability that a given observation belongs to the positive class > 0.5’ rule to assign class labels to the test observations. Below, we report the test accuracy of the classifier by computing the mean matches between our predicted classes and actual classes.

y_pred = pipeline.predict(X_test)
np.mean(y_test == y_pred)

np.float64(0.6398570151921358)

Let’s confirm the previous claim. We can use the predict_proba() method of the classifier models to compute the class probabilities. The second column of y_prob stores the probability that a test observation is from the second class ‘Remote’.

y_prob = pipeline.predict_proba(X_test)
y_prob[:5, ]

array([[0.04759533, 0.95240467],
       [0.28911888, 0.71088112],
       [0.57955305, 0.42044695],
       [0.22462084, 0.77537916],
       [0.57075499, 0.42924501]])

We now use the ‘probability that a given observation belongs to the positive class > 0.5’ rule to assign observations with a probability greater than 0.5 to the Remote class and compute accuracy, which is the same as the previous one.

y_pred = np.array([0]*len(y_test))
y_pred[y_prob[:, 1] > 0.5] = 1
np.mean(y_test == y_pred)

np.float64(0.6398570151921358)

Finding the optimum C through cross-validation

LogisticRegression uses regularization to prevent overfitting. The inverse regularization strength, $C$ , is preset to one, but we can optimize it using the LogisticRegressionCV class. This class has a Cs parameter, which can be set to an integer. The class then chooses a grid of $C$ values in a logarithmic scale between 10^-4 and 10⁴. Once its fit() method is called, it conducts a grid search using stratified k-fold cross-validation, which can be controlled through its cv parameter. Once the optimum value of $C$ is found, the model is trained using that value. Below, we try 20 $C$ values using stratified 10-fold cross-validation and compute the test accuracy.

from sklearn.linear_model import LogisticRegressionCV

logit = LogisticRegressionCV(Cs=10, cv=10, random_state=2)

pipeline = make_pipeline(preprocessor, logit)
pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)
np.mean(y_pred == y_test)

np.float64(0.6407506702412868)

The optimum $C$ value is stored in the C_ attribute of the LogisticRegressionCV class.⁴

pipeline[1].C_

array([0.35938137])

To further assess the model’s predictive performance, we can compute the confusion matrix and the related statistics such as accuracy, true positive rate (recall or sensitivity), positive predictive value (precision), and F₁ score.

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

ax = plt.subplots(figsize=(5, 4))[1]
ax.grid(visible=False)

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(cm, display_labels=['Not remote', 'Remote'])
disp.plot(ax=ax)
plt.show()

print('Accuracy:', (79 + 638) / (79 + 638 + 36 + 366))
print('Recall/sensitivity:', 79 / (79 + 36))
print('Precision:', 79 / (79 + 366))
print('F1 score:', 2 * 79 / (2 * 79 + 366 + 36))

Accuracy: 0.6407506702412868
Recall/sensitivity: 0.6869565217391305
Precision: 0.17752808988764046
F1 score: 0.28214285714285714

Alternatively, we can use the scoring functions from the metrics module to compute the same statistics.

from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

(
  accuracy_score(y_test, y_pred),
  recall_score(y_test, y_pred), 
  precision_score(y_test, y_pred),
  f1_score(y_test, y_pred)
)

(0.6407506702412868,
 0.6869565217391305,
 0.17752808988764046,
 0.28214285714285714)

Since we will use both accuracy and recall scores to assess the performance of the classifiers we train in this post, we write a utility function that computes and returns these scores. We also create a dictionary to enter the test scores for the classifiers.

def accuracy_recall(y_test, y_pred):
  scores = {
    'accuracy': accuracy_score(y_test, y_pred),
    'recall': recall_score(y_test, y_pred)
  }
  return scores

test_results = {}
logit_res = accuracy_recall(y_test, y_pred)
test_results.update({'LOGIT': logit_res})

Finally, let’s plot the ROC curve. The roc_curve() function takes in the true class labels and the positive class probabilities and returns false positive rate, true positive rate, and thresholds, which are used in the decision rule to assign class labels and obtain true and false positive rates. The RocCurveDisplay class inputs the first two to create the ROC curve plot. The auc() function also takes them in to return the area under the ROC curve, which can be passed to RocCurveDisplay to show it at the bottom right of the curve plot.

from sklearn.metrics import roc_curve, auc, RocCurveDisplay

fpr, tpr, thresholds = roc_curve(y_test, y_prob[:, 1])
disp = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=auc(fpr, tpr))
disp.plot()
plt.show()

Generative Models

Unlike the logistic classifier, which models the conditional probability that an observation comes from a given class, generative models directly model the distribution of predictors separately for each response class.

Linear Discriminant Analysis

Linear discriminant analysis (LDA) assumes that predictors are drawn from a multivariate Gaussian (i.e., normal) distribution with a class-specific mean vector and a common covariance matrix. It estimates the mean of the predictors per class and their common variance matrix from the training data (and the prior class probabilities) to compute the posterior probability that an observation comes from a given class. The discriminant function that uses these magnitudes, which is the decision boundary, is linear in predictors.

We can use the LinearDiscriminantAnalysis class from the disciminant_analysis module to train an LDA model. We turn on its store_covariance option to store the covariance matrix.

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

lda = LDA(store_covariance=True)
pipeline = make_pipeline(ct, lda)

y_pred = pipeline.fit(X_train, y_train).predict(X_test)
lda_res = accuracy_recall(y_test, y_pred)
lda_res

{'accuracy': 0.6380697050938338, 'recall': 0.6869565217391305}

The LDA model has a similar accuracy and recall (true positive rate) to the logistic classifier. You can use the priors_, means_, and covariance_ attributes of lda to access the prior class probabilities, class-specific means, and common covariance matrix, respectively. We now add the test results to our results dictionary.

test_results.update({'LDA': lda_res})

Quadratic Discriminant Analysis

Like LDA, quadratic discriminant analysis (QDA) also assumes a class-specific mean vector. But unlike LDA, it assumes a class-specific covariance matrix for predictors. The second assumption makes the discriminant function quadratic in predictors and generates a non-linear decision boundary.

We can use the QuadraticDiscriminantAnalysis class from the disciminant_analysis module to train a QDA model. Its reg_param parameter is preset to 0 and regularizes the per-class covariance estimates. Increasing it helps reduce the collinearity in the covariance matrices. Below, we try 50 values from the range of [0, 1] and compute accuracy and recall for each parameter value.

from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

pipeline = Pipeline([('preprocess', ct), ('qda', QDA())])

reg_params = np.linspace(0, 1, num=101)

grid = GridSearchCV(
  pipeline,
  param_grid={'qda__reg_param': reg_params},
  scoring=['accuracy', 'recall'],
  cv=10,
  refit=False
)
grid.fit(X_train, y_train)

results = {
  'regularization parameter': reg_params,
  'accuracy': grid.cv_results_['mean_test_accuracy'],
  'recall': grid.cv_results_['mean_test_recall']
}

results = pd.DataFrame(results)
results.plot(x='regularization parameter')
plt.show()

Note that the default value of the reg_param parameter (0) yields the lowest accuracy and highest recall score. We can use the parameter value that yields the highest accuracy to train the model, which is 0.26.

reg_params[np.argmax(results['accuracy'])]

np.float64(0.26)

The QDA classifier has a higher recall and a lower accuracy than the LDA classifier.

qda =  QDA(reg_param=reg_params[np.argmax(results['accuracy'])])

pipeline = make_pipeline(ct, qda)

y_pred = pipeline.fit(X_train, y_train).predict(X_test)
qda_res = accuracy_recall(y_test, y_pred)
qda_res

{'accuracy': 0.6184092940125112, 'recall': 0.7391304347826086}

test_results.update({'QDA': qda_res})

Naive Bayes

The naive Bayes (NB) classifier assumes that predictors are independent within each class. We can use the GaussianNB class to train a NB classifier. This class assumes additionally that predictors are drawn from the normal distribution and uses maximum likelihood to estimate the class-specific mean and variance of each predictor.⁵

from sklearn.naive_bayes import GaussianNB

pipeline = make_pipeline(ct, GaussianNB())

y_pred = pipeline.fit(X_train, y_train).predict(X_test)
nb_res = accuracy_recall(y_test, y_pred)
nb_res

{'accuracy': 0.7033065236818588, 'recall': 0.6173913043478261}

The NB classifier has a significantly higher accuracy and a significantly lower recall than the LDA classsifier.⁶ We now add the test scores to the results dictionary.

test_results.update({'NBayes': nb_res})

K-Nearest Neighbors

Unlike the previous methods we have looked at, the K-nearest neighbors (KNN) is a non-parametric method. For a given test observation $x_{0}$ and a positive integer $K$ , the KNN classifier first uses the Minkowski distance measure to identify the closest $K$ points to $x_{0}$ in the training data. It then estimates the conditional probability of each class as the fraction of points belonging to that class among $K$ nearest neighbors. Finally, it assigns $x_{0}$ to the class with the largest probability.⁷ We can use the KNeighborsClassifier to train a KNN classifier.

The number of nearest neighbors and the distance power can be tuned via cross-validation. Below, we do 10-fold cross-validation and evaluate accuracy and recall on a grid of seventy-five $K$ and six distance power candidates (450 candidates in total). Since we evaluate the model performance using two metrics, we pass refit='accuracy' to GridSearch so that, once the grid search is done, it trains the classifier using the parameter combination that maximizes accuracy.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([('preprocess', preprocessor), ('knn', KNeighborsClassifier())])

# grid parameters
neighbors = np.arange(1, 301, 4)
dist_power = [1, 2, 3, 4, 5, 6]

metrics = ['accuracy', 'recall']

grid = GridSearchCV(
  pipeline,
  param_grid={
    'knn__p': dist_power,
    'knn__n_neighbors': neighbors
  },
  scoring=metrics,
  cv=10,
  n_jobs=5,
  refit='accuracy'
)
grid.fit(X_train, y_train)

# plot CV results
axs = plt.subplots(1, 2, figsize=(10, 6), dpi=200)[1]

for i in [0, 1]:
  cv_results = grid.cv_results_['mean_test_' + metrics[i]]
  cv_results = cv_results.reshape(len(neighbors), len(dist_power))
  for p in dist_power:
    axs[i].plot(neighbors, cv_results[:, p-1], label=p)
    axs[i].legend(title='dist_power', loc='lower right')
    axs[i].set_xlabel('# of neighbors')

axs[0].set_title('Accuracy')
axs[1].set_title('Recall')
plt.show()

The $(K, p) = (141, 1)$ combination yields the highest accuracy. It is interesting to observe that recall increases as the distance power gets larger and the number of neighbors increases to around 200. However, this happens at the expense of accuracy because the number of false positives keeps rising as the classifier tends to predict “Remote” more frequently.

grid.best_params_

{'knn__n_neighbors': np.int64(141), 'knn__p': 1}

Finally, we train the classifier with the optimum values, obtain the test scores, and add them to the results dictionary.

pipeline.set_params(**grid.best_params_)
y_pred = pipeline.fit(X_train, y_train).predict(X_test)

knn_res = accuracy_recall(y_test, y_pred)
knn_res

{'accuracy': 0.5415549597855228, 'recall': 0.7652173913043478}

test_results.update({'KNN': knn_res})

Support Vector Machines

Support vector machines (SVM) find a hyperplane to classify observations. It has been shown that the hyperplane can be expressed as a weighted sum of the kernels, where the kernel is a function of inner products of observations, and it measures the similarity between them. The kernel function can be linear in inner products. Polynomial and radial kernels are non-linear in inner products.⁸

We can use the SVC class from the svm module of scikit-learn to train support vector classifiers.⁹ The C parameter controls the cost of observations violating the margin. Its kernel option can be set to 'linear', 'poly', or 'rbf' to specify a linear, polynomial, and radial kernel, respectively. The degree of the polynomial kernel is controlled by the degree parameter. For the radial kernel, the gamma parameter controls the $γ$ coefficient in the kernel function.

We start with tuning the cost parameter of the support vector classifier (SVC) that uses a linear kernel.

from sklearn.svm import SVC

pipeline = Pipeline(
  steps=[
    ('preprocess', preprocessor),
    ('svc_lin', SVC(kernel='linear'))
  ]
)

cost = [1e-4, 1e-3, 1e-2, 1e-1, 1, 1e1, 1e2]

grid = GridSearchCV(
  pipeline,
  param_grid={'svc_lin__C': cost},
  scoring=['accuracy', 'recall'],
  cv=10,
  refit='accuracy'
)
grid.fit(X_train, y_train)

results = {
  'accuracy': grid.cv_results_['mean_test_accuracy'],
  'recall': grid.cv_results_['mean_test_recall']
}
results = pd.DataFrame(results)

ax = plt.subplots()[1]
results.plot(ax=ax)
ax.set_xticks(ticks=range(len(cost)), labels=cost)
ax.set_xlabel('Cost')
plt.show()

The highest accuracy is obtained with $C = 1$ .

grid.best_params_

{'svc_lin__C': 1}

We now train the classifier with the optimum cost value to obtain the test scores and add them to the results dictionary.

pipeline.set_params(**grid.best_params_)
y_pred = pipeline.fit(X_train, y_train).predict(X_test)

svc_lin_res = accuracy_recall(y_test, y_pred)
svc_lin_res

{'accuracy': 0.6496872207327972, 'recall': 0.7217391304347827}

test_results.update({'SVC_L': svc_lin_res})

Next, we tune the cost and degree parameters of the SVC that uses a polynomial kernel.

pipeline = Pipeline(
  steps=[
    ('preprocess', preprocessor),
    ('svc_poly', SVC(kernel='poly'))
  ]
)

cost = [1e-3, 1e-2, 1e-1, 1, 1e1, 1e2]
degree = [1, 2, 3, 4]

grid = GridSearchCV(
  pipeline,
  param_grid={'svc_poly__C': cost, 'svc_poly__degree': degree},
  scoring=['accuracy', 'recall'],
  cv=10,
  refit='accuracy'
)

grid.fit(X_train, y_train)

# plot CV results
axs = plt.subplots(1, 2, figsize=(10, 6))[1]

for i in [0, 1]:
  cv_results = grid.cv_results_['mean_test_' + metrics[i]]
  cv_results = cv_results.reshape(6, 4)

  for j in range(len(degree)):
    axs[i].plot(cv_results[:, j], label=degree[j])
    axs[i].legend(title='degree', loc='lower right')
    axs[i].set_xlabel('Cost')

  axs[i].set_xticks(ticks=range(len(cost)), labels=cost)

axs[0].set_title('Accuracy')
axs[1].set_title('Recall')
plt.show()

grid.best_params_

{'svc_poly__C': 10.0, 'svc_poly__degree': 1}

pipeline.set_params(**{'svc_poly__C': 10, 'svc_poly__degree': 1})
y_pred = pipeline.fit(X_train, y_train).predict(X_test)
svc_poly_res = accuracy_recall(y_test, y_pred)
svc_poly_res

{'accuracy': 0.6470062555853441, 'recall': 0.7217391304347827}

test_results.update({'SVC_P': svc_poly_res})

Note that as far as recall is concerned, the combination of a degree of 4 and a cost of 0.001 is the best, and the classifier trained with these values yields a recall of 1 and an accuracy of 10.5%. But this classifier is useless as it almost always predicts “Remote”. A naive classifier can always predict “Remote” without requiring any computation.

pipeline.set_params(**{'svc_poly__C': 1e-3, 'svc_poly__degree': 4})
y_pred = pipeline.fit(X_train, y_train).predict(X_test)
accuracy_recall(y_test, y_pred)

{'accuracy': 0.10545129579982127, 'recall': 1.0}

Finally, we tune the cost and gamma parameter of the SVC that uses a radial kernel.

pipeline = Pipeline(
  steps=[
    ('preprocess', preprocessor),
    ('svc_rad', SVC(kernel='rbf'))
  ]
)

cost = [1e-2, 1e-1, 1, 1e1, 1e2, 1e3]
gamma = [1e-4, 1e-3, 1e-2, 1e-1]

grid = GridSearchCV(
  pipeline,
  param_grid={'svc_rad__C': cost, 'svc_rad__gamma': gamma},
  scoring=['accuracy', 'recall'],
  cv=10,
  refit='accuracy'
)
grid.fit(X_train, y_train)

# plot CV results
axs = plt.subplots(1, 2, figsize=(10, 6))[1]

for i in [0, 1]:
  cv_results = grid.cv_results_['mean_test_' + metrics[i]]
  cv_results = cv_results.reshape(6, 4)

  for j in range(len(gamma)):
    axs[i].plot(cv_results[:, j], label=gamma[j])
    axs[i].legend(title='gamma')
    axs[i].set_xlabel('Cost')
  
  axs[i].set_xticks(ticks=range(len(cost)), labels=cost)

axs[0].set_title('Accuracy')
axs[1].set_title('Recall')
plt.show()

$(C, γ) = (100, 1^{- 4})$ is the best parameter candidate. We use these values to train the radial SVC and add the test scores to the results dictionary.

grid.best_params_

{'svc_rad__C': 100.0, 'svc_rad__gamma': 0.0001}

pipeline.set_params(**{'svc_rad__C': 100, 'svc_rad__gamma': 1e-4})
y_pred = pipeline.fit(X_train, y_train).predict(X_test)
svc_rad_res = accuracy_recall(y_test, y_pred)
svc_rad_res

{'accuracy': 0.6639857015192135, 'recall': 0.7304347826086957}

test_results.update({'SVC_R': svc_rad_res})

Comparing Test Results

All classifiers except for the NB classifier yield better recall than accuracy. The KNN classifier has the highest recall but by far the lowest accuracy. The linear SVC, polynomial SVC of degree 1, and logistic classifier have similar accuracy scores. The NB classifier tops the accuracy ranking but is at the bottom of the recall ranking. Given all that, we can pick the radial SVC as the winner, which performs relatively well on accuracy and recall. All the same, it is only slightly better than the rest.

df = pd.DataFrame(test_results).transpose()
df.plot()
plt.show()

df['accuracy'].sort_values(ascending=False)

NBayes    0.703307
SVC_R     0.663986
SVC_L     0.649687
SVC_P     0.647006
LOGIT     0.640751
LDA       0.638070
QDA       0.618409
KNN       0.541555
Name: accuracy, dtype: float64

df['recall'].sort_values(ascending=False)

KNN       0.765217
QDA       0.739130
SVC_R     0.730435
SVC_L     0.721739
SVC_P     0.721739
LOGIT     0.686957
LDA       0.686957
NBayes    0.617391
Name: recall, dtype: float64

Footnotes

For more about the dataset, see here.↩︎
$1 / C$ corresponds to the $λ$ parameter that is usually used to weigh the regularization term in the objective function. See the scikit-learn documentation for more details.↩︎
The other alternatives are L1, L2, or both.↩︎
The cross-validation results (i.e., accuracy) are stored in the scores_ attribute. It turns out that most CV results are pretty close to each other, indicating that L2 regularization does not improve the model performance importantly.↩︎
See the scikit-learn user guide.↩︎
The class_prior_, theta_, and var_ attributes of the GaussianNB class stores prior class probabilities, predictor means, and predictor variances, respectively.↩︎
For a brief overview of the method, see here.↩︎
For a brief overview of the support vector machines, see here.↩︎
Scikit-learn recommends the LinearSVC and SGDClassifier classes for large datasets.↩︎