import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Classification with Scikit-learn
In the previous two posts of this series on scikit-learn, we focused on a regression problem and trained linear regression, decision trees, bagging, random forests, and boosting models. We will now concentrate on a classification problem and continue to use scikit-learn to train logistic regression, linear discriminant analysis, quadratic discriminant analysis, naive Bayes, k-nearest neighbors, and support vector classifiers. Let’s import the following packages to start.
Data
Initial inspection
We will use the stackoverflow
dataset from tidymodels, the R meta-package for machine learning and statistical analysis. I made it available on a GitHub repository as a CSV file, which we can read into a data frame.
= pd.read_csv("https://raw.githubusercontent.com/barisguven/datasets/main/data/stackoverflow_tm.csv") so
The dataset contains 5594 observations and 21 variables, two of which are categorical, and the rest are numeric. There are no missing values.
so.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5594 entries, 0 to 5593
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country 5594 non-null object
1 Salary 5594 non-null float64
2 YearsCodedJob 5594 non-null int64
3 OpenSource 5594 non-null int64
4 Hobby 5594 non-null int64
5 CompanySizeNumber 5594 non-null int64
6 Remote 5594 non-null object
7 CareerSatisfaction 5594 non-null int64
8 Data_scientist 5594 non-null int64
9 Database_administrator 5594 non-null int64
10 Desktop_applications_developer 5594 non-null int64
11 Developer_with_stats_math_background 5594 non-null int64
12 DevOps 5594 non-null int64
13 Embedded_developer 5594 non-null int64
14 Graphic_designer 5594 non-null int64
15 Graphics_programming 5594 non-null int64
16 Machine_learning_specialist 5594 non-null int64
17 Mobile_developer 5594 non-null int64
18 Quality_assurance_engineer 5594 non-null int64
19 Systems_administrator 5594 non-null int64
20 Web_developer 5594 non-null int64
dtypes: float64(1), int64(18), object(2)
memory usage: 917.9+ KB
The variable Remote
indicates the work status of the developers who participated in the StackOverflow survey. It only takes on ‘Not remote’ or ‘Remote’, which represent classes. It is our target variable, and we want to find the best classification model to predict the work status based on the available features. As such, we have a binomial or two-class classification problem. Only 10% of developers reported ‘Remote’ work status (the minority class), implying a class imbalance in our sample.
'Remote', normalize=True) so.value_counts(
Remote
Not remote 0.897211
Remote 0.102789
Name: proportion, dtype: float64
The developers were from five countries, and half were from the United States.
'Country', normalize=True) so.value_counts(
Country
United States 0.498927
United Kingdom 0.182875
Germany 0.135324
India 0.096174
Canada 0.086700
Name: proportion, dtype: float64
While CompanySizeNumber
is a numeric feature, it only takes eight distinct values, and we can one-hot encode it to improve the model performance.1
'CompanySizeNumber'].nunique() so.loc[:,
8
Encoding the class labels
While scikit-learn estimators can work with string target labels, they can sometimes cause errors. We can use the LabelEncoder
utility class from the preprocessing
module to normalize target labels such that they contain only values 0, 1, etc. In our case, this means that “Not remote” will be replaced by zero and “Remote” by 1.
from sklearn.preprocessing import LabelEncoder
= LabelEncoder()
le = le.fit_transform(so.loc[:, 'Remote'])
y np.unique(y), y.shape
(array([0, 1]), (5594,))
We can view the class labels of the encoder by accessing its classes_
attribute. To confirm that 0 represents “Not remote” and 1 “Remote”, we can use the inverse_transform()
method of the encoder.
le.classes_
array(['Not remote', 'Remote'], dtype=object)
1, 0]) le.inverse_transform([
array(['Remote', 'Not remote'], dtype=object)
Initial split and down-sampling
The presence of a class imbalance in our sample will make our classification models predict the majority class most of the time. As such, our models will not be any better than a naive classifier that predicts the majority class most of the time. One way to address the class imbalance problem is to down-sample the majority class, i.e., reduce the frequency of the majority class to make it equal to the frequency of the minority class. Unlike tidymodels, no function or class in scikit-learn can do this for us in a single step. However, we can use the resample
function from the utils
module of scikit-learn to perform down-sampling. This function randomly selects a desired number of observations from each array it takes as input. Down-sampling must be done after the data is split into train and test sets to avoid data leakage. Below, we perform an initial split and down-sample the majority class in the train set to balance the frequencies of the two classes. We pass stratify=y
to the train_test_split()
function to preserve the class imbalance in the initial train/test split. We then combine the minority class with the down-sampled majority class.
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
# initial train/test split
= train_test_split(
X_train, X_test, y_train, y_test ='Remote'),
so.drop(columns
y,=y,
stratify=0.2,
test_size=0
random_state
)
# class filters
= (y_train == 0)
major_class = (y_train == 1)
minor_class
# down-sample the majority class in the train set
= resample(
X_train_down, y_train_down
X_train.loc[major_class],
y_train[major_class],=False,
replace=np.sum(minor_class),
n_samples=1
random_state
)
# combine down-sampled majority class with minority class
= pd.concat([X_train_down, X_train.loc[minor_class]])
X_train = np.concat([y_train_down, y_train[minor_class]])
y_train
for data in [X_train_down, y_train_down, X_train, y_train]] [data.shape
[(460, 20), (460,), (920, 20), (920,)]
Feature engineering
Scikit-learn estimators require input data to be numeric, so we need to convert Country
into numeric type, which we do by one-hot encoding it. Given the previous observation, we also one-hot encode CompanySizeNumber
and exclude it from the second transformer that selects the numeric features. Some classification models need features to be scaled, so we include StandardScaler
in the second step of the pipeline to build our preprocessor.
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.pipeline import Pipeline, make_pipeline
= ColumnTransformer(
ct =[
transformers'encode', OneHotEncoder(handle_unknown='ignore'), ['Country', 'CompanySizeNumber']),
('', 'passthrough', make_column_selector(pattern='^(?!CompanySizeNumber)', dtype_include=np.number))
(
]
)
= Pipeline([('encode', ct), ('scale', StandardScaler())]) preprocessor
Logistic Regression
The LogisticRegression
estimator from the linear_model
module of scikit-learn can fit a logistic regression model. The logistic regression classifier uses the standard logistic function, the sigmoid function, to model the conditional probability that observation
The LogisticRegression
estimator uses negative log loss as the objective function and does L2 regularization by default. This means that it minimizes the following loss function:
where
Let’s first train a logistic regression model with no regularization and obtain its test error. The penalty
parameter of the LogisticRegression
class controls the type of regularization, None
to remove the regularization term from the loss function.3 Below, we combine the preprocessor we previously created with the logistic classifier using the make_pipeline()
function. We then fit the classifier to the training data and access the classes_
attribute of the classifier to view the classes. The second class is our positive class (“Remote”).
from sklearn.linear_model import LogisticRegression
= LogisticRegression(penalty=None)
logit = make_pipeline(preprocessor, logit)
pipeline
pipeline.fit(X_train, y_train)1].classes_ pipeline[
array([0, 1])
The predict()
method of the classifiers computes the class probabilities and uses the ‘probability that a given observation belongs to the positive class > 0.5’ rule to assign class labels to the test observations. Below, we report the test accuracy of the classifier by computing the mean matches between our predicted classes and actual classes.
= pipeline.predict(X_test)
y_pred == y_pred) np.mean(y_test
np.float64(0.6398570151921358)
Let’s confirm the previous claim. We can use the predict_proba()
method of the classifier models to compute the class probabilities. The second column of y_prob
stores the probability that a test observation is from the second class ‘Remote’.
= pipeline.predict_proba(X_test)
y_prob 5, ] y_prob[:
array([[0.04759533, 0.95240467],
[0.28911888, 0.71088112],
[0.57955305, 0.42044695],
[0.22462084, 0.77537916],
[0.57075499, 0.42924501]])
We now use the ‘probability that a given observation belongs to the positive class > 0.5’ rule to assign observations with a probability greater than 0.5 to the Remote
class and compute accuracy, which is the same as the previous one.
= np.array([0]*len(y_test))
y_pred 1] > 0.5] = 1
y_pred[y_prob[:, == y_pred) np.mean(y_test
np.float64(0.6398570151921358)
Finding the optimum C through cross-validation
LogisticRegression
uses regularization to prevent overfitting. The inverse regularization strength, LogisticRegressionCV
class. This class has a Cs
parameter, which can be set to an integer. The class then chooses a grid of fit()
method is called, it conducts a grid search using stratified k-fold cross-validation, which can be controlled through its cv
parameter. Once the optimum value of
from sklearn.linear_model import LogisticRegressionCV
= LogisticRegressionCV(Cs=10, cv=10, random_state=2)
logit
= make_pipeline(preprocessor, logit)
pipeline
pipeline.fit(X_train, y_train)
= pipeline.predict(X_test)
y_pred == y_test) np.mean(y_pred
np.float64(0.6407506702412868)
The optimum C_
attribute of the LogisticRegressionCV
class.4
1].C_ pipeline[
array([0.35938137])
To further assess the model’s predictive performance, we can compute the confusion matrix and the related statistics such as accuracy, true positive rate (recall or sensitivity), positive predictive value (precision), and F1 score.
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
= plt.subplots(figsize=(5, 4))[1]
ax =False)
ax.grid(visible
= confusion_matrix(y_test, y_pred)
cm = ConfusionMatrixDisplay(cm, display_labels=['Not remote', 'Remote'])
disp =ax)
disp.plot(ax plt.show()
print('Accuracy:', (79 + 638) / (79 + 638 + 36 + 366))
print('Recall/sensitivity:', 79 / (79 + 36))
print('Precision:', 79 / (79 + 366))
print('F1 score:', 2 * 79 / (2 * 79 + 366 + 36))
Accuracy: 0.6407506702412868
Recall/sensitivity: 0.6869565217391305
Precision: 0.17752808988764046
F1 score: 0.28214285714285714
Alternatively, we can use the scoring functions from the metrics
module to compute the same statistics.
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
(
accuracy_score(y_test, y_pred),
recall_score(y_test, y_pred),
precision_score(y_test, y_pred),
f1_score(y_test, y_pred) )
(0.6407506702412868,
0.6869565217391305,
0.17752808988764046,
0.28214285714285714)
Since we will use both accuracy and recall scores to assess the performance of the classifiers we train in this post, we write a utility function that computes and returns these scores. We also create a dictionary to enter the test scores for the classifiers.
def accuracy_recall(y_test, y_pred):
= {
scores 'accuracy': accuracy_score(y_test, y_pred),
'recall': recall_score(y_test, y_pred)
}return scores
= {}
test_results = accuracy_recall(y_test, y_pred)
logit_res 'LOGIT': logit_res}) test_results.update({
Finally, let’s plot the ROC curve. The roc_curve()
function takes in the true class labels and the positive class probabilities and returns false positive rate, true positive rate, and thresholds, which are used in the decision rule to assign class labels and obtain true and false positive rates. The RocCurveDisplay
class inputs the first two to create the ROC curve plot. The auc()
function also takes them in to return the area under the ROC curve, which can be passed to RocCurveDisplay
to show it at the bottom right of the curve plot.
from sklearn.metrics import roc_curve, auc, RocCurveDisplay
= roc_curve(y_test, y_prob[:, 1])
fpr, tpr, thresholds = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=auc(fpr, tpr))
disp
disp.plot() plt.show()
Generative Models
Unlike the logistic classifier, which models the conditional probability that an observation comes from a given class, generative models directly model the distribution of predictors separately for each response class.
Linear Discriminant Analysis
Linear discriminant analysis (LDA) assumes that predictors are drawn from a multivariate Gaussian (i.e., normal) distribution with a class-specific mean vector and a common covariance matrix. It estimates the mean of the predictors per class and their common variance matrix from the training data (and the prior class probabilities) to compute the posterior probability that an observation comes from a given class. The discriminant function that uses these magnitudes, which is the decision boundary, is linear in predictors.
We can use the LinearDiscriminantAnalysis
class from the disciminant_analysis
module to train an LDA model. We turn on its store_covariance
option to store the covariance matrix.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
= LDA(store_covariance=True)
lda = make_pipeline(ct, lda)
pipeline
= pipeline.fit(X_train, y_train).predict(X_test)
y_pred = accuracy_recall(y_test, y_pred)
lda_res lda_res
{'accuracy': 0.6380697050938338, 'recall': 0.6869565217391305}
The LDA model has a similar accuracy and recall (true positive rate) to the logistic classifier. You can use the priors_
, means_
, and covariance_
attributes of lda
to access the prior class probabilities, class-specific means, and common covariance matrix, respectively. We now add the test results to our results dictionary.
'LDA': lda_res}) test_results.update({
Quadratic Discriminant Analysis
Like LDA, quadratic discriminant analysis (QDA) also assumes a class-specific mean vector. But unlike LDA, it assumes a class-specific covariance matrix for predictors. The second assumption makes the discriminant function quadratic in predictors and generates a non-linear decision boundary.
We can use the QuadraticDiscriminantAnalysis
class from the disciminant_analysis
module to train a QDA model. Its reg_param
parameter is preset to 0 and regularizes the per-class covariance estimates. Increasing it helps reduce the collinearity in the covariance matrices. Below, we try 50 values from the range of [0, 1] and compute accuracy and recall for each parameter value.
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
= Pipeline([('preprocess', ct), ('qda', QDA())])
pipeline
= np.linspace(0, 1, num=101)
reg_params
= GridSearchCV(
grid
pipeline,={'qda__reg_param': reg_params},
param_grid=['accuracy', 'recall'],
scoring=10,
cv=False
refit
)
grid.fit(X_train, y_train)
= {
results 'regularization parameter': reg_params,
'accuracy': grid.cv_results_['mean_test_accuracy'],
'recall': grid.cv_results_['mean_test_recall']
}
= pd.DataFrame(results)
results ='regularization parameter')
results.plot(x plt.show()
Note that the default value of the reg_param
parameter (0) yields the lowest accuracy and highest recall score. We can use the parameter value that yields the highest accuracy to train the model, which is 0.26.
'accuracy'])] reg_params[np.argmax(results[
np.float64(0.26)
The QDA classifier has a higher recall and a lower accuracy than the LDA classifier.
= QDA(reg_param=reg_params[np.argmax(results['accuracy'])])
qda
= make_pipeline(ct, qda)
pipeline
= pipeline.fit(X_train, y_train).predict(X_test)
y_pred = accuracy_recall(y_test, y_pred)
qda_res qda_res
{'accuracy': 0.6184092940125112, 'recall': 0.7391304347826086}
'QDA': qda_res}) test_results.update({
Naive Bayes
The naive Bayes (NB) classifier assumes that predictors are independent within each class. We can use the GaussianNB
class to train a NB classifier. This class assumes additionally that predictors are drawn from the normal distribution and uses maximum likelihood to estimate the class-specific mean and variance of each predictor.5
from sklearn.naive_bayes import GaussianNB
= make_pipeline(ct, GaussianNB())
pipeline
= pipeline.fit(X_train, y_train).predict(X_test)
y_pred = accuracy_recall(y_test, y_pred)
nb_res nb_res
{'accuracy': 0.7033065236818588, 'recall': 0.6173913043478261}
The NB classifier has a significantly higher accuracy and a significantly lower recall than the LDA classsifier.6 We now add the test scores to the results dictionary.
'NBayes': nb_res}) test_results.update({
K-Nearest Neighbors
Unlike the previous methods we have looked at, the K-nearest neighbors (KNN) is a non-parametric method. For a given test observation KNeighborsClassifier
to train a KNN classifier.
The number of nearest neighbors and the distance power can be tuned via cross-validation. Below, we do 10-fold cross-validation and evaluate accuracy and recall on a grid of seventy-five refit='accuracy'
to GridSearch
so that, once the grid search is done, it trains the classifier using the parameter combination that maximizes accuracy.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
= Pipeline([('preprocess', preprocessor), ('knn', KNeighborsClassifier())])
pipeline
# grid parameters
= np.arange(1, 301, 4)
neighbors = [1, 2, 3, 4, 5, 6]
dist_power
= ['accuracy', 'recall']
metrics
= GridSearchCV(
grid
pipeline,={
param_grid'knn__p': dist_power,
'knn__n_neighbors': neighbors
},=metrics,
scoring=10,
cv=5,
n_jobs='accuracy'
refit
)
grid.fit(X_train, y_train)
# plot CV results
= plt.subplots(1, 2, figsize=(10, 6), dpi=200)[1]
axs
for i in [0, 1]:
= grid.cv_results_['mean_test_' + metrics[i]]
cv_results = cv_results.reshape(len(neighbors), len(dist_power))
cv_results for p in dist_power:
-1], label=p)
axs[i].plot(neighbors, cv_results[:, p='dist_power', loc='lower right')
axs[i].legend(title'# of neighbors')
axs[i].set_xlabel(
0].set_title('Accuracy')
axs[1].set_title('Recall')
axs[ plt.show()
The
grid.best_params_
{'knn__n_neighbors': np.int64(141), 'knn__p': 1}
Finally, we train the classifier with the optimum values, obtain the test scores, and add them to the results dictionary.
**grid.best_params_)
pipeline.set_params(= pipeline.fit(X_train, y_train).predict(X_test)
y_pred
= accuracy_recall(y_test, y_pred)
knn_res knn_res
{'accuracy': 0.5415549597855228, 'recall': 0.7652173913043478}
'KNN': knn_res}) test_results.update({
Support Vector Machines
Support vector machines (SVM) find a hyperplane to classify observations. It has been shown that the hyperplane can be expressed as a weighted sum of the kernels, where the kernel is a function of inner products of observations, and it measures the similarity between them. The kernel function can be linear in inner products. Polynomial and radial kernels are non-linear in inner products.8
We can use the SVC
class from the svm
module of scikit-learn to train support vector classifiers.9 The C
parameter controls the cost of observations violating the margin. Its kernel
option can be set to 'linear'
, 'poly'
, or 'rbf'
to specify a linear, polynomial, and radial kernel, respectively. The degree of the polynomial kernel is controlled by the degree
parameter. For the radial kernel, the gamma
parameter controls the
We start with tuning the cost parameter of the support vector classifier (SVC) that uses a linear kernel.
from sklearn.svm import SVC
= Pipeline(
pipeline =[
steps'preprocess', preprocessor),
('svc_lin', SVC(kernel='linear'))
(
]
)
= [1e-4, 1e-3, 1e-2, 1e-1, 1, 1e1, 1e2]
cost
= GridSearchCV(
grid
pipeline,={'svc_lin__C': cost},
param_grid=['accuracy', 'recall'],
scoring=10,
cv='accuracy'
refit
)
grid.fit(X_train, y_train)
= {
results 'accuracy': grid.cv_results_['mean_test_accuracy'],
'recall': grid.cv_results_['mean_test_recall']
}= pd.DataFrame(results)
results
= plt.subplots()[1]
ax =ax)
results.plot(ax=range(len(cost)), labels=cost)
ax.set_xticks(ticks'Cost')
ax.set_xlabel( plt.show()
The highest accuracy is obtained with
grid.best_params_
{'svc_lin__C': 1}
We now train the classifier with the optimum cost value to obtain the test scores and add them to the results dictionary.
**grid.best_params_)
pipeline.set_params(= pipeline.fit(X_train, y_train).predict(X_test)
y_pred
= accuracy_recall(y_test, y_pred)
svc_lin_res svc_lin_res
{'accuracy': 0.6496872207327972, 'recall': 0.7217391304347827}
'SVC_L': svc_lin_res}) test_results.update({
Next, we tune the cost and degree parameters of the SVC that uses a polynomial kernel.
= Pipeline(
pipeline =[
steps'preprocess', preprocessor),
('svc_poly', SVC(kernel='poly'))
(
]
)
= [1e-3, 1e-2, 1e-1, 1, 1e1, 1e2]
cost = [1, 2, 3, 4]
degree
= GridSearchCV(
grid
pipeline,={'svc_poly__C': cost, 'svc_poly__degree': degree},
param_grid=['accuracy', 'recall'],
scoring=10,
cv='accuracy'
refit
)
grid.fit(X_train, y_train)
# plot CV results
= plt.subplots(1, 2, figsize=(10, 6))[1]
axs
for i in [0, 1]:
= grid.cv_results_['mean_test_' + metrics[i]]
cv_results = cv_results.reshape(6, 4)
cv_results
for j in range(len(degree)):
=degree[j])
axs[i].plot(cv_results[:, j], label='degree', loc='lower right')
axs[i].legend(title'Cost')
axs[i].set_xlabel(
=range(len(cost)), labels=cost)
axs[i].set_xticks(ticks
0].set_title('Accuracy')
axs[1].set_title('Recall')
axs[ plt.show()
grid.best_params_
{'svc_poly__C': 10.0, 'svc_poly__degree': 1}
**{'svc_poly__C': 10, 'svc_poly__degree': 1})
pipeline.set_params(= pipeline.fit(X_train, y_train).predict(X_test)
y_pred = accuracy_recall(y_test, y_pred)
svc_poly_res svc_poly_res
{'accuracy': 0.6470062555853441, 'recall': 0.7217391304347827}
'SVC_P': svc_poly_res}) test_results.update({
Note that as far as recall is concerned, the combination of a degree of 4 and a cost of 0.001 is the best, and the classifier trained with these values yields a recall of 1 and an accuracy of 10.5%. But this classifier is useless as it almost always predicts “Remote”. A naive classifier can always predict “Remote” without requiring any computation.
**{'svc_poly__C': 1e-3, 'svc_poly__degree': 4})
pipeline.set_params(= pipeline.fit(X_train, y_train).predict(X_test)
y_pred accuracy_recall(y_test, y_pred)
{'accuracy': 0.10545129579982127, 'recall': 1.0}
Finally, we tune the cost and gamma parameter of the SVC that uses a radial kernel.
= Pipeline(
pipeline =[
steps'preprocess', preprocessor),
('svc_rad', SVC(kernel='rbf'))
(
]
)
= [1e-2, 1e-1, 1, 1e1, 1e2, 1e3]
cost = [1e-4, 1e-3, 1e-2, 1e-1]
gamma
= GridSearchCV(
grid
pipeline,={'svc_rad__C': cost, 'svc_rad__gamma': gamma},
param_grid=['accuracy', 'recall'],
scoring=10,
cv='accuracy'
refit
)
grid.fit(X_train, y_train)
# plot CV results
= plt.subplots(1, 2, figsize=(10, 6))[1]
axs
for i in [0, 1]:
= grid.cv_results_['mean_test_' + metrics[i]]
cv_results = cv_results.reshape(6, 4)
cv_results
for j in range(len(gamma)):
=gamma[j])
axs[i].plot(cv_results[:, j], label='gamma')
axs[i].legend(title'Cost')
axs[i].set_xlabel(
=range(len(cost)), labels=cost)
axs[i].set_xticks(ticks
0].set_title('Accuracy')
axs[1].set_title('Recall')
axs[ plt.show()
grid.best_params_
{'svc_rad__C': 100.0, 'svc_rad__gamma': 0.0001}
**{'svc_rad__C': 100, 'svc_rad__gamma': 1e-4})
pipeline.set_params(= pipeline.fit(X_train, y_train).predict(X_test)
y_pred = accuracy_recall(y_test, y_pred)
svc_rad_res svc_rad_res
{'accuracy': 0.6639857015192135, 'recall': 0.7304347826086957}
'SVC_R': svc_rad_res}) test_results.update({
Comparing Test Results
All classifiers except for the NB classifier yield better recall than accuracy. The KNN classifier has the highest recall but by far the lowest accuracy. The linear SVC, polynomial SVC of degree 1, and logistic classifier have similar accuracy scores. The NB classifier tops the accuracy ranking but is at the bottom of the recall ranking. Given all that, we can pick the radial SVC as the winner, which performs relatively well on accuracy and recall. All the same, it is only slightly better than the rest.
= pd.DataFrame(test_results).transpose()
df
df.plot() plt.show()
'accuracy'].sort_values(ascending=False) df[
NBayes 0.703307
SVC_R 0.663986
SVC_L 0.649687
SVC_P 0.647006
LOGIT 0.640751
LDA 0.638070
QDA 0.618409
KNN 0.541555
Name: accuracy, dtype: float64
'recall'].sort_values(ascending=False) df[
KNN 0.765217
QDA 0.739130
SVC_R 0.730435
SVC_L 0.721739
SVC_P 0.721739
LOGIT 0.686957
LDA 0.686957
NBayes 0.617391
Name: recall, dtype: float64
Footnotes
corresponds to the parameter that is usually used to weigh the regularization term in the objective function. See the scikit-learn documentation for more details.↩︎The other alternatives are L1, L2, or both.↩︎
The cross-validation results (i.e., accuracy) are stored in the
scores_
attribute. It turns out that most CV results are pretty close to each other, indicating that L2 regularization does not improve the model performance importantly.↩︎See the scikit-learn user guide.↩︎
The
class_prior_
,theta_
, andvar_
attributes of theGaussianNB
class stores prior class probabilities, predictor means, and predictor variances, respectively.↩︎For a brief overview of the support vector machines, see here.↩︎
Scikit-learn recommends the
LinearSVC
andSGDClassifier
classes for large datasets.↩︎