Machine Learning with tidymodels – Part 4

Classification problems

machine learning

classification

Published

Feb 22, 2025

In the previous three posts, we worked on a dataset where the response variable was quantitative/continuous (the sale price of a house). Oftentimes, the response variable is qualitative/categorical, taking only a number of unique values. For instance, it can be either “Yes” or “No” to represent whether a patient is diagnosed with a certain medical condition. Or it can be one of “Very Bad”, “Bad”, “Okay”, “Good”, or “Excellent” to represent the customer rating of a service. In such cases, when we want to predict the response variable whose unique values represent distinct classes, we engage in classification. When the response variable takes on only two values, and hence it is binary/binomial, we have a two-class problem. When the response variable is multinomial, we have a multi-class problem. The statistical methods used for classification problems are sometimes called classifiers. In this blog post, we work on a two-class problem and consider the following classifiers: logistic regression, linear discriminant analysis, quadratic discriminant analysis, naive Bayes, k-nearest neighbors, support vector classifier, and support vector machines.

Data

We will use the stackoverflow dataset included in the modeldata package of tidymodels.

library(tidymodels)
tidymodels_prefer()

data(stackoverflow, package = "modeldata")

The dataset contains 5,594 observations, each representing a unique respondent to the annual StackOverflow developer survey. The data is complete, with no missing values. The variable Remote denotes whether a respondent worked remotely during the survey year. Using the remaining variables as predictors, we want to predict respondents’ remote status.

YearsCodedJob measures how many years of experience each respondent has. The columns from Data_scientist to Web_developer are binary columns and they identify the job position of respondents. Since some individuals reported multiple jobs, they are not mutually exclusive.

glimpse(stackoverflow)
#> Rows: 5,594
#> Columns: 21
#> $ Country                              <fct> United Kingdom, United States, Un…
#> $ Salary                               <dbl> 100000.000, 130000.000, 175000.00…
#> $ YearsCodedJob                        <int> 20, 20, 16, 4, 1, 1, 13, 4, 7, 17…
#> $ OpenSource                           <dbl> 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, …
#> $ Hobby                                <dbl> 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, …
#> $ CompanySizeNumber                    <dbl> 5000, 1000, 10000, 1000, 5000, 20…
#> $ Remote                               <fct> Remote, Remote, Not remote, Not r…
#> $ CareerSatisfaction                   <int> 8, 9, 7, 9, 5, 8, 7, 7, 8, 9, 10,…
#> $ Data_scientist                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ Database_administrator               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ Desktop_applications_developer       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, …
#> $ Developer_with_stats_math_background <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
#> $ DevOps                               <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ Embedded_developer                   <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ Graphic_designer                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ Graphics_programming                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ Machine_learning_specialist          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ Mobile_developer                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
#> $ Quality_assurance_engineer           <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ Systems_administrator                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ Web_developer                        <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, …

The respondents were from five different countries, and almost half of them were from the United States.

100 * prop.table(table(stackoverflow$Country))
#> 
#>         Canada        Germany          India United Kingdom  United States 
#>       8.670004      13.532356       9.617447      18.287451      49.892742

The share of developers who worked remotely was highest in the US and lowest in Germany.

100 * with(stackoverflow, prop.table(table(Country, Remote), margin = 1))
#>                 Remote
#> Country             Remote Not remote
#>   Canada          5.773196  94.226804
#>   Germany         5.284016  94.715984
#>   India          10.408922  89.591078
#>   United Kingdom  6.842620  93.157380
#>   United States  13.651021  86.348979

Those who worked remotely earned higher salaries and were more experienced than those who worked in an office. Both types of developers reported a similar level of career satisfaction.

Code

stackoverflow |>
  select(Remote, Salary, YearsCodedJob,CareerSatisfaction) |>
  pivot_longer(c(everything(), - Remote), names_to = "var", values_to = "value") |>
  mutate(var = factor(var, levels = c("Salary", "YearsCodedJob", "CareerSatisfaction"))) |>
  ggplot(aes(x = var, y = value, color = Remote)) +
  geom_boxplot() +
  labs(x = NULL, y = NULL) +
  facet_wrap(vars(var), scales = "free")

Even though the CompanySizeNumber is a numeric column, there are only eight unique values representing different company sizes. It was relatively more common for developers to work for companies with 20, 100, and 10,000 employees. Given the sparsity in the distribution of company size, we may gain from converting it into a factor variable so that dummy variables can represent different company sizes. We will delay this until the last section, where we define recipes and tune the model hyperparameters.

stackoverflow |>
  mutate(CompanySizeNumber = as.factor(CompanySizeNumber)) |>
  ggplot(aes(x = CompanySizeNumber)) +
  geom_bar()

It is crucial to know the proportion of each class in the data when tackling classification problems. We see that almost 90% of the respondents reported “Not remote” work status. This is a case of class imbalance. The “Not remote” class, which dominates the data, is called the majority class. In such cases, most machine learning methods will fail to perform well for the simple reason that the overpresence of the majority class in the data will push the methods to predict the majority class most of the time. As such, they will not be any better than a naive classifier that predicts “Not remote” for every respondent! A naive classifier will achieve 90% accuracy overall but zero accuracy as far as the minority class is concerned.

prop.table(table(stackoverflow$Remote))
#> 
#>     Remote Not remote 
#>  0.1027887  0.8972113

A relatively simple way to deal with class imbalance is to randomly remove enough of the majority class observations to make the number of observations from both classes equal in the training data. We will see below how to integrate this procedure into the tidymodels framework. For now, we will use the downSample() function from the caret package to eliminate the class imbalance in the training data.

set.seed(321)
so_split = initial_split(stackoverflow, prop = 0.8, strata = "Remote")
so_train = training(so_split)
so_test = testing(so_split)

# downsampling to address class imbalance
set.seed(322)
so_train_down = caret::downSample(
  x = select(so_train, - Remote),
  y = pull(so_train, Remote), 
  yname = "Remote"
)

set.seed(323)
so_folds = vfold_cv(so_train, v = 10, strata = "Remote")

Logistic Regression

Model

For binary classification problems, the response variable can be assumed to take the value of 1 when an observation belongs to a class of interest and 0 otherwise:

$Y_{i} = {\begin{cases} 1 & if obs. i belongs to Class 1 \\ 0 & if obs. i belongs to Class 2 \end{cases}$

The logistic regression models the relationship between the probability of the response variable and the predictors using the logistic or sigmoid function:

$Pr (Y = 1 | X) = p (X) = \frac{e^{β_{0} + β_{1} X_{1} + \dots + β_{p} X_{p}}}{1 + e^{β_{0} + β_{1} X_{1} + \dots + β_{p} X_{p}}}$

where $X_{1}, \dots, X_{p}$ represents predictors and Y represents the response variable. It can be easily shown that

$log (\frac{p (X)}{1 - p (X)}) = β_{0} + β_{1} X_{1} + \dots + β_{p} X_{p}$

The left-hand side of the above equation is called log-odds. Given that $p (X)$ is the probability that an observation belongs to class 1, $1 - p (X)$ is the probability that an observation belongs to the other class, class 2. Therefore, odds measures how many times it is more likely for an observation to belong to class 1 than to class 2. Even though the logistic function posits a nonlinear relationship between the response variable and predictors, the log-odds is linear in predictors.

When there are $K > 2$ classes, the relationship between the response variable and predictors can still be modeled using the sigmoid function by choosing an arbitrary class, say class $K$ , as the baseline class:¹

$Pr (Y = k | X) = \frac{e^{β_{k 0} + β_{k 1} X_{1} + \dots + β_{k p} X_{p}}}{1 + \sum_{l = 1}^{K - 1} e^{β_{l 0} + β_{l 1} X_{1} + \dots + β_{l p} X_{p}}}$

for $k = 1, \dots, K - 1$ , and

$Pr (Y = K | X) = \frac{1}{1 + \sum_{l = 1}^{K - 1} e^{β_{l 0} + β_{l 1} X_{1} + \dots + β_{l p} X_{p}}}$

It follows that the log-odds is given by

$log (\frac{Pr (Y = k | X)}{Pr (Y = K | X)}) = β_{k 0} + β_{k 1} X_{1} + \dots + β_{k p} X_{p}$

Similar to the two-class case, odds measures how many times an observation is more likely to belong to class $k$ than to class $K$ . In both cases, the coefficients are estimated using the maximum likelihood method.

Below, we use the glm() function in base R and the downsampled stackoverflow training data to fit a binary logistic regression model. We then predict on the test set and use the “probability that work status is not remote is greater than 0.5” rule to classify those observations that meet the rule as “Not remote”.

When the response variable is a factor variable with two levels, the glm() function treats the first factor level as failure and the second as success. This means that “Remote” status will be denoted as failure and “Not remote” as success. Unlike glm(), tidymodels functions treat the first level as success and the second level as failure. Since we are going to rely heavily on tidymodels, we will keep the order of levels for the Remote column as is.

glm_fit = glm(
  Remote ~ .,
  family = binomial,
  data = so_train_down
)

probs = predict(glm_fit, newdata = so_test, type = "response")
preds = rep("Remote", 1119) # 1119 = test set size
preds[probs > .5] = "Not remote"

We see that it is more likely for higher-salary earners, more experienced developers, those working on open-source projects, and those working for smaller companies to work remotely. Developers in the US, UK, and India are more likely to work remotely than those in Canada (the base country).

tidy(glm_fit) |>
  filter(term != "(Intercept)") |>
  head(10)
#> # A tibble: 10 × 5
#>    term                    estimate  std.error statistic       p.value
#>    <chr>                      <dbl>      <dbl>     <dbl>         <dbl>
#>  1 CountryGermany        -0.595     0.378        -1.57   0.115        
#>  2 CountryIndia          -2.40      0.409        -5.86   0.00000000458
#>  3 CountryUnited Kingdom -0.733     0.344        -2.13   0.0332       
#>  4 CountryUnited States  -1.05      0.328        -3.19   0.00140      
#>  5 Salary                -0.0000104 0.00000321   -3.25   0.00115      
#>  6 YearsCodedJob         -0.0651    0.0146       -4.47   0.00000779   
#>  7 OpenSource            -0.341     0.157        -2.18   0.0295       
#>  8 Hobby                  0.0138    0.177         0.0779 0.938        
#>  9 CompanySizeNumber      0.0000392 0.0000206     1.90   0.0571       
#> 10 CareerSatisfaction    -0.0568    0.0459       -1.24   0.216

Confusion Matrix

There are a variety of statistics associated with classification problems. A useful construct is the cross-tabulation of observed/actual and predicted classes called confusion matrix. Below, we use the confusionMatrix function from the caret package to obtain the confusion matrix and associated statistics.

caret::confusionMatrix(
  data = factor(preds, levels = c("Remote", "Not remote")),
  reference = so_test$Remote,
  positive = "Remote"
)
#> Confusion Matrix and Statistics
#> 
#>             Reference
#> Prediction   Remote Not remote
#>   Remote         70        387
#>   Not remote     45        617
#>                                           
#>                Accuracy : 0.6139          
#>                  95% CI : (0.5847, 0.6426)
#>     No Information Rate : 0.8972          
#>     P-Value [Acc > NIR] : 1               
#>                                           
#>                   Kappa : 0.0964          
#>                                           
#>  Mcnemar's Test P-Value : <2e-16          
#>                                           
#>             Sensitivity : 0.60870         
#>             Specificity : 0.61454         
#>          Pos Pred Value : 0.15317         
#>          Neg Pred Value : 0.93202         
#>              Prevalence : 0.10277         
#>          Detection Rate : 0.06256         
#>    Detection Prevalence : 0.40840         
#>       Balanced Accuracy : 0.61162         
#>                                           
#>        'Positive' Class : Remote          
#>

Taking the “Remote” class as our positive class (i.e., the event of interest or success), we observe the following:

The logistic regression model correctly classified 70 test observations as “Remote” class: 70 true positives (TP).
It also correctly classified 617 observations as “Not remote”: 617 true negatives (TN).
387 observations are predicted as “Remote,” but they are, in fact, in the “Not remote” class: 387 false positives (FP).
45 observations are predicted as “Not remote,” but they are, in fact, “Remote” class: 45 false negatives (FN).

Based on these numbers, the following common statistics are computed:

Accuracy: (617 + 70) / 1119 = 0.6139
Sensitivity (recall, true positive rate, or hit rate) = TP / P = 70 / (70 + 45) = 0.60870
Specificity (true negative rate) = TN / N = 617 / (617 + 387) = 0.61454
Positive Predictive Value (precision) = TP / (TP + FP) = 70 / (70 + 387) = 0.15317
Negative Predictive Value = TN / (TN + FN) = 617 / (617 + 45) = 0.93202²

No information rate, also reported by the confusionMatrix() function, is simply the proportion of the majority class in the training set.³ In other words, if we were to simply predict “Not Remote” for all test observations, we would have correctly classified 89.72% of them. Compared to this naive classifier, the logistic classifier has only a 61.39% accuracy rate! However, this is not all bad. The naive classifier would have 0% sensitivity since it would not have detected any “Remote”. But, the logistic classifier has a sensitivity of 60.87%.⁴

ROC Curve

Another useful construct associated with classification problems is the receiver operating characteristic curve or ROC curve. The ROC curve plots the true positive rate against the false positive rate at each threshold value.

The false positive rate is the ratio of false positives to (actual) negatives, FP/N. Since N = FP + TN, the false positive rate is also given by (N - TN) / N = 1 - TN / N. That is, it equals 1 minus the true negative rate or 1 - specificity. So the ROC curve plots sensitivity against 1 - specificity.

Suppose that we have a very strict rule to classify the test observations as the event of interest using 0.99 as the threshold value. So the rule is “p(event) > 0.99”. Such a rule will unlikely detect any positives, yielding a true positive rate of zero and a false positive rate of zero. So (0, 0) represents the point on the ROC curve at 0.99 threshold value. As we relax the rule and reduce the threshold value, both the true positive rate and false positive rate start increasing since we classify more and more observations as success. For a good classifier, the increase in the true positive rate should be greater than the increase in the false positive rate such that the curve should bow out toward the upper left corner as much as possible.

Once we obtain the ROC curve, we can compute the area under it. This area measure is a common performance metric used to evaluate the classifiers. Below, we combine the Remote column from the test set with the predicted probabilities and pass the combined data to the roc_curve() function in the yardstick package to generate sensitivity and specificity measures at different threshold values. We then plot the ROC curve.

Code

roc_data = bind_cols(truth = so_test$Remote, prob_event = probs)

roc_data |>
  roc_curve(truth, prob_event, event_level = "second") |>
  ggplot(aes(x = 1 - specificity, y = sensitivity)) +
  geom_path() +
  geom_abline(lty = 3) + 
  coord_equal()

We can compute the area under the ROC curve using the roc_auc() function, which is also from the yardstick package.

roc_data |>
  roc_auc(truth, prob_event, event_level = "second")
#> # A tibble: 1 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 roc_auc binary         0.671

Finally, we specify a logistic regression model using tidymodels so that we can compare its performance against the other models we consider below.

logistic_spec = 
  logistic_reg() |>
  set_engine("glm") |>
  set_mode("classification")
logistic_spec
#> Logistic Regression Model Specification (classification)
#> 
#> Computational engine: glm

Generative Models

The logistic regression uses the logistic function to directly model the conditional probability that an observation comes from class $k$ , $Pr (Y = k | X = x)$ . Some alternative approaches instead model the distribution of the predictors separately for each response class.

Let $π_{k}$ represent the prior probability that a randomly chosen observation comes from the $k$ th class. Let $f_{k} (X) \equiv Pr (X | Y = k)$ denote the density function of $X$ for an observation that comes from the $k$ th class. According to Bayes’ theorem

$Pr (Y = k | X = x) = \frac{π_{k} f_{k} (x)}{\sum_{l = 1}^{K} π_{l} f_{l} (x)}$

$p_{k} (x) = Pr (Y = k | X = x)$ is the posterior conditional probability that an observation belongs to the $k$ th class.

When we have a random sample from the population, we can easily estimate $π_{k}$ s by computing the proportion of each class in the training data. The generative models differ in how they model the density function of predictors within each class.⁵

Linear Discriminant Analysis

Linear discriminant analysis (LDA) assumes that $X = (X_{1}, \dots, X_{p})$ is drawn from a multivariate Gaussian (or normal) distribution with a class-specific mean vector and a common covariance matrix:

$X \sim N (μ_{k}, Σ)$

where $μ_{k}$ and $Σ$ denote the class-specific mean vector ( $p \times 1$ ) and the common covariance matrix ( $p \times p$ ), respectively. The multivariate Gaussian distribution assumes that each predictor follows a one-dimensional normal distribution with some correlation between each pair of predictors.

The LDA classifier approximates the Bayes classifier that assigns each observation to the most likely class by assuming that $f_{k} (x)$ is the multivariate Gaussian density function with a mean vector $μ_{k}$ and a common covariance matrix $Σ$ . It can be shown that the posterior probability that an observation belongs to class $k$ is largest when the following discriminant function is largest:

$δ_{k} (x) = x^{⊤} Σ^{- 1} μ_{k} - \frac{1}{2} μ_{k}^{⊤} Σ^{- 1} μ_{k} + log π_{k}$

So the LDA classifier classifies an observation $x$ to class $k$ for which $δ_{k} (x)$ is largest.

When there is a single predictor ( $p = 1$ ), the covariance matrix is simply given by the variance of that predictor, $σ^{2}$ . It follows that

$δ_{k} (x) = x \frac{μ_{k}}{σ^{2}} - \frac{μ_{k}^{2}}{2 σ^{2}} + log π_{k}$

Note that the discriminant function $δ_{k} (x)$ is linear in $x$ (hence the linear discriminant analysis).

LDA estimates $μ_{1}, \dots, μ_{k}$ , $Σ$ , and $π_{1}, \dots, π_{k}$ and then evaluates the discriminant function for each training observation and each class. It then assigns each observation to the class for which the discriminant function is largest. Since each mean vector contains $p$ means for $p$ predictors and the $p \times p$ covariance matrix has $p (p + 1) / 2$ unique parameters, LDA estimates $K p + p (p + 1) / 2 + K$ parameters in total.

We now use the discrim_linear() function from the parsnip extension package discrim to specify an LDA classifier. The function uses the MASS package as its default underlying engine.

library(discrim)

lda_spec = discrim_linear() |>
  set_engine("MASS") |>
  set_mode("classification")
lda_spec
#> Linear Discriminant Model Specification (classification)
#> 
#> Computational engine: MASS

Quadratic Discriminant Analysis

Quadratic discriminant analysis (QDA) makes the same assumptions as LDA except that it assumes each class has its own covariance matrix:

$X \sim N (μ_{k}, Σ_{k})$

This difference then leads to the following discriminant function:

$δ_{k} (x) = - \frac{1}{2} x^{⊤} Σ_{k}^{- 1} x + x^{⊤} Σ_{k}^{- 1} μ_{k} - \frac{1}{2} μ_{k} Σ_{k}^{- 1} μ_{k} - \frac{1}{2} log | Σ_{k} | + log π_{k}$

When there is $p = 1$ predictor, the discriminant function simplifies to

$σ_{k} (x) = - \frac{x^{2}}{2 σ^{2}} + \frac{x μ_{k}}{σ^{2}} - \frac{μ_{k}^{2}}{2 σ^{2}} - log (σ) + log π_{k}$

So, by assuming a class-specific covariance matrix, the QDA classifier generates a nonlinear (e.g., quadratic) decision boundary, which can be superior to the LDA classifier when the true decision boundaries are nonlinear.

Since QDA assumes a class-specific covariance matrix, it estimates $K p + K p (p + 1) / 2 + K$ parameters in total, which is much larger than LDA. Therefore, QDA is a much more flexible classifier than LDA and has a larger variance. When there are relatively few training observations, LDA can be preferred over QDA to reduce the variance of the classifier. Note also that LDA is a special case of QDA as it assumes a common covariance matrix.

We now use the discriminant_quad() function from the discrim package to specify a QDA classifier.

qda_spec = discrim_quad() |>
  set_engine("MASS") |>
  set_mode("classification")
qda_spec
#> Quadratic Discriminant Model Specification (classification)
#> 
#> Computational engine: MASS

Naive Bayes

LDA and QDA assume a multivariate Gaussian density function of $X$ for an observation from the $k$ th class, $f_{k} (x)$ . The naive Bayes (NB) classifier assumes instead that within each class, the $p$ predictors are independent, which means that for $k = 1, \dots, K$ ,

$f_{k} (x) = f_{k 1} (x_{1}) \times f_{k 2} (x_{2}) \times \dots \times f_{k p} (x_{p})$

where $f_{k j}$ is the one-dimensional density function of the $j$ th predictor among observations in the $k$ th class.⁶ $f_{k} (x)$ s are then inserted into the equation for the posterior probability.

There are a few options to estimate the one-dimensional density functions $f_{k j}$ s using the training data:

If $X_{j}$ is quantitative, assume that $X_{j} | Y = k \sim N (μ_{k}, σ_{j k}^{2})$ and estimate $μ_{k}$ and $σ_{j k}^{2}$ , or use either a non-parametric estimate for $f_{k j}$ or a kernel density estimator;
If $X_{j}$ is qualitative, simply count the proportion of training observations for the $j$ th predictor within each class.

As Hastie et al. (2021) discuss, any classifier with a linear decision boundary is a special case of naive Bayes classifier. So, LDA is a special case of naive Bayes. Moreover, naive Bayes estimates fewer parameters than LDA and QDA and is less flexible. So when $p$ is larger or $n$ is smaller, naive Bayes can perform better than LDA and QDA in reducing the variance of the classifier.

We now use the naive_Bayes() function from parsnip to specify a naive Bayes classifier. This function uses the klaR package as its default underlying engine.

nb_spec = naive_Bayes() |>
  set_engine("klaR") |>
  set_mode("classification")
nb_spec
#> Naive Bayes Model Specification (classification)
#> 
#> Computational engine: klaR

K-Nearest Neighbors

Like the logistic regression classifier, the $K$ -nearest neighbors (KNN) classifier estimates the conditional probability of the event of interest to classify a given observation. Unlike logistic regression, however, KNN is a non-parametric method and does not estimate any parameter. For a given test observation $x_{0}$ and a positive integer $K$ , the KNN classifier first identifies the closest $K$ points to $x_{0}$ in the training data, represented by $N_{0}$ . It then estimates the conditional probability of class $j$ as the fraction of points in $N_{0}$ that belong to class j:

$Pr (Y = j | X = x_{0}) = \frac{1}{K} \sum_{i \in N_{0}}^{} I (y_{i} = j)$

KNN then classifies $x_{0}$ to the class with the largest probability from the above equation, i.e., the most-occurring class.⁷

There are various distance measures to identify the closest $K$ points to a given observation. The Minkowski distance is a general distance measure between two $p$ -dimensional points $x_{a}$ and $x_{b}$ and given by

${(\sum_{j = 1}^{p} | x_{a j} - x_{b j} |^{q})}^{1 / q}$

with $q > 0$ . When $q = 2$ , the Minkowski distance equals the Euclidean distance, and when $q = 1$ , it equals the Manhattan distance.⁸

Given that we need to use a distance measure to quantify how close two points are, the scale of the predictors matters a lot. To prevent the predictors on a larger scale from dominating the distance measure, we must center and scale (i.e., normalize) all predictors in a preprocessing step.

Unlike many other machine learning methods, KNNs require the training data at prediction time because each test observation is checked against the training observations to find their k-nearest neighbors, which makes them lazy learners. So we cannot save the model and use it later to predict on new data. Another drawback of KNNs is that their computation time increases by $n \times p$ for each observation.⁹ Still, KNNs can perform well when the decision boundary is nonlinear. Moreover, they can impute values for missing data in preprocessing.

The number of nearest neighbors, $K$ , is a natural tuning parameter. One can also choose what distance measure to use by optimizing $q$ . Below, we use the neareast_neighbor() function from the parsnip package to specify a KNN model. We tag its neighbors and dist_power arguments, corresponding to $K$ and $q$ , respectively, for tuning. The function uses the kknn() function from the kknn package as the underlying engine to fit KNN models.¹⁰

knn_spec = 
  nearest_neighbor(neighbors = tune(), dist_power = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")
knn_spec
#> K-Nearest Neighbor Model Specification (classification)
#> 
#> Main Arguments:
#>   neighbors = tune()
#>   dist_power = tune()
#> 
#> Computational engine: kknn

Support Vector Machines

Some classifiers aim to find a separating hyperplane to classify observations. These classifiers are known by different names depending on how perfect the separation is and whether it is linear or nonlinear.

Maximal Margin Classifier

Suppose we have only two features, $X_{1}$ and $X_{2}$ , and a binomial response variable that takes on either “A” or “B” class labels, as shown in the figure below. The class B observations tend to have greater $X_{1}$ and $X_{2}$ values than class A observations. What is also interesting is that observations from classes A and B are entirely separable from each other. One can draw such a line, or a hyperplane, that all class A observations stand to its left and all class B observations stand to its right. In fact, there are infinitely many lines that can perfectly separate the observations from classes A and B.

Code

library(latex2exp)

set.seed(123)
svm_data = matrix(rnorm(50 * 2), ncol = 2)
class = rep(c("A", "B"), each = 25)

svm_data[class == "B", ] = svm_data[class == "B", ] + 4
svm_data = data.frame(svm_data, class)

ggplot(svm_data, aes(X1, X2, color = class)) +
  geom_point() +
  geom_abline(intercept = (4.97 + 9.17) / 2, slope = -2) +
  geom_abline(intercept = 4.97, slope = -2, lty = 3) +
  geom_abline(intercept = 9.17, slope = -2, lty = 3) +
  labs(x = TeX(r"($X_1$)"), y = TeX(r"($X_2$)"))

The maximal margin classifier, or the hard margin classifier, finds a separating hyperplane such that the shortest distance between the hyperplane and the observations on both sides, called the margin, is largest (hence the maximal margin).

In the figure above, the solid line represents the separating hyperplane. The shortest distance between the solid line and the dashed lines on both sides is called the margin. The observations on the dashed lines are called support vectors.

The maximal margin classifier depends directly on the support vectors; other observations do not matter. Different support vectors will lead to a different hyperplane, which makes the maximal margin classifier likely to overfit the data. Moreover, it is not always possible to separate the classes perfectly. The next classifier addresses the shortcomings of the maximal margin classifier.

Support Vector Classifier

The support vector classifier, or the soft margin classifier, also finds a separating hyperplane, but it does so by allowing some observations to be on the wrong side of the margin or even the hyperplane. As we see below, it is impossible to find a line to separate class A and B observations perfectly. Yet, one can still insert a line to classify most of them correctly at the expense of some misclassifications.

Code

set.seed(125)
svm_data = matrix(rnorm(50 * 2), ncol = 2)
class = rep(c("A", "B"), each = 25)

svm_data[class == "B", ] = svm_data[class == "B", ] + 1.5
svm_data = data.frame(svm_data, class)

ggplot(svm_data, aes(X1, X2, color = class)) +
  geom_point() +
  geom_abline(intercept = (2.03 + 2.95) / 2, slope = -2) +
  geom_abline(intercept = 2.03, slope = -2, lty = 3) +
  geom_abline(intercept = 2.95, slope = -2, lty = 3) +
  labs(x = TeX(r"($X_1$)"), y = TeX(r"($X_2$)"))

In the example above, there are 25 observations from both classes. The soft margin classifier will misclassify five class A and five class B observations. Yet, it will correctly classify 20 class A and 20 class B observations, yielding 80% accuracy (100*40/50).

Consider a set of training observations $x_{1}, \dots, x_{n} \in R^{p}$ and associated class labels $y_{1}, \dots, y_{n} \in {- 1, 1}$ . Mathematically, the hyperplane is the solution to the following optimization problem:

$\begin{matrix} max_{β_{0}, \dots, β_{p}, M} M \\ subject to \sum_{j = 1}^{p} β_{j}^{2} = 1, \\ y_{i} (β_{0} + β_{1} x_{i 1} + \dots + β_{p} x_{i p}) \geq M (1 - ϵ_{i}), \\ ϵ_{i} \geq 0, \sum_{i = 1}^{n} ϵ_{i} \leq C, \end{matrix}$

where $C$ is a nonnegative tuning parameter. When $C = 0$ , we have zero budget for constraint violations, i.e., $ϵ_{i} = 0$ for $i = 1, \dots, n$ , and the coefficients $β_{0}, β_{1}, \dots, β_{p}$ and the margin $M$ are so chosen that all training observations are perfectly separated into their classes assuming such a hyperplane exists. So $C = 0$ yields the hard margin classifier. As $C$ gets larger, we tolerate more and more observations to be on the wrong side of the margin or the hyperplane so that we get a soft margin classifier.

The hyperplane is given by $f (x) = β_{0} + β_{1} x_{i 1} + \dots + β_{p} x_{i p}$ and a test observation $x^{*}$ is classified as -1 or 1 depending on the sign of $f (x^{*})$ .¹¹

Only the observations that either lie on the margin or violate the margin affect the hyperplane and such observations are called support vectors. So, for the hard and soft margin classifiers, only a small subset of training observations, i.e., support vectors, matter for the hyperplane, making them robust to outliers.

Support Vector Machines

In some cases, letting some observations lie on the wrong side of the margin won’t be enough. In such cases, there does not simply exist a linear decision boundary that can separate the observations reasonably well. If you are familiar with the linear regression model, you can recall that adding nonlinear terms (quadratic, cubic, etc.) to the model equation allows the model to capture nonlinear relationships between the response and predictors.

The support vector machine (SVM) is an extension of the support vector classifier that deals with nonlinear class boundaries by enlarging the feature space using kernels.

It has been shown that the solution to the support vector classifier optimization problem laid out above involves only the inner products of the observations. For two observations $x_{i}$ and $x_{i^{'}}$ , their inner product is given by

$< x_{i}, x_{i^{'}} >= \sum_{j = 1}^{p} x_{i j} x_{i^{'} j}$

Moreover, the hyperplane can be written in a general form as

$f (x) = β_{0} + \sum_{i = 1}^{n} α_{i} K (x, x_{i})$

where $K$ is some function called kernel. The kernel is a function of inner products and quantifies the similarity between observations. A linear kernel is given by

$K (x_{i}, x_{i^{'}}) = \sum_{j = 1}^{p} x_{i j} x_{i^{'} j}$

When the kernel is linear in predictors, $f (x)$ represents the support vector classifier. The support vector classifier with a non-linear kernel implies a support vector machine. The non-linear kernel could be a polynomial kernel of degree $d$ :

$K (x_{i}, x_{i^{'}}) = (1 + \sum_{j = 1}^{p} x_{i j} x_{i^{'} j})^{d}$

Another non-linear kernel, radial kernel, is given by

$K (x_{i}, x_{i^{'}}) = exp (- γ \sum_{j = 1}^{p} (x_{i j} - x_{i^{'} j})^{2})$

where $γ > 0$ .

To estimate the parameters in the hyperplane equation, $(\binom{n}{2}) = n (n - 1) / 2$ inner products $< x_{i}, x_{i^{'}} >$ between all pairs of training observations must be computed. This means that SVMs can be slow to train on data where $n$ is much larger than $p$ .¹²

The classifiers based on finding a separating hyperplane can accommodate the multinomial classification problems. For instance, the kernlab package uses the one-versus-one classification where it trains $(\binom{K}{2}) = \frac{K (K - 1)}{2}$ binary classifiers and uses a rule to assign each observation to one of two classes. Each observation is finally assigned to the class it was most frequently assigned in all pairwise classifications.¹³

Model Specification

The kernlab package can fit the hard and soft margin and support vector machine classifiers. Its ksvm() function can be used for classification and regression problems. The C argument of the ksvm() function, which defaults to 1, sets the cost of constraint violations and determines how wide the margin would be. A higher C value means a narrower margin. Its kernel option can be set to vanilladot to use a linear kernel to fit a hard or soft margin classifier. A relatively high cost combined with vanilladot as the kernel option fits a hard margin classifier. The kernel option defaults to rbfdot to use the radial basis kernel. To use a polynomial kernel, we can set it to polydot.

The parsnip package offers different specification functions for SVM models by setting the kernel option under the hood. Below, we specify three SVM models using linear, polynomial, and radial basis kernels. In each specification, we tag the cost argument for tuning and pass in prob.model = TRUE to set_engine(). This tells the underlying ksvm() function to generate class probabilities so that we can compute the area under the ROC curve.¹⁴ For the polynomial and radial basis specifications, we also tag the degree and rbf_sigma arguments for tuning, respectively. rbf_sigma does not directly correspond to the gamma argument in the radial basis function we have seen above, but they are related.

svm_lin_spec = 
  svm_linear(cost = tune()) |>
  set_engine("kernlab", prob.model = TRUE) |>
  set_mode("classification")

svm_poly_spec = 
  svm_poly(cost = tune(), degree = tune()) |>
  set_engine("kernlab", prob.model = TRUE) |>
  set_mode("classification")

svm_rbf_spec = 
  svm_rbf(cost = tune(), rbf_sigma = tune()) |>
  set_engine("kernlab", prob.model = TRUE) |>
  set_mode("classification")

The ksvm() function, by default, scale and center (i.e., normalize) all non-binary/non-factor variables in the data with its scaled = TRUE option. This means, for instance, that the Country column will be one-hot encoded to create five dummies. The rest of the predictors, which are integer and double variables, will be normalized. Since we will tune model hyperparameters using cross-validation, we will include a preprocessing step to normalize the predictors later.

Tuning Classifiers

So far, we have reviewed eight classifiers and used the parsnip model specification functions to specify a model for each classifier by tagging model hyperparameters. We now create recipes to state model formula/equation and preprocess data before tuning the hyperparameters.

We need to address the class imbalance problem for all models. As we have seen before, we can downsample to make the proportion of classes equal in the training data. In fact, we can add downsampling to recipes as a preprocessing step. The step_downsample() function from the themis package does precisely this. We pass the Remote column to it to equalize the class proportions.

We observed above that even though numeric, the CompanySizeNumber variable has only eight unique values. We can use the step_num2factor() function from the recipes package to create a factor variable whose unique values correspond to those eight distinct size values. This function requires numeric values to take 1, 2, etc., to convert. Its transform argument expects a function to transform the raw numeric data to 1, 2, etc. Below, we define a binning function that transforms company sizes 1, 10, 100, …, to 1, 2, 3, … We also create a character vector that contains the levels for the factor variable to be created.

Code

# convert CompanySizeNumber values to 1, 2, etc.
binner = function(x) {
  x = cut(x, breaks = c(1, 10, 20, 100, 500, 1000, 5000, 10000, 15000), right = FALSE)
  as.numeric(x)
}

sizes = c("one", "ten", "twenty", "hundred", "five_hundred", 
          "thousand", "five_thousand", "ten_thousand")

For the first recipe, we finally add the step_dummy() function to create dummy variables for the factor variables Country and CompanySizeNumber. We will use the first recipe for the logistic regression, LDA, QDA, and naive Bayes models. Finally, we add a normalization step to the first recipe to create a recipe for the KNN and SVM models. We pass in the all_predictors() selector function to the step_normalize() to normalize all predictors.

library(themis)

so_rec = 
  recipe(Remote ~ ., data = so_train) |>
  step_downsample(Remote) |>
  step_num2factor(
    CompanySizeNumber,
    transform = binner,
    levels = sizes
  ) |>
  step_dummy(Country, CompanySizeNumber)

so_rec_norm = so_rec |>
  step_normalize(all_predictors())

We now combine the recipes with the models to create two workflow sets. We then combine the workflow sets with the bind_cols() function. The final workflow set, all_workflows, contains eight workflows, each corresponding to a different model.

models1 = list(
  LOGIT = logistic_spec, 
  LDA = lda_spec, 
  QDA = qda_spec,
  NBayes = nb_spec
)

models2 = list(
  KNN = knn_spec,
  SVC = svm_lin_spec,
  SVM_P = svm_poly_spec,
  SVM_R = svm_rbf_spec
)

wflow_set1 = workflow_set(list(so_rec), models1)
wflow_set2 = workflow_set(list(so_rec_norm), models2)

all_workflows = bind_rows(wflow_set1, wflow_set2) |>
  mutate(wflow_id = gsub("recipe_", "", wflow_id))

all_workflows
#> # A workflow set/tibble: 8 × 4
#>   wflow_id info             option    result    
#>   <chr>    <list>           <list>    <list>    
#> 1 LOGIT    <tibble [1 × 4]> <opts[0]> <list [0]>
#> 2 LDA      <tibble [1 × 4]> <opts[0]> <list [0]>
#> 3 QDA      <tibble [1 × 4]> <opts[0]> <list [0]>
#> 4 NBayes   <tibble [1 × 4]> <opts[0]> <list [0]>
#> 5 KNN      <tibble [1 × 4]> <opts[0]> <list [0]>
#> 6 SVC      <tibble [1 × 4]> <opts[0]> <list [0]>
#> 7 SVM_P    <tibble [1 × 4]> <opts[0]> <list [0]>
#> 8 SVM_R    <tibble [1 × 4]> <opts[0]> <list [0]>

We are ready for tuning now. Given that we have a workflow set to train, we need to use the workflow_map() function from the workflowsets package. As the tuning function, we will use tune_race_anova() from the auxiliary finetune package to enable racing. For classification problems, this function uses the accuracy metric to decide which models stay in the race further.¹⁵ We also use the future package to conduct tuning via parallel processing. plan(multisession, workers = 8) will start eight R processes to perform tuning.

library(finetune)
library(future)

plan(multisession, workers = 8)

race_ctrl = 
  control_race(
    parallel_over = "resamples",
    burn_in = 5
  )

race_results = 
  all_workflows |>
  workflow_map(
    fn = "tune_race_anova",
    seed = 345,
    resamples = so_folds,
    grid = 30,
    control = race_ctrl,
    metrics = metric_set(accuracy, roc_auc, sensitivity)
  )

Since we defined a metric set with three metrics, autoplot() plots the performance of eight models in three metrics. The ranking order is determined by the accuracy results. The radial SVM model tops the accuracy ranking but has the worst sensitivity performance. On the other hand, the naive Bayes model has the worst accuracy result, yet it tops the sensitivity ranking. These results suggest that one can rely on the naive Bayes model if there are good reasons to detect the positive class (“Remote” in our case) more accurately than the negative class.

autoplot(race_results, select_best = TRUE)

We can view the best accuracy results using the rank_results() function.

race_results |>
  rank_results(rank_metric = "accuracy", select_best = TRUE) |>
  filter(.metric == "accuracy") |>
  select(wflow_id, accuracy = mean, rank)
#> # A tibble: 8 × 3
#>   wflow_id accuracy  rank
#>   <chr>       <dbl> <int>
#> 1 SVM_R       0.826     1
#> 2 SVC         0.654     2
#> 3 SVM_P       0.651     3
#> 4 LDA         0.641     4
#> 5 LOGIT       0.640     5
#> 6 KNN         0.575     6
#> 7 QDA         0.543     7
#> 8 NBayes      0.414     8

Instead of ranking the workflows using the accuracy metric, we can use the roc_auc metric.¹⁶ This shows that the LDA model is the best, closely followed by the logistic regression model.

race_results |>
  rank_results(rank_metric = "roc_auc", select_best = TRUE) |>
  filter(.metric == "roc_auc") |>
  select(wflow_id, roc_auc = mean, rank)
#> # A tibble: 8 × 3
#>   wflow_id roc_auc  rank
#>   <chr>      <dbl> <int>
#> 1 LDA        0.705     1
#> 2 LOGIT      0.704     2
#> 3 SVM_P      0.702     3
#> 4 SVC        0.684     4
#> 5 NBayes     0.674     5
#> 6 SVM_R      0.673     6
#> 7 KNN        0.641     7
#> 8 QDA        0.641     8

Note also that the polynomial SVM model has the highest roc_auc value when it is of degree one, i.e., linear. The fact that the top five models are linear in predictors suggests that a linear function of predictors better approximates the decision boundary that separates developers in terms of work status.

race_results |>
  extract_workflow_set_result(id = "SVM_P") |>
  select_best(metric = "roc_auc")
#> # A tibble: 1 × 3
#>    cost degree .config              
#>   <dbl>  <int> <chr>                
#> 1  7.64      1 Preprocessor1_Model16

Given that the LDA classifier has the highest roc_auc value and an accuracy rate almost as high as the top models on the accuracy ranking, we can choose it as the winner. We now extract the LDA workflow from race_results and pipe it to the last_fit() function to fit it to the entire training data and obtain predictions on the test set. We then pull the predicted class labels using the collect_predictions() function. Finally, we report the confusion matrix and associated statistics. The LDA classifier has an accuracy rate of 65.24% and a sensitivity rate of 62.61% on the test set.

preds = 
  race_results |>
  extract_workflow("LDA") |>
  last_fit(so_split) |>
  collect_predictions()

caret::confusionMatrix(
  data = preds$.pred_class,
  reference = preds$Remote,
  positive = "Remote"
)
#> Confusion Matrix and Statistics
#> 
#>             Reference
#> Prediction   Remote Not remote
#>   Remote         72        346
#>   Not remote     43        658
#>                                           
#>                Accuracy : 0.6524          
#>                  95% CI : (0.6236, 0.6803)
#>     No Information Rate : 0.8972          
#>     P-Value [Acc > NIR] : 1               
#>                                           
#>                   Kappa : 0.1299          
#>                                           
#>  Mcnemar's Test P-Value : <2e-16          
#>                                           
#>             Sensitivity : 0.62609         
#>             Specificity : 0.65538         
#>          Pos Pred Value : 0.17225         
#>          Neg Pred Value : 0.93866         
#>              Prevalence : 0.10277         
#>          Detection Rate : 0.06434         
#>    Detection Prevalence : 0.37355         
#>       Balanced Accuracy : 0.64073         
#>                                           
#>        'Positive' Class : Remote          
#>

Footnotes

See Hastie et al. (2021).↩︎
For the definition of other statistics reported by the confusionMatrix() function, see its documentation.↩︎
Also in the test set since we stratified on the Remote variable.↩︎
If we didn’t do downsampling and used the raw training data, the logistic regression would have predicted “Not remote” for all training and test observations. In other words, that would have simply made it a naive classifier!↩︎
For a detailed exposition, see Hastie et al. (2021) which we rely on here.↩︎
Note that the density functions for LDA and QDA are, in contrast, $p$ -dimensional.↩︎
See Hastie et al. (2021). Most implementations of KNN classifiers randomly break ties if they emerge.↩︎
See Boehmke and Greenwell (2020). Some observations might be the same distance from a given observation such that there are more than $K$ nearest neighbors. In such cases, KNN implementations may randomly choose from equal-distance observations to have exactly $K$ neighbors.↩︎
See Boehmke and Greenwell (2020).↩︎
kknn() also uses kernel functions to weight the neighbors according to their distances. Its kernel option must be set to "rectangular" for unweighted KNN classification. It defaults to kernel = "optimal". Moreover, this option can also be tuned by passing in weight_func = tune() to the nearest_neighbors() function, which we skip doing here.↩︎
For a detailed exposition, see Hastie et al. (2021).↩︎
See Boehmke and Greenwell (2020).↩︎
Another alternative is one-versus-all classification where $K$ binary classifiers are trained to compare one of the $K$ classes to the remaining $K - 1$ classes.↩︎
In fact, this is the default behavior of the parsnip specification functions.↩︎
There won’t be any tuning for the models in the first workflow set since we have not tagged any parameter for tuning. tune_race_anova() will consider each as a single model configuration.↩︎
Recall that tune_race_anova() uses the accuracy metric to decide which model configurations to pursue further. This means some model configurations with better roc_auc values can be eliminated earlier, which means that the post-tuning ranking based on roc_auc may not fully match the ranking from racing using roc_auc.↩︎