Machine Learning with tidymodels – Part 3

Tuning with parallel processing and racing

parallel processing

racing

tuning

Published

Feb 6, 2025

If you had reviewed the code in the previous post, you would have noticed that tuning a boosted trees model takes a lot of time (approximately 20 minutes on my machine). This is pretty discouraging, given the small data set we worked on. How can we even train computationally demanding models on much larger data sets that we typically encounter in the real world? The good news is that tuning can be conducted much faster. In this blog plot, we will learn how to use parallel processing and racing methods to tune model hyperparameters within the tidymodels framework.

Parallel processing/execution in R

Most computers today have multi-core CPUs. A core is a processing unit available on a computer, and a given CPU can have multiple cores.¹ For instance, my laptop has an eight-core CPU. In a statistical computing context, unless multiple cores are enabled, a computer will rely on a single core to run the operations its user requests.

Consider the following example: For a simulation exercise, we want to draw 10,000, 100,000, and 1,000,000 numbers from the standard normal distribution. A typical way to do this involves calling the rnorm () function in the current R session three times with a different number, as shown below.

x = rnorm(1e4)
y = rnorm(1e5)
z = rnorm(1e6)

The computations here are sequential. In the current session, R first runs x = rnorm(1e4). Once that’s over, it runs y = rnorm(1e5); once that’s over, it runs z = rnorm(1e6). We not only draw too many numbers each time; we have to do this in sequence.

In parallel processing, code is executed simultaneously in different processes or sessions. In the above example, this would mean running three different R sessions (or processes), each busy drawing a different number of random numbers from the standard normal distribution. What is key to the type of parallel processing here is that the chunks of computation that we want to do must be done completely separately; they must be unrelated and not depend on one another, so there is no communication between the processes. This kind of computation is called embarrassingly parallel computation.² The steps involved in such a computational scheme are

Start up M “worker” processes, and do any initialization needed on the workers.³
Send any data required for each task to the workers.
Split the task into M roughly equally-sized chunks and send the chunks (including the R code needed) to the workers.
Wait for all the workers to complete their tasks and ask them for their results.
Repeat steps 1-4 for any further tasks.
Shut down the worker process.⁴

To summarize, this divide-and-conquer strategy divides the work and assigns the parts to multiple processes running simultaneously on multiple cores on a single machine or cluster. This is where the enormous gains in computing speed come from. We will now see how we can run our R code in parallel.

Parallel processing with the foreach and doParallel packages

Let’s say we want to take the natural logarithm of 1, 2, and 3. We can do this in two ways: a) We can use the (vectorized) log() function, or b) we can use one of the looping constructs available in base R, such as the for loop. In both cases, taking the natural logarithm is repeated 3 times.

# Using a vectorized function
log(1:3)
#> [1] 0.0000000 0.6931472 1.0986123

# Using a looping construct
x = numeric(3)
for (i in 1:3) {x[i] = log(i)}

The foreach package provides another looping construct for executing R code repeatedly. But unlike the vectorized functions or the looping constructs in base R, it can perform the computational operations in parallel, i.e., execute them on multiple cores.

Combining the foreach() function with the %do% operator, both provided by the foreach package, runs the code repeatedly but in a sequential order just like the vectorized functions and looping constructs in base R. This means that each operation is executed on a single core one after another. Below, we use the foreach() function with the %do% operator to take the natural logarithm of 1, 2, and 3.

library(foreach)

foreach(i = 1:3, .combine = "c") %do% {log(i)}
#> [1] 0.0000000 0.6931472 1.0986123

The code that we want to run multiple times is log(i) and is placed between curly braces (or parentheses) after the %do% operator. i=1:3 passed into the foreach() function is called the iteration variable and used to specify the specific values that we want our code to loop over. We set the .combine option to “c” to tell the function that we want the results from each log() function call to be combined by the c() function in base R so it returns a vector.⁵

Consider another example. Let’s say we want to draw five numbers from the normal distribution three times, each time using a different mean value for the normal distribution. We then want to combine those three length-five vectors into a matrix using them as columns.

set.seed(1)

foreach(a = 1:4, b = rep(0.5, 3), .combine = "cbind") %do% {
  rnorm(5, mean = a, sd  = b)
}
#>       result.1 result.2 result.3
#> [1,] 0.6867731 1.589766 3.755891
#> [2,] 1.0918217 2.243715 3.194922
#> [3,] 0.5821857 2.369162 2.689380
#> [4,] 1.7976404 2.287891 1.892650
#> [5,] 1.1647539 1.847306 3.562465

Both a and b are iteration variables. The first is used for the mean, and the second for the standard deviation of the normal distribution. We provide four a and three b values and get a matrix with three columns. This happens because both a and b change simultaneously, meaning that the same number of iterations is specified for both variables by foreach(). That’s why random sampling happens thrice. a = 4 is not used.

We need two things to run our code in parallel. First, we must use the %dopar% operator instead of %do%. Second, we need a package that can provide a parallel backend. Some commonly used packages that provide parallel backend are parallel, doParallel, and doMC. Below, I use the first two packages to create a setup that allows parallel execution.

The doParallel package requires the parallel package, so loading the former will load the latter. The detectCores() function in the parallel package attempts to detect the number of CPUs/cores on the host machine.

library(doParallel) # loads also the parallel package

n_cores = detectCores()
n_cores
#> [1] 8

Next, we create a cluster (of processes) for parallel computing as outlined above. The makeCluster() function creates a set of copies of R that run in parallel and communicate over sockets. It takes mainly two arguments: spec and type. We use the first one to specify the number of cores we want in the cluster. Setting type to the default “PSOCK” value sets up “a worker process [an R instance] which listens on a socket for expressions to evaluate and returns the results”.⁶ Below, we set spec to n_cores to use all cores. ⁷

cluster = makeCluster(spec = n_cores, type = "PSOCK")

Finally, we can use the registerDoParallel() function to register the parallel backend we set up with the foreach package.

registerDoParallel(cluster)

We are now ready for parallel processing. Suppose we want to train a random forest model on some random data using 1000 trees. Rather than growing 1000 trees sequentially on a single core, we can split the work into four parts and do them on multiple cores in parallel. Below, we generate some random data and then use the randomForest() function from the randomForest package to grow 1000 random trees in parallel. Given eight cores on my machine, each worker process will grow 125 random trees. To combine the results from worker processes at the end, we can use the combine() function in the randomForest package by setting the .combine option of foreach() to “combine”. Since base R does not include the randomForest package, it should be exported to the worker processes. We pass in .packages = "randomForest" to foreach() to do that.

library(randomForest)

x = matrix(runif(1000), nrow = 100) # 100-by-10 predictor matrix
y = gl(2, 50) # 100-by-1 class vector with two levels each repeating 50 times

foreach(
  ntree = rep(125, 8), .combine = combine, 
  .packages = "randomForest"
) %dopar% {
  randomForest(x = x, y = y, ntree = ntree)
}
#> 
#> Call:
#>  randomForest(x = x, y = y, ntree = ntree) 
#>                Type of random forest: classification
#>                      Number of trees: 1000
#> No. of variables tried at each split: 3

Once parallel computing is complete, it is recommended that we stop the cluster (i.e., shut down the worker processes) by calling the stopCluster() function from the parallel package.

stopCluster(cluster)

Regardless of the computational intensity in our code, setting up a cluster and managing the runtime takes some time, which is called overhead time. As illustrated here, overhead time can make parallel computing slower than sequential computing if the number of overall computations is relatively small. So, we should prefer parallel computing only when intensive computations repeat many times, such as when fitting a large random forest model or tuning a boosted trees model with 10 cross-validation resamples.

Parallel processing with the future package

The future package simplifies parallel computing even further. This package, too, uses the parallel package, but it is an alternative to the foreach package. It provides a new construct that uses the %<-% assignment operator to create R expressions that can be evaluated asynchronously, for instance, in parallel. Moreover, using its plan() function, we can easily switch between sequential and parallel computing. To do sequential computing, i.e., to evaluate R expressions sequentially in the current R session, we set its strategy option to sequential.

library(future)
plan(sequential)

To enable parallel computing using multiple cores on a single machine, we now set the strategy option to multisession and the number of workers/processes to 8.⁸

plan(multisession, workers = 8)

Below, we will use the future package to activate parallel processing for model tuning.

Tuning using parallel processing

Let’s run the same code from the previous posts again to generate the training and test sets and cross-validation resamples.

library(tidymodels)
tidymodels_prefer()

data(ames)
ames = mutate(ames, Sale_Price = log10(Sale_Price))

set.seed(2025)
ames_split = initial_split(ames, prop = 0.8, strata = Sale_Price)
ames_train = training(ames_split)
ames_test = testing(ames_split)

set.seed(342)
ames_folds = vfold_cv(ames_train, strata = Sale_Price)

Since our point is to illustrate how tuning can be conducted faster with parallel processing, we will use an alternative engine this time, lightgbm, rather than xgboost as we did in the previous post. The underlying LightGBM library is faster than the XGBoost library. In other words, the point is that we can tune even faster. To use the lightgbm engine within the tidymodels framework, we need to attach the bonsai package, which provides additional wrappers for tree-based models that are fit by the lightgbm engine.

Let’s now create a boosted trees workflow by tagging four hyperparameters for tuning. Rather than tuning the number of trees, we tune stop_iter, corresponding to early_stopping_rounds in both xgboost and lightgbm engines. The parameter controls the number of boosting iterations (i.e., trees) by stopping boosting if the performance metric computed against the assessment/validation set does not improve for the early_stopping_rounds consecutive boosting round.⁹

library(bonsai)

boost_spec = 
  boost_tree(
    trees = 2000, tree_depth = tune(), min_n = tune(), 
    learn_rate = tune(), stop_iter = tune()
  ) |>
    set_engine("lightgbm") |>
    set_mode("regression")

boost_wflow = 
  workflow() |>
  add_formula(Sale_Price ~ .) |>
  add_model(boost_spec)

Below, we create a parameters object containing the tagged tuning parameters and their default range. We then update the range for the learn_rate parameter, changing it from the default [10^-10, 10^-1] to [10^-4, 10^-0.5] as the preliminary inspection showed that too-small values of the learning rate tend to yield larger RMSE. So, they slow down boosting and extend the tuning time for no good reason.

boost_param = boost_wflow |> 
  extract_parameter_set_dials() 

boost_param = boost_param |>
  update(learn_rate = learn_rate(c(-4, -0.5)))

We now create a regular grid with 81 (=3⁴) parameter combinations.¹⁰

grid_reg = boost_param |> 
  grid_regular(levels = 3)

We have 10 resamples and 81 parameter candidates, which means that tuning will involve 810 model fits (to the analysis sets) and 810 predictions (with the assessment sets)! As Kuhn and Silge (2023) discuss, the tune_grid() function can execute computations in parallel in two ways:

For each resample, 81 model configurations can be fitted.¹¹ If we have 10 cores on our machine, each core/process can perform 81 fits using a distinct resample. In this case, parallelization is over resamples.
810 fits means there are 810 tasks to perform, where each task represents a fit with a distinct combination of resamples and parameter candidates. In this case, each core can perform 81 fits, where each core will use potentially different resamples.

If our model involves many or intensive preprocessing steps, the former will run faster because preprocessing will be performed only once for each resample. The latter, on the other hand, will perform preprocessing 810 times!

In both cases, computations per fit iteration are completely separable, so we have a case for embarrassingly parallel computation.

The default behavior of tuning and fitting functions in tidymodels is to do computations in parallel if there is a parallel backend registered.¹² If not, they will do computations sequentially. For tune_grid(), the parallel_over option of the associated grid_control() function is set to “resamples” by default if more than one resample is provided. For the second option mentioned above, it should be set to “everything”.¹³ We will use the second option, given that we don’t have any preprocessing in our model.

To set up a parallel backend and enable parallel execution, we use the plan() function from the future package. We also use the system.time() function in base R to time how long it takes for tune_grid() to finish tuning.

plan(multisession, workers = 8)

boost_time_par = system.time({
  tune_res_par = tune_grid(
      boost_wflow,
      resamples = ames_folds,
      grid = grid_reg,
      metrics = metric_set(rmse),
      control = control_grid(parallel_over = "everything")
  )
})[["elapsed"]]

We see that (min_n, tree_depth, learn_rate, stop_iter) = (2, 1, 0.316, 3) is the best parameter candidate.¹⁴

show_best(tune_res_par, metric = "rmse")
#> # A tibble: 5 × 10
#>   min_n tree_depth learn_rate stop_iter .metric .estimator   mean     n std_err
#>   <int>      <int>      <dbl>     <int> <chr>   <chr>       <dbl> <int>   <dbl>
#> 1     2          1    0.316           3 rmse    standard   0.0557    10 0.00304
#> 2     2          1    0.316          11 rmse    standard   0.0557    10 0.00304
#> 3     2          1    0.316          20 rmse    standard   0.0557    10 0.00304
#> 4    40         15    0.00562         3 rmse    standard   0.0557    10 0.00307
#> 5    40         15    0.00562        11 rmse    standard   0.0557    10 0.00307
#> # ℹ 1 more variable: .config <chr>

Racing parameter candidates

Whether using parallel processing or not, tuning fits a model for every parameter combination (or model configuration) using all resamples. An alternative method drops some parameter candidates during tuning as soon as there is enough statistical evidence that they are not performing as well as the best ones.

This is the key idea behind the racing methods. The tune_race_anova() function is one of the racing functions in the finetune package.¹⁵ The function initially fits every model configuration across three randomly selected burn-in resamples.¹⁶ In the case of regression analysis, each parameter candidate is then represented by a dummy variable, and the corresponding analysis set R² is used as the outcome variable. Taking into account the resample-to-resample effect, an ANOVA analysis is conducted to determine both the best and the ones that are underperforming.¹⁷ Those worse-performing parameter candidates are then eliminated and no longer considered. The same ANOVA analysis is conducted for every subsequent resample and so on.

Below, we switch to sequential processing and then time the tune_race_anova() function. We also make use of the control_race() function by turning on its verbose_elim option to let tune_race_anova() print the number of eliminated parameter combinations at each step. control_race() has its allow_par = TRUE and parallel_over = "everything" by default, meaning it will engage in parallel processing if a parallel backend is registered. Since we are in the sequential mode, they will not affect tuning.

library(finetune)

plan(sequential)

boost_time_race = system.time({
  tune_res_race = 
  tune_race_anova(
    boost_wflow,
    resamples = ames_folds,
    grid = grid_reg,
    control = control_race(verbose_elim = TRUE)
  )
})[["elapsed"]]
#> ℹ Evaluating against the initial 3 burn-in resamples.
#> ℹ Racing will minimize the rmse metric.
#> ℹ Resamples are analyzed in a random order.
#> ℹ Fold06: 42 eliminated; 39 candidates remain.
#> 
#> ℹ Fold04: 12 eliminated; 27 candidates remain.
#> 
#> ℹ Fold01: 0 eliminated; 27 candidates remain.
#> 
#> ℹ Fold09: 6 eliminated; 21 candidates remain.
#> 
#> ℹ Fold08: 0 eliminated; 21 candidates remain.
#> 
#> ℹ Fold10: 0 eliminated; 21 candidates remain.
#> 
#> ℹ Fold07: 0 eliminated; 21 candidates remain.

The plot_race() function from the finetune package plots the mean RMSE of the parameter candidates against the resamples (or stages). This visually represents the information generated by tune_race_anova() above. Forty-one candidates were eliminated immediately after the three burn-in resamples, while 21 stayed in the race and saw the finish line.

plot_race(tune_res_race)

Below, we see that racing was able to select the best parameter candidates we have identified before.

show_best(tune_res_race, metric = "rmse")
#> # A tibble: 5 × 10
#>   min_n tree_depth learn_rate stop_iter .metric .estimator   mean     n std_err
#>   <int>      <int>      <dbl>     <int> <chr>   <chr>       <dbl> <int>   <dbl>
#> 1     2          1    0.316           3 rmse    standard   0.0557    10 0.00304
#> 2     2          1    0.316          11 rmse    standard   0.0557    10 0.00304
#> 3     2          1    0.316          20 rmse    standard   0.0557    10 0.00304
#> 4    40         15    0.00562         3 rmse    standard   0.0557    10 0.00307
#> 5    40         15    0.00562        11 rmse    standard   0.0557    10 0.00307
#> # ℹ 1 more variable: .config <chr>

Tuning with parallel processing and racing

Parallel processing and racing are not rival methods of speeding up tuning. We can use both simultaneously! To do so, we need to enable parallel computing and call tune_race_anova(), which we do below.

plan(multisession, workers = 8)

boost_time_par_race = system.time({
  tune_res_par_race = 
  tune_race_anova(
    boost_wflow,
    resamples = ames_folds,
    grid = grid_reg
  )
})[["elapsed"]]

How do these three options compare in terms of computing time? Below, we see the number of seconds each option took. Racing with sequential processing took one and a half times longer than parallel processing. Racing with parallel processing is around twice as fast as parallel processing and three times faster than racing.¹⁸

tibble(
  parallel_time = boost_time_par,
  race_time = boost_time_race,
  parallel_race_time = boost_time_par_race
)
#> # A tibble: 1 × 3
#>   parallel_time race_time parallel_race_time
#>           <dbl>     <dbl>              <dbl>
#> 1          394.      607.               207.

Summary

Reducing tuning time is essential when exploring computationally demanding models’ performance on large data sets for many parameter candidates. Parallel processing and racing together cuts tuning time dramatically. A simple function call can enable parallel processing in R. Once parallel processing is enabled, the tuning functions in tidymodels engage in parallel processing and conduct tuning much faster.

Footnotes

Some CPUs offer hyperthreading or simultaneous multithreading, which means they can have multiple logical cores sharing the same physical cores.↩︎
For more on parallel processing in a statistical computing context, see here.↩︎
Worker and process usually mean the same thing.↩︎
These steps are outlined in the parallel package documentation and I am extensively using them here.↩︎
By default, foreach() returns a list.↩︎
See the package documentation.↩︎
We do not have to use all cores and can manually set spec to a lower number.↩︎
See the package documentation for other options.↩︎
See also the linked StackExchange post on this matter in Simon P. Couch’s post.↩︎
You can make a much denser grid to conduct a more intensive grid search, which may find better parameter candidates. We limit the grid size to 81 to save time/energy.↩︎
Model configuration refers to a model with a given set of parameter values.↩︎
The associated control_* functions have their allow_par = TRUE for this reason.↩︎
This is the default behavior if only one resample is provided.↩︎
This suggests that the best boosting uses very shallow trees. We also see that boosting with much deeper trees yields the same RMSE but uses a much smaller learning rate, which should not be surprising.↩︎
This package provides some extra functions for model tuning. It provides tools for simulated annealing and racing, which optimize the grid search process.↩︎
This is the default number and could be changed.↩︎
As you might expect, the performance metrics correlate across resamples. In other words, any model configuration can generate consistently higher or lower performance metrics than others.↩︎
I skipped the regular tuning with sequential processing. As illustrated here, a similar exercise that used the lightgbm engine took about 17 minutes on a 10-core machine! One can also activate parallel processing for the underlying engine. But as illustrated here, gains from doing so are minimal.↩︎