Model Selection and Tuning Utilities • modeltuning

The goal of modeltuning is to provide common model selection and tuning utilities in an intuitive manner. Additionally, modeltuning aims to be:

Fairly lightweight and not force you to learn an entirely new modeling paradigm
Model/type agnostic and work easily with most R modeling packages and various data types including data frames, standard dense matrices, and Matrix sparse matrices
Easily parallelizable; modeltuning is built on top of the future package and is compatible with any of the (many!) available parallelization backends.

Installation

You can install the released version of modeltuning from CRAN with:

install.packages("modeltuning")

and the development version of modeltuning with:

# install.packages("pak")
pak::pkg_install("dmolitor/modeltuning")

Usage

These are simple examples that use the built-in iris data-set to illustrate the basic functionality of modeltuning.

Cross Validation

First we’ll train a binary classification Decision Tree model to predict whether the flowers in iris are of Species virginica and we’ll specify a 3-fold cross validation scheme with stratification by Species to estimate our model’s true error rate.

First, let’s split our data into a train and test set.

library(future)
library(modeltuning)
library(rpart)
library(rsample)
library(yardstick)

iris_new <- iris[sample(1:nrow(iris), nrow(iris)), ]
iris_new$Species <- factor(iris_new$Species == "virginica")
iris_train <- iris_new[1:100, ]
iris_test <- iris_new[101:150, ]

Next, we’ll define a function to generate cross validation splits.

splitter <- function(data, ...) lapply(vfold_cv(data, ...)$splits, \(.x) .x$in_id)

Now, let’s specify and fit a 3-fold cross validation scheme and calculate the F-Measure, Accuracy, and ROC AUC as our hold-out set evaluation metrics.

# Specify cross validation schema
iris_cv <- CV$new(
  learner = rpart,
  learner_args = list(method = "class"),
  splitter = splitter,
  splitter_args = list(v = 3, strata = Species),
  scorer = list(
    f_meas = f_meas_vec,
    accuracy = accuracy_vec,
    auc = roc_auc_vec
  ), 
  prediction_args = list(
    f_meas = list(type = "class"),
    accuracy = list(type = "class"), 
    auc = list(type = "prob")
  ),
  convert_predictions = list(
    f_meas = NULL,
    accuracy = NULL,
    auc = function(.x) .x[, "FALSE"]
  )
)

# Fit cross validated model
iris_cv_fitted <- iris_cv$fit(formula = Species ~ ., data = iris_new)

Now, let’s check our evaluation metrics averaged across folds.

iris_cv_fitted$mean_metrics
#> $f_meas
#> [1] 0.9393568
#> 
#> $accuracy
#> [1] 0.9199787
#> 
#> $auc
#> [1] 0.9146985

Grid Search

Another common model-tuning method is grid search. We’ll use it to tune the minsplit and maxdepth parameters of our decision tree. We will choose our optimal hyper-parameters as those that maximize the ROC AUC on the validation set.

# Specify Grid Search schema
iris_grid <- GridSearch$new(
  learner = rpart,
  learner_args = list(method = "class"),
  tune_params = list(
    minsplit = seq(10, 30, by = 5),
    maxdepth = seq(20, 30, by = 2)
  ),
  evaluation_data = list(x = iris_test, y = iris_test$Species),
  scorer = list(
    accuracy = accuracy_vec,
    auc = roc_auc_vec
  ),
  optimize_score = "max",
  prediction_args = list(
    accuracy = list(type = "class"),
    auc = list(type = "prob")
  ),
  convert_predictions = list(
    accuracy = NULL,
    auc = function(i) i[, "FALSE"]
  )
)

# Fit models across grid
iris_grid_fitted <- iris_grid$fit(
  formula = Species ~ .,
  data = iris_train
)

Let’s check out the optimal decision tree hyperparameters.

iris_grid_fitted$best_params
#> $minsplit
#> [1] 10
#> 
#> $maxdepth
#> [1] 20

Grid Search with cross validation

Finally, modeltuning supports model-tuning with Grid Search using cross validation to estimate each model’s true error rate instead of a hold-out validation set. We’ll use cross validation to tune the same parameters as above.

# Specify Grid Search schema with cross validation
iris_grid_cv <- GridSearchCV$new(
  learner = rpart,
  learner_args = list(method = "class"),
  tune_params = list(
    minsplit = seq(10, 30, by = 5),
    maxdepth = seq(20, 30, by = 2)
  ),
  splitter = splitter,
  splitter_args = list(v = 3, strata = Species),
  scorer = list(
    accuracy = accuracy_vec,
    auc = roc_auc_vec
  ),
  optimize_score = "max",
  prediction_args = list(
    accuracy = list(type = "class"),
    auc = list(type = "prob")
  ),
  convert_predictions = list(
    accuracy = NULL,
    auc = function(i) i[, "FALSE"]
  )
)

# Fit models across grid
iris_grid_cv_fitted <- iris_grid_cv$fit(
  formula = Species ~ .,
  data = iris_train
)

Let’s check out the optimal decision tree hyperparameters

iris_grid_cv_fitted$best_params
#> $minsplit
#> [1] 10
#> 
#> $maxdepth
#> [1] 28

as well as the cross validation ROC AUC for those parameters

iris_grid_cv_fitted$best_metric
#> [1] 0.9616682

Parallelization

As noted above, modeltuning is built on top of the future package and can utilize any parallelization method supported by the future package when fitting cross-validated models or tuning models with grid search. The code below evaluates the same cross-validated binary classification model using local parallelization.

plan(multisession)

# Fit cross validation model
iris_cv_fitted <- iris_cv$fit(formula = Species ~ ., data = iris_train)

plan(sequential)

# Model performance metrics
iris_cv_fitted$mean_metrics
#> $f_meas
#> [1] 0.9507937
#> 
#> $accuracy
#> [1] 0.939951
#> 
#> $auc
#> [1] 0.9411477

And voila!

Examples

For a bunch of worked examples with common ML frameworks, check out the /examples directory!