The goal of modeltuning is to provide common model selection and tuning utilities in an intuitive manner. Additionally, modeltuning aims to be:
- Fairly lightweight and not force you to learn an entirely new modeling paradigm
- Model/type agnostic and work easily with most R modeling packages and various data types including data frames, standard dense matrices, and
Matrixsparse matrices - Easily parallelizable; modeltuning is built on top of the
futurepackage and is compatible with any of the (many!) available parallelization backends.
Installation
You can install the released version of modeltuning from CRAN with:
install.packages("modeltuning")and the development version of modeltuning with:
# install.packages("pak")
pak::pkg_install("dmolitor/modeltuning")Usage
These are simple examples that use the built-in iris data-set to illustrate the basic functionality of modeltuning.
Cross Validation
First we’ll train a binary classification Decision Tree model to predict whether the flowers in iris are of Species virginica and we’ll specify a 3-fold cross validation scheme with stratification by Species to estimate our model’s true error rate.
First, let’s split our data into a train and test set.
library(future)
library(modeltuning)
library(rpart)
library(rsample)
library(yardstick)
iris_new <- iris[sample(1:nrow(iris), nrow(iris)), ]
iris_new$Species <- factor(iris_new$Species == "virginica")
iris_train <- iris_new[1:100, ]
iris_test <- iris_new[101:150, ]Next, we’ll define a function to generate cross validation splits.
Now, let’s specify and fit a 3-fold cross validation scheme and calculate the F-Measure, Accuracy, and ROC AUC as our hold-out set evaluation metrics.
# Specify cross validation schema
iris_cv <- CV$new(
learner = rpart,
learner_args = list(method = "class"),
splitter = splitter,
splitter_args = list(v = 3, strata = Species),
scorer = list(
f_meas = f_meas_vec,
accuracy = accuracy_vec,
auc = roc_auc_vec
),
prediction_args = list(
f_meas = list(type = "class"),
accuracy = list(type = "class"),
auc = list(type = "prob")
),
convert_predictions = list(
f_meas = NULL,
accuracy = NULL,
auc = function(.x) .x[, "FALSE"]
)
)
# Fit cross validated model
iris_cv_fitted <- iris_cv$fit(formula = Species ~ ., data = iris_new)Now, let’s check our evaluation metrics averaged across folds.
iris_cv_fitted$mean_metrics
#> $f_meas
#> [1] 0.9393568
#>
#> $accuracy
#> [1] 0.9199787
#>
#> $auc
#> [1] 0.9146985Grid Search
Another common model-tuning method is grid search. We’ll use it to tune the minsplit and maxdepth parameters of our decision tree. We will choose our optimal hyper-parameters as those that maximize the ROC AUC on the validation set.
# Specify Grid Search schema
iris_grid <- GridSearch$new(
learner = rpart,
learner_args = list(method = "class"),
tune_params = list(
minsplit = seq(10, 30, by = 5),
maxdepth = seq(20, 30, by = 2)
),
evaluation_data = list(x = iris_test, y = iris_test$Species),
scorer = list(
accuracy = accuracy_vec,
auc = roc_auc_vec
),
optimize_score = "max",
prediction_args = list(
accuracy = list(type = "class"),
auc = list(type = "prob")
),
convert_predictions = list(
accuracy = NULL,
auc = function(i) i[, "FALSE"]
)
)
# Fit models across grid
iris_grid_fitted <- iris_grid$fit(
formula = Species ~ .,
data = iris_train
)Let’s check out the optimal decision tree hyperparameters.
iris_grid_fitted$best_params
#> $minsplit
#> [1] 10
#>
#> $maxdepth
#> [1] 20Grid Search with cross validation
Finally, modeltuning supports model-tuning with Grid Search using cross validation to estimate each model’s true error rate instead of a hold-out validation set. We’ll use cross validation to tune the same parameters as above.
# Specify Grid Search schema with cross validation
iris_grid_cv <- GridSearchCV$new(
learner = rpart,
learner_args = list(method = "class"),
tune_params = list(
minsplit = seq(10, 30, by = 5),
maxdepth = seq(20, 30, by = 2)
),
splitter = splitter,
splitter_args = list(v = 3, strata = Species),
scorer = list(
accuracy = accuracy_vec,
auc = roc_auc_vec
),
optimize_score = "max",
prediction_args = list(
accuracy = list(type = "class"),
auc = list(type = "prob")
),
convert_predictions = list(
accuracy = NULL,
auc = function(i) i[, "FALSE"]
)
)
# Fit models across grid
iris_grid_cv_fitted <- iris_grid_cv$fit(
formula = Species ~ .,
data = iris_train
)Let’s check out the optimal decision tree hyperparameters
iris_grid_cv_fitted$best_params
#> $minsplit
#> [1] 10
#>
#> $maxdepth
#> [1] 28as well as the cross validation ROC AUC for those parameters
iris_grid_cv_fitted$best_metric
#> [1] 0.9616682Parallelization
As noted above, modeltuning is built on top of the future package and can utilize any parallelization method supported by the future package when fitting cross-validated models or tuning models with grid search. The code below evaluates the same cross-validated binary classification model using local parallelization.
plan(multisession)
# Fit cross validation model
iris_cv_fitted <- iris_cv$fit(formula = Species ~ ., data = iris_train)
plan(sequential)
# Model performance metrics
iris_cv_fitted$mean_metrics
#> $f_meas
#> [1] 0.9507937
#>
#> $accuracy
#> [1] 0.939951
#>
#> $auc
#> [1] 0.9411477And voila!
