The goal of modelselection
is to provide a few common model selection and tuning utilities in an intuitive manner. I want something that doesn’t require me to adopt a whole new modeling paradigm. Also, I want the provided functionality to be very type agnostic and be able to work with data frames, standard dense matrices, and Matrix
sparse matrices. Finally, I want it to be easily distributable (It’s built on top of the future.apply
package).
Installation
You can install the development version of modelselection
with:
# install.packages("devtools")
devtools::install_github("dmolitor/modelselection")
Usage
These are simple examples that use the built-in iris
data-set to illustrate the basic functionality of modelselection
.
Cross Validation
First we’ll train a binary classification Decision Tree model to predict whether the flowers in iris
are of Species virginica
and we’ll specify a 3-fold Cross-Validation scheme with stratification by Species to estimate our model’s true error rate.
First, let’s split our data into a train and test set.
library(future)
library(modelselection)
library(rpart)
library(rsample)
library(yardstick)
#> For binary classification, the first factor level is assumed to be the event.
#> Use the argument `event_level = "second"` to alter this as needed.
iris_new <- iris[sample(1:nrow(iris), nrow(iris)), ]
iris_new$Species <- factor(iris_new$Species == "virginica")
iris_train <- iris_new[1:100, ]
iris_test <- iris_new[101:150, ]
Now, let’s specify and fit a 3-fold cross-validation scheme and calculate the F Measure, Accuracy, and ROC AUC as our hold-out set evaluation metrics.
# Specify Cross Validation schema
iris_cv <- CV$new(
learner = rpart,
learner_args = list(method = "class"),
splitter = vfold_cv,
splitter_args = list(v = 3, strata = "Species"),
scorer = list(
"f_meas" = f_meas_vec,
"accuracy" = accuracy_vec,
"auc" = roc_auc_vec
),
prediction_args = list(
"f_meas" = list(type = "class"),
"accuracy" = list(type = "class"),
"auc" = list(type = "prob")
),
convert_predictions = list(
NULL,
NULL,
function(.x) .x[, "FALSE"]
)
)
# Fit Cross Validated model
iris_cv_fitted <- iris_cv$fit(formula = Species ~ ., data = iris_new)
Now, let’s check our evaluation metrics averaged across folds.
Grid Search
Another common model-tuning method is grid search. We’ll use it to tune the minsplit
and maxdepth
parameters of our decision tree. We will choose our optimal hyper-parameters as those that maximize the ROC AUC on the validation set.
# Specify Grid Search schema
iris_grid <- GridSearch$new(
learner = rpart,
learner_args = list(method = "class"),
tune_params = list(
minsplit = seq(10, 30, by = 5),
maxdepth = seq(20, 30, by = 2)
),
evaluation_data = list(x = iris_test, y = iris_test$Species),
scorer = list(
accuracy = accuracy_vec,
auc = roc_auc_vec
),
optimize_score = "max",
prediction_args = list(
accuracy = list(type = "class"),
auc = list(type = "prob")
),
convert_predictions = list(
accuracy = NULL,
auc = function(i) i[, "FALSE"]
)
)
# Fit models across grid
iris_grid_fitted <- iris_grid$fit(
formula = Species ~ .,
data = iris_train
)
Let’s check out some details on our optimal decision tree model.
cat(
"Optimal Hyper-parameters:\n -",
paste0(
paste0(names(iris_grid_fitted$best_params), ": ", iris_grid_fitted$best_params),
collapse = "\n - "
),
"\nOptimal ROC AUC:",
round(iris_grid_fitted$best_metric, 4)
)
#> Optimal Hyper-parameters:
#> - minsplit: 10
#> - maxdepth: 20
#> Optimal ROC AUC: 0.9253
Grid Search with Cross Validation
Finally, modelselection
supports model-tuning with Grid Search using Cross Validation to estimate each model’s true error rate instead of a hold-out validation set. We’ll use Cross Validation to tune the same parameters as above.
# Specify Grid Search schema with Cross Validation
iris_grid_cv <- GridSearchCV$new(
learner = rpart,
learner_args = list(method = "class"),
tune_params = list(
minsplit = seq(10, 30, by = 5),
maxdepth = seq(20, 30, by = 2)
),
splitter = vfold_cv,
splitter_args = list(v = 3),
scorer = list(
accuracy = accuracy_vec,
auc = roc_auc_vec
),
optimize_score = "max",
prediction_args = list(
accuracy = list(type = "class"),
auc = list(type = "prob")
),
convert_predictions = list(
accuracy = NULL,
auc = function(i) i[, "FALSE"]
)
)
# Fit models across grid
iris_grid_cv_fitted <- iris_grid_cv$fit(
formula = Species ~ .,
data = iris_train
)
Let’s check out some details on our optimal decision tree model.
cat(
"Optimal Hyper-parameters:\n -",
paste0(
paste0(
names(iris_grid_cv_fitted$best_params),
": ",
iris_grid_cv_fitted$best_params
),
collapse = "\n - "
),
"\nOptimal ROC AUC:",
round(iris_grid_cv_fitted$best_metric, 4)
)
#> Optimal Hyper-parameters:
#> - minsplit: 20
#> - maxdepth: 30
#> Optimal ROC AUC: 0.9722
Parallelization
As noted above, modelselection
is built on top of the future.apply
package and can utilize any parallelization method supported by the future
package when fitting cross-validated models or tuning models with grid search. The code below evaluates the same cross-validated binary classification model using local multi-core parallelization.
# Initialize multi-core parallel strategy
plan(multisession)
# Fit Cross Validated model
iris_cv_fitted <- iris_cv$fit(formula = Species ~ ., data = iris_train)
And voila!