
Basic usage
basic-usage.RmdThe goal of this vignette is to walk through modeltuning
usage in detail. We’ll be training a classification model on the
iris data-set to predict whether a flower’s species is
Virginica or not.
Load Packages
library(e1071)
library(modeltuning) # devtools::install_github("dmolitor/modeltuning")
library(yardstick)Data Prep
First, let’s generate a bunch of synthetic data observations by
adding random noise to the original iris features and
combining it into one big dataframe.
iris_new <- do.call(
what = rbind,
args = replicate(n = 10, iris, simplify = FALSE)
) |>
transform(
Sepal.Length = jitter(Sepal.Length, 0.1),
Sepal.Width = jitter(Sepal.Width, 0.1),
Petal.Length = jitter(Petal.Length, 0.1),
Petal.Width = jitter(Petal.Width, 0.1),
Species = factor(Species == "virginica")
)
# Shuffle the data-set
iris_new <- iris_new[sample(1:nrow(iris_new), nrow(iris_new)), ]
# Quick overview of the dataset
summary(iris_new[, 1:4])
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# Min. :4.299 Min. :1.999 Min. :0.9986 Min. :0.09801
# 1st Qu.:5.101 1st Qu.:2.799 1st Qu.:1.5984 1st Qu.:0.29972
# Median :5.799 Median :3.001 Median :4.3499 Median :1.30126
# Mean :5.843 Mean :3.057 Mean :3.7580 Mean :1.19938
# 3rd Qu.:6.400 3rd Qu.:3.302 3rd Qu.:5.1002 3rd Qu.:1.80059
# Max. :7.901 Max. :4.401 Max. :6.9020 Max. :2.50193Function arguments
Common arguments
The following modeling approach holds for the CV,
GridSearch and GridSearchCV classes, which are
all very slight variations of each other. Common arguments are as
follows:
learner: This is where you pass your predictive modeling function; in our case a Support Vector Machine, soe1071::svmfrom the e1071 package.scorer: This is a named list of metric functions that will evaluate the model’s predictive performance. Each metric function should have two arguments,truthandestimatethat intake the true outcome values and the predicted outcome values. It should output a scalar numeric score. The yardstick package provides a wide array of these metric functions that should cover most common cases. E.g. for the RMSE of a regression,scorer = list(rmse = yarstick::rmse_vec).learner_args: This is a named list of function arguments that get passed directly to thelearnerfunction. For example, thee1071::svmfunction takes atypeargument specifying whether it is a regression or classification task. You could specify a classification task aslearner_args = list(type = "classification").scorer_args: This is a named list of function arguments to pass to the scorer functions inscorer. This list should have one element per element inscorer. E.g. ifscorer = list(rmse = rmse_vec, mae = mae_vec)thenscorer_args = list(rmse = list(...), mae = list(...)).prediction_args: This is similar tolearner_args. It’s a named list of function arguments passed to thepredictmethod. E.g. our SVM learner has a predict argument calledprobabilitywhether to predict outcome classes or class probabilities. Specify class probabilities asprediction_args = list(probability = TRUE). Similar toscorer_args, this list should have one element per element inscorer.convert_predictions: A named list of functions to transform the output ofpredict(...)into a vector of predictions. By default, the model’s predicted values may not always be a vector. E.g.predict(svm_model, probability = TRUE)returns a matrix with class probabilities for both classes, whilepredict(svm_model, probability = FALSE)returns a vector. For calculating model accuracy you need class predictions, while ROC AUC requires class probabilities. Suppose thatscorer = list(accuracy = accuracy_vec, auc = roc_auc_vec). To ensure that accuracy gets class predictions and ROC AUC gets class probabilities, you can provide the corresponding prediction argumentsprediction_args = list(accuracy = NULL, auc = list(probability = TRUE))and then convert those predictions into a vectorconvert_predictions = list(accuracy = NULL, auc = function(.x) attr(.x, "probabilities")[, "FALSE"]).
Cross validation arguments
The following arguments are specific to the CV cross
validation class.
splitter: This should be a function that intakes the training dataset and returns a list of cross validation indices.modeltuningprovides a very simple function,cv_split()that will do the most simple version of this.splitter_args: A named list of function arguments to pass tosplitter. E.g.cv_split()has the argumentvwhich specifies how many cross validation folds. For 3 folds, setsplitter_args = list(v = 3).
Grid search arguments
The following arguments are specific to the GridSearch
grid search class.
evaluation_data: This must be a list containing validation data, e.g.list(x = x_eval, y = y_eval).optimize_score: One of"max"or"min". Whether to maximize or minimize the metric defined inscorerto find the optimal grid search parameters. Note: if you specify multiple metric functions inscorer, modeltune will use the last metric function to find the optimal parameters.
Examples
We’ll show simple examples of each of CV,
GridSearch and GridSearchCV.
CV
iris_cv <- CV$new(
learner = svm,
learner_args = list(type = "C-classification", probability = TRUE),
splitter = cv_split,
splitter_args = list(v = 3),
scorer = list(roc_auc = roc_auc_vec),
prediction_args = list(roc_auc = list(probability = TRUE)),
convert_predictions = list(roc_auc = function(.x) attr(.x, "probabilities")[, "FALSE"])
)
# Fit cross validated model
iris_cv_fitted <- iris_cv$fit(formula = Species ~ ., data = iris_new)
iris_cv_fitted$mean_metrics
# $roc_auc
# [1] 0.9983162GridSearch
iris_new_train <- iris_new[1:1000, ]
iris_new_eval <- iris_new[1000:nrow(iris_new), ]
iris_grid <- GridSearch$new(
learner = svm,
tune_params = list(
cost = c(0.01, 0.1, 0.5, 1, 3, 6),
kernel = c("polynomial", "radial", "sigmoid")
),
learner_args = list(type = "C-classification", probability = TRUE),
evaluation_data = list(x = iris_new_eval[, -5], y = iris_new_eval[, 5]),
scorer = list(roc_auc = roc_auc_vec),
prediction_args = list(roc_auc = list(probability = TRUE)),
convert_predictions = list(roc_auc = function(.x) attr(.x, "probabilities")[, "FALSE"]),
optimize_score = "max"
)
# Fit cross validated model
iris_grid_fitted <- iris_grid$fit(formula = Species ~ ., data = iris_new)
iris_grid_fitted$best_params
# $cost
# [1] 6
#
# $kernel
# [1] "polynomial"GridSearchCV
iris_grid <- GridSearchCV$new(
learner = svm,
tune_params = list(
cost = c(0.01, 0.1, 0.5, 1, 3, 6),
kernel = c("polynomial", "radial", "sigmoid")
),
learner_args = list(type = "C-classification", probability = TRUE),
splitter = cv_split,
splitter_args = list(v = 3),
scorer = list(roc_auc = roc_auc_vec),
prediction_args = list(roc_auc = list(probability = TRUE)),
convert_predictions = list(roc_auc = function(.x) attr(.x, "probabilities")[, "FALSE"]),
optimize_score = "max"
)
# Fit cross validated model
iris_grid_fitted <- iris_grid$fit(formula = Species ~ ., data = iris_new)
iris_grid_fitted$best_params
# $cost
# [1] 6
#
# $kernel
# [1] "polynomial"