Scaling with AWS
scaling-with-aws.Rmd
This vignette will walk through an example of scaling a
modelselection
analysis with AWS via the Paws SDK. The following
analysis depends on AWS credentials passed via environment variables. To
see a detailed outline of the different ways to set AWS credentials,
check out this
how-to document.
Data and Model Prep
Data Prep
We’ll be training a classification model on the iris
data-set to predict whether a flower’s species is virginica or not.
First, let’s generate a bunch of synthetic data observations by
adding random noise to the original iris
features and
combining it into one big dataframe.
iris_new <- do.call(
what = rbind,
args = replicate(n = 10, iris, simplify = FALSE)
) |>
transform(
Sepal.Length = jitter(Sepal.Length, 0.1),
Sepal.Width = jitter(Sepal.Width, 0.1),
Petal.Length = jitter(Petal.Length, 0.1),
Petal.Width = jitter(Petal.Width, 0.1),
Species = factor(Species == "virginica")
)
# Shuffle the data-set
iris_new <- iris_new[sample(1:nrow(iris_new), nrow(iris_new)), ]
# Quick overview of the dataset
summary(iris_new[, 1:4])
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# Min. :4.299 Min. :1.998 Min. :0.9985 Min. :0.09801
# 1st Qu.:5.101 1st Qu.:2.799 1st Qu.:1.5984 1st Qu.:0.30038
# Median :5.799 Median :3.001 Median :4.3499 Median :1.30105
# Mean :5.843 Mean :3.057 Mean :3.7580 Mean :1.19937
# 3rd Qu.:6.400 3rd Qu.:3.302 3rd Qu.:5.1000 3rd Qu.:1.80117
# Max. :7.901 Max. :4.401 Max. :6.9012 Max. :2.50176
Grid Search Specification
Now that we’ve got the data prepped, let’s specify our predictive
modeling approach. For this analysis I’m going to train a Support Vector
classifier using the e1071
package, and I’m going to use
Grid Search in combination with 5-fold Cross-Validation to find the
optimal values for the cost
and kernel
hyper-parameters.
iris_grid <- GridSearchCV$new(
learner = svm,
tune_params = list(
cost = c(0.01, 0.1, 0.5, 1, 3, 6),
kernel = c("polynomial", "radial", "sigmoid")
),
learner_args = list(
scale = TRUE,
type = "C-classification",
probability = TRUE
),
splitter = vfold_cv,
splitter_args = list(v = 5),
scorer = list(
accuracy = accuracy_vec,
f_measure = f_meas_vec,
auc = roc_auc_vec
),
prediction_args = list(
accuracy = NULL,
f_measure = NULL,
auc = list(probability = TRUE)
),
convert_predictions = list(
accuracy = NULL,
f_measure = NULL,
auc = function(.x) attr(.x, "probabilities")[, "FALSE"]
),
optimize_score = "max"
)
Now that we’ve specified our Grid Search schema let’s check out the hyper-parameter grid and see how many models we’re going to estimate.
Launch AWS Resources
To speed up the estimation of our models, let’s create a remote cluster of 6 worker nodes to estimate the models in parallel.
Launch EC2 Instances
First, we will launch 6 instances using a custom AMI that contains R 4.1.3 and a bunch of essential R packages. While this AMI is not available as a community AMI there are definitely good AMIs out there that have a comprehensive set of R packages and corresponding tools installed. Note: which parameters you need to specify when launching EC2 instances may vary greatly depending on your account’s security configurations.
ec2_client <- ec2()
# Request Instances
instance_req <- ec2_client$run_instances(
ImageId = "ami-06dd49fc9e3a5acee",
InstanceType = "t2.large",
KeyName = key_name,
MaxCount = 6,
MinCount = 6,
InstanceInitiatedShutdownBehavior = "terminate",
SecurityGroupIds = security_group,
# This names the instances
TagSpecifications = list(
list(
ResourceType = "instance",
Tags = list(
list(
Key = "Name",
Value = "Worker Node"
)
)
)
)
)
Now that we’ve launched the instances we need to wait until they all
respond as "running"
before we try to do anything (We also
need to wait for ~ 1 minute for the instances to initialize or they’ll
reject our SSH login attempts).
# Chalk up a quick function to return instance IDs from our request
instance_ids <- function(response) {
vapply(response$Instances, function(i) i$InstanceId, character(1))
}
# Wait for instances to all respond as 'running'
while(
!all(
vapply(
ec2_client$
describe_instances(InstanceIds = instance_ids(instance_req))$
Reservations[[1]]$
Instances,
function(i) i$State$Name,
character(1)
) == "running"
)
) {
Sys.sleep(5)
}
# Rough heuristic -- give additional 45 seconds for instances to initialize
Sys.sleep(45)
Create Cluster
Now, in order to set up our compute cluster we need to get the IP addresses from these instances.
# Get public IPs
inst_public_ips <- vapply(
ec2_client$
describe_instances(InstanceIds = instance_ids(instance_req))$
Reservations[[1]]$
Instances,
function(i) i$PublicIpAddress,
character(1)
)
Finally, we can create a compute cluster on these worker nodes via SSH.
cl <- makeClusterPSOCK(
worker = inst_public_ips,
user = "ubuntu",
rshopts = c("-o", "StrictHostKeyChecking=no",
"-o", "IdentitiesOnly=yes",
"-i", pem_fp), # Local filepath to private SSH key-pair
connectTimeout = 25,
tries = 3
)
Estimate Models
Now that we’ve created our compute cluster, we can use the
future
package to specify our parallelization plan. Since
modelselection
is built on top of the future
framework, it will automatically parallelize the model estimation across
our 6-worker cluster. The following parallelization
topology basically is telling future
to
parallelize the grid-search models across the compute cluster, and to
parallelize each model’s cross-validation across the cores of the
instance it is being evaluated on.
Finally, let’s estimate our Grid Search models in parallel!
iris_grid_fitted <- iris_grid$fit(
formula = Species ~ .,
data = iris_new,
progress = TRUE
)
Best Model/Parameters
Let’s check out the info on our best model.
best_idx <- iris_grid_fitted$best_idx
metrics <- iris_grid_fitted$metrics
# Print model metrics of best model
cat(
" Accuracy:", round(100 * metrics$accuracy[[best_idx]], 2),
"%\nF-Measure:", round(100 * metrics$f_measure[[best_idx]], 2),
"%\n AUC:", round(metrics$auc[[best_idx]], 4), "\n"
)
# Accuracy: 98.27 %
# F-Measure: 98.69 %
# AUC: 0.999
params <- iris_grid_fitted$best_params
# Print the best hyper-parameters
cat(
" Optimal Cost:", params[["cost"]],
"\nOptimal Kernel:", params[["kernel"]], "\n"
)
# Optimal Cost: 6
# Optimal Kernel: polynomial