Double/Debiased Machine Learning for Treatment and Structural Parameters

V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, J. Robins

Causal Inference

Obtain valid inferential statements about a low-dimensional parameter (ATE/LATE) in the presence of potentially high-dimensional confounding using modern ML methods.

Published

November 23, 2022

Double/Debiased Machine Learning, proposed by Chernozhukov et al., tackles the problem of inference on some parameter, \(\theta_0\), in the setting where there is a potentially high-dimensional set of control variables/confounders. To do so, this method harnesses generic ML algorithms which perform well in high-dimensional settings. DML can be applied to learn the ATE and LATE in a partially linear regression model and partially linear instrumental variables model, respectively. It can also be used to estimate the ATE and LATE in the interactive model and interactive instrumental variables model, where treatment effects are fully heterogeneous.

This post is going to show how to implement estimation of the ATE and LATE in both the partially linear and interactive settings. This will be a pretty minimal prototype, implemented entirely in base R. I’ll use a couple other packages for the ML methods, but which models you choose to use are entirely separate from the general method outlined here.

Packages

library(estimatr)
library(glmnet)
library(hdm)
library(ranger)

Data

For clarity, we’ll use one of the same empirical examples used in the actual paper. This example is described as follows:

We use the DML method to estimate the effect of 401(k) eligibility, the treatment variable, and 401(k) participation, a self‐selected decision to receive the treatment that we instrument for with assignment to the treatment state, on accumulated assets. In this example, the treatment variable is not randomly assigned and we aim to eliminate the potential biases due to the lack of random assignment by flexibly controlling for a rich set of variables.

data(pension)
pension <- pension[sample(1:nrow(pension), nrow(pension)), ]

# Construct model inputs
form <- ~ (
  poly(age, 2) + poly(inc, 2) + poly(educ, 2) + poly(fsize,2)
  + as.factor(marr) + as.factor(twoearn) + as.factor(db) + as.factor(pira)
  + as.factor(hown)
)^2
X <- model.matrix(form, data = pension)
X_RF <- pension[, c("age", "inc", "fsize", "educ", "db", "marr", "twoearn", "pira", "hown")]
D <- pension$e401
Y <- pension$net_tfa

Now, let’s define a few simple functions for applying three different ML models (Random Forest, LASSO, and ElasticNet with \(\alpha = 0.5\)) to our data. These three models will be used across each of the four DML estimation strategies.

# LASSO learner and prediction functions
y_learner_lasso <- \(x, y) cv.glmnet(x, y, nfolds = 5)
d_learner_lasso <- \(x, y) cv.glmnet(x, y, nfolds = 5, family = "binomial")
pred_fn_lasso <- \(mod, data) predict(mod, data,"lambda.min", "response")

# ElasticNet learner and prediction functions
y_learner_enet <- \(x, y) cv.glmnet(x, y, nfolds = 5, alpha = 0.5)
d_learner_enet <- \(x, y) cv.glmnet(x, y, nfolds = 5, alpha = 0.5, family = "binomial")
pred_fn_enet <- \(mod, data) predict(mod, data, "lambda.min", "response")

# Random Forest learner and prediction functions
y_learner_rf <- \(x, y) ranger(x = x, y = y, num.trees = 1000)
d_learner_rf <- \(x, y) ranger(x = x, y = y, num.trees = 1000, probability = TRUE)
pred_fn_rf_y <- \(mod, data) predict(mod, data)$predictions
pred_fn_rf_d <- \(mod, data) predict(mod, data)$predictions[, "1"]

Estimating the ATE of 401(k) Eligibility on Net Financial Assets

In the example in this paper, we use the same data as in Chernozhukov and Hansen (2004). We use net financial assets – defined as the sum of IRA balances, 401(k) balances, checking accounts, US saving bonds, other interest‐earning accounts in banks and other financial institutions, other interest‐earning assets (such as bonds held personally), stocks, and mutual funds less non‐mortgage debt – as the outcome variable, Y, in our analysis. Our treatment variable, D, is an indicator for being eligible to enroll in a 401(k) plan. The vector of raw covariates, X, consists of age, income, family size, years of education, a married indicator, a two‐earner status indicator, a defined benefit pension status indicator, an IRA participation indicator, and a home‐ownership indicator.

Partially Linear Regression Model

First we will calculate the ATE of 401(k) eligibility on net financial assets in the partially linear model.

The following function implements DML for the partially linear model.

dml_plm <- function(X, Y, D, y_learner, d_learner, pred_fn_y, pred_fn_d = pred_fn_y) {
  
  # Create folds for sample splitting
  fold1 <- sample(1:nrow(X), size = round((1/3)*nrow(X)))
  fold2 <- sample((1:nrow(X))[-fold1], size = round((1/3)*nrow(X)))
  fold3 <- (1:nrow(X))[-c(fold1, fold2)]
  
  # Create data.frame to store residuals
  resids <- data.frame()
  
  # Generate outcome model across folds
  for (fold in list(fold1, fold2, fold3)) {
    
    # Get the training/prediction indices
    idx_train <- (1:nrow(X))[-fold]
    idx_predict <- fold
    
    # Outcome prediction model
    outcome_model <- y_learner(
      x = X[idx_train, , drop = FALSE],
      y = Y[idx_train]
    )
    
    # Residualize outcome
    outcome_predictions <- pred_fn_y(outcome_model, X[idx_predict, , drop = FALSE])
    outcome_resids <- Y[idx_predict] - outcome_predictions
    
    # Treatment prediction model
    treatment_model <- d_learner(
      x = X[idx_train, , drop = FALSE],
      y = D[idx_train]
    )
    
    # Residualize treatment
    treatment_predictions <- pred_fn_d(treatment_model, X[idx_predict, , drop = FALSE])
    treatment_predictions <- pmin(pmax(treatment_predictions, 0.01), .99)
    treatment_resids <- (
      # This is necessary to deal with cases where D is a factor
      as.numeric(as.character(D[idx_predict])) - treatment_predictions
    )

    # Collect residuals
    new_resids <- data.frame(
      "y_resid" = unname(outcome_resids),
      "d_resid" = unname(treatment_resids),
      "y_hat" = unname(outcome_predictions),
      "d_hat" = unname(treatment_predictions),
      "idx" = idx_predict
    )
    resids <- do.call(rbind, list(resids, new_resids))
  }
  
  # Return data in original ordering
  return(resids[order(resids[, "idx", drop = TRUE]), , drop = FALSE])
  
}

Now that we’ve defined the function, let’s estimate the ATE using our three different ML methods.

# Estimate DML first stage
set.seed(123)
first_stage_lasso <- dml_plm(
  X = X,
  Y = Y,
  D = D,
  y_learner = y_learner_lasso,
  d_learner = d_learner_lasso,
  pred_fn_y = pred_fn_lasso
)
first_stage_enet <- dml_plm(
  X = X,
  Y = Y,
  D = D,
  y_learner = y_learner_enet,
  d_learner = d_learner_enet,
  pred_fn_y = pred_fn_enet
)
first_stage_rf <- dml_plm(
  X = X_RF,
  Y = Y,
  D = factor(D),
  y_learner = y_learner_rf,
  d_learner = d_learner_rf,
  pred_fn_y = pred_fn_rf_y,
  pred_fn_d = pred_fn_rf_d
)

# Estimate the ATE using OLS
ate_lasso <- lm_robust(y_resid ~ d_resid, first_stage_lasso, se_type = "HC1")
ate_enet <- lm_robust(y_resid ~ d_resid, first_stage_enet, se_type = "HC1")
ate_rf <- lm_robust(y_resid ~ d_resid, first_stage_rf, se_type = "HC1")

	LASSO	ElasticNet	RandomForest
Estimate	9738.496	9816.064	8986.493
Std. Error	1372.968	1412.982	1274.455
RMSE Y	53255.882	53939.421	54589.826
RMSE D	0.444	0.444	0.447

Interactive Regression Model

Now, we’ll define the function to estimate the ATE in the fully heterogeneous model.

dml_irm <- function(X,
                    Y,
                    D,
                    y_learner,
                    d_learner,
                    pred_fn_y,
                    pred_fn_d = pred_fn_y) {
  
  # Create folds for sample splitting
  fold1 <- sample(1:nrow(X), size = round((1/3)*nrow(X)))
  fold2 <- sample((1:nrow(X))[-fold1], size = round((1/3)*nrow(X)))
  fold3 <- (1:nrow(X))[-c(fold1, fold2)]
  
  # Create data.frame to store residuals
  resids <- data.frame()
  
  # Generate outcome model across folds
  for (fold in list(fold1, fold2, fold3)) {
    
    # Get the training/prediction indices
    idx_train <- (1:nrow(X))[-fold]
    idx_predict <- fold
    numD <- as.numeric(as.character(D))
    
    # Treatment prediction model
    treatment_model <- d_learner(
      x = X[idx_train, , drop = FALSE],
      y = D[idx_train]
    )
    
    # Residualize treatment
    treatment_predictions <- pred_fn_d(treatment_model, X[idx_predict, , drop = FALSE])
    treatment_predictions <- pmin(pmax(treatment_predictions, 0.01), .99)
    treatment_resids <- (numD[idx_predict] - treatment_predictions)
    
    # Outcome prediction model
    XD <- cbind("D" = numD, X)
    outcome_model <- y_learner(
      x = XD[idx_train, , drop = FALSE],
      y = Y[idx_train]
    )
    
    # Outcome predictions under control
    XD[, "D"] <- 0
    outcome_preds_d0 <- pred_fn_y(outcome_model, XD[idx_predict, , drop = FALSE])
    
    # Outcome predictions under treatment
    XD[, "D"] <- 1
    outcome_preds_d1 <- pred_fn_y(outcome_model, XD[idx_predict, , drop = FALSE])
    outcome_predictions <- (
      numD[idx_predict]*outcome_preds_d1
      + (1 - numD[idx_predict])*outcome_preds_d0
    )
    
    # Calculate individual level weights
    outcome_resids <- Y[idx_predict] - outcome_predictions
    inv_prop_scores <- (
      numD[idx_predict]/treatment_predictions
      - (1 - numD[idx_predict])/(1 - treatment_predictions)
    )
    expected_diff <- outcome_preds_d1 - outcome_preds_d0
    effect_estimates <- expected_diff + inv_prop_scores*outcome_resids
    
    # Collect residuals
    new_resids <- data.frame(
      "effect_estimates" = unname(effect_estimates),
      "y_resid" = unname(outcome_resids),
      "d_resid" = unname(treatment_resids),
      "y_hat" = unname(outcome_predictions),
      "d_hat" = unname(treatment_predictions),
      "idx" = idx_predict
    )
    resids <- do.call(rbind, list(resids, new_resids))
  }
  
  # Return data in original ordering
  return(resids[order(resids[, "idx", drop = TRUE]), , drop = FALSE])
  
}

Let’s estimate the ATE!

# Estimate DML first stage
set.seed(123)
first_stage_lasso <- dml_irm(
  X = X,
  Y = Y,
  D = D,
  y_learner = y_learner_lasso,
  d_learner = d_learner_lasso,
  pred_fn_y = pred_fn_lasso
)
first_stage_enet <- dml_irm(
  X = X,
  Y = Y,
  D = D,
  y_learner = y_learner_enet,
  d_learner = d_learner_enet,
  pred_fn_y = pred_fn_enet
)
first_stage_rf <- dml_irm(
  X = X_RF,
  Y = Y,
  D = factor(D),
  y_learner = y_learner_rf,
  d_learner = d_learner_rf,
  pred_fn_y = pred_fn_rf_y,
  pred_fn_d = pred_fn_rf_d
)

# Estimate the ATE using OLS
ate_lasso <- lm_robust(effect_estimates ~ 1, first_stage_lasso, se_type = "HC1")
ate_enet <- lm_robust(effect_estimates ~ 1, first_stage_enet, se_type = "HC1")
ate_rf <- lm_robust(effect_estimates ~ 1, first_stage_rf, se_type = "HC1")

	LASSO	ElasticNet	RandomForest
Estimate	9078.850	9311.623	8100.171
Std. Error	1412.741	1462.927	1149.504
RMSE Y	53085.113	53732.086	54427.880
RMSE D	0.444	0.444	0.447

Estimating the LATE of 401(k) Participation on Net Financial Assets

We now will estimate the LATE of 401(k) participation on net financial assets, using 401(k) eligibility as our instrument and 401(k) participation as our treatment.

D <- pension$p401
Z <- pension$e401

Partially Linear IV Model

The following function estimates the partially linear IV model.

dml_plivm <- function(X,
                      Y,
                      D,
                      Z,
                      y_learner,
                      d_learner,
                      z_learner = d_learner,
                      pred_fn_y,
                      pred_fn_d = pred_fn_y,
                      pred_fn_z = pred_fn_d) {
  
  # Create folds for sample splitting
  fold1 <- sample(1:nrow(X), size = round((1/3)*nrow(X)))
  fold2 <- sample((1:nrow(X))[-fold1], size = round((1/3)*nrow(X)))
  fold3 <- (1:nrow(X))[-c(fold1, fold2)]
  
  # Create data.frame to store residuals
  resids <- data.frame()
  
  # Generate outcome model across folds
  for (fold in list(fold1, fold2, fold3)) {
    
    # Get the training/prediction indices
    idx_train <- (1:nrow(X))[-fold]
    idx_predict <- fold
    
    # Convert treatment and instrument to numeric
    numD <- as.numeric(as.character(D[idx_predict]))
    numZ <- as.numeric(as.character(Z[idx_predict]))
    
    # Outcome prediction model
    outcome_model <- y_learner(
      x = X[idx_train, , drop = FALSE],
      y = Y[idx_train]
    )
    
    # Residualize outcome
    outcome_predictions <- pred_fn_y(outcome_model, X[idx_predict, , drop = FALSE])
    outcome_resids <- Y[idx_predict] - outcome_predictions
    
    # Treatment prediction model
    treatment_model <- d_learner(
      x = X[idx_train, , drop = FALSE],
      y = D[idx_train]
    )
    
    # Residualize treatment
    treatment_predictions <- pred_fn_d(treatment_model, X[idx_predict, , drop = FALSE])
    treatment_resids <- numD - treatment_predictions
    
    # Instrument prediction model
    instrument_model <- z_learner(
      x = X[idx_train, , drop = FALSE],
      y = Z[idx_train]
    )
    
    # Residualize instrument
    instrument_predictions <- pred_fn_z(instrument_model, X[idx_predict, , drop = FALSE])
    instrument_predictions <- pmin(pmax(instrument_predictions, 0.01), .99)
    instrument_resids <- numZ - instrument_predictions
    
    # Collect residuals
    new_resids <- data.frame(
      "y_resid" = unname(outcome_resids),
      "d_resid" = unname(treatment_resids),
      "z_resid" = unname(instrument_resids),
      "y_hat" = unname(outcome_predictions),
      "d_hat" = unname(treatment_predictions),
      "z_hat" = unname(instrument_predictions),
      "idx" = idx_predict
    )
    resids <- do.call(rbind, list(resids, new_resids))
  }
  
  # Return data in original ordering
  return(resids[order(resids[, "idx", drop = TRUE]), , drop = FALSE])
  
}

Let’s estimate the LATE now!

# Estimate DML first stage
set.seed(123)
first_stage_lasso <- dml_plivm(
  X = X,
  Y = Y,
  D = D,
  Z = Z,
  y_learner = y_learner_lasso,
  d_learner = d_learner_lasso,
  pred_fn_y = pred_fn_lasso
)
first_stage_enet <- dml_plivm(
  X = X,
  Y = Y,
  D = D,
  Z = Z,
  y_learner = y_learner_enet,
  d_learner = d_learner_enet,
  pred_fn_y = pred_fn_enet
)
first_stage_rf <- dml_plivm(
  X = X_RF,
  Y = Y,
  D = factor(D),
  Z = factor(Z),
  y_learner = y_learner_rf,
  d_learner = d_learner_rf,
  pred_fn_y = pred_fn_rf_y,
  pred_fn_d = pred_fn_rf_d
)

# TSLS with LASSO residuals
d_hat_lasso <- lm(d_resid ~ z_resid, first_stage_lasso)$fitted.values
ate_lasso <- lm_robust(first_stage_lasso$y_resid ~ d_hat_lasso, se_type = "HC1")

# TSLS with LASSO residuals
d_hat_enet <- lm(d_resid ~ z_resid, first_stage_enet)$fitted.values
ate_enet <- lm_robust(first_stage_enet$y_resid ~ d_hat_enet, se_type = "HC1")

# TSLS with RF residuals
d_hat_rf <- lm(d_resid ~ z_resid, first_stage_rf)$fitted.values
ate_rf <- lm_robust(first_stage_rf$y_resid ~ d_hat_rf, se_type = "HC1")

	LASSO	ElasticNet	RandomForest
Estimate	13982.593	13809.441	12781.930
Std. Error	1980.323	1977.618	1938.429
RMSE Y	53280.924	53123.786	56275.848
RMSE D	0.415	0.414	0.418
RMSE Z	0.444	0.444	0.449

Interactive IV Model

Finally, the following function estimates the interactive IV model.

dml_iivm <- function(X,
                     Y,
                     D,
                     Z,
                     y_learner,
                     d_learner,
                     z_learner = d_learner,
                     pred_fn_y,
                     pred_fn_d = pred_fn_y,
                     pred_fn_z = pred_fn_d,
                     always_takers = TRUE,
                     never_takers = TRUE) {
  
  # Create folds for sample splitting
  fold1 <- sample(1:nrow(X), size = round((1/3)*nrow(X)))
  fold2 <- sample((1:nrow(X))[-fold1], size = round((1/3)*nrow(X)))
  fold3 <- (1:nrow(X))[-c(fold1, fold2)]
  
  # Create data.frame to store residuals
  resids <- data.frame()
  
  # Generate outcome model across folds
  for (fold in list(fold1, fold2, fold3)) {
    
    # Get the training/prediction indices
    idx_train <- (1:nrow(X))[-fold]
    idx_predict <- fold
    
    # Convert treatment and instrument to numeric
    numD <- as.numeric(as.character(D))
    numZ <- as.numeric(as.character(Z))
    
    # Outcome model
    XZ <- cbind("Z" = numZ, X)
    outcome_model <- y_learner(
      x = XZ[idx_train, , drop = FALSE],
      y = Y[idx_train]
    )
    
    # Outcome predictions under Instrument == 0
    XZ[, "Z"] <- 0
    outcome_preds_z0 <- pred_fn_y(outcome_model, XZ[idx_predict, , drop = FALSE])
    
    # Outcome predictions under Instrument == 1
    XZ[, "Z"] <- 1
    outcome_preds_z1 <- pred_fn_y(outcome_model, XZ[idx_predict, , drop = FALSE])
    
    # Outcome predictions and residuals
    outcome_predictions <- (
      numZ[idx_predict]*outcome_preds_z1
      + (1 - numZ[idx_predict])*outcome_preds_z0
    )
    outcome_resids <- Y[idx_predict] - outcome_predictions
    
    # Treatment model
    XZ <- cbind("Z" = numZ, X)
    treatment_model <- d_learner(
      x = XZ[idx_train, , drop = FALSE],
      y = D[idx_train]
    )
    
    # Treatment predictions under Instrument == 0
    if (always_takers == FALSE) {
      treatment_preds_z0 <- rep(0, length(D[idx_predict]))
    } else {
      XZ[, "Z"] <- 0
      treatment_preds_z0 <- pred_fn_d(treatment_model, XZ[idx_predict, , drop = FALSE])
    }
    
    # Treatment predictions under Instrument == 1
    if (never_takers == FALSE) {
      treatment_preds_z1 <- rep(1, length(D[idx_predict]))
    } else {
      XZ[, "Z"] <- 1
      treatment_preds_z1 <- pred_fn_d(treatment_model, XZ[idx_predict, , drop = FALSE])
    }
    
    # Treatment predictions and residuals
    treatment_predictions <- (
      numZ[idx_predict]*treatment_preds_z1
      + (1 - numZ[idx_predict])*treatment_preds_z0
    )
    treatment_resids <- numD[idx_predict] - treatment_predictions
    
    # Instrument prediction model
    instrument_model <- z_learner(
      x = X[idx_train, , drop = FALSE],
      y = Z[idx_train]
    )
    
    # Instrument predictions and residuals
    instrument_predictions <- pred_fn_z(instrument_model, X[idx_predict, , drop = FALSE])
    treatment_predictions <- pmin(pmax(treatment_predictions, 0.01), .99)
    instrument_resids <- numZ[idx_predict] - instrument_predictions
    
    # Construct effect estimates
    effect_estimates <- (
      (outcome_preds_z1 - outcome_preds_z0)
      + (numZ[idx_predict]*(Y[idx_predict] - outcome_preds_z1))
        /instrument_predictions
      - ((1 - numZ[idx_predict])*(Y[idx_predict] - outcome_preds_z0))
        /(1 - instrument_predictions)
      - (
        (treatment_preds_z1 - treatment_preds_z0)
        + (numZ[idx_predict]*(numD[idx_predict] - treatment_preds_z1))
          /instrument_predictions
        - ((1 - numZ[idx_predict])*(numD[idx_predict] - treatment_preds_z0))
          /(1 - instrument_predictions)
      )
    )
    
    # Collect residuals
    new_resids <- data.frame(
      "effect_estimates" = unname(effect_estimates),
      "y_resid" = unname(outcome_resids),
      "d_resid" = unname(treatment_resids),
      "z_resid" = unname(instrument_resids),
      "y_hat" = unname(outcome_predictions),
      "d_hat" = unname(treatment_predictions),
      "z_hat" = unname(instrument_predictions),
      "idx" = idx_predict
    )
    resids <- do.call(rbind, list(resids, new_resids))
  }
  
  # Return data in original ordering
  return(resids[order(resids[, "idx", drop = TRUE]), , drop = FALSE])
  
}

Let’s estimate the LATE now!

# Estimate DML first stage
set.seed(123)
first_stage_lasso <- dml_iivm(
  X = X,
  Y = Y,
  D = D,
  Z = Z,
  y_learner = y_learner_lasso,
  d_learner = d_learner_lasso,
  pred_fn_y = pred_fn_lasso,
  always_takers = FALSE
)
first_stage_enet <- dml_iivm(
  X = X,
  Y = Y,
  D = D,
  Z = Z,
  y_learner = y_learner_enet,
  d_learner = d_learner_enet,
  pred_fn_y = pred_fn_enet,
  always_takers = FALSE
)
first_stage_rf <- dml_iivm(
  X = X_RF,
  Y = Y,
  D = factor(D),
  Z = factor(Z),
  y_learner = y_learner_rf,
  d_learner = d_learner_rf,
  pred_fn_y = pred_fn_rf_y,
  pred_fn_d = pred_fn_rf_d,
  always_takers = FALSE
)

# ATEs
ate_lasso <- lm_robust(first_stage_lasso$effect_estimates ~ 1, se_type = "HC1")
ate_enet <- lm_robust(first_stage_enet$effect_estimates ~ 1, se_type = "HC1")
ate_rf <- lm_robust(first_stage_rf$effect_estimates ~ 1, se_type = "HC1")

	LASSO	ElasticNet	RandomForest
Estimate	9155.272	8686.631	7687.822
Std. Error	1458.567	1323.662	1213.721
RMSE Y	53083.322	52912.153	55839.966
RMSE D	0.275	0.275	0.277
RMSE Z	0.444	0.444	0.449

Conclusion

And that’s it! There are some additional tidbits, for example estimating the ATTE instead of the ATE, estimating confidence intervals with a multiplier bootstrap, and dealing with multiple treatments, but this has covered (IMO) the essentials. This really helped me solidify the underlying estimation strategy, which I found very helpful.