Program Evaluation and Causal Inference with High-Dimensional Data

A. Belloni, V. Chernozhukov, I. Fernandez-Val, C. Hansen

ML
Causal Inference
Estimators and honest confidence bands for a variety of treatment effects including local average (LATE) and average treatment effects (ATE) in data-rich environments.
Published

November 8, 2022

Packages

library(dplyr)
library(glmnet)
library(here)
library(Matrix)
library(readr)

This will be replicating (to a small degree), Chernozhukov et al.’s work on estimating causal effects in high-dimensional settings. This work demonstrates how to estimate treatment effects including LATE (in an Instrumental Variables setting) and ATE among others, as well as constructing honest confidence bands. They demonstrate this within a specific empirical setting. This setting is a prior study which uses Instrumental Variables estimation to quantify the effect of 401(k) participation on accumulated assets. The instrument in this setting is 401(k) eligibility which is arguably exogenous after conditioning on income and other related confounders.

Study Data

First, I will import the data that this study is based on.

iv_data <- read_tsv(here("posts/causal-inf-high-dim/data/high-dim-iv.dat"))

# Variable transformations
iv_data <- iv_data |>
  mutate(
    age = (age - 25)/(64 - 25),
    inc = (inc + 2652)/(242124 + 2652),
    fsize = fsize/13,
    educ = educ/18
  )

The primary variables of interest are our continuous outcome \(Y=\) Total Assets, our binary instrument \(Z=\) 401(k) Eligibility, and our binary treatment \(D=\) 401(k) Participation. The definitions of all the rest of these variables are kind of obscure, and I’m not sure what they all are exactly. However, they define their specification precisely so I’m simply going to mirror what they use in their paper.

Model Specification

The specification of the full set of controls they construct is defined below as iv_control_spec. It’s a pretty hairy model spec! Let’s create the input matrix and check it’s dimensions.

iv_control_spec <- ~ (
  (
    marr + twoearn + db + pira + hown + age + I(age^2) + I(age^3)
    + educ + I(educ^2) + fsize + I(fsize^2)
  )^2
  * (i1 + i2 + i3 + i4 + i5 + i6 + i6 + inc + I(inc^2))
  + (i1 + i2 + i3 + i4 + i5 + i6 + i6 + inc + I(inc^2))^2
) - 1
iv_X <- sparse.model.matrix(iv_control_spec, data = iv_data)
N Observations: 9915; N Confounders: 738

While this isn’t quite the full set of potential controls considered in the paper, it should be close enough to get the point across.

First-Stage Estimates

Conditional Expected Outcomes

In order to calculate the LATE in our IV framework, we will estimate the following expected values: \(E[Y|Z=0,X]\), \(E[Y|Z=1,X]\), \(E[D|Z=1,X]\), and \(E[Z|X]\) using post-LASSO estimates for each of these values. Let’s get to estimating! As in the paper, we will estimate using LASSO with a pre-determined, data-driven choice of regularization parameter, \(\lambda\).

N <- nrow(iv_X)
P <- ncol(iv_X)

# Data-driven regularization parameters
lambda <- 2.2 * sqrt(N) * qnorm(1 - (0.1/log(N))/(2 * (2 * P)))
logit_lambda <- lambda/(2 * N)

# Estimate models where instrument == 0

## Outcome Post-LASSO model
id_z0 <- iv_data$e401 == 0
ey_z0 <- glmnet(x = iv_X[id_z0, ], y = iv_data$tw[id_z0], lambda = lambda)
ey_z0_selected <- names((c <- coef(ey_z0))[(drop(c) != 0), ])
ey_z0_post_data <- cbind(tw = iv_data$tw[id_z0], as.matrix(iv_X[id_z0, ]))
ey_z0_post_form <- paste("tw ~", paste0(ey_z0_selected[-1], collapse = "+"))
ey_z0_lm <- lm(formula(ey_z0_post_form), data = as.data.frame(ey_z0_post_data))

## Treatment Post-LASSO model - not needed (E[D] = 0, since D = 1 iff Z = 1)

# Estimate models where instrument == 1

## Outcome Post-LASSO model
id_z1 <- iv_data$e401 == 1
ey_z1 <- glmnet(x = iv_X[id_z1, ], y = iv_data$tw[id_z1], lambda = lambda)
ey_z1_selected <- names((c <- coef(ey_z1))[(drop(c) != 0), ])
ey_z1_post_data <- cbind(tw = iv_data$tw[id_z1], as.matrix(iv_X[id_z1, ]))
ey_z1_post_form <- paste("tw ~", paste0(ey_z1_selected[-1], collapse = "+"))
ey_z1_lm <- lm(formula(ey_z1_post_form), data = as.data.frame(ey_z1_post_data))

## Treatment Post-LASSO model
ed_z1 <- glmnet(
  x = iv_X[id_z1, ],
  y = iv_data$p401[id_z1],
  family = "binomial",
  lambda = logit_lambda
)
ed_z1_selected <- names((c <- coef(ed_z1))[(drop(c) != 0), ])
ed_z1_post_data <- cbind(p401 = iv_data$p401[id_z1], as.matrix(iv_X[id_z1, ]))
ed_z1_post_form <- paste("p401 ~", paste0(ed_z1_selected[-1], collapse = "+"))
ed_z1_lm <- glm(
  formula(ed_z1_post_form),
  family = "binomial",
  data = as.data.frame(ed_z1_post_data)
)

# Estimate instrument as a function of X; Post-LASSO
ez <- glmnet(
  x = iv_X,
  y = iv_data$e401,
  family = "binomial",
  lambda = logit_lambda
)
ez_selected <- names((c <- coef(ez))[(drop(c) != 0), ])
ez_post_data <- cbind(e401 = iv_data$e401, as.matrix(iv_X))
ez_post_form <- paste("e401 ~", paste0(ez_selected[-1], collapse = "+"))
ez_lm <- glm(
  formula(ez_post_form),
  family = "binomial",
  data = as.data.frame(ez_post_data)
)

Calculate LATE

Now, that we’ve estimated models for the expected value of \(Y\) and \(D\) under the different values of our instrument \(Z\), let’s create a data.frame that has the estimated expected values of these variables for every observation. As is standard (and implemented in the paper), we will trim observations to ensure that estimated propensities of our instrument are bounded away from \({0, 1}\).

prediction_data <- as.data.frame(as.matrix(iv_X))
iv_expected_values <- data.frame(
  y = iv_data$tw,
  d = iv_data$p401,
  z = iv_data$e401,
  ey_z0 = predict(ey_z0_lm, prediction_data),
  ey_z1 = predict(ey_z0_lm, prediction_data),
  ed_z0 = 0,
  ed_z1 = predict(ed_z1_lm, prediction_data, type = "response"),
  ez = predict(ez_lm, prediction_data, type = "response")
)

# Trim instrument propensity scores -- No observations are dropped here
iv_expected_values <- iv_expected_values |>
  filter(ez >= 1e-12 & ez <= (1 - 1e-12))

# Estimate LATE plug-in values
iv_expected_values <- iv_expected_values |>
  mutate(
    ay_1 = z*(y - ey_z1)/ez + ey_z1,
    ay_0 = (1 - z)*(y - ey_z0)/(1 - ez) + ey_z0,
    ad_1 = z*(d - ed_z1)/ez + ed_z1,
    ad_0 = 0,
    LATE = (ay_1 - ay_0)/(ad_1 - ad_0)
  )

Confidence via Bootstrap

Now that we’ve estimated the plug-in values, let’s calculate the LATE and generate a confidence interval using the described multiplier bootstrap.

# Calculate LATE
mean_ay_1 <- mean(iv_expected_values$ay_1)
mean_ay_0 <- mean(iv_expected_values$ay_0)
mean_ad_1 <- mean(iv_expected_values$ad_1)
mean_ad_0 <- mean(iv_expected_values$ad_0)
LATE <- (mean_ay_1 - mean_ay_0)/(mean_ad_1 - mean_ad_0)

# Confidence intervals: both analytic and bootstrap
analytic_se <- sqrt(
  (1/(nrow(iv_expected_values) - 1))
  * sum(
    (
      (iv_expected_values$ay_1 - iv_expected_values$ay_0)
      /(mean_ad_1 - mean_ad_0)
      - LATE
    )^2
  )
  /nrow(iv_expected_values)
)

# Function to generate multiplier weights
mw <- function(n) {
  1 + rnorm(n)/sqrt(2) + (rnorm(n)^2 - 1)/2
}

bootstrap_LATEs <- vapply(
  1:500,
  function(i) {
    weights <- mw(nrow(iv_expected_values))
    (
      mean((iv_expected_values$ay_1 - iv_expected_values$ay_0)*weights)
      /mean((iv_expected_values$ad_1 - iv_expected_values$ad_0)*weights)
    )
  },
  numeric(1)
)
bootstrap_se <- (
  (quantile(bootstrap_LATEs, .75) - quantile(bootstrap_LATEs, .25))
  / (qnorm(.75) - qnorm(.25))
)
LATE: 8391.53 (3305.03) {3098.99}

Conclusion

And voila, we’ve estimated the LATE using IV estimation in a high-dimensional setting! A specific, but very useful, case of this general framework is when we want to directly estimate the effect of a treatment variable that is conditionally exogenous. In that case, we can execute the algorithm shown above, but setting \(Z = D\). Other than that, everything is exactly the same.

HDM Package

If you want a quick and easy implementation for these methods, check out the hdm package. The package is relatively easy-to-follow, and also works with sparse matrices right out of the box, which is nice. It’s not the most user-friendly package, but it seems to get the job done.