library(dplyr)
library(glmnet)
library(here)
library(Matrix)
library(readr)
Program Evaluation and Causal Inference with High-Dimensional Data
A. Belloni, V. Chernozhukov, I. Fernandez-Val, C. Hansen
Packages
This will be replicating (to a small degree), Chernozhukov et al.’s work on estimating causal effects in high-dimensional settings. This work demonstrates how to estimate treatment effects including LATE (in an Instrumental Variables setting) and ATE among others, as well as constructing honest confidence bands. They demonstrate this within a specific empirical setting. This setting is a prior study which uses Instrumental Variables estimation to quantify the effect of 401(k) participation on accumulated assets. The instrument in this setting is 401(k) eligibility which is arguably exogenous after conditioning on income and other related confounders.
Study Data
First, I will import the data that this study is based on.
<- read_tsv(here("posts/causal-inf-high-dim/data/high-dim-iv.dat"))
iv_data
# Variable transformations
<- iv_data |>
iv_data mutate(
age = (age - 25)/(64 - 25),
inc = (inc + 2652)/(242124 + 2652),
fsize = fsize/13,
educ = educ/18
)
The primary variables of interest are our continuous outcome \(Y=\) Total Assets, our binary instrument \(Z=\) 401(k) Eligibility, and our binary treatment \(D=\) 401(k) Participation. The definitions of all the rest of these variables are kind of obscure, and I’m not sure what they all are exactly. However, they define their specification precisely so I’m simply going to mirror what they use in their paper.
Model Specification
The specification of the full set of controls they construct is defined below as iv_control_spec
. It’s a pretty hairy model spec! Let’s create the input matrix and check it’s dimensions.
<- ~ (
iv_control_spec
(+ twoearn + db + pira + hown + age + I(age^2) + I(age^3)
marr + educ + I(educ^2) + fsize + I(fsize^2)
^2
)* (i1 + i2 + i3 + i4 + i5 + i6 + i6 + inc + I(inc^2))
+ (i1 + i2 + i3 + i4 + i5 + i6 + i6 + inc + I(inc^2))^2
- 1
) <- sparse.model.matrix(iv_control_spec, data = iv_data) iv_X
N Observations: 9915; N Confounders: 738
While this isn’t quite the full set of potential controls considered in the paper, it should be close enough to get the point across.
First-Stage Estimates
Conditional Expected Outcomes
In order to calculate the LATE in our IV framework, we will estimate the following expected values: \(E[Y|Z=0,X]\), \(E[Y|Z=1,X]\), \(E[D|Z=1,X]\), and \(E[Z|X]\) using post-LASSO estimates for each of these values. Let’s get to estimating! As in the paper, we will estimate using LASSO with a pre-determined, data-driven choice of regularization parameter, \(\lambda\).
<- nrow(iv_X)
N <- ncol(iv_X)
P
# Data-driven regularization parameters
<- 2.2 * sqrt(N) * qnorm(1 - (0.1/log(N))/(2 * (2 * P)))
lambda <- lambda/(2 * N)
logit_lambda
# Estimate models where instrument == 0
## Outcome Post-LASSO model
<- iv_data$e401 == 0
id_z0 <- glmnet(x = iv_X[id_z0, ], y = iv_data$tw[id_z0], lambda = lambda)
ey_z0 <- names((c <- coef(ey_z0))[(drop(c) != 0), ])
ey_z0_selected <- cbind(tw = iv_data$tw[id_z0], as.matrix(iv_X[id_z0, ]))
ey_z0_post_data <- paste("tw ~", paste0(ey_z0_selected[-1], collapse = "+"))
ey_z0_post_form <- lm(formula(ey_z0_post_form), data = as.data.frame(ey_z0_post_data))
ey_z0_lm
## Treatment Post-LASSO model - not needed (E[D] = 0, since D = 1 iff Z = 1)
# Estimate models where instrument == 1
## Outcome Post-LASSO model
<- iv_data$e401 == 1
id_z1 <- glmnet(x = iv_X[id_z1, ], y = iv_data$tw[id_z1], lambda = lambda)
ey_z1 <- names((c <- coef(ey_z1))[(drop(c) != 0), ])
ey_z1_selected <- cbind(tw = iv_data$tw[id_z1], as.matrix(iv_X[id_z1, ]))
ey_z1_post_data <- paste("tw ~", paste0(ey_z1_selected[-1], collapse = "+"))
ey_z1_post_form <- lm(formula(ey_z1_post_form), data = as.data.frame(ey_z1_post_data))
ey_z1_lm
## Treatment Post-LASSO model
<- glmnet(
ed_z1 x = iv_X[id_z1, ],
y = iv_data$p401[id_z1],
family = "binomial",
lambda = logit_lambda
)<- names((c <- coef(ed_z1))[(drop(c) != 0), ])
ed_z1_selected <- cbind(p401 = iv_data$p401[id_z1], as.matrix(iv_X[id_z1, ]))
ed_z1_post_data <- paste("p401 ~", paste0(ed_z1_selected[-1], collapse = "+"))
ed_z1_post_form <- glm(
ed_z1_lm formula(ed_z1_post_form),
family = "binomial",
data = as.data.frame(ed_z1_post_data)
)
# Estimate instrument as a function of X; Post-LASSO
<- glmnet(
ez x = iv_X,
y = iv_data$e401,
family = "binomial",
lambda = logit_lambda
)<- names((c <- coef(ez))[(drop(c) != 0), ])
ez_selected <- cbind(e401 = iv_data$e401, as.matrix(iv_X))
ez_post_data <- paste("e401 ~", paste0(ez_selected[-1], collapse = "+"))
ez_post_form <- glm(
ez_lm formula(ez_post_form),
family = "binomial",
data = as.data.frame(ez_post_data)
)
Calculate LATE
Now, that we’ve estimated models for the expected value of \(Y\) and \(D\) under the different values of our instrument \(Z\), let’s create a data.frame that has the estimated expected values of these variables for every observation. As is standard (and implemented in the paper), we will trim observations to ensure that estimated propensities of our instrument are bounded away from \({0, 1}\).
<- as.data.frame(as.matrix(iv_X))
prediction_data <- data.frame(
iv_expected_values y = iv_data$tw,
d = iv_data$p401,
z = iv_data$e401,
ey_z0 = predict(ey_z0_lm, prediction_data),
ey_z1 = predict(ey_z0_lm, prediction_data),
ed_z0 = 0,
ed_z1 = predict(ed_z1_lm, prediction_data, type = "response"),
ez = predict(ez_lm, prediction_data, type = "response")
)
# Trim instrument propensity scores -- No observations are dropped here
<- iv_expected_values |>
iv_expected_values filter(ez >= 1e-12 & ez <= (1 - 1e-12))
# Estimate LATE plug-in values
<- iv_expected_values |>
iv_expected_values mutate(
ay_1 = z*(y - ey_z1)/ez + ey_z1,
ay_0 = (1 - z)*(y - ey_z0)/(1 - ez) + ey_z0,
ad_1 = z*(d - ed_z1)/ez + ed_z1,
ad_0 = 0,
LATE = (ay_1 - ay_0)/(ad_1 - ad_0)
)
Confidence via Bootstrap
Now that we’ve estimated the plug-in values, let’s calculate the LATE and generate a confidence interval using the described multiplier bootstrap.
# Calculate LATE
<- mean(iv_expected_values$ay_1)
mean_ay_1 <- mean(iv_expected_values$ay_0)
mean_ay_0 <- mean(iv_expected_values$ad_1)
mean_ad_1 <- mean(iv_expected_values$ad_0)
mean_ad_0 <- (mean_ay_1 - mean_ay_0)/(mean_ad_1 - mean_ad_0)
LATE
# Confidence intervals: both analytic and bootstrap
<- sqrt(
analytic_se 1/(nrow(iv_expected_values) - 1))
(* sum(
($ay_1 - iv_expected_values$ay_0)
(iv_expected_values/(mean_ad_1 - mean_ad_0)
- LATE
^2
)
)/nrow(iv_expected_values)
)
# Function to generate multiplier weights
<- function(n) {
mw 1 + rnorm(n)/sqrt(2) + (rnorm(n)^2 - 1)/2
}
<- vapply(
bootstrap_LATEs 1:500,
function(i) {
<- mw(nrow(iv_expected_values))
weights
(mean((iv_expected_values$ay_1 - iv_expected_values$ay_0)*weights)
/mean((iv_expected_values$ad_1 - iv_expected_values$ad_0)*weights)
)
},numeric(1)
)<- (
bootstrap_se quantile(bootstrap_LATEs, .75) - quantile(bootstrap_LATEs, .25))
(/ (qnorm(.75) - qnorm(.25))
)
LATE: 8391.53 (3305.03) {3098.99}
Conclusion
And voila, we’ve estimated the LATE using IV estimation in a high-dimensional setting! A specific, but very useful, case of this general framework is when we want to directly estimate the effect of a treatment variable that is conditionally exogenous. In that case, we can execute the algorithm shown above, but setting \(Z = D\). Other than that, everything is exactly the same.
HDM Package
If you want a quick and easy implementation for these methods, check out the hdm
package. The package is relatively easy-to-follow, and also works with sparse matrices right out of the box, which is nice. It’s not the most user-friendly package, but it seems to get the job done.