<- function(data, models) {
estfun_or # grab scores
<- grab_psiFUN(models$m0, data)
m0_scores <- grab_psiFUN(models$m1, data)
m1_scores
# design matrices
<- grab_design_matrix(data, grab_fixed_formula(models$m0))
X0 <- grab_design_matrix(data, grab_fixed_formula(models$m1))
X1
# parameter indexes
<- 1:ncol(X0)
ii0 <- (ncol(X0) + 1):(ncol(X0) + ncol(X1))
ii1
<- data$A
A
# stacked equations
function(theta) {
c(m0_scores(theta[ii0]) * (A == 0),
m1_scores(theta[ii1]) * (A == 1),
%*% theta[ii1] - X0 %*% theta[ii0] - theta[length(theta)])
X1
} }
In this post
A simple causal example
To fix concepts, let’s start with a simple example where we’re interested in estimating the effect of a single binary time-fixed treatment, \(A\), on outcome \(Y\). Define \(Y^a\) as the potential outcome under a hypothetical intervention which sets \(A\) to \(a\). Our interest is in causal contrasts such as the average treatment effect, i.e. \[ \psi \equiv E(Y^{a=1} - Y^{a=0}). \] Because the focus of this post is on estimation, let’s assume the ideal observational setting in which measured covariates, \(L\), are sufficient to control confounding. Under standard identifiability conditions, namely:
- Exchangeability: \(Y^{a} \perp\!\!\!\perp A \mid L\)
- Consistency: \(Y^{a} = Y \text{ if } A = a\)
- Positivity: \(1 > \Pr(A = a \mid L = l) > 0\)
it can be shown that \(\psi\) is identified by \[ \psi_{om} \equiv E\{E(Y | A = 1, L)\} - E\{E(Y | A = 0, L)\}, \] and, equivalently, by \[ \psi_{ipw} \equiv E\left\{\frac{I(A = 1)}{\Pr(A = 1 \mid L)} Y\right\} - E\left\{\frac{I(A = 0)}{\Pr(A = 0 \mid L)} Y\right\}. \]
To simplify, define \(\mu_a(L) = E(Y | A = a, L)\) and \(e(L) = \Pr(A = 1 | L)\). Each expression suggests a corresponding plug-in estimator, i.e. \[ \hat{\psi}_{om} \equiv \dfrac{1}{n} \sum_{i=1}^n \hat{\mu}_1(L_i) - \hat{\mu}_0(L_i), \] and \[ \hat{\psi}_{ipw} = \dfrac{1}{n} \sum_{i=1}^n \dfrac{A_i}{\hat{e}(L_i)}Y_i - \sum_{i=1}^n \dfrac{1 - A_i}{1 - \hat{e}(L_i)}Y_i, \] where the first is termed the outcome model or regression estimator because it relies on a model for \(E(Y | A = a, L)\) and the second is the inverse probability weighting estimator because it relies on a model for \(\Pr(A = 1 | L)\).
Hernan and Robins describe, in detail, the necessary analysis steps to obtain estimates \(\hat{\psi}_{om}\) and \(\hat{\psi}_{ipw}\). Briefly, the outcome regression estimator is obtained by
- Regressing the outcome, \(Y\), on covariates, \(L\), either a) separately within the subset with \(A = 1\) and \(A = 0\), or b) by pooling into one model of \(Y\) on \(L\) and \(A\).
- Create two copies of the dataset and artificially set \(A = 1\) for all individuals in one and set \(A = 0\) for all individuals in the other.
- Obtain predicted values \(\hat{\mu}_1(L_i)\) and \(\hat{\mu}_0(L_i)\) by applying the model in step 1 to the observed values of \(L_i\) in each dataset.
- Take the mean of these predicted values and subtract them to obtain \(\hat{\psi}_{om}\).
and the inverse probability weighting estimator may be obtained by
- Regressing the treatment, \(A\), on covariates, \(L\), using, for instance, a logistic model.
- Obtain propensity scores \(\hat{e}(L_i)\) from predicted values of logistic model.
- For each individual form weights based on the inverse probability of the treatment they received \[ W_i = \dfrac{A_i}{\hat{e}(L_i)} \]
- Calculate the weighted mean and take the difference.
The problem
Implementing
In their book, Hernan and Robins suggest variance estimates can be obtained via the bootstrap. However, there are at least two issues. First the bootstrap may be computationally intensive especially as problems become more complex (e.g. once we consider time-varying treatments).
The basics of M-estimation
The basic set up is as follows. Let \(\theta\) be a finite dimensional parameter vector. An M-estimator, \(\hat{\theta}\), for this vector is the solution to sample moment equation \[ \sum_{i=1}^n \psi(O_i; \hat{\theta}) = 0 \] where \(\psi(O_i; \hat{\theta})\) is often referred to as an estimating function. Under standard regularity conditions, the asymptotic distribution of \(\hat{\theta}\) is given by \[ \sqrt{n}(\theta - \hat{\theta}) \overset{d}{\longrightarrow} N(0, \mathbb V(\hat{\theta})) \] where the asymptotic variance, \(\mathbb V_n(\hat{\theta})\), can be estimated from \[ \mathbb V_n(\hat{\theta}) = \mathbb B_n(\hat{\theta})^{-1} \mathbb F_n(\hat{\theta})\{\mathbb B_n(\hat{\theta})^{-1}\}^T \] with \[ \mathbb B_n(\hat{\theta}) = \frac{1}{n} \sum_{i=1}^n \left\{-\dfrac{\partial\psi(O_i; \hat{\theta})}{\partial\theta}\right\} \] and \[ \mathbb F_n(\hat{\theta}) = \frac{1}{n} \sum_{i=1}^n \left\{\partial\psi(O_i; \hat{\theta})\partial\psi(O_i; \hat{\theta})^T\right\} \] In M-estimation parlance \(\mathbb V_n(\hat{\theta})\) is referred to as the empirical sandwich variance estimator where \(\mathbb B_n(\hat{\theta})\) is the “bread” and \(\mathbb F_n(\hat{\theta})\) is the “meat”. While analytical solutions can often be derived for the asymptotic variance for a given set of estimating functions, an advantage of M-estimation is that the variance can also be obtained numerically and therefore can be easily extended to any arbitrary set of estimating functions.
The beauty of M-estimation is that many common estimation problems can be posed as an estimating function or sequence of estimating functions. Estimating functions can be ``stacked’’
- Can I re-express my estimator as a set of stacked estimating equations?
- Do I meet the
The geex
package
Outcome regression
We can convert \(\hat{\psi}_{om}\) into the following set of stacked estimating equations \[ \psi(O; \beta_0, \beta_1, \psi) = \begin{bmatrix} A \{Y - \mu_1(L;\beta_1)\} \\ (1 -A) \{Y - \mu_0(L;\beta_0)\} \\ \mu_1(L;\beta_1) - \mu_0(L;\beta_1) - \psi \end{bmatrix} \] where the first two are the score equations for the models \(\mu_1(L;\beta_1)\) and \(\mu_0(L;\beta_1)\) respectively and the last obtains the estimate of the difference in predicted values.
In the geex
package the stacked estimating functions for outcome regression can be implemented as follows.
Inverse probability weighting
Using some probability theory we can show that \(\psi_{om}\) is equal to \[ \psi_{ipw} \equiv E\left\{\frac{I(A = 1)}{\Pr(A = 1 \mid L)} Y\right\} - E\left\{\frac{I(A = 0)}{\Pr(A = 0 \mid L)} Y\right\} \] and simple plug-in estimator
\[ \hat{\psi}_{ipw} = \sum_{i=1}^n \dfrac{A_i}{\hat{e}(L_i)}Y_i - \sum_{i=1}^n \dfrac{1 - A_i}{1 - \hat{e}(L_i)}Y_i. \]
The variance is complicated by the estimation of the nuisance terms \(\hat{\mu}_1(L)\) and \(\hat{\mu}_0(L)\).
\[ \psi(O; \alpha, \psi) = \begin{bmatrix} \{A - e(L;\alpha)\} L \\ \dfrac{A}{\hat{e}(L;\alpha)}Y - \dfrac{1 - A}{1 - \hat{e}(L;\alpha)}Y - \psi \end{bmatrix} \]
<- function(data, models) {
estfun_ipw # grab scores
<- grab_psiFUN(models$e, data)
e_scores
# design matrix
<- grab_design_matrix(data, grab_fixed_formula(models$e))
X
# parameter indexes
<- 1:ncol(X)
ii
<- data$A
A <- data$Y
Y
# stacked equations
function(theta) {
<- plogis(X %*% theta[ii])
e c(e_scores(theta[ii]),
* Y / e) - ((1 - A) * Y / (1 - e)) - theta[length(theta)])
(A
} }
G-estimation
\[ \begin{aligned} E\{Y - E(Y \mid A = a, X)\} = 0 \\ E[\{A - E(A | X)\}\{Y - \psi A\}] = 0 \end{aligned} \]
Putting it all together
library(geex)
Attaching package: 'geex'
The following object is masked from 'package:methods':
show
<- function(n) {
gendata <- MASS::mvrnorm(
L n = n,
mu = rep(0, 3),
Sigma = matrix(
data = c(1, 0.3, 0.3,
0.3, 1, 0.3,
0.3, 0.3, 1),
nrow = 3,
ncol = 3,
byrow = TRUE)
) <- rbinom(n, 1, plogis(-2 + L[, 1] + sqrt(exp(L[, 2]))+ L[, 3]^2))
A <- rnorm(n, -A + 0.5 * (L[, 1] + L[, 2] + L[, 3] + L[, 1] * L[, 3]))
Y
colnames(L) <- paste0("L", 1:3)
<- data.frame(L, A, Y)
mat return(mat)
}
<- function(data, models) {
estfun <- grab_psiFUN(models$m0, data)
m0 <- grab_psiFUN(models$m1, data)
m1 <- grab_design_matrix(data, rhs_formula = grab_fixed_formula(models$m0))
X0 <- grab_design_matrix(data, rhs_formula = grab_fixed_formula(models$m1))
X1 <- 1:ncol(X0)
ii0 <- (ncol(X0) + 1):(ncol(X0) + ncol(X1))
ii1
<- data$A
A function(theta) {
c(m0(theta[ii0]) * (A == 0),
m1(theta[ii1]) * (A == 1),
%*% theta[ii1] - X0 %*% theta[ii0] - theta[length(theta)] )
X1
}
}
<- gendata(1000)
d
<- list(
models m0 = glm(formula = Y ~ L1 + L2 + L3 + L1:L3,
family = gaussian,
subset = A == 0,
data = d),
m1 = glm(formula = Y ~ L1 + L2 + L3 + L1:L3,
family = gaussian,
subset = A == 1,
data = d)
)
<- sum(sapply(models, function(x) length(coef(x)))) + 1
nparms
<- m_estimate(
res estFUN = estfun,
data = d,
root_control = setup_root_control(start = rep(0, nparms)),
outer_args = list(models = models)
)
c(
roots(res)[11],
roots(res)[11] - 1.96 * sqrt(vcov(res)[11,11]),
roots(res)[11] + 1.96 * sqrt(vcov(res)[11,11])
)
[1] -1.0167416 -1.1525034 -0.8809799
Unifying these approaches with M-estimation
Rather than computing plug-in estimators from predicted values, we can view these estimation problems as solving estimating equations. This approach—known as M-estimation—is especially useful because it provides a principled way to compute standard errors using the sandwich variance formula and forms the basis for more advanced estimators like doubly robust estimators or TMLE.
To formalize, we write the estimand \(\psi\) as the solution to an estimating equation: \[ \frac{1}{n} \sum_{i=1}^n m(O_i; \psi, \eta) = 0 \] where \(O_i = (Y_i, A_i, L_i)\) is observed data, \(\psi\) is the target parameter, and \(\eta\) includes any nuisance parameters.
Let’s walk through how both \(\hat\psi_{om}\) and \(\hat\psi_{ipw}\) can be obtained this way.
Outcome regression as M-estimator
Let \(\hat{\mu}_a(L)\) be a model for \(E(Y \mid A=a, L)\). Then define:
<- function(data, psi, nuisance) {
m_om <- predict(nuisance$model1, newdata = transform(data, A = 1))
mu1 <- predict(nuisance$model0, newdata = transform(data, A = 0))
mu0 - mu0 - psi
mu1 }
We solve for \(\psi\) such that the average of this score is zero.
IPW as M-estimator
Let \(\hat{e}(L)\) be the propensity model. Then the estimating function is:
<- function(data, psi, nuisance) {
m_ipw <- data$A
A <- data$Y
Y <- nuisance$ps
ps * Y / ps - (1 - A) * Y / (1 - ps) - psi
A }
Again, solve for \(\psi\) so the sample mean of m_ipw
is zero.
Estimation in R with geex
The geex
package provides a convenient framework for implementing M-estimation and computing standard errors via the sandwich formula.
Here’s how we might implement the IPW estimator using geex
:
library(geex)
<- function(data) {
ipw_estimator <- function(data) {
estfun <- data$A
A <- data$Y
Y <- glm(A ~ L, family = binomial, data = data)
ps_model <- predict(ps_model, type = "response")
ps function(theta) {
* Y / ps - (1 - A) * Y / (1 - ps) - theta
A
}
}m_estimate(
estFUN = estfun,
data = data,
root_control = setup_root_control(start = 0)
) }
Wrap-up
M-estimation provides a flexible, principled approach to estimation in causal inference that unifies many standard methods under a common framework. By framing estimators as solutions to estimating equations, we can take advantage of general results for asymptotic inference, including robust (sandwich) standard errors.
In upcoming posts, I’ll show how this M-estimation framework generalizes to doubly robust estimators and targeted learning.
Reuse
Citation
@online{boyer2024,
author = {Boyer, Christopher},
title = {M-Estimation for Causal Inference in {R}},
date = {2024-02-16},
langid = {en}
}