Machine Learning Integration

Overview

The cfperformance package supports flexible machine learning methods for nuisance model estimation (propensity scores and outcome models). When using ML methods, cross-fitting is essential for valid inference—this is handled automatically when you use the ml_learner() specification.

This vignette demonstrates how to use ML learners with the counterfactual performance estimators.

Why Use ML for Nuisance Models?

The doubly robust (DR) estimator requires models for:

Propensity scores: $P(A = a | X)$ — the probability of treatment given covariates
Outcome model: $E[Y | X, A = a]$ — the expected outcome given covariates and treatment

Traditional approaches use logistic regression, which may be misspecified when the true relationships are complex. Machine learning methods can capture nonlinear relationships and interactions without explicit specification.

However, using flexible ML methods introduces a challenge: if the same data is used to both fit the nuisance models and compute the estimator, the resulting inference can be invalid. Cross-fitting (sample splitting) solves this by:

Splitting data into K folds
For each fold, fitting nuisance models on the other K-1 folds
Using the held-out fold for estimation

This approach, known as “double/debiased machine learning” (Chernozhukov et al., 2018), maintains valid inference while leveraging flexible estimation.

The `ml_learner()` Interface

The ml_learner() function creates a learner specification that can be passed to the propensity_model or outcome_model arguments:

library(cfperformance)

# Create a learner specification
ranger_learner <- ml_learner("ranger", num.trees = 500)
print(ranger_learner)
#> ML Learner Specification
#> ------------------------
#> Method: ranger 
#> Arguments:
#>   num.trees: 500

Supported Learners

Method	Package	Description
`"ranger"`	ranger	Fast random forest implementation
`"xgboost"`	xgboost	Gradient boosting (XGBoost)
`"grf"`	grf	Generalized random forests with honesty
`"glmnet"`	glmnet	Elastic net with cross-validated λ
`"superlearner"`	SuperLearner	Ensemble of multiple learners
`"custom"`	—	User-supplied fit/predict functions

Basic Usage

Here’s a complete example using random forests for nuisance estimation:

# Generate example data
set.seed(123)
n <- 1000

# Covariates with nonlinear relationships
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)

# Treatment depends on covariates (nonlinearly)
ps_true <- plogis(-0.5 + 0.8 * x1 - 0.5 * x2 + 0.3 * x1 * x2)
a <- rbinom(n, 1, ps_true)

# Outcome depends on covariates and treatment
y_prob <- plogis(-1 + 0.5 * x1 + 0.3 * x2^2 + 0.2 * x3 - 0.4 * a)
y <- rbinom(n, 1, y_prob)

# Prediction model (intentionally simplified)
pred <- plogis(-1 + 0.4 * x1 + 0.2 * x2)

# Create data frame
df <- data.frame(x1 = x1, x2 = x2, x3 = x3)

Using GLM (Default)

# Standard approach with GLM
result_glm <- cf_mse(
  predictions = pred,
  outcomes = y,
  treatment = a,
  covariates = df,
  treatment_level = 0,
  estimator = "dr",
  cross_fit = TRUE,
  n_folds = 5,
  se_method = "influence"
)

print(result_glm)
#> 
#> Counterfactual MSE Estimation
#> ---------------------------------------- 
#> Estimator: dr 
#> Treatment level: 0 
#> N observations: 1000 
#> 
#> Estimate: 0.2301 (SE: 0.0098 )
#> 95% CI: [0.211, 0.2492]
#> 
#> Naive estimate: 0.2081

Using Random Forest

# With random forest nuisance models
result_rf <- cf_mse(
  predictions = pred,
  outcomes = y,
  treatment = a,
  covariates = df,
  treatment_level = 0,
  estimator = "dr",
  propensity_model = ml_learner("ranger", num.trees = 200),
  outcome_model = ml_learner("ranger", num.trees = 200),
  cross_fit = TRUE,
  n_folds = 5,
  se_method = "influence"
)

print(result_rf)
#> 
#> Counterfactual MSE Estimation
#> ---------------------------------------- 
#> Estimator: dr 
#> Treatment level: 0 
#> N observations: 1000 
#> 
#> Estimate: 0.2154 (SE: 0.0114 )
#> 95% CI: [0.193, 0.2378]
#> 
#> Naive estimate: 0.2081

Learner-Specific Examples

ranger (Random Forest)

cf_mse(
  ...,
  propensity_model = ml_learner(
    "ranger",
    num.trees = 500,      # Number of trees
    mtry = 3,             # Variables per split
    min.node.size = 10    # Minimum node size
  ),
  cross_fit = TRUE
)

xgboost (Gradient Boosting)

cf_mse(
  ...,
  propensity_model = ml_learner(
    "xgboost",
    nrounds = 100,        # Boosting rounds
    max_depth = 4,        # Tree depth
    eta = 0.1             # Learning rate
  ),
  cross_fit = TRUE
)

glmnet (Elastic Net)

cf_mse(
  ...,
  propensity_model = ml_learner(
    "glmnet",
    alpha = 0.5,          # 0=ridge, 1=lasso, 0.5=elastic net
    nfolds = 10           # CV folds for lambda selection
  ),
  cross_fit = TRUE
)

grf (Generalized Random Forest)

cf_mse(
  ...,
  propensity_model = ml_learner(
    "grf",
    num.trees = 2000,     # More trees for honesty
    honesty = TRUE        # Honest estimation (default)
  ),
  cross_fit = TRUE
)

SuperLearner (Ensemble)

cf_mse(
  ...,
  propensity_model = ml_learner(
    "superlearner",
    SL.library = c("SL.glm", "SL.ranger", "SL.xgboost")
  ),
  cross_fit = TRUE
)

Custom Learners

You can define custom learners by providing fit_fun and predict_fun:

# Define a custom learner (e.g., GAM with specific settings)
custom_fit <- function(formula, data, family, ...) {
  if (!requireNamespace("mgcv", quietly = TRUE)) {
    stop("mgcv required for this custom learner")
  }
  mgcv::gam(formula, data = data, family = binomial())
}

custom_predict <- function(object, newdata, ...) {
  predict(object, newdata = newdata, type = "response")
}

# Use in estimation
custom_learner <- ml_learner(
  "custom",
  fit_fun = custom_fit,
  predict_fun = custom_predict
)

Mixing Learners

You can use different learners for propensity and outcome models:

# Random forest for propensity, elastic net for outcome
cf_mse(
  predictions = pred,
  outcomes = y,
  treatment = a,
  covariates = df,
  treatment_level = 0,
  estimator = "dr",
  propensity_model = ml_learner("ranger", num.trees = 500),
  outcome_model = ml_learner("glmnet", alpha = 0.5),
  cross_fit = TRUE
)

Current Support

ML learners with cross-fitting are currently fully supported in:

cf_mse() - Counterfactual mean squared error
cf_auc() - Counterfactual AUC (doubly robust estimator only)

Support for cf_calibration() and the transportability variants is planned for a future release.

MSE with ML Learners

# Full example with ML nuisance models
result_mse <- cf_mse(
  predictions = pred,
  outcomes = y,
  treatment = a,
  covariates = df,
  treatment_level = 0,
  estimator = "dr",
  propensity_model = ml_learner("ranger", num.trees = 100),
  outcome_model = ml_learner("ranger", num.trees = 100),
  cross_fit = TRUE,
  n_folds = 5,
  se_method = "influence"
)

print(result_mse)
#> 
#> Counterfactual MSE Estimation
#> ---------------------------------------- 
#> Estimator: dr 
#> Treatment level: 0 
#> N observations: 1000 
#> 
#> Estimate: 0.2246 (SE: 0.0115 )
#> 95% CI: [0.2021, 0.2471]
#> 
#> Naive estimate: 0.2081

AUC with ML Learners

For AUC estimation, we recommend using GRF (Generalized Random Forests) rather than standard random forests. GRF’s “honesty” property produces well-calibrated probability estimates, which is critical for the doubly robust AUC estimator. Standard random forests can produce extreme probability predictions that destabilize the estimator.

# AUC estimation with GRF nuisance models (recommended)
result_auc <- cf_auc(
  predictions = pred,
  outcomes = y,
  treatment = a,
  covariates = df,
  treatment_level = 0,
  estimator = "dr",
  propensity_model = ml_learner("grf", num.trees = 500),
  outcome_model = ml_learner("grf", num.trees = 500),
  cross_fit = TRUE,
  n_folds = 5,
  se_method = "influence"
)

print(result_auc)
#> 
#> Counterfactual AUC Estimation
#> ---------------------------------------- 
#> Estimator: dr 
#> Treatment level: 0 
#> N observations: 1000 
#> 
#> Estimate: 0.5779 (SE: 0.0197 )
#> 95% CI: [0.5393, 0.6165]
#> 
#> Naive estimate: 0.5994

Best Practices

Always use cross-fitting with ML: The package will warn you if you pass an ml_learner without cross_fit = TRUE.
Choose appropriate number of folds: 5-10 folds is typical. More folds = more data for training each nuisance model, but more computation.
Tune hyperparameters thoughtfully: Default values work reasonably well, but tuning can improve performance. Consider using external cross-validation for hyperparameter selection.
Consider computational cost: ML methods are slower than GLM. For initial exploration, use fewer trees/rounds, then increase for final analysis.
Check for extreme propensity scores: ML methods can produce very extreme propensity scores. The package truncates these at [0.025, 0.975] by default.
Use GRF for AUC estimation: The doubly robust AUC estimator is sensitive to poorly calibrated probability estimates. GRF’s “honest” estimation produces well-calibrated probabilities, while standard random forests (ranger) can produce extreme predictions that destabilize the estimator.

References

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1-C68.

Boyer, C. B., Dahabreh, I. J., & Steingrimsson, J. A. (2025). Estimating and evaluating counterfactual prediction models. Statistics in Medicine, 44(23-24), e70287.