A simulated dataset for demonstrating counterfactual prediction model performance estimation. The data mimics a scenario where patients receive a cardiovascular treatment based on their risk factors, and we want to evaluate how well a prediction model would perform under different treatment policies.
Format
A data frame with 1000 rows and 6 variables:
- age
Standardized age (mean 0, SD 1)
- bp
Standardized blood pressure (mean 0, SD 1)
- chol
Standardized cholesterol level (mean 0, SD 1)
- treatment
Binary treatment indicator (1 = treated, 0 = untreated). Treatment assignment is confounded by age and blood pressure.
- event
Binary outcome indicating cardiovascular event (1 = event, 0 = no event). Risk depends on age, bp, chol, and is reduced by treatment.
- risk_score
Predicted probability of event from a logistic regression model fit on observed data using age, bp, and chol as predictors.
Source
Simulated data based on the framework in Boyer, Dahabreh & Steingrimsson (2025). "Estimating and evaluating counterfactual prediction models." Statistics in Medicine, 44(23-24), e70287. doi:10.1002/sim.70287
Details
The data generating process is:
Covariates are independent standard normal
Treatment probability:
plogis(-0.3 + 0.4*age + 0.3*bp + 0.1*chol)Outcome probability:
plogis(-2 + 0.6*age + 0.5*bp + 0.3*chol - 0.4*treatment)
This creates confounding because sicker patients (higher age, bp) are more likely to receive treatment, but treatment also affects outcomes.
Examples
data(cvd_sim)
head(cvd_sim)
#> age bp chol treatment event risk_score
#> 1 -0.2078913 -0.43879526 -0.5697974 0 0 0.07548152
#> 2 -1.2517361 1.30171507 0.7798967 0 0 0.13491154
#> 3 1.7957878 -0.39076092 -0.1731313 1 0 0.18333022
#> 4 -1.2464064 0.08506276 0.0269594 1 0 0.07145644
#> 5 -0.5880067 0.10358176 0.8346190 1 0 0.11730651
#> 6 -0.9132198 0.88158838 0.6061392 0 0 0.12684642
# Estimate counterfactual MSE under no treatment
result <- cf_mse(
predictions = cvd_sim$risk_score,
outcomes = cvd_sim$event,
treatment = cvd_sim$treatment,
covariates = cvd_sim[, c("age", "bp", "chol")],
treatment_level = 0,
estimator = "dr"
)
result
#>
#> Counterfactual MSE Estimation
#> ----------------------------------------
#> Estimator: dr
#> Treatment level: 0
#> N observations: 2500
#>
#> Estimate: 0.1186 (SE: 0.0065 )
#> 95% CI: [0.1064, 0.1309]
#>
#> Naive estimate: 0.1086
#>
