---
title: "Comparing Estimators Under Different Data-Generating Processes"
author: "mixedsubjects"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{Comparing Estimators Under Different Data-Generating Processes}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 8,
  fig.height = 5,
  warning = FALSE,
  message = FALSE
)
```

```{r, eval=FALSE}
# Install from github
devtools::install_github('klintkanopka/mixedsubjects')
```

```{r load-package, echo=FALSE, message=FALSE}
# For the vignette, we source the files directly

library(mixedsubjects)

pkg_dir <- system.file(package = "mixedsubjects")
if (pkg_dir == "") {
  # Running from source
  pkg_dir <- "../R"
  for (f in list.files(pkg_dir, pattern = "\\.R$", full.names = TRUE)) {
    source(f)
  }
}
```

## Overview

Each of the seven estimators in the `mixedsubjects` package has a different
variance formula, so each dominates under a different data-generating process.
This document constructs targeted DGP scenarios, runs Monte Carlo simulations
under each, and confirms which estimator achieves the smallest variance.

The key quantities that drive the comparison:

| Quantity | Favors |
|----------|--------|
| $\rho(Y, S)$ — prediction quality | All prediction-based estimators over DiM |
| $\rho_1 \neq \rho_0$ — heterogeneous quality across arms | D-T and D-T DiP over single-$\lambda$ estimators |
| $\text{Cov}(S^{(1)}, S^{(0)})$ — common-mode prediction error | DiP family over PPI/D-T family |
| $m / n$ — ratio of unlabeled to labeled | Prediction-based estimators (larger $m$ helps more) |


## Simulation Engine

```{r simulation-engine}
#' Run a Monte Carlo comparison of all seven estimators
#'
#' @param dgp_fn A function(seed) that returns a list with components
#'   obs_df, unobs_df (data.frames), and true_tau (scalar).
#' @param n_sims Number of Monte Carlo replications.
#' @param n_folds Number of cross-fitting folds.
#' @return A data.frame with columns: estimator, mean_est, bias, variance, mse.
run_comparison <- function(dgp_fn, n_sims = 2000, n_folds = 2) {
  estimator_names <- c("dim", "greg", "ppi", "dt", "dip", "dip_pp", "dt_dip")
  results <- matrix(NA_real_, nrow = n_sims, ncol = length(estimator_names))
  colnames(results) <- estimator_names

  for (i in seq_len(n_sims)) {
    d <- dgp_fn(seed = i)
    msd <- msd_data(observed = d$obs_df, unobserved = d$unobs_df)

    results[i, "dim"] <- tryCatch(msd_dim(msd)$estimate, error = function(e) NA)
    results[i, "greg"] <- tryCatch(msd_greg(msd)$estimate, error = function(e) NA)
    results[i, "ppi"] <- tryCatch(msd_ppi(msd, n_folds = n_folds)$estimate, error = function(e) NA)
    results[i, "dt"] <- tryCatch(msd_dt(msd, n_folds = n_folds)$estimate, error = function(e) NA)
    results[i, "dip"] <- tryCatch(msd_dip(msd)$estimate, error = function(e) NA)
    results[i, "dip_pp"] <- tryCatch(msd_dip_pp(msd, n_folds = n_folds)$estimate, error = function(e) NA)
    results[i, "dt_dip"] <- tryCatch(msd_dt_dip(msd, n_folds = n_folds)$estimate, error = function(e) NA)
  }

  true_tau <- dgp_fn(seed = 1)$true_tau

  data.frame(
    estimator = estimator_names,
    mean_est  = colMeans(results, na.rm = TRUE),
    bias      = colMeans(results, na.rm = TRUE) - true_tau,
    variance  = apply(results, 2, var, na.rm = TRUE),
    mse       = apply(results, 2, function(x) mean((x - true_tau)^2, na.rm = TRUE)),
    stringsAsFactors = FALSE
  )
}

#' Pretty-print a comparison table, highlighting the minimum-variance estimator
print_comparison <- function(comp, title = "") {
  if (nchar(title) > 0) cat("###", title, "\n\n")
  comp$variance <- round(comp$variance, 6)
  comp$bias <- round(comp$bias, 4)
  comp$mse <- round(comp$mse, 6)
  best <- comp$estimator[which.min(comp$variance)]
  comp$best <- ifelse(comp$estimator == best, " <-- min", "")
  print(comp[, c("estimator", "bias", "variance", "mse", "best")], row.names = FALSE)
  cat("\nLowest variance:", best, "\n\n")
}
```


## Scenario 1: Poor Predictions — DiM Wins

When predictions are essentially noise ($\rho(Y,S) \approx 0$), incorporating
them only adds variance. DiM ignores predictions entirely, so it should
dominate.

```{r dgp-poor-predictions}
dgp_poor_predictions <- function(seed) {
  set.seed(seed)
  true_tau <- 0.5
  n <- 200; m <- 400

  D_obs <- rep(c(1, 0), each = n / 2)
  Y <- rnorm(n) + true_tau * D_obs

  # Predictions are pure noise — no correlation with Y
  S1_obs <- rnorm(n, 0.1, 1)
  S0_obs <- rnorm(n, 0, 1)

  D_unobs <- rep(c(1, 0), each = m / 2)
  S1_unobs <- rnorm(m, 0.1, 1)
  S0_unobs <- rnorm(m, 0, 1)

  list(
    obs_df = data.frame(Y = Y, D = D_obs, S0 = S0_obs, S1 = S1_obs),
    unobs_df = data.frame(D = D_unobs, S0 = S0_unobs, S1 = S1_unobs),
    true_tau = true_tau
  )
}
```

```{r run-scenario-1}
comp1 <- run_comparison(dgp_poor_predictions)
print_comparison(comp1, "Scenario 1: Poor Predictions")
```

**Why**: All prediction-based estimators add $\lambda^2 \text{Var}(S)/m$ to the
variance without any offsetting covariance reduction. The optimal $\lambda$
shrinks toward 0 for PPI++/D-T/DiP++/D-T DiP, making them roughly equal to DiM,
while GREG and DiP (fixed $\lambda = 1$) are strictly worse.


## Scenario 2: Negatively Correlated Predictions

When a covariate $X$ has **opposite effects** on $Y(1)$ and $Y(0)$
(e.g., $Y(0) = \varepsilon + X$, $Y(1) = \varepsilon' - X + \tau$),
the potential outcomes are negatively correlated, and so are the predictions
$S^{(1)}$ and $S^{(0)}$. This inflates $\text{Var}(S^{(1)} - S^{(0)})$
relative to the positive-correlation case, which **reduces** DiP's advantage
over PPI — but does not eliminate it, because DiP still benefits structurally
from using all $m$ unobserved units rather than splitting by arm ($m/2$ each).

```{r dgp-high-quality-balanced}
dgp_neg_corr_predictions <- function(seed) {
  set.seed(seed)
  true_tau <- 0.5
  n <- 200; m <- 500

  D_obs <- rep(c(1, 0), each = n / 2)
  # X has opposite effects on Y(1) vs Y(0), creating negative Cov(Y1, Y0)
  X_obs <- rnorm(n)
  Y0_obs <- rnorm(n, 0, 0.3) + 1.0 * X_obs
  Y1_obs <- rnorm(n, 0, 0.3) - 1.0 * X_obs + true_tau
  Y <- D_obs * Y1_obs + (1 - D_obs) * Y0_obs

  # Good predictions of each potential outcome
  S1_obs <- 0.85 * Y1_obs + rnorm(n, 0, 0.2)
  S0_obs <- 0.85 * Y0_obs + rnorm(n, 0, 0.2)

  D_unobs <- rep(c(1, 0), each = m / 2)
  X_unobs <- rnorm(m)
  Y0_unobs <- rnorm(m, 0, 0.3) + 1.0 * X_unobs
  Y1_unobs <- rnorm(m, 0, 0.3) - 1.0 * X_unobs + true_tau
  S1_unobs <- 0.85 * Y1_unobs + rnorm(m, 0, 0.2)
  S0_unobs <- 0.85 * Y0_unobs + rnorm(m, 0, 0.2)

  list(
    obs_df = data.frame(Y = Y, D = D_obs, S0 = S0_obs, S1 = S1_obs),
    unobs_df = data.frame(D = D_unobs, S0 = S0_unobs, S1 = S1_unobs),
    true_tau = true_tau
  )
}
```

```{r run-scenario-2}
comp2 <- run_comparison(dgp_neg_corr_predictions)
print_comparison(comp2, "Scenario 2: Negatively Correlated Predictions")
```

**Why**: With $\text{Cov}(S^{(1)}, S^{(0)}) < 0$, differencing inflates the
unobserved variance: $\text{Var}(S^{(1)} - S^{(0)}) =
\text{Var}(S^{(1)}) + \text{Var}(S^{(0)}) - 2\text{Cov}(S^{(1)}, S^{(0)}) >
\text{Var}(S^{(1)}) + \text{Var}(S^{(0)})$.
However, DiP still outperforms PPI because its unobserved term divides by $m$
while PPI divides by $m/2$ per arm. In general,
$\frac{\text{Var}(S^{(1)} - S^{(0)})}{m} \leq \frac{\text{Var}(S^{(1)})}{m_1} + \frac{\text{Var}(S^{(0)})}{m_0}$
always holds (by Cauchy–Schwarz), so the DiP family structurally dominates
the PPI family in the unobserved variance component regardless of the
sign of $\text{Cov}(S^{(1)}, S^{(0)})$. Negative correlation merely
*shrinks* DiP's advantage relative to the positive-correlation case.


## Scenario 3: Heterogeneous Prediction Quality — Double-Tuned Wins

When the LLM predicts much better in one arm than the other
($\rho_1 \gg \rho_0$), double-tuned estimators (D-T, D-T DiP) that set
arm-specific $\lambda_1, \lambda_0$ outperform their single-$\lambda$
counterparts (PPI++, DiP++). D-T DiP achieves the lowest variance by
combining the double-tuning advantage with DiP's structural advantage
of using all $m$ unobserved units.

```{r dgp-heterogeneous-quality}
dgp_heterogeneous_quality <- function(seed) {
  set.seed(seed)
  true_tau <- 0.5
  n <- 200; m <- 500

  D_obs <- rep(c(1, 0), each = n / 2)
  Y0_obs <- rnorm(n)
  Y1_obs <- Y0_obs + true_tau
  Y <- D_obs * Y1_obs + (1 - D_obs) * Y0_obs

  # Treatment arm: excellent predictions (rho ~ 0.85)
  # Control arm: mediocre predictions (rho ~ 0.25)
  S1_obs <- 0.9 * Y1_obs + rnorm(n, 0, 0.3)
  S0_obs <- 0.2 * Y0_obs + rnorm(n, 0, 0.9)

  D_unobs <- rep(c(1, 0), each = m / 2)
  Y0_unobs <- rnorm(m)
  Y1_unobs <- Y0_unobs + true_tau
  S1_unobs <- 0.9 * Y1_unobs + rnorm(m, 0, 0.3)
  S0_unobs <- 0.2 * Y0_unobs + rnorm(m, 0, 0.9)

  list(
    obs_df = data.frame(Y = Y, D = D_obs, S0 = S0_obs, S1 = S1_obs),
    unobs_df = data.frame(D = D_unobs, S0 = S0_unobs, S1 = S1_unobs),
    true_tau = true_tau
  )
}
```

```{r run-scenario-3}
comp3 <- run_comparison(dgp_heterogeneous_quality)
print_comparison(comp3, "Scenario 3: Heterogeneous Quality Across Arms")
```

**Why**: A single $\lambda$ must compromise between the two arms: PPI++ and
DiP++ cannot fully exploit the good treatment-arm predictions without also
amplifying noise in the control arm. Double-tuned estimators (D-T, D-T DiP)
set $\lambda_1$ high and $\lambda_0$ low, avoiding this trade-off. Within
each tuning family, DiP-style estimators beat their PPI-style counterparts
(D-T DiP < D-T, DiP++ < PPI++) because they use all $m$ unobserved units
rather than splitting by arm.


## Scenario 4: High Common-Mode Error

When predictions for $S^{(1)}$ and $S^{(0)}$ share a large common error
component (e.g., the LLM has a stable unit-level bias that cancels in
$S^{(1)} - S^{(0)}$), the DiP family exploits $\text{Cov}(S^{(1)}, S^{(0)}) > 0$
to reduce the unobserved variance component.

```{r dgp-high-common-mode}
dgp_high_common_mode <- function(seed) {
  set.seed(seed)
  true_tau <- 0.5
  n <- 200; m <- 500

  D_obs <- rep(c(1, 0), each = n / 2)
  Y0_obs <- rnorm(n)
  Y1_obs <- Y0_obs + true_tau
  Y <- D_obs * Y1_obs + (1 - D_obs) * Y0_obs

  # Common-mode prediction error (shared LLM bias per unit, not in Y)
  common_obs <- rnorm(n, 0, 0.8)
  S1_obs <- 0.9 * Y1_obs + common_obs + rnorm(n, 0, 0.1)
  S0_obs <- 0.9 * Y0_obs + common_obs + rnorm(n, 0, 0.1)

  D_unobs <- rep(c(1, 0), each = m / 2)
  Y0_unobs <- rnorm(m)
  Y1_unobs <- Y0_unobs + true_tau
  common_unobs <- rnorm(m, 0, 0.8)
  S1_unobs <- 0.9 * Y1_unobs + common_unobs + rnorm(m, 0, 0.1)
  S0_unobs <- 0.9 * Y0_unobs + common_unobs + rnorm(m, 0, 0.1)

  list(
    obs_df = data.frame(Y = Y, D = D_obs, S0 = S0_obs, S1 = S1_obs),
    unobs_df = data.frame(D = D_unobs, S0 = S0_unobs, S1 = S1_unobs),
    true_tau = true_tau
  )
}
```

```{r run-scenario-4}
comp4 <- run_comparison(dgp_high_common_mode)
print_comparison(comp4, "Scenario 4: High Common-Mode Prediction Error")
```

**Why**: $\text{Var}(S^{(1)})$ and $\text{Var}(S^{(0)})$ are inflated by the
common error, which hurts PPI++/D-T through the $\lambda^2 \text{Var}(S)/m$
term. But $\text{Var}(S^{(1)} - S^{(0)})$ cancels the common error, so
the DiP unobserved term $\lambda^2 \text{Var}(S^{(1)} - S^{(0)})/m$ is much
smaller.


## Scenario 5: Common-Mode Error + Heterogeneous Quality

Combine the advantages of scenarios 3 and 4: predictions share common-mode
error (favoring DiP) but quality differs across arms (favoring D-T). D-T DiP
should beat both D-T and DiP++.

```{r dgp-common-mode-heterogeneous}
dgp_common_mode_heterogeneous <- function(seed) {
  set.seed(seed)
  true_tau <- 0.5
  n <- 200; m <- 500

  D_obs <- rep(c(1, 0), each = n / 2)
  Y0_obs <- rnorm(n)
  Y1_obs <- Y0_obs + true_tau
  Y <- D_obs * Y1_obs + (1 - D_obs) * Y0_obs

  common_obs <- rnorm(n, 0, 1.2)
  # Treatment arm: good signal; control arm: weak signal
  S1_obs <- 0.9 * Y1_obs + common_obs + rnorm(n, 0, 0.3)
  S0_obs <- 0.2 * Y0_obs + common_obs + rnorm(n, 0, 0.5)

  D_unobs <- rep(c(1, 0), each = m / 2)
  Y0_unobs <- rnorm(m)
  Y1_unobs <- Y0_unobs + true_tau
  common_unobs <- rnorm(m, 0, 1.2)
  S1_unobs <- 0.9 * Y1_unobs + common_unobs + rnorm(m, 0, 0.3)
  S0_unobs <- 0.2 * Y0_unobs + common_unobs + rnorm(m, 0, 0.5)

  list(
    obs_df = data.frame(Y = Y, D = D_obs, S0 = S0_obs, S1 = S1_obs),
    unobs_df = data.frame(D = D_unobs, S0 = S0_unobs, S1 = S1_unobs),
    true_tau = true_tau
  )
}
```

```{r run-scenario-5}
comp5 <- run_comparison(dgp_common_mode_heterogeneous)
print_comparison(comp5, "Scenario 5: Common-Mode Error + Heterogeneous Quality")
```

**Why**: D-T DiP can set $\lambda_1$ large (good treatment predictions) and
$\lambda_0$ small (noisy control predictions), while still exploiting the
common-mode cancellation in $\lambda_1 S^{(1)} - \lambda_0 S^{(0)}$.


## Scenario 6: Near-Perfect Predictions

When predictions are nearly unbiased and have very high correlation with $Y$
($\rho \approx 0.995$), the optimal $\lambda^*$ is close to 1. With small $n$
(few observations per cross-fitting fold), estimating $\lambda$ adds
finite-sample noise. DiP with fixed $\lambda = 1$ dominates: it avoids
the lambda-estimation cost while retaining the structural advantage of
using all $m$ unobserved units.

```{r dgp-near-perfect}
dgp_near_perfect <- function(seed) {
  set.seed(seed)
  true_tau <- 0.5
  n <- 100; m <- 500

  D_obs <- rep(c(1, 0), each = n / 2)
  Y0_obs <- rnorm(n)
  Y1_obs <- Y0_obs + true_tau
  Y <- D_obs * Y1_obs + (1 - D_obs) * Y0_obs

  # Near-perfect predictions: rho = 1/sqrt(1 + 0.01) ~ 0.995
  S1_obs <- Y1_obs + rnorm(n, 0, 0.1)
  S0_obs <- Y0_obs + rnorm(n, 0, 0.1)

  D_unobs <- rep(c(1, 0), each = m / 2)
  Y0_unobs <- rnorm(m)
  Y1_unobs <- Y0_unobs + true_tau
  S1_unobs <- Y1_unobs + rnorm(m, 0, 0.1)
  S0_unobs <- Y0_unobs + rnorm(m, 0, 0.1)

  list(
    obs_df = data.frame(Y = Y, D = D_obs, S0 = S0_obs, S1 = S1_obs),
    unobs_df = data.frame(D = D_unobs, S0 = S0_unobs, S1 = S1_unobs),
    true_tau = true_tau
  )
}
```

```{r run-scenario-6}
comp6 <- run_comparison(dgp_near_perfect)
print_comparison(comp6, "Scenario 6: Near-Perfect Predictions")
```

**Why**: When $\lambda^* \approx 1$, estimating $\lambda$ via cross-fitting
(with only $n/2$ observations per fold) adds finite-sample variance without
meaningful bias reduction. DiP with fixed $\lambda = 1$ beats DiP++ and D-T DiP
by avoiding this estimation cost. Meanwhile, DiP dominates all PPI-style
estimators (GREG, PPI++, D-T) by a large margin thanks to the structural
advantage of using all $m$ unobserved units rather than splitting by arm.


## Summary Table

```{r summary-table}
scenarios <- list(
  "1: Poor predictions"        = comp1,
  "2: Neg corr predictions"    = comp2,
  "3: Heterogeneous quality"   = comp3,
  "4: High common-mode error"  = comp4,
  "5: Common-mode + hetero"    = comp5,
  "6: Near-perfect"            = comp6
)

summary_df <- do.call(rbind, lapply(names(scenarios), function(nm) {
  comp <- scenarios[[nm]]
  best_idx <- which.min(comp$mse)
  data.frame(
    scenario = nm,
    winner = comp$estimator[best_idx],
    winner_mse = comp$mse[best_idx],
    winner_var = comp$variance[best_idx],
    dim_var = comp$variance[comp$estimator == "dim"],
    reduction_pct = round(
      (1 - comp$variance[best_idx] / comp$variance[comp$estimator == "dim"]) * 100, 1
    ),
    stringsAsFactors = FALSE
  )
}))

knitr::kable(summary_df,
             col.names = c("Scenario", "Best Estimator", "Best MSE",
                           "Best Var", "DiM Var", "Var Reduction (%)"),
             digits = 5,
             caption = "Which estimator achieves the lowest Monte Carlo MSE under each DGP?")
```