Two-Stage Estimation With g-estimation

library(trtswitch)
library(dplyr, warn.conflicts = FALSE)
library(ggplot2)

Introduction

TSEgest is an extension of the simple two-stage estimation (TSE) method by incorporating a structural nested model (SNM) and utilizing g-estimation. This allows a delay between disease progression (secondary baseline) and treatment switch, provided that time-dependent confounding variables that predict switching and survival are measured beyond the secondary baseline and included in the model for treatment switching. One key assumption for the TSEgest method is no unmeasured confounding, i.e., switching is independent of potential outcomes conditional on measured variables.

Estimation of \(\psi\)

To derive the g-estimate of \(\psi\), we utilize a logistic regression model for switching \[ \textrm{logit}(p(E_{ik})) = \alpha U_{i,\psi} + \sum_{j} \beta_j x_{ijk} \] alongside a structural model for counterfactual survival times \[ U_{i,\psi} = T_{C_i} + e^{\psi}T_{E_i} \]

Key Components

Estimation of Hazard Ratio

Once \(\psi\) has been estimated, we can derive an adjusted data set and fit a (potentially stratified) Cox proportional hazards model to the adjusted data set to obtain an estimate of the hazard ratio. The confidence interval for the hazard ratio can be derived by bootstrapping the entire adjustment and subsequent model-fitting process.

Example

We start by preparing the data.

sim1 <- tsegestsim(
  n = 500, allocation1 = 2, allocation2 = 1, pbprog = 0.5, 
  trtlghr = -0.5, bprogsl = 0.3, shape1 = 1.8, 
  scale1 = 0.000025, shape2 = 1.7, scale2 = 0.000015, 
  pmix = 0.5, admin = 5000, pcatnotrtbprog = 0.5, 
  pcattrtbprog = 0.25, pcatnotrt = 0.2, pcattrt = 0.1, 
  catmult = 0.5, tdxo = 1, ppoor = 0.1, pgood = 0.04, 
  ppoormet = 0.4, pgoodmet = 0.2, xomult = 1.4188308, 
  milestone = 546, swtrt_control_only = TRUE,
  outputRawDataset = 1, seed = 2000)

Next we apply the TSE method with g-estimation.

fit1 <- tsegest(
  data = sim1$paneldata, id = "id", 
  tstart = "tstart", tstop = "tstop", event = "died", 
  treat = "trtrand", censor_time = "censor_time", 
  pd = "progressed", pd_time = "timePFSobs", swtrt = "xo", 
  swtrt_time = "xotime", swtrt_time_upper = "xotime_upper",
  base_cov = "bprog", conf_cov = "bprog*catlag", 
  low_psi = -3, hi_psi = 3, strata_main_effect_only = TRUE,
  recensor = TRUE, admin_recensor_only = TRUE, 
  swtrt_control_only = TRUE, alpha = 0.05, ties = "efron", 
  tol = 1.0e-6, boot = FALSE)

The Kaplan-Meier plot for the control arm demonstrates that treatment switching can occur at the secondary baseline and at each of the ensuing five scheduled visits, spaced 21 days apart.

switched <- fit1$analysis_switch$data_switch[[1]]$data %>% 
  filter(swtrt == 1)
table(switched$swtrt_time)
#> 
#>   0  21  42  63  84 105 
#>  16   8  15  14  11  15
ggplot(fit1$analysis_switch$km_switch[[1]]$data, 
       aes(x=time, y=survival)) + 
  geom_step() + 
  scale_y_continuous(limits = c(0,1)) + 
  scale_x_continuous(breaks = seq(0,105,21)) + 
  xlab("time from progression to switch") + 
  ggtitle(paste("trtrand: ", 
                fit1$analysis_switch$km_switch[[1]]$trtrand)) + 
  theme(panel.grid.minor.x = element_blank())

We can examine the logistic regression switching model to confirm that the coefficient associated with the counterfactual survival time (the martingale residuals for the null Cox model) is effectively zero. To account for the potential correlation of multiple observations within a subject, a robust sandwich variance estimator is employed, clustering on the subject level for the logistic regression model.

parest <- fit1$analysis_switch$fit_logis[[1]]$fit$parest
parest[, c("param", "beta", "sebeta", "z")]
#>            param          beta    sebeta             z
#> 1    (Intercept) -3.2107237230 0.2606354 -12.318832320
#> 2 counterfactual  0.0003012517 0.1503624   0.002003504
#> 3          bprog  1.6012096263 0.3166177   5.057233481
#> 4         catlag  0.9283091909 0.5326442   1.742831583
#> 5   bprog.catlag -0.0119634791 0.6394798  -0.018708141

The plot of \(Z(\psi)\) versus \(\psi\) shows that the estimation process worked well.

c(fit1$psi, fit1$psi_CI)
#> [1] -0.5193008 -0.9105605 -0.2065572
```{r Z(psi)}
psi_CI_width <- fit1$psi_CI[2] - fit1$psi_CI[1]

ggplot(fit1$analysis_switch$eval_z[[1]]$data %>% 
         filter(psi > fit1$psi_CI[1] - psi_CI_width*0.25 & 
                  psi < fit1$psi_CI[2] + psi_CI_width*0.25), 
       aes(x=psi, y=Z)) + 
  geom_line() + 
  geom_hline(yintercept = c(0, -1.96, 1.96), linetype = 2) + 
  scale_y_continuous(breaks = c(0, -1.96, 1.96)) + 
  geom_vline(xintercept = c(fit1$psi, fit1$psi_CI), linetype = 2) + 
  scale_x_continuous(breaks = round(c(fit1$psi, fit1$psi_CI), 3)) + 
  ylab("Wald Z for counterfactual") + 
  theme(panel.grid.minor = element_blank())
```

Now we fit the outcome Cox model and compare the treatment hazard ratio estimate with the reported.

fit1$fit_outcome$parest[, c("param", "beta", "sebeta", "z")]
#>     param       beta     sebeta         z
#> 1 treated -0.5345281 0.09762388 -5.475383
#> 2   bprog  0.3511980 0.09132221  3.845702
c(fit1$hr, fit1$hr_CI)
#> [1] 0.5859457 0.4839047 0.7095042

Finally, to ensure the uncertainty is accurately represented, the entire adjustment process and subsequent survival modeling can be bootstrapped.

```{r boot}
fit2 <- tsegest(
  data = sim1$paneldata, id = "id", 
  tstart = "tstart", tstop = "tstop", event = "died", 
  treat = "trtrand", censor_time = "censor_time", 
  pd = "progressed", pd_time = "timePFSobs", swtrt = "xo", 
  swtrt_time = "xotime", swtrt_time_upper = "xotime_upper",
  base_cov = "bprog", conf_cov = "bprog*catlag", 
  low_psi = -3, hi_psi = 3, strata_main_effect_only = TRUE,
  recensor = TRUE, admin_recensor_only = TRUE, 
  swtrt_control_only = TRUE, alpha = 0.05, ties = "efron", 
  tol = 1.0e-6, boot = TRUE, n_boot = 1000, seed = 12345)

c(fit2$hr, fit2$hr_CI)
```