--- title: "Introduction to CDsampling" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to CDsampling} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(CDsampling) ``` ```{r} set.seed(123) ``` In the context of paid research studies and clinical trials, budget considerations and patient sampling from available populations are subject to inherent constraints. **CDsampling** integrates optimal design theories within the framework of constrained sampling. * This package offers the possibility to find both D-optimal approximate and exact allocations for sampling with or without constraints. * Additionally, it provides functions to find constrained uniform sampling as a robust sampling strategy with limited model information. * It also provides tool for computation of Fisher information matrix of the generalized linear models (GLMs) including regular linear regression model and the multinomial logistic models (MLMs). * Two datasets are embedded in the package for application examples Reference: Constrained D-optimal Design for Paid Research Study (2024+), Statistica Sinica, to appear, DOI: 10.5705/ss.202022.0414, Yifei Huang, Liping Tong, and Jie Yang ## Computation of Fisher information matrix #### Example GLM Fisher information matrix Consider a research study with a simple logistic regression model $$\log(\frac{\mu_i}{1-\mu_i}) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2}$$ where $\mu_i = E(Y_i\mid {\mathbf x}_i)$, ${\mathbf x}_i = (x_{i1}, x_{i2})^\top \in \{(-1, -1), (-1, +1), (+1, -1)\}$ and parameters $\boldsymbol \beta = (\beta_0, \beta_1, \beta_2) = (0.5, 0.5, 0.5)$. The approximate allocation to the three design points is $\mathbf w = (1/3,1/3,1/3)^\top$ To calculate Fisher information matrix of the design with GLM, we can use *F_func_GLM* in the package. ```{r} beta = c(0.5, 0.5, 0.5) X = matrix(data=c(1,-1,-1,1,-1,1,1,1,-1), byrow=TRUE, nrow=3) w = c(1/3,1/3,1/3) #approximate allocation CDsampling::F_func_GLM(w=w, beta=beta, X=X, link='logit') ``` #### Example of MLM Fisher information matrix Consider a research study with cumulative npo multinomial logit model with $J=5$ response levels with covariates $(x_{i1}, x_{i2})=\{(1,0),(2,0),(3,0), (4,0),(1,1),(2,1),(3,1),(4,1)\}$. The model can be written as:$$\log(\frac{\pi_{i1}}{\pi_{i2}+\dots+\pi_{i5}}) = \beta_{11}+\beta_{12}x_{i1}+\beta_{13}x_{i2}$$ $$\log(\frac{\pi_{i1}+\pi_{i2}}{\pi_{i3}+\pi_{i4}+\pi_{i5}}) = \beta_{21}+\beta_{22}x_{i1}+\beta_{23}x_{i2}$$ $$\log(\frac{\pi_{i1}+\pi_{i2}+\pi_{i3}}{\pi_{i4}+\pi_{i5}}) = \beta_{31}+\beta_{32}x_{i1}+\beta_{33}x_{i2}$$ $$\log(\frac{\pi_{i1}+\dots+\pi_{i4}}{\pi_{i5}}) = \beta_{41}+\beta_{42}x_{i1}+\beta_{43}x_{i2}$$ where $i=1,\dots,8$, we assume the parameters $\boldsymbol \beta = (\beta_{11}, \beta_{12}, \beta_{13}, \beta_{21}, \beta_{22}, \beta_{23}, \beta_{31}, \beta_{32}, \beta_{33}, \beta_{41}, \beta_{42}, \beta_{43})^\top = (-4.047, -0.131, 4.214, -2.225, -0.376,$ $3.519, -0.302, -0.237, 2.420, 1.386, -0.120, 1.284)^\top$. The approximate allocation for the eight design points is $\mathbf w = (1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8)^\top$. To calculate the Fisher information matrix with MLM, we can use the *F_func_MLM* in the package. ```{r} J=5; p=12; m=8; beta = c(-4.047, -0.131, 4.214, -2.225, -0.376, 3.519, -0.302, -0.237, 2.420, 1.386, -0.120, 1.284) Xi=rep(0,J*p*m) dim(Xi)=c(J,p,m) Xi[,,1] = rbind(c( 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)) Xi[,,2] = rbind(c( 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)) Xi[,,3] = rbind(c( 1, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 1, 3, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 1, 3, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)) Xi[,,4] = rbind(c( 1, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 1, 4, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 1, 4, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 4, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)) Xi[,,5] = rbind(c( 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)) Xi[,,6] = rbind(c( 1, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 1, 2, 1, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 1, 2, 1, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 1), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)) Xi[,,7] = rbind(c( 1, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 1, 3, 1, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 1, 3, 1, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 1), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)) Xi[,,8] = rbind(c( 1, 4, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 1, 4, 1, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 1, 4, 1, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 4, 1), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)) alloc = rep(1/8,m) #approximate allocation CDsampling::F_func_MLM(w=alloc, beta=beta, X=Xi, link='cumulative') ``` ## Data: trial_data & Constrained sampling with GLM *trial_data* is a simulated dataset that contains $N=500$ volunteers gender, age, and final efficacy information. The details as below: gender is coded as $0$ for female and $1$ for male. Age group is coded as both $age\_1 = 0$ and $age\_2 = 0$ for ages $18\sim25$, $age\_1 = 1$ for ages $26\sim64$, and $age\_2 = 1$ for ages $65$ and above. The number of design points $m=6$, that is, the number of combinations of gender and age groups $(x_{gender\_i}, x_{age\_1i}, x_{age\_2i})$ corresponds to $(0,0,0)$, $(0,1,0)$, $(0,0,1)$, $(1,0,0)$, $(1,1,0)$, $(1,0,1)$. We have the number of patients in each category as $(N_1, N_2, \ldots, N_6) = (50, 40, 10, $ $200, 150, 50)$. Suppose that a sample of $n=200$ participants is required due to budget limit.Our goal is to find the constrained D-optimal allocation $(w_1, w_2, \dots, w_6)$ with feasible allocation $S = \{(w_1, \ldots, w_m)^T \in S_0 \mid n w_i \leq N_i, i=1, \ldots, m\}$. We can use constrained lift-one algorithm *liftone_constrained_GLM* to find the locally D-optimal approximate sampling allocations. We consider the logistic regression model for $j=1,\dots,m$, $i=1,\dots,n_j$ with $\boldsymbol \beta=(\beta_{1}, \beta_{21}, \beta_{22}= (0,3,3,3)$: \begin{equation}\label{eq:trial_logistic_model} {\rm logit} \{P(Y_{ij}=1 \mid x_{gender\_i}, x_{age\_1i}, x_{age\_2i})\} = \beta_0 + \beta_1 x_{gender\_i} + \beta_{21} x_{age\_1i} + \beta_{22} x_{age\_2i} \end{equation} We use the following R codes to define the beta coefficients, sample size, and design matrix: ```{r} beta = c(0, 3, 3, 3) #coefficients X=matrix(data=c(1,0,0,0,1,0,1,0,1,0,0,1,1,1,0,0,1,1,1,0,1,1,0,1), ncol=4, byrow=TRUE) #design matrix X nsample=200 #sample size ``` To run the *liftone_constrained_GLM* function, we also need to the $\mathbf W$ matrix from the calculation of Fisher information matrix, we can use the *W_func_GLM* function in the package: ```{r} W_matrix=CDsampling::W_func_GLM(X=X, b=beta, link="logit") #W matrix ``` Lastly, we also need to define the constraints (number of patients from different gender and age group) and boundaries for constrained sampling (please see the reference for details of lower bound $r_{i1}$ and upper bound $r_{i2}$): ```{r} rc = c(50, 40, 10, 200, 150, 50)/nsample m = 6 g.con = matrix(0,nrow=(2*m+1), ncol=m) g.con[1,] = rep(1, m) g.con[2:(m+1),] = diag(m) g.con[(m+2):(2*m+1), ] = diag(m) g.dir = c("==", rep("<=", m), rep(">=", m)) g.rhs = c(1, rc, rep(0, m)) lower.bound=function(i, w){ nsample = 200 rc = c(50, 40, 10, 200, 150, 50)/nsample m=length(w) #num of categories temp = rep(0,m) temp[w>0]=1-pmin(1,rc[w>0])*(1-w[i])/w[w>0]; temp[i]=0; max(0,temp); } upper.bound=function(i, w){ nsample = 200 rc = c(50, 40, 10, 200, 150, 50)/nsample m=length(w) #num of categories rc[i]; min(1,rc[i]) } ``` Now, we can run the constrained lift-one algorithm: ```{r} set.seed(2025) approximate_design = CDsampling::liftone_constrained_GLM(X=X, W=W_matrix, g.con=g.con, g.dir=g.dir, g.rhs=g.rhs, lower.bound=lower.bound, upper.bound=upper.bound, reltol=1e-10, maxit=100, random=TRUE, nram=4, w00=NULL, epsilon=1e-8) print(approximate_design) ``` To find the exact allocation (integer value of allocation), we can use the *approxtoexact_constrained_func*: ```{r} exact_design = CDsampling::approxtoexact_constrained_func(n=200, w=approximate_design$w, m=6, beta=beta, link='logit', X=X, Fdet_func=Fdet_func_GLM, iset_func=iset_func_trial) print(exact_design) ``` **Note:** ```{} If we cannot obtain an approximated value of beta coefficients, we can also try to find the EW D-optimal allocations instead. For EW D-optimal allocation, we can come up with some prior distributions of beta coefficients, and replace the coefficients value with expectation of those prior distributions in the R codes. (Please see reference for more details). ``` If there is no knowledge of beta coefficients, we can use constrained uniform design: ```{r} w00 = rep(1/200, 6) unif_design = CDsampling::approxtoexact_constrained_func(n=200, w=w00, m=6, beta=NULL, link=NULL, X=NULL, Fdet_func=Fdet_func_unif, iset_func=iset_func_trial) print(unif_design) ``` ## Data: trauma_data & Constrained sampling with MLM *trauma_data* consists of $N=802$ trauma patients. After the treatment, there are five outcomes, Death ($1$), Vegetative state ($2$), Major disability ($3$), Minor disability ($4$) and Good recovery ($5$). We consider two covariates, the symptoms of the trauma ($x_{i1} = 0$ for mild or $1$ for moderate /severe), there are $392$ mild symptoms patients, and $410$ moderate/severe symptoms patients. The other covariate is the treatment dose levels ($x_{i2} = 1$ (Placebo), $2$ (Low dose), $3$ (Medium dose), and $4$ (High dose). In total, we have $m=8$ design points from the combination of levels from the two covariates. In this study, we plan to enroll $n=600$ patients and the collection of feasible allocation is $S=\{(w_1, \ldots, w_8)^\top \in S_0\mid n(w_1+w_2+w_3+w_4) \leq 392, n(w_5+w_6+w_7+w_8) \leq 410\}$. The model can be written as:$$\log(\frac{\pi_{i1}}{\pi_{i2}+\dots+\pi_{i5}}) = \beta_{11}+\beta_{12}x_{i1}+\beta_{13}x_{i2}$$ $$\log(\frac{\pi_{i1}+\pi_{i2}}{\pi_{i3}+\pi_{i4}+\pi_{i5}}) = \beta_{21}+\beta_{22}x_{i1}+\beta_{23}x_{i2}$$ $$\log(\frac{\pi_{i1}+\pi_{i2}+\pi_{i3}}{\pi_{i4}+\pi_{i5}}) = \beta_{31}+\beta_{32}x_{i1}+\beta_{33}x_{i2}$$ $$\log(\frac{\pi_{i1}+\dots+\pi_{i4}}{\pi_{i5}}) = \beta_{41}+\beta_{42}x_{i1}+\beta_{43}x_{i2}$$ In this example, we assume the parameter to be $\boldsymbol\beta = (\hat\beta_{11}, \hat\beta_{21}, \hat\beta_{31}, \hat\beta_{41}, \hat\beta_{12}, \hat\beta_{22}, \hat\beta_{32}, \hat\beta_{42}, \hat\beta_{13}, \hat\beta_{23}, \hat\beta_{33}, \hat\beta_{43})^\top = (-4.047, -2.225, -0.302, 1.386, 4.214, 3.519,\\ 2.420, 1.284, -0.131, -0.376, -0.237, -0.120)^\top$. In total, we have $m=8$ design points from the combination of levels from the two covariates. In this study, we plan to enroll $n=600$ patients and the collection of feasible allocation is $S=\{(w_1, \ldots, w_8)^\top \in S_0\mid n(w_1+w_2+w_3+w_4) \leq 392, n(w_5+w_6+w_7+w_8) \leq 410\}$. For this example, we can reuse the set-up of design matrix and beta coefficients from the MLM Fisher information matrix: ```{r} J=5; p=12; m=8; beta = c(-4.047, -0.131, 4.214, -2.225, -0.376, 3.519, -0.302, -0.237, 2.420, 1.386, -0.120, 1.284) Xi=rep(0,J*p*m) dim(Xi)=c(J,p,m) Xi[,,1] = rbind(c( 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)) Xi[,,2] = rbind(c( 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)) Xi[,,3] = rbind(c( 1, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 1, 3, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 1, 3, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)) Xi[,,4] = rbind(c( 1, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 1, 4, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 1, 4, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 4, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)) Xi[,,5] = rbind(c( 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)) Xi[,,6] = rbind(c( 1, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 1, 2, 1, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 1, 2, 1, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 1), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)) Xi[,,7] = rbind(c( 1, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 1, 3, 1, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 1, 3, 1, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 1), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)) Xi[,,8] = rbind(c( 1, 4, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 1, 4, 1, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 1, 4, 1, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 4, 1), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)) ``` Similarly to the GLM example, we define the sample size, constraints, and boundaries using the following R codes: ```{r} nsample=600 #sample size constraint = c(392, 410) #mild:severe lower.bound <- function(i, w0){ n = 600 constraint = c(392,410) if(i <= 4){ a.lower <- (sum(w0[5:8])-(constraint[2]/n)*(1-w0[i]))/(sum(w0[5:8])) } else{ a.lower <- (sum(w0[1:4])-(constraint[1]/n)*(1-w0[i]))/(sum(w0[1:4])) } a.lower } upper.bound <- function(i, w0){ n = 600 constraint = c(392,410) if(i <= 4){ b.upper <- ((constraint[1]/n)*(1-w0[i]) - (sum(w0[1:4])-w0[i]))/(1-sum(w0[1:4])) } else{ b.upper <- ((constraint[2]/n)*(1-w0[i]) - (sum(w0[5:8])-w0[i]))/(1-sum(w0[5:8])) } b.upper } g.con = matrix(0,nrow=length(constraint)+1+m, ncol=m) g.con[2:3,] = matrix(data=c(1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1), ncol = m, byrow=TRUE) g.con[1,] = rep(1, m) g.con[4:(length(constraint)+1+m), ] = diag(1, nrow=m) g.dir = c("==", "<=","<=", rep(">=",m)) g.rhs = c(1, ifelse((constraint/nsample<1),constraint/nsample,1), rep(0, m)) ``` Then, we can use the constrained lift-one algorithm to find the locally D-optimal approximate sampling with *liftone_constrained_MLM* function in the package: ```{r} set.seed(123) approx_design = CDsampling::liftone_constrained_MLM(m=m, p=p, Xi=Xi, J=J, beta=beta, lower.bound=lower.bound, upper.bound=upper.bound, g.con=g.con, g.dir=g.dir, g.rhs=g.rhs, w00=NULL, link='cumulative', Fi.func = Fi_func_MLM, reltol=1e-5, maxit=500, delta = 1e-6, epsilon=1e-8, random=TRUE, nram=1) print(approx_design) ``` To get the exact allocation, we can again apply the *approxtoexact_constrained_func* in the package: ```{r} exact_design = CDsampling::approxtoexact_constrained_func(n=600, w=approx_design$w, m=8, beta=beta, link='cumulative', X=Xi, Fdet_func=Fdet_func_MLM, iset_func=iset_func_trauma) print(exact_design) ``` If there is no coefficients information, we can find the constrained uniform sampling with the folloiwng codes: ```{r} iset_func_trauma <- function(allocation){ iset = rep(1,8) if(sum(allocation[1:4])>=392){iset[1:4]=0} if(sum(allocation[5:8])>=410){iset[5:8]=0} return(iset) } unif_design = CDsampling::approxtoexact_constrained_func(n=600, w=rep(1/600,8), m=8, beta=NULL, link=NULL, X=NULL, Fdet_func=Fdet_func_unif, iset_func=iset_func_trauma) unif_design ```