The conditional dependence structure (CDS) of a data matrix with
There is extensive GGM literature and many R packages for GGM,
however, all make the restrictive assumption that the precision matrix
is homogeneous throughout the data, or that there exists a partition of
homogeneous subgroups. covdepGE
avoids this strong
assumption by utilizing information sharing to model the CDS as varying
continuously with an extraneous covariate. Intuitively, this implies
that observations having similar extraneous covariate values will have
similar precision matrices.
To facilitate information sharing while managing complexity,
covdepGE
uses an efficient variational approximation
conducted under the novel weighted pseudo-likelihood framework proposed
by (1). covdepGE
further accelerates inference by employing
parallelism and executing expensive iterative computations in C++.
Additionally, covdepGE
offers a principled, data-driven
approach for hyperparameter specification that only requires the user to
input data and extraneous covariates to perform inference. Finally,
covdepGE
offers several wrappers around
ggplot2
for seamless visualization of resulting estimates,
such as matViz
, inclusionCurve
, and the S3
method plot.covdepGE
.
You can install the released version of covdepGE from CRAN with:
install.packages("covdepGE")
And the development version from GitHub with:
# install.packages("devtools")
::install_github("JacobHelwig/covdepGE") devtools
If you have an idea to improve covdepGE
, considering forking the repo
and creating a pull request or opening an
issue.
library(covdepGE)
library(ggplot2)
# get the data
set.seed(12)
<- generateData()
data <- data$X
X <- data$Z
Z <- data$interval
interval <- data$true_precision
prec
# get overall and within interval sample sizes
<- ncol(X)
p <- nrow(X)
n <- sum(interval == 1)
n1 <- sum(interval == 2)
n2 <- sum(interval == 3)
n3
# visualize the distribution of the extraneous covariate
ggplot(data.frame(Z = Z, interval = as.factor(interval))) +
geom_histogram(aes(Z, fill = interval), color = "black", bins = n %/% 5)
# visualize the true precision matrices in each of the intervals
# interval 1
matViz(prec[[1]], incl_val = TRUE) +
ggtitle(paste0("True precision matrix, interval 1, observations 1,...,", n1))
# interval 2 (varies continuously with Z)
cat("\nInterval 2, observations ", n1 + 1, ",...,", n1 + n2, sep = "")
#>
#> Interval 2, observations 61,...,120
<- prec[interval == 2]
int2_mats <- c(5, n2 %/% 2, n2 - 5)
int2_inds lapply(int2_inds, function(j) matViz(int2_mats[[j]], incl_val = TRUE) +
ggtitle(paste("True precision matrix, interval 2, observation", j + n1)))
#> [[1]]
#>
#> [[2]]
#>
#> [[3]]
# interval 3
matViz(prec[[length(prec)]], incl_val = TRUE) +
ggtitle(paste0("True precision matrix, interval 3, observations ",
+ n2 + 1, ",...,", n1 + n2 + n3)) n1
# fit the model and visualize the estimated graphs
<- covdepGE(X, Z, parallel = T, num_workers = p))
(out #> Warning in covdepGE(X, Z, parallel = T, num_workers = p): No registered workers
#> detected; registering doParallel with 5 workers
#> Covariate Dependent Graphical Model
#>
#> ELBO: -171501.68 # Unique Graphs: 3
#> n: 180, variables: 5 Hyperparameter grid size: 125 points
#> Model fit completed in 2.229 secs
plot(out)
#> [[1]]
#>
#> [[2]]
#>
#> [[3]]
# visualize the posterior inclusion probabilities for variables (1, 3) and (1, 2)
inclusionCurve(out, 1, 2)
inclusionCurve(out, 1, 3)
Math is correctly rendered in the version of this document available on CRAN.
Suppose that
Given data satisfying these assumptions, the covdepGE
function employs the algorithm described in (1) to estimate a graphical
representation of the structure of
Graphs are constructed using a pseudo-likelihood approach by fixing
each of the columns sym_method
(e.g., by taking the mean of edge_threshold
, an edge will be included between
To model
Spike-and-slab posterior quantities are estimated using a block
mean-field variational approximation. Coordinate Ascent Variational
Inference (CAVI) is performed for each of the weighted regressions to
select the variational parameters that maximize the ELBO. The parameters
for each of the regression coefficients are the mean and variance of the
slab (
CAVI for the alpha_tol
or after max_iter
iterations of CAVI have been performed.
Note that since the regressions performed for variable parallel = T
. Registering parallel backend with greater
than
Each regression requires the specification of covdepGE
offers hp_method
argument:
grid_search
, model_average
, and
hybrid
. Empirically, grid_search
offers the
best sensitivity and model_average
offers the best
specificity, while hybrid
sits between the other two
methods in both metrics.
The hyperparameter candidate grid is generated by taking the
Cartesian product between ssq
, sbsq
, and
pip
(candidate values for
In grid_search
, the point from the grid that produces
the model that has the greatest total ELBO is selected, where the total
ELBO is calculated by summing the ELBO for each of the
Instead of selecting only one model as in grid_search
,
models are averaged over in model_average
. With
Finally, hybrid
combines grid_search
and
model_average
. Fixing pip
, the
point in the grid defined by the Cartesian product of ssq
and sbsq
is selected by maximizing the total ELBO summed
across the
Note that in the search step of grid_search
and
hybrid
, CAVI for each of the grid points is performed for
at most max_iter_grid
iterations. A second CAVI is then
performed for max_iter
iterations using the max_iter_grid
to be less than
max_iter
(as is the default) will result in a more
efficient search.
The candidate grids (ssq
, sbsq
, and
pip
) may be passed as arguments, however, by default, these
grids are generated automatically. Each of the grids are spaced
uniformly between an upper end point and lower end point. The number of
points in each grid is nssq
, nsbsq
,
and npip
. The lower endpoints (ssq_lower
,
sbsq_lower
, and pip_lower
) are all
1e-5
by default. The upper endpoints are calculated
dependent on the variable
ssq_upper
is simply the variance of ssq_mult
. By
default, ssq_mult
is 1.5
.
pip_upper
is calculated by regressing the remaining
variables on lambda.1se
. The number of non-zero coefficients estimated
by LASSO is then divided by p - 1
to calculate
pip_upper
. Note that if the LASSO estimate to the number of
non-zero coefficients is pip_upper
is
greater than
Finally, an upper bound is induced on pip_upper
and sbsq_upper
gives an upper bound on the signal-to-noise
ratio. Setting this bound equal to snr_upper
gives an
expression for sbsq_upper
.
The similarity weight for observation norm
argument,
tau
may be passed as an argument, however, by default,
it is estimated using the methodology given in (2). (2) describes a
two-step approach for density estimation, where in the first step, an
initial estimate is calculated using Silverman’s rule of thumb for
initializing bandwidth values, and in the second step, the density is
refined by updating the bandwidth values. This methodology is used here
to estimate the density of tau
.
Sutanoy Dasgupta, Peng Zhao, Prasenjit Ghosh, Debdeep Pati, and Bani Mallick. An approximate Bayesian approach to covariate-dependent graphical modeling. pages 1–59, 2022.
Sutanoy Dasgupta, Debdeep Pati, and Anuj Srivastava. A Two-Step Geometric Framework For Density Modeling. Statistica Sinica, 30(4):2155–2177, 2020.