bmass Introduction – Real Data

Michael C Turchin

2019-05-16

This is an advanced introductory vignette for running bmass. Here, we will use real data and explore multiple components of the bmass() output. Specifically, we will:

For a more basic introductory vignette using a smaller, simulated dataset, please see here.

Downloading GlobalLipids2013

For this vigenette, we will be download and analyze the GlobalLipids2013 dataset. To download the necessary files from http://csg.sph.umich.edu/willer/public/lipids2013/, run the following:

mkdir my-directory/GlobalLipids2013
cd my-directory/GlobalLipids2013

wget http://csg.sph.umich.edu/abecasis/public/lipids2013/jointGwasMc_LDL.txt.gz
wget http://csg.sph.umich.edu/abecasis/public/lipids2013/jointGwasMc_HDL.txt.gz
wget http://csg.sph.umich.edu/abecasis/public/lipids2013/jointGwasMc_TG.txt.gz
wget http://csg.sph.umich.edu/abecasis/public/lipids2013/jointGwasMc_TC.txt.gz

Formatting & QC’ing our dataset

To run bmass, it is recommended that at least a minimal level of QC is performed on the datasets of interest. Additionally, it is important that all the single nucleotide polymorphisms (SNPs) being analyzed are oriented to the same reference allele across all the phenotypes being included. This latter situation is emphasized so that the correct direction of effect is assigned to each phenotype for a given SNP (which is oriented based on reference allele).

Below is example bash code on how to reformat, conduct some basic QC, and orient the reference alleles on the GlobalLipids2013 dataset. First, we will reformat our input datafiles to match the requirements for bmass. bmass expects input files to have the following columns –

To create files that match these specifications, we will conduct the following: extract the columns that are needed, orient SNPs to the minor allele, and extract the direction of effect from the provided effect sizes.

zcat jointGwasMc_HDL.txt.gz | sed 's/:/ /g' | sed 's/chr//g' | awk '{ print $1 "\t" $2 "\t" $5 "\t" $12 "\t" $6 "\t" $7 "\t" $8 "\t" $11 "\t" $10 }' | perl -lane '$F[4] = uc($F[4]); $F[5] = uc($F[5]); if ($F[6] > 0) { $F[6] = "+"; } elsif ($F[6] < 0) { $F[6] = "-"; } else { $F[6] = 0; } print join("\t", @F);' | grep -v SNP_hg18 | gzip > jointGwasMc_HDL.formatted.txt.gz 

zcat jointGwasMc_LDL.txt.gz | sed 's/:/ /g' | sed 's/chr//g' | awk '{ print $1 "\t" $2 "\t" $5 "\t" $12 "\t" $6 "\t" $7 "\t" $8 "\t" $11 "\t" $10 }' | perl -lane '$F[4] = uc($F[4]); $F[5] = uc($F[5]); if ($F[6] > 0) { $F[6] = "+"; } elsif ($F[6] < 0) { $F[6] = "-"; } else { $F[6] = 0; } print join("\t", @F);' | grep -v SNP_hg18 | gzip > jointGwasMc_LDL.formatted.txt.gz 

zcat jointGwasMc_TG.txt.gz | sed 's/:/ /g' | sed 's/chr//g' | awk '{ print $1 "\t" $2 "\t" $5 "\t" $12 "\t" $6 "\t" $7 "\t" $8 "\t" $11 "\t" $10 }' | perl -lane '$F[4] = uc($F[4]); $F[5] = uc($F[5]); if ($F[6] > 0) { $F[6] = "+"; } elsif ($F[6] < 0) { $F[6] = "-"; } else { $F[6] = 0; } print join("\t", @F);' | grep -v SNP_hg18 | gzip > jointGwasMc_TG.formatted.txt.gz 

zcat jointGwasMc_TC.txt.gz | sed 's/:/ /g' | sed 's/chr//g' | awk '{ print $1 "\t" $2 "\t" $5 "\t" $12 "\t" $6 "\t" $7 "\t" $8 "\t" $11 "\t" $10 }' | perl -lane '$F[4] = uc($F[4]); $F[5] = uc($F[5]); if ($F[6] > 0) { $F[6] = "+"; } elsif ($F[6] < 0) { $F[6] = "-"; } else { $F[6] = 0; } print join("\t", @F);' | grep -v SNP_hg18 | gzip > jointGwasMc_TC.formatted.txt.gz 

Next, we will conduct some basic QC. These steps include: remove SNPs that have duplicate entries, that do not have entries in every field, do not have MAF information, that are fixed (MAF = 0), are rare (MAF < .01), have sample sample sizes < 50000, and have effect sizes of 0 (ie direction of effect is indeterminable). We will also orient each SNP/phenotype file to its respective minor allele.

join <(zcat jointGwasMc_HDL.formatted.txt.gz | awk '{ print $1 "_" $2 }' | grep -v hg18 | sort | uniq -u) <(zcat jointGwasMc_HDL.formatted.txt.gz | grep -v hg | grep -v ^rs | grep -v NA | perl -lane 'if ($F[3] > .5) { ($F[4], $F[5]) = ($F[5], $F[4]); if ($F[6] eq "+") { $F[6] = "-"; } elsif ($F[6] eq "-") { $F[6] = "+"; } else { $F[6] = 0; } $F[3] = 1 - $F[3]; } if ($F[6] ne "0") { if ($F[3] > .01) { if ($F[8] >= 50000) { print join("\t", @F); } } }' | awk '{ print $1 "_" $2 "\t" $0 }' | sort -k 1,1) | perl -lane 'print join("\t", @F[1..$#F]);' | cat <(echo -e "Chr\tBP\tMarker\tMAF\tA1\tA2\tDirection\tpValue\tN") - | gzip > jointGwasMc_HDL.formatted.QCed.txt.gz 

join <(zcat jointGwasMc_LDL.formatted.txt.gz | awk '{ print $1 "_" $2 }' | grep -v hg18 | sort | uniq -u) <(zcat jointGwasMc_LDL.formatted.txt.gz | grep -v hg | grep -v ^rs | grep -v NA | perl -lane 'if ($F[3] > .5) { ($F[4], $F[5]) = ($F[5], $F[4]); if ($F[6] eq "+") { $F[6] = "-"; } elsif ($F[6] eq "-") { $F[6] = "+"; } else { $F[6] = 0; } $F[3] = 1 - $F[3]; } if ($F[6] ne "0") { if ($F[3] > .01) { if ($F[8] >= 50000) { print join("\t", @F); } } }' | awk '{ print $1 "_" $2 "\t" $0 }' | sort -k 1,1) | perl -lane 'print join("\t", @F[1..$#F]);' | gzip > jointGwasMc_LDL.formatted.QCed.txt.gz 

join <(zcat jointGwasMc_TG.formatted.txt.gz | awk '{ print $1 "_" $2 }' | grep -v hg18 | sort | uniq -u) <(zcat jointGwasMc_TG.formatted.txt.gz | grep -v hg | grep -v ^rs | grep -v NA | perl -lane 'if ($F[3] > .5) { ($F[4], $F[5]) = ($F[5], $F[4]); if ($F[6] eq "+") { $F[6] = "-"; } elsif ($F[6] eq "-") { $F[6] = "+"; } else { $F[6] = 0; } $F[3] = 1 - $F[3]; } if ($F[6] ne "0") { if ($F[3] > .01) { if ($F[8] >= 50000) { print join("\t", @F); } } }' | awk '{ print $1 "_" $2 "\t" $0 }' | sort -k 1,1) | perl -lane 'print join("\t", @F[1..$#F]);' | gzip > jointGwasMc_TG.formatted.QCed.txt.gz 

join <(zcat jointGwasMc_TC.formatted.txt.gz | awk '{ print $1 "_" $2 }' | grep -v hg18 | sort | uniq -u) <(zcat jointGwasMc_TC.formatted.txt.gz | grep -v hg | grep -v ^rs | grep -v NA | perl -lane 'if ($F[3] > .5) { ($F[4], $F[5]) = ($F[5], $F[4]); if ($F[6] eq "+") { $F[6] = "-"; } elsif ($F[6] eq "-") { $F[6] = "+"; } else { $F[6] = 0; } $F[3] = 1 - $F[3]; } if ($F[6] ne "0") { if ($F[3] > .01) { if ($F[8] >= 50000) { print join("\t", @F); } } }' | awk '{ print $1 "_" $2 "\t" $0 }' | sort -k 1,1) | perl -lane 'print join("\t", @F[1..$#F]);' | gzip > jointGwasMc_TC.formatted.QCed.txt.gz 

Lastly, we will orient all phenotype files to the HDL minor allele:

join <(zcat jointGwasMc_HDL.formatted.QCed.txt.gz | awk '{ print $1 "_" $2 "\t" $5 }' | sort -k 1,1) <(zcat jointGwasMc_LDL.formatted.QCed.txt.gz | awk '{ print $1 "_" $2 "\t" $0 }' | sort -k 1,1) | perl -lane 'if ($F[1] eq $F[7]) { ($F[6], $F[7]) = ($F[7], $F[6]); if ($F[8] eq "-") { $F[8] = "+"; } elsif ($F[8] eq "+") { $F[8] = "-"; } else { print STDERR "Error1 -- direction of effect allele matching"; } } print join("\t", @F[2..$#F]);' | cat <(echo -e "Chr\tBP\tMarker\tMAF\tA1\tA2\tDirection\tpValue\tN") - | gzip > jointGwasMc_LDL.formatted.QCed.HDLMatch.txt.gz

join <(zcat jointGwasMc_HDL.formatted.QCed.txt.gz | awk '{ print $1 "_" $2 "\t" $5 }' | sort -k 1,1) <(zcat jointGwasMc_TG.formatted.QCed.txt.gz | awk '{ print $1 "_" $2 "\t" $0 }' | sort -k 1,1) | perl -lane 'if ($F[1] eq $F[7]) { ($F[6], $F[7]) = ($F[7], $F[6]); if ($F[8] eq "-") { $F[8] = "+"; } elsif ($F[8] eq "+") { $F[8] = "-"; } else { print STDERR "Error1 -- direction of effect allele matching"; } } print join("\t", @F[2..$#F]);' | cat <(echo -e "Chr\tBP\tMarker\tMAF\tA1\tA2\tDirection\tpValue\tN") - | gzip > jointGwasMc_TG.formatted.QCed.HDLMatch.txt.gz

join <(zcat jointGwasMc_HDL.formatted.QCed.txt.gz | awk '{ print $1 "_" $2 "\t" $5 }' | sort -k 1,1) <(zcat jointGwasMc_TC.formatted.QCed.txt.gz | awk '{ print $1 "_" $2 "\t" $0 }' | sort -k 1,1) | perl -lane 'if ($F[1] eq $F[7]) { ($F[6], $F[7]) = ($F[7], $F[6]); if ($F[8] eq "-") { $F[8] = "+"; } elsif ($F[8] eq "+") { $F[8] = "-"; } else { print STDERR "Error1 -- direction of effect allele matching"; } } print join("\t", @F[2..$#F]);' | cat <(echo -e "Chr\tBP\tMarker\tMAF\tA1\tA2\tDirection\tpValue\tN") - | gzip > jointGwasMc_TC.formatted.QCed.HDLMatch.txt.gz

Running bmass

Now with our four GlobalLipids2013 files properly formatted and QCed, we are able to run bmass. To do so, we will also need the list of published univariate GWAS SNPs from the GlobalLipids2013 publication (presented as two columns per SNP, chromosome and basepair position) – these have been provided in the data subdirectory (GlobalLipids2013.GWASsnps.txt). To run bmass with our prepared input datafiles and our list of univariate GWAS SNPs, we run the following R code:

library("bmass");
HDL <- read.table("jointGwasMc_HDL.formatted.QCed.txt.gz", header=T);
LDL <- read.table("jointGwasMc_LDL.formatted.QCed.HDLMatch.txt.gz", header=T);
TG <- read.table("jointGwasMc_TG.formatted.QCed.HDLMatch.txt.gz", header=T); 
TC <- read.table("jointGwasMc_TC.formatted.QCed.HDLMatch.txt.gz", header=T);
load("bmassDirectory/data/GlobalLipids2013.GWASsnps.rda");
Phenotypes <- c("HDL", "LDL", "TG", "TC");
bmassResults <- bmass(Phenotypes, GlobalLipids2013.GWASsnps);

Note once again that to run bmass we supply a vector of phenotype names that also correspond to the variable names referencing each phenotype’s respective data file. Also note that you will see a warning that one or more of the GlobalLipids2013 datafiles contains p-values smaller than the threshold at which R can handle converting them into logarthmic scale – this is to be expected.

Accessing & exploring results

bmass returns a list containing multiple areas of information, including separate sublists pertaining to the previous univariate GWAS SNPs used for training the multivariate model priors and the new multivariate SNPs (if any) bmass has idenfitifed as significant. Run summary(bmassResults) and you should see the following:

> summary(bmassResults)
                      Length Class  Mode     
MarginalSNPs             3   -none- list     
PreviousSNPs             4   -none- list     
NewSNPs                  3   -none- list     
LogFile                 20   -none- character
ZScoresCorMatrix        16   -none- numeric  
Models                 324   -none- numeric  
ModelPriors           1134   -none- numeric  
GWASlogBFMinThreshold    1   -none- numeric  

We will touch on all the above outputs.

NewSNPs

The NewSNPs sublist contains information pertaining to the new significant multivariate associations found by bmass, if any were identified. Running summary(bmassResults$NewSNPs) you should see the following:

bmassResults$NewSNPs$SNPs contains the ‘main’ bmass output of interest – the list of new multivariate associations found by bmass and those SNPs’ related information. The format of this output appears as follows:

NewSNPs$logBFs contains the log10 Bayes Factor (BF) for each multivariate model tested (in the form of a Model x SNP matrix, with the first columns of the matrix describing each model). NewSNPs$Posteriors is a similar matrix, just with each matrix entry representing the posterior probability for each model/SNP combination instead of the log10 BF.

PreviousSNPs

PreviousSNPs contains similar information as NewSNPs does, but pertaining to the set of previous univariate GWAS significant SNPs used to train the multivariate model priors and set the empirical significance threshold for weighted average BF. Running summary(bmassResults$PreviousSNPs) you see:

PreviousSNPs$SNPs, PreviousSNPs$logBFs, and PreviousSNPs$Posteriors all match the descriptions as above with NewSNPs. However PreviousSNPs$DontPassSNPs refers to any GWAS SNPs which are included in the final publicaiton list but which do meet the original study’s univariate GWAS p-value threshold based on the dataset released. This can for instance occur when a final GWAS SNP list is produced from combining a discovery dataset with a replication dataset, and the former data is the only part of the study publicly released. Therefore bmass keeps track of these SNPs so as not to call them a completely novel multivariate SNP, but also to not use them to train the multivariate model priors since they do not technically reach univariate GWAS significance based on the dataset provided alone.

The format of DontPassSNPs is the same seen in NewSNPs$SNPs and PreviousSNPs$SNPs.

MarginalSNPs

MarginalSNPs contains the same information as NewSNPs and PreviousSNPs, but pertaining to the set of marginally significant SNPs extracted from the intermediate MergedDataSources dataset (a data.frame that contains all the input phenotype files combined) based on the SNPMarginalUnivariateThreshold and SNPMarginalMultivariateThreshold values (default of 1e-6 for each). summary(bmassResults$MarginalSNPs) returns:

MarginalSNPs$SNPs, MarginalSNPs$logBFs, and MarginalSNPs$Posteriors are all as described above.

ZScoresCorMatrix

This is the phenotype correlation matrix (V_0 hat in Detailed Methods (Global Lipids Analysis) of Stephens 2013 PLoS ONE) derived from extracting all the ‘null’ SNPs (abs(ZScore) < 2 for every phenotype) in the dataset.

Models

This is a matrix displaying all the model combinations possible given the set of d input phenotypes and 3 model categories of directly associated (1), indirectly associated (2), and unassociated (0). The total number of models displayed should be 3^d.

ModelPriors

This is the set of trained priors determined for each multivariate model using the previous univariate GWAS SNPs and an EM algorithm (see eq. 36 in Stephens 2013). Note that for any model which does not contain at least one phenotype in the directly associated category, the prior was automatically set to 0 (aside from the global null model of all phenotypes set to unassociated).

Also note that ModelPriors contains a full set of model priors for each value of the hyperparameter sigma_alpha (see “Prior on sigma_alpha” in Stephens 2013); these values are stored in SigmaAlphas, and by default are set to c(0.005,0.0075,0.01,0.015,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09,0.1,0.15) (hence 81 models * 14 sigma_alphas = 1134 priors). To see a more interpretable form of the model priors, use the GetModelPriorMatrix() function as shown below.

ModelPriorMatrix contains the model descriptions as provided in Models along with the combined trained priors across each sigma_alpha value; the ModelPriomatrix is sorted by decreasing model prior value and also shows the cumulative distribution of the priors as well as the original order position of each model.

GWASlogBFMinThreshold

This is the empirical weighted average BF threshold (BFavg) used to determine whether any new SNPs are ‘multivariate significant’. As described in Stephens 2013, each SNP has a single summary metric combining all the information across each multivariate model tested; this summary metric is the weighted average of each model’s BF, where weights are determined from the previous univariate GWAS SNPs (eq. 1 in Turchin and Stephens 2019 or eq. 8 in Stephens 2013). This summary metric is also calculated for the previous unviariate GWAS SNPs themselves, and the smallest BFavg among them becomes the threshold above which new SNPs are considered multivariate significant.

So every new multivariate SNP bmass identifies in this GlobalLipids2013 analysis has a BFavg value >= 4.289906 (and is also more than a 1Mb window away from a previous univariate GWAS SNP).