scMappR relies on the following dependencies which should be downloaded/updated with scMappR automatically. Please ensure that these packages are not open when installing scMappR.
Install GSVA and pcaMethods from bioconductor first, as
devtools::install_githb()
will automatically install
CRAN.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!requireNamespace("devtools", quietly = TRUE))
install.packages("devtools")
::install("pcaMethods")
BiocManager::install("GSVA")
BiocManager
::install_github("wilsonlabgroup/scMappR") devtools
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!requireNamespace("devtools", quietly = TRUE))
install.packages("devtools")
::install("pcaMethods")
BiocManager::install("GSVA")
BiocManager
install.packages("scMappR")
Before using scMappR, we strongly recommend that you download data
files from “https://github.com/wilsonlabgroup/scMappR_Data”.Tthe
rda_path
parameter will allow users to set the directory
where these data are stored. If the correct files are not stored in this
directory, then scMappR will download these as temporary files from “https://github.com/wilsonlabgroup/scMappR_Data”, use
them in the function, and not save them. This parameter is set to “” as
a default and thus data are downloaded as a default.
Additionally, the get_signature_matrices()
function
allows the user to get the list of all stored signature matrices at any
given time. Input “all”, “pVal”, or “OR” to get the matrix of p-values,
odds-ratios, or both. This function outputs the signature matrices, the
top 30 cell-type markers for each cell-type, and the predicted cluster
label from the Fisher’s exact test and GSVA.
<- get_signature_matrices(type = "all") #return a list of cell-type labels, p-values, and odds-ratios. signatures
Many of the functions in scMappR print files or generate directories.
To return the full output of scMappR, please change the ‘toSave’
parameters from FALSE
to TRUE
in any of the
functions being used. Otherwise, the functions in scMappR will only
return a small portion of what scMappR has to offer. Due to CRAN
packages not allowing for their packages to print files/make
directories, toSave is set to FALSE
as default.
Furthermore, a path directory must be set. For example
path="./"
will print files and directories to the working
directory. These measures are to prevent any file overwriting. This
vignette sets path
to tempdir()
. When using
scMappR, change tempdir()
to the directories where files
should be saved.
scMappR_and_pathway_analysis()
: This function generates
cell-weighted fold changes (cellWeighted_Foldchanges), visualizes them
in a heatmap, and completes pathway enrichment of
cellWeighted_Foldchanges and bulk gene list.two_method_pathway_enrichment()
: This function
completes additional pathway enrichment tests from cwFold-changes
calculated in scMappR_and_pathway_analysis
. Specifically,
it re-ranks cwFoldChanges based on their absolute ct specificity scores
(per-celltype) as well as their rank increase in cell-type specificity
before completing an ordered pathway analysis. In the second method,
only genes with a rank increase in cell-type specificity were
included.cwFoldChange_evaluate()
. At the level of the gene,
cwFold-changes are scored so that they sum to 1. For each gene,
cell-types whose cwFold-change are greater than the cell-type proportion
while accounting for an abnormally high proportion of the fold change (+
3 median absolute deviations from median cell-type specificity) are
considered cell-type specific. At the level of the cell-type, bulk DEGs
and cwFold-changes for each cell-type are correlated. The difference in
the rank of DEG is also measured.tissue_scMappR_custom()
: This function visualizes
signature matrix, clusters subsetted genes, completes enrichment of
individual cell-types and co-enrichment.tissue_scMappR_internal()
: This function loops through
every signature matrix in a particular tissue and generates heatmaps,
cell-type preferences, and co-enrichment.tissue_by_celltype_enrichment()
: This function
completes a fishers exact test of an input list of genes against one of
the two curated tissue by cell-type marker datasets from scMappR.process_dgTMatrix_lists()
: This function takes a list
of count matrices, processes them, calls cell-types, and generates
signature matrices.The scMappR_and_pathway_analysis()
function generates
cellWeighted_Foldchanges based on an inputted signature matrix,
normalized RNA-seq count matrix (e.g. TPM, RPKM, CPM), and list of
differentially-expressed genes (DEGs) before creating cell-weighted Fold
Changes (cwFold-changes). cwFold-changes are row-normalized and
visualized in a heatmap. Cell-type markers within calculated
cwFold-changes are also visualized both as cell-type markers on their
own (i.e. just their cell-type specificity), and as cwFold-changes.
Then, for each cell-type, this function filters genes expressed in each
cell-type (cwFold-change > 1e-5), reorders the genes by
cwFold-change, then completes an ordered pathway analysis with
g:ProfileR or gprofiler2 package (user chooses). All genes detected in
the bulk RNA-seq are used as the background. The example below has
toSave = FALSE
, up_and_downregulated = FALSE
and internet = FALSE
. When running scMappR yourself, it is
strongly recommended to set all to TRUE
.
toSave = TRUE
allows for the printing of folders, images,
and files onto a desktop/cluster.
up_and_downregulated = TRUE
repeats pathway analysis for up
and down-regulated genes separately. internet = TRUE
allows
for the completion of pathway analysis altogether. Tissue-types are
available in data(scMappR_tissues)
, and the signature
matrices themselves from
get_signature_matrices(type = "OR")
.
data(PBMC_scMappR) # load data example of PBMC bulk- and cell-sorted RNA-seq data
<- PBMC_example$bulk_DE_cors # 59 sex-specific DEGs in bulk PBMC (up-regulated = female-biased)
bulk_DE_cors
<- PBMC_example$bulk_normalized # log CPM normalized bulk RNA-seq data
bulk_normalized
<- PBMC_example$odds_ratio_in # signature matrix developed from cell-sorted RNA-seq
odds_ratio_in
<- "_female" # flag for 'cases' (up-regulated), index is also acceptable
case_grep
<- "_male" # flag for 'control' (down-regulated), index is also acceptable
control_grep
<- 10 # maximum cell-type proportion change -- this is good for cell-types that are uncomon in population and small absolute changes may yield large relative changes
max_proportion_change
<- "human" # these RNA-seq data have human gene symbols (and are also from human)
theSpecies
# When running scMappR, it is strongly recommended to use scMappR_and_pathway analysis with the parameters below.
<- scMappR_and_pathway_analysis(bulk_normalized, odds_ratio_in,
toOut case_grep = case_grep,
bulk_DE_cors, control_grep = control_grep, rda_path = "",
max_proportion_change = 10, print_plots = TRUE,
plot_names = "scMappR_vignette_", theSpecies = "human",
output_directory = "scMappR_vignette_",
sig_matrix_size = 3000, up_and_downregulated = TRUE,
internet = TRUE, toSave = TRUE, path = tempdir())
Assuming toSave = T
and
up_and_downregulated = T
,
scMappR_and_pathway_analysis()
will also generate a folder
in your current directory with a considerable amount of
data/figures.
Here, we will walk through the output of the above example sorting by name and working downwards.
In addition to completing pathway enrichment of DEGs re-ranked by their cwFold-change, it may be informative to interrogate which DEGs were the most influenced by their cell-type specificity in the context of each cell-type.
In this example, we are extending the output from the
scMappR_and_pathway_analysis
function above.
<- two_method_pathway_enrichment(bulk_DE_cors, "human",
twoOutFiles scMappR_vals = toOut$cellWeighted_Foldchange, background_genes = rownames(bulk_normalized),
output_directory = "newfun_test",plot_names = "nonreranked_", toSave = FALSE)
scMappR_and_pathway_analysis
. Please refer to previous
section for output.An important aspect of scMappR is to calculate which DEGs from bulk differential analysis may be differentially expressed due to a change in expression of specific cell-types. We complete this task my measuring if the cell-type specificity of each DEG contains outliers for a specific cell-type, and if the distribution of DEGs within a cell-type (measured through cwFold-changes) is different from the distribution of DEGs measured in bulk.
In this example, we are extending the output from the
scMappR_and_pathway_analysis
function above.
<- cwFoldChange_evaluate(toOut$cellWeighted_Foldchange, toOut$cellType_Proportions, bulk_DE_cors) evaluated
cwFoldchange_gene_flagged_FP
list. There is a
possibility that these genes are not false positives, but should be
inspected further by the researcher if they are to be included in
downstream analysis. By default we suggest that these genes are not
considered cell-type specific.Given a tissue and an unranked list of genes (i.e. without a count
matrix or summary statistics), the tissue_scMappR_custom()
and tissue_scMappR_internal()
functions visualizes
cell-type markers contained within the gene-list. Then, they test for
the enrichment of cell-types within that tissue. When there is no custom
signature matrix, providing a tissue present in
get_signature_matrices(type = "pVal")
will complete
cell-type specific gene visualization and enrichment for every signature
matrix in the tissue.
data(POA_example) # region to preoptic area
<- POA_example$POA_Rank_signature # signature matrix
Signature
<- get_gene_symbol(Signature) # get signature
rowname
rownames(Signature) <- rowname$rowname
<- rownames(Signature)[1:60]
genes
= "" # data directory (if it exists)
rda_path1
# Identify tissues available for tissue_scMappR_internal
data(scMappR_tissues)
"Hypothalamus" %in% toupper(scMappR_tissues)
<- tissue_scMappR_internal(genes, "mouse", output_directory = "scMappR_Test_Internal",
internal tissue = "hypothalamus", rda_path = rda_path1, toSave = TRUE, path = tempdir())
This function returns a list “internal”. Using the example of
internal[[4]]
, the pre-optic area of the hypothalamus, you
can see four objects.
When toSave == TRUE
, a directory is generated with
visualization of each signature matrix, the signature matrix subsetted
by input genes, and statistical enrichment.
tissue_scMappR_custom()
assumes a custom signature
matrix. We suggest a gene x cell-type matrix populated with the
-log10(Padj)*sign(FC)
(i.e. Rank) of the cell-type marker.
We identify the cell-type marker with the FindMarker
function from the Seurat package. The statistical test within that
function is based on the biological question that the user interested
in. Pre-computed signature matrices use the Wilcoxon test. If the value
filling the matrix is not rank, users will need to change the
gene_cutoff
parameter to what they deem significant. If
is_pvalue == TRUE
, tissue_scMappR_custom()
will transform the p-values into Ranks.
# Acquiring the gene list
data(POA_example)
<- POA_example$POA_Rank_signature
Signature
<- get_gene_symbol(Signature)
rowname
rownames(Signature) <- rowname$rowname
<- rownames(Signature)[1:200]
genes
#running tisue_scMappR_custom
<- tissue_scMappR_custom(genes,Signature,output_directory = "scMappR_Test_custom", toSave = F) internal
The output is identical to that found in
tissue_scMappR_internal()
but with only one study, or the
example shown above.
If you choose to set toSave = TRUE
. This function will
return a directory with a number of relevant files.
When toSave == TRUE
, a directory is generated with
visualization of each signature matrix, the signature matrix subsetted
by input genes, and statistical enrichment.
scMappR allows for gene-set enrichment of every cell-type marker,
sorted by cell-type, tissue, and study, curated while preprocessing all
of the signature matrices for the functions within scMappR. The dataset
may be downloaded from https://github.com/DustinSokolowski/scMappR_Data,
processed into a gmt file using a number of packages (qusage,
activepathways etc.) and then used with traditional gene-set-enrichment
tools (GSEA, GSVA, gprofiler). Additionally, scMappR contains the
tissue_by_celltype_enrichment
function, that enriches a
gene list against all cell-type markers using a Fisher’s-exact test.
Users can also download all cell-type markers by setting
return_gmt = TRUE
when using the function.
Here, we will investigate the tissue x cell-type enrichment of the top 100 genes in the Preoptic area signature matrix.
data(POA_example)
<- POA_example$POA_generes
POA_generes <- POA_example$POA_OR_signature
POA_OR_signature <- POA_example$POA_Rank_signature
POA_Rank_signature <- POA_Rank_signature
Signature <- get_gene_symbol(Signature)
rowname rownames(Signature) <- rowname$rowname
<- rownames(Signature)[1:100]
genes
<- tissue_by_celltype_enrichment(gene_list = genes,
enriched species = "mouse",p_thresh = 0.05, isect_size = 3)
tissue_by_celltype_enrichment()
function. This function
returns gene set enrichment compatible with plotBP()
for
all cell-type markers significantly enriched via the function.return_gmt == TRUE
, then this downloaded gmt will be
returned with the enrichment.toSave == TRUE
, then the -log10(P_adj) of the top 10
tissue/cell-types that are enriched are plotted.scMappR inputs a list of count matrices (of class list, dCGMatrix, or matrix) and re-processes it using the standard Seurat V4 vignette (+ removal of cells with > 2 standard deviations of mt contamination than mean). Then, it finds cell-type markers and identifies potential cell-type names using the GSVA and Fisher’s exact methods on the CellMarker and Panglao databases. Finally, it creates a signature matrix of odds ratios and ranks. There are options to save the Seurat object, GSVA cell-type identities and list of cell-type markers. To identify what naming-preferences options described here: “brain”, “epithelial”, “endothelial”, “blood”, “connective”,“eye”, “epidermis”, “Digestive”, “Immune”, “pancreas”, “liver”, “reproductive”, “kidney”, “respiratory”
data(sm)
<- list(example = sm)
toProcess
<- process_dgTMatrix_lists(toProcess, name = "testProcess", species_name = "mouse",
tst1 naming_preference = "eye", rda_path = "",
toSave = TRUE, saveSCObject = TRUE, path = tempdir())
It is recommended to set toSave == TRUE
, allowing for
important data objects to be saved. Here, the above function is repeated
with toSave == TRUE
and saveSCObject == TRUE
,
and the outputted files will be briefly discussed.
Here, the following objects are saved.
process_dgTMatrix_lists
was used and cell-types were
predicted within scMappR, Each cell-type is labelled as followed.
When there are multiple scRNA-seq runs, we use the integration
anchors feature to combine runs. Here, each run is a different element
of the dgTMatrix_list
parameter.
Making multiple run scRNA-seq data. As seen in many cases, the first dataset is identified with “.1” and the second with “.2”
# generating scRNA-seq data with multiple runs.
data(sm)
<- sm2 <- sm
sm1 colnames(sm1) <- paste0(colnames(sm1), ".1")
colnames(sm2) <- paste0(colnames(sm2),".2")
<- cbind(sm1,sm2) combined_counts
Combining datasets with the integration anchors batch correction
<- list()
toProcess for(i in 1:2) {
paste0("example",i)]] <- combined_counts[,grep(paste0(".",i), colnames(combined_counts))]
toProcess[[
}<- process_dgTMatrix_lists(toProcess, name = "testProcess", species_name = "mouse",
tst1 naming_preference = "eye", rda_path = "",
toSave = TRUE, saveSCObject = TRUE, path = tempdir())
If you do not want to use the integration anchors feature and process the scRNA-seq data as if it were one run, then keep all of these data in the same dgCMatrix matrix.
<- process_dgTMatrix_lists(combined_counts, name = "testProcess", species_name = "mouse",
tst1 naming_preference = "eye", rda_path = "",
toSave = TRUE, saveSCObject = TRUE, path = tempdir())
Please continue to use the process_from_count
function.
The only species specific information required in this function is the
mitochondrial gene symbol to regress out mitochondria. In each of your
count matrices, make sure that your michondial genes have a “MT-”
preface. Then select “human” as the species. This can also be done with
an “mt-” preface and selecting mouse.
It may be common to generate a signature matrix when clusters and cell-types have already been given for every cell. These examples follow how to make this signature matrix from:
Generating the Seurat Object for example and making up cell-types. This example will be used from 1-2.
data(sm)
<- list(sm = sm)
toProcess
<- process_from_count(toProcess, "test_vignette",theSpecies = "mouse")
seurat_example
levels(seurat_example@active.ident) <- c("Myoblast", "Neutrophil", "cardiomyoblast", "Mesothelial")
generes
object and each signature matrix
is in gene_out
.<- seurat_to_generes(pbmc = seurat_example, test = "wilcox")
generes
<- generes_to_heatmap(generes, make_names = FALSE) gene_out
#Create the cell-type ids and matrix
<- seurat_example@active.ident
Cell_type_id
<- sm
count_file
<- get_gene_symbol(count_file)
rownames_example
rownames(count_file) <- rownames_example$rowname
# make seurat object
<- process_from_count(count_file, "test_vignette",theSpecies = "mouse")
seurat_example
# Intersect column names (cell-types) with labelled CTs
<- intersect(colnames(seurat_example), names(Cell_type_id))
inters
<- seurat_example[,inters]
seurat_example_inter
<- Cell_type_id[inters]
Cell_type_id_inter
@active.ident <- Cell_type_id_inter
seurat_example_inter
# Making signature matrices
<- seurat_to_generes(pbmc = seurat_example_inter, test = "wilcox")
generes
<- generes_to_heatmap(generes, make_names = FALSE) gene_out
scMappR generates heatmaps and barplots. The barplots are generated
with plotBP
and make_TF_barplot
. The plotting
code for plotBP
is provided. Inputs are a matrix called
ordered_back_all
of -log10(padj) and term names with the
column names log10 and term_name respectively.
# making an example matrix
<- c("one", "two", "three")
term_name <- c(1.5, 4, 2.1)
log10
<- as.data.frame(cbind(term_name,log10))
ordered_back_all
#plotting
<- ggplot2::ggplot(ordered_back_all, ggplot2::aes(x = stats::reorder(term_name,
g y = log10)) + ggplot2::geom_bar(stat = "identity",
log10), fill = "turquoise") + ggplot2::coord_flip() + ggplot2::labs(y = "-log10(Padj)",
x = "Gene Ontology")
<- g + ggplot2::theme(axis.text.x = ggplot2::element_text(face = NULL,
y color = "black", size = 12, angle = 35), axis.text.y = ggplot2::element_text(face = NULL,
color = "black", size = 12, angle = 35), axis.title = ggplot2::element_text(size = 16,
color = "black"))
print(y)
Here, the heatmaps are for plotting cwFold-changes and cell-type proportions. The same heatmap is used so just an example of one is given.
# Generating a heatmap
# Acquiring the gene list
data(POA_example)
<- POA_example$POA_Rank_signature
Signature
<- get_gene_symbol(Signature)
rowname
rownames(Signature) <- rowname$rowname
<- rownames(Signature)[1:200]
genes
#running tisue_scMappR_custom
<- tissue_scMappR_custom(genes,Signature,output_directory = "scMappR_Test_custom", toSave = F)
internal
<- internal$gene_list_heatmap$geneHeat
toPlot
#Plotting the heatmap
= 0.2 # size of genes
cex
<- grDevices::colorRampPalette(c("lightblue", "white", "orange"))(256)
myheatcol ::pheatmap(as.matrix(toPlot), color = myheatcol, scale = "row", fontsize_row = cex, fontsize_col = 10) pheatmap