%\VignetteIndexEntry{NMF: generating heatmaps} %\VignetteDepends{utils,NMF,RColorBrewer,knitr} %\VignetteKeyword{aplot} %\VignetteCompiler{knitr} %\VignetteEngine{knitr::knitr} \documentclass[a4paper]{article} %\usepackage[OT1]{fontenc} \usepackage[colorlinks]{hyperref} \usepackage{a4wide} \usepackage{xspace} \usepackage[all]{hypcap} % for linking to the top of the figures or tables % add preamble from pkgmaker <>= library(NMF) latex_preamble() if(!requireNamespace("Biobase")) BiocManager::install("Biobase") @ \newcommand{\nmfpack}{\pkgname{NMF}} \newcommand{\MATLAB}{MATLAB\textsuperscript{\textregistered}\xspace} \newcommand{\refeqn}[1]{(\ref{#1})} % REFERENCES \usepackage[citestyle=authoryear-icomp , doi=true , url=true , maxnames=1 , maxbibnames=15 , backref=true , backend=bibtex]{biblatex} \AtEveryCitekey{\clearfield{url}} <>= latex_bibliography('NMF') @ \newcommand{\citet}[1]{\textcite{#1}} \renewcommand{\cite}[1]{\parencite{#1}} \DefineBibliographyStrings{english}{% backrefpage = {see p.}, % for single page number backrefpages = {see pp.} % for multiple page numbers } % % boxed figures \usepackage{float} \floatstyle{boxed} \restylefloat{figure} \usepackage{array} \usepackage{tabularx} \usepackage{mathabx} \usepackage{url} \urlstyle{rm} % use cleveref for automatic reference label formatting \usepackage[capitalise, noabbrev]{cleveref} % define commands for notes \usepackage{todonotes} \newcommand{\nbnote}[1]{\ \bigskip\todo[inline, backgroundcolor=blue!20!white]{\scriptsize\textsf{\textbf{NB:} #1}}\ \\} % put table of contents on two columns \usepackage[toc]{multitoc} \setkeys{Gin}{width=0.95\textwidth} \begin{document} <>= #options(prompt=' ') #options(continue=' ') set.seed(123456) @ \title{Generating heatmaps for Nonnegative Matrix Factorization\\ \small Package \nmfpack\ - Version \Sexpr{utils::packageVersion('NMF')}} \author{Renaud Gaujoux} \maketitle \begin{abstract} This vignette describes how to produce different informative heatmaps from NMF objects, such as returned by the function \code{nmf} in the \citeCRANpkg{NMF}. The main drawing engine is implemented by the function \code{aheatmap}, which is a highly enhanced modification of the function \code{pheatmap} from the \CRANpkg{pheatmap}, and provides convenient and quick ways of producing high quality and customizable annotated heatmaps. Currently this function is part of the package \nmfpack, but may eventually compose a separate package on its own. \end{abstract} {\small \tableofcontents} \section{Preliminaries} \subsection{Quick reminder on NMF models} Given a nonnegative target matrix $X$ of dimension $n\times p$, NMF algorithms aim at finding a rank $k$ approximation of the form: $$ X \approx W H, $$ where $W$ and $H$ are nonnegative matrices of dimensions $n\times k$ and $k\times p$ respectively. The matrix $W$ is the basis matrix, whose columns are the basis components. The matrix $H$ is the mixture coefficient or weight matrix, whose columns contain the contribution of each basis component to the corresponding column of $X$. We call the rows of $H$ the basis profiles. \subsection{Heatmaps for NMF} Because NMF objects essentially wrap up a pair of matrices, heatmaps are convenient to visualise the results of NMF runs. The package \nmfpack provides several specialised heatmap functions, designed to produce heatmaps with sensible default configurations according to the data being drawn. Being all based on a common drawing engine, they share almost identical interfaces and capabilities. The following specialised functions are currently implemented: \begin{description} \item[\code{basismap}] draws heatmaps of the basis matrix \item[\code{coefmap}] draws heatmaps of the mixture coefficient matrix \item[\code{consensusmap}] draws heatmaps of the consensus matrix, for results of multiple NMF runs. \end{description} \subsection{Heatmap engine} All the above functions eventually call a common heatmap engine, with different default parameters, chosen to be relevant for the given underlying data. The engine is implemented by the function \code{aheatmap}. Its development started as modification of the function \code{pheatmap} from the \pkgname{pheatmap} package. The initial objective was to improve and increase its capabilities, as well as defining a simplified interface, more consistent with the R core function \code{heatmap}. We eventually aim at providing a general, flexible, powerful and easy to use engine for drawing annotated heatmaps. The function \code{aheatmap} has many advantages compared to other heatmap functions such as \code{heatmap}, \code{gplots::heatmap2}, \code{heatmap.plus::heatmap.plus} , or even \code{pheatmap}: \begin{itemize} \item Annotations: unlimited number of annotation tracks can be added to \emph{both} columns and rows, with automated colouring for categorical and numeric variables. \item Compatibility with both base and grid graphics: the function can be directly called in drawing contexts such as grid, mfrow or layout. This is a feature many R users were looking for, and that was strictly impossible with base heatmaps. \item Legends: default automatic legend and colouring; \item Customisation: clustering methods, annotations, colours and legend can all be customised, even separately for rows and columns; \item Convenient interface: many arguments provide multiple ways of specifying their value(s), which speeds up developping/writing and reduce the amount of code required to generate customised plots (e.g. see \cref{sec:colour_spec}). \item Aesthetics: the heatmaps look globally cleaner, the image and text components are by default well proportioned relatively to each other, and all fit within the graphic device. \end{itemize} \subsection{Data and model} \label{sec:data} For the purpose of illustrating the use of each heatmap function, we generate a random target matrix, as well as some annotations or covariates: <>= # random data that follow an 3-rank NMF model (with quite some noise: sd=2) X <- syntheticNMF(100, 3, 20, noise=2, factors = TRUE) Xmat <- X[[1]] # row annotations and covariates n <- nrow(Xmat) d <- rnorm(n) e <- unlist(mapply(rep, c('X', 'Y', 'Z'), 10)) e <- c(e, rep(NA, n-length(e))) rdata <- data.frame(Var=d, Type=e) # column annotations and covariates p <- ncol(Xmat) a <- sample(c('alpha', 'beta', 'gamma'), p, replace=TRUE) # define covariates: true groups and some numeric variable c <- rnorm(p) # gather them in a data.frame covariates <- data.frame(a, X$pData, c) @ %\SweaveOpts{fig.width=14,fig.height=7} <>= library(knitr) opts_chunk$set(fig.width=14, fig.height=7) @ Note that in the code above, the object \code{X} returned by \code{syntheticNMF} \emph{really is} a matrix object, but wrapped through the function \code{ExposedAttribute} object, which exposes its attributes via a more friendly and access controlled interface \code{\$}. Of particular interests are attributes \code{'pData'} and \code{'fData'}, which are lists that contain a factor named \code{'Group'} that indicates the true underlying clusters. These are respectively defined as each sample's most contrbuting basis component and the basis component to which each feature contributes the most. They are useful to annotate heatmaps and assess the ability of NMF methods to recover the true clusters. As an example, one can conveniently visualize the target matrix as a heatmap, with or without the relevant sample and feature annotations, using simple calls to the \code{aheatmap} function: <>= par(mfrow=c(1,2)) aheatmap(Xmat, annCol=covariates, annRow=X$fData) aheatmap(Xmat) @ Then, we fit an NMF model using multiple runs, that will be used throughtout this vignette to illustrate the use of NMF heatmaps: <>= res <- nmf(Xmat, 3, nrun=10) res @ \nbnote{To keep the vignette simple, we always use the default NMF method (i.e. \code{'brunet'}), but all steps could be performed using a different method, or multiple methods in order to compare their perfromances.} \section{Mixture Coefficient matrix: \texttt{coefmap}} The coefficient matrix of the result can be plotted using the function \code{coefmap}. The default behaviour for multiple NMF runs is to add two annotation tracks that show the clusters obtained by the best fit and the hierarchical clustering of the consensus matrix\footnote{The hierarchical clustering is computed using the consensus matrix itself as a similarity measure, and average linkage. See \code{?consensushc}.}. In the legend, these tracks are named \emph{basis} and \emph{consensus} respectively. For single NMF run or NMF model objects, no consensus data are available, and only the clusters from the fit are displayed. <>= opar <- par(mfrow=c(1,2)) # coefmap from multiple run fit: includes a consensus track coefmap(res) # coefmap of a single run fit: no consensus track coefmap(minfit(res)) par(opar) @ \nbnote{Note how both heatmaps were drawn on the same plot, simply using the standard call to \code{par(mfrow=c(1,2)}. This is impossible to achieve with the R core function \code{heatmap}. See \cref{sec:aheatmap} for more details about compatibility with base and grid graphics.} By default: \begin{itemize} \item the rows are not ordered; \item the columns use the default ordering of \code{aheatmap}, but may easily be ordered according to the clusters defined by the dominant basis component for each column with \code{Colv="basis"}, or according to those implied by the consensus matrix, i.e. as in \code{consensusmap}, with \code{Colv="consensus"}; \item each column is scaled to sum up to one; \item the color palette used is \code{'YlOrRd'} from the \citeCRANpkg{RColorBrewer}, with 50 breaks. \end{itemize} In term of arguments passed to the heatmap engine \code{aheatmap}, these default settings translate as: <>= Rowv = NA Colv = TRUE scale = 'c1' color = 'YlOrRd:50' annCol = predict(object) + predict(object, 'consensus') @ If the ordering does not come from a hierarchical clustering (e.g., if \code{Colv='basis'}), then no dendrogram is displayed. The default behaviour of \code{aheatmap} can be obtained by setting arguments \code{Rowv=TRUE, Colv=TRUE, scale='none'}. \medskip The automatic annotation tracks can be hidden all together by setting argument \code{tracks=NA}, displayed separately by passing only one of the given names (e.g. \code{tracks=':basis'} or \code{tracks='basis:'} for the row or column respectively), and their legend names may be changed by specifying e.g. \code{tracks=c(Metagene=':basis', 'consensus')}. Beside this, they are handled by the heatmap engine function \code{aheatmap} and can be customised as any other annotation tracks -- that can be added via the same argument \code{annCol} (see \cref{sec:aheatmap} or \code{?aheatmap} for more details). <>= opar <- par(mfrow=c(1,2)) # removing all automatic annotation tracks coefmap(res, tracks=NA) # customized plot coefmap(res, Colv = 'euclidean' , main = "Metagene contributions in each sample", labCol = NULL , annRow = list(Metagene=':basis'), annCol = list(':basis', Class=a, Index=c) , annColors = list(Metagene='Set2') , info = TRUE) par(opar) @ \nbnote{The feature that allows to display some information about the fit at the bottom of the plot via argument \code{info=TRUE} is still experimental. It is helpful mostly when developing algorithms or doing an analysis, but would seldom be used in publications.} \section{Basis matrix: \texttt{basismap}} The basis matrix can be plotted using the function \code{basismap}. The default behaviour is to add an annotation track that shows for each row the dominant basis component. That is, for each row, the index of the basis component with the highest loading. This track can be disabled by setting \code{tracks=NA}, and extra row annotations can be added using the same argument \code{annRow}. <>= opar <- par(mfrow=c(1,2)) # default plot basismap(res) # customized plot: only use row special annotation track. basismap(res, main="Metagenes", annRow=list(d, e), tracks=c(Metagene=':basis')) par(opar) @ By default: \begin{itemize} \item the columns are not ordered; \item the rows are ordered by hierarchical clustering using default distance and linkage methods (\code{'eculidean'} and \code{'complete'}); \item each row is scaled to sum up to one; \item the color palette used is \code{'YlOrRd'} from the \citeCRANpkg{RColorBrewer}, with 50 breaks. \end{itemize} In term of arguments passed to the heatmap engine \code{aheatmap}, these default settings translate as: <>= Colv = NA scale = 'r1' color = 'YlOrRd:50' annRow = predict(object, 'features') @ \section{Consensus matrix: \texttt{consensusmap}} When doing clustering with NMF, a common way of assessing the stability of the clusters obtained for a given rank is to consider the consensus matrix computed over multiple independent NMF runs, which is the average of the connectivity matrices of each separate run \footnote{Hence, stability here means robustness with regards to the initial starting point, and shall not be interpreted as in e.g. cross-validation/bootstrap analysis. However, one can argue that having very consistent clusters across runs somehow supports for a certain regularity or the presence of an underlying pattern in the data.}. This procedure is usually repeated over a certain range of factorization ranks, and the results are compared to identify which rank gives the best clusters, possibly in the light of some extra knowledge one could have about the samples (e.g. covariates). The functions \code{nmf} and \code{consensusmap} make it easy to implement this whole process. \nbnote{The consensus plots can also be generated for fits obtained from single NMF runs, in which case the consensus matrix simply reduces to a single connectivity matrix. This is a binary matrix (i.e. entries are either 0 or 1), that will always produce a bi-colour heatmap, and by default clear blocks for each cluster.} \subsection{Single fit} In section \cref{sec:data}, the NMF fit \code{res} was computed with argument \code{nrun=10}, and therefore contains the best fit over 10 runs, as well as the consensus matrix computed over all the runs \footnote{If one were interested in keeping the fits from all the runs, the function \code{nmf} should have been called with argument \code{.options='k'}. See section \emph{Options} in \code{?nmf}. The downstream hanlding of the result would remain identical.}. This can be ploted using the function \code{consensusmap}, which allows for the same kind of customization as the other NMF heatmap functions: <>= opar <- par(mfrow=c(1,2)) # default plot consensusmap(res) # customized plot consensusmap(res, annCol=covariates, annColors=list(c='blue') , labCol='sample ', main='Cluster stability' , sub='Consensus matrix and all covariates') par(opar) @ By default: \begin{itemize} \item the rows and columns of the consensus heatmap are symmetrically ordered by hierarchical clustering using the consensus matrix as a similarity measure and average linkage, and the associated dendrogram is displayed; \item the color palette used is the reverse of \code{'RdYlBu'} from the \citeCRANpkg{RColorBrewer}. \end{itemize} In term of arguments passed to the heatmap engine \code{aheatmap}, these default settings translate as: <>= distfun = function(x) as.dist(1-x) # x being the consensus matrix hclustfun = 'average' Rowv = TRUE Colv = "Rowv" color = '-RdYlBu' @ \subsection{Single method over a range of ranks} The function \code{nmf} accepts a range of value for the rank (argument \code{rank}), making it fit NMF models for each value in the given range \footnote{Before version 0.6, this feature was provided by the function \code{nmfEstimateRank}. From version 0.6, the function \code{nmf} accepts ranges of ranks, and internally calls the function \code{nmfEstimateRank} -- that remains exported and can still be called directly. See documentation \code{?nmfEstimateRank} for more details on the returned value.}: <>= res2_7 <- nmf(Xmat, 2:7, nrun=10, .options='v') class(res2_7) @ The result \code{res2\_7} is an S3 object of class \code{'NMF.rank'}, that contains -- amongst other data -- a list of the best fits obtained for each value of the rank in range $\ldbrack 2, 7\rdbrack]$. The method of \code{consensusmap} defined for class \code{'NMF.rank'}, which plots all the consensus matrices on the same plot: <>= consensusmap(res2_7) @ \nbnote{ The main title of each consensus heatmap can be customized by passing to argument \code{main} a character vector or a list whose elements specify each title. All other arguments are used in each internal call to consensusmap, and will therefore affect all the plots simultaneously. The layout can be specified via argument \code{layout} as a numeric vector giving the number of rows and columns in a \code{mfrow}-like way, or as a matrix that will be passed to R core function \code{layout}. See \code{?consensusmap} for more details and example code. } \subsection{Single rank over a range of methods} If one is interested in comparing methods, for a given factorization rank, then on can fit an NMF model for each method by providing the function \code{nmf} with a \code{list} in argument \code{method}: <>= res_methods <- nmf(Xmat, 3, list('lee', 'brunet', 'nsNMF'), nrun=10) class(res_methods) @ The result \code{res\_methods} is an S4 object of class \code{NMFList}, which is essentially a named list, that contains each fits and the CPU time required by the whole computation. As previously, the sequence of consensus matrices is plotted with \code{consensusmap}: <>= consensusmap(res_methods) @ \section{Generic heatmap engine: \texttt{aheatmap}} \label{sec:aheatmap} This section still needs to be written, but many examples of annotated heatmaps can be found in the demos \code{'aheatmap'} and \code{'heatmaps'}: <>= demo('aheatmap') # or demo('heatmaps') @ These demos and the plots they generate can also be browsed online at \url{http://nmf.r-forge.r-project.org/_DEMOS.html}. \section{Session Info} <>= toLatex(sessionInfo()) @ \printbibliography[heading=bibintoc] \end{document}