\documentclass{article} % \VignettePackage{adephylo} % \VignetteIndexEntry{adephylo: exploratory analyses for the phylogenetic comparative method} \usepackage{graphicx} \usepackage[colorlinks=true,urlcolor=blue]{hyperref} \usepackage{array} \usepackage{color} \usepackage[utf8]{inputenc} % for UTF-8/single quotes from sQuote() \newcommand{\code}[1]{{{\tt #1}}} \title{\code{adephylo}: exploratory analyses for the phylogenetic comparative method} \author{Thibaut Jombart and St\'ephane Dray} \date{\today} \sloppy \hyphenpenalty 10000 \begin{document} \SweaveOpts{concordance=TRUE} \definecolor{Soutput}{rgb}{0,0,0.56} \definecolor{Sinput}{rgb}{0.56,0,0} \DefineVerbatimEnvironment{Sinput}{Verbatim} {formatcom={\color{Sinput}},fontsize=\footnotesize, baselinestretch=0.75} \DefineVerbatimEnvironment{Soutput}{Verbatim} {formatcom={\color{Soutput}},fontsize=\footnotesize, baselinestretch=0.75} \color{black} \maketitle \tableofcontents %%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%% \section{Introduction} %%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%% \SweaveOpts{prefix.string = figs/adephylo, fig = FALSE, eps = FALSE, pdf = TRUE, width = 6, height = 6} This document describes the \code{adephylo} package for the R software. \code{adephylo} aims at implementing exploratory methods for the analysis of phylogenetic comparative data, i.e. biological traits measured for taxa whose phylogeny is also provided. This package extends and replaces implementation of phylogeny-related methods in the ade4 package \url{http://pbil.univ-lyon1.fr/ADE-4/home.php?lang=eng}. Procedures implemented in \code{adephylo} rely on exploratory data analysis. They include data visualization and manipulation, tests for phylogenetic autocorrelation, multivariate analysis, computation of phylogenetic proximities and distances, and modelling phylogenetic signal using orthonormal bases. \\ These methods can be used to visualize, test, remove or investigate the phylogenetic signal in comparative data. The purpose of this document is to provide a general view of the main functionalities of \code{adephylo}, and to show how this package can be used along with \code{ape}, \code{phylobase} and \code{ade4} to analyse comparative data. %%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%% \section{First steps} %%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%% \subsection{Data representation: why we are not reinventing the weel} %%%%%%%%%%%%%%%%%%%%% Data representation can be defined as the way data are stored in a software (R, in our case). Technically, data representation is defined by classes of objects that contain the information. In the case of phylogeny and comparative data, very efficient data representation are already defined in other packages. Hence, it makes much more sense to use directly objects from these classes. \\ Phylogenies are best represented in Emmanuel Paradis's \code{ape} package (\url{http://ape.mpl.ird.fr/}), as the class \code{phylo}. As \code{ape} is by far the largest package dedicated to phylogeny, using the \code{phylo} class assures a good interoperability of data. This class is defined in an online document: \url{http://ape.mpl.ird.fr/misc/FormatTreeR_28July2008.pdf}. \\ However, data that are to be analyzed in \code{adephylo} do not only contain trees, but also traits associated to the tips of a tree. The package \code{phylobase} (\url{http://r-forge.r-project.org/projects/phylobase/}) is a collaborative effort designed to handling such data. Its representation of phylogenies slightly differs from that of \code{ape}; the class \code{phylo4} was originally an extension of the \code{phylo} class into formal (S4) class, but it has now evolved into something more original. The S4 class \code{phylo4d} (`d' for `data') can be used to store a tree and data associated to tips, internal nodes, or even edges of a tree. Classes of \code{phylobase} are described in a vignette of the package, accessible by typing: <>= vignette("phylobase") @ ~\\ As trees and comparative data are already handled by \code{ape} and \code{phylobase}, no particular data format shall be defined in \code{adephylo}. In particular, we are no longer using \code{phylog} objects, which were used to represent phylogenies in \code{ade4} in a very \textit{ad hoc} way, without much compatibility with other packages. This class is now deprecated, but all previous functionalities available for \code{phylog} objects have been re-implemented and -- in some cases -- improved in \code{adephylo}. %%%%%%%%%%%%%%%%%%%%% \subsection{Installing the package} %%%%%%%%%%%%%%%%%%%%% What is tricky here is that a vignette is basically available once the package is installed. Assuming you got this document before installing the package, here are some clues about installing \code{adephylo}. \\ First of all, \code{adephylo} depends on other packages, being \code{methods}, \code{ape}, \code{phylobase}, and \code{ade4}. These dependencies are mandatory, that is, you actually need to have these packages installed before using \code{adephylo}. Also, it is better to make sure you are using the latest versions of these packages. This can be achieved using the \texttt{update.packages} command, or by installing devel versions from R-Forge (\url{http://r-forge.r-project.org/}). In all cases, the latest version of \code{adephylo} can be found from \url{http://r-forge.r-project.org/R/?group_id=303}. \\ We load \textit{adephylo}, alongside some useful packages: <>= library(ape) library(phylobase) library(ade4) library(adephylo) search() @ %%%%%%%%%%%%%%%%%%%%% \subsection{Getting started} %%%%%%%%%%%%%%%%%%%%% All the material of the package is summarized in a manpage accessible by typing: <>= ?adephylo @ The html version of this manpage may be preferred to browse easily the content of \code{adephylo}; this is accessible by typing: <>= help("adephylo", package="adephylo", html=TRUE) @ %%%%%%%%%%%%%%%%%%%%% \subsection{Putting data into shape} %%%%%%%%%%%%%%%%%%%%% While this is not the purpose of this document to go through the details of \code{phylo}, \code{phylo4} and \code{phylo4d} objects, we shall show briefly how these objects can be obtained. % % % % % % % % % % % \subsubsection{Making a \code{phylo} object} % % % % % % % % % % % The simplest way of turning a tree into a \code{phylo} object is using ape's function \code{read.tree}. This function reads a tree with the Newick (or `parentetic') format, from a file (default, argument \code{file}) of from a character string (argument \code{text}). <>= data(ungulates) ungulates$tre myTree <- read.tree(text=ungulates$tre) myTree plot(myTree, main="ape's plotting of a tree") @ It is easy to convert \code{ade4}'s \code{phylog} objects to a \code{phylo}, as \code{phylog} objects store the Newick format of the tree in the \code{\$tre} component. \\ Note that \code{phylo} trees can also be constructed from alignements (see \code{read.GenBank}, \code{read.dna}, \code{dist.dna}, \code{nj}, \code{bionj}, and \code{mlphylo}, all in \code{ape}), or even simulated (for instance, see \code{rtree}). \\ Also note that, if needed, conversion can be done back and forward with \code{phylo4} trees: <<>>= temp <- as(myTree, "phylo4") class(temp) temp <- as(temp, "phylo") class(temp) all.equal(temp, myTree) @ % % % % % % % % % % % \subsubsection{Making a \code{phylo4d} object} % % % % % % % % % % % \code{phylo4d} objects are S4 objects, and are thus created in a particular way. These objects can be obtained in two ways, by reading a Nexus file containing tree and data information, or by `assembling' a tree and data provided for tips, nodes, or edges. Nexus files containing both tree and data can be read by \code{phylobase}'s function \code{readNexus} (see corresponding manpage for more information). The other way of creating a \code{phylo4d} object is using the constructor, also named \code{phylo4d}. This is a function that takes two arguments: a tree (\code{phylo} or \code{phylo4} format) and a \code{data.frame} containing data, for tips by default (see \code{?phylo4d} for more information). Here is an example: <>= ung <- phylo4d(myTree, ungulates$tab) class(ung) table.phylo4d(ung) @ %% \noindent Note that the constructor checks the consistency of the %% names used for the tips of the tree and for the rows of the data.frame. %% Inconsistencies issue an error. %% To override this behaviour, one can specify %% \code{use.tip.names=FALSE}. %% However, this can be tricky: often, mismatches between names can %% indicate that data are not sorted adequately; moreover, object created %% with such mismatches will often be invalid objects, and may issue %% errors in further analyses. %% \\ Data are stored inside the \code{@data} slot of the object. They can be accessed using the function \code{tdata}: <<>>= x <- tdata(ung, type="tip") head(x) @ %%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%% \section{Exploratory data analysis} %%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%% \subsection{Quantifying and testing phylogenetic signal} %%%%%%%%%%%%%%%%%%%%% In this document, the terms `phylogenetic signal' and `phylogenetic autocorrelation' are used interchangeably. They refer to the fact that values of life-history traits or ecological features are not independent in closely related taxa. Several procedures are implemented by \code{adephylo} to measure and test phylogenetic autocorrelation. % % % % % % % % % % % \subsubsection{Moran's $I$} % % % % % % % % % % % The function \code{moran.idx} computes Moran's $I$, the most widely-used autocorrelation measure. It can also provide additionnal information (argument \code{addInfo}), being the null value of $I$ (i.e., the expected value in absence of phylogenetic autocorrelation), and the range of variation of $I$. It requires the degree of relatedness of tips on the phylogeny to be modelled by a matrix of phylogenetic proximities. Such a matrix can be obtained using different methods implemented by the function \code{proxTips}. <>= W <- proxTips(myTree, met="Abouheif") moran.idx(tdata(ung, type="tip")$afbw, W) moran.idx(tdata(ung, type="tip")[,1], W, addInfo=TRUE) @ From here, it is quite straightforward to build a non-parametric test based on Moran's $I$. For instance (taken from \code{?moran.idx}): <>= afbw <- tdata(ung, type="tip")$afbw sim <- replicate(499, moran.idx(sample(afbw), W)) # permutations sim <- c(moran.idx(afbw, W), sim) cat("\n=== p-value (right-tail) === \n") pval <- mean(sim>=sim[1]) pval plot(density(sim), main="Moran's I Monte Carlo test for 'bif'") # plot mtext("Density of permutations, and observation (in red)") abline(v=sim[1], col="red", lwd=3) @ \noindent Here, \code{afbw} is likely not phylogenetically autocorrelated. % % % % % % % % % % % \subsubsection{Abouheif's test} % % % % % % % % % % % The test of Abouheif (see reference in \code{?abouheif.moran}) is designed to test the existence of phylogenetic signal. In fact, it has been shown that this test amounts to a Moran's $I$ test with a particular proximity matrix (again, see references in the manpage). The implementation in \code{abouheif.moran} proposes different phylogenetic proximities, using by default the original one. The function can be used on different objects; in particular, it can be used with a \code{phylo4d} object. In such case, all traits inside the object are tested. The returned object is a \code{krandtest}, a class of object defined by \code{ade4} to store multiple Monte Carlo tests. Here is an example using the ungulates dataset: <>= ung.abTests <- abouheif.moran(ung) ung.abTests plot(ung.abTests) @ \noindent In this case, it seems that all variables but \code{afbm} are phylogenetically structured. \\ Note that other proximities than those proposed in \code{abouheif.moran} can be used: on has just to pass the appropriate proximity matrix to the function (argument \code{W}). For instance, we would like to use the correlation corresponding to a Brownian motion as a measure of phylogenetic proximity. First, we must estimate branch lengths, as our tree does not have any (ideally, we would already have a tree with meaningful branch lengths): <<>>= hasEdgeLength(ung) myTree.withBrLe <- compute.brlen(myTree) @ \noindent Now, we can use ape's function \code{vcv.phylo} to compute the matrix of phylogenetic proximities, and use this matrix in Abouheif's test: <<>>= myProx <- vcv.phylo(myTree.withBrLe) abouheif.moran(ung, W=myProx) @ \noindent In the present case, traits no longer appear as phylogenetically autocorrelated. Several explanation can be proposed: the procedure for estimating branch length may not be appropriate in this case, or the Brownian motion may fail to describe the evolution of the traits under study for this set of taxa. % % % % % % % % % % % \subsubsection{Phylogenetic decomposition of trait variation} % % % % % % % % % % % The phylogenetic decomposition of the variation of a trait proposed by Ollier et al. (2005, see references in \code{?orthogram}) is implemented by the function \code{orthogram}. This function replaces the former, deprecated version from \code{ade4}. \\ The idea behind the method is to model different levels of variation on a phylogeny. Basically, these levels can be obtained from dummy vectors indicating which tip descends from a given node. A partition of tips can then be obtained for each node. This job is achieved by the function \code{treePart}. Here is an example using a small simulated tree: <>= x <- as(rtree(5),"phylo4") plot(x,show.n=TRUE) @ <<>>= x.part <- treePart(x) x.part @ \noindent The obtained partition can also be plotted: <>= temp <- phylo4d(x, x.part) table.phylo4d(temp, cent=FALSE, scale=FALSE) @ \noindent What we would like to do is assess where the variation of a trait is structured on the phylogeny; to do so, we could use these dummy vectors as regressors and see how variation is distributed among these vectors. However, these dummy vectors cannot be used as regressors because they are linearly dependent. The orthogram circumvents this issue by transforming and selecting dummy vectors into a new set of variables that are orthonormal. The obtained orthonormal basis can be used to decompose the variation of the trait. Even if not necessary to get an orthogram, this basis can be obtained from \code{treePart}: <<>>= args(treePart) temp <- phylo4d(x, treePart(x, result="orthobasis") ) @ \noindent And here are the first 8 vectors of the orthonormal basis for the ungulate dataset: <>= temp <- phylo4d(myTree, treePart(myTree, result="orthobasis") ) par(mar=rep(.1,4)) table.phylo4d(temp, repVar=1:8, ratio.tree=.3) @ The decomposition of variance achieved by projecting a trait onto this orthonormal basis gives rise to several test statistics, that are performed by the function \code{orthogram}. Like the \code{abouheif.moran} function, \code{orthogram} outputs a \code{krandtest} object: <>= afbw.ortgTest <- orthogram(afbw, myTree) afbw.ortgTest @ \noindent Here again, \code{afbw} does not seem to be phylogenetically structured. %%%%%%%%%%%%%%%%%%%%% \subsection{Modelling phylogenetic signal} %%%%%%%%%%%%%%%%%%%%% % % % % % % % % % % % \subsubsection{Using orthonormal bases} % % % % % % % % % % % The previous section describing the orthogram has shown that testing phylogenetic signal underlies a model of phylogenetic structure. In the case of the orthogram, several tests are based on the decomposition of the variance of a trait onto an orthonormal basis describing tree topology. In fact, it is possible to extend this principle to any orthonormal basis modelling phylogenetic topology. Another example of such bases is offered by Moran's eigenvectors, which can be used to model different observable phylogenetic structures (see references in \code{me.phylo}). Moran's phylogenetic eigenvectors are implemented by the function \code{me.phylo} (also nicknamed \code{orthobasis.phylo}). The returned object is a data.frame with the class \code{orthobasis} defined in \code{ade4}; columns of this object are Moran's eigenvectors. An \code{orthobasis} can be coerced to a regular \code{data.frame} or to a matrix using \code{as.data.frame} and \code{as.matrix}. <<>>= me.phylo(myTree.withBrLe) @ \noindent Moran's eigenvectors are constructed from a matrix of phylogenetic proximities between tips. Any proximity can be used (argument \code{prox}); the 5 proximities implemented by the \code{proxTips} function are available by default, giving rise to different orthobases: <>= ung.listBas <- list() ung.listBas[[1]] <- phylo4d(myTree, as.data.frame(me.phylo(myTree.withBrLe, method="patristic"))) ung.listBas[[2]] <- phylo4d(myTree, as.data.frame(me.phylo(myTree, method="nNodes"))) ung.listBas[[3]]<- phylo4d(myTree, as.data.frame(me.phylo(myTree, method="Abouheif"))) ung.listBas[[4]] <- phylo4d(myTree, as.data.frame(me.phylo(myTree, method="sumDD"))) par(mar=rep(.1,4), mfrow=c(2,2)) invisible(lapply(ung.listBas, table.phylo4d, repVar=1:5, cex.sym=.7, show.tip.label=FALSE, show.node=FALSE)) @ \includegraphics[width=.8\textwidth]{figs/adephylo-figFourBas} \noindent In this case, the first Moran's eigenvectors are essentially similar. In other cases, however, the orthobases built from different proximities can be quite different. \\ One of the interests of Moran's eigenvectors in phylogeny is to account for phylogenetic autocorrelation in a linear model. This can be achieved using the appropriate eigenvector as covariate. Here is an example when studying the link of two traits in ungulate dataset. <>= afbw <- log(ungulates$tab[,1]) neonatw <- log((ungulates$tab[,2]+ungulates$tab[,3])/2) names(afbw) <- myTree$tip.label names(neonatw) <- myTree$tip.label plot(afbw, neonatw, main="Relationship between afbw and neonatw") lm1 <- lm(neonatw~afbw) abline(lm1, col="blue") anova(lm1) @ \noindent Are the residuals of this model independent? <>= resid <- residuals(lm1) names(resid) <- myTree$tip.label temp <- phylo4d(myTree,data.frame(resid)) abouheif.moran(temp) table.phylo4d(temp) @ \noindent No, residuals are clearly not independent, as they exhibit strong phylogenetic autocorrelation. In this case, autocorrelation can be removed by using the first Moran's eigenvector as a covariate. In general, the appropriate eigenvector(s) can be chosen by usual variable-selection approaches, like the forward selection, or using a selection based on the existence of autocorrelation in the residuals. <<>>= myBasis <- me.phylo(myTree, method="Abouheif") lm2 <- lm(neonatw~myBasis[,1] + afbw) resid <- residuals(lm2) names(resid) <- myTree$tip.label temp <- phylo4d(myTree,data.frame(resid)) abouheif.moran(temp) anova(lm2) @ The link between the two variables is still very statistically significant, but this time the model is not invalid because of non-independence of residuals. % % % % % % % % % % % \subsubsection{Autoregressive models} % % % % % % % % % % % Autoregressive models can also be used to remove phylogenetic autocorrelation from residuals. This approach implies the use of a phylogenetically lagged vector, for some or all of the variates of a model (see references in \code{?proxTips}). The lag vector of a trait $x$, denoted $\tilde{x}$, is computed as: $$ \tilde{x} = Wx $$ \noindent where $W$ is a matrix of phylogenetic proximities, as returned by \code{proxTips}. Hence, one can use an autoregressive approach to remove phylogenetic autocorrelation quite simply. We here re-use the example from the previous section: <<>>= W <- proxTips(myTree, method="Abouheif", sym=FALSE) lagNeonatw <- W %*% neonatw lm3 <- lm(neonatw ~ lagNeonatw + afbw) resid <- residuals(lm3) abouheif.moran(resid,W) @ \noindent Here, this most simple autoregressive model may not be sufficient to account for all phylogenetic signal; yet, phylogenetic autocorrelation is no longer detected at the usual threshold $\alpha=0.05$. %%%%%%%%%%%%%%%%%%%%% \subsection{Using multivariate analyses} %%%%%%%%%%%%%%%%%%%%% Multivariate analyses can be used to identify the main biodemographic strategies in a large set of traits. This could be the topic of an entire book. Such application is not particular to \code{adephylo}, but some practices are made easier by the package, used together with \code{ade4}. We here provide a simple example, using the \code{maples} dataset. This dataset contains a tree and a set of 31 quantitative traits (see \code{?maples}). First of all, we seek a summary of the variability in traits using a principal component analysis. Missing data are replaced by mean values, so they are placed at the origin of the axes (the `non-informative' point). <>= f1 <- function(x){ m <- mean(x,na.rm=TRUE) x[is.na(x)] <- m return(x) } data(maples) traits <- apply(maples$tab, 2, f1) pca1 <- dudi.pca(traits, scannf=FALSE, nf=1) barplot(pca1$eig, main="PCA eigenvalues", col=heat.colors(16)) @ \noindent One axis shall be retained. Does this axis reflect a phylogenetic structure? We can represent this principal component onto the phylogeny. In some cases, positive autocorrelation can be better perceived by examining the lag vector (see previous section on autoregressive models) instead of the original vector. Here, we shall plot both the retained principal component, and its lag vector: <>= tre <- read.tree(text=maples$tre) W <- proxTips(tre) myComp <- data.frame(PC1=pca1$li[,1], lagPC1=W %*% pca1$li[,1]) myComp.4d <- phylo4d(tre, myComp) nodeLabels(myComp.4d) <- names(nodeLabels(myComp.4d)) table.phylo4d(myComp.4d) @ \noindent It is quite clear that the main component of diversity among taxa separates descendants from node 19 from descendants of node 24. Phylogenetic autocorrelation can be checked in `PC1' (note that testing it in the lag vector would be circulary, as the lag vector already otimizes positive autocorrelation), for instance using Abouheif's test: <>= myTest <- abouheif.moran(myComp[,1], W=W) plot(myTest, main="Abouheif's test using patristic proximity") mtext("First principal component - maples data", col="blue", line=1) @ \noindent To dig further into the interpretation of this structure, one can have a look at the loadings of the traits, to see to which biological traits these opposed life histories correspond: <>= ldgs <- pca1$c1[,1] plot(ldgs, type="h", xlab="Variable", xaxt="n", ylab="Loadings") s.label(cbind(1:31, ldgs), lab=colnames(traits), add.p=TRUE, clab=.8) temp <- abs(ldgs) thres <- quantile(temp, .75) abline(h=thres * c(-1,1), lty=2, col="blue3", lwd=3) title("Loadings for PC1") mtext("Quarter of most contributing variables indicated in blue", col="blue") @ \noindent As a reminder, species with a large black symbol would be on the top of this graph, while species with a large white symbol would lie on the bottom. %%%%%%%%%%%%%%%%%%%%% %\subsection{Performing a phylogenetic Principal Component Analysis} %%%%%%%%%%%%%%%%%%%%% \end{document}