\documentclass{article}
% \VignettePackage{adephylo}
% \VignetteIndexEntry{adephylo: exploratory analyses for the phylogenetic comparative method}

\usepackage{graphicx}
\usepackage[colorlinks=true,urlcolor=blue]{hyperref}
\usepackage{array}
\usepackage{color}

\usepackage[utf8]{inputenc} % for UTF-8/single quotes from sQuote()
\newcommand{\code}[1]{{{\tt #1}}}
\title{\code{adephylo}: exploratory analyses for the phylogenetic comparative method}
\author{Thibaut Jombart and St\'ephane Dray}
\date{\today}


\sloppy
\hyphenpenalty 10000


\begin{document}
\SweaveOpts{concordance=TRUE}


\definecolor{Soutput}{rgb}{0,0,0.56}
\definecolor{Sinput}{rgb}{0.56,0,0}
\DefineVerbatimEnvironment{Sinput}{Verbatim}
{formatcom={\color{Sinput}},fontsize=\footnotesize, baselinestretch=0.75}
\DefineVerbatimEnvironment{Soutput}{Verbatim}
{formatcom={\color{Soutput}},fontsize=\footnotesize, baselinestretch=0.75}

\color{black}

\maketitle
\tableofcontents


%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}
%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%

\SweaveOpts{prefix.string = figs/adephylo, fig = FALSE, eps = FALSE, pdf = TRUE, width = 6, height = 6}


This document describes the \code{adephylo} package for the R software.
\code{adephylo} aims at implementing exploratory methods for the
analysis of phylogenetic comparative data, i.e. biological traits measured for
taxa whose phylogeny is also provided.
This package extends and replaces implementation of phylogeny-related
methods in the ade4 package \url{http://pbil.univ-lyon1.fr/ADE-4/home.php?lang=eng}.

Procedures implemented in \code{adephylo} rely on exploratory data analysis.  They include data
visualization and manipulation, tests for phylogenetic autocorrelation, multivariate analysis,
computation of phylogenetic proximities and distances, and modelling phylogenetic signal using
orthonormal bases.  \\

These methods can be used to visualize, test, remove or investigate the phylogenetic signal in
comparative data.  The purpose of this document is to provide a general view of the main
functionalities of \code{adephylo}, and to show how this package can be used along with \code{ape},
\code{phylobase} and \code{ade4} to analyse comparative data.


%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%
\section{First steps}
%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%


%%%%%%%%%%%%%%%%%%%%%
\subsection{Data representation: why we are not reinventing the weel}
%%%%%%%%%%%%%%%%%%%%%

Data representation can be defined as the way data are stored in a software
(R, in our case).  Technically, data representation is defined by classes of objects that contain
the information.  In the case of phylogeny and comparative data, very efficient data representation
are already defined in other packages.  Hence, it makes much more sense to use directly objects from
these classes.  \\


Phylogenies are best represented in Emmanuel Paradis's \code{ape} package
(\url{http://ape.mpl.ird.fr/}), as the class \code{phylo}.  As \code{ape} is by far the largest
package dedicated to phylogeny, using the \code{phylo} class assures a good interoperability of
data. This class is defined in an online document:
\url{http://ape.mpl.ird.fr/misc/FormatTreeR_28July2008.pdf}.  \\

However, data that are to be analyzed in \code{adephylo} do not only contain trees, but also traits
associated to the tips of a tree.  The package \code{phylobase}
(\url{http://r-forge.r-project.org/projects/phylobase/}) is a collaborative effort designed to
handling such data.  Its representation of phylogenies slightly differs from that of \code{ape}; the
class \code{phylo4} was originally an extension of the \code{phylo} class into formal (S4) class, but it
has now evolved into something more original. The S4 class \code{phylo4d} (`d' for `data') can be used to store a
tree and data associated to tips, internal nodes, or even edges of a tree.  Classes of
\code{phylobase} are described in a vignette of the package, accessible by typing:
<<eval=FALSE>>=
vignette("phylobase")
@

~\\ As trees and comparative data are already handled by \code{ape} and \code{phylobase}, no
particular data format shall be defined in \code{adephylo}.  In particular, we are no longer using
\code{phylog} objects, which were used to represent phylogenies in \code{ade4} in a very \textit{ad
  hoc} way, without much compatibility with other packages.  This class is now deprecated, but all
previous functionalities available for \code{phylog} objects have been re-implemented and -- in some
cases -- improved in \code{adephylo}.


%%%%%%%%%%%%%%%%%%%%%
\subsection{Installing the package}
%%%%%%%%%%%%%%%%%%%%%

What is tricky here is that a vignette is basically available once the package
is installed.  Assuming you got this document before installing the package, here are some clues
about installing \code{adephylo}.  \\

First of all, \code{adephylo} depends on other packages, being \code{methods}, \code{ape},
\code{phylobase}, and \code{ade4}.  These dependencies are mandatory, that is, you actually need to
have these packages installed before using \code{adephylo}.  Also, it is better to make sure you are
using the latest versions of these packages. This can be achieved using
the \texttt{update.packages} command, or by installing devel versions from R-Forge
(\url{http://r-forge.r-project.org/}).  In all cases, the latest version of \code{adephylo} can be
found from \url{http://r-forge.r-project.org/R/?group_id=303}.  \\

We load \textit{adephylo}, alongside some useful packages:
<<load>>=
library(ape)
library(phylobase)
library(ade4)
library(adephylo)
search()
@


%%%%%%%%%%%%%%%%%%%%%
\subsection{Getting started}
%%%%%%%%%%%%%%%%%%%%%
All the material of the package is summarized in a manpage accessible
by typing:
<<eval=FALSE>>=
?adephylo
@

The html version of this manpage may be preferred to browse easily the content
of \code{adephylo}; this is accessible by typing:
<<eval=FALSE>>=
help("adephylo", package="adephylo", html=TRUE)
@


%%%%%%%%%%%%%%%%%%%%%
\subsection{Putting data into shape}
%%%%%%%%%%%%%%%%%%%%%

While this is not the purpose of this document to go through the details of
\code{phylo}, \code{phylo4} and \code{phylo4d} objects, we shall show briefly how these objects can
be obtained.


% % % % % % % % % % %
\subsubsection{Making a \code{phylo} object}
% % % % % % % % % % %
The simplest way of turning a tree into a \code{phylo} object is using
ape's function \code{read.tree}.
This function reads a tree with the Newick (or `parentetic') format,
from a file (default, argument \code{file}) of from a character string
(argument \code{text}).
<<readTree, plot=TRUE>>=
data(ungulates)
ungulates$tre
myTree <- read.tree(text=ungulates$tre)
myTree
plot(myTree, main="ape's plotting of a tree")
@


It is easy to convert \code{ade4}'s \code{phylog} objects to a
\code{phylo}, as \code{phylog} objects store the Newick format of the
tree in the \code{\$tre} component.
\\

Note that \code{phylo} trees can also be constructed from alignements
(see \code{read.GenBank}, \code{read.dna},
\code{dist.dna}, \code{nj}, \code{bionj}, and \code{mlphylo}, all in
\code{ape}), or even simulated (for instance, see \code{rtree}).
\\

Also note that, if needed, conversion can be done back and forward
with \code{phylo4} trees:
<<>>=
temp <- as(myTree, "phylo4")
class(temp)
temp <- as(temp, "phylo")
class(temp)
all.equal(temp, myTree)
@


% % % % % % % % % % %
\subsubsection{Making a \code{phylo4d} object}
% % % % % % % % % % %

\code{phylo4d} objects are S4 objects, and are thus created in a particular
way.  These objects can be obtained in two ways, by reading a Nexus file containing tree and data
information, or by `assembling' a tree and data provided for tips, nodes, or edges.

Nexus files containing both tree and data can be read by \code{phylobase}'s function
\code{readNexus} (see corresponding manpage for more information).
The other way of creating a \code{phylo4d} object is using the
constructor, also named \code{phylo4d}.  This is a function that takes two arguments: a tree
(\code{phylo} or \code{phylo4} format) and a \code{data.frame} containing data, for tips by default (see
\code{?phylo4d} for more information).  Here is an example:
<<phylo4d, fig=TRUE>>=
ung <- phylo4d(myTree, ungulates$tab)
class(ung)
table.phylo4d(ung)
@

%% \noindent Note that the constructor checks the consistency of the
%% names used for the tips of the tree and for the rows of the data.frame.
%% Inconsistencies issue an error.
%% To override this behaviour, one can specify
%% \code{use.tip.names=FALSE}.
%% However, this can be tricky: often, mismatches between names can
%% indicate that data are not sorted adequately; moreover, object created
%% with such mismatches will often be invalid objects, and may issue
%% errors in further analyses.
%% \\

Data are stored inside the \code{@data} slot of the object.
They can be accessed using the function \code{tdata}:
<<>>=
x <- tdata(ung, type="tip")
head(x)
@


%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%
\section{Exploratory data analysis}
%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%


%%%%%%%%%%%%%%%%%%%%%
\subsection{Quantifying and testing phylogenetic signal}
%%%%%%%%%%%%%%%%%%%%%

In this document, the terms `phylogenetic signal' and `phylogenetic autocorrelation' are used
interchangeably.  They refer to the fact that values of life-history traits or ecological features
are not independent in closely related taxa.  Several procedures are implemented by \code{adephylo}
to measure and test phylogenetic autocorrelation.


% % % % % % % % % % %
\subsubsection{Moran's $I$}
% % % % % % % % % % %

The function \code{moran.idx} computes Moran's $I$, the most widely-used autocorrelation measure.
It can also provide additionnal information (argument \code{addInfo}), being the null value of $I$
(i.e., the expected value in absence of phylogenetic autocorrelation), and the range of variation of
$I$.  It requires the degree of relatedness of tips on the phylogeny to be modelled by a matrix of
phylogenetic proximities.  Such a matrix can be obtained using different methods implemented by the
function \code{proxTips}.

<<moranI>>=
W <- proxTips(myTree, met="Abouheif")
moran.idx(tdata(ung, type="tip")$afbw, W)
moran.idx(tdata(ung, type="tip")[,1], W, addInfo=TRUE)
@

From here, it is quite straightforward to build a non-parametric test
based on Moran's $I$.
For instance (taken from \code{?moran.idx}):
<<fig=TRUE>>=
afbw <- tdata(ung, type="tip")$afbw
sim <- replicate(499, moran.idx(sample(afbw), W)) # permutations
sim <- c(moran.idx(afbw, W), sim)

cat("\n=== p-value (right-tail) === \n")
pval <- mean(sim>=sim[1])
pval

plot(density(sim), main="Moran's I Monte Carlo test for 'bif'") # plot
mtext("Density of permutations, and observation (in red)")
abline(v=sim[1], col="red", lwd=3)

@

\noindent Here, \code{afbw} is likely not phylogenetically autocorrelated.


% % % % % % % % % % %
\subsubsection{Abouheif's test}
% % % % % % % % % % %

The test of Abouheif (see reference in \code{?abouheif.moran}) is
designed to test the existence of phylogenetic signal.
In fact, it has been shown that this test amounts to a Moran's $I$
test with a particular proximity matrix (again, see references in the manpage).
The implementation in \code{abouheif.moran} proposes different phylogenetic proximities,
using by default the original one.

The function can be used on different objects; in particular, it can
be used with a \code{phylo4d} object.
In such case, all traits inside the object are tested.
The returned object is a \code{krandtest}, a class of object defined
by \code{ade4} to store multiple Monte Carlo tests.
Here is an example using the ungulates dataset:
<<abouheif, plot=TRUE>>=
ung.abTests <- abouheif.moran(ung)
ung.abTests
plot(ung.abTests)
@

\noindent In this case, it seems that all variables but \code{afbm} are
phylogenetically structured.
\\

Note that other proximities than those proposed in
\code{abouheif.moran} can be used: on has just to pass the appropriate
proximity matrix to the function (argument \code{W}).
For instance, we would like to use the correlation corresponding to a
Brownian motion as a measure of phylogenetic proximity.

First, we must estimate branch lengths, as our tree does
not have any (ideally, we would already have a tree with meaningful branch lengths):
<<>>=
hasEdgeLength(ung)
myTree.withBrLe <- compute.brlen(myTree)
@

\noindent Now, we can use ape's function \code{vcv.phylo} to compute
the matrix of phylogenetic proximities, and use this matrix in
Abouheif's test:
<<>>=
myProx <- vcv.phylo(myTree.withBrLe)
abouheif.moran(ung, W=myProx)
@

\noindent In the present case, traits no longer appear as phylogenetically autocorrelated.  Several
explanation can be proposed: the procedure for estimating branch length may not be appropriate in
this case, or the Brownian motion may fail to describe the evolution of the traits under study for
this set of taxa.


% % % % % % % % % % %
\subsubsection{Phylogenetic decomposition of trait variation}
% % % % % % % % % % %
The phylogenetic decomposition of the variation of a trait proposed by Ollier
et al. (2005, see references in \code{?orthogram}) is implemented by
the function \code{orthogram}.
This function replaces the former, deprecated version from \code{ade4}.
\\

The idea behind the method is to model different levels of variation
on a phylogeny.
Basically, these levels can be obtained from dummy vectors indicating
which tip descends from a given node.
A partition of tips can then be obtained for each node.
This job is achieved by the function \code{treePart}.
Here is an example using a small simulated tree:
<<fig=TRUE>>=
x <- as(rtree(5),"phylo4")
plot(x,show.n=TRUE)
@

<<>>=
x.part <- treePart(x)
x.part
@
\noindent The obtained partition can also be plotted:
<<fig=TRUE>>=
temp <- phylo4d(x, x.part)
table.phylo4d(temp, cent=FALSE, scale=FALSE)
@

\noindent What we would like to do is assess where the variation of a trait is structured on the
phylogeny; to do so, we could use these dummy vectors as regressors and see how variation is
distributed among these vectors.  However, these dummy vectors cannot be used as regressors because
they are linearly dependent.  The orthogram circumvents this issue by transforming and selecting
dummy vectors into a new set of variables that are orthonormal.  The obtained orthonormal basis can
be used to decompose the variation of the trait.  Even if not necessary to get an orthogram, this basis
can be obtained from \code{treePart}:
<<>>=
args(treePart)
temp <- phylo4d(x, treePart(x, result="orthobasis") )
@

\noindent And here are the first 8 vectors of the orthonormal basis
for the ungulate dataset:
<<orthobas1, fig=TRUE>>=
temp <- phylo4d(myTree, treePart(myTree, result="orthobasis") )
par(mar=rep(.1,4))
table.phylo4d(temp, repVar=1:8, ratio.tree=.3)
@

The decomposition of variance achieved by projecting a trait onto this
orthonormal basis gives rise to several test statistics, that are
performed by the function \code{orthogram}.
Like the \code{abouheif.moran} function, \code{orthogram} outputs a
\code{krandtest} object:
<<orthogram, plot=TRUE>>=
afbw.ortgTest <- orthogram(afbw, myTree)
afbw.ortgTest
@

\noindent Here again, \code{afbw} does not seem to be phylogenetically structured.


%%%%%%%%%%%%%%%%%%%%%
\subsection{Modelling phylogenetic signal}
%%%%%%%%%%%%%%%%%%%%%

% % % % % % % % % % %
\subsubsection{Using orthonormal bases}
% % % % % % % % % % %

The previous section describing the orthogram has shown that testing phylogenetic signal underlies a
model of phylogenetic structure. In the case of the orthogram, several tests are based on the
decomposition of the variance of a trait onto an orthonormal basis describing tree topology.  In
fact, it is possible to extend this principle to any orthonormal basis modelling phylogenetic
topology.  Another example of such bases is offered by Moran's eigenvectors, which can be used to
model different observable phylogenetic structures (see references in \code{me.phylo}).

Moran's phylogenetic eigenvectors are implemented by the function \code{me.phylo} (also nicknamed
\code{orthobasis.phylo}).  The returned object is a data.frame with the class \code{orthobasis}
defined in \code{ade4}; columns of this object are Moran's eigenvectors.  An \code{orthobasis} can
be coerced to a regular \code{data.frame} or to a matrix using \code{as.data.frame} and \code{as.matrix}.
<<>>=
me.phylo(myTree.withBrLe)
@

\noindent Moran's eigenvectors are constructed from a matrix of
phylogenetic proximities between tips.
Any proximity can be used (argument \code{prox}); the 5 proximities
implemented by the \code{proxTips} function are available by default, giving rise
to different orthobases:
<<figFourBas, fig=TRUE,include=FALSE, print=FALSE>>=
ung.listBas <- list()
ung.listBas[[1]] <- phylo4d(myTree, as.data.frame(me.phylo(myTree.withBrLe, method="patristic")))
ung.listBas[[2]] <- phylo4d(myTree, as.data.frame(me.phylo(myTree, method="nNodes")))
ung.listBas[[3]]<- phylo4d(myTree, as.data.frame(me.phylo(myTree, method="Abouheif")))
ung.listBas[[4]] <- phylo4d(myTree, as.data.frame(me.phylo(myTree, method="sumDD")))
par(mar=rep(.1,4), mfrow=c(2,2))
invisible(lapply(ung.listBas, table.phylo4d, repVar=1:5, cex.sym=.7, show.tip.label=FALSE, show.node=FALSE))
@

\includegraphics[width=.8\textwidth]{figs/adephylo-figFourBas}

\noindent In this case, the first Moran's eigenvectors are essentially similar.  In other cases,
however, the orthobases built from different proximities can be quite different. \\

One of the interests of Moran's eigenvectors in phylogeny is to account for phylogenetic
autocorrelation in a linear model.  This can be achieved using the appropriate eigenvector as
covariate.  Here is an example when studying the link of two traits in ungulate dataset.
<<lm1, fig=TRUE>>=
afbw <- log(ungulates$tab[,1])
neonatw <- log((ungulates$tab[,2]+ungulates$tab[,3])/2)
names(afbw) <- myTree$tip.label
names(neonatw) <- myTree$tip.label
plot(afbw, neonatw, main="Relationship between afbw and neonatw")
lm1 <- lm(neonatw~afbw)
abline(lm1, col="blue")
anova(lm1)
@

\noindent Are the residuals of this model independent?
<<resid, fig=TRUE>>=
resid <- residuals(lm1)
names(resid) <- myTree$tip.label
temp <- phylo4d(myTree,data.frame(resid))
abouheif.moran(temp)
table.phylo4d(temp)
@

\noindent No, residuals are clearly not independent, as they exhibit
strong phylogenetic autocorrelation.
In this case, autocorrelation can be removed by using the first
Moran's eigenvector as a covariate.
In general, the appropriate eigenvector(s) can be chosen by usual
variable-selection approaches, like the forward selection, or using a
selection based on the existence of autocorrelation in the residuals.
<<>>=
myBasis <- me.phylo(myTree, method="Abouheif")
lm2 <- lm(neonatw~myBasis[,1] + afbw)
resid <- residuals(lm2)
names(resid) <- myTree$tip.label
temp <- phylo4d(myTree,data.frame(resid))
abouheif.moran(temp)
anova(lm2)
@

The link between the two variables is still very statistically
significant, but this time the model is not invalid because of
non-independence of residuals.


% % % % % % % % % % %
\subsubsection{Autoregressive models}
% % % % % % % % % % %
Autoregressive models can also be used to remove phylogenetic
autocorrelation from residuals.
This approach implies the use of a phylogenetically lagged vector, for
some or all of the variates of a model (see references in \code{?proxTips}).
The lag vector of a trait $x$, denoted $\tilde{x}$, is computed as:
$$
\tilde{x} = Wx
$$
\noindent where $W$ is a matrix of phylogenetic proximities, as
returned by \code{proxTips}.
Hence, one can use an autoregressive approach to remove phylogenetic
autocorrelation quite simply.
We here re-use the example from the previous section:
<<>>=
W <- proxTips(myTree, method="Abouheif", sym=FALSE)
lagNeonatw <- W %*% neonatw
lm3 <- lm(neonatw ~ lagNeonatw + afbw)
resid <- residuals(lm3)
abouheif.moran(resid,W)
@

\noindent Here, this most simple autoregressive model may not be
sufficient to account for all phylogenetic signal; yet, phylogenetic
autocorrelation is no longer detected at the usual threshold
$\alpha=0.05$.


%%%%%%%%%%%%%%%%%%%%%
\subsection{Using multivariate analyses}
%%%%%%%%%%%%%%%%%%%%%

Multivariate analyses can be used to identify the main biodemographic strategies in a large set of
traits.  This could be the topic of an entire book.  Such application is not particular to
\code{adephylo}, but some practices are made easier by the package, used together with \code{ade4}.
We here provide a simple example, using the \code{maples} dataset.  This dataset contains a tree and
a set of 31 quantitative traits (see \code{?maples}).

First of all, we seek a summary of the variability in traits using a principal component analysis.
Missing data are replaced by mean values, so they are placed at the origin of the axes (the
`non-informative' point).
<<pca1, fig=TRUE>>=
f1 <- function(x){
    m <- mean(x,na.rm=TRUE)
    x[is.na(x)] <- m
    return(x)
}

data(maples)
traits <- apply(maples$tab, 2, f1)
pca1 <- dudi.pca(traits, scannf=FALSE, nf=1)
barplot(pca1$eig, main="PCA eigenvalues", col=heat.colors(16))
@

\noindent One axis shall be retained.  Does this axis reflect a phylogenetic structure?  We can
represent this principal component onto the phylogeny.  In some cases, positive autocorrelation can
be better perceived by examining the lag vector (see previous section on autoregressive models)
instead of the original vector.  Here, we shall plot both the retained principal component, and its
lag vector:
<<pca2, fig=TRUE>>=
tre <- read.tree(text=maples$tre)
W <- proxTips(tre)
myComp <- data.frame(PC1=pca1$li[,1], lagPC1=W %*% pca1$li[,1])
myComp.4d <- phylo4d(tre, myComp)
nodeLabels(myComp.4d) <- names(nodeLabels(myComp.4d))
table.phylo4d(myComp.4d)
@

\noindent It is quite clear that the main component of diversity among taxa separates descendants
from node 19 from descendants of node 24.  Phylogenetic autocorrelation can be checked in `PC1'
(note that testing it in the lag vector would be circulary, as the lag vector already otimizes
positive autocorrelation), for instance using Abouheif's test:
<<aboutest, fig=TRUE>>=
myTest <- abouheif.moran(myComp[,1], W=W)
plot(myTest, main="Abouheif's test using patristic proximity")
mtext("First principal component - maples data", col="blue", line=1)
@

\noindent To dig further into the interpretation of this structure,
one can have a look at the loadings of the traits, to see to which
biological traits these opposed life histories correspond:
<<loadings, fig=TRUE>>=
ldgs <- pca1$c1[,1]
plot(ldgs, type="h", xlab="Variable", xaxt="n", ylab="Loadings")
s.label(cbind(1:31, ldgs), lab=colnames(traits), add.p=TRUE, clab=.8)
temp <- abs(ldgs)
thres <- quantile(temp, .75)
abline(h=thres * c(-1,1), lty=2, col="blue3", lwd=3)
title("Loadings for PC1")
mtext("Quarter of most contributing variables indicated in blue", col="blue")
@

\noindent As a reminder, species with a large black symbol would be on
the top of this graph, while species with a large white symbol would
lie on the bottom.


%%%%%%%%%%%%%%%%%%%%%
%\subsection{Performing a phylogenetic Principal Component Analysis}
%%%%%%%%%%%%%%%%%%%%%


\end{document}