                                  fcontrast



Wiki

   The master copies of EMBOSS documentation are available at
   http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki.

   Please help by correcting and extending the Wiki pages.

Function

   Continuous character contrasts

Description

   Reads a tree from a tree file, and a data set with continuous
   characters data, and produces the independent contrasts for those
   characters, for use in any multivariate statistics package. Will also
   produce covariances, regressions and correlations between characters
   for those contrasts. Can also correct for within-species sampling
   variation when individual phenotypes are available within a population.

Algorithm

   This program implements the contrasts calculation described in my 1985
   paper on the comparative method (Felsenstein, 1985d). It reads in a
   data set of the standard quantitative characters sort, and also a tree
   from the treefile. It then forms the contrasts between species that,
   according to that tree, are statistically independent. This is done for
   each character. The contrasts are all standardized by branch lengths
   (actually, square roots of branch lengths).

   The method is explained in the 1985 paper. It assumes a Brownian motion
   model. This model was introduced by Edwards and Cavalli-Sforza (1964;
   Cavalli-Sforza and Edwards, 1967) as an approximation to the evolution
   of gene frequencies. I have discussed (Felsenstein, 1973b, 1981c,
   1985d, 1988b) the difficulties inherent in using it as a model for the
   evolution of quantitative characters. Chief among these is that the
   characters do not necessarily evolve independently or at equal rates.
   This program allows one to evaluate this, if there is independent
   information on the phylogeny. You can compute the variance of the
   contrasts for each character, as a measure of the variance accumulating
   per unit branch length. You can also test covariances of characters.

   The statistics that are printed out include the covariances between all
   pairs of characters, the regressions of each character on each other
   (column j is regressed on row i), and the correlations between all
   pairs of characters. In assessing degress of freedom it is important to
   realize that each contrast was taken to have expectation zero, which is
   known because each contrast could as easily have been computed xi-xj
   instead of xj-xi. Thus there is no loss of a degree of freedom for
   estimation of a mean. The degrees of freedom is thus the same as the
   number of contrasts, namely one less than the number of species (tips).
   If you feed these contrasts into a multivariate statistics program make
   sure that it knows that each variable has expectation exactly zero.

  Within-species variation

   With the W option selected, CONTRAST analyzes data sets with variation
   within species, using a model like that proposed by Michael Lynch
   (1990). The method is described in vague terms in my book (Felsenstein,
   2004, pp. 441). If you select the W option for within-species
   variation, the data set should have this structure (on the left are the
   data, on the right my comments:

   10    5                           number of species, number of characters
Alpha        2                       name of 1st species, # of individuals
 2.01 5.3 1.5  -3.41 0.3             data for individual #1
 1.98 4.3 2.1  -2.98 0.45            data for individual #2
Gammarus     3                       name of 2nd species, # of individuals
 6.57 3.1 2.0  -1.89 0.6             data for individual #1
 7.62 3.4 1.9  -2.01 0.7             data for individual #2
 6.02 3.0 1.9  -2.03 0.6             data for individual #3
...                                  (and so on)



   The covariances, correlations, and regressions for the "additive"
   (between-species evolutionary variation) and "environmental"
   (within-species phenotypic variation) are printed out (the maximum
   likelihood estimates of each). The program also estimates the
   within-species phenotypic variation in the case where the
   between-species evolutionary covariances are forced to be zero. The
   log-likelihoods of these two cases are compared and a likelihood ratio
   test (LRT) is carried out. The program prints the result of this test
   as a chi-square variate, and gives the number of degrees of freedom of
   the LRT. You have to look up the chi-square variable on a table of the
   chi-square distribution. The A option is available (if the W option is
   invoked) to allow you to turn off the doing of this test if you want
   to.

   The log-likelihood of the data under the models with and without
   between-species For the moment the program cannot handle the case where
   within-species variation is to be taken into account but where only
   species means are available. (It can handle cases where some species
   have only one member in their sample).

   We hope to fix this soon. We are also on our way to incorporating
   full-sib, half-sib, or clonal groups within species, so as to do one
   analysis for within-species genetic and between-species phylogenetic
   variation.

   The data set used as an example below is the example from a paper by
   Michael Lynch (1990), his characters having been log-transformed. In
   the case where there is only one specimen per species, Lynch's model is
   identical to our model of within-species variation (for multiple
   individuals per species it is not a subcase of his model).

Usage

   Here is a sample session with fcontrast


% fcontrast
Continuous character contrasts
Input file: contrast.dat
Phylip tree file (optional): contrast.tree
Phylip contrast program output file [contrast.fcontrast]:


Output written to file "contrast.fcontrast"

Done.



   Go to the input files for this example
   Go to the output files for this example

Command line arguments

Continuous character contrasts
Version: EMBOSS:6.6.0.0

   Standard (Mandatory) qualifiers:
  [-infile]            frequencies File containing one or more sets of data
  [-intreefile]        tree       Phylip tree file (optional)
  [-outfile]           outfile    [*.fcontrast] Phylip contrast program output
                                  file

   Additional (Optional) qualifiers (* if not always prompted):
   -varywithin         boolean    [N] Within-population variation in data
*  -[no]reg            boolean    [Y] Print out correlations and regressions
*  -writecont          boolean    [N] Print out contrasts
*  -[no]nophylo        boolean    [Y] LRT test of no phylogenetic component,
                                  with and without VarA
   -printdata          boolean    [N] Print data at start of run
   -[no]progress       boolean    [Y] Print indications of progress of run

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-outfile" associated qualifiers
   -odirectory3        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options and exit. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages
   -version            boolean    Report version number and exit


Input file format

   fcontrast reads continuous character data.

  Continuous character data

   The programs in this group use gene frequencies and quantitative
   character values. One (CONTML) constructs maximum likelihood estimates
   of the phylogeny, another (GENDIST) computes genetic distances for use
   in the distance matrix programs, and the third (CONTRAST) examines
   correlation of traits as they evolve along a given phylogeny.

   When the gene frequencies data are used in CONTML or GENDIST, this
   involves the following assumptions:
    1. Different lineages evolve independently.
    2. After two lineages split, their characters change independently.
    3. Each gene frequency changes by genetic drift, with or without
       mutation (this varies from method to method).
    4. Different loci or characters drift independently.

   How these assumptions affect the methods will be seen in my papers on
   inference of phylogenies from gene frequency and continuous character
   data (Felsenstein, 1973b, 1981c, 1985c).

   The input formats are fairly similar to the discrete-character
   programs, but with one difference. When CONTML is used in the
   gene-frequency mode (its usual, default mode), or when GENDIST is used,
   the first line contains the number of species (or populations) and the
   number of loci and the options information. There then follows a line
   which gives the numbers of alleles at each locus, in order. This must
   be the full number of alleles, not the number of alleles which will be
   input: i. e. for a two-allele locus the number should be 2, not 1.
   There then follow the species (population) data, each species beginning
   on a new line. The first 10 characters are taken as the name, and
   thereafter the values of the individual characters are read
   free-format, preceded and separated by blanks. They can go to a new
   line if desired, though of course not in the middle of a number.
   Missing data is not allowed - an important limitation. In the default
   configuration, for each locus, the numbers should be the frequencies of
   all but one allele. The menu option A (All) signals that the
   frequencies of all alleles are provided in the input data -- the
   program will then automatically ignore the last of them. So without the
   A option, for a three-allele locus there should be two numbers, the
   frequencies of two of the alleles (and of course it must always be the
   same two!). Here is a typical data set without the A option:

     5    3
2 3 2
Alpha      0.90 0.80 0.10 0.56
Beta       0.72 0.54 0.30 0.20
Gamma      0.38 0.10 0.05  0.98
Delta      0.42 0.40 0.43 0.97
Epsilon    0.10 0.30 0.70 0.62

   whereas here is what it would have to look like if the A option were
   invoked:

     5    3
2 3 2
Alpha      0.90 0.10 0.80 0.10 0.10 0.56 0.44
Beta       0.72 0.28 0.54 0.30 0.16 0.20 0.80
Gamma      0.38 0.62 0.10 0.05 0.85  0.98 0.02
Delta      0.42 0.58 0.40 0.43 0.17 0.97 0.03
Epsilon    0.10 0.90 0.30 0.70 0.00 0.62 0.38


   The first line has the number of species (or populations) and the
   number of loci. The second line has the number of alleles for each of
   the 3 loci. The species lines have names (filled out to 10 characters
   with blanks) followed by the gene frequencies of the 2 alleles for the
   first locus, the 3 alleles for the second locus, and the 2 alleles for
   the third locus. You can start a new line after any of these allele
   frequencies, and continue to give the frequencies on that line (without
   repeating the species name).

   If all alleles of a locus are given, it is important to have them add
   up to 1. Roundoff of the frequencies may cause the program to conclude
   that the numbers do not sum to 1, and stop with an error message.

   While many compilers may be more tolerant, it is probably wise to make
   sure that each number, including the first, is preceded by a blank, and
   that there are digits both preceding and following any decimal points.

   CONTML and CONTRAST also treat quantitative characters (the
   continuous-characters mode in CONTML, which is option C). It is assumed
   that each character is evolving according to a Brownian motion model,
   at the same rate, and independently. In reality it is almost always
   impossible to guarantee this. The issue is discussed at length in my
   review article in Annual Review of Ecology and Systematics
   (Felsenstein, 1988a), where I point out the difficulty of transforming
   the characters so that they are not only genetically independent but
   have independent selection acting on them. If you are going to use
   CONTML to model evolution of continuous characters, then you should at
   least make some attempt to remove genetic correlations between the
   characters (usually all one can do is remove phenotypic correlations by
   transforming the characters so that there is no within-population
   covariance and so that the within-population variances of the
   characters are equal -- this is equivalent to using Canonical
   Variates). However, this will only guarantee that one has removed
   phenotypic covariances between characters. Genetic covariances could
   only be removed by knowing the coheritabilities of the characters,
   which would require genetic experiments, and selective covariances
   (covariances due to covariation of selection pressures) would require
   knowledge of the sources and extent of selection pressure in all
   variables.

   CONTRAST is a program designed to infer, for a given phylogeny that is
   provided to the program, the covariation between characters in a data
   set. Thus we have a program in this set that allow us to take
   information about the covariation and rates of evolution of characters
   and make an estimate of the phylogeny (CONTML), and a program that
   takes an estimate of the phylogeny and infers the variances and
   covariances of the character changes. But we have no program that
   infers both the phylogenies and the character covariation from the same
   data set.

   In the quantitative characters mode, a typical small data set would be:

     5   6
Alpha      0.345 0.467 1.213  2.2  -1.2 1.0
Beta       0.457 0.444 1.1    1.987 -0.2 2.678
Gamma      0.6 0.12 0.97 2.3  -0.11 1.54
Delta      0.68  0.203 0.888 2.0  1.67
Epsilon    0.297  0.22 0.90 1.9 1.74


   Note that in the latter case, there is no line giving the numbers of
   alleles at each locus. In this latter case no square-root
   transformation of the coordinates is done: each is assumed to give
   directly the position on the Brownian motion scale.

   For further discussion of options and modifiable constants in CONTML,
   GENDIST, and CONTRAST see the documentation files for those programs.

  Input files for usage example

  File: contrast.dat

    5   2
Homo        4.09434  4.74493
Pongo       3.61092  3.33220
Macaca      2.37024  3.36730
Ateles      2.02815  2.89037
Galago     -1.46968  2.30259

  File: contrast.tree

((((Homo:0.21,Pongo:0.21):0.28,Macaca:0.49):0.13,Ateles:0.62):0.38,Galago:1.00);

Output file format

   fcontrast statistics that are printed out include the covariances
   between all pairs of characters, the regressions of each character on
   each other (column j is regressed on row i), and the correlations
   between all pairs of characters. In assessing degress of freedom it is
   important to realize that each contrast was taken to have expectation
   zero, which is known because each contrast could as easily have been
   computed xi-xj instead of xj-xi. Thus there is no loss of a degree of
   freedom for estimation of a mean. The degrees of freedom is thus the
   same as the number of contrasts, namely one less than the number of
   species (tips). If you feed these contrasts into a multivariate
   statistics program make sure that it knows that each variable has
   expectation exactly zero. With the W option selected, the covariances,
   correlations, and regressions for the "additive" (between-species
   evolutionary variation) and "environmental" (within-species phenotypic
   variation) are printed out (the maximum likelihood estimates of each).
   The program also estimates the within-species phenotypic variation in
   the case where the between-species evolutionary covariances are forced
   to be zero. The log-likelihoods of these two cases are compared and a
   likelihood ratio test (LRT) is carried out. The program prints the
   result of this test as a chi-square variate, and gives the number of
   degrees of freedom of the LRT. You have to look up the chi-square
   variable on a table of the chi-square distribution. The A option is
   available (if the W option is invoked) to allow you to turn off the
   doing of this test if you want to.

  Output files for usage example

  File: contrast.fcontrast


Covariance matrix
---------- ------

    3.9423    1.7028
    1.7028    1.7062

Regressions (columns on rows)
----------- -------- -- -----

    1.0000    0.4319
    0.9980    1.0000

Correlations
------------

    1.0000    0.6566
    0.6566    1.0000


Data files

   None

Notes

   None.

References

   None.

Warnings

   None.

Diagnostic Error Messages

   None.

Exit status

   It always exits with status 0.

Known bugs

   None.

See also

                    Program name                  Description
                    econtml      Continuous character maximum likelihood method
                    econtrast    Continuous character contrasts

Author(s)

                    This program is an EMBOSS conversion of a program written by Joe
                    Felsenstein as part of his PHYLIP package.

                    Please report all bugs to the EMBOSS bug team
                    (emboss-bug (c) emboss.open-bio.org) not to the original author.

History

                    Written (2004) - Joe Felsenstein, University of Washington.

                    Converted (August 2004) to an EMBASSY program by the EMBOSS team.

Target users

                    This program is intended to be used by everyone and everything, from
                    naive users to embedded scripts.

Comments

None
