Apoderoides Tutorial

Satoshi Aoki

2024-01-12

1 Aim of this package, Apoderoides

Apoderoides is an R package for finding and deleting erroneous taxa from a phylogenetic tree by calculating scores for each taxa. The score shows how erroneous the taxon and can prioritize which taxon should be deleted first. Apoderoides especially focuses on erroneous taxa caused by taxon mistake or misidentification.

2 Installation

You can install Apoderoides by the following commands on R console. This command requires the Internet connection.

install.packages("Apoderoides")

3 How to use

First, please load the package by the following command:

library(Apoderoides)

3.1 Import a phylogenetic tree

We need a phylogenetic tree for the analysis. The next command imports a test tree included in this package.

data(testTree)

Otherwise, please load your own phylogenetic tree by a command like this:

testTree <- read.tree(file="/directory/yourTree.tre")

3.2 Score calculation

3.2.1. Score at genus level

Let’s calculate the score of taxa in the loaded tree at genus level. Taxa with higher scores are more harmful to monophyly of genera, and such taxa are considered erroneous. The score is calculated by the next command:

#for the test tree
calc.Score(testTree)
#for the user imported tree
calc.Score(tree)

Here, let’s have a look at the top 10 scores of the test tree.

calc.Score(testTree,show_progress=FALSE)[1:10,]
#>       OTU                            perCladeOTUScore sum   intruder outlier
#>  [1,] "Araucaria_cunninghamii"       "248"            "248" "0"      "248"  
#>  [2,] "Thuja_orientalis"             "21"             "21"  "1"      "20"   
#>  [3,] "Callitropsis_vietnamensis"    "7"              "7"   "0"      "7"    
#>  [4,] "Callitropsis_funebris"        "6"              "6"   "0"      "6"    
#>  [5,] "Goniophlebium_niponicum"      "5"              "5"   "0"      "5"    
#>  [6,] "Dryopteris_crassirhizoma"     "4"              "4"   "0"      "4"    
#>  [7,] "Thelypteris_aurita"           "4"              "4"   "0"      "4"    
#>  [8,] "Thelypteris_decursivepinnata" "4"              "4"   "0"      "4"    
#>  [9,] "Cyclosorus_dentatus"          "4"              "4"   "1"      "3"    
#> [10,] "Juniperus_chinensis"          "3"              "3"   "1"      "2"    
#>       #clade
#>  [1,] "136" 
#>  [2,] "121" 
#>  [3,] "124" 
#>  [4,] "123" 
#>  [5,] "65"  
#>  [6,] "49"  
#>  [7,] "79"  
#>  [8,] "81"  
#>  [9,] "87"  
#> [10,] "125"

The columns of the calculation results show the following information:

  • “OTU”: The names of the tree tips.

  • “perCladeOTUscore”: The final score of the tree tip calculated by “sum” divided by the number of taxa with the same “#clade”.

  • “sum”: The sum of “intruder” and “outlier”.

  • “intruder”: The intruder score of the tree tip.

  • “outlier”: The outlier score of the tree tip.

  • “#clade”: Identifier of clades of the same rank (Here, genus). Different clades have different #clade.

In short, the intruder score shows how many clades of the other ranks the tree tip is intruding, and the outlier score shows how far the tree tip is from the main clade of the belonging rank.

The result shows that “Araucaria_cunninghamii” is by far the top candidate to delete from the tree due to its high score.

Please note that this function assumes that all the names of tree tips are scientific names connected by underbars like “Homo_sapiens”. If the tree tips are named otherwise, please see the next chapter.

3.2.2. Score at the other rank

When you want to calculate score for the rank other than genus, or when the tree tips are not scientific names, you need a list of belonging ranks of the tree tips. Let’s see the rank list of the test tree called by the following command:

data("testRankList")

The contains of the rank list is like this:

data("testRankList")
testRankList[[1]][1:10]
#>  [1] "Wollemia_nobilis"                    
#>  [2] "Araucaria_bidwillii"                 
#>  [3] "Araucaria_araucana"                  
#>  [4] "Araucaria_angustifolia"              
#>  [5] "Ephedra_sinica"                      
#>  [6] "Ephedra_przewalskii"                 
#>  [7] "Ephedra_monosperma"                  
#>  [8] "Cathaya_argyrophylla"                
#>  [9] "Pseudotsuga_sinensis_var._wilsoniana"
#> [10] "Larix_gmelinii"
testRankList[[2]][1:10]
#>  [1] "Araucariaceae" "Araucariaceae" "Araucariaceae" "Araucariaceae"
#>  [5] "Ephedraceae"   "Ephedraceae"   "Ephedraceae"   "Pinaceae"     
#>  [9] "Pinaceae"      "Pinaceae"

The rank list is a list of size 2. The first element is equivalent to a character vector of the tree tips (obtained by like testTree$tip). The second element is a character vector of the rank names corresponding to the first element of the rank list. In this test data, the rank list indicates the family of the test tree tips. When the tree tips are not scientific names and you want to calculate the score for genus, you can calculate it by setting the genus names in the second element of the rank list.

Using this rank list, the score of test tree for family can be calculated by the following command:

calc.Score(testTree,testRankList)

The output can be interpreted just like the score for genus. The only difference is the score is based on monophyly of genus or family.

3.3 Auto deletion of erroneous tree tips

The score tells us which tree tip(s) are most erroneous in the tree. Therefore, repeating score calculation and deleting the top-score tip(s) until all tips have 0 scores can provide a tree without erroneous tips with the small number of deleted tips. The following commands conduct such auto deletion of erroneous taxa for the test tree:

#for genus level
autoDeletion(testTree)
#for family level
autoDeletion(testTree,testRankList)

The output of autoDeletion() is a list of size 3. The first element is the tree without erroneous tips. The second element is a character vector of deleted tree tips. The last element is a lists of scores repeatedly calculated until all the erroneous tips are deleted.

3.4 The other utilities to help analysis

The functions calc.Score() and autoDeletion() have arguments of show_progress and num_threads. show_progress is a boolean (TRUE or FALSE) and TRUE by default. When it is TRUE, the progress of calculation is reported on the R console. When it is FALSE, it provides no reports but the calculation will be slightly faster.

num_threads is a positive integer and 1 by default. You can specify the number of threads for faster calculation by this argument. However, this option validly works only when OpenMP is available, and the default compiler in MacOS does not support OpenMP. Single thread calculation is still available for MacOS, but if you want to use multiple threads in MacOS, you need to get OpenMP. One way to install OpenMP is using following commands in the terminal:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install libomp

The function calc.Score() also has an argument of sort. sort is a boolean and TRUE by default. When it is FALSE, the resultant score is no longer sorted by the descending order, and it will be remained as the original order of the tree tips.

The function get.upperRank() returns the genus name of given scientific names assuming that they are connected by underbars. This may be useful to search upper ranks to make a rank list.