+ All Categories
Home > Documents > Avoiding potential biases in ses.PD estimations with the ...46 arguments of pd, ses.pd includes...

Avoiding potential biases in ses.PD estimations with the ...46 arguments of pd, ses.pd includes...

Date post: 25-Jan-2021
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
11
1 Avoiding potential biases in ses.PD estimations with the Picante software package 1 Rafael Molina-Venegas 1 2 1. Institute of Plant Sciences, University of Bern, Altenbergrain 21, Bern 3013, 3 Switzerland. 4 Contact: [email protected] 5 Running title: avoiding biases in ses.PD estimations 6 7 . CC-BY 4.0 International license available under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which was this version posted March 15, 2019. ; https://doi.org/10.1101/579300 doi: bioRxiv preprint
Transcript
  • 1

    Avoiding potential biases in ses.PD estimations with the Picante software package 1

    Rafael Molina-Venegas1 2

    1. Institute of Plant Sciences, University of Bern, Altenbergrain 21, Bern 3013, 3

    Switzerland. 4

    Contact: [email protected] 5

    Running title: avoiding biases in ses.PD estimations 6

    7

    .CC-BY 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

    The copyright holder for this preprint (which wasthis version posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv preprint

    https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/

  • 2

    Abstract 8

    1. Faith’s phylogenetic diversity (PD) is one of the most widespread used indices of 9

    phylogenetic structure in the eco-phylogenetic literature. The metric became 10

    notably popular with the publication of the function pd as part of the Picante R 11

    package, which is nowadays a reference software for phylogenetic analyses. 12

    2. Because PD is not statistically independent of species richness, the routine 13

    procedure is to standardize the observed PD values for unequal richness across 14

    samples. The function ses.pd, which is also implemented in the Picante R 15

    package, is the reference function to conduct such standardization. 16

    3. Unfortunately, I have detected an anomaly in the function that may result in biased 17

    estimations of standardized PD values, particularly in communities with low 18

    species richness (i.e. less than four species) and unbalanced phylogenies. 19

    4. I conduct a simple simulation exercise to illustrate the issue and propose two 20

    alternative and easy to implement solutions to go around the problem. 21

    22

    Keywords: Phylogenetic diversity; Picante; ses.pd; standardization. 23

    24

    .CC-BY 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

    The copyright holder for this preprint (which wasthis version posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv preprint

    https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/

  • 3

    Introduction 25

    Faith’s phylogenetic diversity (PD; Faith, 1992), defined as the sum of all 26

    branch lengths connecting taxa in a sample, is one of the most widespread used indices 27

    of phylogenetic structure in the eco-phylogenetic literature. The metric became notably 28

    popular with the publication of the function pd as part of the Picante software (Kembel 29

    et al., 2010), which is nowadays a reference R package for phylogenetic analyses. The 30

    pd function includes three arguments: (i) “samp”, a community data matrix (samples in 31

    rows and species in columns), (ii) “tree”, a phylo tree object including all the species in 32

    the community data matrix, and (iii) “include.root”, which is a logical argument. If the 33

    latter is set to TRUE (default = TRUE), then the PD of all samples in the community 34

    data matrix will include the distance from the most recent common ancestor (MRCA) of 35

    the species in each sample and the root of the supplied phylogeny (hereafter “MRCA – 36

    root” distance). Otherwise, the MRCA – root distance is excluded from the 37

    computations. 38

    Importantly, PD is not statistically independent of species richness, and the 39

    routine procedure is to standardize the observed PD values for unequal richness across 40

    samples (see documentation for the pd function in Picante R package, Kembel et al., 41

    2010). Typically, the observed PD is compared to a null distribution of PD values 42

    generated by shuffling species names across the phylogenetic tips a high number of 43

    times (e.g. 999 times). The function ses.pd, which is also implemented in the Picante 44

    software, is the reference function to conduct such standardization. In addition to the 45

    arguments of pd, ses.pd includes multiple null models that can be used to generate null 46

    distributions (the default model is “taxa.labels”, which shuffles taxa labels across the 47

    phylogenetic tips). Since ses.pd calls internally to pd, the user can specify if the MRCA 48

    – root distance should be included in the calculation of the observed PD and the 49

    .CC-BY 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

    The copyright holder for this preprint (which wasthis version posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv preprint

    https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/

  • 4

    corresponding null PD values. However, I have noted that ses.pd computes null PD 50

    values without including the MRCA – root distance regardless of the logical value that 51

    is specified in the include.root argument. This is, if include.root = TRUE (default = 52

    TRUE), the observed PD will include the MRCA – root distance, but the null PD values 53

    will not (see Supplementary Material). Unfortunately, this anomaly in the ses.pd 54

    function may result in biased estimations of standardized PD values (hereafter 55

    “ses.PD”), particularly in samples with low species richness. This is because the lower 56

    the species richness in the samples, the lower the probability for the phylogenetic 57

    branches connecting species in the null samples to traverse the root node of the supplied 58

    phylogeny, and therefore the higher the impact of excluding the MRCA – root distance 59

    from the computations when it should be included (i.e. when include.root = TRUE). 60

    Here, I conducted a simple simulation exercise to illustrate this issue. 61

    62

    Materials and Methods 63

    I simulated four different community data matrices with n = 50 samples (rows) 64

    and m = 25, 50, 100 and 200 species (columns) respectively. Then, I used the function 65

    pbtree implemented in Phytools R package (Revell, 2012) to simulate 500 pure-birth 66

    phylogenies (root to tip distance scaled to unit) of m = 25, 50, 100 and 200 tips, 67

    respectively (2000 phylogenies in total), representing the species in the community data 68

    matrices. Finally, I simulated four different community datasets per community data 69

    matrix (each community data matrix represents a different species pool) with fixed 70

    species richness within datasets (i.e. equal row sums). To do so, I assigned n = 2, 4, 8 71

    and 16 species, respectively, to each of samples in the community data matrices by 72

    randomly picking species from the corresponding pools. For each dataset (16 in total), I 73

    computed ses.PD values for the samples using 500 simulated phylogenies and the ses.pd 74

    .CC-BY 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

    The copyright holder for this preprint (which wasthis version posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv preprint

    https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/

  • 5

    function as implement in Picante (Kembel et al., 2010, hereafter “ses.pd-Picante”). The 75

    argument include.root was set to TRUE, and null distributions were generated using the 76

    taxa.labels model and 999 randomizations. Then, I reanalyzed the data using a corrected 77

    version of the ses.pd-Picante function that actually includes the MRCA – root distance 78

    in all the computations if the argument include.root is set to TRUE. Both functions are 79

    identical in all other respects (see Supplementary Material). I used the same seed 80

    (random number generator) to analyze the data with both functions. Finally, I compared 81

    the ses.PD values derived from each function using cross-validation R-squared (𝑅"#$ ) 82

    (Molina-Venegas, Moreno-Saiz, Castro, Davies, Peres-Neto & Rodríguez, 2018). 𝑅"#$ = 83

    1 indicates perfect match between ses.PD values obtained from both functions, and 𝑅"#$ 84

    < 1 indicates imperfect match. 𝑅"#$ varies from 1 to minus infinity. Since I used the 85

    same seed to analyze the data with both functions, the randomization pattern was 86

    preserved, and therefore 𝑅"#$ will be equal to 1 in case both functions yield identical 87

    results. The R code to reproduce all the analyses along with the corrected version of the 88

    ses.pd-Picante function is provided as Supplementary Material. All analyses were 89

    conducted using Picante version 1.7 (latest version) and R version 3.4.3 (R Core Team., 90

    2017), yet results were the same regardless of the version of the package (the ses.pd 91

    function was first implemented in Picante version 0.7-2 and delivered in CRAN R 92

    repository in July 2009). 93

    94

    Results and Discussion 95

    I found substantial mismatch between the ses.PD values derived from the ses.pd-96

    Picante function and its corrected version, particularly at low species richness and 97

    regardless of the size of the species pool (Figs. 1 and S1 in Appendix 1). More 98

    specifically, the ses.pd-Picante function yielded higher ses.PD values than expected (i.e. 99

    .CC-BY 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

    The copyright holder for this preprint (which wasthis version posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv preprint

    https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/

  • 6

    above the 1:1 line) in the negative side of the distribution (Figs. 2 and S2 in Appendix 100

    1). However, results derived from both functions rapidly converged following an 101

    exponential trend as species richness increased (Figs. 1 and S1 in Appendix 1), 102

    suggesting that the ses.pd-Picante function will introduce biases only when species 103

    richness is very low (i.e. less than four species). Fortunately, such low-richness levels 104

    are rare in natural communities, yet they are eventually reported along with their ses.PD 105

    values (e.g. Mennes, Moerland, Rath, Smets & Merckx, 2015; Geedicke, Schultz, 106

    Rudolph & Oldeland, 2016; Nowakowski, Frishkoff, Thompson, Smith & Todd, 2018). 107

    On the other hand, some simulation analyses have also reported ses.PD values for 108

    samples including only two species (e.g. Mazel et al., 2016), and diversity experiments 109

    often include plots with very few species (e.g. Symstad et al., 2003). 110

    Figs. 3 and S3 show that mismatches were more likely to occur with highly 111

    unbalanced trees (i.e. those with internal nodes defining divergent lineages of unequal 112

    size). This is because the higher the imbalance of the phylogeny, the lower the 113

    probability for the phylogenetic branches connecting species in the null samples to 114

    traverse the root node of the supplied phylogeny, and therefore the higher the impact of 115

    excluding the MRCA – root distance when it should be included. Given the unbalanced 116

    nature of most real phylogenies, I conclude that future studies will avoid potential 117

    biases in ses.PD estimations (particularly in communities with very low species 118

    richness) by either removing the MRCA – root distance from all the computations 119

    conducted by the ses.pd-Picante function (i.e. setting the include.root argument to 120

    FALSE) or using its corrected version if the MRCA – root distance is to be included 121

    (see Supplementary Material). 122

    123

    Acknowledgments 124

    .CC-BY 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

    The copyright holder for this preprint (which wasthis version posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv preprint

    https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/

  • 7

    I thank the Scientific Computation Centre of Andalusia (CICA) for the computing 125

    services they provided. 126

    127

    Supporting Information 128

    Appendix 1. R code used for the analyses. 129

    130

    REFERENCES 131

    Faith, D. P. (1992). Conservation evaluation and phylogenetic diversity. Biological 132 Conservation, 61, 1–10. doi:10.1016/0006-3207(92)91201-3 133

    Geedicke, I., Schultz, M., Rudolph, B., & Oldeland, J. (2016). Phylogenetic clustering 134 found in lichen but not in plant communities in European heathlands. 135 Community Ecology, 17, 216–224. doi:10.1556/168.2016.17.2.10 136

    Kembel,S.W., Cowan, P. D., Helmus, M. R., Cornwell, W. K., Morlon, H, Ackerly, D. 137 D., … , Webb, C. O. (2010). Picante: R tools for integrating phylogenies and 138 ecology. Bioinformatics, 26, 1463–1464. 139

    Mazel, F., Davies, T. J., Gallien, L., Renaud, J., Groussin, M., Münkemüller, T., & 140 Thuiller, W. (2016). Influence of tree shape and evolutionary time-scale on 141 phylogenetic diversity metrics. Ecography, 39, 913–920. 142 doi:10.1111/ecog.01694 143

    Mennes, C. B., Moerland, M. S., Rath, M., Smets, E. F., & Merckx, V. S. F. T. (2015). 144 Evolution of mycoheterotrophy in Polygalaceae: The case of Epirixanthes. 145 American Journal of Botany, 102, 598–608. doi:10.3732/ajb.1400549 146

    Molina-Venegas, R., Moreno-Saiz, J. C., Parga, I. C., Davies, T. J., Peres-Neto, P. R., & 147 Rodríguez, M. Á. (2018). Assessing among-lineage variability in phylogenetic 148 imputation of functional trait datasets. Ecography, 41, 1740–1749 149 doi:10.1111/ecog.03480 150

    Nowakowski, A. J., Frishkoff, L. O., Thompson, M. E., Smith, T. M., & Todd, B. D. 151 (2018). Phylogenetic homogenization of amphibian assemblages in human-152 altered habitats across the globe. Proceedings of the National Academy of 153 Sciences, 115, E3454–E3462. doi:10.1073/pnas.1714891115 154

    .CC-BY 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

    The copyright holder for this preprint (which wasthis version posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv preprint

    https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/

  • 8

    R Core Team (2017) R: a language and environment for statistical computing. R 155 Foundation for Statistical Computing, Vienna, Austria. 156

    Revell, L. J. (2012). phytools: an R package for phylogenetic comparative biology (and 157 other things). Methods in Ecology and Evolution, 3, 217–223. 158

    Symstad, A. J., Chapin, F. S., Wall, D. H., Gross, K. L., Huenneke, L. F., Mittelbach, G. 159 G., … Tilman, D. (2003). Long-term and large-scale perspectives on the 160 relationship between biodiversity and ecosystem functioning. BioScience, 53, 161 89–98. doi:10.1641/0006-3568(2003)053[0089:LTALSP]2.0.CO;2 162

    163

    .CC-BY 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

    The copyright holder for this preprint (which wasthis version posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv preprint

    https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/

  • 9

    Figure 1. Violin plot showing the cross-validation R-squared scores for the 164

    comparisons between ses.PD values derived from the function ses.pd-Picante (Kembel 165

    et al., 2010) and its corrected version. Analyses were conducted using datasets with 166

    species richness = 2, 4, 8 and 16, respectively, 500 simulated phylogenies and a species 167

    pool (community data matrix) of n = 25 species (see Fig. S1 in Appendix 1 for results 168

    derived from community data matrices with n = 50, 100 and 200 species). 169

    170

    171

    172

    .CC-BY 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

    The copyright holder for this preprint (which wasthis version posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv preprint

    https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/

  • 10

    Figure 2. Scatter plots showing the relationship between ses.PD values (25,000 per 173

    plot) derived from the function ses.pd-Picante (Kembel et al., 2010, y-axis) and its 174

    corrected version (x-axis). Analyses were conducted using datasets with species 175

    richness n = 2, 4, 8 and 16, respectively, 500 simulated phylogenies and a species pool 176

    of 25 species (see Fig. S2 for results derived from species pools of 50, 100 and 200 177

    species). The grey lines represent the expected 1:1 relationship. 178

    179

    180

    181

    .CC-BY 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

    The copyright holder for this preprint (which wasthis version posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv preprint

    https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/

  • 11

    Figure 3. Relationship between cross-validation R-squared scores (y-axis) and tree 182

    imbalance (i.e. Colless’ index, x-axis). Analyses were conducted using datasets with 183

    species = 2, 4, 8 and 16, respectively, 500 simulated phylogenies and a species pool of 184

    25 species (see Fig. S3 for results derived from species pools of 50, 100 and 200 185

    species). The higher the value of the Colless’ index, the higher the imbalance of the 186

    phylogeny (values scaled between 0 and 1). 187

    188

    189

    190

    .CC-BY 4.0 International licenseavailable under anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

    The copyright holder for this preprint (which wasthis version posted March 15, 2019. ; https://doi.org/10.1101/579300doi: bioRxiv preprint

    https://doi.org/10.1101/579300http://creativecommons.org/licenses/by/4.0/

Recommended