+ All Categories
Home > Documents > Efficiency of the Neighbor-Joining Method in Reconstructing Deep and Shallow Evolutionary...

Efficiency of the Neighbor-Joining Method in Reconstructing Deep and Shallow Evolutionary...

Date post: 24-Feb-2023
Category:
Upload: midwestern
View: 0 times
Download: 0 times
Share this document with a friend
10
Efficiency of the Neighbor-Joining Method in Reconstructing Deep and Shallow Evolutionary Relationships in Large Phylogenies Sudhir Kumar, Sudhindra R. Gadagkar Department of Biology, Arizona State University, Tempe, AZ 85287-1501, USA Received: 7 March 2000 / Accepted: 2 August 2000 Abstract. The neighbor-joining (NJ) method is widely used in reconstructing large phylogenies because of its computational speed and the high accuracy in phyloge- netic inference as revealed in computer simulation stud- ies. However, most computer simulation studies have quantified the overall performance of the NJ method in terms of the percentage of branches inferred correctly or the percentage of replications in which the correct tree is recovered. We have examined other aspects of its per- formance, such as the relative efficiency in correctly re- constructing shallow (close to the external branches of the tree) and deep branches in large phylogenies; the contribution of zero-length branches to topological errors in the inferred trees; and the influence of increasing the tree size (number of sequences), evolutionary rate, and sequence length on the efficiency of the NJ method. Re- sults show that the correct reconstruction of deep branches is no more difficult than that of shallower branches. The presence of zero-length branches in real- ized trees contributes significantly to the overall error observed in the NJ tree, especially in large phylogenies or slowly evolving genes. Furthermore, the tree size does not influence the efficiency of NJ in reconstructing shal- low and deep branches in our simulation study, in which the evolutionary process is assumed to be homogeneous in all lineages. Key words: Phylogenetic inference — Neighbor- joining method — Large phylogenies — Zero-length branches — Accuracy — Deep versus shallow branches Introduction The scope of molecular phylogenetic studies for infer- ring short- and long-term evolutionary histories of or- ganisms and multigene families has expanded greatly beyond molecular systematics due to an explosive growth in the number of sequences available in genetic databases (e.g., Balczarek et al. 1997; Duret et al. 1994; Higgins et al. 1996; Kumar et al. 1996; Kumar and Rzhetsky 1996; Li 1997; Nei and Kumar 2000). With this growth, data sets for molecular phylogenetics have increased in terms of the number of sequences being analyzed, and the neighbor joining (NJ) method (Saitou and Nei 1987) has become one of the most commonly used methods. It is computationally efficient, has desir- able statistical properties, and is known to produce trees as accurate as, or better than, more computationally in- tensive and global searching methods (Charleston et al. 1993; Gascuel 1994, 1997; Kuhner and Felsenstein 1994; Nei and Kumar 2000; Nei et al. 1998; Rzhetsky and Nei 1992; Tateno et al. 1994). Computer simulations provide a convenient way to assess the efficiency of tree-making methods (reviewed by Nei and Kumar 2000). For the NJ method, most of these computer simulation studies have evaluated its overall performance in inferring phylogenetic trees by either calculating its performance in inferring the true tree topology completely or estimating the proportion of Correspondence to: Dr. Sudhir Kumar, Life Sciences A-371, Depart- ment of Biology, Arizona State University, Tempe, AZ 85287-1501, USA; e-mail: [email protected] J Mol Evol (2000) 51:544–553 DOI: 10.1007/s002390010118 © Springer-Verlag New York Inc. 2000
Transcript

Efficiency of the Neighbor-Joining Method in Reconstructing Deep andShallow Evolutionary Relationships in Large Phylogenies

Sudhir Kumar, Sudhindra R. Gadagkar

Department of Biology, Arizona State University, Tempe, AZ 85287-1501, USA

Received: 7 March 2000 / Accepted: 2 August 2000

Abstract. The neighbor-joining (NJ) method is widelyused in reconstructing large phylogenies because of itscomputational speed and the high accuracy in phyloge-netic inference as revealed in computer simulation stud-ies. However, most computer simulation studies havequantified the overall performance of the NJ method interms of the percentage of branches inferred correctly orthe percentage of replications in which the correct tree isrecovered. We have examined other aspects of its per-formance, such as the relative efficiency in correctly re-constructing shallow (close to the external branches ofthe tree) and deep branches in large phylogenies; thecontribution of zero-length branches to topological errorsin the inferred trees; and the influence of increasing thetree size (number of sequences), evolutionary rate, andsequence length on the efficiency of the NJ method. Re-sults show that the correct reconstruction of deepbranches is no more difficult than that of shallowerbranches. The presence of zero-length branches in real-ized trees contributes significantly to the overall errorobserved in the NJ tree, especially in large phylogeniesor slowly evolving genes. Furthermore, the tree size doesnot influence the efficiency of NJ in reconstructing shal-low and deep branches in our simulation study, in whichthe evolutionary process is assumed to be homogeneousin all lineages.

Key words: Phylogenetic inference — Neighbor-joining method — Large phylogenies — Zero-lengthbranches — Accuracy — Deep versus shallow branches

Introduction

The scope of molecular phylogenetic studies for infer-ring short- and long-term evolutionary histories of or-ganisms and multigene families has expanded greatlybeyond molecular systematics due to an explosivegrowth in the number of sequences available in geneticdatabases (e.g., Balczarek et al. 1997; Duret et al. 1994;Higgins et al. 1996; Kumar et al. 1996; Kumar andRzhetsky 1996; Li 1997; Nei and Kumar 2000). Withthis growth, data sets for molecular phylogenetics haveincreased in terms of the number of sequences beinganalyzed, and the neighbor joining (NJ) method (Saitouand Nei 1987) has become one of the most commonlyused methods. It is computationally efficient, has desir-able statistical properties, and is known to produce treesas accurate as, or better than, more computationally in-tensive and global searching methods (Charleston et al.1993; Gascuel 1994, 1997; Kuhner and Felsenstein 1994;Nei and Kumar 2000; Nei et al. 1998; Rzhetsky and Nei1992; Tateno et al. 1994).

Computer simulations provide a convenient way toassess the efficiency of tree-making methods (reviewedby Nei and Kumar 2000). For the NJ method, most ofthese computer simulation studies have evaluated itsoverall performance in inferring phylogenetic trees byeither calculating its performance in inferring the truetree topology completely or estimating the proportion of

Correspondence to:Dr. Sudhir Kumar, Life Sciences A-371, Depart-ment of Biology, Arizona State University, Tempe, AZ 85287-1501,USA; e-mail: [email protected]

J Mol Evol (2000) 51:544–553DOI: 10.1007/s002390010118

© Springer-Verlag New York Inc. 2000

the correct interior branches in the inferred tree (e.g.,Hillis 1996; Kim 1998; Nei et al. 1998; Strimmer andvon Haeseler 1996). However, a number of specificquestions regarding the performance of the NJ methodremain unexplored. Are shallow branches (branchescloser to the tips of the tree) easier to reconstruct thandeeper branches? In this case, shallow branches corre-spond to more recent evolutionary divergences, whereasdeep branches establish evolutionary relationshipsamong groups that have diverged earlier in the evolu-tionary history. How does an increase in the number ofsequences affect the correct inference of shallow anddeep branches? What are the relative contributions of theevolutionary rate and sequence length on the efficiencyof the NJ method? How long should an interior branchbe, in terms of the total number of substitutions, in orderto be reconstructed correctly?

Another common feature of previous simulation stud-ies has been that often no distinction was made betweenexpected and realized trees. An expected tree is one inwhich all branch lengths are expressed in terms of theexpected number of nucleotide (or amino acid) substitu-tions per site, whether or not the evolutionary rate isconstant among lineages. A realized tree, on the otherhand, has branch lengths equal to the actual number ofsubstitutions per site (Kumar 1996; Nei 1987). The samebranch in the realized and expected trees differ in lengthbecause evolution is a stochastic process in which therealized tree is one “realization” of the expected tree. The

NJ method uses the extant sequences to infer the realizedtree rather than the expected tree (Nei and Kumar 2000).Note that the realized tree is not a mere “sample” of theexpected tree. Rather, it is an actual quantity to be esti-mated because the sequences in a real data set are uniqueproducts of the evolutionary process which occurs onlyonce for a given gene.

When the expected number of substitutions on abranch is small, the probability that one or more realizedbranch lengths is equal to zero is high (Kumar 1996).This suggests that for closely related sequences (slowlyevolving genes or population level divergence), the to-pology of the realized tree may contain multifurcations.Therefore, the performance of all tree-making methodsshould be evaluated by comparing the inferred tree to therealized tree rather than the expected tree (e.g., Kumar1996; Tateno 1990). What is the difference in the effi-ciencies of the NJ method in reconstructing the realizedversus the expected trees?

It is worth noting that the expected tree can also havebranches with expected length equal to zero simply be-cause the product of the evolutionary time elapsed, thelength of the gene, and its rate of evolution is practicallyzero. In such cases, the topology of the expected tree isstill bifurcating, but some interior branches are of zeroexpected length (e.g., Saitou 1996). For simplicity, wehave assumed that all interior branches in the model treehave expected branch lengths$1 per sequence.

In this paper, we have taken the first step to address

Fig. 1. A–D. The four basic model topologies used in this study, with the relative branch lengths shown. Composite trees were constructed bystacking these four trees to giveAx, Bx, Cx, andDx trees. For example,E, F, andG are composite trees consisting of twoA trees (A2), two C trees(C2), and eightD trees (D8), respectively. All interior branches in the stacked trees have equal relative lengths.

545

the questions raised above. We discuss the results ob-tained in relation to the presumed increase in complexityof phylogenetic reconstruction with increasing numberof sequences.

Computer Simulations

Model Trees.Following Saitou and Imanishi (1989) and Kumar(1996), we considered four basic six-taxon model trees. These trees aredrawn in an unrooted fashion in Figs. 1A–D to reflect the fact that theNJ method produces unrooted trees. Previously these model trees havebeen drawn with a root (indicated by a filled circle) to specify anarbitrary starting point in the computer simulation. Using these fourbasic trees, we constructed larger composite phylogenies (Fig. 1), as inKumar (1996). For instance, Fig. 1E is a composite tree consisting oftwo copies of treeA, where one copy has been grafted onto the other.We refer to topologies generated in this manner asAx trees, wherexrefers to the number of copies in the composite tree. We constructedAx,Bx, Cx, andDx trees, wherex varied from 1 . . . 10, 16, and 32 (a totalof 48 model trees containing up to 192 taxa). In all of these model trees,each interior branch was made to be 1 unit long and the lengths of theexternal branches are given in multiples of the interior branch length(Figs. 1A–D).

All expectedinterior branch lengths in a given model tree were keptequal in magnitude to compare directly the performance of the NJmethod in reconstructing branches at different depths in the tree as afunction of the branch location (depth) alone. The stacked tree structureof our large phylogenies also allowed us to study the change in per-formance of the NJ method from small trees to the larger compositetrees. Alternatively, our composite trees can also be viewed as consist-ing of multiple monophyletic groups, with each group containing thesame number of sequences. This situation is similar to that in multigenefamily evolutionary studies, where gene duplication events need to beinferred and the data are often available for a similar set of modelorganisms. While our composite trees are convenient for statisticalcomparisons, the situation in real life is obviously more complicated.Therefore, we also conducted computer simulations using “hybrid”composite trees that were stacked with trees taken at random from

among the four basic trees (Figs. 1A–D), as well as a much larger,228-sequence, chloroplastrbcL gene tree (Hillis 1996) containing in-terior branches of varying expected lengths (Fig. 2) and lacking therepeated phylogenetic structure found in our composite trees. Thisallowed us to evaluate the generality of the results obtained from thecomposite trees.

Rates of Evolution and Sequence Length.We conducted computersimulations using many sequence lengths and rates of evolution. Be-cause we are comparing the relative performance of the NJ method incorrectly reconstructing small and large phylogenies, we discuss theevolutionary rate in the context of the lengths of interior branches,rather than the maximum pairwise distance between sequences, as thelatter depends upon the number of sequences in the data. A low rate ofevolution refers to an interior branch length of 0.00625 substitution/site. Multiples of this rate (r 4 0.00625) were used for allA–D modeltrees as well as the hybrid model trees. ForrbcL trees, we conductedcomputer simulations with up to 10-fold rate differences. The sequencelengths employed were in multiples of 100 sites for all the model trees.

Simulating Evolutionary Change.For the computer simulation, thestarting point was chosen for each tree (marked by the filled circle inFigs. 1A–D), and for this “root” an ancestral sequence of a given lengthwas first generated by randomly selecting nucleotides such that the fournucleotides are expected to occur with equal frequency in the ancestralsequence. This sequence was evolved by introducing random nucleo-tide substitutions to generate the immediate descendents. In any givenbranch, the actual (realized) number of nucleotide substitutions wasobtained by selecting a random number from a Poisson distributionwith mean equal to the expected number of substitutions (rate × se-quence length). A given nucleotide was allowed to change to any of theother three with equal probability, resulting in the Jukes and Cantor(1969) model of nucleotide substitution. This process was carried outfor all branches moving away from the root, and a set of sequences wasgenerated at the end of this process. The final set of sequences at theexternal nodes was then used to reconstruct their evolutionary relation-ships using the NJ method. We generated 1000 simulation replicates foreach case, except for the “hybrid” trees, the 96- and 192-taxonA–Dtrees and the 228-taxonrbcL trees, where 100 replications were gen-erated.

Fig. 2. A 228-taxonrbcL tree.

546

Definitions. Tree sizerefers to the number of sequences. Aninteriorbranchpartitions an unrooted tree into two subtrees, each containing atleast two taxa. Thecluster sizefor a given interior branch is defined asthe minimum of the two subtree sizes. The cluster size thus directlymeasures the minimumdepthof a branch in terms of the number ofsequences contained in the smaller of the two subtrees that it defines.By this definition, the complexity involved in inferring deep branchesis higher than that for shallow branches, because the minimum numberof taxa to be joined in inferring deep branches is larger than that forshallow branches in the NJ algorithm. Therefore, the depth of a branchdepends only on the subtree sizes rather than the subtree heights interms of the number of substitutions. This definition of branch depth ismore relevant to our analysis because the NJ algorithm always clustersshallow branches before deeper branches, irrespective of the number ofsubstitutions.

Performance measures.A number of different measures were usedto quantify the performance of the NJ method.

PM represents the proportion of all simulation replicates in whichthe topology of the NJ tree is identical to that of the model tree.

PBM is the proportion of all branches of the model tree that arereconstructed correctly in the NJ tree.PBM 4 [cavg/(m− 3)], wherecavg

is the average number of correctly inferred interior branches of themodel tree in all simulation replications, andm − 3 is the number ofinterior branches for an unrooted tree containingm sequences.

P0 is the proportion of branches in the realized tree that receive zerosubstitutions (zero length branches).P0 4 [b0,avg/(m− 3)], whereb0,avg

is the average number of zero-length interior branches in the realizedtree, in all the simulation replications.

PBR is the proportion of all non-zero-length branches of the realizedtrees that are reconstructed correctly in the NJ tree.PBR 4 c>0,avg/[(m − 3) − b0,avg], where c>0,avg is the average number of correctlyinferred non-zero-length interior branches in the realized tree, in allsimulation replications.

pB represents the percentage efficiency in correctly estimatingbranches of a given depth (in terms of the number of taxa) or length (interms of the number of substitutions).pB 4 b/B, whereB is the totalnumber of occurrences of the desired type of branches (always non-zero length) in all the simulation replicates, andb is the number ofcases in which that branch was found in the NJ tree.

Results

Accurate Inference of Complete Trees

Table 1 shows the percentage replicates in which themodel tree topology was reconstructed correctly (PM) fortrees containing increasing numbers of sequences andsequence lengths, withr 4 0.0125. As expected, it ismore difficult to reconstruct trees when they containlarge numbers of sequences or if the sequences are short(e.g., Kumar 1996; Strimmer and von Haeseler 1996).This is because allm − 3 interior branches (nontrivialpartitions) need to be reconstructed correctly for correctinference of the complete tree, which requires selectingthe sole true tree from a large number of possible trees(Table 2). Longer sequences improve the efficiency oftree-making methods, partly because the pairwise dis-tances can be estimated with better accuracy (lower vari-ance). Table 1 shows slower rates ofPM decline forlarger sequences as the number of sequences increases.For instance, for 18 sequences,PM is 8% fors4 200 and

96% for s 4 1000. When the number of sequences in-creases to 192,PM declines to only 46% fors 4 1000.

Influence of Zero-Length Branches on the Efficiencyof NJ

Figure 3 shows the mean number of zero-length branchesper replication for different tree sizes, withr ands fixedat 0.00625 and 200, respectively. The probability that agiven lineage (interior branch) has experienced zero sub-stitutions is given bye−b, whereb is the expected branchlength in terms of the total number of substitutions persequence. Fors 4 200 andr 4 0.00625,b 4 0.00625× 200 4 1.25 substitutions. Since all interior branches

Table 1. Percentage replicates in which the complete model tree isreconstructed correctly (PM) by the NJ methoda

Sequences

Sequence length

200 500 1000

6 57 87 9812 22 74 9618 8 63 9624 3 54 9430 1 46 9336 1 39 9142 0 33 9048 0 28 8954 0 25 8660 0 21 8696 0 8 79

192 0 0 46

a Each value is the arithmetic mean over all the topologies given inFigs. 1A–D, withr 4 0.0125.

Table 2. Number of unrooted trees and the corresponding number ofinterior branches for the complete and subtree sizes (numbers of se-quences) in the simulation study

Sequences Unrooted trees Interior branches

4 3 15 15 26 105 37 945 48 10,395 59 135,135 6

10 2,027,025 712 654,729,075 918 1017.28 1524 1026.75 2130 1036.94 2736 1047.69 3342 1058.90 3948 1070.51 4554 1082.45 5160 1094.70 5796 10173.10 93

192 10407.79 189228 10502.06 225

547

are of equal expected length in our model trees, the ex-pected proportion of zero-length branches ise−b. Thisexpectation is confirmed in the computer simulation re-sults shown in Fig. 3.

Figure 3 also shows the percentage of branches in themodel and realized trees that were reconstructed cor-rectly. As expected, the percentage branches correctlyinferred increases with increasing sequence length, formodel as well as realized trees. However, comparison ofthe realized tree to the inferred tree shows much higherPBR values even for smalls values. Interestingly,PBM

(i.e., for the model tree) is essentially a mirror image ofP0, the proportion of zero-length branches in the realizedtree. This suggests that the zero-length branches in therealized tree contribute significantly towards the declinein NJ efficiency. Therefore, zero-length branches shouldbe properly discounted in any estimation of the NJ effi-ciency, as the NJ method reconstructs realized ratherthan model trees. For this reason, we report only theefficiency of the NJ method in reconstructing non-zero-length branches.

Percentage Branches Reconstructed Correctly

In order to present succinctly the rather voluminous com-puter simulation results (from the thousands of modeltrees used) in one place, we first present a summary table(Table 3). In this table, the NJ efficiencies (PBR) wereaveraged over all rates, topologies, and sequence lengthsfor a given tree size. Results show that the minimum,maximum, and average NJ efficiencies are similar acrosstree sizes, which differ 32-fold in the number of se-quences (6 to 192). This is further illustrated in Fig. 4,where NJ efficiencies are similar across tree sizes andevolutionary rates, for fixed sequence lengths (100, 200,500 and 1000). In the figure each value is an averagetaken from all the topologies for a given tree size and

evolutionary rate, for a given sequence length. In gen-eral, we find that a medium evolutionary rate leads to aslightly higher performance when compared to lower orhigher rates.

Our large phylogenies consist of four basic trees, eachof which constitutes a monophyletic group. Table 4shows the efficiency with which these monophyleticgroups were inferred correctly—observed efficienciesare similar for different tree sizes for a given sequencelength. It is thus clear that the NJ method is able to infergroups of the same size with similarly high efficienciesin large as well as small phylogenies.

Effect of Branch Depth on NJ Efficiency (Table 5)

As mentioned earlier, the depth of a branch is defined bythe size of the smallest subtree connected to it. Further-more, a branch is considered correctly inferred when it

Fig. 3. Percentage branches of the model trees (PBM) and realizedtrees (PBR) reconstructed correctly by the NJ method, with increasingnumber of sites, and the corresponding proportion of zero-lengthbranches (P0; filled circles). The values were averaged over all fourtopologies and all tree sizes (Ax, Bx, Cx, Dx trees), forr 4 0.00625.

Table 3. Percentage branches reconstructed correctly (PBR) for treesof different sizesa

Sequences

Overall efficiency

Average Minimum Maximum

6 92 53 10012 94 55 10018 94 54 10024 95 61 10030 95 56 10036 95 60 10042 95 60 10048 95 60 10054 95 60 10060 95 63 10096 95 64 100

192 95 61 100

a Each value is an average over all rates of evolution (r 4 0.00625 to0.0625, in steps of 0.00625), numbers of sites (s 4 100 to 1000, insteps of 100), and all topologies (Ax, Bx, Cx, andDx).

Fig. 4. Percentage efficiency (PBR) of the NJ method for varyingevolutionary rates (r) and tree sizes (up to 192 sequences). Averagevalues ofPBR from all Ax, Bx, Cx, andDx trees are shown.

548

partitions the tree into two clusters, each containing thesame set of sequences as in the original tree. In trees withvarying expectedinternal branch lengths (e.g., Fig. 2),the efficiency of reconstructing an internal branch couldbe influenced by the branch depth and/or branch length.This is not the case in our study because all the interiorbranches in a given tree are of equal expected length. (Ofcourse, the realized interior branch lengths may differamong branches in any given replication). This designallows us to look at only the location (depth) of thebranch, independent of its length, and facilitates directcomparison across different parts of a tree. Figure 5shows the efficiency of reconstruction of branches ofvarious depths for two sequence lengths. For each se-quence length we find that the efficiency of reconstruc-tion of interior branches is largely similar across alldepths of the tree, with deeper branches in fact beingreconstructed with higher efficiency in some cases. Thisobservation is somewhat counterintuitive because deepbranches are often thought to be more difficult to recon-

struct than the shallow ones (see Discussion later). Fur-thermore, the efficiency is high for sequence lengths of500 sites or more and relatively lower for smaller se-quences.

Efficiency in Reconstructing Branches of DifferentRealized Lengths

The number of substitutions per sequence that actuallyoccurred in a given branch constitutes the realized lengthof that branch. This length varies from replication toreplication, whereas the expected branch lengths areidentical. As mentioned under Computer Simulations,the realized branch lengths are obtained by drawing arandom number from a Poisson distribution, with theexpected branch length as the mean of the distribution, tosimulate the stochastic nature of the evolutionary pro-cess. How large should the realized branch length be inorder to obtain an NJ efficiency of 95% or higher? Fur-thermore, how does this length change with sequencelength and evolutionary rate (the two determinants ofexpected branch length)? To address these questions, wecomputed the percentage efficiency with which branches

Table 4. Percentage replicates in which monophyletic clusters of sixtaxa were reconstructed correctlya

Treesize

Sequence length

200 500 1000

12 77 94 9918 80 96 10024 83 97 10030 83 97 10036 85 97 10042 84 97 10048 84 97 10054 85 97 10060 85 97 10096 84 97 100

192 84 96 100

a Each value is the arithmetic mean over all the topologies given inFigs. 1A–D and evolutionary rates used.

Table 5. Efficiency of reconstructing branches of various depthsa

Branchdepth

Sequence length

200 500 1000

2 86 96 993 85 97 1004 87 97 1005 86 97 996 83 97 100

12 87 99 10018 89 99 10024 91 99 10030 90 99 10048 92 100 10096 93 100 100

a Each value is a percentage, averaged over all the topologies given inFigs. 1A–D, tree sizes, and evolutionary rates used.

Fig. 5. Probability of correct reconstruction of branches (pB) at vari-ous depths in trees of different sizes. EachpB value is an average over10 evolutionary rates and four topologies (Ax, Bx, Cx, andDx trees).

549

of different lengths were constructed correctly, irrespec-tive of their position in the tree. The resulting branchlengths were standardized;l95 4 [(b − e)/e] × 100, whereb is the minimum branch length required for a 95% ef-ficiency, ande is the expected branch length. A negativestandardized value shows that a tree with realized branchlength that is smaller than the expected branch length canstill be reconstructed correctly at an average. This stan-dardization allows us to make comparisons across dif-ferent rates of evolution and sequence lengths (Fig. 6).

Figure 6 shows that an increase in sequence length(with evolutionary rate held constant) leads to a signifi-cant decrease inl95. However, an increase in evolution-ary rate (with the sequence length held constant) does notchangel95. Therefore, when the total numbers of substi-tutions in the expected tree are the same, data with longersequences will perform better than those with faster evo-lutionary rates. This is not unexpected because the sameexpected Jukes–Cantor distance will be estimated withlower variance in the former case.

Discussion

In this work, we have presented results from our analysisof large phylogenies in which all interior branches in theexpected trees were made equal for any given tree. Thisstipulation allowed us to examine the relative efficiencywith which the deep and shallow interior branches arereconstructed correctly. Furthermore, the stacked struc-ture of our composite model trees is suitable for exam-ining the relative efficiency of correctly reconstructingthe same branch in small and large trees. As a result, weare now in a position to establish a “baseline” profile ofthe NJ performance. In the following we discuss thesignificance of these results and assess their generalityby comparing them to the results obtained from com-puter simulations involving a 228-taxonrbcL tree (Fig.2) and some “hybrid” composite trees (see ComputerSimulations).

Our simulation results clearly establish the adverseeffect of the zero-length branches in the realized tree onthe performance of the NJ method (Fig. 3). The NJmethod is for reconstructing realized trees rather than theexpected trees, and therefore, its efficiency should bemeasured by comparing the inferred tree to the realizedtree. For instance, Hillis (1996) conducted a computersimulation using the model tree in Fig. 2 and showed thatthe NJ method recovers the expected (“model”) treewhen the sequence length was∼5000 sites. The increasein efficiency of the NJ method in recovering the modeltree with increasing number of sites can be attributed to(1) decreasing variance of distance estimates and/or (2)the decrease in the number of zero-length branches. Us-ing the expected branch lengths employed by Hillis(1996), we examined the performance of the NJ methodby considering the influence of the zero-length branches(Fig. 7). Our results for the efficiency of the NJ methodin reconstructing the model tree are similar to those ofHillis (1996). However, now it is clear that an increase inthe sequence length directly reduces the number of zero-length branches, and this is highly correlated (almost asa mirror image) with the efficiency of the NJ method(PBM). In fact, the NJ tree is almost identical to themultifurcating realized tree (>99% branches are correctlyinferred) even for only∼500 sites. This result, along withthose in Fig. 3, underscores the importance of the dis-tinction between the realized and the model trees in ex-amining the performance of NJ and other methods. Infact, a similar effect is seen when the stepwise additionalgorithm is used for the maximum-parsimony method incomputer simulations involving therbcL model tree (re-sults not shown). Zero-length branches can be eliminatedeither by increasing the sequence length or by increasingthe evolutionary rate. Our simulations suggest that theformer is more effective than the latter, as the distancescan be estimated with lower variances in the former case(also see Fig. 6).

Fig. 6. Average percentage difference between the expected and theminimum realized branch length per sequence needed forpB $95%.

Fig. 7. Percentage branches of the model (PBM) and realized (PBR)rbcL trees reconstructed correctly by the NJ method, plotted againstincreasing number of sites.Filled circlesshow the corresponding pro-portions of zero-length branches.

550

The efficiency of NJ in terms of the proportion ofbranches reconstructed correctly (PBR) in the realizedtrees is similar for trees consisting of vastly differentnumber of sequences. Strimmer and von Haeseler (1996)have reported similar results, but they did not remove thenegative contribution made by zero-length branches.Since PBR is similar for large and small phylogenies(Table 3), the efficiency of reconstructing deeperbranches is likely to be no worse than that of the shallowbranches, if their expected branch lengths are equal. Thiswas indeed the case, as the branches at different depthsare inferred correctly with similar efficiencies (Fig. 5). Infact, deeper branches appear to be reconstructed cor-rectly with a higher probability, in some cases. This isdue in part to the fact that the estimate of the averagedistance between groups of sequences, used in recon-structing deep branches, has a lower variance.

An extrapolation of the results in Fig. 5 comes fromthe analysis of our “hybrid” trees as well as the unequal-internal branch lengthrbcL tree, which containsbranches with depths ranging from 2 to 82 taxa and ofdifferent lengths. The reconstruction efficiency remainedsimilar across different branch depths, as long as only thenon zero-length interior branches in the realized treeswere considered for measuring the efficiency (Fig. 8).Recently, many investigators have considered the effectof taxon sampling on the efficiency of tree-making meth-ods, in which the main emphasis has been to study andremedy the effect of long branch attraction for small andlarge phylogenies (see Graybeal 1998; Hillis 1998; Kim1996; Purvis and Quicke 1997; Yang and Goldman

1997). The large model trees used in our computer simu-lations were formed by stacking smaller trees. This in-creases the tree size by the addition of sister groups toexisting clusters, rather than the addition of taxa to breakup long branches, as done in taxon-sampling studies.Therefore, a comparison of our results to those by aboveauthors is not straightforward.

The NJ method works in a stepwise fashion, inferringshallow branches first. Therefore, the topological errorsin the early stages of tree reconstruction may propagateas we move toward inferring deeper branches. Conse-quently, one may expect deep branches to be more dif-ficult to reconstruct correctly than shallow branches, allelse being equal. However, this intuitive argument isclearly not supported, as the efficiency does not declinewith depth (Figs. 5 and 8), suggesting that the accuracyof the NJ method in the later stages of clustering (deepbranches) is largely independent of the accuracy at theearly stages (shallow branches). To look for an explana-tion of why this may happen, let us consider the theo-retical aspects of the minimum-evolution (ME) principle(Rzhetsky et al. 1995; Rzhetsky and Nei 1992) that formsthe basis of the NJ method.

Consider the tree in Fig. 9, where groupsI andJ areneighbors, as are groupsK andL. If groupsI, J, K, andL are reconstructed correctly, then, under the ME prin-ciple, the correct inference of branche is not affected byinaccuracies in inferring within-group phylogenies(Rzhetsky et al. 1995). That is, errors in phylogenywithin groups do not affect higher-level clustering aslong as the monophyly of a group is inferred correctly,

Fig. 8. Reconstruction efficiency for branches atdifferent depths in therbcL tree, for 200 sites(A) and 500 sites(B).

551

and the realized branch length (branche) will dictate theefficiency of correct reconstruction of that branch. Whenmonophyletic relationships within groups (I, J, K, andL)are not correctly inferred, then one of two things mayhappen. First, groupI may contain some taxa that belongto groupJ, and vice versa (or such swapping may occurfor K andL). In this case, the reconstruction ofemay notbe affected because the true neighbor groups will tend tocluster together anyway. The second possibility is theincorrect grouping of taxa from more distantly relatedclusters (e.g., taxa from groupI clustering within groupK). This would depend on the length of branche: forlongere it is more difficult for a taxon to cross over to anonsister group. For a deep branch, the number of taxaaround it is large and the random possibility of crossoverof one or more taxa is potentially larger than for a shal-low branch. However computer simulations for deepbranches show that this is not the case. Rather, the effi-ciency is at least the same, or sometimes greater, which,as mentioned earlier, is because the average distancesbetween groups have lower variance than the pairwisedistances for individual sequences.

In maximum-parsimony analysis, it is generallythought that homoplasy will hinder accurate reconstruc-tion of higher-level relationships (deep branches), as thephylogenetic signal to infer deep branches may deterio-rate with later evolutionary changes. This intuition needsto be examined by computer simulation. In the case ofdistance-matrix methods such as the NJ method, how-ever, pairwise distances are computed in a step indepen-dent of the reconstruction of the evolutionary histories,and, perhaps consequently, the degradation of the phy-logenetic signal does not appear to occur.

The argument presented above is only approximate,as the NJ method implements ME criterion locally (ateach stage of clustering) rather than globally. We haveexamined the performance of NJ with respect to the sameoptimality criterion as for ME. That is, we compared thesum of ordinary least-squares estimates of branch lengths(S) of each of our NJ trees with that of the correspondingmodel trees. We found that the NJ tree is less optimalthan the true tree in only 22% of the replicates (see alsoNei et al. 1998). In those cases, the least optimal NJ treewas only 3% worse than the true tree in terms of theSvalue. Relating the effects of sequence length and evo-lutionary rate on the percentage optimality score differ-

ence (Nei et al. 1998) between the NJ and the model trees(not shown), we found that for a given sequence lengththere is little difference between the optimality score forthe NJ topology and that for the model topology, irre-spective of the tree size. The worst NJ performance wasfor large trees with very small sequence lengths (or veryslow evolutionary rates), as there was a very large num-ber of statistically equally good trees (Kumar 1996).Therefore, the results presented in this paper are gener-ally applicable to methods with underlying principlessimilar to the NJ method (e.g., Gascuel 1997).

It is important to exercise caution in extrapolatingresults from any computer simulation to real-life situa-tions. We have assumed that the evolutionary processesamong the lineages have remained the same throughoutthe evolutionary history, i.e., the evolutionary process isstationary. This condition is often met in short-term evo-lution (e.g., population data) and in slowly evolvinggenes but is likely to be violated when we consider long-term evolutionary histories of genes and species. If thestationarity condition is not met, the correct inference ofdeep branches is likely to be adversely affected (e.g.,Steel et al. 1993). This aspect will be examined in furthercomputer simulation studies.

Acknowledgments. We would like to thank Roman G. Johnson andTushar S. Gadagkar for helping with the simulations and Dr. ThomasE. Dowling for his comments on the manuscript. We thank two anony-mous reviewers for their helpful comments. This research was sup-ported by a National Institutes of Health research grant to S.K.

References

Balczarek K, Lai Z-C, Kumar S (1997) Evolution and functional di-versification of paired box (Pax) DNA-binding domains. Mol BiolEvol 14:829–842

Charleston MA, Hendy MD, Penny D (1993) Neighbor-joining uses theoptimal weight for net divergence. Mol Phylogenet Evol 2:6–12

Duret L, Mouchiroud D, Gouy M (1994) HOVERGEN: A database ofhomologous vertebrate genes. Nucleic Acids Res 22:2360–2365

Gascuel O (1994) A note on Sattath and Tversky’s, Saitou and Nei’s,and Studier and Keppler’s algorithms for inferring phylogeniesfrom evolutionary distances. Mol Biol Evol 11:961–963

Gascuel O (1997) Concerning the NJ algorithm and its unweightedversion, UNJ. In: Mirkin B, McMorris FR, Roberts FS, Rzhetsky A(eds) Mathematical hierarchies and biology. American Mathemati-cal Society, Providence, RI, pp 149–170

Graybeal A (1998) Is it better to add taxa or characters to a difficultphylogenetic problem? Syst Biol 47:9–17

Higgins DG, Thompson JD, Gibson TJ (1996) Using CLUSTAL formultiple sequence alignments. In: Doolittle RF (ed) Methods inenzymology. Academic Press, San Diego, pp 383–401

Hillis DM (1996) Inferring complex phylogenies. Nature 383:130–131

Hillis DM (1998) Taxonomic sampling, phylogenetic accuracy, andinvestigator bias. Syst Biol 47:3–8

Jukes TH, Cantor CR (1969) Evolution of protein molecules. In: MunroHN (ed) Mammalian protein metabolism. Academic Press, NewYork, pp 21–132

Kim J (1996) General inconsistency conditions for maximum parsi-

Fig. 9. A schematic showing the topological configuration around aninterior branch (e). Four clusters (I, J, K, andL) always surround anygiven internal branch.

552

mony: Effects of branch lengths and increasing numbers of taxa.Syst Biol 45:363–374

Kim J (1998) Large-scale phylogenies and measuring the performanceof phylogenetic estimators. Syst Biol 47:43–60

Kuhner MK, Felsenstein J (1994) A simulation comparison of phylog-eny algorithms under equal and unequal evolutionary rates. MolBiol Evol 11:459–468

Kumar S (1996) A stepwise algorithm for finding minimum evolutiontrees. Mol Biol Evol 13:584–593

Kumar S, Rzhetsky A (1996) Evolutionary relationships of eukaryotickingdoms. J Mol Evol 42:183–193

Kumar S, Balczarek KA, Lai Z-C (1996) Evolution of thehedgehoggene family. Genetics 142:965–972

Li W-H (1997) Molecular evolution. Sinauer Associates, Sunderland,MA

Nei M (1987) Molecular evolutionary genetics. Columbia UniversityPress, New York

Nei M, Kumar S (2000) Molecular evolution and phylogenetics. Ox-ford University Press, New York

Nei M, Kumar S, Takahashi K (1998) The optimization principle inphylogenetic analysis tends to give incorrect topologies when thenumber of nucleotides or amino acids used is small. Proc Natl AcadSci USA 95:12390–12397

Purvis A, Quicke DLJ (1997) Building phylogenies: Are the big easy?Trends Ecol Evol 12:49–50

Rzhetsky A, Nei M (1992) Statistical properties of the ordinary least-

squares, generalized least-squares and minimum-evolution methodsof phylogenetic inference. J Mol Evol 35:367–375

Rzhetsky A, Kumar S, Nei M (1995) Four cluster analysis: A simplemethod to test phylogenetic hypotheses. Mol Biol Evol 12:163–167

Saitou N (1996) Reconstruction of gene trees from sequence data.Methods Enzymol 266:427–449

Saitou N, Imanishi M (1989) Relative efficiencies of the Fitch-Margoliash, maximum-parsimony, maximum-likelihood, mini-mum-evolution, and neighbor-joining methods of phylogenetic treereconstruction in obtaining the correct tree. Mol Biol Evol 6:514–525

Saitou N, Nei M (1987) The neighbor-joining method: A new methodfor reconstructing phylogenetic trees. Mol Biol Evol 4:406–425

Steel MA, Lockhart PJ, Penny D (1993) Confidence in evolutionarytrees from biological sequence data. Nature 364:440–442

Strimmer K, von Haeseler A (1996) Accuracy of neighbor joining forn-taxon trees. Syst Biol 45:516–523

Tateno Y (1990) A method for molecular phylogeny construction bydirect use of nucleotide sequence data. J Mol Evol 30:85–93

Tateno Y, Takezaki N, Nei M (1994) Relative efficiencies of the maxi-mum-likelihood, neighbor-joining, and maximum-parsimony meth-ods when substitution rate varies with site. Mol Biol Evol 11:261–277

Yang Z, Goldman N (1997) Are big trees indeed easy? Trends EcolEvol 12:357

553


Recommended