PHYLOGENETIC ANALYSIS MOLECULAR EVOLUTIONARY GENETICS · 93) as well as for understanding the...

Annu. RN. Genet. 1996,30:371403 Copyrfght @ 1996 by AnnualRrviovs lnc. All rights reserved

PHYLOGENETIC ANALYSIS IN MOLECULAR EVOLUTIONARY GENETICS Masatoshi Nei Institute of Molecular Evolutionary Genetics and Department of Biology, The Pennsylvania State University, University Park, Pennsylvania 16802; e-mail: [email protected]

KEY WORDS: molecular phylogeny, tree-building methods, four-cluster analysis, linearized trees, ancestral proteins

~ ~- ~~~

ABSTRACT Recent developments of statistical methods in molecular phylogenetics are reviewed. It is shown that the mathematical foundations of these methods are not well established, but computer simulations and empirical data indicate that currently used methods such as neighbor joining, minimum evolution, likelihood, and parsimony methods produce reasonably good phylogenetic trees when a sufficiently large number of nucleotides or amino acids are used. However, when the rate of evolution varies exlensively from branch to branch, many methods may fail to recover the true topology. Solid statistical tests for examining’the accuracy of trees obtained by neighborjoining, minimum evolution, and least-squares method are available, but the methods for likelihood and parsimony trees are yet to be refined. Parsimony, likelihood, and distance methods can all be used for inferring amino acid sequences of the proteins of ancestral organisms that have become extinct.

INTRODUCTION Phylogenetic analysis of DNA or protein sequences has become an important tool for studying the evolutionary history of organisms from bacteria to humans. Since the rate of sequence evolution varies extensively with gene or DNA segment (17,88,142), one can study the evolutionary relationships of virtually all levels of classification of organisms (kingdoms, phyla, classes, families, genera, species, and intraspecific populations). Phylogenetic analysis is also

37 1 0066-4197/96/1215-0371$08.00

mailto:[email protected]

312 NE1 ,

important for clarifying the evolutionary pattern of multigene families (4,44, 93) as well as for understanding the adaptive evolution at the molecular level (15.64,143). This technique also gives much deeperinsight into the mechanism of maintenance of polymorphic alleles in populations (34, 128).

Reconstruction of phylogenetic trees by using statistical methods was initi- ated independently in numerical taxonomy for morphological characters (120) I and in population genetics for gene frequency data (13). Some of the statistical methods developed for these purposes are still used for phylogenetic analysis of molecular data, but in recent years many new methods have been developed. Felsenstein (31) and Swofford et al(l24) reviewed various statistical methods from mathematical points of view. In this review, I discuss only recently developed methods or newly clarified statistical properties of previous methods, with emphasis on practical utilities rather than mathematical details or mathematical possibilitics. i3ecause of space limitation, 1 do not discuss the phylogenetic analysis of gene frequency data. Citation of the literature is also restricted to papers directly related to the subject.

PHYLOGENETIC ANALYSIS OF DNA OR PROTEIN SEQUENCES It is now customary to consider the reconstruction of a phylogenetic tree as a statistical inference of a true phylogenetic tree, which is unknown. There are two processes involved in this inference: "estimation" of the topology (branching pattern of a tree) and estimation of branch lengths for a given tree topology. When a topology is known, statistical estimation of branch lengths is relatively simple, and one can use several statistical methods such as the least squares and the maximum likelihood methods. The problem is the estimation or reconstruction of a topology. When there are a sizable number of DNA or protein sequences (say lo), the number of possible topologies is enormously large (more than 1 million) (14), and it is generally very difficult to choose the correct topology among them.

In phylogenetic inference a certain optimization principle such as the maximum likelihood (ML) or minimum evolution (ME) principle is often used for choosing the most likely topology. The ML method is a well.established statistical method of parameter estimation; i t gives the smallest variance of a parameter estimate when sample size is large. However, this is true only when a given probability space is considered. In the construction of phylogenetic trees, I

maximization of the likelihood is done for each topology separately by using a different likelihood function (28), and the topology with the highest (maximum) likelihood is chosen as an estimate of the true topology. However, since different topologies represent different probability spaces of parameters, it is

I

i

MOLECULAR PHYLOGENY 373 '

not clear whether the highest likelihood tree is expected to be the true tree unless an infinite number of nucleotides are examined (88,148,150). Indeed, it is not difficult to find examples in which the ML method is inferior to other methods in obtaining the true tree, as is mentioned later. Note also that the regularity conditions (continuity and differentiability of the likelihood function) required for the asymptotic properties of ML estimators are not satisfied in phylogenetic reconstruction (148, 150). Some authors have suggested that topologies are parameters, but these parameters are not included in the likelihood function that is maximized,

Extending Cavalli-Sforza & Edwards' (14) idea, Rannala & Yang (97) attempted to estimate a topology under the assumption that a new species is formed following the birth-death process in statistics. In this case, a topology is trcaled as a random variable. Mathematically, this is a reasonable approach, but since the birth-death process is unlikely to describe the real speciation events (25), it is still unclear how useful their new approach is in real data analysis. Note also that the pattern of nucleotide substitution often changes with site and time, particularly when long-term evolution is considered (see later), and at this moment no study has been made on this problem.

A similar criticism applies to all other tree-building methods, though the nature of the criticism varies with the method. That is, the statistical foundation of topology estimation by any optimization principle is not well established. Nevertheless, computer simulations have shown that the optimization principles currently used generally work well under biologically realistic conditions.

The method of phylogenetic inference currently used in molecular phylogenetics can be classified into three major groups: distance methods, likelihood methods, and parsimony methods. Recently, Hendy and colleagues (53, 55, 56) proposed the use of the Hadarnard conjugation for phylogenetic reconstruction (closest tree method). However, its practical utility is yet to be examined.

Distance Methods In distance methods, an evolutionary distance is computed for all pairs of sequences, and a phylogenetic tree is constructed from pairwise disfances by using the least squares, minimum evolution, or some other criterion. The evolutionary distance used for this purpose is usually an estimate of the number of nucleotides or amino acid substitutions per site, but other distance measures may also be used. There are a large number of distance methods for constructing phylogenetic trees (31,88,89), but those commonly used are based on the principles of least squares and minimum evolution.

LEAST-SQUARES (LS) METHODS The principle of LS methods is to compute the minimurn s u m of squmd differences between observed pairwise distances and estimated pairwise distances (patristic distances) (88).for a given topology and to choose a topology that shows the smallest minimum sum of squared differences. Cavalli-Sforza & Edwards (14) suggested that the ordinary or generalized LS methods can be used for distances computed from gene frequency data, whereas Fitch & Margoliash (37) used a weighted LS method. Later Bulmer (9) implemented and formalized the generalized LS method for DNA and protein sequence data.

However, LS methods often give negative branch lengths, and mainly for this reason the accuracy of the topology obtained is not particularly high (74. 1 1 I , I 12, 121). One way to rectify this problem is to conduct the least squares estimation of branch lengths with the restriction of no negative branch lengths (14,31). Bulmer (personal communication) and Gascuel(39) have shown that in the case of four scquences, this restricted LS method gives the same results as those obtained by the neighbor joining method, which is mentioned later. However, this does not seem to be the case when the number of sequences is greater than four, because neighborjoining also occasionally produces negative branches.

MINIMUM EVOLUTION (ME) METHODS In this method, the branch lengths of a tree are estimated by a certain algorithm from pairwise distance data, and the total sum (S) of branch lengths is computed for each of the possible topologies, The topology that shows the smallest S value will then be chosen as the most likely tree (23). In this method, branch lengths are estimated either by Fitch & Margoliash's algorithm ( I 10) or by the ordinary LS method (69, 102). Rzhetsky & Nei [ 102) presented a formal mathematical treatment of this method for DNA and protein sequcnce data and simplified the computational algorithm considerably. Thcy (10.13 also presented a theoretical foundation of this method by showing that thc expected value of S is smallest for the true topology when unbiased estimators of nucleotide or amino acid substitutions are uscd as distance measures. Of course, this does not mean that a tree with the smallest S value is cxpccted to he thc true tree unless a large number of nucleotides or amino acids are used.

Kidd & Sgaramella-Zonta (69) suggested that the total branch lengths [L(s)] be computed by summing the obsofure values of all branch lengths under the conjecture that there are no negative branches for the true topology. However, t($) does not have a nice statistical property that permits the fast computation of S values and the statistical tests as developed by Rzhetsky & Nei (102, 104). Note also that in the presence of statistical errors, some branch lengths may become negative by chance even for a correct topology (1 19). Furthermore, if

- - - - - - - - - - - MOLECULAR PHYLOGENY 375

one wants to have an ME tree without negative branchcs, a better way would be to estimate branch lengths by the least squares method under the constraint of nonnegative branches.

Although the MErnethod is statistically appealing, i t requires a large amount of computational time to examine all different topologies if the number of sequences (m) is greater than 10. For this reason, Rzhetsky & Nei (102, 105) suggested that the neighbor joining (NJ) tree (see below) be first constructed and then a set of topologies close to this NJ tree be examined to find a tree with a smaller S value (temporary ME tree). A new set of topologies close to this temporary ME tree (excluding previously examined topologies) are now examined to find a tree with an even smaller S value. This process will bc continued until no tree with a smaller S is found, and the tree with the smallest S is regarded as the ME tree. The theoretical basis of this strategy is that the ME tree is generally identical or close to the NJ tree when nt is relatively small (102, 1 IO) and thus the NJ tree can be used as a starling tree when m is large. They (105) also suggested that a special type of bootstrapping could be used for generating topologies for examination. Kumar (77) devised a new algorithm to obtain an ME tree, extending the NJ algorithm to examine many potential ME trees. This algorithm does not examine all topologies, but computer simulation has shown that it almost always examines the true tree even if nt is quite large.

FOUR-CLUSTER ANALYSIS In phylogenetic analysis, it is often important to establish the evolutionary relationships of four groups of organisms. For example, the evolutionary relationships of animals, plants, fungi, and protists have been studied for many decades, yet we do not have a definitive answer, partly because each group contains so many different kinds of organisms (5,45, 1 16, 139). In most methods of phylogenetic analysis the number of organisms to be included is limited because of computational difficulties. For this reason, only a few representative organisms are used from each group, but this procedure often gives erroneous conclusions (2).

Thc four-cluster analysis (101) is an application of the theory of the ME method (104) and can handle a large number of species from each group of organisms as long as each group is known to be monophyletic, and i t docs not rcquirc any information regarding the branching order of organisms within groups. Let A , B, C, and D be the four monophyletic groups or clusters, and suppose that A , B , C, and D contain mA, ms, n ~ c , and ma sequences, respectively. In this case, there are three possible unrooted trees of clusters, i.e. TI = ( ( A E ) ( C D ) ) , TZ = ( ( A C ) ( B D ) ) . and TJ = ( ( A D ) ( B C ) ) , and one of them must be correct. This correct tree is expected to have the smallest sum of branch lengths. Let SI , S2, and S3 be the sums of branch lengths for trees TI , Tz, and T3. To compute SI, S2, and S,, we have to know the phylogenetic

376 NE1 MOLECULAR PHYLOGENY 377

relationships of all sequences within clusters, but what we need is to compute the differences SI - S2, SI - S3, and $2 - 33. These differences can be computed by a simple algorithm without knowing the phylogenetic relationships within clusters, and the statistical significance of each difference can be tested.

This technique was applied to resolve the branching pattern of animals, plants, fungi, and a group of protists, using ribosomal RNA genes. It was con- cluded that animals and fungi are significantly closer to each other than to the others (78).

NEIGHBOR JOINING (NJ) METHOD This method (1 12) is a simplified version Of

the ME method for inferring a bifurcating tree. In this method, the S value is not computed for all or many different topologies, but the examination of different topologies is imbedded in the algorithm, so that only one final tree is produced. Computation oi S starts with a star phylogeny, in which all interior branches are assumed to be 0. This tree is clearly incorrect, so the S value (So) is much higher than the S for the true tree. The next step is to compute Sij for a tree in which sequences i and j are paired and are separated from the rest of the sequences that still form a star tree. If i and j are the neighbors connected by only one node (e.g. sequences 1 and 2 in Figure lA), then Sij is smaller than SO. Therefore, computing &j’s for all pairs of sequences and choosing the smallest Sij, we can identify a pair of neighbors. Once this pair is identified, they are combined as a single unit and treated as a single sequence in the next step. This process is continued until all multifurcating nodes are resolved into bifurcating ones. In practice, any distance measures are subject to stochastic errors, so that the NJ tree obtained may not necessarily be the true tree.

According to computer simulations (77, 102), this method nearly always produces the same topology as that of the ME tree when the extent of sequence differences is sufficiently large and the number of nucleotides examined is large (>500). When the latter condition is not satisfied, however, the NJ tree can be considerably different from the ME tree (104), yet the difference in S between the NJ and ME trees is usually statistically nonsignificant. Furthermore, using computer simulation, Kumar (77) showed that when the NJ and ME trees are different, the latter tree is not necessarily closer to the true tree (topology), When a large number of closely related sequences are used as in the case of Vigilant et at’s data (138), both NJ and ME trees will give only a rough evolutionary relationship of the sequences, and there is no point in spending excessive efforts to find the ME tree (77). At any rate, the high efficiency of the NJ method in obtaining the ME tree or the true tree seems to stem from the fact that in each step of sequence clustering the principle of minimum evolution is applied, and the repeated application of this principle would reduce the effects of sampling errors in topology construction.

1 I

n n

Figure I (A)-(C), three different unroofed frees for five sequences; a-g, expected branch lengths; @)-(E), two unrooted trees for four sequences with different branch lengths. These two model frees were used for computer simulation of which the results are presented in Table 1.

DISTANCE MEASURES Theoretically, if the total number of substitutions between any pair of sequences is known, all the above distance methods produce the correct phylogenetic tree (additive tree). In practice, however, this number is almost always unknown, and thus many different methods for estimating this number have been proposed, some being quite sophisticated (6, 79, 107, 124, 146, 154). Examples are Kimura’s (70) and Hasegawa et al’s (51) methods. These methods are certainly useful for correcting for parallel and backward mu- tations and are expected to give better estimates of the number of substitutions per site ( d ) than the simple proportion of nucleotide differences ( p distance). However, these estimates usually have a larger variance than the uncorrected p distance. Partly for this reason, the p distance or a simple distance measure such as the Jukes-Cantor distance (67) tends to produce the correct topology more often than saphisticated distance measures when the rate of nucleotide substitution is more or less constant in all evolutionary lineages and the number of nucleotides examined is not very large (108, 11 1, 112, 121, 127). In this case, therefore, i t is preferable to use p distance for topology construction rather than a more sophisticated distance. However, the advantage of p distance over more sophisticated distance measures diminishes when the number of nucleotides examined is large and the evolutionary rate varies extensively with evolutionary lineage, and in this case it is better to use unbiased distance measures (79, 112).

- J / U NE1

The efficiency of distance measures in obtaining the correct tree depends on at least two factors: the linear relationship with the number of substitutions and the standard error or coefficient of variation of the estimate of the distance measure. For Kimuta’s (?O) model of nucleotide substitution, several authors have attempted to produce better distance measures than the original estimator (43, 114, 127), but the utility of these distance measures with actual data has not been tested.

Many distance measures for estimating the number of nucleotide substitutions per site (6) become inapplicable when the distance is very large, because they usually involve logarithmic terms in the mathematical formula and the ar- guments of the logarithms often become negative. This problem can be avoided by expanding the logarithmic terms into an infinite series, but the variance of the distance obtained in this way seems to be quite large when the sequence divergence is high (106, 126). However, phylogenetic trees are generally con. structed with pairwise distances whose values are rather small (say d < 0.5). In this case, the use of p , Jukes-Cantor, and Kimura distances is usually sufficient for topology construction. However, if d is large or if there is evidence that the substitution rate varies extensively among sites and with evolutionary lineage, the gamma distance (65,79) is expected to produce better trees. For some special sets of data, even more complicated distance measures may give better results. For example, DNA sequences of the control (D-loop) region of mitochondrial DNA evolve in a complicated way (73), and a special model (131) has been developed for analyzing data for this region.

Maximum Likelihood (ML) Methods DNA LIKELIHOOD METHOD The idea of using an ML method for phylogenetic inference was first presented by Cavalli-Sforza & Edwards (14) for gene frequency data, but they encountered a number of problems in implementing the method. Later, considering nucleotide sequence data, Felsenstein (28) developed an algorithm for constructing a phylogenetic tree by the ML method. The first model of nucleotide substitution used was rather simple and did no1 take into account the transitiodtransversion bias, which is often observed in actual data. This deficiency was later removed by using a more realistic model involv- ing five parameters (31). A number of authors developed more general models (33, 51, 131, 146), and some of them have been implemented in computer programs (1,32,92, 149)

However, for obtaining the true topology, sophisticated models do not necessarily give better results than simple models such as the Jukes-Cantor model, though the likelihood value of an ML tree is almost always higher for the former models than for the latter (40, 100). In fact, computer simulations have shown that under cenain circumstances a simple model gives a higher probability of

- - - r r u r r r I s u c I l l MOLECULAR PHYLOGENY 379

obtaining the true tree than acomplicated model, even if sequence evolution has occurred following the latter model (150). Of course, this result was obtained in a simulation with four sequences, and it is not clear what will happen if more than four sequences are used. However, this problem is the same as that. for distance methods discussed above, and it emphasizes the difficulty of topology construction mentioned earlier, It should also be noted that the pattern of nucleotide substitution varies from site to site (132) and with evolutionary time (3,49,76, 132), particularly when long-term evolution is considered. At this moment, it is not clear how these factors affect the topology estimation by ML.

A serious problem with ML is the computational time. Even if the number of sequences is about ten, it requires an enormous amount of computational time, Olsen et a1 (92) developed a faster algorithm, but it still requires a large amount of time if many topologies are to be examined. Saitou (109) and Adachi & Hasegawa (1) took a different approach to tackle this problem using a new algorithm. Their algorithm (star-decomposition algorithm) is essentially the same as that of the NJ method, except that the ML principle is used in finding neighbors instead of the ME principle. In Adachi & Hasegawa’s (1) computer software MOLPHY, the star-decomposition (SD) tree is regarded as a first potential ML tree from which trees with higher ML values are searched for by using other algorithms such as local branch rearrangement. In practice, however, the SD algorithm seems to be quite efficient in obtaining the me tree, and the relationship between the SD and exhaustive search algorilhms for ML trees may be similar to that between the NJ and ME algorithms (100; T Sitnikova & M Nei, unpublished results).

PROTEIN-LIKELIHOOD METHOD When the DNA sequences are relatively closely related to one another, DNA likelihood methods seem to work well. However, if they aredistantly related and encode protein sequences, many complications arise because the rate of synonymous substitution is generally much higher than that of nonsynonymous substitution and the transition/transversion bias exists. The relative frequencies of the four nucleotides at third codon positions also varies considerably with sequence (63,76), suggesting that the stationary model of nucleotide substitutidn is not appropriate. By contrast, the evolutionary change of protein sequences does not suffer very much from these problems and seems to be much simpler than that of DNA sequences when long- term evolution is considered. Noting this property, Kishino et a1 (72) proposed a protein-likelihood method by using Dayhoff et al’s (1 8) empirical transition matrix for 20 different amino acids. Adachi & Hasegawa (1,3) extended this method by using various transition matrices including Jones et al’s (66) matrix for nuclear proteins and their own for mitochondrial proteins. They applied these methods to various sequence data and obtained reasonably good trees for

‘380 NE1

several groups of vertebrate organisms (1 1,12). Analyzing mitochondrial gene data for 1 1 vertebrate species, Russo et al(lO0) showed that protein sequences are more reliable than DNA sequences for obtaining the correct phylogeny.

Maximum Parsimony (MP) Methods In MP methods, a given set of nucleotide (or amino acid) sequences are considered, and the nucleotides (or amino acid$) of ancestral sequences for a hypothet- ical topology are inferred under the assumption that mutational changes occur in all directions among the four different nucleotides (or 20 amino acids). The smallest number of nucleotide substitutions that explain the entire evolutionary process for the given topology is then computed. This computation is done for all other topologies, and the topology that requires the smallest number of substitutions is chosen to be the best tree (22,3547).

If there are no multiple substitutions at each site, MP is expected to generate the correct topology as long as enough parsimony-informative sites (79) are examined. In practice, nucleotide sequences are often subject to backward and parallel substitutions, and this introduces uncertainties in phylogenetic inference. When the true tree has a special type of topology and branch lengths, MP may generate an incorrect topology even if an infinite number of nucleotides are examined (27). This can happen even if the rate of nucleotide substitution is constant for all evolutionary lineages (55, 129, 157). Furthermore, in parsimony analysis it is difficult to treat the phylogenetic inference in a statistical framework because there is no natural way to compute the means and variances of minimum numbers of substitutions obtained by the parsimony procedure. However, under certain circumstances, MP is quite efficient in obtaining the correct topology (89). Note also that MP is the only method that can easily take care of insertions and deletions of nucleotides, which sometimes give important phylogenetic information.

WEIGHTED PARSIMONY One factor that makes MP inefficient is the transitiodtransversion bias and the heterogeneity of substitution rate among different nucleotide sites. In the control region of mitochondrial DNA, the transition/transversion ratio (R) is as high as 15 in humans (13th and the rate heterogeneity as measured by the inverse (1 /a) of the gamma parameter (65) seems to be as high as 6.7 (73, 131, 140). In this case, nucleotide sites with transitional changes or high substitution rates are not very informative for phylogenetic construction when relatively distantly related sequences are used. The reason is that at these sites multiple substitutions are likely to have occurred, and this will introduce noise in phylogenetic inference.

One way to reduce this noise is to give higher weights to transversional changes or slowly evolving sites and lower weights to transitional changes or

MOLECULAR PHYLOGENY 38 1

fast evolving sites (26, 124). In thisxase, the tree length no longer gives an estimate of the minimum number of nucleotide substitutions, but this method substantially improves the probability of obtaining the correct topology (60, 91). One problem with this approach is that we do not know the actual R value for the data set under investigation. In this case, it is possible to use a so-called dynamically weighted parsimony method (1 13, 141). In this method, a probable R value is first used to generate an MP tree, and then a new R value is estimated from the tree obtained, This new R value is then used to generate a new MP tree. (In practice, all different nucleotide pairs are weighted differently.) This process is repeated until a stable MP tree (or trees) is obtained. This is a time-consuming method and does not guarantee the convergence of an MP tree. Nevertheless, computer simulations have shown that this method substantially improves the probability of obtaining the correct tree when R is high (134). A similar weighted method can also be used to take into account the variation of substitution rate among different sites (38, 141).

STATISTICAL TESTS OF PHYLOGENETIC TREES ! During the past two decades, many authors have studied the statistical methods

for testing the reliability of the tree obtained. Some of these studies have been reviewed by Felsenstein (31), Li & Gouy (83), and Li & Zharkikh (84). Here I present a brief summary of recent studies on the subject. Statistical tests of phylogenetic trees can be divided into two categories: a test of reliability of a tree obtained and a test of topological differences between two or more different trees obtainable from the same data set.

Reliability of an Estimated Tree INTERIOR BRANCH TESTS One way of knowing the ieliability of an estimated tree is to examine the reliability of each interior branch. This is particularly appropriate for trees constructed by distance methods, Consider the tree for five sequences given in Figure I(A). In the case of five sequences, there are 15 possible unrooted bifurcating trees, and each tree is composed of five exterior branches and two interior branches. Suppose one obtains tree ( A ) in Figure 1 by some tree-building method. The reliability of this tree (topology) is assured if the two interior branch lengths f and g are different from 0 and positive Therefore, by testing the null hypothesis of f= 0 and g = 0, we can establist the validity of the tree. In general, the null hypothesis of an interior branct length b = Ocan be tested by computing the standard error [$(&)I of an estimatc (h) of b. Since & is known approximately to follow the normal distribution ever when the number of nucleotides examined is as small as 100 (IOZ), the nul hypothesis of b = 0 can be tested by examining the statistical significancr

I

!

i

382 NE1 1 of the normal deviate 2 = 6/s(&), This test is called the interior-branch test (1 19).

This type of test was first used by Nei et a1 (90) for’a UPGMA tree and then by Li (81) for an unrooted tree when the number of sequences is four or five. Later Rzhetsky & Nei (102, 104) developed a fast algorithm for computing s(6) using the ordinary least-squares approach and made it possible to use this test for a large number of sequences. This method requires a specific model of nucleotide substitution, and the test seems to be robust about the substitution i models unless the extent of sequence divergence is high (134).

In this method, the confidence probability (PC) that & > 0 is computed by using the Z test, and if the probability is higher than 95% or 99%, then & is considered to be significantly positive. One theoretical problem concerning Rzhetsky & Nei’s (102) method was that PC was computed without considering I

the estimation error of the topology obtained. If we take into account this error in the computation of confidence probability, the actual confidence probability (Ph) can be smaller than the original PC value (1 19). In practice, however, the difference between PC and Pb is small if we consider the region of PC 2 0.99. Sitnikova (1 17) produced an approximate formula Pi = 3Pc - 2, where PC > 2/3. This is a useful formula for computing P& because the computation of PC is much simpler than that of Pi.

Another interior branch test that is applicable to distance trees is Rodrigo & Dopazo’s (98) bootstrap test. This test is different from Felsenstein’s bootstrap test, which is mentioned below, and is intended to examine the reliability of each interior branch for agiven topology. As in the case of Felsenstein’s test, the same number of nucleotides as that of the original sequence is randomly sampled with replacement for each set of sequences, and the lengths of all branches’are estimated by a given tree-building method for a given topology, which was obtained by the original sequence data. This process is repeated many times for the same topology. Therefore, the length of an interior branch varies from replication to replication. We count the number of replications in which a given interior branch takes a positive value (& > 0), and the proportion of theseamong the entire set of replications is used as an estimate of the confidence probability (PC) of the interior branch. The advantage of this method is that it requires no particular substitution models and thus is applicable for a wide variety of situations (1 17). The computational time required is also much shorter than the analytical method when the number of sequences is large. However, when the number of nucleotides examined is small, this may give biased estimates of PC’s.

A number of authors (31, 124) have suggested that the null hypothesis of b = 0 could be tested by the likelihood ratio test, because the trees with b s 0 and with b 2 0 are nested. In the case of phylogenetic trees, however, this

i i I

- _ - -- _- --- -- - MOLECULAR PHYLOGENY 383

test does not seem to be justified (42,.150,151), and computer simulations (40, 134) have shown that the test may give strong statistical support for a wrong topology. The DNAML program in PHYLIP computes the confidence interval of a branch length, but this interval also does not seem to be reliable (134).

FELSENSTEIN’S BOOTSTRAPTEST One of the most commonly used tests of the reliability of an inferred tree is Felsenstein’s (30) bootstrap test. In this test, the reliability of an inferred tree is examined by using Efron’s (24) bootstrap resampling technique. A set of nucleotide sites is randomly sampled with replacement from the original set, and this random set is used for constructing a new phylogenetic tree. This process is repeated many times, and the proportion of replications in which a given sequence cluster (sequence partitioning; e.g. sequences 1 and 2 vs others in Figure 1A) appears is computed. If this proportion ( P B ) is high (say, PB > 0.95) for a sequence cluster, this cluster is considered to be statistically significant. The null hypothesis of this test as implemented in MEGA (79) is the same as that of the interior branch test. In recent years, a number of authors (57, 119,155, 156, 158) have shown that this test is generally very conservative except for high PB values close to 1. Zharkikh & Li (158) then invented a method to rectify the conservativeness of P B , but it is not yet clear whether this correction method is applicable to the cases of many sequences.

Although Felsenstein’s bootstrap test can be very conservative, I believe that it is a useful method for evaluating the statistical reliability of an inferred tree. The actual pattern of nucleotide substitution is very complicated (76, 146) and often changes with site and evolutionary time (132). Therefore, it is better to use a conservative test for examining trees for distantly related sequences. However, we have to be cautious even with this conservative test, because any tree-building method may generate an incorrect tree almost consistently for a given data set, as is mentioned later. Furthermore, when a tree is produced for closely related sequences, one may use mathematically more rigorous methods such as the interior-branch test (1 19) or the Zharkikh & Li test (158). Note that Felsenstein’s (30) original bootstrap method is slightly different from the method described above and is for testing a consensus tree generated by repeated resampling. The null hypothesis of this test is not clearly specified, but it should be similar to the one mentioned above.

Tests of Topological Diferences MINIMUM EVOLUTION m e s The second class of tests is a cornpatison ol two topologies in terms of the quantity used for an optimization process of phylogenetic inference. Previously we mentioned that the minimum evolution tree is a tree with the smallest sum SM of branch lengths. In practice, however there may be several other trees whose Sis greater than 27, but is not significantl)

different from the latter. These trees are potentially correct trees, and thus one may want to keep them until other data are obtained to identify the true tree. Rzhetsky & Nei (102) developed a statistical method for testing the difference ( D = Ss - SA) in S between two topologies. This test is equivalent to the test

' of the lengths of the interior branches at which the two topologies are different,

incorrect ones. When unbiased distance estimators are used, the expectation [E(D)] of D = SB - SA for topologies (A) and (B) is given by f/2, wherefis I I the true length of the left interior branch of topology (A) (102). Therefore, if I D is significantly greater than 0, we can establish the validity of topology (A). However, if D is significantly smaller than 0, topology (E) will be the correct one, Similarly, comparison of topologies (A) and (C) gives the expectation of D (= Sc - SA) equal to 3(f + g)/4 (102). Therefore, the null hypothesis for the test of D has a clear-cut biological meaning. Using this test, one can identify topologies that are not significantly different from the ME tree.

Rzhetsky & Nei's (102) method for testing D's is dependent on the mathematical model on which a particular distance measure is based and requires an intensive computation when m is large. Nei (89) suggested that the hypothe- I i sis E(D) = 0 can be tested by a bootstrap method. In this test S is computed . I for a given pair of topologies (i and j ) for each sequence resampling, and Di, = Si - SI is computed. If this is repeated many times, we can compute i the mean D and its standard error. Therefore, we can test the null hypothesis of E(D) = 0. When there are several potentially correct trees, Dl] can be computed for all pairs of i andj by using the same set of resampled sequences.

The D test mentioned above is clearly related to the interior branch test mentioned earlier, but the exact relationship remains unclear. One might speculate that if every interior branch of the ME tree is significant, the D test will also establish that S, is significantly smaller than S for any other tree. If this is the case, the interior branch test or the Felsenstein bootstrap test would be simpler than the D test in finding a reliable tree.

Some authors (124) criticized the NJ method as producing only one final tree, rather than several potentially correct trees. This criticism is valid. However, if all the interior branches of a NJ tree are statistically supported, there will be no need to consider other alternative trees. By contrast, if some of the interior branches are not statistically supported, one may consider the alternative trees that can be generated by changing the branching pattern for each nonsignificant interior branch. This approach would be simpler than the enumeration of alternative trees by using the D test mentioned above. Another approach to this problem is to construct a condensed tree (79) in which all weakly supported interior branch-lengths are reduced to 0. This tree is conceptually similar to a consensus tree (123), which is constructed for MP trees.

Suppose that topology (A) in Figure 1 is the correct tree and (E) and ( C ) are I

I

ML TREES One might think that a simple test for the difference in topology between an ML and asuboptimal tree would be to use the standard log likelihood ratio (LR) test, Unfortunately, this cannot be done because all bifurcating trees have the same number of degrees of freedom (31). Kishino & Hasegawa (71) suggested that the difference in log likelihood value between an ML tree and a suboptimal tree be tested by using the variance of the difference in single-site log likelihood between the two trees. However, these authors have not specified the null hypothesis of this test in relation to the topological differences as given in Figure 1. Without knowing what is being tested, it is difficult to interpret the results of the test. Clearly, a detailed study of the theoretical basis of the test is necessary.

As mentioned earlier, the number of possible topologies is enormously large even when the number of sequences is about 10. If an unlimited amount of computer time is available, it is possible to conduct a large number of bootstrap resamplings and evaluate the probability of occurrence of each topology. If a particular topology occurs with a high probability (say P > 0.95), one may con- clude that this topology is most likely to be the correct one. In practice, it is usually unnecessary to consider all topologies because most topologies have an extremely low probability of occurrence. Furthermore, i t is often possible to iden- tifyasmallnumber(say 1O)ofpotentiallycorrecttopologiesonthebasisofother biological information. In these cases, one can evaluate the relative bootstrap probability values of the potentially correct topologies. However, even if we consider a small number of potentially correct topologies, Felsenstein's bootstrap procedure requires a prohibitive amount of computer time for ML trees.

To cope with this problem, Kishino et a1 (72) developed two approximate methods of computing the bootstrap probability of the i-th topology (Pi ) . In the first method, the MND method, the log likelihoods .!(I), .!(2), . , . , and e(&) for topologies 1,2, . . . , and k are assumed to follow a multivariate normal distribution, of which the means, variances, and covariances are estimated by Kishino & Hasegawa's method (71). One can then choose a random set of .!(I), e(2), . . . , t(k) from this distribution and then determine the topology that has the highest likelihood. This topology is the ML tree in this set of sample. This process is repeated many times, and the probability [ P ( i ) J that the i-th topology is the ML tree is determined. If a particular topology has a P ( i ) value of 95% or higher, then this topology is assumed to be the correct topology. In the second method, the RELL method, the log likelihood at each site [ t ( j ) ] is computed for all k topologies, and for each set of bootstrap-resampled sites, the sum [x l ! ( j ) ] of log likelihoods for individual sites is computed for each topology. If this process is repeated many times, one can determine P( i ) ' s .

The above test procedures, particularly the latter one, seem to be more reasonable than Kishino & Hasegawa's earlier test (71). However, this method has

- - R m ~ ~ ~ ~ m 3 r NE1

one problem, which is how to choose a set of potentially correct trees, particularly when the number of sequences is large. Kishino et a1 (72) proposed an ad hoc procedure to solve this problem, but its utility is still untested.

MP TREES As mentioned earlier, it is difficult to develop any parametric test for MP trees because of the nonrandom nature of “minimum numbers of substitutions.” Templeton (135) suggested a nonparametric test for comparing two topologies that is similar to Kishino & Hasegawa’s (71) test for ML trees. However, the null hypothesis of this test is also unclear in relation to the

inferred MP trees would be Felsenstein’s bootstrap test, though one has to be cautious about the possibility of inconsistency of MP methods (see below), i

- I Y

topologies to be compared. Probably the best way of testing the reliability of i 1

MERITS AND DEMERITS OF DIFFERENT TREE-BUILDING METHODS

Criteria of Comparison Because.there are many different tree-building methods, one is naturally interested in the merits and demerits of different methods. There are several different criteria for comparing different tree-building methods. Important ones are (a) computational speed, (b) consistency as an estimator of a topology, (c) statistical tests of phylogenetic trees, ( d ) probability of obtaining the correct topology, and (e) reliability of branch length estimates.

The computational speed of each tree-building method can be measured relatively easily, though it depends on the algorithm used. According to this criterion, the NJ method is superior to most other tree-building methods which are currently in use. This method can handle a large number of sequences (m > 100) even with a personal computer, and the application of bootstrap tests is easy. The orthodox MP, LS, ME, and ML methods examine all possible topologies searching for the MP, LS, ME, and ML trees, respectively. Since the possible number of topologies rapidly increases with m (14), it is difficult to use these methods when m is large. In the case of ME, however, simplified algorithms (77, 104, 105) seem to be as efficient as the exhaustive search in obtaining the correct tree. It is hoped that similar simplified algorithms will be developed for other methods as well. [In the case of MP methods the branch and bound method (54) may be used when m 5 20.1 Note that a vast majority of tree topologies are clearly incorrect when rn is large, and there is no need to examine all these trees. The algorithm suggested by Rzhetsky & Nei (104,105) may be used for identifying MP and ML or suboptimal MP or ML trees rapidly.

A tree-building method is said to be a “consistent estimator” if the method tends to give the correct topology as the number of nucleotides used (n)

I

. 1 E l ~ U ~ U U W I r

MOLECULAR PHYLOGENY 387

approaches infinity (27). The NJ, ME, and LS methods are a consistent estimator if unbiased estimates of nucleotide substitutions are used as distance measures (19, 102,112), and so is the ML method when the correct model of nucleotide substitution is used (148). By contrast, MP is often inconsistent, as mentioned earlier. In practice, however,n is usually of the order of hun- dreds to thousands, and in this case even NJ, ME, LS. and ML may fail to produce the correct tree with a relatively high probability when MP fails (60, 62,115). Therefore, consistency is not always a useful criterion for comparing the efficiencies of different tree-building methods.

We have already discussed statisticat tests of phylogenetic trees €or several different tree-building methods. At present, the statistical methods for testing NJ and ME trees are well established. Solid statistical tests are also available for trees obtained by the generalized LS method (9, 137). In the case of ML methods, however, there seem to be many complications, as mentioned above. The best method for testing MP trees is probably Felsenstein’s bootstrap test, as long as the cases of inconsistency are avoided (30). The probability of obtaining the correct topology is probably the most im-

portant criterion for comparing different wee-building methods, but this is also the most difficult problem to study. During the past 15 years, many authors have studied this problem, yet we do not have a clear-cut answer, as is discussed below. Another important criterion for comparing different methods is the reliability of branch length estimates. Once the correct topology is obtained for a given data set, this problem can be studied relatively easily. Theoretically, ML, LS, NJ, and ME are expected to give more reliable estimates of branch lengths than MP. At present, MP (and sometimes ML) trees are almost always presented without branch length estimates. This practice is regrettable because it gives a distorted picture of a phylogenetic tree. Since computer programs are available for estimating branch lengths of MP trees (123), any tree should be provided with branch length estimates.

Probability of Obtaining the Correct Topology In most cases of phylogenetic reconstruction, we never know the true phylogeny for real data under investigation, so it is difficult to study this problem empiri- cally. However, if we use an appropriate mathematical model, we can simulate the evolutionary changes of DNA sequences following a given model tree. We can then reconstruct a tree by various methods using the artificially generated present-day sequences and compare the topology of the tree obtained with that of the model tree. If this process is repeated many times, we can estimate the probability of obtaining the correct topology (PT), and this probability can be used for comparing the efficiencies of different tree-building methods (7, 89, 95. 133).

388 NE1 MOLECULAR PHYLOGENY 389

THEORETICAL STUDY When the number of sequences examined (m) is small (four or five), it is possible to evaluate PT analytically for the NJ, LS. and MP methods (1 11, 119, 155). These studies have shown that when the evolutionary rate is more or less constant for all four or five sequences, NJ has a slightly higher Pr value than MP, which in turn has a somewhat higher PT than Fitch & Margoliash's (37) LS method (1 1 1). Both the ordinary and generalized LS methods are inferior to the ME method in obtaining the correct topology (103). This inferiority seems to be partly due to the fact that the LS methods often generate negative branches, as mentioned earlier. However, analytical evaluation of PT is very difficult when rn is large, and the conclusion

, obtained from these studies may not apply to a wide variety of situations. No study has been made for ML trees even in the case of m = 4. For this reason, comparison of PT among different methods is usually done by computer simulation.

COMPUTER SIMULATION If we use computer simulation, PT *s can be estimated for a variety of evolutionary conditions. Thus, a large number of simulation studies have been done during the past 15 years. The results obtained before 1990 have been summarized by Nei (89), but there are many recent studies (40, 48,50,60, 61,74,91, 102, 115, 148, 150). It is not easy to summarize these studies because different authors considered different evolutionary models and used different computer algorithms.

One of the most popular model trees used in computer simulation is the unrooted tree of four sequences in the form given in Figure 1(D), where a, b, and c represent the expected number of nucleotide substitutions per site. When a = b = c and u is greater than 0.1 but smaller than 0.5, almost any tree-building method produces the correct topology if n is greater than 100. Therefore, this model tree is not useful for discriminating the efficiencies of different methods. For this reason, many authors have assumed a > b. If we use the Jukes-Cantor model of nucleotide substitution, the MP method becomes inconsistent when b = c = 0.05 anda 2 0,394 (134). Therefore, MP always fails to produce the correct tree when a large number of nucleotides is used. However, NJ and ML usually recover the correct tree in this case if a < 0.5.

Some authors (59) have used cases of an extremely high degree of sequence divergence (a = 2.83; p distance = 0.65, and& = c = 0.05: p = 0.05) to show a superiority of ML methods. However, such divergent sequences are almost never used in practice because of the difficulty of sequence alignment. Therefore, such a study is not biologically meaningful. For the same reason, a large part of computer simulations conducted by Huelsenbeck (60) also do not seem to be biologically meaningful (91). Although he considered the complete two-dimensional space for a and b = c(0 5 p 5 0.75; 0 5 corrected distance

d 5 00) for the sake of completeness, actual data used for phylogenetic analysis fall into a relatively small portion of the space near the origin (108). When b and c are of the order of 0.05 and 0.1 < (I 0.5, MP is generally less efficient than NJ, which is in turn less efficient than ML (48, 50, 60, 134). However, when a, b, and c are all of the order of 0.01 - 0.025 and n is about 1O00, all three methods reconstruct the true tree quite easily (134).

Note that the comparison of different tree-building methods is not always straightforward when a complicated model of nucleotide substitution is used, because appropriate computer programs are not always available. Thus, Tateno et a1 (134) compared the robustness of MP, NJ, and ML using available computer programs for the case where the substitution rate varies among nucleotide sites following the gamma distribution. Since the computer program for ML was not available, their comparison of NJ and ML was not adequate. Using a newly developed ML algorithm with the gamma distribution (147). Huelsenbeck (61) attempted to rectify Tateno et al's inadequate comparison between NJ and ML. However, he used a continuous gamma distribution for NJ but a discrete version for ML. Although this difference would not a f k t the final conclusion significantly, it illustrates a difficulty in computer sirnu- lation. This problem is compounded by the fact that for the NJ or ME methods, biased distance measurers often give a higher PC value than unbiased distances.

The model tree (D) in Figure 1 obviously does not cover all possible types of trees for four sequences. The model tree ( E ) is different from tree (D) in that two long branches with length a are now neighbors and two short branches with length b are also neighbors. Interestingly, this model tree gives different relative Pr values compared to those for tree (0). Some results for the two trees are given in Table 1. In tree (D), ML gives the highest PT value among the three methods ML, MP, and NJ, and NJ with p distance shows the lowest value. In tree ( E ) , however, ML gives the lowest PT, whereas NJ withp distance gives the highest. Furthermore, both unweighted and weighted MP show much higher PT'S than ML. These results were obtained apparently because in parsimony and NJ with pdistance short branches tend to attract each other. Yang (150) has also shown that even when the evolutionary rate is constant, ML can be inferior to unweighted MP. These results indicate the difficulty of obtaining a general conclusion about the relative efficiencies of different tree-building methods, even for the simplest case of m = 4.

A number of simulation studies have been done for the cases of six or more sequences, although it is difficult to consider more than a dozen sequences, When m is very large, the interior branch lengths become very small if we want to make the most divergent sequence pair biologically reasonable (d 5 1 .O). For this reason, Pr becomes very low for any method, and an enormous amount

390 NE1

lsble 1 Percent probabililies of obtaining the cornct tree topology

Number Tree D Tree E Nucleotides NJ MP ML NJ MP ML (n) p JC K2 UW W p JC K2 UW W

100 44 68 12 41 64 76 98 73 14 88 96 64 200 41 79 81 52 80 84 100 83 82 91 99 76 300 43 81 88 59 80 92 100 88 86 98 100 ' 82 500 35 94 95 62 89 97 100 96 94 100 100 90 800 29 96 96 63 94 98 100 98 96 100 100 94 l o o 0 35 99 99 66 98 100 100 99 99 100 IO0 96

by using Kimura's (70) two-paramefer model with a tranrilionltranrversion tate ratio of 2. and the method of In both trees D and E in Figum I . II = 0.4.1, E 0.1 I and c = 0.05 wen arsurned. Sequencer data were generaled

Abbrtviationr: NJ. netghbor-joining mthod; p . p distances; IC, Juker-Canlor distance; K2, modified Kirnura simulation wm LC s d m c as t h a ~ of Ne1 et al(91).

dirrancc (32); MP. maaimurn parsimony method; U W , unweighfed; W. weighted; ML, maximum likelihood method.

of computer time is required (77, 133). The model trees considered usually represent the case of constant rate or its modifications (50, 110, 112, 121, 122). In some of these studies (110, 121), the exhaustive search of MP or ML trees was not done because of an excessive computer time required, but the true topology was always included. Therefore, the simulations were somewhat more favorable for MP or ML than for NJ. In general, these simulation studies have shown that ML is as good as or better than NJ, which is in turn often better than MP. However, the number of these studies is quite limited, and it is difficult to extrapolate these results to other cases.

A somewhat different type of simulation was conducted by Kuhner & Felsenstein (74). They generated a model tree of 10 sequences following the branching process in statistics in each replication, and the sequence data generated according to this model tree were used to reconstruct a phylogenetic tree.

' The topology of this tree was then compared with that of the model tree. The topological difference between the model tree and the estimated tree was measured by the number of the nonidentical sequence partitions between the two trees being compared (d7) (96). They considered a case of low divergence with an expected value of the root-to-tip branch length equal to 0.0193 and a case of high divergence with an expected value of 0.193. The average dT values for the low divergence case with a constant rate were 1.95, 1.82, and 1.64 for MP, NJ, and ML, respectively when n = 1,000, whereas dT's for the high divergence case were 0.68, 0.67, and 0.54 for MP, NJ, and ML, respectively. Therefore, on the basis of dTvalues, ML is better than NJ, which is in turn slightly better than MI? However, the differences in dTamong the three different methods are very small. Note that the above comparison was done with very special types

MOLECULAR PHYLOGENY 39 1

of model trees that seem to have had very short interior branches occasionally. (None of the model trees used was published,) Therefore, many inferred trees should have had multifurcating nodes, yet the authors did not treat them as such; they accepted whatever resolution of the multifurcation a particular computer algorithm produced. Here again, we see an example where the comparison of different methods is algorithm-dependent. Note that ML algorithms often give zero branch lengths even if the true tree is apparently bifurcating (16).

Despite many recent computer simulations, the interpretation of the results is not as straightforward as was originally expected, and more careful studies are needed to know the relative efficiencies of different methods. However, i t is now clear that any method is not almighty, and there are situations in which one method is more efficient than others in obtaining the true tree and that, unless the evolutionary rate varies drastically with evolutionary lineages, all the three methods considered here generally give the same or similar topologies (1 10). Computer simulations have also indicated that one of the most important factors is the number of nucleotides or amino acids used per sequence and that if this number is small, one cannot produce reliable trees.

TESTS BASED ON KNOWN PHYLOGENIES Although it is generally difficult to know the true topology in real data analysis, there are a few such cases. Onr is a phylogenetic tree experimentally produced by artificial mutagenesis with T7 phages (58). However, this type of experiment produces only one or a few replications, so it is difficult to compare different methods statistically. Furthermore, the pattern of nucleotide changes produced by mutagens seems to be somewhat unusual (8). It is thus unclear whether we can extrapolate the results obtained from these experiments to real cases.

However, there are few instances in which the phylogenetic tree for a group of organisms is firmly established on the paleontological and morphological bases. One such example is given in Figure 2(A). The complete nucleotide sequence of mitochondrial DNA (mtDNA) has recently been published for the 1 1 vertebrate species given in this figure, MtDNA in these species contains 13 protein-coding genes, the number of shared codons for each gene varying from 52 to 582. A phylogenetic tree was reconstructed for each of these genes and for the entire set of genes (3682 codons), and the trees obtained were compared with the true tree (100). In this study, amino acid sequences rather than nucleotide sequences were used, because the former produced more reliable trees.

When all 13 genes were used, all tree-building methods (NJ, ML, and MP) produced the correct tree irrespective of the algorithm used. A few genes (usually large genes) such as NdS, Cy& and C03 also produced the correct or nearly correct topology. However, some genes (e.g. 1202, Ndl. Nd3, and Nd41) almost always produced incorrect trees regardless of the method and algorithm

392 NE1

ILnJ -cow

I Chicken

I Figure 2 (A) Known phylogeny for 11 vertebrate species. The total amino acid sequences of 13 coding genes of mitochondrial DNA produced the correct phylogeny with a bootstrap value of 100% for each interior branch. (S) When the lamprey and sea urchin sequences were added, m incorrect topology was produced with high bootstrap values.

used. This result clearly indicates that some genes are more suitable than others in phylogenetic inference and that all the tree-building methods tend to produce the same topology whether the topology is correct or not. Similar results were obtained by Kumazawa & Nishida (80). Since only 13 genes were studied, it was difficult to evaluate the relative efficiencies of the different tree-building methods. In general, however, sophisticated methods such as the ML method with Jones et al's substitution model were no better than simple methods such as NJ with p distance or ML star-decomposition algorithm. Similar results were obtained by Cao et al (1 1, 12). These results suggest that the pattern and the rate of amino acid substitution vary with a group of organisms (also with evolutionary time) and thus sophisticated mathematical models do not necessarily generate better results.

However, a surprising result was obtained when the lamprey and sea urchin sequences were added to the 11 sequences in Figure 2(A): a clearly wrong tree [Figure 2(B)] was obtained by all tree-building methods even when all genes were used, and a bootstrap test showed strong statistical support for this wrong tree! The reason for this is unclear, but the unusually slow rate of evolution of fish genes and the change in the pattern of amino acid substitutions with.site and time (76) seem to be contributing factors.

Empirical studies of a few cases of known phylogenies have shown that when the sequences used are relatively closely related the correct phylogeny is

MOLECULAR PHYLOGENY 393

generally obtained as the number of codons or nucleotides increases but that the topology of distantly related sequences may well be incorrect even when a large number of codons or nucleotides are used and a bootstrap test may give strong statistical support for it.

THE MOLECULAR CLOCK AND LINEARIZED TREES The molecular clock is one of the most important concepts in molecular evolutionary genetics, yet it has been controversial for many decades (21,36,82). Strictly speaking, the rate of nucleotide or amino acid substitution would never be constant over the entire evolutionary process because nucleotide or amino acid substitution is a complicated process that is dependent on the evolutionary stability and functional changes of genes. Therefore, if we study a large number of nucleotides or amino acids and the extent of sequence divergence is sufficiently large, we would surely be able to detect the heterogeneity of evolutionary rate. Yet, the extent of rate heterogeneity is usually moderate when relatively closely related sequences are used, so that one can use an approximate clock to obtain rough estimates of times of divergence between sequences from molecular data. Actually, a number of molecular evolutionists (75, 136) have attempted to estimate divergence times even when the molecular clock fails.

To use a molecular clock for estimating divergence times, however, it is important to test the applicability of a clock for the data set under considera- tion. If a molecular clock does not hold, we must identify and eliminate the sequences that deviate significantly from the assumption of rate constancy. Af- ter elimination of these sequences, we can reestimate the branch lengths of the tree for the remaining sequences under the assumption of rate constancy. A tree constructed in this way is called a linearized tree and can be used for estimating the divergence time of any pair of sequences provided that the rate of substitution can be estimated from ,other sources such as fossil records or geological dates (130). In this case, the test of a molecular clock need not be very strict, because the estimates of divergence times obtainable are generally very rough. Actually, we may retain certain important sequences even if they evolve significantly faster or slower than the average, unless they distort the tree substantially.

A commonly used test of the molecular clock is the relative rate test for three sequences (36,87,125,145), but this test is not appropriate for our purpose. We need a test'that is applicable for many sequences simultaneously. Felsenstein (29) suggested that for trees constructed by distance methods the test be done by comparing the least-squares residual sums obtained under the assumption of rate constancy (Rc) with that for the case of no such assumption (RN) using

Fisher's F test. When the ordinary or weighted least-squares method is used to compute Rc and RN (Fitch and Kitch programs in the PHYLIP package), it is implicitly assumed that pairwise distance estimates are independently and normally distributed. Normality may not be seriously violated, but pairwise distances are positively correlated because of the tree-like relationships of the sequences, Therefore, this is not a rigorous statistical test (3 I),

The hypothesis of rate constancy can also be tested by computing the likelihood values with and without the assumption of rate constancy (31). m i c e the difference of the log likelihood values between the two cases is expected to follow the x 2 distribution asymptotically with m - 2 degrees of freedom. Goldman (42) questioned the x 2 approximation of the test statistic, but Yang et al's (151) simulation study suggests that the x' approximation is acceptable in most cases. Takezaki et a1 (130) presented two simple methods of testing rate constancy

specifically designed to identify sequences evolving excessively fast or slow: the two-cluster and the branch-length tests. In these methods, the root of the tree is first established by using outgroup sequences as in the case of Figure 2(A), where the fish genes can be regarded as outgroups. The two-cluster test examines whether the difference in average branch length between two clusters of sequences created by an interior node is statistically significant or not. This test is done by computing the standard error of the difference between the two average branch lengths and applying the 2 test. If this 2 test shows that one branch length is significantly different from the other, the one that is more different from the average root-to-tip distance for all sequences is eliminated. In the branch-length test, the root-to-tip branch length (y) is computed for all sequences, and the difference between they value for a particular sequence and the average (8) for all sequences is computed. This difference is then subjected to a statistical test to identify the sequences that evolve significantly faster or slower.

The deviation of y from j~ can also be tested by Uyenoyama's (137) generalized least-squares method. Since the generalized least-squares estimates of branch lengths have a smaller variance than theordinary least-squaresestimates, this test is expected to be more powerful than the above methods. However, the application of this method to the case of a large number of sequences is difficult because it requires a large amount of computer time. For our purpose, we do not need a very powerful method to detect the rate heterogeneity, because we are interested only in approximate rate constancy.

Once excessively fast- or slow-evolving sequences are eliminated, one can construct a linearized tree under the assumption of rate constancy, using the method described by Taketaki et al (130). Linearized trees have been

constructed to estimate the times of divergence for various pairs of Drosophila species using the alcohol dehydrogenase gene sequences and the geological estimates of the times of formation of the Hawaiian islands (99, 130). These studies indicate that when many sequences (about 40 sequences) are used, the time estimates remain nearly the same even if some sequences that evolved significantly slower or faster (1% level) than the average are included. The same method has also been used for estimating the times of origin of the orders of placental mammals and birds (52).

PERSPECTIVES In this review, I have discussed recent developments in phylogenetic analysis that are biologically important. I have emphasized that the statistical foundation of phylogenetic inference is not well established for any tree-building method and that there is an urgent need to clarify this foundation. However, computer simulations and a few empirical studies suggest that currently used methods such as the NJ, ME, MP, and ML methods generate reasonably good phylogenetic trees (topologies) when a sufficiently large number of nucleotides or amino acids are used. In general, all these methods produce the same or similar trees unless the evolutionary rate extensively varies with evolutionary lineage. When the evolutionary rate varies with evolutionary lineage, MP tends to be less efficient than other methods in obtaining the true topology, but if the extent of rate heterogeneity is very high, all methods may fail to identify the truetopol- ogy. MP methods also tend to give more biased estimates of branch lengths than others. However, MP methods have an advantage over other methods in that they can easily utilize information generated by insertions/deletions.

In practice, the number of nucleotides or amino acids used is sometimes quite small, and the sequences analyzed are closely related. In this case, any

, tree-building method would make some errors in topology construction, and the use of a simple method would be sufficient for obtaining rough evolutionary relationships of sequences. By contrast, when the extent of sequence divergence is very high and the pattern of nucleotide or amino acid substitution remains nearly the same for all sites during the entire evolutionary time, ML methods are expected to give better results in topology estimation than other methods, The only problem is whether the substitution pattern remains the same or not for a long evolutionary time. If this pattern varies with site and changes with time, the advantage of ML methods over other methods would decline, because other methods require less rigid assumptions about the substitution pattern than ML methods. Another complicating factor is sequence alignment. When there are many deletions and insertions, it is often very difficult to have a reliable

sequence alignment, and a relatively small difference in sequence alignment often has a profound effect on the phylogenetic tree reconstructed. Yet, few studies have been made on the sensitivities of different tree-building methods to the alignment differences. At this moment, the relative efficiencies of different tree-building methods remain unclear, particularly when various biological factors are considered.

Our current knowledge of the relative efficiencies of different tree-building methods is largely based on computer simulation and some theoretical consid- erations. However, our ability to simulate the real process of DNA sequence evolution is limited, and in the future the relative efficiencies should be studied by using actual sequence data for known phylogenies. Fortunately, as the amount of sequence data increases, the number of known phylogenies for which such a study can be done is increasing. Therefore, we will probably be able to know the empirical telative efficiencies in the near future.

In the past two decades many investigators have studied the molecular phylogeny of three basic forms of life, archaebacteria, eubacteria, and eukaryotes, but the results obtained are still conflicting (20, 46, 68, 144). To resolve this problem, Golding & Gupta (41) examined the relationships of gram-positive bacteria, gram-negative bacteria, archaebacteria, and eukaryotes using 24 different protein sequences, and showed that different proteins generate different but statistically significant topologies. On the basis of this result, they hypothesized that eukaryotes evolved by fusion of an archaebacterium and a gram-negative bacterium. While this hypothesis is interesting and should be pursued in more detail, it should be kept in mind that current statistical tests are not very reliable for examining the evolutionary events in such ancient times.

This proviso applies even to phylogenetic trees of different phyla, classes, and families. Note that we could not reconstruct the correct phylogeny even for vertebrates and invertebrates by using mtDNA (Figure 2 4 . Therefore, great caution is necessary in interpreting molecular phylogenies for distantly related organisms. Since our statistical methods depend on so many simplifying assumptions, it is not always clear whether we are uncovering the Prenealoa-

- w " ical history of early evolution or merely describing the extent of functional differentiation of genes that have been highly conserved. [Note that there are I

only two amino acid differences between calf and pea in a sequence of 105 !

amino acids of histon H4 (88). Do they represent stochastic changes of amino acid sequences?] In this type of study, we should use both quantitative and qualitative characters (68). Here we need some principles of the cladistic approach as well as the statistical approach, It is also important to examine a large number of genes and analyze them simultaneously (20).

In this article, I have been concerned primarily with statistical inference of phylogenetic trees. However, phylogenetic'analysis of DNA or protein sequences is also useful for understanding the mechanism of evolution as mentioned in the Introduction. One approach to this problem is to infer the amino acid sequences of proteins in ancestral organisms from sequence data of extant organisms and to study how each amino acid substitution has changed the function of genes in the evolutionary process. T h i s approach was suggested as early as in 1963 by Pauling & Zuckerkandl(94), but not until recently was it used very often because of difficulty in predicting the function of ancestral genes by this approach alone.

In recent years statistical methods for inferring ancestral amino acid sequences have been refined, and we can use any of the parsimony (35, 41, 85,86), likelihood ( 1 52), and distance (153) approaches for this purpose. Com- puter simulations have shown that the likelihood and distance methods give more accurate inference than does the parsimony method, but all methods give reasonably good results when extant sequences are relatively closely related (153). Once the ancestral sequences are inferred, it is possible to reconstruct the ancestral proteins by site directed mutagenesis. Jerrnann et al(64) used this technique to study the evolutionary change of catalytic activity of ribonucle- ases in artiodactyls in relation to the evolution of ruminant digestion. Similarly, Chandrasekharan et a1 (15) reconstructed an inferred chyrnase (an enzyme be- longing to the serine protease family) of an ancestral organism of mammals and showed that the catalytic activity of the ancestral enzyme was higher than that of the chymase of current mammals.

This type of study will be conducted more often in the future, because both statistical and biochemical techniques are now available. It is no longer a dream to reconstruct proteins of many extinct organisms by using these techniques and to study how morphological and physiological characters have evolved , In the study of evolution the phylogenetic analysis of DNA or protein sequences will play a more important role in the future.

ACKNOWLEDGMENTS

I thank Arndt von Haeseler, Blair Hedges, John Huelsenbeck, Sudhir Kumar, Andrey Rzhetsky, Tanya Sitnikova, Naoko Takezaki, and Ziheng Yang for their comments on an earlier version of this paper. I am also grateful to Naoko Takezaki for her help i n making the illustrations. This work was supported by grants from NIH (GM20293) and NSF (DEB-9520832).

Vlslt lhc Annual Revltwr homrpap ml htlp$/www.onnurrv.org,

http://htlp$/www.onnurrv.org

1. Adachi J, Hasegawa M. 1995. MOLPHY: Programs for Molecular Phylogenctics.

2. Adachi J, Hasegawa M. 1996. Instabil- Tokyo: Inst. Statist. Math.

ity of quartet analyses of molecular sequence data by the maximum likelihood method: the cetacealmiodactyla relationships. Mol. Phyl. Evol. 672-76

3. Adachi J, Hasegawa M. 1996. Model of &no acid substitution in proteins en- coded by mitochondrial DNA. J. Mol. Evol. 42:459-68

4. Atchley WR, Fitch WM. Bronner-Fraser M. 1994. Molecular evolution of the MyoD famil of transcri tion factors. Pmc. Natl. j c d . Sci. UJA 91:11522- 26

5. Baldauf SL. Palmer JD. 1993. Animals and fungi arc each other's closest rel- atives: congruent evidence from multiple proteins. Pmc. NatL Acad. Sci. USA 901155862

6. Barry Dl Hartigan JA. 1987. Asyn- chronous distance between homologous DNA sequences. Biometrics 43:261- 76

7. Blanken RL, Klotz LC, Hinnebusch AG. 1982. Computer companson of new and

tionary trees from sequence data. 1. Mol. existing criteria for constructing evolu-

Evol. 19:9-19 8. Bull JJ, Cunningham CW, Molineux IJ.

Badge% M R , Hlllis DM: 1993. Experi- mental molecular evolution of bacterio-

9. fjulmer M. 1991. Use of the method of

ing hylogenies from squence data. Mol. generalized least squares in reconstmct-

Biof Evnl. 8:868-83

hage T7. Evolution 47:993-1007

IO. Deleted in proof 11. CPO Y. Adachi I, Hasegawa M. 1994. Eu-

~ . . ~

theriati ohvlonenv ashferred from mi- tochondhd DBA-sequence data. Jpn. J. Genet. 69:455-72

12. Cao Y. Adachi I. Janke A. Paabo S. Hasegawa M. 1994. Phylogenetic relationships among euthenan orders estimated from infemd sequences of mitochondrial pmtelns: instability of a tree based on a single gene. 1. Mol. Evol.

Analysis of human evolution. In Generics

fnt. Congr. Genet., I I& The Hague, The Today. ed. SI Gcerts, p. 923-33. Pmc.

Netherlands: Pergamon

Phylogenetic analysis: models and csti-

39519-27 13. ClVatli-Sfom LL, Edwards AW. 1964.

14. CSValli-SfOrZ~ LL. Edwards AWR 1967.

mation procedures. Am. J. Hum. Gena.

15. Chandrasekharan UM, Sanker S, Gly- nias MI. Kamik SS. Husain A. 1996.

19:122-257

Angiotensin II-forming activity in a reconstructed ancestral chymase. Science

16. Cooper A, Mourer-Chauvire C, Cham- bers GK. von Haeseler A. Wilson AC, Paabo S. 1992. Independent origins of New Zealand moas and kiwis. Pmc. Natl. A C ~ . Sci. USA 89874144

17. Dayhoff MO. 1972. AI~US of Protein Se- quence and Struc~un. Silver Springs,

18. Dayhoff MO. Schwartz RM, Orcutt BC. MD Natl. Biomed. Res. Pound.

1978. A model of evolutionary change in proteins, In Atlas of Pmtein Sequence and Sfrucntn, ed. MO Dayhoff, pp. 345- 52. Washington, D C Natl. Biomed. Res.

19. DeBrv RW. 1992. The consistency of Found.

27 1502-5

seved phylogeny-inference methd under varying evolutionary rates. Mol. Bid. ElJOl. 9!537-51

20. Doolittle RP, Feng D-F', Tsang S, Cho G, Little E. 1996. Determining divergence times of the major kingdoms of living organism with a protein clock. Science 271:47&77

21. East& S. Collet C, Beny D. 1995. The Mammalian Molecular Clock. Austin, TX: Landes

22. Eck RV, Dayhoff MO. 1966. Atlar of

Springs, Mb: Natl. Biomed. Res. Found. Pmrein Se uence and Structure. Silver

23. Edwards AWF, Cavalli-Sfom LL. 1963. The reconsmetion of evolution. Herediry 18553. Abstr.

24. Efron B. 1982. The Jackknifs. the Boat- strap and Other Resampling Plans. Philadelphia, PA: Sac. Ind. Appl. Math.

25. Eldredge N, Gould SJ. 1972. Punctuated equilibria: an alternative to phyletic grad- ualism. In Modelr in Paleobiology, ed. TIM Schopf, pp. 82-1 15. San Fmncisco:

26. Partis JS. 1969. A successive approxi- h e m a n

Syst. ZddfI 18:374-85 malions a proach to character weighting.

27. Pelsenstein J. 1978. Casu in which ar- simony or compatibili methods w d b e osidvely misleading. $st. Zool. 22401-

28. Feloenstein J8 1981. Evolutionary trees from DNA sequences: a maximum llke-

29. Felsenstefn J. 1984. Distance methods lihood ap roach. 1. Mol. Evol. 17368-76

- . - . . - . - - . - -

PO

for infemng phylogenies: a justification.

30. Fclsenstein J. 1985, Confidence limits on Evolution 38: 16-24

phylogenies: an approach using the bootstrap. Evolution 39783-91

31. Pelsenstein J. 1986. Phylogenies from molecular sequences: inference and re-

32. Felsenstein J. 1993. PHYUP: Phylogc- liability. Annu. Rev. Gcner. 22521-65

netic Inference Package. Seattle, Wk Univ. Wash,

33. Felsenstein I, Churchill GA. 1996. A hid- den Markov model approach to variation amongsitesin rateofevolution. Mol. B i d .

34. Pigueroa P, Gunther E, Klein J. 1988. MHC polymorphism n-dating speciation. Nature 335:~5$7

35. Fitch WM. 1971. Toward defining the course of evolution: minimum change for a specific tree topology. Sys. Zool. 20406-16

36. Fitch WM. 1976. Molecular evolutionary clocks. In Molecular Evolution, ed. PI Ayala, pp. 160-78. Sunderland, MA:

37. Fitch WM, Margoliash E. 1967. Con- Sinauer

struction of phylogenetic frees. Science 155:279-84

38. Fitch WM. Ye J. 1991. Weighted parsimony: Does it work? In Phyloge- neticAnalysisofDNA Sequences, ed. MM Miyamoto. J Cracraft, 147-54. New York. NY: Oxford Vnlv%ess

39. Gascuelo. 1994. A note on Satteth and Tversky's, Saitou and Nei's. and Studier and Keppler's algorithms for inferring phylogenies from evolutionary distances. Mol. Biof. Evol. 11:96143

40. Gaut BS. Lewis PO. 1995. Success of maximum likelihood phylogeny inference in the four-taxon case. Mol. RioL Evol. 12: 152-62

41. Golding GB, Gupta R. 1995. Protein- based phylogenies support a chimeric origin of the eukaryotic genome. Mol. Biol. Evol. 121-6

42. Goldman N. 1993. Statistical tats of models of DNA substitution. 1. Mol. Evol.

43. Goldstein DB, Pollock DD, 1994. Least squares estimation of molecular distance-noise abatement In phylogenetic reconstruction. Theor: Popul, B id . 45r219-26

44. Goodwin RL, Baumann H, Berger FO. ludon ofcq -proteinass bwbitors in mam- 1996. Patterns of diver ence during evo- mals. Mol. Bid. h'wL 13346-58

logeny ofthe kingdoms antmalia, plantae,

Evo~. 1393-104

36: 182-98

45. GOuy M. Li W-H. 1989. M ~ l ~ ~ l a r php

46. Gu ta RS. Aitken K, Mizied F. Singh B. and fungi. Mol. Bid . Ewl. 6109-22

1984. Cloning of Giardia lamblia heat shock protein HSP70 homologs: implica- tions regarding origin of eukaryotic cells and of endoplasmic reticulum. Pmc. Narl. Acad. Sci. USA 91:2895-99

47. Hartigan JA. 1973. Minimum evolution fits to a given troe. Biometrics 2953-65

48. Hasegawa M, Fujiwara M. 1993. Rel- ative efficiencies of the maximum likelihood, maximum-parsimony, and neighbor-joining methods for wlirnaring protein phylogeny. Mol. Phyl. Euol. 2: 1-5

49. Hasegawa M, Hashimoto T. 1993, Ribo- somal RNA trees misleading? Nature 361:23

50. HasegawaM, Kishino H,Saitou N. 1991. On the maximum likelihood method in molecular phylogenetics. Mol. Evol. 32:44345

51. Hasegawa M, Kishino H. Yano T. 1985. Dating the human-ape splimng by a molecular clock of mitochondrial DNA. 1. Mol. EVOI! 22:160-74

52. HedgesSB,PerkerPH.SibleyCG.Kumar S. 1996. Continental breakup and ordinal diversificatign of birds and mammals.Na-

53. Hendy M D , Charleston MA. 1993. lure 38 1 :22&29

Hadamard conjugation: a venatile tool for modelling nucleotide squenceevolu-

54. Hendy MD, Penny D. 1982. Branch tion.NZI. Bot. 31:231-37

and bound algorithms to determine mini- mal evolutionary mes. Math. Biosci. 59:

55. Hendy MD, Penny D. 1989. A framework for the quantitative study of evolutionary

56. Hendy MD, Penny D, Steel MA. 1994.

sty trees. Pmc. Nad A c d . Sci. USA A discrete Fourier analysis of evolution-

919339-43 57. Hillis DM, Bull JJ. 1993. An em irical

test of bootstrapping as a method !or as- sessing confidence in phylogenetic anal-

58. hillis DM, Bull JJ, White ME, Badgett MR, Molineus U. 1992. Ex nimental phylogenetics: generation OR known

s M, Huelsenbeck JP, Swofford DL.

tun 36936364 1994. Hobgoblin of phylogenetics? Na-

60. Huelsenbeck JF! 1995. n e ptrformance of ph lo enetic methods in simulation, Sysr. &off 44: 17-48

61. Huelsenbeck JP. 1995. me robustness of two hylogenetic methods: four-taxon dimuktlons reveal a slight superiority of

277-90

tmso SYSI. Z00l. 38:297-309

.. sis. Syst. Biol. 42:182-92

59. Riru % h lo eny, Science 255589-92

tnvtitnutn likelihood over neighbor joining. Mol. Bid. Evol. 12:843-49

62. Huelsenbeck JP, Hillis DM. 1993. Suc- cess of phylogenetic methods in the four-

63. Hughes A L , Nei M. 1989. Nucleotide taxon case. Sysr. Biol. 42:247-64

complex I1 loci: evidence for overdomi- substitution at major histocompatibility

nant selection. Pnx. Natl. Acad. Sci. USA

64. Jermann RM, Opitz JG, Stackhouse J, BennerSA. 1995. Reconstructingtheevo- lutionary hisloryofthe artiodactyl ribonu.

65. JinL,NeiM. 1990.Limitationsoftheevo- clease superfamily. Nurun 37457-59

lutionnry parsimony method of h logenetic analysis. Mol. Bid. Evol. $8$-102

66. Jones DT, Taylor WR, Thomton JM.

data matrices from protein sequences. 1992. The rapid generarion of mutation

Compw. Appl. Biosci. 8:275-82 67. Jukes TH, Cantor CR. 1969. Evolutlon of

protein molecules. In Mummulim Pm- tein Metabolism, ed. HN Mum, pp. 21- 132. New Yolk Academic

68. Keeling PJ, Charlebois RL, DoolitUe WF! 1994. Archaebaclrial genomes: eubacte- rial form and eukaryotic content. Cur,: Biol. 4816-22

69. Kidd KK, Sgaramella-anta LA. 1971. Phylogenetic analysis: concepts and

70. Kimura M. 1980. A simple method for methods. Am. J. Hum. Genet. 23:235-52

estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. 1. Mol. E u d 16: 1 11-20

71. Kishino H. Hasegawa M. 1989. Evalua- tion of the maximum likelihood estimate of the evolutionary tree topology from DNA sequence data. nnd the bmnchingor- der in Hominoidea. 1. Mol. Evol. 29:170- 79

72. Kishino H, MiyaiaT, Hasegawa M. 1990. Maximum likelihood inferenceof protein phylogeny and the origin of chloroplasts. J. Mol. Evol. 3 1 : 15 1 - 6 0

73. Kocher TD. Wilson AC. 1991. Sequence evolution of mitochondrial DNA in humans nnd chimpanzees: control region and a protein-coding region. In Evolution o Life, cd. S Osawa, T Honjo, pp. 391- &. New Yo&: Spdn4er-Verlag

74. Kuhner MK, Pelsenstem I. 1994, A simulation comparison of phylogeny algorithms underqual and unequalevolution- ary rates. Mol. Biol. Evol, I1:459-68

75. Kumads Y. Btnoon DR, Hillemann D, Hosted TI, Rochefort DA. et al. 1993. Evolution of the glutamine synthetase gene, one of the oldest existing md func-

86:958-62

0 .

tioning genes. Pmc. Narl. Acud. Sei. USA

76. Kumar S. 1996. Patterns of nudeorids substitution in mitochondrial protein coding genes of vertebrates. Generim 14353748

77. Kumar S. 1996. A stepwise algorithm for finding minimum evolution trees. Mol. Biol. E v d 13584-93

78. Kumar S, Rzhetsk A. 1996. Evolution- aryrelationships oreuknryotie kingdoms. 1. Mal. Evol. 42183-93

79. Kumar S. Tamura K, Nei M. 1993. MEGA: Moleculur Evolutionury Genefic Analy,ri,s. University Park: Penn. State Univ.

80. Kumerawa Y, Nishida M. 1996. Phylo- genetic utility of mitochondrinl trensfer RNA enes for deep divergence in animals. lee Ref. 90a, pp, 23-35

81. Li W-H. 1989. A statistical test of phy-

Mol. BioL Evd. 6424-35 logenies estimated from sequence data.

82. Li W-H. 1993. So. what about the molec-

D m 3:396-901 ularclock hypothesis? Cum Upin. Genet.

83. Li W-H,GuoyM. 1991. Statistical methods for testing phylogenies. In Phyloge- neh'cAnaIysisofDNASequences.ed.MM Miyamoto, J Cracraft, pp. 249-77. New

84. Li W-H, Zharkikh A. 1995. Statistical Yo& Oxford Univ. Press

tests of DNA phylogenies. Sy.sr. Bid. 44:49-63

85. Maddison WP. 1995. Calculating the probability distributions of ancestral states reconsmeled by parsimony on hylogenetic trees. Sysf. B i d . 44474-

86. Maddison WP, Maddison DR. 1992. MUC-

ucrer Evolufion. Sunderland. MA: Sin- Clode: Anulysis ./ Phylogeny and Chur-

87. Muse SV, Weir BS. 1992. Testing for auer

e uality of evolutionary rates. Generics 192269-76

88. Ne1 M. 1987. Molecular Evolurionury Genetics. New York: Columbia 'Univ.

89. Nei M. 1991. Relative elliciencies of dif- Press

ferent lree making methods for molecular data. In Recent Advances in Ph logenetic Studies of DNA Sequences, ed: MM

90. Nei M, Stephens JC, Saitou N. 1985. ford: Oxford Univ. Press

Methods for computing the standard errors of branching points in an evolutionary tme and heir application to molecu-

Euol. 266-85 lardata from humans and apes. Mol. Bid .

90:3009-13

!I

Miymoto, JL Cracraft, pp. 133-47. OX-

90a. Nei M, Takahata N, eds. 1995. Current

State Univ., USA, and Grad. Univ. Adv. Topics on Molecular Evolution. Penn.

Stud., Hayama, Jpn. 91. NeiM,TakezakiN,SitnikovaT. 1995.A~-

sessing molecular phylogenies. Science

92. Olsen GJ, Matsuda H, Hagstrom R, Over- beekR. 1994.FastDNAml: atoolforcon- struction of phylogenetic t m s of DNA sequences using maximum likelihood. Cornput. Appl. Biorci. 1Ck41-48

93. Oh T. Nei M. 1994. Divergent evolution and evolution by the bidh-and-death pro-

' ily, Mol. Biol. Evnl. 1 1:469-82 cess in the immunoglobulin VH gene fam-

94. PnuIing L. Zuckerkandl E. 1963. Chem- ical paleogenetics: molecular "restora- tion studies" of extinct fonns of life. Acta Chem. Scad. [B] 17:S9-16

95. PeacockD,BoulterD. 1975.Useofamino acid seauence data in phvlonenv and eval-

' 267~253-55

uation cif methods usin- computer simulation. J. Mol. B id . 95:h-27

96. Penny D, Hendy MD. 1985. The use of trie comparison metrics. Syrt. Zool.'

97. Rannala B, Yang 21996. Probability distribution of molecular evolutionary trees: a new method of phylogenetic inference.

ary analysis of picomavirus family. J. 99. Russo CAM, Takezaki N. Nei .M. 1995.

Molecular phylogeny and dwergence times of drosophilid species. Mol. Einf.

100. Russo CAM, Takezakl N, Nei M. 1996. Evol. 12:391-404

Efliciencies of different genes and different tree-buildinn methods in recoverinn a

34:75-82

J. Mol. EvoI. 43304-1 1 98. Rodrig0 MI, D O ~ ~ Z O 3.1995. Evolution-

Mol. EvoI. 40362-271

known vertebde phylogeny. Mol. SLI.

101. RzhetskvA.KumarS.NeiM. 1995.Four- Evol. 13525-36

cluster &lysis: a simple method to test hylogenetic hypotheses. Mol. Biof. Evol.

102. Rzhetsky A. Nei M. 1992. A sirn- ple method for estimating and testing minimum-evolution trees. Mol. Biol. Evol. 9:94547

103. Rzhetsky A, Nel M. 1992. Statistical properties of the ordinary least-squares, generalized least-squares and minimum- evolution methods of ph logenetic infer-

104, Rzhetsky A, Nei M. 1993. Theoreti- ence. ~ o l . Bioi. EVO~. 34367-75

cal foundation of the minimum-evolution method of hylo enctic inference. Mal. B i d . Evnl. PO: 10%-95

105. Rzhetsky A, Nei M. 1994. METREE: a

P 2 163-67

program package for inferring and testin1 minimum.evolution trees. Comput. Appl

106. Rzhetsky A, Nei M. 1994. Unbiasedesti Biosci. 10:409-12

mates of the number of nucleotide sub stitutions when substitution rate varie: among different sites. J. Mol. Evol

107. Rzhetsky A, Nei M. 1995. Tests of ap plicability of several substitution model: for DNA sequence data. Mol, Biol. Eva1

108. Rzhelsky A, Sitnikova T. 1995. Using I wrong substitution model in tree-making

109. Saitou N. 1988. Property and efficienc) See Ref. 90a. pp. 125-35

of the maximum likelihood method for molecular phylogeny. J. Mol. Evol.

110. Saitou N, lmanishi M. 1989. Rela. tive efficiencies of the Filch-Margoliash,

hood, minimum-evolution, and neighbor- maximum-parsimony, maximum-likeli-

joining methods of phylogenetic reconstructions in obtaining the comcl tree. Mol. Eiol. Euol. 6:514-25

111. Saitou N. Nei M. 1986. The number of nucleotides required to determine the branching order of three species, with special reference to the human-chimpanzee-gorilla divergence. 1.

112. Saitou N, Nei M. 1987. The neighborjoining method: a new method for reconstructing phylogenetic uees. Mol. Bid. Evol. 4406-25

113. Sankoff D. Cedergren RJ. 1983. Simul-

quences related by a tree. In 7ime warp.^. taneous comparison of three or more se-

String Ed4t.s. and Macromolecules: The

parison. ed. D Sankoff. JB Kmskal, pp. Theoty and Practice of Sequence Com-

25343. Reading, MA: Addison-Wesley 114. Schoniger M, von Haeseler A. 1993. A

simple method to improve the reliability of tree reconstructions. Mol. Biol. Evol.

115. Schoniger M. von Haeseler A. 1995. Per. formance of the maximum likelihood, neighbor joining, and maximum parsimony methods when sequence sites are not independent. Sy~t. Biol. 44533-47

116. Sidow A,Thomas WK. 1994. A molecu- larevolutionary frameworkforeukaryotic

117. Sitnikova T. 1996. Bootstrap method of model organism. Curt Biol. 4596-603

interior-branch test for hylogenetic trees. Mol. Biol. Evol. 13:60!-11

1 18. Deletedin roof 119. Sitnikova f Rzhetsky A, Nei M. 1995.

Interior-branch and the bootstrap tests

38:295-99

12:131-51

27~261-73

Mol. E v ~ . 24:189-204

10:471-83

-

402 NE1

120.

121.

122.

123.

124.

125.

126.

127.

128.

129.

130.

131.

132.

133.

134.

of phylogenetic Ines. Mol. Bid. Evol.

Sokal RR, Sneath PHA. 1963. Principles o j Numerical Taronomy. San Francisco: Freeman Sourdis J. Krimbas C. 1987. Accuracy of phylogenetic trees estimated from DNA

66 sequence dotn. Mol. B id . Evol. 4:159-

ciencies of the maximum parsimony and Sourdis J, Nei M. 1988. Relative em-

distance-matrix methods in obtaining the c o m t phylogenetictree. Mol. Biol. Evol. 5:298-3 11 Swofford DL. 1993, PAUP: Phylogenetic Anulysis Using Parsimony Champaign, L K. Natl. Hist. Sum. Swofford DL, Olsen GJ, Waddell PJ, Hillis DM. 1996. Phylogenetic infemce..

C Moritz, BK Mable, 407-514. Sunder- In Molecular Systemofics, ed. DM Hillis,

land. MA: Sinauer. 2nd ed.

12319-33

Tajima F. 1993. Simple methods for testing molecular clock hypothesis. Genetics 1 tS599-607 TajimaF. 1993.Unbiasedestimakofevo- lutionary distance between nucleotide sequences. Mol. Bid. Evol. 10677-88 TajimaF, Takezaki N. 1994. Estimation of evolutionary distance for reconstructing molecular hylogenetic trees. Mol. Bid. Evol. I1:2!8-86 Takahata N. 1993. Allelic genealo and human evolution. Mol. Biol. E v X 10: 2-22 Tnkezaki N, Nei M. 1994. Inconsistency of the maximum parsimony method when the rate of nucleotide substitution is con-

Takezaki N. Rzhetskv A. Nei M. 1995. stant. 1. Mol. Evol. 39:210-18

Phylogenetic lest of ihe molecular clock and lmenrized tree. Mol. Biol. Evol. ... ~ ~ ~ ~~ ~ ~

12:823-33 Tamura K, Nei M. 1993. Estimation of the number of nucleotide substitutions in the control renion of mitochondrial DNA in-humans ani chimpanzees. Mol. Biol.

%aka T, Nei M. 1989. Positive Dar- EvOl. 105 12-26

winian selection observed at the variable region genes of immunoglobulins. Mol. Biol. Evol. 6:447-59 'Meno Y. Nei M. Tajima F. 1982. Ac- curacy of estimated phylogenetic tree from molecular data. 1. Distantly related species. 1. Mol. Ewl . 18:387404 Tateno Y, Takezalci N, Nei M. 1994. Relative efficiencies of the maximum- likelihood, neighbor-joining, and maximum-parsimony melhods when substitution rate varies with site. Mol. Biol. Evof.

135.

136.

137.

138.

139.

140.

141.

142.

143.

144.

145.

146.

11:261-77 Templeton AR. 1983. Phylogenetic inference from restriction cleavage site maps with particular reference to the evolution of humans and the apes. Evolution 37t221-44 Thomas RH. Hunt JA. 1993. Phylogenetic relationships in Dwsopbilu: aconflict between molecular and morphological data. Mol. Biol. Evol. 10362-74 Uyenoyama M. 1995. A generalized leest-squares estimate of the origin of sporophytic self-incompatibility. Genet- ics 139:975-92 Vigilant L, Stoneking M, Harpend- in H, Hawkes K, Wilson AC. 1991. Afdcan populations and the evolution of human mitochondrial DNA. Science ~ ~~

2531503-7 Wninright PO, Hinkle 0, Sogin ML, Stickel SK. 1993. Monophyletic origins of metazoa: an evolufionary link with fungi. Science 260:340-42 Wdeley J. 1993. Substitution mte variation among sites in hypervariable region I of human mitochondrial DNA. J. Mol.

Williams PL, Fitch WM. 1990. Phy-

weighted parsimony method. In MefhoL logeny determination using dynamically

26. SM Diego. CA: Academic in Enqymology. ed. RF Doolittle, pp. 615-

Wilson AC, Carlson SS, White TJ.

Biochem. 46573439 1977. Biochemical evolution. Annu. Rev.

Wlstow G. 1993. Lens crystallins: gene recmitment and evolutionary dynamism. Trends Biochem. Sci. 18:301-6 Woese CR. Kandler 0. Wheelis ML.

isms: proposal for the domains archaea, 1990. Towards a natural system of organ-

bacteria, and eucnrya. Pmc. Natl. Acad. Sci. USA 87:4576-79 Wu C-I, Li W-H. 1985. Evidence for

dentsthan in man. Pmc. Natl. Acud. Sci. higher rates of nucleotide substitution in

USA821741-45 Yang 2. 1994. Estimating the pattern of nucleotide substitution. J. Mol. Ewl.

E V ~ . 37:613-23

39105-1 1 147. Yang 2. 1994. Maximum likelihood

qucnces with variable rates over sites: a - phylogenetic estimation from DNA se-

proximate methods. 1. Mol. Evnl, 39:3& 14

148. Yang 2.1994. Statistical properties of the maximum likelihood method of phylogenetic estimation and comparison with

43:329-42 distance matrlx methods. Syst. Biol.

149. Yang Z. 1995. PAML: PhylogeneticAnal-. ysis by Maximum Likelihood. University Park: Inst. Mol: Evol. Genet., Penn. State Univ.

150. Yang 2. 1996. Ph logenetic analysis us-

Mol. Evol. 42:294-307 ing parsimony ardlikelihood methods. 1.

151. Yang Z, Goldman N, Friday AB. 1995. Maximum likelihood frm from DNA sequences: a peculiar statistical eslimalion problem. Sy.rt. Bid . M384-99

152. Yang 2. Kumar S, Nei M. 1995. A new method of inference of ancesual nu-

netics 141:1641-50 cleotide and amino acid sequences. Gc-

153. U a n g J, Nei M. 1996. Accuracies of M- cesval amino acid s ucnces infemd by the parsimony, I lke~ood, and distance

154. Zharkikh A. 1994. Estimation of cvolu- methods. 1. Mol. Evol. In pms

- - - -

MOLECULAR PHYLOGENY - - - 403 1 tionary distances between nucleotide sequences. 1. Mol. Evol.39:315-29

155. Zharkikh A. Li W-H. 1992. Statistical froperties of bootstrap estimdon of phy- ogenelic variability from nucleotide se-

quences. 1. Four taxa with a molecular

156. Zharkikh A, Li W-H. 1992. Statistical clock. Mol. Biol. Evol. 9:111947

p e d e s of bootstrsp estimation of phy- ogenctic variability from nucleotide se-

quences. 11. Fowtaxa without a molecular clock. J. Mol. Evol. 35:35&66

157. ZharkM A. Li W-H. 1993. Inconsistency of the maximumparsimony method: the case of five taxa with a molecular clock. Syst. Biol. 421 13-25

158. ZharltiWl A, Li W-H. 1995. Estimation of confidence in phylogeny: complete-and- partial bootstrap technique. Mol. Phyl. Evol. 444-63

(a) Monte Carlo simulation

-5

-IC

h

p -15 ! > J

1 -20

-25

-30 I I

-14

-16

-18

-20

-22

-24

>

I

-26

-28

-30

t

0 40000 80000 120000 160000 200000 step

(c) Joint search. number of phi/psi-pair selections per cross-over

5 ....... -1

I 0 10 20 30 40 50 60

(e) Generations

Steps of sidechain minimisation -14 I I

1 5 .......

20 j o l o ..-. i

-30 ' J 0 10 20 30 40 50 60

Generations

(b) Convergence V.S. number of generatrons. varying population size -14

-16 1 ;, -18 .;

-20 ,

2 - ....... ---

t . :o 1: , . 5C . gl;

200 ........ I ,.

,! !

-22 1 !\ ! ', . . . ...

-24 . \, ',:. ... : .. , .......

-26

- -28

-

-30

.i y.,- ' .......... ........................... .........

-- *-..__ ....... :-. .' . . . . . . . . . .....

........... ..- ..... .. ::... .: ..__

0 10 20 30 40 50 60 Generations

( a

-14 I Carry forward percentage

1

-16 1 I

2 ....... "-i 5 '

-30 1 I 0 10 20 30 40 50 60

Generations

Standard deviation recalculation frequency ( f )

-14 1 1

-1 6

-1 8

-20

-22

-24

-26

-28

-30 ' I 0 10 20 30 40 50 60

Generations

gure 1. Parameter optimization simulations performed on the Bar-1 fragment from Bamase. a, MC simulation trajectory from which nitial GA population was drawn; b, effect of population size; c, effect of joint search; d, fraction of conformations can id forward

next generation; e, effect of side-chain free energy minimization; f, effect of AG,,,,, standard deviation recalculation frequency. jimulatiow were performed relative to a standard set of parameters: population size, 50; number of generations, 60; number of Pal= applied to joint, 50; q luca l recalculation frequency, 60; fraction of conformations camed forward to the next generation.

; rounds ofside-~hain minimization, 50; T, 300 K. The kcy in each figure shows the parameter vnlurs used.

Date post:	18-Jun-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

PHYLOGENETIC ANALYSIS MOLECULAR EVOLUTIONARY GENETICS · 93) as well as for understanding the...

Documents