+ All Categories
Home > Documents > Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Date post: 31-Dec-2015
Category:
Upload: brett-robinson
View: 218 times
Download: 0 times
Share this document with a friend
34
Phylogenetic trees School B&I TCD Bioinformatics May 2010
Transcript
Page 1: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Phylogenetic trees

School B&I TCD Bioinformatics

May 2010

Page 2: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Why do trees?

 

Page 3: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Phylogeny 101

• OTUs operational taxonomic units: species, populations, individuals

• Nodes internal (often ancestors)Nodes external (terminal, often living species,

individuals)• Branches length scaled (length propn evo dist)

Branches length unscaled, nominal, arbitrary• Outgroup an OTU that is most distantly related

to all the other OTUs in the study.• Choose outgroup carefully

Page 4: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Phylogeny 102• Trees rooted N=(2n-3)! / 2n-2(n-2)!

Trees unrooted N=(2n-5)! / 2n-3(n-3)!OTUs #rooted trees #unrooted trees2 1 13 3 14 15 35 105 156 954 1057 10395 9548 135135 103959 2027025 13513510 34349425 202702520 34*106 8*1021

Page 5: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Four key aspects of treeA

DC

B A

B

C

D

Topology

Branch lengths

Root

Confidence

A

B

C

D

Basic tree

D

C

B

A

D

C

B

A

100

78

Page 6: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Distances from sequence

• Use Phylip Protdist or DNAdist• D= non-ident residues/total sequence length• Correction for multiple hits necessary because

• Jukes-Cantor assumes all subs equally likely• Kimura: transition rate NE transversion rate• Ts usually > Tv

G

A A

Page 7: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Methods

• Distance matrix– UPGMA– Neighbour joining NJ

• Maximum parsimony MP– tree requiring fewest changes

• Maximum likelihood ML– Most likely tree

• Bayesian: sort of ML– Samples large number of “pretty good” trees

Page 8: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Trees NJ

• Distance matrix

• Neighbor joining is very fast

Often a “good enough” tree

Embedded in ClustalW

Page 9: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Trees MP• Maximum parsimony

• Minimum # mutations to construct tree

• Better than NJ – information lost in distance matrix – but much slower

• Sensitive to long-branch attraction– Long branches clustered together

• No explicit evolutionary model

• Protpars refuses to estimate branch lengths

• Informative sites

Page 10: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Long-branch attractionTrue tree

MusHBA MusHBB

HumHBBHumHBA

Rodents evolve fasterthan primates

False “LBA” treeMusHBA

MusHBB

HumHBA

HumHBB

Page 11: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Maximum parsimony

Site: 1 2 3 4 5 6 7 8 9OTU1 A A G A G T G C AOTU2 A G C C G T G C GOTU3 A G A T A T C C AOTU4 A G A G A T C C G * * *

It is a good alignment clearly aligning homologous sites without gaps.

Here we have a representative alignment. Want to determine the phylogenetic relationships among the OTUs

Page 12: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

There are 3 possible trees for 4 taxa (OTUs):

1 3 1 2 1 2 \_____/ \_____/ \_____/ / \ / \ / \ 2 4 3 4 4 3

Or (1,2)(3,4) (1,3)(2,4) and (1,4)(2,3)

Aim to identify (phylogenetically) informative sites and use these to determine which tree is most parsimonious.

Page 13: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

The identical sites 1, 6, 8 are useless for phylogenetic purposes.

 

Site: 1 2 3 4 5 6 7 8 9

OTU1 A A G A G T G C A

OTU2 A G C C G T G C G

OTU3 A G A T A T C C A

OTU4 A G A G A T C C G

* * *

Page 14: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Site 2 also useless: OTU1’s A could be grouped with any of the Gs.

Site: 1 2 3 4 5 6 7 8 9OTU1 A A G A G T G C AOTU2 A G C C G T G C GOTU3 A G A T A T C C AOTU4 A G A G A T C C G * * *

Page 15: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Site 4 is uniformative as each site is different.UNLESS transitions weighted in which case (1,4)(2,3)

Site: 1 2 3 4 5 6 7 8 9

OTU1 A A G A G T G C A

OTU2 A G C C G T G C G

OTU3 A G A T A T C C A

OTU4 A G A G A T C C G

* * *

Page 16: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

For site 3 each tree can be made with (minimum) 2 mutations:

Site: 1 2 3 4 5 6 7 8 9

OTU1 A A G A G T G C A

OTU2 A G C C G T G C G

OTU3 A G A T A T C C A

OTU4 A G A G A T C C G

* * *

Page 17: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

(1,2)(3,4)

G A G A G A

\ / \ / \ /

G---A C---A A---A

/ \ / \ / \

C A C A C A

Page 18: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

(1,3)(2,4)

G C can do worse:G C

\ / \ /

A---A G---A

/ \ / \

A A A A

Page 19: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

(1,4)(2,3)

G C

\ /

A---A

/ \

A A

So site 3 is (Counterintuitively) NOT informative

Page 20: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Site 5, however, is informative because one tree shortest.

Site: 1 2 3 4 5 6 7 8 9

OTU1 A A G A G T G C A

OTU2 A G C C G T G C G

OTU3 A G A T A T C C A

OTU4 A G A G A T C C G

* * *

Page 21: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

(1,2)(3,4) (1,3)(2,4) (1,4)(2,3)

G A G G G G

\ / \ / \ /

G---A A---A G---G

/ \ / \ / \

G A A A A A

Page 22: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Likewise sites 7 and 9.By majority rule most parsimonious tree is

(1,2)(3,4) supported by 2/3 informative sites.

Site: 1 2 3 4 5 6 7 8 9

OTU1 A A G A G T G C A

OTU2 A G C C G T G C G

OTU3 A G A T A T C C A

OTU4 A G A G A T C C G

* * *

Page 23: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Protparsinfile:

8 370

BRU MSQNSLRLVE DNSV-DKTKA LDAALSQIER

RLR ---------- ---V-DKSKA LEAALSQIER

NGR ---------- -MSD-DKSKA LAAALAQIEK

ECO ---------- AIDE-NKQKA LAAALGQIEK

YPR ---------M AIDE-NKQKA LAAALGQIEK

PSE ---------- -MDD-NKKRA LAAALGQIER

TTH ---------- -MEE-NKRKS LENALKTIEK

ACD ---------- -MDEPGGKIE FSPAFMQIEG

Page 24: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Protpars

• treefile:(((((ACD,TTH),(PSE,(YPR,ECO))),NGR),RLR),BRU);

Page 25: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

• outfile:One most parsimonious tree found:

+-ACD +-------7 ! +-TTH +-6 ! ! +----PSE ! +----5 +-3 ! +-YPR ! ! +-4 ! ! +-ECO +-2 ! ! ! +-------------NGR--1 ! ! +----------------RLR ! +-------------------BRU

remember: this is an unrooted tree!

requires a total of 853.000 steps

Page 26: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Clustalw

****** PHYLOGENETIC TREE MENU ******

1. Input an alignment 2. Exclude positions with gaps? = ON 3. Correct for multiple substitutions? = ON 4. Draw tree now 5. Bootstrap tree 6. Output format options

S. Execute a system command H. HELP or press [RETURN] to go back to main menu

Page 27: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Trees

General guidelines – NOT rules

• More data is better

• Excellent alignment = few informative sites

• Exclude unreliable data – toss all gaps?

• Use seqs/sites evolving at appropriate rate– Phylip DISTANCE– 3rd positions saturated– 2nd positions invariant– Fast evolving seqs for closely related taxa– Eliminate transition - homoplasy

Page 28: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Trees

• Beware base composition bias in unrelated taxa

• Are sites (hairpins?) independent?

• Are substitution rates equal across dataset?

• Long branches prone to error – remove them?– Choose outgroup carefully

Page 29: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Bootstrapping

Page 30: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Bootstrapping

• Random re-sampling of the data– with replacement

• The MSA stays the same

• Each column of aligned residues in the MSA is a “site”.

• The sites are what is re-sampled.

Page 31: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Bootstrap 2

• Having resampled the data – to get a new dataset/alignment– based on the original– the same length

• Redraw the tree from that dataset• For each node

– ask is this node retained in the resampled data.

• Re-iterate 100, 1000 or 10,000 times

Page 32: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Boostrap dataset

Site: 1 2 3 4 5 6 7 8 9OTU1 A A G A G T G C AOTU2 A G C C G T G C GOTU3 A G A T A T C C AOTU4 A G A G A T C C G * * *

4 OTUs and 9 “sites”

Page 33: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

What do the little numbers mean?

 

Page 34: Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Why does it work?

• The tree based on the real data is the best tree – the best estimate of what happened in evolution.

• If a node is based on many bits of info then some of these will be resampled

• If the node is based on a single site then it is unlikely to be resampled so we are less confident in that node.


Recommended