Molecular Phylogeny. 2 Phylogeny is the inference of evolutionary relationships. Traditionally,...

transcript

Molecular Phylogeny

Phylogeny is the inference of evolutionary relationships.Traditionally, phylogeny relied on the comparison of morphological features between organisms. Today, molecular sequence data are mainly used for phylogenetic analyses.

One tree of life A sketch Darwin madesoon after returning from his voyage onHMS Beagle (1831–36) showed his thinkingabout the diversification of speciesfrom a single stock (see Figure, overleaf).This branching, extended by the conceptof common descent,

Haeckel (1879) Pace (2001)

Molecular phylogeny uses trees to depict evolutionaryrelationships among organisms. These trees are based upon DNA and protein sequence data

Chimpanzee

Gorilla

Orangutan

Gorilla

Chimpanzee

Orangutan

Molecular analysis:Chimpanzee is related more closely

to human than the gorilla

Pre-Molecular analysis:The great apes

(chimpanzee, Gorilla & orangutan)Separate from the human

What can we learn from phylogenetics tree?

• Was the extinct quagga more like a zebra or a horse?

1. Determine the closest relatives of one organism in which we are interested

Which species are closest to Human?

Chimpanzee

Gorilla

Orangutan

Gorilla

Chimpanzee

Orangutan

Example Metagenomics

A new field in genomics aims the study the genomes recovered from environmental samples.

A powerful tool to access the wealthy biodiversity of native environmental samples

2. Help to find the relationship between the species and identify new species

106 cells/ ml seawater107 virus particles/ ml seawater

>99% uncultivated microbes

Incredible microbial diversity in a drop of seawater

3 – 4 kb shotgunlibrary

paired-end sequence(F / R)

compositecontig assembly

community DNA

…ACGGCTGCGTTACATCGATCATTTACGAACATCGATCATTTACGATACCATTG…

community sample

(cloning bias)

(extraction bias)

Metagenomics

From : “The Sorcerer II Global Ocean Sampling Expedition: Metagenomic Characterization of Viruses within Aquatic Microbial Samples” Williamson et al, PLOS ONE 2008

3. Discover a function of an unknown gene or protein

RBP1_HS

RBP2_pig

RBP_RAT

ALP_HS

ALPEC_BV

ALPA1_RAT

Hypothetical protein

Relationships can be represented by Phylogenetic Tree or Dendrogram

A B C D

Phylogenetic Tree Terminology

• Graph composed of nodes & branches

• Each branch connects two adjacent nodes

A B C D

Rooted tree

based on priori knowledge:

Chicken

Gorilla

Human ChimpChicken Gorilla

Un-rooted tree

Phylogenetic Tree Terminology

Rooted vs. unrooted trees

How can we build a tree with molecular data?

-Trees based on DNA sequence (rRNA)-Trees based on Protein sequences

Questions:

• Can DNA and proteins from the same gene produce different trees ?

• Can different genes have different evolutionary history ?

• Can different regions of the same gene produce different trees ?

Methods

Approach 1 - Distance methods

• Two steps :– Compute a distances between any two sequences from the MSA.– Find the tree that agrees most with the distance table.

• Algorithms : -Neighbor joining

Approach 2 - State methods• Algorithms:

– Maximum parsimony (MP)– Maximum likelihood (ML)

Neighbor Joining (NJ)

• Reconstructs unrooted tree• Calculates branch lengths Based on pairwise distance• In each stage, the two nearest nodes of the

tree are chosen and defined as neighbors in our tree. This is done recursively until all of the nodes are paired together.

Star StructureAssumption: Divergence of sequences is assumed to occur at constant rate Distance to root equals

a b c d

a 0 8 7 5

b 8 0 3 9

c 7 3 0 8

d 5 9 8 0

Basic Algorithm

Initial star diagramDistance matrix

a b c d

a 0 8 7 5

b 8 0 3 9

c 7 3 0 8

d 5 9 8 0

Choose the nodes with the shortest distance and fuse them.

Selection step

Then recalculate the distance between the rest of the remaining sequences (a and d) to the new node (e) and remove the fused nodesfrom the table.

dc,b e

aa d e

a 0 5 6

d 5 0 7

e 6 7 0

D (EA) = (D(AC)+ D(AB)-D(CB))/2

Next Step

D (ED) = (D(DC)+ D(DB)-D(CB))/2

a b c d

a 0 8 7 5

b 8 0 3 9

c 7 3 0 8

d 5 9 8 0

In order to get a tree, un-fuse c and b by calculating their distance to the new node (e)

a 0 5 6

d 5 0 7

e 6 7 0 b

Next Step

ea d e

a 0 5 6

d 5 0 7

e 6 7 0 b

Next…

D (EF) = (D(EA)+ D(ED)-D(AD))/2

dc,b e

IMPORTANT !!!•Usually we don’t start from a star diagram

and in order to choose the nodes to fuse we have to calculate the relative distance matrix (Mij) representing the relative distance of each node to all other nodes

EXAMPLE

A B C D E

D 7 10 7

E 6 9 6 5

F 8 11 8 9 8

A B C D E

C -11 -11

D -10 -10 -10.5

E -10 -10 -11 -13

F -10.5 -10.5 -11 -11.5 -11.5

Original distance Matrix Relative Distance Matrix (Mij)

The Mij Table is used only to choose the closest pairs not for calculating the distances

Advantages -It is fast and thus suited for large datasets -permits lineages with largely different branch lengths

Disadvantages - sequence information is reduced - gives only one possible tree

Advantages and disadvantages of the neighbor-joining method

Molecular Phylogeny. 2 Phylogeny is the inference of evolutionary relationships. Traditionally,...

Documents