+ All Categories
Home > Documents > bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most...

bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most...

Date post: 10-Aug-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
72
Bioinformatics course Phylogeny and Comparative genomics 10/23/13 1
Transcript
Page 1: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Bioinformatics course Phylogeny and Comparative

genomics

10/23/13 1

Page 2: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Contents-phylogeny

•  Introduction-biology, life –classification-taxonomy

•  Phylogenetic-tree of life, tree representation

•  Why study phylogeny? •  Types of trees •  phylogenetic tree building methods

10/23/13 2

Page 3: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Contents-Comparative genomics

•  Comparative genomics -concepts •  Why study comparative genomics? •  What can be compared? •  What do we learn from comparison?

10/23/13 3

Page 4: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Biology – The Study of Life

•  Life arose more than 3.5 billion years ago

•  First organisms (living things) were single celled

•  Only life on earth for millions of years

•  Organisms changed over time (evolved)

10/23/13 4

Page 5: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

•  New organisms arose from older kinds

•  Today there are millions of species

•  They inhabit almost every region of Earth today

10/23/13 5

Page 6: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Themes of Biology •  Cell structure and function •  Stability and homeostasis •  Reproduction and inheritance •  Evolution •  Interdependence of

organisms •  Matter, energy, and

organization

10/23/13 6

Page 7: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Cell Structure and Function

•  Cell basic unit of life •  All organisms are made

of and develop from cells •  Some composed of only

a single cell (unicellular) which is usually identical to parent

10/23/13 7

Page 8: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

•  Cells contain specialized structures (organelles) that carry out the cell’s life processes

•  Many different kinds of cells exist

•  All cells surrounded by a plasma membrane

•  Contain a set of instructions called DNA (genetic information)

10/23/13 8

Page 9: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Genotype and Phenotype

10/23/13 9

Page 10: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Biological classification

10/23/13 10

Page 11: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Genome size 2 THE FIRST LOOK AT A GENOME: SEQUENCE STATISTICS

Table 1.1 Some of the genomes discussed in this book, with their size and date of completion

CompletionOrganism date Size Description

phage phiX174 1978 5,368 bp 1st viral genomehuman mtDNA 1980 16,571 bp 1st organelle genomelambda phage 1982 48,502 bp important virus modelHIV 1985 9,193 bp AIDS retrovirusH. influenzae 1995 1,830 Kb 1st bacterial genomeM. genitalium 1995 580 Kb smallest bacterial genomeS. cerevisiae 1996 12.5 Mb 1st eukaryotic genomeE. coli K12 1997 4.6 Mb bacterial model organismC. trachomatis 1998 1,042 Kb internal parasite of eukaryotesD. melanogaster 2000 180 Mb fruit fly, model insectA. thaliana 2000 125 Mb thale cress, model plantH. sapiens 2001 3,000 Mb humanSARS 2003 29,751 bp coronavirus

involves various phases, and is never really complete. However, most sequenc-ing projects perform at least two steps: a first (usually simpler) analysis, aimedat identifying all of the main structures and characteristics of a genome; thena second (often more complicated) phase, aimed at predicting the biologicalfunction of these structures. The first chapters of this book present some of thebasic tools that allow us to perform sequence annotation. We leave the moreadvanced topic of sequence assembly – the initial step of constructing the entiregenome sequence that must occur before any analyses begin – to more advancedcourses in bioinformatics.

Now that we have the complete genome sequences of various species, andof various individuals of the same species, scientists can begin to make whole-genome comparisons and analyze the differences between organisms. Of coursethe completion of the draft human genome sequence in 2001 attracted headlines,but this was just one of the many milestones of the genomic era, to be followedsoon thereafter by mouse, rat, dog, chimp, mosquito, and others. Table 1.1lists some important model organisms as well as all of the organisms used inexamples throughout this book, with their completion dates and genome length(the units of length will be defined in the next section). We should stress herethat there were many challenges in data storage, sharing, and management thathad to be solved before many of the new analyses we discuss could even beconsidered.

In the rest of this chapter we begin our analysis of genomic data by repro-ducing some of the original analyses of the early genome papers. We continuethis aim in the following chapters, providing the reader with the data, tools,and concepts necessary to repeat these landmark analyses. Before we start ourfirst statistical examination of a complete genome; however, we will need tosummarize some key biological facts about how DNA is organized in cells andthe key statistical issues involved in the analysis of DNA.

10/23/13 11

Page 12: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Total number of eukaryotic species

10/23/13 12

Page 13: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Phylogenetics

•  Phylogenetics is the study of evolutionary relatedness among various groups of organisms (e.g., species, populations)

•  The  results  of  phylogene1c  analysis  are  usually  presented  as  a  collec1on  of  nodes  and  branches.  That  is,    a  tree  

•  In  such  tree,  taxa  that  are  closely  related  in  an  evolu1onary  sense  appear  close  to  each  other,  and  taxa  that  are  distantly  related  are  in  different  (far)  branches  of  the  trees  

10/23/13 13

Page 14: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Why study phylogeny?

•  Find evolutionary ties between the organism, that is analyze the changes occurring in different organisms during evolution

•  Find relation between ancestral sequence and its descendants

•  Estimate time of divergence between the group of organisms sharing a common ancestor.

10/23/13 14

Page 15: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

A Long, Long Time Ago…

Charles Darwin (1809–1882) was the first to produce an

evolutionary tree of life. He was very cautious about the possibility of reconstructing the history of life.

On the Origin of Species (1859)

2 10/23/13 15

Page 16: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

TREE OF LIFE

10/23/13 16

Page 17: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Phylogenenetic trees

•  Leafs - current day species •  Nodes - hypothetical most recent common ancestors •  Edges length - “time” from one speciation to the •  External nodes: things under comparison; operational

taxonomic units (OTUs)

10/23/13 17

Aardvark Bison Chimp Dog Elephant

Page 18: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

rooted  unrooted  

Rooted and unrooted trees

10/23/13 18

Page 19: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Common players of evolution

•  Relatedness- homology, orthology, parlogy

•  Mutations •  SNPS •  Duplications •  Translocations and may more …..

10/23/13 19

Page 20: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

10/23/13 20

Common players of evolution

Page 21: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Mutations kinds

10/23/13 21

Page 22: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Denovo mutations

10/23/13 22

Page 23: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Single nucleotide polymorphisms (snps)

10/23/13 23

Page 24: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Gene vs species trees

10/23/13 24

Species tree Gene tree

Page 25: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Evolution-Related Concepts

Homologs: •  Genes sharing a common ancestor and generally retain same function

Orthologs: •  Genes (homologs) in different species derived from a single ancestral gene in

the last common ancestor (LCA) (arise from speciation)

Paralogs: •  Homologs in same species related via duplication

–  Duplication before speciation (ancient duplication) •  Out-paralogs; may not have the same function

–  Duplication after speciation (recent duplication) •  In-paralogs; likely to have the same function

10/23/13 25

Page 26: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Gene vs species trees

•  Gene tree •  The genealogy of taxa,

individuals of a population, etc •  Nodes represent speciation or

other taxonomic events. •  Contains sequences from only

orthologous genes.

•  Species tree •  Evolutionary history of the

genes •  Nodes provide evidence for

gene duplication events, as well as speciation events.

•  Contains sequences from different homologs. Subsequent analyses should cluster orthologs, demonstrating its evolutionary history.

26 10/23/13

Page 27: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Molecular clock

•  All the mutations occur in the same rate in all the tree branches

•  The rate of mutations is same for all the positions along the sequences

•  The molecular clock hypothesis is most suitable for closely related species.

10/23/13 27

Page 28: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Molecular clock

10/23/13 28

Page 29: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Tree building

•  Distance methods –  UPGMA –  Neighbor-joining –  For distances use matrix like PAM or BLOSUM

•  parsimony-based methods •  -Maximum parsimony

character-based methods -maximum likelihood - Bayesian inference

10/23/13 29

Page 30: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

What do edge lengths represent? •  In some trees edges represent time, in which case all

modern sequences should be the same distance from the root.

•  Sometimes edge lengths represent the product µ·t of the rate of change µ and time t in which case different tips can be different distances from the root provided that the rate has changed across the tree.

10/23/13 30

Cat

Dog

Rat

Cow

1 1

2

2 4

Page 31: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Distance matrices

10/23/13 31

•  There are many ways of building phylogenetic trees, one family of methods uses a distance matrix as a starting point.

•  A distance matrix is a table that indicates pairwise dissimilarity, for instance...

Cat Dog Rat Cow

Cat 0 2 4 7

Dog 2 0 5 6

Rat 4 5 0 3

Cow 7 6 3 0

A B C D

B 400 - - -

C 300 300 - -

D 250 150 250 -

E 250 250 500 200

Page 32: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Where do we get distances from?

10/23/13 32

•  Distances can be derived from Multiple Sequence Alignments (MSAs).

•  The most basic distance is just a count of the number of sites which differ between two sequences divided by the sequence length. These are sometimes known as p-distances.

Cat Dog Rat Cow

Cat 0 0.2 0.4 0.7

Dog 0.2 0 0.5 0.6

Rat 0.4 0.5 0 0.3

Cow 0.7 0.6 0.3 0

Cat ATTTGCGGTA

Dog ATCTGCGATA

Rat ATTGCCGTTT

Cow TTCGCTGTTT

Page 33: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Properties of distances

•  d(x,x) = 0 •  d(x,y) = d(y,x) •  d(x,y) + d(y,z) >= d(x,z) (the triangle inequality)

•  The distances used in phylogenetics always have the first two properties but sometimes not the third.

10/23/13 33

Page 34: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

I want to build a tree - will any old distances do?

10/23/13 34

•  Not all distances will be suitable for building trees.

•  Tree-building methods do not discriminate, they will return a tree regardless of whether you give them roadmap distances or distances based on a sequence alignment.

•  Some distances are perfectly “tree-like”.

Page 35: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Perfectly “tree-like” distances

10/23/13 35

Cat Dog Rat

Dog 3

Rat 4 5

Cow 6 7 6

Cat

Dog

Rat

Cow

1 1

2

2 4

Page 36: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Perfectly “tree-like” distances

10/23/13 36

Cat Dog Rat

Dog 3

Rat 4 5

Cow 6 7 6

Cat

Dog

Rat

Cow

1 1

2

2 4

Page 37: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Perfectly “tree-like” distances

10/23/13 37

Cat Dog Rat

Dog 3

Rat 4 5

Cow 6 7 6

Cat

Dog

Rat

Cow

1 1

2

2 4

Page 38: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Perfectly “tree-like” distances

10/23/13 38

Cat Dog Rat

Dog 3

Rat 4 5

Cow 6 7 6

Cat

Dog

Rat

Cow

1 1

2

2 4

Page 39: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Perfectly “tree-like” distances

10/23/13 39

Cat Dog Rat

Dog 3

Rat 4 5

Cow 6 7 6

Cat

Dog

Rat

Cow

1 1

2

2 4

Page 40: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Perfectly “tree-like” distances

10/23/13 40

Cat Dog Rat

Dog 3

Rat 4 5

Cow 6 7 6

Cat

Dog

Rat

Cow

1 1

2

2 4

Page 41: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

The 4-Point Condition

10/23/13 41

•  Distances that fit exactly on a tree can be characterised by a condition on any quartet i, j, k, l (i.e. it must hold true for any 4 taxa).

•  We write d(x,y) for the distance between x and y. Given 4 taxa i, j, k, l, of the 3 sums

§  d(i,j) + d(k,l) §  d(i,k) + d(j,l) §  d(i,l) + d(j,k)

The largest two are equal. •  Distances with this property are called additive, because

the weights on the paths along the tree add up to the values in the distance matrix.

Page 42: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Why is this true of tree-like distances?

i

j

k

l

i

j

k

l

i

j

k

l

d(i,j)+d(k,l) d(i,k)+d(j,l) d(i,l)+d(j,k) < =

10/23/13 42

Page 43: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Clock-like distances

•  An even stricter condition on distances is that they fit on a clock-like tree.

•  Distances with this property are called ultrametric.

10/23/13 43

time

i j k

d(i,k) = d(j,k) > d(i,j)

Page 44: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Where do we get distances from?

10/23/13 44

•  Distances can be derived from Multiple Sequence Alignments (MSAs).

•  The most basic distance is just a count of the number of sites which differ between two sequences divided by the sequence length. These are sometimes known as p-distances.

Cat Dog Rat Cow

Cat 0 0.2 0.4 0.7

Dog 0.2 0 0.5 0.6

Rat 0.4 0.5 0 0.3

Cow 0.7 0.6 0.3 0

Cat ATTTGCGGTA

Dog ATCTGCGATA

Rat ATTGCCGTTT

Cow TTCGCTGTTT

Page 45: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Tree building - UPGMA

10/23/13 45

UPGMA works by progressively clustering the most similar taxa until all the taxa form a rooted clock-like tree.

1.  Find the smallest entry in the distance matrix, say d(x,y).

2.  Form a new internal node, z, that is a parent to x and y and set the edge lengths from z to x and z to y to half d(x,y).

3.  Update the distance matrix by setting the distances from the new node z to all the other taxa to be the average distance between groups x and y.

REPEAT until all groups have been joined.

Page 46: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

What precisely is meant by the average distance?

•  If we a joining two groups i and j that already have ni and nj members we update the distances using

kjji

jki

ji

ikji D

nnn

DnnnD ,,),,( )()(

++

+=

46 10/23/13

Page 47: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

d(i,j) A B C D E F A - B 2 - C 4 4 - D 4 4 2 - E 7 7 7 7 - F 5 5 5 5 6 - G 8 8 8 8 9 5

G

1 1

A B

I C

D

E F

A B

D E

F

G

Step 2 - Cluster taxa A and B, form a new internal node I Calculate the lengths of the new edges d(A,I)=d(B,I)=1/2 d(A,B)=1

Step 1 – Find the smallest entry in the distance matrix

Step 3 – Update the distance matrix d(C,I) = ½(d(A,C) + d(B,C)) = 4 etc...

C

10/23/13 47

Page 48: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Step 2 - Cluster taxa C and D, form a new internal node II Calculate the lengths of the new edges d(C,II)=d(D,II)=1/2 d(C,D)=1

1 1

A B

I

E

F

1 1

C D

II

1 1

A B

I C D

E F

G

d(i,j) I (A+B) C D E F I (A+B) - C 4 - D 4 2 - E 7 7 7 - F 5 5 5 6 - G 8 8 8 9 5

Step 1 – Find the smallest entry in the distance matrix

Step 3 – Update the distance matrix d(I,II)=1/2(d(I,C)+d(I,D)) = 4 d(E,II) = ½(d(E,C) + d(E,D)) = 7 etc...

G 10/23/13 48

Page 49: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

A B

I

A B C

D E

F C D

E F

G

G

A B

I E

F G

C D

II

A B C D

I II III

E

F G

I II III

IV

A B C D F

E G

I II III

IV

A B C D F E

G V 0.4

3.8 3.4 0.9

1 1

1 1 1 1

0.5 2.5 I II

III IV

A B C D F E

V

G

VI

And so on...

...until we have a rooted tree.

But, is it the right tree? 10/23/13 49

Page 50: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

d(i,j) A B C D E F A - B 2 - C 4 4 - D 4 4 2 - E 7 7 7 7 - F 5 5 5 5 6 - G 8 8 8 8 9 5

1

A

B

C D

E

G

F

1 1

1 1

1 1

1

1 4

4

0.4

3.8 3.4 0.9

1 1

1 1

1 1 0.5

2.5 I II

III IV

A B C D F E

V

G

VI

=

The tree that matches the distances is not recovered by UPGMA.

UPGMA is not consistent for additive distances

10/23/13 50

Page 51: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Inconsistency

10/23/13 51

•  When a method is given “perfect” data but still gets the wrong tree it is said to be inconsistent.

•  UPGMA is inconsistent for data that isn’t ultrametric (clock-like).

•  Next we’ll look at a method that is consistent for any additive data.

Page 52: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Neighbor-joining (NJ)

10/23/13 52

NJ works by progressively clustering taxa until all the taxa form an unrooted tree.

1.  Rather than using the distance matrix directly to determine which taxa should be clustered at each stage, NJ uses the S matrix where S(i,j) = (N-2)d(i,j) - R(i) - R(j) N is the number of taxa. R(i) is the sum of the ith row in the distance matrix. R(j) is the sum of the jth row in the distance matrix.

2.  Find the smallest entry in the S matrix, say S(x,y).

Page 53: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

3.  Form a new internal node, z, that is a parent to x and y and calculate the edge lengths from z to x and z to y.

d(x,z) = 1/(2(N-2))[(N-2)d(x,y) + R(x) – R(y)] d(y,z) = d(x,y) – d(x,z) 4.  Update the distance matrix

d(w,z) = ½ (d(x,w) + d(y,w) – d(x,y)) REPEAT until only two things are left to be joined.

10/23/13 53

Page 54: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

NJ Example

10/23/13 54

Cat Dog Rat Dog 3

Rat 4 5

Cow 6 7 6

Cat Dog Rat

Dog -22

Rat -20 -20

Cow -20 -20 -22

D= S=

R(cat) = 13 R(dog) = 15 R(rat) = 15 R(cow) = 19 e.g. S(cat,dog) = (4-2)x3 – 13 – 15 = -22 S(cat,rat) = (4-2)x4 – 13 – 15 = -20

Step 1

Page 55: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

NJ Example

10/23/13 55

Cat Dog Rat Dog 3

Rat 4 5

Cow 6 7 6

Cat Dog Rat

Dog -22

Rat -20 -20

Cow -20 -20 -22

D= S=

Cat

Dog

Rat

Cow

z Step 3 d(cat,z) = ¼[2d(cat,dog) + R(cat) – R(dog)]

= ¼ [6 + 13 – 15] = 1

d(dog,z) = 3-1 = 2

Step 1 Step 2

Page 56: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Gene expression and evolution

10/23/13 56

Page 57: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Evolution of vocal learning in birds

10/23/13 57

Page 58: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Did the Florida Dentist infect his patients with HIV?

Phylogenetic tree

of HIV sequences

from the DENTIST,

his Patients, & Local

HIV-infected People:

From Ou et al., Science. 1992; 256:1165-71.

DENTIST Patient C Patient A Patient G Patient B Patient E Patient A DENTIST Local control 2 Local control 3 Patient F Local control 9

Local control 35

Local control 3

Patient D

Yes: The HIV sequences from these patients fall within the clade of HIV sequences found in the dentist.

No

No

17 10/23/13 58

Page 59: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Convergent evolution/homoplasty

10/23/13 59

Page 60: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Ancestral character or pleisiomorphic state

10/23/13 60

Page 61: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Molecular Phylogeny Major Softwares

Software/Package

Description

URL

PHYLIP

includes programs to carry out parsimony, distance matrix methods, maximum likelihood, evolution.genetics.washington.edu/phylip.html and other methods on a variety of types of data, including DNA and RNA sequences, protein sequences, restriction sites, 0/1 discrete characters data, gene frequencies, continuous characters and distance matrices.

PAUP*

originally including parsimony program, it has become much broader with the inclusion

paup.csit.fsu.edu/ of more methods. It includes parsimony, distance matrix, invariants, and maximum

likelihood methods and many indices and statistical tests.

PAML

a package of programs for the maximum likelihood analysis of nucleotide or protein

abacus.gene.ucl.ac.uk/software/paml.html sequences, including codon-based methods that take into account both amino acids and

nucleotides. The programs can estimate phylogenetic trees by maximum likelihood and Bayesian Markov Chain Monte Carlo methods.

MrBayes

a program for Bayesian inference of phylogenies from nucleic acid or protein sequences.

morphbank.ebc.uu.se/mrbayes/ It assumes a prior distribution of tree topologies and uses Markov Chain Monte Carlo

(MCMC) methods to search tree space and infer the posterior distribution of topologies

Tree-Puzzle

a program for maximum likelihood analysis for nucleotide and amino acid alignments.

www.tree-puzzle.de/ TREE-PUZZLE infers phylogenies by "quartet puzzling", a method that applies

maximum likelihood tree reconstruction to all possible quartets of taxa and subsequently tries to combine most of the four-taxa maximum likelihood trees to construct an overall maximum likelihood tree. A consensus tree generated from the quartet puzzling trees shows nodes that are well supported.

10/23/13 61

Page 62: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Comparative genomics

•  Compara1ve  genomics    studies  differences  between  genome  sequences  pin-­‐poin1ng  changes  over  1me.  

 •  Comparison  of  the  number/type  changes  against  the  background  “neutral”  expected  changes  provides  a  beHer  understanding  of  the  forces  that  shaped  genomes  and  traits.  

•  Insights  into  evolu1on  

10/23/13 62

Page 63: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Defining comparative genomics “The combination of genomics data and comparative evolutionary biology to address questions of genome structure, evolution and

function”

10/23/13 63

Page 64: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

What is Comparative Genomics

http://www.compsysbio.org 10/23/13 64

Page 65: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Why comparative genomics?

•  To understand the genomic basis of the present –  Differences in lifestyle

•  pathogen vs. nonpathogenic •  obligate vs. free-living

–  Host specificity

–  In the case of emerging pathogens: this understanding should help us in fighting disease (drug discovery, vaccines)

•  To understand the past –  How organisms evolved to be what they are now

10/23/13 65

Page 66: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

What can we learn from cross species comparison?

•  Genome conservation •  transfer knowledge gained from model organisms to non-model organisms

•  Genome variation •  Understand how genomes change over time to identify evolutionary process and constraints

•  Detecting functional elements •  Identifying coding and non coding sequences

10/23/13 66

Comparing sequences of different organisms

• Helps in gene predictions

• Helps in understanding evolution

• Conserved between species non-coding sequences are reliable guides to regulatory elements

• Differences between evolutionary closely related sequences help to discover gene functions

Page 67: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

What to compare?

What is the common set of proteins ?

What sequences show a signature of purifying selection and are likely functional ?

What sequence features are unique to individual species ?

10/23/13 67

Page 68: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

2a 4a

Organism A

Organism B

1a 3a 5a 6a

2b 4b 7b 3b 8b 9b

Block of synteny

Synteny

•  Refers to regions of two genomes that show considerable similarity in terms of –  sequence and –  conservation of the

order of genes

•  likely to be related by common descent

10/23/13 68

Page 69: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Conserva1on    highlights    exons  

Novel  exon?  

Regulatory  Element?  

10/23/13 69

Page 70: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Sequence  conserva1on  doesn’t  imply  func1on  conserva1on  

Odom  D.  et  al  (2007)  Schmidt  D.  et  al  (2010)  

Despite  conserva.on  of  binding  preferences  and  binding  sites  only  a  small  propor.on  of  TF  binding  events  is  conserved  across  species  

10/23/13 70

Page 71: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

same different contraction expansion

Demuth J.P. et al, (2006)

Lessons from comparative genomics

Changes of protein coding repertoires and contributions to phenotypic differences

10/23/13 71

Page 72: bioinformatics course phylogeney and comparative genomics€¦ · • Nodes - hypothetical most recent common ancestors • Edges length - “time” from one speciation to the •

Tools for comparative genomics

Tool name Purpose

UCSC genome browser Conserved regions using tracks

Ensembl contains information of several genomes

Mummer whole genome sequence alignment

Blat Blast like sequence alignment tool (large genomes)

BLAST alignment tools for smaller sequences

Genscan Gene prediction tool

10/23/13 72


Recommended