+ All Categories
Home > Documents > 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus...

1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus...

Date post: 04-Jan-2016
Category:
Upload: christian-crawford
View: 214 times
Download: 0 times
Share this document with a friend
Popular Tags:
34
1 Genome sizes (sample) Genome sizes (sample)
Transcript
Page 1: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

1

Genome sizes (sample)Genome sizes (sample)

Page 2: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

2

Some genomics historySome genomics history• 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR

• first use of whole-genome shotgun for a bacterium• Fleischmann et al. 1995 became most-cited paper of the year

• 2869 citations to date

• 1995-6: 2nd and 3rd bacteria published by TIGR: Mycoplasma genitalium, Methanococcus jannaschii

• 1996: first eukaryote, S. cerevisiae (yeast), 13 Mbp, sequenced by a consortium of (mostly European) labs

• 1997: E. coli finished (7th bacterial genome)• 1998-2001: T. pallidum (syphilis), B. burgdorferi (Lyme disease), M. tuberculosis,

Vibrio cholerae, Neisseria meningitidis, Streptococcus pneumoniae, Chlamydia pneumoniae [all at TIGR]

• 2000: fruit fly, Drosophila melanogaster• 2000: first plant genome, Arabidopsis thaliana• 2001: human genome, first draft• 2002: malaria genome, Plasmodium falciparum• 2002: anthrax genome, Bacillus anthracis• TODAY (Sept 4, 2008):

• 744 complete microbial genomes!• 1199 microbial genomes in progress!• 476 eukaryotic genomes in progress!

Page 3: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

3

Page 4: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

New directions:New directions:sequencing ancient DNAsequencing ancient DNA

(some assembly required)

Page 5: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

5

J. P. Noonan et al., Science 309, 597 -599 (2005)

Page 6: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

6Published by AAAS

J. P. Noonan et al., Science 309, 597 -599 (2005)

Fig. 1. Schematic illustration of the ancient DNA extraction and library construction process

Page 7: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

7Published by AAAS

J. P. Noonan et al., Science 309, 597 -599 (2005)

Fig. 2. Characterization of two independent cave bear genomic libraries

Fig. 2. Predicted origin of 9035 clones from library CB1 (A) and 4992 clones from library CB2 (B) are shown, as determined by BLAST comparison to GenBank and environmental sequence databases. Other refers to viral or plasmid-derived DNAs. Distribution of sequence annotation features in 6,775 nucleotides of carnivore sequence from library CB1 (C) and 20,086 nucleotides of carnivore sequence from library CB2 (D) are shown as determined by alignment to the July 2004 dog genome assembly.

Page 8: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

8

Page 9: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

9

Page 10: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

10Published by AAAS

H. N. Poinar et al., Science 311, 392 -394 (2006)

Fig. 1. Characterization of the mammoth metagenomic library, including percentage of read distributions to various taxa

Page 11: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

11

Page 12: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

12

JournalsJournals

• The very best:• Science• www.sciencemag.org

• Nature• www.nature.com/nature

• PLoS Biology• www.plosbiology.org

Page 13: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

13

Bioinformatics JournalsBioinformatics Journals•Bioinformatics• bioinformatics.oxfordjournals.org

•BMC Bioinformatics• www.biomedcentral.com/bioinformatics

• PLoS Computational Biology• compbiol.plosjournals.org

• Journal of Computational Biology• www.liebertpub.com/cmb

Page 14: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

14

Radically new journalsRadically new journals• PLoS ONE• www.plosone.org

•Biology Direct• www.biology-direct.com• Reviewers’ comments are public

Both journals can be annotated by readersPapers can be negative results,

confirmations of other results, or brand new

Page 15: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

15

Genomics JournalsGenomics Journals•Genome Biology• genomebiology.com

•Genome Research• www.genome.org

•Nucleic Acids Research• nar.oxfordjournals.org

•BMC Genomics• www.biomedcentral.com/bmcgenomics

Page 16: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

Before assembly…Before assembly…

… we need to cover a basic sequence alignment algorithm

16

Page 17: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

17

Sequence AlignmentSequence AlignmentWhen we have very similar sequences:

• Closely related species• Very little changed sequence• Small differences can be very important• Computationally “easy” to align• Assembly ONLY deals with these

When sequences are not so similar:• Distantly related species• Most positions changed• Sequences that are most highly conserved are under the

strongest selective (evolutionary) pressure.– E.g., some genes in humans and E. coli clearly have a

common ancestor, the proteins can be aligned• Computationally “difficult” to align

Page 18: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

18

Sequence AlignmentSequence Alignment

Algorithms for sequence alignment• Choose best alignment, subject to some mutation

model.

• A common (but overly simplistic) model for DNA mutations is called “edits”, which counts the number of substitutions, insertions and deletions.

• The resulting alignment suggests a possible “history” for the sequence.

This slide and subsequent alignment slides courtesy of Nathan Edwards, available atwww.umiacs.umd.edu/~nedwards/teaching/CMSC858E_Fall_2005/

Page 19: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

19

Example AlignmentsExample Alignments

ACGTCTAG

||*****^

ACTCTAG-

2 matches, 5 mismatches, 1 not aligned

Page 20: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

20

Example AlignmentsExample Alignments

ACGTCTAG

^**|||||

-ACTCTAG

5 matches, 2 mismatches, 1 not aligned

Page 21: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

21

Example AlignmentsExample Alignments

ACGTCTAG ||^||||| AC-TCTAG

7 matches, 0 mismatches, 1 not aligned

Edit distance here = 1

Page 22: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

22

Example AlignmentsExample Alignments

...AACTGAGTTTACGCGCATAGA... |^^^||^|^^| T---CG-A--G

Many equally good alignments!

Even exact matching sequence can be found (at random) in long enough sequences

Page 23: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

23

Global Alignment problemGlobal Alignment problem

Given two related sequences, S (length n) and T (length m), find an alignment of S and T.

Edit distance: minimum number of substitutions, insertions and deletions.

Page 24: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

24

Dynamic Programming for Dynamic Programming for pairwise alignmentpairwise alignment

Page 25: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

25

Dynamic Programming Dynamic Programming FormulationFormulation

Definition: Let D(i,j) be the edit distance of the alignment of S[1...i] and T[1...j].

Edit distance of S and T, then, is D(n,m).

Dynamic programming solves the global alignment problem by computing D(i,j) for all i=0...n and j=0...m.

Page 26: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

26

Recurrence Relation for DRecurrence Relation for D

Computation of D is a recursive/iterative process.• D(i,j) in terms of D(i’,j’) for i’ < i and j’ < j.

Base conditions for D(i,j):

• D(i,0) = i, for all i = 0,...,n• D(0,j) = j, for all j = 0,...,m

Page 27: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

27

Recurrence relation for DRecurrence relation for D

For i > 0, j > 0:

D(i,j) = min {

D(i-1,j) + 1,

D(i,j-1) + 1,

D(i-1,j-1) + δ(S(i),T(j)) }

Page 28: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

28

Dynamic programmingDynamic programming

D(i,j) is computed by optimally solving sub-problems

The optimal solution to D(i,j) is a simple combination (addition) of two optimally solved subproblems

Page 29: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

29

Using the recurrenceUsing the recurrence

We could code this as a recursive function call...• ...but an exponential number of function

evaluations–each position explores 3 alternatives

There are only (n+1)x(m+1) pairs i and j• We must be evaluating D(i,j) multiple times• Why not cache the results?

Page 30: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

30

Using the recurrenceUsing the recurrenceCompute D(i,j) bottom up.

Store the intermediate results in a table (the table we already saw).

Start with smallest (i,j) = (1,1).

Compute D(i,j) after D(i-1,j), D(i,j-1), and D(i-1,j-1) have been determined.

(n+1)(m+1) cells to fill, so O(nm) time.

Page 31: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

31

TracebackTraceback

Our dynamic programming table helps us compute the edit distance “score”

We need the actual alignment corresponding to this edit distance

The corresponding alignment can be read off, by doing a little extra accounting.

Page 32: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

32

TracebackTraceback

If D(i,j) == D(i-1,j) + 1,Pointer(i,j) = (i-1,j)

If D(i,j) == D(i,j-1) + 1,Pointer(i,j) = (i,j-1)

If D(i,j) == D(i-1,j-1) + δ(S(i),T(j)),Pointer(i,j) = (i-1,j-1)

Break ties arbitrarily, or keep multiple pointers

Page 33: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

33

TracebackTracebackFollow the pointers from cell (n,m).

Any path to (0,0) corresponds to the (reverse of the) edits of the optimal alignment• “horizontal” pointers: insertion in S• “vertical” pointers: insertion in T• “diagonal” pointers: match or substitution

An optimal alignment can be found in O(n+m) time.

Page 34: 1 Genome sizes (sample). 2 Some genomics history 1995: first bacterial genome, Haemophilus influenza, 1.8 Mbp, sequenced at TIGR first use of whole-genome.

34

Original referencesOriginal references

T.F. Smith and M.S. Waterman, Identification of common molecular subsequences. J. Molecular Biology (1981), 147(1):195-7.

Altschul SF, Gish W, Miller W, Myers EW, and Lipman DJ. Basic local alignment search tool. J. Molecular Biology (1990), 215(3):403-10.

- 24,113 citations!


Recommended