Date post: | 30-Jun-2015 |
Category: |
Technology |
Upload: | beiko |
View: | 111 times |
Download: | 0 times |
(an example of)
Computing the Microbial World
Rob BeikoJune 25, 2014
Siddique et al. (2014) Front Microbiol
Lawley et al., PLoS Genet (2012)
The Breakfast Organisms"Bacon Fields" Author: Michael DeForge
240M “pieces”, each 150 nucleotides long3.6 x 1010 nucleotides
~40 GB
Hundreds of “species”Genomes between 1.5M – 6M nucleotides
150 nt x 150 nt
We know this And this
But not this
who is doing what?
Marker genes WHO
Environmental “Shotgun” WHAT
The challenge ofMETAGENOME CLASSIFICATION
Clues – Sequence similarity(homology)
150 nt x 150 nt
Referencegenes
Take the WHOLE SEQUENCE
Best
Worst
Clues – composition150 nt x 150 nt
Referencegenome
k-mer profiles
Genome #1:20% G & C30% A & T
Genome #2:24% G & C26% A & T
Best
Worst
Take a K-MER FREQUENCY
DECOMPOSITION
Homology >> Composition
* GGCTGGACCA1 GACTGGACCA2 GGCCGGACTA
But homology evidence canmislead or be absent
Homology + Composition > Homology alone
GGCTGGACCA
GCCTGGTCCAGCCAGGTGCAGCCTGTCCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Query:
Subject:
Exact string search? NO
BLAST? OK, but SLOW!
A compromise: UBLAST
• BLAST seeks out very similar “anchor points” between a pair of sequences before doing a more thorough search• Typically, a query is compared against all candidate DB
sequences, but most will return no hits
UBLAST:GGCTGGACCA
GCCTGTCCANNNNNNNNNNNNNNNNNNNNGCCAGGTGCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCCTGGTCCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
(1) Query, DB sequences
GGCTGGACCA
GCCTGGTCCAGCCAGGTGCAGCCTGTCCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
(3) Rank DBbased on k-mer
matching
GGCTGGACCA
GCCTGGTCCAGCCAGGTGCAGCCTGTCCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
(4) Do detailed searchuntil there is
no more point
X
(2) k-mer table
Compositional models• Interpolated Markov models: adaptively generate
frequency models based on extending k-mers with sufficiently high frequencies
• One model per genome• Evaluate probability of each k-mer in query sequence,
given shorter k-mers in sequence• Model construction can take a while
k = 4 k = 5 k = 6 k = 7
PhymmBL: Brady and Salzberg (2009) Nat Methods
An alternative: Naïve Bayes• Just compute the frequency of each k-mer for a fixed
length k
• Build one frequency model for each genome
• FAST• Assumes conditional independence – may not matter
Probability of a query Fragment originating from genome Gi
For all k-mers in the fragment…
The frequency of that k-mer in Gi
Parks et al. (2011) BMC Bioinformatics
RITA: Rapid Identification of Taxonomic Assignments
UBLAST filter
MacDonald et al. (2012) Nucleic Acids Res
Evaluation set
• “Fake metagenome”: take sequences from known genomes, randomly sample fragments of 50, 100, 200 and 1000 nt in different trials
• Build reference models from other genomes – can leave close relatives out of reference model• Leave out other strains within the same species – not so
hard• Leave out other classes in the same phylum - HARD
But does it work?
Full RITA
Best class (homology and composition agree)
DNA sequence length50
Predicting genus from different species Predicting phylum from different class
Conclusions
• Careful attention needs to be paid to the choice of approach – simple is better
• RITA illustrates two key points in (microbial) bioinformatics:
1. Homology: How heuristic are you willing to go?2. Naïve Bayes: Keep it simple until told otherwise
• Technological change means that many bioinformatics algorithms will be irrelevant in 5 years
FIN