+ All Categories
Home > Documents > Remote RNA homology detection

Remote RNA homology detection

Date post: 04-Feb-2016
Category:
Upload: dannon
View: 24 times
Download: 0 times
Share this document with a friend
Description:
Remote RNA homology detection. Sean R. Eddy HHMI Janelia Farm Research Campus. Probability theory is nothing but common sense reduced to calculation. Laplace (1819). RNAs conserve both secondary structure and sequence. There are many RNA structures of interest. - PowerPoint PPT Presentation
Popular Tags:
36
Remote RNA homology detection Sean R. Eddy HHMI Janelia Farm Research Campus Probability theory is nothing but common sense reduced to calculation. Laplace (1819)
Transcript
Page 1: Remote RNA homology detection

Remote RNA homology detection

Sean R. Eddy

HHMI Janelia Farm Research Campus

Probability theory is nothing but common sense reduced to calculation.Laplace (1819)

Page 2: Remote RNA homology detection

RNAs conserve both secondary structure and sequence

Page 3: Remote RNA homology detection

There are many RNA structures of interest

Wade Winkler and Ron Breaker, ChemBioChem 4:1024 2003

Page 4: Remote RNA homology detection

Recognizing homologous RNAs is not easy

Page 5: Remote RNA homology detection

RNAs have a lot of information in pairwise correlations

Page 6: Remote RNA homology detection

Models before algorithms: a short sermon

Prob(data | model, parameters)

Always write down the probability of everything. - Steve Gull

Probabilistic (Bayesian) inference : no arbitrary scores

DJC MacKay, Information Theory, Inference, and Learning Algorithms, Cambridge, 2003

Page 7: Remote RNA homology detection

Single residue scores extend to pairwise residue scores

background frequencies of residue a (or residues a,b)

probability of observing residue a (or residues a,b) ‘here’ in the model (‘here’ = aligned to a particular residue, site, or state)

Steve Altschul, J. Mol. Biol. 219:555, 1991

Page 8: Remote RNA homology detection

Formal grammars as models of biological sequences

Noam Chomsky, 1958

Page 9: Remote RNA homology detection

An SCFG parse tree corresponds to an RNA structure

David Searls, Am. Scientist 80:579, 1992

Page 10: Remote RNA homology detection

Probabilistic models of biological sequences

Goal

optimal alignmentP(sequence | model)

EM parameter estimation

memory complexity:time complexity (general):time complexity (as used):

HMM algorithms(sequence)

ViterbiForward

Forward-Backward

O(MN)O(M2N)O(MN)

SCFG algorithms(RNA structure)

CYKInside

Inside-Outside

O(MN2)O(M3N3)O(MN3)

• we can analyze target sequences with secondary structure models;• but the algorithms are computationally expensive.

Richard Durbin, Sean Eddy, Graeme Mitchison, Anders KroghBiological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids

Cambridge, 1998

Page 11: Remote RNA homology detection

Covariance models (profile SCFGs): query construction

Infernal User’s Guide, 2007: http://infernal.janelia.org/

Page 12: Remote RNA homology detection

CM is organized on a consensus ‘guide tree’

Infernal User’s Guide, 2007: http://infernal.janelia.org/

Page 13: Remote RNA homology detection

Each node contains one or more SCFG ‘states’

Infernal User’s Guide, 2007: http://infernal.janelia.org/

Page 14: Remote RNA homology detection

CM state graph looks complex but follows simple rules

Infernal User’s Guide, 2007: http://infernal.janelia.org/

Page 15: Remote RNA homology detection

Homologous structures: different parses from same model

P(seq, parse tree | CM) is the product of all transition and emission probabilities used by the parse tree

Infernal User’s Guide, 2007: http://infernal.janelia.org/

Page 16: Remote RNA homology detection

Structure, parse, alignment : different views of same data

Page 17: Remote RNA homology detection

Covariance models (profile SCFGs): summary

• Query is nonpseudoknotted RNA secondary structure/sequence– either consensus of a multiple RNA alignment, or a single RNA structure

• Position-specific scoring parameters are derived from probabilities – easiest: frequencies of events in a deep alignment (maximum likelihood estimation)– usually: maximum a posteriori estimation from counts and a Dirichlet prior– in the limit of one sequence: position-independent substitution matrix

• Generative probabilistic model assigns P(seq, parse tree | CM), by factorizing into a product of transition (indel) and emission probabilities

– affine gap (gap-open, gap-extend)– position-specific, but 0th order Markov: no stacking correlations

• Now, to be useful, we also need:– algorithm for identifying best alignment given a sequence: the CYK algorithm– algorithm for calculating likelihood P(seq | CM): the Inside algorithm

Page 18: Remote RNA homology detection

Dynamic programming algorithms for SCFGs

most states:

bifurcation states:

Recursively calculates log probability thatCM subgraph rooted at v generates subsequence i..j.

This is a 3D dynamic programming lattice,requiring O(ML^2) memory.

Bifurcation calculations cost O(L) per cell,and there are O(ML^2) cells in the lattice:so algorithm is O(ML^3) time worst case.

Most models have few bifurcations, sorunning time is between ML^2 and ML^3in practice.

Page 19: Remote RNA homology detection

3D SCFG dynamic programming lattices

answer here:root state 0,i=1, j=L complete sequence

initialize here:end states, length 0 subsequences on diagonal

Page 20: Remote RNA homology detection

Bacillus subtilis RNase P example (E. coli query)

Page 21: Remote RNA homology detection

C. elegans RNase P (found with human RNase P query)

Page 22: Remote RNA homology detection

The glycine riboswitch

A glycine-dependent riboswitch that uses cooperative binding to control gene expression.Mandal et al. (Breaker lab), Science 306:275 (2004)

Page 23: Remote RNA homology detection

Memory is no longer a limitation, but time is

A memory-efficient dynamic programming algorithm for optimalalignment of a sequence to an RNA secondary structure.

S.R. Eddy, BMC Bioinformatics, 3:18 (2002)

The divide and conquer algorithm (2002): Myers/Miller extended to 3D SCFG lattices.Memory requirement now O(L^2 log M)

Page 24: Remote RNA homology detection

Ways to accelerate CM searches

• custom filtering programs (tRNAscan-SE, Lowe and Eddy, 1997)

• BLAST prefilter (Rfam database does this)

• linear profile-HMM filters, including “rigorous filters” (Zasha Weinberg, Larry Ruzzo)

• extend the BLAST algorithm to 3D (Diana Kolbe, work in progress)

• more generally: banded dynamic programming (various strategies possible)

• query-dependent banding (QDB): Nawrocki and Eddy, 2007

Page 25: Remote RNA homology detection

QDB algorithm

QDB recursively calculates the probability that a subgraph rooted at state v willgenerate a subsequence of length d, for all v and d, using the generative model.

Subsequence lengths with negligible probability may then be ignored in DP alignment to any target sequence.

Query-dependent banding (QDB) for faster RNA similarity searches.Eric Nawrocki and Sean Eddy, PLoS Computational Biology, in press (2007)

Page 26: Remote RNA homology detection

Examples of QDB bands

Query-dependent banding (QDB) for faster RNA similarity searches.Eric Nawrocki and Sean Eddy, PLoS Computational Biology, in press (2007)

Page 27: Remote RNA homology detection

CYK dynamic programming with QDB

Query-dependent banding (QDB) for faster RNA similarity searches.Eric Nawrocki and Sean Eddy, PLoS Computational Biology, in press (2007)

Page 28: Remote RNA homology detection

One free parameter: negligible probability mass threshold

Query-dependent banding (QDB) for faster RNA similarity searches.Eric Nawrocki and Sean Eddy, PLoS Computational Biology, in press (2007)

Page 29: Remote RNA homology detection

Four- to six-fold acceleration

Query-dependent banding (QDB) for faster RNA similarity searches.Eric Nawrocki and Sean Eddy, PLoS Computational Biology, in press (2007)

with QDB, not far from O(MN),the same complexity as BLAST or Smith/Waterman(albeit with a big constant)

Page 30: Remote RNA homology detection

QDB has little effect on sensitivity/specificity

Page 31: Remote RNA homology detection

BRAliBase III benchmark

Eva Freyhult, Jonathan Bollback, and Paul Gardner, Genome Research 17:117 (2007)

Page 32: Remote RNA homology detection

http://hmmer.janelia.org/

http://infernal.janelia.org/

http://pfam.janelia.org/

http://rfam.janelia.org/soon:first release of Easel,

the code library underlying HMMER and Infernal

Software integration & availability: a final sermon

all freely available; currently GPL,soon to be under the (BSD-like) Janelia Software License

Page 33: Remote RNA homology detection

HHMI Janelia FarmNow that scientific research has become a regular profession on the payroll of the state, the observer can no longer afford to concentrate for extended periods of time on one subject, and must work even harder. Gone are the days of yore...

Santiago Ramon y Cajal, 1916

http://selab.janelia.org/

Infernal, RNA homology search Diana Kolbe Eric Nawrocki

HMMER, protein homology search Sergi Castellano Alex Coventry

ncRNA genefinding: Jennifer Davila-Aponte Seolkyoung Jung Elena Rivas

secret agent man: Tom Jones

The Rfam Consortiumled by Sam Griffiths-Jones (Sanger, Cambridge UK)

Zasha Weinberg (Yale)Larry Ruzzo (U Washington, Seattle)

Ron Breaker (Yale)Norm Pace (U Colorado, Boulder)

Page 34: Remote RNA homology detection

Mixture Dirichlet priors: base pairs

Query-dependent banding (QDB) for faster RNA similarity searches.Eric Nawrocki and Sean Eddy, PLoS Computational Biology, in press (2007)

Page 35: Remote RNA homology detection

Mixture Dirichlet priors: singlets

Query-dependent banding (QDB) for faster RNA similarity searches.Eric Nawrocki and Sean Eddy, PLoS Computational Biology, in press (2007)

Page 36: Remote RNA homology detection

Recommended