Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis...

transcript

Profile Hidden Markov Models

Bioinformatics Fall-2004Dr Webb Miller and Dr Claude

Depamphilis

Dhiraj JoshiDepartment of Computer Science and

Engineering The Pennsylvania State University

Outline Introduction to HMMs Profile HMMs Available resources for Profile HMMs Some online demonstrations

Introduction to HMMs Hidden Markov Models – Formalism

statistical techniques for modeling patterns in data

First order Markov property - memorylessness

state generally a hidden entity which spawns symbols or features

the same symbol could be emitted by several states

HMM characterized by transition probabilities and emission distribution

P x x x x P x xt t t t t( | , ) ( | ),...,− − −=1 2 0 1

e ast( )

Introduction to HMMs Hidden Markov Models – Parameter

Estimation Parameters- transition probabilities and emission

probabilities iterative computational algorithms used

EM algorithm, Viterbi algorithm algorithms based on dynamic programming to save

computational cost usually the iterations involve variants of the

following two steps estimate state sequence which maximizes likelihood

under a parameter set update parameter set based on the estimated state

sequence

algorithms converge to local optima sometimes

Profile Hidden Markov Models Stochastic methods to model multiple

sequence alignments – proteins and dna sequences

Potential application domains: protein families could be modeled as an HMM or a

group of HMMs constructing a profile HMM

new protein sequences could be aligned with stored models to detect remote homology aligning a sequence with a stored profile HMM

align two or more protein family profile HMMs to detect homology finding statistical similarities between two profile

HMM models

Profile Hidden Markov Models Constructing a profile HMM

A multiple sequence alignment assumed each consensus column can exist in 3 states match, insert and delete states number of states depends upon length of the

alignment

Profile Hidden Markov Models A typical profile HMM architecture

squares represent match states diamonds represent insert states circles represent delete states arrows represent transitions

Profile Hidden Markov Models A typical profile HMM architecture

transition between match states - transition from match state to insert state - transition within insert state - transition from match state to delete state - transition within delete state - emission of symbol at a state -

aM Mj j+1aM Ij j

aI Ij j

aD Dj j

aM Dj j

e as( )

Profile Hidden Markov Models Estimation of parameters

transition probabilities estimated as frequency of a transition in a given alignment

emission probabilities estimated as frequency of an emission in a given alignment

pseudo counts usually introduced to account for transititions / emissions which were not present in the alignment

Profile Hidden Markov Models Estimation of parameters

with pseudo counts

Dirichlet prior distribution used to determine pseudo counts

=∑ '

e aE a

( )( )

( )'''

e aE a A

E a Ak

a( )( )

( )'''

Profile Hidden Markov Models Scoring a sequence against a profile HMM

Viterbi algorithm used to find the best state path

Simulated annealing based methods also used Maximization criteria – log likelihood or log odds

Log likelihood score generally depends on length of sequence and hence not preferred

If an alignment not given initially, the alignment could be learnt iteratively using Viterbi

Profile Hidden Markov Models Comparing two profile HMMs

Profile-profile comparison tool based on information theory

based on Kullback-Leibler divergence criterion for comparing 2 statistical distributions

dynamic programming used to compare entire profiles

detect weak similarities between models

Available resources for Profile HMMs

HMMER and SAM one of the first available programs for profile HMMs

HMMER : S Eddy at Washington University SAM : Sequence alignment and Modeling System R. Hughey at University of California, Santa

available free for research SAM has online servers to perform sequence

comparisons http://www.cse.ucsc.edu/research/compbio/sam.html

InterPro consortium in Europe has many resources for protein data

Database of protein families and domains Brings together several different databases under

one umbrella Pfam and Superfamily are profile HMM libraries

associated with Interpro Pfam based on HMMER search and Superfamily

based on SAM search and modeling

SAM’s iterative approach for building HMM find a set of close homologs using BLASTP learn the alignment and build model using close

homologs use BLASTP to get more remote homologs using

the first set of sequences (relax the E value) iteratively refine the HMM model

SAM uses Dirichlet priors as pseudo counts for parameters

Hand tuned seed alignments not required as the alignments are learnt by the algorithm – unlike HMMER

SUPERFAMILY database incorporates: library of profile HMMs representing all proteins of

known structure assignments to predicted proteins from all

completely sequenced genomes search and alignment services models and domain assignments are freely available

Based on SCOP classification of protein domains

SAM HMM iterative procedure used for model building and sequence alignment

In Superfamily: Each SCOP superfamily is represented as an HMM

model Model built using SAM procedure based 4 variants

accurate structure based alignments hand labeled alignments autonomic alignments using ClustalW sequence members used separately as seeds

Assignment of superfamilies for a given sequence, every model is scored across

the whole sequence using Viterbi scoring model which scores highest has its superfamily

assigned to the region

Online Demonstrations

http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/temp/624288710157514.html

References Durbin. R, Eddy. S, Krough. A, and Mitchenson. G, ``Biological

Sequence Analysis’’, Cambridge University Press, 2002 Baldi. P and Brunak. S, ``Bioinformatics, the Machine Learning

Approach’’, the MIT Press, Cambridge, 1998 Eddy. S, ``Profile Hidden Markov Models’’, Bioinformatics

Review, vol. 19, no. 8, pp. 755-763, 1998 Karplus. K, Barrett. C, and Hughey. R, ``Hidden Markov

models for detecting remote homologies’’, Bioinformatics, vol. 14, no. 10, pp. 846-856, 1998

Madera. M, Gough, J, ``A comparison of profile hidden Markov model procedures for remote homology detection’’, Nucleic Acids Research, vol. 30, no. 19, pp. 4321-4328, 2002

Gough. J, Karplus. K, Hughey. R, and Chothia. C, ``Assignment of Homology to Genome Sequences using a Library of Hidden Markov Models that represent all Proteins of known structure’’, J. Mol. Biol., 313, pp. 903-919, 2001

References Yona. G, Levitt. M, ``Within the Twilight Zone: A sensitive

Profile-Profile comparison tool based on Information Theory’’, J. Mol. Biol., 315, 1257-1275, 2002

Mandera. M, Vogel. C, Kummerfeld. K, Chothia. C, and Gough. J, ``The SUPERFAMILY database in 2004: additions and improvements’’, Nucleic Acids Research, vol. 32, Database Issue, D235-239, 2004

Bateman. A, Birney. E, Durbin. R, Eddy. S, Finn. R, Sonnhammer. E, ``Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins’’, Nucleic Acids Research, vol. 27, no. 1, 1999

Andreeva. A, et. al., ``SCOP database in 2004: refinements integrate structure and sequence family data’’, Nucleic Acids Research, vol. 32, Database Issue, D226-D229,2004

Many other online resources and tutorials

Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis...

Documents