Post on 18-Dec-2015
transcript
Profile Hidden Markov Models
Bioinformatics Fall-2004Dr Webb Miller and Dr Claude
Depamphilis
Dhiraj JoshiDepartment of Computer Science and
Engineering The Pennsylvania State University
Outline Introduction to HMMs Profile HMMs Available resources for Profile HMMs Some online demonstrations
Introduction to HMMs Hidden Markov Models – Formalism
statistical techniques for modeling patterns in data
First order Markov property - memorylessness
state generally a hidden entity which spawns symbols or features
the same symbol could be emitted by several states
HMM characterized by transition probabilities and emission distribution
P x x x x P x xt t t t t( | , ) ( | ),...,− − −=1 2 0 1
e ast( )
Introduction to HMMs Hidden Markov Models – Parameter
Estimation Parameters- transition probabilities and emission
probabilities iterative computational algorithms used
EM algorithm, Viterbi algorithm algorithms based on dynamic programming to save
computational cost usually the iterations involve variants of the
following two steps estimate state sequence which maximizes likelihood
under a parameter set update parameter set based on the estimated state
sequence
algorithms converge to local optima sometimes
Outline Introduction to HMMs Profile HMMs Available resources for Profile HMMs Some online demonstrations
Profile Hidden Markov Models Stochastic methods to model multiple
sequence alignments – proteins and dna sequences
Potential application domains: protein families could be modeled as an HMM or a
group of HMMs constructing a profile HMM
new protein sequences could be aligned with stored models to detect remote homology aligning a sequence with a stored profile HMM
align two or more protein family profile HMMs to detect homology finding statistical similarities between two profile
HMM models
Profile Hidden Markov Models Constructing a profile HMM
A multiple sequence alignment assumed each consensus column can exist in 3 states match, insert and delete states number of states depends upon length of the
alignment
Profile Hidden Markov Models A typical profile HMM architecture
squares represent match states diamonds represent insert states circles represent delete states arrows represent transitions
Profile Hidden Markov Models A typical profile HMM architecture
transition between match states - transition from match state to insert state - transition within insert state - transition from match state to delete state - transition within delete state - emission of symbol at a state -
aM Mj j+1aM Ij j
aI Ij j
aD Dj j
aM Dj j
e as( )
Profile Hidden Markov Models Estimation of parameters
transition probabilities estimated as frequency of a transition in a given alignment
emission probabilities estimated as frequency of an emission in a given alignment
pseudo counts usually introduced to account for transititions / emissions which were not present in the alignment
Profile Hidden Markov Models Estimation of parameters
with pseudo counts
Dirichlet prior distribution used to determine pseudo counts
aA
Aklkl
kll
=∑ '
'
e aE a
E akk
ka
( )( )
( )'''
=∑
e aE a A
E a Ak
k q
ka
a( )( )
( )'''
=+
+∑
Profile Hidden Markov Models Scoring a sequence against a profile HMM
Viterbi algorithm used to find the best state path
Simulated annealing based methods also used Maximization criteria – log likelihood or log odds
Log likelihood score generally depends on length of sequence and hence not preferred
If an alignment not given initially, the alignment could be learnt iteratively using Viterbi
Profile Hidden Markov Models Comparing two profile HMMs
Profile-profile comparison tool based on information theory
based on Kullback-Leibler divergence criterion for comparing 2 statistical distributions
dynamic programming used to compare entire profiles
detect weak similarities between models
Outline Introduction to HMMs Profile HMMs Available resources for Profile HMMs Some online demonstrations
Available resources for Profile HMMs
HMMER and SAM one of the first available programs for profile HMMs
HMMER : S Eddy at Washington University SAM : Sequence alignment and Modeling System R. Hughey at University of California, Santa
Cruz
available free for research SAM has online servers to perform sequence
comparisons http://www.cse.ucsc.edu/research/compbio/sam.html
Available resources for Profile HMMs
InterPro consortium in Europe has many resources for protein data
Database of protein families and domains Brings together several different databases under
one umbrella Pfam and Superfamily are profile HMM libraries
associated with Interpro Pfam based on HMMER search and Superfamily
based on SAM search and modeling
Available resources for Profile HMMs
SAM’s iterative approach for building HMM find a set of close homologs using BLASTP learn the alignment and build model using close
homologs use BLASTP to get more remote homologs using
the first set of sequences (relax the E value) iteratively refine the HMM model
SAM uses Dirichlet priors as pseudo counts for parameters
Hand tuned seed alignments not required as the alignments are learnt by the algorithm – unlike HMMER
Available resources for Profile HMMs
SUPERFAMILY database incorporates: library of profile HMMs representing all proteins of
known structure assignments to predicted proteins from all
completely sequenced genomes search and alignment services models and domain assignments are freely available
Based on SCOP classification of protein domains
SAM HMM iterative procedure used for model building and sequence alignment
Available resources for Profile HMMs
In Superfamily: Each SCOP superfamily is represented as an HMM
model Model built using SAM procedure based 4 variants
accurate structure based alignments hand labeled alignments autonomic alignments using ClustalW sequence members used separately as seeds
Assignment of superfamilies for a given sequence, every model is scored across
the whole sequence using Viterbi scoring model which scores highest has its superfamily
assigned to the region
Outline Introduction to HMMs Profile HMMs Available resources for Profile HMMs Some online demonstrations
Online Demonstrations
http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/temp/624288710157514.html
References Durbin. R, Eddy. S, Krough. A, and Mitchenson. G, ``Biological
Sequence Analysis’’, Cambridge University Press, 2002 Baldi. P and Brunak. S, ``Bioinformatics, the Machine Learning
Approach’’, the MIT Press, Cambridge, 1998 Eddy. S, ``Profile Hidden Markov Models’’, Bioinformatics
Review, vol. 19, no. 8, pp. 755-763, 1998 Karplus. K, Barrett. C, and Hughey. R, ``Hidden Markov
models for detecting remote homologies’’, Bioinformatics, vol. 14, no. 10, pp. 846-856, 1998
Madera. M, Gough, J, ``A comparison of profile hidden Markov model procedures for remote homology detection’’, Nucleic Acids Research, vol. 30, no. 19, pp. 4321-4328, 2002
Gough. J, Karplus. K, Hughey. R, and Chothia. C, ``Assignment of Homology to Genome Sequences using a Library of Hidden Markov Models that represent all Proteins of known structure’’, J. Mol. Biol., 313, pp. 903-919, 2001
References Yona. G, Levitt. M, ``Within the Twilight Zone: A sensitive
Profile-Profile comparison tool based on Information Theory’’, J. Mol. Biol., 315, 1257-1275, 2002
Mandera. M, Vogel. C, Kummerfeld. K, Chothia. C, and Gough. J, ``The SUPERFAMILY database in 2004: additions and improvements’’, Nucleic Acids Research, vol. 32, Database Issue, D235-239, 2004
Bateman. A, Birney. E, Durbin. R, Eddy. S, Finn. R, Sonnhammer. E, ``Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins’’, Nucleic Acids Research, vol. 27, no. 1, 1999
Andreeva. A, et. al., ``SCOP database in 2004: refinements integrate structure and sequence family data’’, Nucleic Acids Research, vol. 32, Database Issue, D226-D229,2004
Many other online resources and tutorials