CONTENTS
• A Brief History of Hidden Markov Model
• Relationship Between DNA, RNA And
Proteins
• Coin Tossing with HMM
• HMM Architecture
• HMMs in Biology
2
A Brief History of Hidden
Markov Model
• Hidden Markov Models (HMMs) are
learnable finite stochastic automates.
Nowadays, they are considered as a
specific form of dynamic Bayesian
networks. Dynamic Bayesian networks
are based on the theory of Bayes
(Bayes & Price, 1763).
3
4
Relationship Between DNA, RNA
And Proteins
Protein
mRNA
DNA
transcription
translation
CCTGAGCCAACTATTGATGAA
PEPTIDE
CCUGAGCCAACUAUUGAUGAA
21
HMM Architecture
• Markov Chains
• What is a Hidden Markov Model(HMM)?
• Components of HMM
• Problems of HMMs
22
Markov Chains
Sunny
Rain
Cloudy
State transition matrix : The probability of
the weather given the previous day's
weather.
Initial Distribution : Defining the probability of the
system being in each of the states at time 0.
States : Three states - sunny,
cloudy, rainy.
23
Hidden Markov Models
Hidden states : the (TRUE) states of a system that
may be described by a Markov process (e.g., the
weather).
Observable states : the states of the process that
are `visible' (e.g., seaweed dampness).
24
Components Of HMM
Output matrix : containing the probability of observing a
particular observable state given that the hidden model is in a
particular hidden state.
Initial Distribution : contains the probability of the (hidden)
model being in a particular hidden state at time t = 1.
State transition matrix : holding the probability of a hidden
state given the previous hidden state.
25
Example-HMM
Scoring a Sequence with an HMM:
The probability of ACCY along this path is
.4 * .3 * .46 * .6 * .97 * .5 * .015 * .73 *.01 * 1 = 1.76x10-6.
Transition
Prob.
Output Prob.
26
Problems With HMM
Scoring problem:
Given an existing HMM and observed sequence , what is the probability
that the HMM can generate the sequence
27
Problems With HMM
Alignment Problem
Given a sequence, what is the optimal state sequence that the HMM would use to generate it
28
Problems With HMM
Training ProblemGiven a large amount of data how can we estimate the structure
and the parameters of the HMM that best accounts for the data
29
HMMs in Biology
• Gene finding and prediction
• Protein-Profile Analysis
• Secondary Structure prediction
• Advantages
• Limitations
30
Finding genes in DNA sequence
This is one of the most challenging and interesting problems in
computational biology at the moment. With so many genomes
being sequenced so rapidly, it remains important to begin by
identifying genes computationally.
31
What is a (protein-coding) gene?
Protein
mRNA
DNA
transcription
translation
CCTGAGCCAACTATTGATGAA
PEPTIDE
CCUGAGCCAACUAUUGAUGAA
33
Gene Finding HMMs
• Our Objective:
– To find the coding and non-coding regions of an unlabeled string of DNA nucleotides
• Our Motivation:
– Assist in the annotation of genomic data produced by genome sequencing methods
– Gain insight into the mechanisms involved in transcription, splicing and other processes
34
Why HMMs
• Classification: Classifying observations within a
sequence
• Order: A DNA sequence is a set of ordered observations
• Grammar : Our grammatical structure (and the
beginnings of our architecture) is right here:
• Success measure: # of complete exons correctly
labeled
• Training data: Available from various genome
annotation projects
HMMs for gene finding
An HMM for unspliced genes.
x : non-coding DNA
c : coding state
• Training - Expectation Maximization (EM)
• Parsing – Viterbi algorithm
36
Genefinders- a comparison
Method Sn Sp AC Sn Sp(Sn+Sp)/
2ME WE
GENSCAN 0.93 0.93 0.91 0.78 0.81 0.8 0.09 0.05
FGENEH 0.77 0.85 0.78 0.61 0.61 0.61 0.15 0.11
GeneID 0.63 0.81 0.67 0.44 0.45 0.45 0.28 0.24
GeneParser2 0.66 0.79 0.66 0.35 0.39 0.37 0.29 0.17
GenLang 0.72 0.75 0.69 0.5 0.49 0.5 0.21 0.21
GRAILII 0.72 0.84 0.75 0.36 0.41 0.38 0.25 0.1
SORFIND 0.71 0.85 0.73 0.42 0.47 0.45 0.24 0.14
Xpound 0.61 0.82 0.68 0.15 0.17 0.16 0.32 0.13
Accuracy per nucleotide Accuracy per exon
Sn = Sensitivity
Sp = Specificity
Ac = Approximate Correlation
ME = Missing Exons
WE = Wrong Exons
GENSCAN Performance Data, http://genes.mit.edu/Accuracy.html
37
Protein Profile HMMs
• Motivation
– Given a single amino acid target sequence of
unknown structure, we want to infer the structure
of the resulting protein. Use Profile Similarity
• What is a Profile?– Proteins families of related sequences and structures
– Same function
– Clear evolutionary relationship
– Patterns of conservation, some positions are more
conserved than the others
38
Aligned Sequences
Build a Profile HMM (Training)
Database
search
Multiple
alignments
(Viterbi)
Query against Profile
HMM database
(Forward)
An Overview
A HMM model for a DNA motif alignments, The transitions are
shown with arrows whose thickness indicate their probability. In
each state, the histogram shows the probabilities of the four
bases.
ACA - - - ATG
TCA ACT ATC
ACA C - - AGC
AGA - - - ATC
ACC G - - ATC
Building – from an existing alignment
Transition probabilities
Output Probabilities
insertion
40
Matching states
Insertion states
Deletion states
No of matching states = average sequence length in the family
PFAM Database - of Protein families
(http://pfam.wustl.edu)
Building – Final Topology
41
• Given HMM, M, for a sequence family, find all members of the family in data base.
• LL – score LL(x) = log P(x|M)
(LL score is length dependent – must normalize or use Z-score)
Database Searching
Consensus sequence:
P (ACACATC) = 0.8x1 x 0.8x1 x 0.8x0.6 x 0.4x0.6 x 1x1 x
0.8x1 x 0.8 = 4.7 x 10 -2
Suppose I have a query protein sequence, and I am interested
in which family it belongs to? There can be many paths
leading to the generation of this sequence. Need to find all
these paths and sum the probabilities.
ACAC - - ATC
Query a new sequence
43
Multiple Alignments
• Try every possible path through the model that would produce the target sequences
– Keep the best one and its probability.
– Output : Sequence of match, insert and delete states
• Viterbi alg. Dynamic Programming
44
Building – unaligned sequences
• Baum-Welch Expectation-maximization method– Start with a model whose length matches the
average length of the sequences and with random output and transition probabilities.
– Align all the sequences to the model.
– Use the alignment to alter the output and transition probabilities
– Repeat. Continue until the model stops changing
• By-product: It produced a multiple alignment
45
PHMM Example
An alignment of 30 short amino acid sequences chopped out
of a alignment of the SH3 domain. The shaded area are the
most conserved and were represented by the main states in
the HMM. The unshaded area was represented by an insert
state.
46
Prediction of Protein
Secondary structures• Prediction of secondary structures is needed
for the prediction of protein function.
• Analyze the amino-acid sequences of proteins
• Learn secondary structures – helix, sheet and turn
• Predict the secondary structures of sequences
47
Advantages
• Characterize an entire family of sequences.
• Position-dependent character distributions and
position-dependent insertion and deletion gap
penalties.
• Built on a formal probabilistic basis
• Can make libraries of hundreds of profile HMMs and
apply them on a large scale (whole genome)
48
Limitations
Markov Chains
Probabilities of states are supposed to
be independent
P(y) must be independent of P(x), and
vice versa
This usually isn’t true
P(x) … P(y)
49
Limitations - contd
• Standard Machine Learning Problems
• Watch out for local maxima– Model may not converge to a truly optimal
parameter set for a given training set
• Avoid over-fitting– You’re only as good as your training set
– More training is not always good
50
REFERENCES
Hidden Markov Models in Computational Biology , www.evl.uic.edu/shalini/hmm/
• Wikipedia1, http://en.wikipedia.org/wiki/Andrey_Markov
• Wikipedia2, http://en.wikipedia.org/wiki/Hidden_Markov_model
• Wikipedia3, http://en.wikipedia.org/wiki/Markov_chain
• Wikipedia4, http://en.wikipedia.org/wiki/Markov_process
REFERENCES
• Hidden Markov Models-
http://digital.cs.usu.edu/~cyan/CS7960/hmm-tutorial.pdf
• Protein oluşumu -
http://evrimci.freeservers.com/protein_olusumu.html
• Application of Hidden Markov Models in Bioinformatics-
http://www.bioss.ac.uk/associates/dirk/talks/tutorial_hmm_bioinf.
• Hidden Markov Models in Bioinformatic-
http://benthamscience.com/cbio/Samples/cbio2-1/0005CBIO.pdf
51