+ All Categories
Home > Documents > HIDDEN MARKOV MODELS - EEMB DERSLER · A Brief History of Hidden Markov Model •Hidden Markov...

HIDDEN MARKOV MODELS - EEMB DERSLER · A Brief History of Hidden Markov Model •Hidden Markov...

Date post: 15-Apr-2018
Category:
Upload: vuongmien
View: 227 times
Download: 2 times
Share this document with a friend
51
HIDDEN MARKOV MODELS BIL-438 Hesaplanabilir Biyoloji ve İleri Konular Ramazan AY 2007638024
Transcript

HIDDEN MARKOV MODELS

BIL-438 Hesaplanabilir Biyoloji

ve İleri Konular

Ramazan AY

2007638024

CONTENTS

• A Brief History of Hidden Markov Model

• Relationship Between DNA, RNA And

Proteins

• Coin Tossing with HMM

• HMM Architecture

• HMMs in Biology

2

A Brief History of Hidden

Markov Model

• Hidden Markov Models (HMMs) are

learnable finite stochastic automates.

Nowadays, they are considered as a

specific form of dynamic Bayesian

networks. Dynamic Bayesian networks

are based on the theory of Bayes

(Bayes & Price, 1763).

3

4

Relationship Between DNA, RNA

And Proteins

Protein

mRNA

DNA

transcription

translation

CCTGAGCCAACTATTGATGAA

PEPTIDE

CCUGAGCCAACUAUUGAUGAA

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

HMM Architecture

• Markov Chains

• What is a Hidden Markov Model(HMM)?

• Components of HMM

• Problems of HMMs

22

Markov Chains

Sunny

Rain

Cloudy

State transition matrix : The probability of

the weather given the previous day's

weather.

Initial Distribution : Defining the probability of the

system being in each of the states at time 0.

States : Three states - sunny,

cloudy, rainy.

23

Hidden Markov Models

Hidden states : the (TRUE) states of a system that

may be described by a Markov process (e.g., the

weather).

Observable states : the states of the process that

are `visible' (e.g., seaweed dampness).

24

Components Of HMM

Output matrix : containing the probability of observing a

particular observable state given that the hidden model is in a

particular hidden state.

Initial Distribution : contains the probability of the (hidden)

model being in a particular hidden state at time t = 1.

State transition matrix : holding the probability of a hidden

state given the previous hidden state.

25

Example-HMM

Scoring a Sequence with an HMM:

The probability of ACCY along this path is

.4 * .3 * .46 * .6 * .97 * .5 * .015 * .73 *.01 * 1 = 1.76x10-6.

Transition

Prob.

Output Prob.

26

Problems With HMM

Scoring problem:

Given an existing HMM and observed sequence , what is the probability

that the HMM can generate the sequence

27

Problems With HMM

Alignment Problem

Given a sequence, what is the optimal state sequence that the HMM would use to generate it

28

Problems With HMM

Training ProblemGiven a large amount of data how can we estimate the structure

and the parameters of the HMM that best accounts for the data

29

HMMs in Biology

• Gene finding and prediction

• Protein-Profile Analysis

• Secondary Structure prediction

• Advantages

• Limitations

30

Finding genes in DNA sequence

This is one of the most challenging and interesting problems in

computational biology at the moment. With so many genomes

being sequenced so rapidly, it remains important to begin by

identifying genes computationally.

31

What is a (protein-coding) gene?

Protein

mRNA

DNA

transcription

translation

CCTGAGCCAACTATTGATGAA

PEPTIDE

CCUGAGCCAACUAUUGAUGAA

32

In more detail

(color ~state)

(Removed)

(Left)

33

Gene Finding HMMs

• Our Objective:

– To find the coding and non-coding regions of an unlabeled string of DNA nucleotides

• Our Motivation:

– Assist in the annotation of genomic data produced by genome sequencing methods

– Gain insight into the mechanisms involved in transcription, splicing and other processes

34

Why HMMs

• Classification: Classifying observations within a

sequence

• Order: A DNA sequence is a set of ordered observations

• Grammar : Our grammatical structure (and the

beginnings of our architecture) is right here:

• Success measure: # of complete exons correctly

labeled

• Training data: Available from various genome

annotation projects

HMMs for gene finding

An HMM for unspliced genes.

x : non-coding DNA

c : coding state

• Training - Expectation Maximization (EM)

• Parsing – Viterbi algorithm

36

Genefinders- a comparison

Method Sn Sp AC Sn Sp(Sn+Sp)/

2ME WE

GENSCAN 0.93 0.93 0.91 0.78 0.81 0.8 0.09 0.05

FGENEH 0.77 0.85 0.78 0.61 0.61 0.61 0.15 0.11

GeneID 0.63 0.81 0.67 0.44 0.45 0.45 0.28 0.24

GeneParser2 0.66 0.79 0.66 0.35 0.39 0.37 0.29 0.17

GenLang 0.72 0.75 0.69 0.5 0.49 0.5 0.21 0.21

GRAILII 0.72 0.84 0.75 0.36 0.41 0.38 0.25 0.1

SORFIND 0.71 0.85 0.73 0.42 0.47 0.45 0.24 0.14

Xpound 0.61 0.82 0.68 0.15 0.17 0.16 0.32 0.13

Accuracy per nucleotide Accuracy per exon

Sn = Sensitivity

Sp = Specificity

Ac = Approximate Correlation

ME = Missing Exons

WE = Wrong Exons

GENSCAN Performance Data, http://genes.mit.edu/Accuracy.html

37

Protein Profile HMMs

• Motivation

– Given a single amino acid target sequence of

unknown structure, we want to infer the structure

of the resulting protein. Use Profile Similarity

• What is a Profile?– Proteins families of related sequences and structures

– Same function

– Clear evolutionary relationship

– Patterns of conservation, some positions are more

conserved than the others

38

Aligned Sequences

Build a Profile HMM (Training)

Database

search

Multiple

alignments

(Viterbi)

Query against Profile

HMM database

(Forward)

An Overview

A HMM model for a DNA motif alignments, The transitions are

shown with arrows whose thickness indicate their probability. In

each state, the histogram shows the probabilities of the four

bases.

ACA - - - ATG

TCA ACT ATC

ACA C - - AGC

AGA - - - ATC

ACC G - - ATC

Building – from an existing alignment

Transition probabilities

Output Probabilities

insertion

40

Matching states

Insertion states

Deletion states

No of matching states = average sequence length in the family

PFAM Database - of Protein families

(http://pfam.wustl.edu)

Building – Final Topology

41

• Given HMM, M, for a sequence family, find all members of the family in data base.

• LL – score LL(x) = log P(x|M)

(LL score is length dependent – must normalize or use Z-score)

Database Searching

Consensus sequence:

P (ACACATC) = 0.8x1 x 0.8x1 x 0.8x0.6 x 0.4x0.6 x 1x1 x

0.8x1 x 0.8 = 4.7 x 10 -2

Suppose I have a query protein sequence, and I am interested

in which family it belongs to? There can be many paths

leading to the generation of this sequence. Need to find all

these paths and sum the probabilities.

ACAC - - ATC

Query a new sequence

43

Multiple Alignments

• Try every possible path through the model that would produce the target sequences

– Keep the best one and its probability.

– Output : Sequence of match, insert and delete states

• Viterbi alg. Dynamic Programming

44

Building – unaligned sequences

• Baum-Welch Expectation-maximization method– Start with a model whose length matches the

average length of the sequences and with random output and transition probabilities.

– Align all the sequences to the model.

– Use the alignment to alter the output and transition probabilities

– Repeat. Continue until the model stops changing

• By-product: It produced a multiple alignment

45

PHMM Example

An alignment of 30 short amino acid sequences chopped out

of a alignment of the SH3 domain. The shaded area are the

most conserved and were represented by the main states in

the HMM. The unshaded area was represented by an insert

state.

46

Prediction of Protein

Secondary structures• Prediction of secondary structures is needed

for the prediction of protein function.

• Analyze the amino-acid sequences of proteins

• Learn secondary structures – helix, sheet and turn

• Predict the secondary structures of sequences

47

Advantages

• Characterize an entire family of sequences.

• Position-dependent character distributions and

position-dependent insertion and deletion gap

penalties.

• Built on a formal probabilistic basis

• Can make libraries of hundreds of profile HMMs and

apply them on a large scale (whole genome)

48

Limitations

Markov Chains

Probabilities of states are supposed to

be independent

P(y) must be independent of P(x), and

vice versa

This usually isn’t true

P(x) … P(y)

49

Limitations - contd

• Standard Machine Learning Problems

• Watch out for local maxima– Model may not converge to a truly optimal

parameter set for a given training set

• Avoid over-fitting– You’re only as good as your training set

– More training is not always good

50

REFERENCES

Hidden Markov Models in Computational Biology , www.evl.uic.edu/shalini/hmm/

• Wikipedia1, http://en.wikipedia.org/wiki/Andrey_Markov

• Wikipedia2, http://en.wikipedia.org/wiki/Hidden_Markov_model

• Wikipedia3, http://en.wikipedia.org/wiki/Markov_chain

• Wikipedia4, http://en.wikipedia.org/wiki/Markov_process

REFERENCES

• Hidden Markov Models-

http://digital.cs.usu.edu/~cyan/CS7960/hmm-tutorial.pdf

• Protein oluşumu -

http://evrimci.freeservers.com/protein_olusumu.html

• Application of Hidden Markov Models in Bioinformatics-

http://www.bioss.ac.uk/associates/dirk/talks/tutorial_hmm_bioinf.

pdf

• Hidden Markov Models in Bioinformatic-

http://benthamscience.com/cbio/Samples/cbio2-1/0005CBIO.pdf

51


Recommended