+ All Categories
Home > Documents > Hidden Markov Models in Bioinformatics

Hidden Markov Models in Bioinformatics

Date post: 23-Jan-2016
Category:
Upload: aimee
View: 56 times
Download: 0 times
Share this document with a friend
Description:
Hidden Markov Models in Bioinformatics. Example Domains: Gene Finding & Protein Family Modeling. 5 Second Overview. Today’s goal: Introduce HMMs as general tools in bioinformatics I will use the problem of Gene Finding as an example of an “ideal” HMM problem domain - PowerPoint PPT Presentation
Popular Tags:
34
Hidden Markov Models in Bioinformatics Example Domains: Gene Finding & Protein Family Modeling
Transcript
Page 1: Hidden Markov Models in Bioinformatics

Hidden Markov Models in Bioinformatics

Example Domains: Gene Finding &

Protein Family Modeling

Page 2: Hidden Markov Models in Bioinformatics

5 Second Overview

Today’s goal: Introduce HMMs as general tools in bioinformatics

I will use the problem of Gene Finding as an example of an “ideal” HMM problem domain

I will use the problem of Protein Family Modeling as an example of a clever way to fit HMMs to a problem

Page 3: Hidden Markov Models in Bioinformatics

Learning Objectives

When I’m done you should know:

1. When is an HMM a good fit for a problem space?

2. What materials are needed before work can begin with an HMM?

3. What are the advantages and disadvantages of using HMMs?

Page 4: Hidden Markov Models in Bioinformatics

Outline

HMMs as Statistical Models The example tasks at a glance Good problems for HMMs HMM Advantages HMM Disadvantages Gene Finding Examples

Page 5: Hidden Markov Models in Bioinformatics

Statistical Models

Definition: Any mathematical construct that attempts to parameterize

a random process

Example: A normal distribution Assumptions Parameters Estimation Usage

HMMs are just a little more complicated…

Page 6: Hidden Markov Models in Bioinformatics

HMM Assumptions Observations are ordered Random process can be represented by a

stochastic finite state machine with emitting states.

Page 7: Hidden Markov Models in Bioinformatics

HMM Parameters Using weather example Modeling daily weather

for a year Ra Ra Su Su Su Ra..

Lots of parameters One for each table entry

Represented in two tables. One for emissions One for transitions

Page 8: Hidden Markov Models in Bioinformatics

HMM Estimation

Called training, it falls under machine learning

Feed an architecture (given in advance) a set of observation sequences

The training process will iteratively alter its parameters to fit the training set

The trained model will assign the training sequences high probability

Page 9: Hidden Markov Models in Bioinformatics

HMM Usage

Two major tasks Evaluate the probability of an observation

sequence given the model (Forward) Find the most likely path through the model

for a given observation sequence (Viterbi)

Page 10: Hidden Markov Models in Bioinformatics

Gene Finding(An Ideal HMM Domain)

Our Objective: To find the coding and non-coding regions of an

unlabeled string of DNA nucleotides Our Motivation:

Assist in the annotation of genomic data produced by genome sequencing methods

Gain insight into the mechanisms involved in transcription, splicing and other processes

Page 11: Hidden Markov Models in Bioinformatics

Gene Finding Terminology

A string of DNA nucleotides containing a gene will have separate regions (lines): Introns – non-coding regions within a gene Exons – coding regions

Separated by functional sites (boxes) Start and stop codons Splice sites – acceptors and donors

Page 12: Hidden Markov Models in Bioinformatics

Gene Finding Challenges

Need the correct reading frame Introns can interrupt an exon in mid-codon

There is no hard and fast rule for identifying donor and acceptor splice sites Signals are very weak

Page 13: Hidden Markov Models in Bioinformatics

Protein Family Modeling (A clever fit of HMMs)

I have a protein sequence.

What family is it in? Can you give me a quick alignment to the

other members of the family? These amino acids here, do they match the

families consensus positions, or are they inserts?

Page 14: Hidden Markov Models in Bioinformatics

Profile HMM

Square: Match (consensus) state Diamond: Insert state – notice the loops Circle: Delete state – allows you to “jump” a match

Page 15: Hidden Markov Models in Bioinformatics

What makes a good HMM problem space?

Characteristics: Classification problems

There are two main types of output from an HMM: Scoring of sequences

(Protein family modeling) Labeling of observations within a sequence

(Gene Finding)

Page 16: Hidden Markov Models in Bioinformatics

HMM Problem CharacteristicsContinued

The observations in a sequence should have a clear, and meaningful order Unordered observations will not map easily to

states It’s beneficial, but not necessary for the

observations follow some sort of grammar Makes it easier to design an architecture Gene Finding Protein Family Modeling

Page 17: Hidden Markov Models in Bioinformatics

HMM Requirements

So you’ve decided you want to build an HMM, here’s what you need:

An architecture Probably the hardest part Should be biologically sound & easy to interpret

A well-defined success measure Necessary for any form of machine learning

Page 18: Hidden Markov Models in Bioinformatics

HMM Requirements Continued

Training data Labeled or unlabeled – it depends

You do not always need a labeled training set to do observation labeling, but it helps

Amount of training data needed is: Directly proportional to the number of free parameters

in the model Inversely proportional to the size of the training

sequences

Page 19: Hidden Markov Models in Bioinformatics

Why HMMs might be a good fit for Gene Finding

Classification: Classifying observations within a sequence Order: A DNA sequence is a set of ordered observations Grammar / Architecture: Our grammatical structure (and the

beginnings of our architecture) is right here:

Success measure: # of complete exons correctly labeled Training data: Available from various genome annotation

projects

Page 20: Hidden Markov Models in Bioinformatics

Why HMMs can be made to fit Protein Family Modeling

Classification: What model fits a sequence best? Order: An amino acid sequence is well ordered Grammar: Any two matches can be separated by a series of

inserts and deletes… okay, maybe the word “grammar” is a bit of a stretch

Success Measure: How many sequences can we correctly label after training?

Page 21: Hidden Markov Models in Bioinformatics

HMM Advantages

Statistical Grounding Statisticians are comfortable with the theory

behind hidden Markov models Freedom to manipulate the training and

verification processes Mathematical / theoretical analysis of the results

and processes HMMs are still very powerful modeling tools – far

more powerful than many statistical methods

Page 22: Hidden Markov Models in Bioinformatics

HMM Advantages continued

Modularity HMMs can be combined into larger HMMs

Transparency of the Model Assuming an architecture with a good design People can read the model and make sense of it The model itself can help increase understanding

Page 23: Hidden Markov Models in Bioinformatics

HMM Advantages continued

Incorporation of Prior Knowledge

Incorporate prior knowledge into the architecture

Initialize the model close to something believed to be correct

Use prior knowledge to constrain training process

Page 24: Hidden Markov Models in Bioinformatics

How does Gene Finding make use of HMM advantages?

Statistics: Many systems alter the training process to better

suit their success measure Modularity:

Almost all systems use a combination of models, each individually trained for each gene region

Prior Knowledge: A fair amount of prior biological knowledge is built

into each architecture

Page 25: Hidden Markov Models in Bioinformatics

HMM Disadvantages

Markov Chains States are supposed to be independent

P(y) must be independent of P(x), and vice versa This usually isn’t true Can get around it when relationships are local Not good for RNA folding problems

P(x) … P(y)

Page 26: Hidden Markov Models in Bioinformatics

HMM Disadvantagescontinued

…and then there are the standard Machine Learning Problems

Watch out for local maxima Model may not converge to a truly optimal

parameter set for a given training set

Avoid over-fitting You’re only as good as your training set More training is not always good

Page 27: Hidden Markov Models in Bioinformatics

HMM Disadvantagescontinued

Speed!!!

Almost everything one does in an HMM involves: “enumerating all possible paths through the model”

There are efficient ways to do this

Still slow in comparison to other methods

Page 28: Hidden Markov Models in Bioinformatics

HMM Gene Finders:VEIL

A straight HMM Gene Finder Takes advantage of grammatical structure and

modular design Uses many states that can only emit one symbol to

get around state independence

Page 29: Hidden Markov Models in Bioinformatics

HMM Gene Finders:HMMGene

Uses an extended HMM called a CHMM CHMM = HMM with classes Takes full advantage of being able to modify

the statistical algorithms Uses high-order states Trains everything at once

Page 30: Hidden Markov Models in Bioinformatics

HMM Gene Finders:Genie

Uses a generalized HMM (GHMM) Edges in model are complete HMMs States can be any arbitrary program States are actually neural networks specially

designed for signal finding

Page 31: Hidden Markov Models in Bioinformatics

Conclusions

HMMs have problems where they excel, and problems where they do not

You should consider using one if: Problem can be phrased as classification Observations are ordered The observations follow some sort of grammatical

structure (optional)

Page 32: Hidden Markov Models in Bioinformatics

Conclusions

Advantages: Statistics Modularity Transparency Prior Knowledge

Disadvantages: State independence Over-fitting Local Maximums Speed

Page 33: Hidden Markov Models in Bioinformatics

Some final words…

Lots of problems can be phrased as classification problems

Homology search Build a model of the sequence with a few close

homologs, and use the model the search for more distant homologs

Sequence alignment Align all of these sequences to the model that

represents their family

Page 34: Hidden Markov Models in Bioinformatics

Questions

Any Questions?


Recommended