+ All Categories
Home > Documents > Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU.

Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU.

Date post: 26-Mar-2015
Category:
Upload: rachel-cantrell
View: 213 times
Download: 0 times
Share this document with a friend
Popular Tags:
29
Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU
Transcript
Page 1: Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU.

Sequence motifs, information content, logos, and HMM’s

Morten Nielsen,

CBS, BioCentrum,

DTU

Page 2: Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU.

Outline

Multiple alignments and sequence motifs Weight matrices and consensus sequence

– Sequence weighting– Low (pseudo) counts

Information content– Sequence logos– Mutual information

Example from the real world HMM’s and profile HMM’s

– TMHMM (trans-membrane protein) – Gene finding

Links to HMM packages

Page 3: Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU.

Multiple alignment and sequence motifs

Core Consensus sequence Weight matrices Problems

– Sequence weights– Low counts

----------MLEFVVEADLPGIKA------------------MLEFVVEFALPGIKA------------------MLEFVVEFDLPGIAA---------------------YLQDSDPDSFQD-----------GSDTITLPCRMKQFINMWQE-------------RNQEERLLADLMQNYDPNLR-----------------YDPNLRPAERDSDVVNVSLK----------------NVSLKLTLTNLISLNEREEA-------EREEALTTNVWIEMQWCDYR-------------------WCDYRLRWDPRDYEGLWVLR-----LWVLRVPSTMVWRPDIVLEN-----------------------IVLENNVDGVFEVALYCNVL--------------YCNVLVSPDGCIYWLPPAIF---------PPAIFRSACSISVTYFPFDW---- ********* FVVEFDLPG

Consensus

Page 4: Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU.

Sequences weighting 1 - Clustering

----------MLEFVVEADLPGIKA------------------MLEFVVEFALPGIKA------------------MLEFVVEFDLPGIAA---------------------YLQDSDPDSFQD-----------GSDTITLPCRMKQFINMWQE-------------RNQEERLLADLMQNYDPNLR-----------------YDPNLRPAERDSDVVNVSLK----------------NVSLKLTLTNLISLNEREEA-------EREEALTTNVWIEMQWCDYR-------------------WCDYRLRWDPRDYEGLWVLR-----LWVLRVPSTMVWRPDIVLEN-----------------------IVLENNVDGVFEVALYCNVL--------------YCNVLVSPDGCIYWLPPAIF---------PPAIFRSACSISVTYFPFDW----

*********

} Homologous sequencesWeight = 1/n (1/3)

Consensus sequence

YRQELDPLV

Previous

FVVEFDLPG

Page 5: Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU.

Sequences weighting 2 - (Henikoff & Henikoff)

W FVVEADLPG 0.37FVVEFALPG 0.43FVVEFDLPG 0.32YLQDSDPDS 0.59MKQFINMWQ 0.90LMQNYDPNL 0.68PAERDSDVV 0.75LKLTLTNLI 0.85VWIEMQWCD 0.84YRLRWDPRD 0.51WRPDIVLEN 0.71VLENNVDGV 0.59YCNVLVSPD 0.71FRSACSISV 0.75

• Waa’ = 1/rs• r: Number of different aa in a column• s: Number occurrences• Normalize so Waa= 1 for each column• Sequence weight is sum of Waa

F: r=7 (FYMLPVW), s=4 w’=1/28, w = 0.055Y: s=3, w`=1/21, w = 0.073M,P,W: s=1, w’=1/7, w = 0.218L,V: s=2, w’=1/14, w = 0.109

Page 6: Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU.

Low count correction

--------MLEFVVEADLPGIKA----------------MLEFVVEFALPGIKA----------------MLEFVVEFDLPGIAA-------------------YLQDSDPDSFQD---------GSDTITLPCRMKQFINMWQE-----------RNQEERLLADLMQNYDPNLR---------------YDPNLRPAERDSDVVNVSLK--------------NVSLKLTLTNLISLNEREEA-----EREEALTTNVWIEMQWCDYR-----------------WCDYRLRWDPRDYEGLWVLR---LWVLRVPSTMVWRPDIVLEN---------------------IVLENNVDGVFEVALYCNVL------------YCNVLVSPDGCIYWLPPAIF-------PPAIFRSACSISVTYFPFDW---- *********

Limited number of data Poor sampling of

sequence space I is not found at position

P1. Does this mean that I is forbidden?

No! Use Blosum matrix to estimate pseudo frequency of I

P1

Page 7: Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU.

Low count correction using Blosum matrices

# I L V

L 0.1154 0.3755 0.0962

V 0.1646 0.1303 0.2689

Blosum62 substitution frequencies Every time for instance L/V is observed, I is also likely to occur

Estimate low (pseudo) count correction using this approach

As more data are included the pseudo count correction becomes less important

NL = 2, NV=2, Neff=12 =>fI = (2*0.1154 + 2*0.1646)/12 = 0.05

pI* = (Neff * pI + * fI)/(Neff+) = (12*0 + 10*0.05)/(12+10) = 0.02

Page 8: Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU.

Information content

Information and entropy– Conserved amino acid regions contain high degree of

information (high order == low entropy)– Variable amino acid regions contain low degree of information

(low order == high entropy) Shannon information

D = log2(N) + pi log2 pi (for proteins N=20, DNA N=4)

Conserved residue pA=1, pi<>A=0, D = log2(N) ( = 4.3 for proteins)

Variable region pA=0.05, pC=0.05, .., D = 0

Page 9: Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU.

Sequence logo

Height of a column equal to D Relative height of a letter is

pA

Highly useful tool to visualize sequence motifs

High information position

MHC class IILogo from 10 sequences

http://www.cbs.dtu.dk/~gorodkin/appl/plogo.html

Page 10: Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU.

Frequency matrix

A R N D C Q E G H I L K M F P S T W Y V 2 1 1 1 1 1 1 1 1 4 16 1 6 15 7 1 2 7 18 13 8 19 1 1 7 2 2 2 1 3 15 13 6 2 1 2 2 7 1 8 3 2 7 2 1 17 13 2 1 8 14 3 1 1 7 7 2 0 1 8 8 13 13 14 1 2 13 2 1 2 3 3 1 7 1 3 7 0 1 7 4 1 7 7 7 1 2 2 1 13 15 2 6 6 1 7 2 7 7 4 5 2 8 23 1 6 3 2 1 3 3 2 1 1 1 13 8 0 1 18 2 1 7 13 1 1 2 2 1 8 14 2 6 1 20 7 2 7 1 3 3 7 7 8 7 1 7 8 1 2 8 2 1 1 13 7 2 7 1 7 3 2 7 19 1 6 2 8 1 9 9 2 1 1 1 7 2 0 1 18

Frequencies x 100

Page 11: Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU.

More on Logos

Information contentD = pi log2 (pi/qi)

Shannon, qi = 1/N = 0.05

D = pi log2 (pi) - pi log2 (1/N)

= log2 N - pi log2 (pi)

Kullback-Leibler, qi = background frequency– V/L/A more frequent than for instance C/H/W

Page 12: Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU.

Mutual information

I(i,j) = aai aaj

P(aai, aaj) *

log[P(aai, aaj)/P(aai)*P(aaj)]

P(G1) = 2/9 = 0.22, ..P(V6) = 4/9 = 0.44,..P(G1,V6) = 2/9 = 0.22, P(G1)*P(V6) = 8/81 = 0.10

log(0.22/0.10) > 0

ALWGFFPVAILKEPVHGVILGFVFTLTLLFGYPVYVGLSPTVWLSYMNGTMSQV

GILGFVFTL WLSLLVPFVFLPSDFFPS

P1 P6

Page 13: Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU.

Mutual information 

313 binding peptides 313 random peptides

Page 14: Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU.

Weight matrices

Estimate amino acid frequencies from alignment inc. sequence weighting and pseudo counts

Now a weight matrix is given as

Wij = log(pij/qj) Here i is a position in the motif, and j an amino acid.

qj is the background frequency for amino acid j. W is a L x 20 matrix, L is motif length Score sequences to weight matrix by looking up and

adding L values from matrix

Page 15: Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU.

Example from real life

10 peptides from MHCpep database

Bind to the MHC complex

Relevant for immune system recognition

Estimate sequence motif and weight matrix

Evaluate on 528 peptides

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

Page 16: Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU.

Example from real life (cont.)

Raw sequence counting

– No sequence weighting – No pseudo count– Prediction accuracy 0.45

Sequence weighting– No pseudo count– Prediction accuracy 0.5

Page 17: Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU.

Example from real life (cont.)

Sequence weighting and pseudo count

– Prediction accuracy 0.60

Motif found on all data (485)

– Prediction accuracy 0.79

Page 18: Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU.

Hidden Markov Models

Weight matrices do not deal with insertions and deletions

In alignments, this is done in an ad-hoc manner by optimization of the two gap penalties for first gap and gap extension

HMM is a natural frame work where insertions/deletions are dealt with explicitly

Page 19: Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU.

HMM (a simple example)

ACA---ATG

TCAACTATC

ACAC--AGC

AGA---ATC

ACCG--ATC

Example from A. Krogh Core region defines the

number of states in the HMM (red)

Insertion and deletion statistics is derived from the non-core part of the alignment (blue)

Core of alignment

Page 20: Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU.

.8

.2

ACGT

ACGT

ACGT

ACGT

ACGT

ACGT.8

.8 .8.8

.2.2.2

.2

1

ACGT .2

.2

.2

.4

1. .4 1. 1.1.

.6.6

.4

HMM construction

ACA---ATG

TCAACTATC

ACAC--AGC

AGA---ATC

ACCG--ATC

• 5 matches. A, 2xC, T, G• 5 transitions in gap region

• C out, G out• A-C, C-T, T out• Out transition 3/5• Stay transition 2/5

ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2 = 3.3x10-2

Page 21: Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU.

Align sequence to HMM

ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2 = 3.3x10-2

TCAACTATC 0.2x1x0.8x1x0.8x0.6x0.2x0.4x0.4x0.4x0.2x0.6x1x1x0.8x1x0.8 = 0.0075x10-2

ACAC--AGC = 1.2x10-2

AGA---ATC = 3.3x10-2

ACCG--ATC = 0.59x10-2

Consensus:

ACAC--ATC = 4.7x10-2, ACA---ATC = 13.1x10-2

Exceptional:

TGCT--AGG = 0.0023x10-2

Page 22: Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU.

Align sequence to HMM - Null model

Score depends strongly on length

Null model is a random model. For length L the score is

0.25L

Log-odd score for sequence S

Log( P(S)/0.25L)

ACA---ATG = 4.9

TCAACTATC = 3.0 ACAC--AGC = 5.3AGA---ATC = 4.9ACCG--ATC = 4.6Consensus:ACAC--ATC = 6.7 ACA---ATC = 6.3Exceptional:TGCT--AGG = -0.97

Note!

Page 23: Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU.

HMM’s and weight matrices

Note. In the case of un-gapped alignments HMM’s become simple weight matrices

It still might be useful to use a HMM tool package to estimate a weight matrix– Sequence weighting– Pseudo counts

Page 24: Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU.

EM55_HUMAN WWQGRVEGSSKESAGLIPSPELQEWRVASMAQSAP--SEAPSCSPFGKKKK-YKDKYLAKCSKP_HUMAN WWQGKLENSKNGTAGLIPSPELQEWRVACIAMEKTKQEQQASCTWFGKKKKQYKDKYLAKKAPB_MOUSE -----PENLLIDHQGYIQVTDFGFAKRVKG------------------------------NRC2_NEUCR -----PENILLHQSGHIMLSDFDLSKQSDPGGKPTMIIGKNGTSTSSLPTIDTKSCIANF

EM55_HUMAN HSSIFDQLDVVSYEEVVRLPAFKRKTLVLIGASGVGRSHIKNALLSQNPEKFVYPVPYTTCSKP_HUMAN HNAVFDQLDLVTYEEVVKLPAFKRKTLVLLGAHGVGRRHIKNTLITKHPDRFAYPIPHTTKAPB_MOUSE RTWTLCGTPEYLAPEIILSKGYNKAVDWWALGVLIYEMAAGYPPFFADQPIQIYEKIVSGNRC2_NEUCR RTNSFVGTEEYIAPEVIKGSGHTSAVDWWTLGILIYEMLYGTTPFKGKNRNATFANILRE

EM55_HUMAN RPPRKSEEDGKEYHFISTEEMTRNISANEFLEFGSYQGNMFGTKFETVHQIHKQNKIAILCSKP_HUMAN RPPKKDEENGKNYYFVSHDQMMQDISNNEYLEYGSHEDAMYGTKLETIRKIHEQGLIAILKAPB_MOUSE KVRFPSHF-----SSDLKDLLRNLLQVDLTKRFGNLKNGVSDIKTHKWFATTDWIAIYQRNRC2_NEUCR DIPFPDHAGAPQISNLCKSLIRKLLIKDENRRLG-ARAGASDIKTHPFFRTTQWALI--R

EM55_HUMAN NNGVDETLKKLQEAFDQACSSPQWVPVSWVYCSKP_HUMAN NNEIDETIRHLEEAVELVCTAPQWVPVSWVYKAPB_MOUSE EKCGKEFCEF---------------------NRC2_NEUCR ENAVDPFEEFNSVTLHHDGDEEYHSDAYEKR

Profile HMM’s

Insertion

Deletion

Page 25: Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU.

Profile HMM’s

All M/D pairs must be visited once

Page 26: Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU.

TMHMM (trans-membrane HMM)(Sonnhammer, von Heijne, and Krogh)

Model TM length distribution.Power of HMM.Difficult in alignment.

Page 27: Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU.

Combination of HMM’s -Gene finding

x cccxxxxxxxxATGccc cccTAAxxxxxxxx

Inter-genicregion

Region aroundstart codon

Coding region

Region aroundstop codon

Start codon

Stop codon

Page 28: Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU.

HMM packages

HMMER (http://hmmer.wustl.edu/)– S.R. Eddy, WashU St. Louis. Freely available.

SAM (http://www.cse.ucsc.edu/research/compbio/sam.html)– R. Hughey, K. Karplus, A. Krogh, D. Haussler and others, UC Santa

Cruz. Freely available to academia, nominal license fee for commercial users.

META-MEME (http://metameme.sdsc.edu/)– William Noble Grundy, UC San Diego. Freely available. Combines

features of PSSM search and profile HMM search.

NET-ID, HMMpro (http://www.netid.com/html/hmmpro.html)– Freely available to academia, nominal license fee for commercial users.– Allows HMM architecture construction.

Page 29: Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU.

Simple Hmmer command

hmmbuild --gapmax 0.0 --fast A2.hmmer A2.fsa

hmmbuild - build a hidden Markov model from an alignmentHMMER 2.2g (August 2001)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Alignment file: A2.fsa

File format: a2mSearch algorithm configuration: Multiple domain (hmmls)

Model construction strategy: Fast/ad hoc (gapmax 0.0)Null model used: (default)

Sequence weighting method: G/S/C tree weights- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Alignment: #1Number of sequences: 232

Number of columns: 9Determining effective sequence number ... done. [192]

Weighting sequences heuristically ... done.Constructing model architecture ... done.Converting counts to probabilities ... done.

Setting model name, etc. ... done. [A2.fasta]Constructed a profile HMM (length 9)

Average score: -6.42 bitsMinimum score: -15.47 bitsMaximum score: -0.84 bits

Std. deviation: 2.72 bits

>HLA-A.0201 16 Example_for_LigandSLLPAIVEL>HLA-A.0201 16 Example_for_LigandYLLPAIVHI>HLA-A.0201 16 Example_for_LigandTLWVDPYEV>HLA-A.0201 16 Example_for_LigandSXPSGGXGV>HLA-A.0201 16 Example_for_LigandGLVPFLVSV


Recommended