+ All Categories
Home > Education > Error Statistics of Hidden Markov Model and Hidden Boltzmann Model Results

Error Statistics of Hidden Markov Model and Hidden Boltzmann Model Results

Date post: 17-Jun-2015
Category:
Author: yaser-sulaiman
View: 521 times
Download: 3 times
Share this document with a friend
Description:
A presentation about the paper titled "Error statistics of hidden Markov model and hidden Boltzmann model results" by Lee A Newberg. The paper is available at http://www.biomedcentral.com/1471-2105/10/212
Embed Size (px)
Popular Tags:
of 129 /129
Error Statistics of Hidden Markov Model and Hidden Boltzmann Model Results A paper by Lee A Newberg Presented by Yaser Sulaiman 1
Transcript
  • 1. Error Statistics of Hidden Markov Model and Hidden Boltzmann Model Results
    A paper by Lee A Newberg
    Presented by Yaser Sulaiman
    1

2. Im a computer scientist
2
3. who recently got interested in bioinformatics
3
4. a different flavor of probability theory & stochastic processes
4
5. HMMs in computer science
5
6. temporal pattern recognition
6
7. speech recognition
7
8. handwriting recognition
8
9. bioinformatics
9
10. 10
photo by John A Burnett
11. bioinformatics in 5 minutes
11
12. 12
13. biological sequences
13
14. DNA
{A,T,C,G}

14
15. 15
stolen from Iowa State University
16. RNA
{A,U,C,G}

16
17. proteins
{A,R,N,D,C,E,Q,G,H,I,L,K,M,F,P,S,T,W,Y,V}

17
18. 18
stolen from Wikipedia
19. sequence comparison
19
20. @ the heart of bioinformatics
20
21. why?
21
22. 22
23. not to mention evolution
23
24. sequence alignment
24
25. find optimal alignment
25
26. according to a scoring function
26
27. align AACGT and AACT
to max. identities
27
28. AACGT
|||
AA-CT
28
29. AACGT
||| |
AAC-T
29
30. its not always that easy!
30
31. 31
photo by JohnGoode
32. theres more to bioinformatics
than can fit into this presentation
32
33. back to the paper
33
34. Error statistics of HMM & hidden Boltzmann model results
34
35. Error statistics of HMM & hidden Boltzmann model results
35
36. how to interpret a score
36
37. 1. is it strong enough to indicate signal?
37
38. 2. is it weak enough to indicate noise?
38
39. false positive & true positive rates
39
40. false positive rate (fpr) for s0
Pr(scoreofnoises0)

40
41. true positive rate (tpr) for s0
Pr(scoreofsignals0)

41
42. a faster, more general approach to estimating fpr/tpr
42
43. we assume that were given:
43
44. a hidden Boltzmann model
44
45. a simple background model describing noise
45
46. a computable foreground model describing signal
46
47. Error statistics of HMM & hidden Boltzmann model results
47
48. a Markov process with unobserved states
48
49. transition probabilities
+
emission probabilities
49
50. Error statistics of HMM & hidden Boltzmann model results
50
51. generalization of HMM
51
52. scores rather than probabilities
52
53. states
(including start & terminal)
53
54. transitions
54
55. emitters
55
56. emissions
56
57. alphabet
57
58. each state, transition, & emission has a real-valued score
58
59. emission path
59
60. sequence
60
61. score of emission path
(encounteredscores)

61
62. 62
63. hidden?
63
64. an emission path cant be uniquely determined from its sequence
64
65. a sequence can be emitted by any of several emission paths
65
66. 66
67. how to score a given sequence
67
68. maximum score
smaxD=maxDs()

68
69. forward score
an HMM interpretation of the hidden Boltzmann model
69
70. for anys,exp(s) is treated as if it were an HMM probability

70
71. expsfwD=Dexp(s)

71
72. free score
definition of free energy from thermodynamics
72
73. temperature
T(0,+)

73
74. ZD,T=expsfreeD,TT=Dexp(sT)

74
75. background model
75
76. simple model: sequence positions are i.i.d.
76
77. PrDB=i=1LPr(di|B)

77
78. mathematical problem statement
78
79. fprs0=DDLPrDB(sDs0)

79
80. algorithm
80
81. fpr(s0) can be estimated via nave sampling

81
82. alternatively, fpr(s0) can be estimated via importance sampling

82
83. fprs0=DDLPrDTf(D,s0)
where
fD,s0=PrDB(sDs0)Pr(D|T)

83
84. importance sampling is more efficient
84
85. importance sampling distribution
85
86. PrDT=PrDBZ(D,T)Z(T)

86
87. f(D,s0)=ZT(sDs0)Z(D,T)

87
88. sampling of sequences in a nutshell
88
89. draw sample sequences according to Pr(D|T)

89
90. compute f(D,s0) for each sample

90
91. use the average as an estimate for fpr(s0)

91
92. estimation of fpr
92
93. fpr1s0=Z(T)Ni=1N(sDis0)Z(Di,T)
=1tnr1(s0)

93
94. tnr2s0=Z(T)Ni=1N(sDi=1fpr2(s0)

94
95. fpr3s0=&fpr1(s0),iffpr1(s0)tnr2(s0)&fpr2(s0),otherwise

95
96. which estimator is the best?
96
97. based on the results, fpr3

97
98. choice depends on efficiency of the estimators
98
99. estimation of tpr
99
100. by extending the technique for estimating tpr
100
101. choice of T

101
102. which T will be efficient for a given s0?

102
103. the relation between T and s0 isnt straightforward

103
104. build a calibration curve
104
105. we have empirically observed lower variances for error statistic estimation when the fraction of sampled sequences exceeding the given score threshold is 20-60%.
105
106. results
106
107. HMMER 3.0
107
108. randomly generated a length M=100, Plan7 profile-HMM

108
109. estimated its error statistics using polypeptide sequences of length L=200

109
110. time to calculate error statistics for s0is 4.2-6.3 seconds

110
111. runtime for nave sampling would be much larger
111
112. an error statistic less than 1020 would require a runtime longer than the present age of the universe.

112
113. a quick check using Wolfram|Alpha
113
114. 114
115. discussion
115
116. 116
117. future directions
117
118. real problem instances
118
119. scaling to different problem instances
119
120. re-use of simulations
120
121. other scoring functions
121
122. complex background models
122
123. stochastic context-free grammars
123
124. to summarize
124
125. error statistic estimation for hidden Boltzmann models
125
126. applied to HMM
126
127. faster than nave sampling
127
128. more general than other approaches
128
129.

129


Recommended