Date post: | 17-Jun-2015 |

Category: |
## Education |

Author: | yaser-sulaiman |

View: | 521 times |

Download: | 3 times |

Share this document with a friend

Description:

A presentation about the paper titled "Error statistics of hidden Markov model and hidden Boltzmann model results" by Lee A Newberg. The paper is available at http://www.biomedcentral.com/1471-2105/10/212

Embed Size (px)

Popular Tags:

of 129
/129

Transcript

- 1. Error Statistics of Hidden Markov Model and Hidden Boltzmann
Model Results

A paper by Lee A Newberg

Presented by Yaser Sulaiman

1

2. Im a computer scientist

2

3. who recently got interested in bioinformatics

3

4. a different flavor of probability theory & stochastic
processes

4

5. HMMs in computer science

5

6. temporal pattern recognition

6

7. speech recognition

7

8. handwriting recognition

8

9. bioinformatics

9

10. 10

photo by John A Burnett

11. bioinformatics in 5 minutes

11

12. 12

13. biological sequences

13

14. DNA

{A,T,C,G}

14

15. 15

stolen from Iowa State University

16. RNA

{A,U,C,G}

16

17. proteins

{A,R,N,D,C,E,Q,G,H,I,L,K,M,F,P,S,T,W,Y,V}

17

18. 18

stolen from Wikipedia

19. sequence comparison

19

20. @ the heart of bioinformatics

20

21. why?

21

22. 22

23. not to mention evolution

23

24. sequence alignment

24

25. find optimal alignment

25

26. according to a scoring function

26

27. align AACGT and AACT

to max. identities

27

28. AACGT

|||

AA-CT

28

29. AACGT

||| |

AAC-T

29

30. its not always that easy!

30

31. 31

photo by JohnGoode

32. theres more to bioinformatics

than can fit into this presentation

32

33. back to the paper

33

34. Error statistics of HMM & hidden Boltzmann model
results

34

35. Error statistics of HMM & hidden Boltzmann model
results

35

36. how to interpret a score

36

37. 1. is it strong enough to indicate signal?

37

38. 2. is it weak enough to indicate noise?

38

39. false positive & true positive rates

39

40. false positive rate (fpr) for s0

Pr(scoreofnoises0)

40

41. true positive rate (tpr) for s0

Pr(scoreofsignals0)

41

42. a faster, more general approach to estimating fpr/tpr

42

43. we assume that were given:

43

44. a hidden Boltzmann model

44

45. a simple background model describing noise

45

46. a computable foreground model describing signal

46

47. Error statistics of HMM & hidden Boltzmann model
results

47

48. a Markov process with unobserved states

48

49. transition probabilities

+

emission probabilities

49

50. Error statistics of HMM & hidden Boltzmann model
results

50

51. generalization of HMM

51

52. scores rather than probabilities

52

53. states

(including start & terminal)

53

54. transitions

54

55. emitters

55

56. emissions

56

57. alphabet

57

58. each state, transition, & emission has a real-valued
score

58

59. emission path

59

60. sequence

60

61. score of emission path

(encounteredscores)

61

62. 62

63. hidden?

63

64. an emission path cant be uniquely determined from its
sequence

64

65. a sequence can be emitted by any of several emission
paths

65

66. 66

67. how to score a given sequence

67

68. maximum score

smaxD=maxDs()

68

69. forward score

an HMM interpretation of the hidden Boltzmann model

69

70. for anys,exp(s) is treated as if it were an HMM
probability

70

71. expsfwD=Dexp(s)

71

72. free score

definition of free energy from thermodynamics

72

73. temperature

T(0,+)

73

74. ZD,T=expsfreeD,TT=Dexp(sT)

74

75. background model

75

76. simple model: sequence positions are i.i.d.

76

77. PrDB=i=1LPr(di|B)

77

78. mathematical problem statement

78

79. fprs0=DDLPrDB(sDs0)

79

80. algorithm

80

81. fpr(s0) can be estimated via nave sampling

81

82. alternatively, fpr(s0) can be estimated via importance
sampling

82

83. fprs0=DDLPrDTf(D,s0)

where

fD,s0=PrDB(sDs0)Pr(D|T)

83

84. importance sampling is more efficient

84

85. importance sampling distribution

85

86. PrDT=PrDBZ(D,T)Z(T)

86

87. f(D,s0)=ZT(sDs0)Z(D,T)

87

88. sampling of sequences in a nutshell

88

89. draw sample sequences according to Pr(D|T)

89

90. compute f(D,s0) for each sample

90

91. use the average as an estimate for fpr(s0)

91

92. estimation of fpr

92

93. fpr1s0=Z(T)Ni=1N(sDis0)Z(Di,T)

=1tnr1(s0)

93

94. tnr2s0=Z(T)Ni=1N(sDi=1fpr2(s0)

94

95.
fpr3s0=&fpr1(s0),iffpr1(s0)tnr2(s0)&fpr2(s0),otherwise

95

96. which estimator is the best?

96

97. based on the results, fpr3

97

98. choice depends on efficiency of the estimators

98

99. estimation of tpr

99

100. by extending the technique for estimating tpr

100

101. choice of T

101

102. which T will be efficient for a given s0?

102

103. the relation between T and s0 isnt straightforward

103

104. build a calibration curve

104

105. we have empirically observed lower variances for error
statistic estimation when the fraction of sampled sequences
exceeding the given score threshold is 20-60%.

105

106. results

106

107. HMMER 3.0

107

108. randomly generated a length M=100, Plan7 profile-HMM

108

109. estimated its error statistics using polypeptide sequences of
length L=200

109

110. time to calculate error statistics for s0is 4.2-6.3
seconds

110

111. runtime for nave sampling would be much larger

111

112. an error statistic less than 1020 would require a runtime
longer than the present age of the universe.

112

113. a quick check using Wolfram|Alpha

113

114. 114

115. discussion

115

116. 116

117. future directions

117

118. real problem instances

118

119. scaling to different problem instances

119

120. re-use of simulations

120

121. other scoring functions

121

122. complex background models

122

123. stochastic context-free grammars

123

124. to summarize

124

125. error statistic estimation for hidden Boltzmann models

125

126. applied to HMM

126

127. faster than nave sampling

127

128. more general than other approaches

128

129.

129

Recommended