Date post: | 17-Jun-2015 |
Category: |
Education |
Upload: | yaser-sulaiman |
View: | 524 times |
Download: | 3 times |
1
Error Statistics of Hidden Markov Model and Hidden Boltzmann Model
Results
A paper by Lee A NewbergPresented by Yaser Sulaiman
2
I’m a computer scientist
3
who recently got interested in bioinformatics
4
a different “flavor” of probability theory & stochastic
processes
5
HMMs in computer science
6
temporal pattern recognition
7
speech recognition
8
handwriting recognition
9
bioinformatics
11
bioinformatics in 5 minutes
12
Molecular
Biology
Computer
Science
Statistics
13
biological sequences
14
DNA
16
RNA
17
proteins
19
sequence comparison
20
@ the heart of bioinformatics
21
why?
22
sequence similarity
structural similarity
functional similarity
23
not to mention evolution
24
sequence alignment
25
find optimal alignment
26
according to a scoring function
27
align AACGT and AACT
to max. identities
28
AACGT|| |AA-CT
29
AACGT||| |AAC-T
30
it’s not always that easy!
32
there’s more to bioinformaticsthan can fit into this
presentation
33
back to the paper
34
Error statistics of HMM & hidden Boltzmann model
results
35
Error statistics of HMM & hidden Boltzmann model
results
36
how to interpret a score
37
1. is it strong enough to indicate signal?
38
2. is it weak enough to indicate noise?
39
false positive & true positive rates
40
false positive rate (fpr) for
41
true positive rate (tpr) for
42
a faster, more general approach to estimating fpr/tpr
43
we assume that we’re given:
44
a hidden Boltzmann model
45
a simple background model describing noise
46
a computable foreground model describing signal
47
Error statistics of HMM & hidden Boltzmann model
results
48
a Markov process with unobserved states
49
transition probabilities+
emission probabilities
50
Error statistics of HMM & hidden Boltzmann model
results
51
generalization of HMM
52
scores rather than probabilities
53
states(including start & terminal)
54
transitions
55
emitters
56
emissions
57
alphabet
58
each state, transition, & emission has a real-valued
score
59
emission path
60
sequence
61
score of emission path
62
63
hidden?
64
an emission path can’t be uniquely determined from its
sequence
65
a sequence can be emitted by any of several emission paths
66
67
how to score a given sequence
68
maximum score
69
forward score
an HMM interpretation of the hidden Boltzmann model
70
for any , is treated as if it were an HMM probability
71
exp (sfw (D ) )= ∑𝜋 ∈𝜋 D
exp (s ( 𝜋 ))
72
free score
definition of free energy from thermodynamics
73
temperature
74
Z (D ,T )=exp (sfree (D ,T )/T )
75
background model
76
simple model: sequence positions are i.i.d.
77
Pr (D|B )=∏i=1
L
Pr (d i∨B)
78
mathematical problem statement
79
fpr ( s0 )= ∑D∈D L
Pr (D|B )Θ(s (D )≥ s0)
80
algorithm
81
can be estimated via naïve sampling
82
alternatively, can be estimated via importance sampling
83
where
84
importance sampling is more efficient
85
importance sampling distribution
86
Pr (D|T )=Pr (D|B )Z (D ,T )
Z (T )
87
f (D , s0)=Z (T )Θ(s (D )≥s0)
Z (D ,T )
88
sampling of sequences in a nutshell
89
draw sample sequences according to
90
compute for each sample
91
use the average as an estimate for
92
estimation of fpr
93
94
95
f̂pr3 (s0 )={¿ f̂pr1(s0) , if f̂pr1(s0)≤ t̂nr2(s0)¿ f̂pr2(s0) ,otherwise
96
which estimator is the best?
97
based on the results,
98
choice depends on efficiency of the estimators
99
estimation of tpr
100
by extending the technique for estimating tpr
101
choice of
102
which will be efficient for a given ?
103
the relation between and isn’t straightforward
104
build a calibration curve
105
“we have empirically observed lower variances for error
statistic estimation when the fraction of sampled sequences
exceeding the given score threshold is 20-60%.”
106
results
108
randomly generated a length , Plan7 profile-HMM
109
estimated its error statistics using polypeptide sequences of
length
110
time to calculate error statistics for is 4.2-6.3 seconds
111
runtime for naïve sampling would be much larger
112
“an error statistic less than would require a runtime longer
than the present age of the universe.”
114
115
discussion
116
117
future directions
118
real problem instances
119
scaling to different problem instances
120
re-use of simulations
121
other scoring functions
122
complex background models
123
stochastic context-free grammars
124
to summarize
125
error statistic estimation for hidden Boltzmann models
126
applied to HMM
127
faster than naïve sampling
128
more general than other approaches
129
…</presentation><questions>…