DEPARTMENT OF INFORMATION TECHNOLOGY
Conversion of Hidden Markov Model computation to C#
Gustaf Pettersson Jakob Tysk Henrik Vallgren
Report: Project in Computational Science January 2013
PR
OJE
CT
R
EP
OR
T
1
Abstract The project objective was to implement a mathematical model in C#, in
order for Schlumberger Ltd to be able to automatically classify oil wells.
The theoretical model used was a hidden Markov model using continuous
observation densities and a Gaussian mixture model. Three sets of training
data were given, as well as a number of unclassified data sets for
classification. All data contained a number of measure points, at different
depths and with different self potential. The resulting classifications
showed that the hidden Markov model is a fairly good tool for classifying
oil wells.
2
Index Introduction .............................................................................................. 2
Theory ...................................................................................................... 3
Discrete Markov processes ................................................................... 4
Hidden Markov models......................................................................... 5
Forward-Backward procedure ............................................................... 6
Training of a hidden Markov model ...................................................... 7
Scaling.................................................................................................. 8
Training the model with multiple observation sequences....................... 8
Continuous observation densities and Gaussian mixture model ............. 9
Classification using hidden Markov models ........................................ 10
Method ................................................................................................... 10
Data .................................................................................................... 10
Process and structure .......................................................................... 11
Training .............................................................................................. 12
Classification ...................................................................................... 14
Results .................................................................................................... 14
Training data from one well ................................................................ 15
Training data from two wells .............................................................. 15
Training data from three wells ............................................................ 16
Performance comparison with the MATLAB implementation ............. 17
Training .......................................................................................... 18
Classification .................................................................................. 18
Validation ........................................................................................... 19
Discussion and Conclusion ..................................................................... 21
References .............................................................................................. 22
3
Introduction To be able to analyze and process data it is often important to be able to
classify the data. By classifying the data many important advantages can
be achieved. For example there is a possibility to reduce the amount of
expensive and time consuming measurements. At Schlumberger Ltd, the
world leading oil field services company, the classification of oil wells is
done by hand. This results in that the classification often becomes
subjective and the uncertainty is usually high. A method that can be used
to address this problem is a so called hidden Markov model (HMM). A
hidden Markov model is a mathematical model that is used for, among
other things, pattern recognition in data sequences. By using such a model,
Schlumberger Ltd could automatize the classification of the oil wells and
save a lot of working time.
The assignment for this project was to implement a HMM in C#, a multi-
paradigm programming language developed by Microsoft. The
implemented model should be able to classify oil wells. To begin with, the
HMM must be trained after pre-classified wells. By doing this, the model
will be able to recognize patterns inside the wells. The sequences, or the
input, that are to be classified consists of values of depth and self potential
(SP) for different measure points in the well.
To solve the problem an already working MATLAB implementation was
given. The reason for Schlumberger Ltd to require an implementation in
C# was that the program should be compatible with software used at the
company. They use their own system, called Ocean, which is directly
compatible with C#. An intermediate goal of this project was to increase
the performance in relation to the MATLAB implementation.
Theory The mathematical model that has been used for the classification is the
hidden Markov model (HMM), which is a stochastic model based on
Markov processes. Much work in this area has been done by Lawrence R.
Rabiner and the following theoretical background is from the papers:
A Tutorial on Hidden Markov Models and Selected Applications in
Speech Recognition (Rabiner, 1989)
4
An Erratum for “A Tutorial on Hidden Markov Models and
Selected Applications in Speech Recognition” (Rahimi, 2000)
Gaussian Mixture Models (Reynolds, 2009)
Training Hidden Markov Models with Multiple Observations – A
Combinatorial Method (Li, et al., 2000).
Discrete Markov processes
A Markov process is a stochastic model in which a system changes states
according to a given model, where the transition between the states Si and
Sj are defined by a transition probability aij according to:
(1)
N is the number of states in the system, qt is the state for time t and aij is a
probabilistic variable such that:
(2)
(3)
Each state is observable, i.e. if the system is in state Si an observer can
immediately determine this from the system’s output. The transition
probabilities for transitions between the different states varies depending
on the current state, and for a system with N states it is good practice to
collect transition probabilities in a transition probability matrix A such
that:
(5)
The rows in the transition probability matrix describe the state the system
will change to and the columns describe the state the system is currently
in.
Assume a system with states S = { S1, S2, S3 }, equipped with the
transition probability matrix:
5
. (6)
Each matrix entry aij represents the probability for the system S to go from
state Si to state Sj. Then the system may be illustrated with the following
graph:
Figure 1: An illustration of a discrete Markov process with 3 states.
The system attains different states with different probabilities, determined
by the initial state probability distribution and:
(7)
(8)
Using and A it is now possible for our example system (6) to
determine the probability for the sequence of observations
:
(9)
Hidden Markov models
The discrete Markov process can be expanded to a hidden Markov model
(HMM). The principal difference between these two models is that for the
HMM it isn’t possible to directly observe the states of the system. What is
observable is instead a stochastic function, which depends on the states of
the system.
A hidden Markov model is characterized by five principal parts:
13
S1 S2 S3 S2
6
a set of hidden states , where N is the total
number of states in the system,
transition probability matrix as in the discrete Markov
process,
a set of observation symbols , where M is
the total number of possible observations for each state,
a observation symbol probability distribution , where
,
initial state probability distribution as in the discrete
Markov process.
In the same way as for the discrete Markov process, the HMM can be
illustrated as:
Figure 2: An illustration of a hidden Markov model, with observable
output and “hidden” states.
The HMM model is usually denoted:
(10)
Forward-Backward procedure
Given a model and a set of observations , the
forward (α) and backward (β) variables can be defined as
(11)
(12)
,
S1 S2 S3 S2
V1 V2
7
and can be solved using induction:
1. Initialization:
, (13)
, (14)
2. Induction:
, (15)
, (16)
,
3. Termination:
(17)
Using (14), this can be rewritten as:
(18)
Training of a hidden Markov model
The basic problem when trying to train a HMM is how to find the optimal
model vector parameter λ that maximizes given a set of
observations (Rabiner, 1989). This problem can be solved
using the iterative Baum-Welch algorithm. Start by defining the
probability of being in state Si at time t and state Sj at time t + 1 as
(19)
and the probability of being in the state Si at time t, given the observation
sequence and the model, as:
(20)
8
For each iteration of the algorithm, the re-estimated model
will be updated in the following way:
, (21)
, (22)
. (23)
Scaling
is an equation that consists of a number of terms that are products of
a and b. Furthermore, as both a and b are (in most cases significantly) less
than 1, for large values of t, will exponentially approach zero. The
values will be so small that a computer won’t be able to correctly handle
those because of the computational precision range. Therefore, some kind
of scaling is needed in order to keep the values in range of what is
computationally possible to use.
Introduce a scaling coefficient such that:
(24)
The scaled induction formula for and , equation (15) and (16),
can then be written as:
(25)
(26)
Using these scaled formulas, the probability equation (17) would be
outside the typical computational range. In order to still be able to compare
the probabilities, the probability equation can be replaced with:
(27)
Training the model with multiple observation sequences
The training formulas described above are used for one observation
sequence. In many, if not most, practical applications, there’ll be multiple
9
observation sequences accessible and the model should be able to handle
this in a good way. Let:
(28)
denote a set of k observation sequences and let:
(29)
be the k:th observation sequence. Since the reestimation formulas,
equations (21) and (22), are based on the frequency of occurrence of the
various outputs and classes, the modified reestimation formulas may be
written as:
, (30)
. (31)
Continuous observation densities and Gaussian mixture
model
The theory above concerns the cases where there exist observations that
could be characterized by a discrete and finite number of observation
symbols. However, in this case, the output is a continuous observation
density which demands some restrictions to be placed on the model for the
parameters to be re-estimated consistently.
This can be done by using a Gaussian mixture model and using the
reestimation formulas:
, (33)
(34)
(35)
10
where is a mixture coefficient, is a mean vector and is a
covariance matrix. Using these, the probability density function b
becomes:
(36)
with:
(37)
(38)
Classification using hidden Markov models
The basic problem when trying to classify a set of data using HMM is how
to choose a state sequence , given a set of observations
and a model λ (Rabiner, 1989). This problem can be
solved by finding the state sequence that corresponds to the maximum
solution of equation (27).
Method
Data
Classified data were given for three different wells;
Tourm
Halita
Jasper
Unclassified data were given for nine wells;
Agate
Albite
Amethyst
Barite
Basalt
11
Bauxite
Beryl
Chaclopyrite
Halite
The data from each well contained the measured SP value every 0.5
meters. An example of the given data for a classified well is shown in
Table 1. Each pre-classified well used for training contained between
25 000 and 30 000 measure points, while the unclassified wells contained
approximately 13 000 measure points. Data that specified the boundaries
of each sequence for the unclassified wells were also given.
Depth SP value Class 11394.500000 -1.323600e+010 0.0000000000
11395.000000 -1.347600e+010 0.0000000000 11395.500000 -1.371700e+010 0.0000000000
11396.000000 -1.395700e+010 0.0000000000
11396.500000 -1.440900e+010 0.0000000000
11397.000000 -1.492500e+010 4.0000000000 11397.500000 -1.544200e+010 4.0000000000
11398.000000 -1.572800e+010 4.0000000000
11398.500000 -1.590200e+010 4.0000000000
Table 1: Data structure of each classified well.
All SP values used are normalized with 0 mean and standard deviation 1.
Furthermore, as a consequence of too few sequences of class 6 and its
similarity to class 5, all sequences of class 6 were reclassified as class 5.
Process and structure
For the C# implementation of the hidden Markov model, the following
process chart was designed:
12
Figure 3: Process chart for the HMM implementation.
By analyzing the process chart a number of classes were deemed
necessary. The classes used in the implementation are:
HMM – a class designed to start and control the process. This class
contains the main method of the program.
Well – contains methods to read, store and manipulate data from a
well. Each physical well is created as an object of Well before it’s
loaded as pre-classified training data or unclassified data for
analysis.
Training – contains methods for training the model.
Classification – contains methods to classify data using the model
and methods for evaluation against possible pre-classified data.
HMMmath – contains necessary mathematical methods, both
common functions such as determinant calculation and specific
methods for the HMM.
Training
In order to train the model the pre-classified wells are split into sequences,
which are a part of the well with uniformly classified data. These
sequences are split into subsequences based on local maxima for the SP
values and at least 30 measure points (corresponding to 15 meters in a
well). The reason for using at least 30 measure points is because it is a
sufficiently large data set for analysis and small enough to handle most
measure points and avoid unclassified points in the result. The decision to
13
choose exactly 30 measure points was done by repeated trial and error
tests. In other words, a part of the well is handled as a subsequence if it
consists of at least 30 measure points between two local maxima, see
Figure 4. Each subsequence is handled as one observation used for HMM
classification.
Figure 4: A plot of one sequence, containing multiple subsequences.
Four features are calculated for each subsequence:
mean value,
difference between maximum and minimum values,
minimum value,
median of the derivative of a least square approximation.
These features are used in the calculation of (33) – (38) as the observed
values needed to determine the probability density function.
The Baum-Welch algorithm is initialized by making a suitable initial guess
of the covariance matrix by calculating the covariance of the different
features, assuming that all states are equally possible (i.e. the matrix A is
14
evenly distributed) and all states are equally possible as starting position
(i.e. the vector π is evenly distributed).
The Baum-Welch algorithm is then iterated until the logarithmic
probability, equation (27), stays within the tolerance difference between
two iterations or the maximum number of iterations is reached.
Finally, the A matrix, the covariance matrix, the means matrix, the mixture
coefficients (weights) matrix and π vector for each class, are saved for
further use in the classification. The number of features, the degree of the
least squares polynomial used in the features calculation, the number of
Gaussian mixture coefficients and the number of states for each class are
also saved.
Classification
In order to classify unknown data the data have to be split into sequences
and normalized in the same way as when training the model. The data are
then used to calculate a probability matrix containing the logarithmic
likelihood value, equation (27), for each subsequence and class.
Using this, each subsequence is classified to the class that gives the highest
logarithmic likelihood value and a file containing depths, SP values and
result of classification is saved.
Results It is difficult to display results for the classification of unknown wells. The
reason for this is that the output of the classification is a matrix with depth,
self potential and class, and there is nothing to compare these values to.
Thus, it is difficult to determine the accuracy of the classification. To
display the results the pre-classified wells were chosen instead. In that way
the results can be validated and the percentage of correct classifications
can be shown.
It is important to note that the results shown in the figures are not absolute.
The reason for this is that the calculations contain stochastic variables.
Because of this the amount of correct classifications differs from one
training run to another. The difference can be as large as a couple of
percent. Thus, if the percentage of correct classifications is about equal for
two wells the order could be different for a different training run.
15
Training data from one well
To begin with the classification was performed with training data from one
well at a time. The result is shown in Figure 5.
Figure 5: Classification with training data from one well
The result shows that the wells are best at classifying themselves. This
means that if the training data come from for example Tourm, the model is
best at classifying Tourm and so on. As seen in Figure 5, Tourm is best at
classifying itself with a successful rate of approximately 85 %. Halita and
Jasper are about equal with a rate of 70 %. Furthermore, it is obvious that
Halita is not as good at classifying the other two wells (40-50 %) as Tourm
and Jasper (60 %).
Training data from two wells
The result of the classification, performed with training data from two
wells at a time is shown in Figure 6.
0
10
20
30
40
50
60
70
80
90
100
Tourm Halita Jasper
Per
cen
t
Training Data
Tourm
Halita
Jasper
16
Figure 6: Classification with training data from two wells at a time
The result shows that, as in the example with training data from one well,
that the wells are best at classifying themselves. The rate of correct
classifications for these cases is located in the interval 70-80 %. For
classification of the third and remaining well the successful rate is 60-70
%. This is a clear improvement in comparison to the previous example
with training data from one well.
Training data from three wells
Finally the classification was performed with training data from all three
wells. The result is shown in Figure 7.
0
10
20
30
40
50
60
70
80
90
100
Tourm &
Halita
Tourm &
Jasper
Halita &
Jasper
Per
cen
t
Training Data
Tourm
Halita
Jasper
17
Figure 7: Classification with training data from all three wells
The result shows that the rate of correct classification is located in the
interval 70-80 %. The best result is acquired for Jasper with almost 80 %,
and then comes Halita with 75 %, and finally Tourm with just over 70 %.
For this configuration the wells are only classifying themselves. Thus,
these results are not really an improvement from the previous
configuration with training data from two wells at a time. The previous
configuration also gave a result in the interval 70-80 %. However, what
was acquired was a high and even result for all three wells.
Performance comparison with the MATLAB
implementation
As already mentioned a working MATLAB implementation was given to
help solve the problem. It is of interest to compare the performance of the
two implementations. One of the advantages with C# in comparison to
MATLAB is that it is a programming language that can ensure faster
performance.
It is important to note that the following comparisons were made without
some of the built in text and warning outputs, since it was the calculation
time that was interesting.
0
10
20
30
40
50
60
70
80
90
100
Training data from all three wells
Per
cen
t
Tourm
Halita
Jasper
18
Training
To display the performance difference for the training phase two
configurations were used. The first case is an example of training with two
wells and the second is of the training with all three wells. The reason for
only using these two cases is that the calculation times are very similar for
the cases where the training data come from two wells. The configuration
with training data from one well was not used because it would need
rewriting of the MATLAB implementation and because it is not as
relevant as the other two configurations. In Figure 8 the performance
difference between C# and MATLAB for the training phase is displayed.
Figure 8: Performance comparison for the training phase
The result shows that the C# implementation is considerably faster than
the MATLAB implementation for training of the model. In both the tested
cases the C# implementation is more than eight times faster than the
MATLAB implementation.
Classification
To display the performance difference for the classification phase, only the
unclassified wells were used. The reason for this is that there were more
unclassified wells than pre-classified wells and the calculations are very
similar for both cases. Thus, the performance results for the unclassified
wells should also be representative for the pre-classified wells. In Figure 9
0
20
40
60
80
100
120
140
160
180
200
Two wells Three wells
Sec
on
ds
MATLAB
C#
19
the performance difference between C# and MATLAB for the
classification phase is displayed.
Figure 9: Performance comparison for the classification phase
The result shows that the C# implementation, in all cases, is considerably
faster than the MATLAB implementation at classifying the unknown
wells. The average speedup is approximately 2.7 times.
Validation
To validate the model the results of the C# implementation was compared
to the results of the MATLAB implementation. To do this the training of
the model was performed without stochastic variables before the
classification of the wells. In that way it was possible to compare the two
implementations and check if they produced the same result. See Table 2
and Table 3 for the comparison.
0
2
4
6
8
10
12
14
Sec
on
ds
Matlab
C#
20
Table 2: Comparison of the classification without stochastic variables
Table 3: Another comparison of the classification without stochastic
variables
The results of the comparison show that the two implementations have
basically the same output. Though, there is a small difference. The reason
for this difference is the least square fit approximation. As mentioned
earlier all the functions in the C# implementation were written from
scratch. The method chosen for the least squares fit approximation differs
somewhat from the method used in MATLAB, called polyfit. Thus, there
is a small difference in the output. In Table 4 a comparison of the least
squares fit approximation for a random sequence is shown.
C#
Tourm: 66.3 % (80.8 %, top 2)
Halita: 70.2 % (83.1 %, top 2)
Jasper: 60.7 % (81.1 %, top 2)
MATLAB
Tourm: 66.3 % (80.8 %, top 2)
Halita: 69.4% (83.1 %, top 2)
Jasper: 60.7 % (81.1 %, top 2)
C#
Tourm: 68.0 % (82.0 %, top 2)
Halita: 71.0 % (84.7 %, top 2)
Jasper: 65.6 % (85.2 %, top 2)
MATLAB
Tourm: 68.6 % (82.0 %, top 2)
Halita: 71.0% (84.7 %, top 2)
Jasper: 66.4 % (85.2 %, top 2)
21
Table 4: Example of the difference in the least squares fit approximation
The small difference shown in Table 4 affects the calculations through that
classes that have almost the same probability can be classified differently
by the two implementations. More specifically it can happen when the
logarithmic probability values, for the most probable classes, are too close
to each other. It is hard to tell which implementation is more “correct”. As
seen in Table 2 and 3 there are examples where each implementation
yields a more correct classification than the other.
Observe that the values and probabilities used for validation only can be
used for comparison. They cannot be used as a measurement of how good
the model is at classifying wells. The reason for this is the absence of the
stochastic variables.
Discussion and Conclusion The assignment was to implement a hidden Markov model in C#. This was
done successfully and the implemented model is a fairly good tool for
classifying oil wells. An intermediate goal was to get a better performance
than the given MATLAB implementation. This goal was satisfied since the
C# implementation was considerably faster, above eight times faster for
training and in average three times faster for classification.
Three pre-classified wells were given as reference for training. This is a
quite small sample and to really determine how good the implementation
is at classifying data more wells would have been needed. For example, it
would be of interest to see how well the model would classify a fourth well
with training data from three wells and so on. If more pre-classified wells
were given an intermediate goal could have been to check the optimal
amount of wells to be used in the training.
A limitation could be that the MATLAB implementation was considered
to be correct. It was used for reference for the C# implementation to check
if the code was correct. If the MATLAB implementation in some way is
C# MATLAB
10663.9000619288 10663.90750010427
-6.3842254380315 -6.384229888206242
0.000955388371274855 0.0009553890368708431
22
incorrect then probably the C# implementation is incorrect. Though, it
should be noted that this assignment first and foremost was solved by
using the given articles. Thus, the implemented model should be working
properly.
References Li, X., Parizeau, M. & Plamondon, R., 2000. Training Hidden Markov
Models with Multiple Observations-A Combinatorial Method. IEEE
TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE
INTELLIGENCE, 22(4).
Rabiner, L. R., 1989. A Tutorial on Hidden Markov Models and Selected
Applications in Speech Recognition. Proceedings of the IEEE, 77(2).
Rahimi, A., 2000. An Erratum for ‘A Tutorial on Hidden Markov Models
and Selected Applications in Speech Recognition’. [Online]
Available at: http://alumni.media.mit.edu/~rahimi/rabiner/rabiner-
errata/rabiner-errata.html
[Accessed 07 12 2012].
Reynolds, D., 2009. Gaussian Mixture Models. In: Encyclopedia of
Biometrics. s.l.:s.n., pp. 659-663.