Home >
Documents >
Conversion of Hidden Markov Model computation to C# · The theoretical model used was a hidden...

Share this document with a friend

Embed Size (px)

of 23
/23

Transcript

DEPARTMENT OF INFORMATION TECHNOLOGY

Conversion of Hidden Markov Model computation to C#

Gustaf Pettersson Jakob Tysk Henrik Vallgren

Report: Project in Computational Science January 2013

PR

OJE

CT

R

EP

OR

T

1

Abstract The project objective was to implement a mathematical model in C#, in

order for Schlumberger Ltd to be able to automatically classify oil wells.

The theoretical model used was a hidden Markov model using continuous

observation densities and a Gaussian mixture model. Three sets of training

data were given, as well as a number of unclassified data sets for

classification. All data contained a number of measure points, at different

depths and with different self potential. The resulting classifications

showed that the hidden Markov model is a fairly good tool for classifying

oil wells.

2

Index Introduction .............................................................................................. 2

Theory ...................................................................................................... 3

Discrete Markov processes ................................................................... 4

Hidden Markov models......................................................................... 5

Forward-Backward procedure ............................................................... 6

Training of a hidden Markov model ...................................................... 7

Scaling.................................................................................................. 8

Training the model with multiple observation sequences....................... 8

Continuous observation densities and Gaussian mixture model ............. 9

Classification using hidden Markov models ........................................ 10

Method ................................................................................................... 10

Data .................................................................................................... 10

Process and structure .......................................................................... 11

Training .............................................................................................. 12

Classification ...................................................................................... 14

Results .................................................................................................... 14

Training data from one well ................................................................ 15

Training data from two wells .............................................................. 15

Training data from three wells ............................................................ 16

Performance comparison with the MATLAB implementation ............. 17

Training .......................................................................................... 18

Classification .................................................................................. 18

Validation ........................................................................................... 19

Discussion and Conclusion ..................................................................... 21

References .............................................................................................. 22

3

Introduction To be able to analyze and process data it is often important to be able to

classify the data. By classifying the data many important advantages can

be achieved. For example there is a possibility to reduce the amount of

expensive and time consuming measurements. At Schlumberger Ltd, the

world leading oil field services company, the classification of oil wells is

done by hand. This results in that the classification often becomes

subjective and the uncertainty is usually high. A method that can be used

to address this problem is a so called hidden Markov model (HMM). A

hidden Markov model is a mathematical model that is used for, among

other things, pattern recognition in data sequences. By using such a model,

Schlumberger Ltd could automatize the classification of the oil wells and

save a lot of working time.

The assignment for this project was to implement a HMM in C#, a multi-

paradigm programming language developed by Microsoft. The

implemented model should be able to classify oil wells. To begin with, the

HMM must be trained after pre-classified wells. By doing this, the model

will be able to recognize patterns inside the wells. The sequences, or the

input, that are to be classified consists of values of depth and self potential

(SP) for different measure points in the well.

To solve the problem an already working MATLAB implementation was

given. The reason for Schlumberger Ltd to require an implementation in

C# was that the program should be compatible with software used at the

company. They use their own system, called Ocean, which is directly

compatible with C#. An intermediate goal of this project was to increase

the performance in relation to the MATLAB implementation.

Theory The mathematical model that has been used for the classification is the

hidden Markov model (HMM), which is a stochastic model based on

Markov processes. Much work in this area has been done by Lawrence R.

Rabiner and the following theoretical background is from the papers:

A Tutorial on Hidden Markov Models and Selected Applications in

Speech Recognition (Rabiner, 1989)

4

An Erratum for “A Tutorial on Hidden Markov Models and

Selected Applications in Speech Recognition” (Rahimi, 2000)

Gaussian Mixture Models (Reynolds, 2009)

Training Hidden Markov Models with Multiple Observations – A

Combinatorial Method (Li, et al., 2000).

Discrete Markov processes

A Markov process is a stochastic model in which a system changes states

according to a given model, where the transition between the states Si and

Sj are defined by a transition probability aij according to:

(1)

N is the number of states in the system, qt is the state for time t and aij is a

probabilistic variable such that:

(2)

(3)

Each state is observable, i.e. if the system is in state Si an observer can

immediately determine this from the system’s output. The transition

probabilities for transitions between the different states varies depending

on the current state, and for a system with N states it is good practice to

collect transition probabilities in a transition probability matrix A such

that:

(5)

The rows in the transition probability matrix describe the state the system

will change to and the columns describe the state the system is currently

in.

Assume a system with states S = { S1, S2, S3 }, equipped with the

transition probability matrix:

5

. (6)

Each matrix entry aij represents the probability for the system S to go from

state Si to state Sj. Then the system may be illustrated with the following

graph:

Figure 1: An illustration of a discrete Markov process with 3 states.

The system attains different states with different probabilities, determined

by the initial state probability distribution and:

(7)

(8)

Using and A it is now possible for our example system (6) to

determine the probability for the sequence of observations

:

(9)

Hidden Markov models

The discrete Markov process can be expanded to a hidden Markov model

(HMM). The principal difference between these two models is that for the

HMM it isn’t possible to directly observe the states of the system. What is

observable is instead a stochastic function, which depends on the states of

the system.

A hidden Markov model is characterized by five principal parts:

13

S1 S2 S3 S2

6

a set of hidden states , where N is the total

number of states in the system,

transition probability matrix as in the discrete Markov

process,

a set of observation symbols , where M is

the total number of possible observations for each state,

a observation symbol probability distribution , where

,

initial state probability distribution as in the discrete

Markov process.

In the same way as for the discrete Markov process, the HMM can be

illustrated as:

Figure 2: An illustration of a hidden Markov model, with observable

output and “hidden” states.

The HMM model is usually denoted:

(10)

Forward-Backward procedure

Given a model and a set of observations , the

forward (α) and backward (β) variables can be defined as

(11)

(12)

,

S1 S2 S3 S2

V1 V2

7

and can be solved using induction:

1. Initialization:

, (13)

, (14)

2. Induction:

, (15)

, (16)

,

3. Termination:

(17)

Using (14), this can be rewritten as:

(18)

Training of a hidden Markov model

The basic problem when trying to train a HMM is how to find the optimal

model vector parameter λ that maximizes given a set of

observations (Rabiner, 1989). This problem can be solved

using the iterative Baum-Welch algorithm. Start by defining the

probability of being in state Si at time t and state Sj at time t + 1 as

(19)

and the probability of being in the state Si at time t, given the observation

sequence and the model, as:

(20)

8

For each iteration of the algorithm, the re-estimated model

will be updated in the following way:

, (21)

, (22)

. (23)

Scaling

is an equation that consists of a number of terms that are products of

a and b. Furthermore, as both a and b are (in most cases significantly) less

than 1, for large values of t, will exponentially approach zero. The

values will be so small that a computer won’t be able to correctly handle

those because of the computational precision range. Therefore, some kind

of scaling is needed in order to keep the values in range of what is

computationally possible to use.

Introduce a scaling coefficient such that:

(24)

The scaled induction formula for and , equation (15) and (16),

can then be written as:

(25)

(26)

Using these scaled formulas, the probability equation (17) would be

outside the typical computational range. In order to still be able to compare

the probabilities, the probability equation can be replaced with:

(27)

Training the model with multiple observation sequences

The training formulas described above are used for one observation

sequence. In many, if not most, practical applications, there’ll be multiple

9

observation sequences accessible and the model should be able to handle

this in a good way. Let:

(28)

denote a set of k observation sequences and let:

(29)

be the k:th observation sequence. Since the reestimation formulas,

equations (21) and (22), are based on the frequency of occurrence of the

various outputs and classes, the modified reestimation formulas may be

written as:

, (30)

. (31)

Continuous observation densities and Gaussian mixture

model

The theory above concerns the cases where there exist observations that

could be characterized by a discrete and finite number of observation

symbols. However, in this case, the output is a continuous observation

density which demands some restrictions to be placed on the model for the

parameters to be re-estimated consistently.

This can be done by using a Gaussian mixture model and using the

reestimation formulas:

, (33)

(34)

(35)

10

where is a mixture coefficient, is a mean vector and is a

covariance matrix. Using these, the probability density function b

becomes:

(36)

with:

(37)

(38)

Classification using hidden Markov models

The basic problem when trying to classify a set of data using HMM is how

to choose a state sequence , given a set of observations

and a model λ (Rabiner, 1989). This problem can be

solved by finding the state sequence that corresponds to the maximum

solution of equation (27).

Method

Data

Classified data were given for three different wells;

Tourm

Halita

Jasper

Unclassified data were given for nine wells;

Agate

Albite

Amethyst

Barite

Basalt

11

Bauxite

Beryl

Chaclopyrite

Halite

The data from each well contained the measured SP value every 0.5

meters. An example of the given data for a classified well is shown in

Table 1. Each pre-classified well used for training contained between

25 000 and 30 000 measure points, while the unclassified wells contained

approximately 13 000 measure points. Data that specified the boundaries

of each sequence for the unclassified wells were also given.

Depth SP value Class 11394.500000 -1.323600e+010 0.0000000000

11395.000000 -1.347600e+010 0.0000000000 11395.500000 -1.371700e+010 0.0000000000

11396.000000 -1.395700e+010 0.0000000000

11396.500000 -1.440900e+010 0.0000000000

11397.000000 -1.492500e+010 4.0000000000 11397.500000 -1.544200e+010 4.0000000000

11398.000000 -1.572800e+010 4.0000000000

11398.500000 -1.590200e+010 4.0000000000

Table 1: Data structure of each classified well.

All SP values used are normalized with 0 mean and standard deviation 1.

Furthermore, as a consequence of too few sequences of class 6 and its

similarity to class 5, all sequences of class 6 were reclassified as class 5.

Process and structure

For the C# implementation of the hidden Markov model, the following

process chart was designed:

12

Figure 3: Process chart for the HMM implementation.

By analyzing the process chart a number of classes were deemed

necessary. The classes used in the implementation are:

HMM – a class designed to start and control the process. This class

contains the main method of the program.

Well – contains methods to read, store and manipulate data from a

well. Each physical well is created as an object of Well before it’s

loaded as pre-classified training data or unclassified data for

analysis.

Training – contains methods for training the model.

Classification – contains methods to classify data using the model

and methods for evaluation against possible pre-classified data.

HMMmath – contains necessary mathematical methods, both

common functions such as determinant calculation and specific

methods for the HMM.

Training

In order to train the model the pre-classified wells are split into sequences,

which are a part of the well with uniformly classified data. These

sequences are split into subsequences based on local maxima for the SP

values and at least 30 measure points (corresponding to 15 meters in a

well). The reason for using at least 30 measure points is because it is a

sufficiently large data set for analysis and small enough to handle most

measure points and avoid unclassified points in the result. The decision to

13

choose exactly 30 measure points was done by repeated trial and error

tests. In other words, a part of the well is handled as a subsequence if it

consists of at least 30 measure points between two local maxima, see

Figure 4. Each subsequence is handled as one observation used for HMM

classification.

Figure 4: A plot of one sequence, containing multiple subsequences.

Four features are calculated for each subsequence:

mean value,

difference between maximum and minimum values,

minimum value,

median of the derivative of a least square approximation.

These features are used in the calculation of (33) – (38) as the observed

values needed to determine the probability density function.

The Baum-Welch algorithm is initialized by making a suitable initial guess

of the covariance matrix by calculating the covariance of the different

features, assuming that all states are equally possible (i.e. the matrix A is

14

evenly distributed) and all states are equally possible as starting position

(i.e. the vector π is evenly distributed).

The Baum-Welch algorithm is then iterated until the logarithmic

probability, equation (27), stays within the tolerance difference between

two iterations or the maximum number of iterations is reached.

Finally, the A matrix, the covariance matrix, the means matrix, the mixture

coefficients (weights) matrix and π vector for each class, are saved for

further use in the classification. The number of features, the degree of the

least squares polynomial used in the features calculation, the number of

Gaussian mixture coefficients and the number of states for each class are

also saved.

Classification

In order to classify unknown data the data have to be split into sequences

and normalized in the same way as when training the model. The data are

then used to calculate a probability matrix containing the logarithmic

likelihood value, equation (27), for each subsequence and class.

Using this, each subsequence is classified to the class that gives the highest

logarithmic likelihood value and a file containing depths, SP values and

result of classification is saved.

Results It is difficult to display results for the classification of unknown wells. The

reason for this is that the output of the classification is a matrix with depth,

self potential and class, and there is nothing to compare these values to.

Thus, it is difficult to determine the accuracy of the classification. To

display the results the pre-classified wells were chosen instead. In that way

the results can be validated and the percentage of correct classifications

can be shown.

It is important to note that the results shown in the figures are not absolute.

The reason for this is that the calculations contain stochastic variables.

Because of this the amount of correct classifications differs from one

training run to another. The difference can be as large as a couple of

percent. Thus, if the percentage of correct classifications is about equal for

two wells the order could be different for a different training run.

15

Training data from one well

To begin with the classification was performed with training data from one

well at a time. The result is shown in Figure 5.

Figure 5: Classification with training data from one well

The result shows that the wells are best at classifying themselves. This

means that if the training data come from for example Tourm, the model is

best at classifying Tourm and so on. As seen in Figure 5, Tourm is best at

classifying itself with a successful rate of approximately 85 %. Halita and

Jasper are about equal with a rate of 70 %. Furthermore, it is obvious that

Halita is not as good at classifying the other two wells (40-50 %) as Tourm

and Jasper (60 %).

Training data from two wells

The result of the classification, performed with training data from two

wells at a time is shown in Figure 6.

0

10

20

30

40

50

60

70

80

90

100

Tourm Halita Jasper

Per

cen

t

Training Data

Tourm

Halita

Jasper

16

Figure 6: Classification with training data from two wells at a time

The result shows that, as in the example with training data from one well,

that the wells are best at classifying themselves. The rate of correct

classifications for these cases is located in the interval 70-80 %. For

classification of the third and remaining well the successful rate is 60-70

%. This is a clear improvement in comparison to the previous example

with training data from one well.

Training data from three wells

Finally the classification was performed with training data from all three

wells. The result is shown in Figure 7.

0

10

20

30

40

50

60

70

80

90

100

Tourm &

Halita

Tourm &

Jasper

Halita &

Jasper

Per

cen

t

Training Data

Tourm

Halita

Jasper

17

Figure 7: Classification with training data from all three wells

The result shows that the rate of correct classification is located in the

interval 70-80 %. The best result is acquired for Jasper with almost 80 %,

and then comes Halita with 75 %, and finally Tourm with just over 70 %.

For this configuration the wells are only classifying themselves. Thus,

these results are not really an improvement from the previous

configuration with training data from two wells at a time. The previous

configuration also gave a result in the interval 70-80 %. However, what

was acquired was a high and even result for all three wells.

Performance comparison with the MATLAB

implementation

As already mentioned a working MATLAB implementation was given to

help solve the problem. It is of interest to compare the performance of the

two implementations. One of the advantages with C# in comparison to

MATLAB is that it is a programming language that can ensure faster

performance.

It is important to note that the following comparisons were made without

some of the built in text and warning outputs, since it was the calculation

time that was interesting.

0

10

20

30

40

50

60

70

80

90

100

Training data from all three wells

Per

cen

t

Tourm

Halita

Jasper

18

Training

To display the performance difference for the training phase two

configurations were used. The first case is an example of training with two

wells and the second is of the training with all three wells. The reason for

only using these two cases is that the calculation times are very similar for

the cases where the training data come from two wells. The configuration

with training data from one well was not used because it would need

rewriting of the MATLAB implementation and because it is not as

relevant as the other two configurations. In Figure 8 the performance

difference between C# and MATLAB for the training phase is displayed.

Figure 8: Performance comparison for the training phase

The result shows that the C# implementation is considerably faster than

the MATLAB implementation for training of the model. In both the tested

cases the C# implementation is more than eight times faster than the

MATLAB implementation.

Classification

To display the performance difference for the classification phase, only the

unclassified wells were used. The reason for this is that there were more

unclassified wells than pre-classified wells and the calculations are very

similar for both cases. Thus, the performance results for the unclassified

wells should also be representative for the pre-classified wells. In Figure 9

0

20

40

60

80

100

120

140

160

180

200

Two wells Three wells

Sec

on

ds

MATLAB

C#

19

the performance difference between C# and MATLAB for the

classification phase is displayed.

Figure 9: Performance comparison for the classification phase

The result shows that the C# implementation, in all cases, is considerably

faster than the MATLAB implementation at classifying the unknown

wells. The average speedup is approximately 2.7 times.

Validation

To validate the model the results of the C# implementation was compared

to the results of the MATLAB implementation. To do this the training of

the model was performed without stochastic variables before the

classification of the wells. In that way it was possible to compare the two

implementations and check if they produced the same result. See Table 2

and Table 3 for the comparison.

0

2

4

6

8

10

12

14

Sec

on

ds

Matlab

C#

20

Table 2: Comparison of the classification without stochastic variables

Table 3: Another comparison of the classification without stochastic

variables

The results of the comparison show that the two implementations have

basically the same output. Though, there is a small difference. The reason

for this difference is the least square fit approximation. As mentioned

earlier all the functions in the C# implementation were written from

scratch. The method chosen for the least squares fit approximation differs

somewhat from the method used in MATLAB, called polyfit. Thus, there

is a small difference in the output. In Table 4 a comparison of the least

squares fit approximation for a random sequence is shown.

C#

Tourm: 66.3 % (80.8 %, top 2)

Halita: 70.2 % (83.1 %, top 2)

Jasper: 60.7 % (81.1 %, top 2)

MATLAB

Tourm: 66.3 % (80.8 %, top 2)

Halita: 69.4% (83.1 %, top 2)

Jasper: 60.7 % (81.1 %, top 2)

C#

Tourm: 68.0 % (82.0 %, top 2)

Halita: 71.0 % (84.7 %, top 2)

Jasper: 65.6 % (85.2 %, top 2)

MATLAB

Tourm: 68.6 % (82.0 %, top 2)

Halita: 71.0% (84.7 %, top 2)

Jasper: 66.4 % (85.2 %, top 2)

21

Table 4: Example of the difference in the least squares fit approximation

The small difference shown in Table 4 affects the calculations through that

classes that have almost the same probability can be classified differently

by the two implementations. More specifically it can happen when the

logarithmic probability values, for the most probable classes, are too close

to each other. It is hard to tell which implementation is more “correct”. As

seen in Table 2 and 3 there are examples where each implementation

yields a more correct classification than the other.

Observe that the values and probabilities used for validation only can be

used for comparison. They cannot be used as a measurement of how good

the model is at classifying wells. The reason for this is the absence of the

stochastic variables.

Discussion and Conclusion The assignment was to implement a hidden Markov model in C#. This was

done successfully and the implemented model is a fairly good tool for

classifying oil wells. An intermediate goal was to get a better performance

than the given MATLAB implementation. This goal was satisfied since the

C# implementation was considerably faster, above eight times faster for

training and in average three times faster for classification.

Three pre-classified wells were given as reference for training. This is a

quite small sample and to really determine how good the implementation

is at classifying data more wells would have been needed. For example, it

would be of interest to see how well the model would classify a fourth well

with training data from three wells and so on. If more pre-classified wells

were given an intermediate goal could have been to check the optimal

amount of wells to be used in the training.

A limitation could be that the MATLAB implementation was considered

to be correct. It was used for reference for the C# implementation to check

if the code was correct. If the MATLAB implementation in some way is

C# MATLAB

10663.9000619288 10663.90750010427

-6.3842254380315 -6.384229888206242

0.000955388371274855 0.0009553890368708431

22

incorrect then probably the C# implementation is incorrect. Though, it

should be noted that this assignment first and foremost was solved by

using the given articles. Thus, the implemented model should be working

properly.

References Li, X., Parizeau, M. & Plamondon, R., 2000. Training Hidden Markov

Models with Multiple Observations-A Combinatorial Method. IEEE

TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE

INTELLIGENCE, 22(4).

Rabiner, L. R., 1989. A Tutorial on Hidden Markov Models and Selected

Applications in Speech Recognition. Proceedings of the IEEE, 77(2).

Rahimi, A., 2000. An Erratum for ‘A Tutorial on Hidden Markov Models

and Selected Applications in Speech Recognition’. [Online]

Available at: http://alumni.media.mit.edu/~rahimi/rabiner/rabiner-

errata/rabiner-errata.html

[Accessed 07 12 2012].

Reynolds, D., 2009. Gaussian Mixture Models. In: Encyclopedia of

Biometrics. s.l.:s.n., pp. 659-663.

Recommended