CW Speech Recognition

CW Speech Recognition

1

Kingdom of Saudi Arabia Ministry of Higher Education King Abdul Aziz University

Faculty of Computing & Information Technology Department of Computer Science

Connected Word Speech Recognition آالمُه التعرف على الكالم المتصلِة

B.S. Project Report

Performed by

Zahidur Rahman Mohammed Abul Basher

First Semester of the year 1428/1429A.H-2007/2008A.D


2

ACCEPTANCE The report which is done by the student: Zahidur Rahman Mohammed Abul Basher ID: 0457588, has been reviewed. Therefore we recommend accepting and submitting it to the discussion committee to grant him the degree of Bachelor of Science.

Supervisor Dr. Reda Al-Kheraiby


3

TESTIMONY

This research project has been conducted by my efforts with help of references specified herewith.

Zahidur Rahman Mohammed I.D: 0457588 Mobile: 0541112494 [email protected] Signature:

mailto:[email protected]


4

DEDICATION

To my parents for their support throughout my life


5

ACKNOWLEDGEMENT I would like to thank Dr. Reda Al-kheraiby for his invaluable

support, encouragement and guidance, without which my Project would have been an exercise in futility.


6

ABSTRACT This report takes a brief look at the basic building block of a speech recognition engine. It talks about implementation of different modules. The report includes following important topics for this purpose:

1) Reading and saving sound in .wav format file. 2) Detecting presence and absence of words in sound signals. 3) Extracting features from sound signals using Mel Frequency Cepstrum

Coefficients and several others. 4) Modeling of features using Continuous Hidden Markov Model.

Results of the experiments that were conducted are provided in the Appendix.


7

Table of Contents CHAPTER ONE ................................................................................................................................. 9

Introduction…………………………….. ....................................................................................... 9

1.1Project Identification……………….. ...................................................................................... 9

1.2 Speech Recognition :definition and issues ....................................................................... 11

1.3 Project Beneficiaries………………….. ............................................................................. 12

1.4 Tools Used………………………. .............................................. . 12

1.5 Future Work………………………… ................................................................................... 13

1.6 Difficulties………………………… ....................................................................................... 13

1.7 Literature Review…………………….. ................................................................................ 13

CHAPTER TWO .............................................................................................................................. 14

Speech Recognition - Basic techniques .................................................................................. 14

2.1 Signal Representation & Modeling ..................................................................................... 14

2.2 Sound Recording and Word Detection ............................................................................. 16

2.2.1 Microphone…………………………………………………………………. ........ 16

2.2.2 Word Detector ......................................................................................................... 16

2.3 Feature Extractor…………………… .................................................................................. 17

2.3.1 Zero Mean ............................................................................................................... 18

2.3.2 Pre-emphasis ........................................................................................................... 19

2.3.3 Framing ................................................................................................................... 19

2.3.4 Windowing .............................................................................................................. 20

2.3.3 End Point Detection ................................................................................................ 21

2.3.5 Spectral Analysis ..................................................................................................... 22

2.4 Knowledge Models………………… ................................................................................... 24

2.4.1 Acoustic Model ....................................................................................................... 24

2.4.2 Language Model ..................................................................................................... 26

2.5 HMM Recognition and Training. ......................................................................................... 27

2.5.1 HMM and Speech Recognition ............................................................................... 28

2.5.2 Recognition using HMM ........................................................................................ 28

2.5.3 Forward Procedure .................................................................................................. 29

2.5.4 Backward Procedure ............................................................................................... 30

2.5.5 Baum-Walsh re-estimation Procedure .................................................................... 31

2.5.6 Occurrence Probability............................................................................................ 33

2.5.7 Training the Model .................................................................................................. 34

CHAPTER THREE .......................................................................................................................... 35

System Analysis and Design………………. ............................................................................ 35

3.1 Use Case Model………………….. ......................................................................................... 35


8

3.3 Design Model……………………… ....................................................................................... 36

3.4 Class Diagram Model…………………. .................................................................................. 36

3.5 Interaction Diagram…………………. .................................................................................... 37

3.6: State Chart & State Transition Graph. .................................................................................... 38

3.7 Interface Design………………………. .................................................................................. 40

3.8 Design of Some Methods…………….. ................................................................................... 44

REFERENCES: ............................................................................................................................... 57

CHAPTER FOUR ............................................................................................................................ 59

Appendix ........................................................................................................................................... 59

8.1 WAV file format………………….. ....................................................................................... 59

8.1.1 RIFF WAVE Chunk ................................................................................................ 59

8.1.2 FMT SubChunk ....................................................................................................... 59

8.1.3 Data SubChunk ....................................................................................................... 59

8.2 Results & Conclusions……………………… ..................................................................... 60


9

CHAPTER ONE

Introduction We, as human being, have five senses, all of which give an overview of what the world and how the life is. Talking is what people need and the best communication tool found since past eternity. Of course talking is nothing if people cannot understand themselves. Therefore, we also have ears besides having tongues. These all with our perfect body make speech and its recognition very easy. Although nobody teaches babies how to talk, they can learn easily. It is true that talking and recognizing speech is very easy for us. But teaching of producing meaningful speech, and recognizing that speech has been under-estimated for centuries. Many developments in the technology have led people to ask this question and come up with a lot of solutions.

1.1 Project Identification In all science fiction stories, we see people talking with robots and computers which we will may face all these unbelievable events in the near future. Although speech recognition is a very difficult task, it can be used in many applications up to some degree. Automatic speech recognition systems can be implemented in computers for use in dictation and in controlling the programs, which may be very helpful for handicapped people. Automated telephone services may also use speech recognition capabilities for the easiness of life.

Computer Science

......... ......... Artificial Intelligence

......... ......... pattern Recognition

Speech Recognition

Image Recognition ..........

.......... .......

......... .........

Figure 1.1: AI Applications

Speech recognition applications that have emerged over the last few years include voice dialing (e.g., "Call home"), call routing (e.g., "I would like to make a collect call"), simple data entry (e.g., entering a credit card number),


10

preparation of structured documents (e.g., a radiology report) and content- based spoken audio search (e.g. find a podcast where particular words were spoken).

In this project a connected word speech recognition project will be implemented, this part of artificial intelligent and many others which depend on human voice lay under “speech based pattern recognition” applications. Some others of this type of applications are:

1. Speech Recognition: aims to know the contents of the speech.

What is being said?

Figure 1.2: Speech Recognition

2. Speaker Recognition: aims to know the person who is talking.

Who is speaking?

Figure 1.3: Speaker Recognition

3. Language Identification: aims to know the spoken language.

What language is being spoken?

Figure 1.4: Language Identification


11

1.2 Speech Recognition: definition and issues Speech recognition (in many contexts also known as automatic speech recognition, computer speech recognition or erroneously as voice recognition) is the process of converting a speech signal to a sequence of words in the form of digital data, by means of an algorithm implemented as a computer program. The recognized words can be the final results, as for applications such as commands & control, data entry, and document preparation. They can also serve as the input to further linguistic processing in order to achieve speech understanding. Recognition systems may be designed in many modes to achieve specific objective or performance criteria: 1 .Speaker Dependent / Independent System For speaker dependent systems, user is asked to utter predefined words or sentences. These acoustic signals form the training data, which are used for recognition of the input speech. Since these systems are used for only a predefined speaker, their performance becomes higher compared to speaker independent systems. 2 .Isolated Word Recognition This is also called discrete recognition system. In this system, there has to be pause between uttered words. Therefore the system does not have to care about finding boundaries between words. 3. Continuous Speech Recognition These systems are the ultimate goal of a recognition process. No matter how or when a word is uttered, they are recognized in real time and accordingly an action is performed. Changes in speaking rate, careless pronunciations, detecting the word boundaries and real time issues are main problems for this recognition mode. 4. Vocabulary Size The lower the size of the vocabulary in a recognition system the higher the recognition performance. Specific tasks may use small vocabularies. However a natural system should be speaker independent continuous recognition over a large vocabulary which is the most difficult 5. Keyword Spotting These systems are used to detect a word in continuous speech. For this reason they may be as good as isolated recognition besides having the capability to handle continuous.


12

Figure 1.5: Speech Recognition systems

Speech Recognition

based on Speaking mode

Isolated word

Connected word

based on Speaking style

Read Speech Spontaneous

based on Vocabulary

size

Big,medium,

small

based on Enrollment

Speaker Independent

Speaker Dependent

This project will cover the “connected word, speaker independent and small vocabulary” speech recognition. The project will compose of two phases:

1) Training phase: In this phase, a number of words will be trained to extract model for each word.

2) Recognition phase: In this phase, a sequence of connected word is entered by microphone or an input file and the system will try to recognize these words.

1.3 Project Beneficiaries 1) Telecommunications companies. 2) Cellular phone manufactories. 3) Broadcasting stations.

1.4 Tools Used 1) Programming Language :C#.Net 2) Operating System :Windows XP 3) Additional devices : Microphone 4) Libraries :Microsoft DirectX, to pick up the voice from the microphone. 5) Assistance Softwares :Microsoft Office.


13

1.5 Future Work 1) Improving accuracy of the system using other available features. 2) Using Level Building Pattern Matching in recognition to make the system more reliable. 3) Developing the system to more complex speech recognition system based on phones rather then words. 4) Combining the Neural Network with Continuous Hidden Markov Model to Classify the features in each frame into phonetic-based categories. 5) Integrating the system into practical speech recognition applications (e.g. Automatic translation, Dictation, Hands-free computing, Home automation, Interactive voice response, Medical transcription, Pronunciation ……..etc).

1.6 Difficulties 1) Understanding how a typical Automatic Speech Recognition works. 2) Finding the best Feature Extraction algorithm that gives best results. 3) Understanding statistical modeling using Continuous Hidden Markov Model. 4) Finding the best initial parameters for Continuous Hidden Markov Model.

1.7 Literature Review 1)LawrenceRabiner,Bing-Hwang Juang,“Fundamentals of Speech Recognition” Prentice Hall. The book covers briefly in fundamentals of speech analyzing, feature extraction and modeling. 2) Lawrence Rabiner, “A tutorial on Hidden Markov Models”, http://www.cs.ubc.ca/~murphyk/bayes/rabiner.pdf. A guide to all mathematical equations needed for implementing Hidden Markov Models. 3) Emad yousif Al-Bassam, “Speaker Segmentation”, ID:0353232 ,Bsc graduation project, KAAU, Saudi Arabia. The Project includes the basic foundation of speaker(s) recognition with implementation in C#. 4) Sakhr Awad Mohammad Saleh, “Sound Analysis for Recitation of the Holy Koran”,ID:0353232,Bsc graduation project, KAAU, Saudi Arabia. The Project describes some basic steps in any sound analysis system.

http://www.cs.ubc.ca/%7Emurphyk/bayes/rabiner.pdf


14

CHAPTER TWO

Speech Recognition - Basic techniques In this chapter, I will give some basic ideas about how a typical speech recognition system works step by step. In the later chapters some of its analysis and design models will be shown. Recognition process consists of several steps. According to recognition modes some of the steps may have more importance or they may be just skipped.

2.1 Signal Representation & Modeling a. Analog signal is firstly sampled and the signal becomes discrete. Sampling frequency determines the quality of the sampled signal because more bandwidth is preserved if sampling frequency is high. The discrete signal is quantized to range of numbers so that the signal has become digitized and now can be used in digital systems such as computers. b. Signal processing techniques is applied to the speech signal in order to extract its properties, which makes its own. Raw speech signal cannot be used in recognition systems because most of the data in the speech is nonsense from the recognition point of view. Also it becomes very large in a training database so this requires many computations; therefore it is not feasible to process with such a huge data. So we have to preserve the useful properties of the signal that we need for recognition and throw away nonsense data such as speaker differences or emotions of the speaker. We also need to minimize data for computational purposes. This is called feature parameter extraction. These parameters are extracted from the signal by splitting the speech into frames. The reason for this is that speech changes its properties slowly in time and for a short period of time we can assume the signal as independent data. However we have to apply windowing while framing the signal because the frame boundaries should be smooth.


15

Figure 2.1: Speech Recognition systems

c. After extracting the feature parameters from the speech, we fit a statistical model for those parameters. This is called acoustic modeling. Acoustic models of the speech signals uttered by speakers make up the training database. Searching this entire database and selecting the best match in a vocabulary for a given signal performs recognition. HMMs have been used for fifteen years especially for speech recognition problems successfully. Neural networks have also been used with HMM which were then called hybrid systems. The following equation has to be solved in a recognition process: (2.1) That is we have to find most likely word W given the acoustic signal A. P (W) determines the probability of word and it is related to the language. So language models are developed for finding P (W). P (A/W) is the conditional probability of the acoustic signal given the word W. P (A) is the probability of the recorded signal so after recording it becomes same for all recordings. The problem is then to find the maximum value of the product of P(W) with P (A/W).


16

2.2 Sound Recording and Word Detection The microphone.cs class is responsible to accept input from a microphone and forward it to the feature extraction module. Before converting the signal into suitable or desired form, it is important to identify the segments of the sound containing words. The audio.cs class deals with all tasks needed for converting wave file to stream of digits and vice versa. It also has a provision of saving the sound into WAV files.

2.2.1 Microphone Features The microphone.cs class takes input from the microphone and saves it or forwards it depending on which function is invoked. Default sampling rate of the microphone.cs is 44100 samples per second, at a size of 16 bits per sample and dual channel. Design Internally, it is the job of microphone.cs class to take the input from the user. The microphone.cs class takes sampling rate, sample size and number of channels as parameters. The Audio.cs class takes care of converting the raw audio signal into WAV format and Vice versa. The format of WAV file if given in the appendix for reference.

2.2.2 Word Detector The Principle In speech recognition it is important to detect when a word is spoken. The system does detect the region of silence. Anything other than silence is considered as a spoken word by the system. The system uses energy pattern present in the sound signal and zero crossing rate to detect the silent region. Taking both of them is important as only energy tends to miss some parts of sounds which are important. This technique has been described in [7]. The Method For word detection a sample is taken every 10 milli-seconds. Energy and zero crossing for this duration are calculated. Energy is calculated by adding the square of the value of waveform at each instance and then dividing it by to number of instances over the period of sample. Zero crossing rate is the number of times the value of the wave goes from the negative number to positive of vice-versa. Word Detector assumes that the first 100 milli-second is silence. It uses the average Energy and average Zero Crossing Rate obtained during this time for identifying the background noise. Upper threshold for energy and zero crossing is set to 2 times the average value of background noise. Lower thresholds are set


17

to 0.75 times the upper threshold. While detecting the presence of word in the sound, if the energy or zero crossing goes above the upper threshold and stays above for three consecutive sample word is assumed to be present and the recording is started. The recording continues till the energy and zero crossing both fall below the lower threshold and stay there for at-least 30 milli-seconds.

2.3 Feature Extractor Humans have a capacity of identifying different types of sounds (phones). Phones put in a particular order constitute a word. If we want a machine to identify the spoken word, it will have to differentiate between different kinds of sound the way the humans perceive it. The point to be noted in case of humans is that although, one word spoken by different people produces different sound waves humans are able to identify the sound waves as same.

Figure(2.2) Pre-Processing of speech signal

On the other hand two sounds which are different are perceived as different by humans. The reason being even when same phones or sounds are produced by different speakers they have common features. A good feature extractor should extract these features and use them for further analysis and processing.


18

Figure(2.3) Feature extraction of speech signal

2.3.1 Zero Mean Zero mean normalization is used, to slid all data in each profile so that their average becomes zero. Implementation: 1) initialize three double variables sum, y, mean. 2) sum= summation of elements of array samples[].

Figure(2.4) Zero mean normalization

3) mean=sum/samplesLength. 4) for(integer i=0;i<samplesLength;i++) { y=samples[i]-mean; samples[i]=(y>0.0)? y+0.5 : y-0.5); }


19

2.3.2 Pre-emphasis The second stage in feature extraction is to boost the amount of energy in the high frequencies. It turns out that if we look at the spectrum for voiced segments like vowels, there is more energy at the lower frequencies than the higher frequencies. This SPECTRAL TILT drop in energy across frequencies (which is called spectral tilt) is caused by the nature of the glottal pulse. Boosting the high frequency energy makes information from these higher formants more available to the acoustic model and improves phone detection accuracy. Implementation: 1. Set integer variable i to the last index of array of samples. 2. While(i !=0) begin samples[i] = samples[i] - 0.9375 * samples[i-1]. i=i-1. end.

2.3.3 Framing Speech signal is a kind of unstable signal. But we can assume it as stable signal during 10--30ms. Framing is used to cut the long-time speech to the short-time speech signal in order to get relative stable frequency characteristics. Features get periodically extracted. The time for which the signal is considered for processing is called a window and the data acquired in a window is called as a frame. Typically features are extracted once every 10ms, which is called as frame rate. The window duration is typically 20ms. Thus two consecutive frames have overlapping areas. Implementation:

1) initialize a two-dimension jagged float array frames[samplesLength/(frameSize-overlappedSamplesSize)][frameSize].

2)initialize two integer variables frameIndex,offset. 3) for(integer i=0;i<samplesLength;i++) { If (offset==frameSize) { i -=overlappedSamples-1; frameIndex++; offset=0; } Frames[frameIndex)][ offset]=samples[i]; }


20

(Figure 2.5) Framing of speech signal

2.3.4 Windowing Windowing is mainly to reduce the aliasing effect, when cut the long signal to a short-time signal in frequency domain. There are different types of windows, there are: • Rectangular window • Bartlett window • Hamming window Out of these, the most widely used window is Hamming window. This project uses Hamming window as it introduces the least amount of distortion. Impulse response of the Hamming window is a raised cosine impulse .Transfer function of hamming window is:

(2.2) Features are then extracted from each of frame. Implementation: 1. Initialize a variable wn. 2.foreach frames[] in juggedArray frames[][] begin foreach element i in frames[] begin wn = 0.54 - 0.46 * Math.Cos((2 * π* i) / (frameSize - 1)); i = i * wn. end end


21

2.3.3 End Point Detection Speech recognition is based on the premise that the signal in a prescribed recording interval consists of words, preceded and followed by silence or other background noise. Thus, when a word is actually spoken, it is assumed that the speech segments can be reliably separated from the non-speech segments. The process of separating the speech segments of an utterance from the background, i.e., the non-speech segments obtained during the recording process, is called endpoint detection. In speech recognition systems, accurate detection of the endpoints of a spoken word is important for two reasons, namely: 1) Reliable word recognition is critically dependent on accurate endpoint detection. 2) The computation for processing the speech is minimum when the endpoints are accurately located. Two widely accepted methods namely Short Time Energy (STE) and Zeros Crossing Rate (ZCR) have been used for this purpose. STE uses the fact that energy in voiced sample is greater than silence/unvoiced sample. On the other hand ZCR has a demarcation rule specifying that if the ZCR of a portion speech exceeds some rate (50 for example) then this portion will be labeled as unvoiced or background noise whereas any segment showing ZCR at about 12(for example) is considered to be the voiced one.

Figure(2.5) End Point Detection Implementation: EndPoint Detection using Short Time Energy (STE), Zeros Crossing Rate (ZCR) and samples mean (SM) have been implemented for this project, also an endpoint detection based on paper [4] is implemented. See flowchart of endpoint detection based on paper [4] in section Design of Some Methods:


22

2.3.5 Spectral Analysis Spectral analysis gives us quite a lot of information about the spoken phone. Time domain data is converted to Frequency domain by applying Fourier transform on it. This process gives us the spectral information. Spectral information is the energy levels at different frequencies in a given window. Thus features like frequency with maximum energy, distance between frequencies of maximum and minimum energies, etc can be extracted. Mel frequency cepstrum computation Mel frequency cepstrum computation(MFCC) is considered to be the best available approximation of human ear. It is known that human ear is more sensitive to higher frequency. The spectral information can then be converted to MFCC by passing the signals through band pass filters where higher frequencies are artificially boosted, and then doing an inverse Fast Fourier Transform (FFT) on it. This results in higher frequencies being more prominent.

Figure (2.6) Steps of MFCC

Implementation: FFT is used to convert speech signal from time-domain to frequency-domain. The skill of frequency scaling is used to map linear frequency into human perception. Mel-frequency scale is such a kind of perceptually motivated scale, which is linear below 1 kHz, and logarithmic above. One Mel is defined as one thousand of the pitch of a 1 kHz tone. Mel-scale frequency analysis has been widely used in speech recognition system. It can be approximated by equation (2.3): B(f)=1125ln(1+f/700) (2.3)

Where B is the Mel-frequency scale, f is the linear frequency.


23

Given that the FFT of the input signal in equation (2.4)

(2.4)

And we define mel-frequency filterbank with p filters (j=1,2,…,p), where filter m is triangular filter shown in the figure (2.7). Each FFT magnitude coefficient is multiplied by the corresponding filter gain and the results accumulated. It can be computed as equation:

(2.5)

Figure(2.7) MFCC

Where [k] is the transfer function of filter j. The mel-frequency cepstrum is then the discrete cosine transform of the p filter outputs. It’s described as equation:

(2.6)


24

2.4 Knowledge Models For speech recognition, the system needs to know how the words sound. For this we need to train the system. During the training, using the data given by the user, the system generates acoustic model and language model. These models are later used by the system to map a sound to a word or a phrase.

2.4.1 Acoustic Model Features that are extracted by the Feature Extraction module need to be compared against a model to identify the sound that was produced as the word that was spoken. This model is called as Acoustic Model. There are two kinds of Acoustic Models • Word Model • Phone Model Word Model Word models are generally used to small vocabulary systems. In this model the words are modeled as whole. Thus each word needs to be modeled separately. If we need to add support to recognize a new word, we will have to train the system for the word. In the recognition process, the sound is matched against each of the model to find the best match. This best match is assumed to be the spoken word. Building a model for a word requires us to collect the sound files of the word from various users. These sound files are then used to train a HMM Model. Figure 2.8 shows a diagrammatic representation of phone based acoustic model.

Figure (2.8): Word Acoustic Model


25

Phone Model In phone model instead of modeling the whole word, we model only parts of words generally phones. And the word itself is modeled as sequence of phone. The heard sound is now matched against the parts and parts are recognized. The recognized parts are put together to for a word. For example the word six is generated by combination of two phones see and x. This is generally useful when we need a large vocabulary system. Adding a new word in the vocabulary is easy as the sounds of phones are already know only the possible sequence of phone for the word with it probability needs to be added to the system. Figure 2.9 shows a diagrammatic representation of phone based acoustic model. Phone models can be further classified into:

Figure (2.9): Phone Acoustic Model

• Context-Independent Phone Model • Context-Dependent Phone Model Context-Independent Phone Model In this model individual phones are modeled. The context that they occur is not modeled. The good thing about this model is that the number of phone that has to be modeled is small. Thus the complexity of the system is less.


26

Context-Dependent Phone Model While modeling phone their neighbors are also considered. That means r surrounded by a and e is a separate entity as compared to r surrounded by he and r. This results in a growth of number of modeled phones which increases the complexity. In both word acoustic model and phone acoustic model we need to model silence and filler words too. Filler words are the sounds that humans produce between two words. Both these models can either be implemented using a Hidden Markov Model or a Neural Network. HMM is more widely used technique in automatic speech recognition systems.

2.4.2 Language Model Although there are words that have similar sounding phone, humans generally do not find it difficult to recognize the word. This is mainly because they know the context, and also have a fairly good idea about what words or phrases can occur in the context. Providing this context to a speech recognition system is the purpose of language model. The language model specifies what are the valid words in the language and in what sequence they can occur. Classification Language Models can be classified into several categories: a) Uniform Models: Each word has equal probability of occurrence. b) Stochastic Models: Probability of occurrence of a word depends on the words preceding it. c) Finite State Languages: Language uses a finite state network to define allowed word sequences. d) Context Free Grammar: Context free grammar can be used to encode which kind of sentence is allowed. Implementation We have implemented a word acoustic model. The system has a model for each word that the system can recognize. While recognizing the system need to know where to locate the model for each word and what word the model corresponds to. This information is stored in a flat file called models in a directory called hmms. When a sound is given to the system to recognize, it compares each model with the word and finds out the to model that most closely matches with it. The word corresponding to that HMM model is given as the output. Details about the HMM models and its training and recognizing it are given in the section 3.8.


27

2.5 HMM Recognition and Training Hidden Markov Model (HMM) is a state machine. The states of the model are represented as nodes and the transition are represented as edges. The difference in case of HMM is that the symbol does not uniquely identify a state. The new state is determined by the symbol and the transition probabilities from the current state to a candidate state. [2] is a tutorial on HMM which shows how it can be used. Figure (2.10) shows a diagrammatic representation of a HMM. Nodes denoted as circles are states. O1 to O5 are observations. Observation O1 takes us to states S1. defines the transition probability between and . It can be observed that the states also have self transitions. If we are at state S1 and observation O2 is observed, we can either decide to go to state S2 or stay in state S1. The decision is made depending on the probability of observation at both the states and the transition probability.

| |

Figure 2.10: Diagrammatic representation of HMM.

Thus HMM Model is defined as: λ = (Q, O, A, B, π) (2.7) Where Q is { } (all possible states). O is { } (all possible observation). a is { } where = P( = = ) (transition probabilities). b is{b(i)} where (k)= P( = = | ) (observation probabilities of observation k at state i). π is { } where = P( = ) (initial state probabilities)

denotes the state at time t. denotes the observation at time t.


28

2.5.1 HMM and Speech Recognition HMM can be classified upon various criteria: • Values of Occurrences – Discrete. – Continuous. • Dimension – One Dimensional. – Multi Dimensional. • Probability density function – Continuous density (Gaussian distribution) based. – Discrete density (Vector quantization) based. While using HMM for recognition, we provide the occurrences to the model and it returns a number. This number is the probability with which the model could have produced the output (occurrences). In speech recognition occurrences are feature vectors rather than just symbols. Hence for each occurrence, feature vector has a group of real numbers. Thus, what we need for speech recognition is a Continuous, Multi-dimensional HMM. Implementation Continuous HMM class, which supports vector as observations, has been implemented in the project. The library uses Gaussian probability distribution function. Each state has transition probabilities , Gaussian mixtures , a mean

and variance associated with it. Mean is a vector of N real numbers, where N is a size of the observation. While variance

is a matrix of size N * N.

2.5.2 Recognition using HMM We need to recognize a word using the existing models of words that we have. Sound recorder need to record the sound when it detects the presence of a word. This recorded sound is then passed through feature vector extractor model. The output of the above module is a list of features. These features are then passed to the Recognition module for recognition. The list of all the words that the system is trained for and their corresponding models are given in a file called models present in the bin. All models corresponding of the words are then loaded in memory. The feature vectors generated by the feature vector generator module act as the list of observation for the recognition module.


29

Probability of generation of the observation given a model, P(O|λ), is calculated for each of the model using recognizes function. The word corresponding to the HMM, that gives the probability that is highest and is above the threshold, is considered to be spoken.

2.5.3 Forward Procedure Initially consider a new forward probability variable , at instant t and state i , has the following formula:

= P( , , ,......., , = /λ ) (2.8) This probability function could be solved for N states and T observations iteratively: 1 – Initialization = (2.9) 2 – Induction = ∑ (i) ). (2.10) Fig.(2.11) shows the induction step graphically. It is clear from this figure how state at instant t+1 reached from N possible states at instant t. 3 – Termination P(O|λ) = ∑ (i) (2.11)

This stage is just a sum of all the values of the probability function (i) over all the states at instant T. This sum will represent the probability of the given observations to be driven from the given model. That is how likely the given model produces the given observations.

Figure (2.11) Forward Procedure


30

Implementation See flowchart of the algorithm in the section (Design of Some Methods:)

2.5.4 Backward Procedure This procedure is similar to the forward procedure but it takes into consideration the state flow as if in backward direction from the last observation entity, instant T, till the first one, instant 1. That means that the access to any state will be from the states that are coming just after that state in time. To formulate this approach let us consider the backward probability function

which can be defined as:

1

= P( , , ,......., | = ,λ ) (2.12) In analogy to the forward procedure we can solve for in the following two steps: 1 - Initialization: = 1 (2.13) These initial values for β’s of all states at instant T is arbitrarily selected. 2 - Induction:

(2.14)

Equation (2.14) can be well understood with help of Fig. (2.12). We are still looking from left to right in calculating the partial probability function β(from t to T). Fig.(2.12) shows this behaviour clearly. Even we are still looking from left to right in calculating the partial probability function (from t to T). However, at each instant we consider that we have β at t+1 and we need to calculate it at time t; as if we are moving backward in time. Implementation See flowchart of the algorithm in section Design of Some Methods:.


31

Figure (2.12) Backward Procedure

2.5.5 Baum-Walsh re-estimation Procedure To adjust the model parameters (A, B, π) for maximizing the probability of the observation sequence, we used an iterative procedure based on the classic work of Baum and his colleagues, for choosing model parameters. In order to use this procedure we need to find ,the probability of being in state at time t, and state at time t+1, given the model and the observation sequence, i.e.

,

, ,

= P( = = |O,λ ) (2.15) We also need to find (i) as the probability of being in state ,at time t, given the observation sequence and the model, i.e. = P( = |O,λ ) (2.16) A set of reasonable re-estimation formulas for π, A, C, μ and U are in equations (2.17-2.21):


32

(2.17-2.21) Implementation 1. Define a 3-Dimension double Array xi[t+1,noOfStates+1,noOfMixtures+1]. 2. Call FindXi(observations,alpha,beta,b,out xi). Where alpha array is returned from ForwardProcedureWithScale(). beta array is returned from BackwardProcedureWithScale(). b array is returned from compB(). 3. Define a 2-Dimension double Array First_Gamma[t+1,noOfStates+1]. 4. Call Gamma(alpha, beta,out First_Gamma). 5. Define a 3-Dimension double Array Second_Gamma[t+1,noOfStates+1,noOfMixtures+1]. 6. Call Gamma(observations, First_Gamma, b,cb,out Second_Gamma); cb array is returned from compB(). 7. Initialize a 2-Dimension double Array tempA[noOfStates+1,noOfMixtures+1]; 8. Initialize a 2-Dimension double Array tempC [noOfStates+1,noOfMixtures+1]; 9. Initialize a 2-Dimension jugged double Array tempMeu [noOfStates + 1, noOfMixtures + 1][ noOfCepstrum+ 1]. 10. Initialize a 2-Dimension jugged double Array tempSigma[noOfStates+1,noOfMixtures+1][noOfCepstrum+1,noOfCepstrum+ 1]. 11. Apply the equations (2.17-2.21):


33

for (int i = 1; i <= noOfStates; i++) pi[i] = First_Gamma[1,i]. for (int i = 1; i <= noOfStates; i++) for (int j = 1; j <= noOfStates; j++) { for (int t = 1; t < time; t++) tempA[i, j] += xi[t, i, j]. } for (int j = 1; j <=noOfStates; j++) for (int k = 1; k <= noOfMixtures; k++) // applying the sigma in the numarator of every rules for (int t = 1; t <= time; t++) { tempC[j, k] += Second_Gamma[t, j, k]. tempMeu[j, k] += Second_Gamma[t, j, k]* o[t - 1]. double[] substract= (o[t - 1]- meu[j, k]). tempSigma[j, k] +=Second_Gamma[t, j, k]* substract* substract. } 12. Divide every element in tempA by the summation of elements of row that This element belongs to, By calling NormalizeMatrixA(tempA). the new tempA Will be the reestimated global array a. 13. Divide every element in tempC by the summation of elements of row that This element belongs to. the new tempc will be the reestimated global array c. 14. Divide every element in tempMeu & tempSigma by the corresponding element of in tempC. the new tempMeu & tempSigma will be the reestimated global array meu & covarianceMatrix. 15. Fix covariance matrix so (r,r)>= 1e-50. ,

2.5.6 Occurrence Probability For the forward variable to work we need to find ( ), which is probability of a given occurrence for a particular state. This value can be calculated by Multivariate normal distribution formula. Probability of observation X occurring in state I is given as:

(2.22)

Where D is dimension of the vector.

is matrix representing the mean vector. is the covariance Matrix.


34

| | is the determinant of matrix . is the inverse of matrix Vi. Mean vector is obtained by:

(2.23) Covariance Matrix can be obtained by:

(2.24) In Equation 2.24 variance is calculated by finding the distance vector between an observation and the mean. Transpose of the distance vector is taken and it is multiplied with the distance vector. This operation gives a NXN where N is the dimension of the system.

2.5.7 Training the Model Before we can recognize a word we need to train the system. Train command is used to train the system for a new word. The command takes 4 parameters: • No of states the HMM model should have N. • No of mixture co-efficient the HMM model should have M • The size of the feature vector D. • Jugged array of Single or Multiple observations. We used random numbers for generating an initial HMM taking under consideration the following constraints: 1. ∑ =1, 1≤i≤N. ≥0, 1≤i≤N, 1≤j≤N. 2. ∑ =1, 1≤j≤N. ≥0, 1≤j≤N, 1≤m≤M. Where is the transition probability from i to j state. is the mixture coefficient for the mth mixture in state j. Implementation See flowchart of the algorithm in the section (Design of Some Methods:)


35

CHAPTER THREE

System Analysis and Design: 3.1 Use Case Model:

Figure (3.1) Use case Model

3.2 Analysis Model:

Figure (3.2) Analysis Model


36

3.3 Design Model:

Figure (3.3) Design Model

4.3 Class Diagram Model:

Figure (3.4) Class Diagram


37

3.5 Interaction Diagram: 3.5 Interaction Diagram: Train Use Case: Train Use Case:

Figure (3.5) Training Interaction Diagram Recognize Use Case:

Figure (3.6) Recognizing Interaction Diagram


38

3.6: State Chart & State Transition Graph:

Figure (3.7) HMM Transition Diagram

Figure (3.8) End Point Detection


39

Figure (3.8) Continued


40

3.7 Interface Design:

Figure (3.9) Interface Design

1) If you want to make a model for certain word, first choose train radio

a) select FromFile radio

button from the first group box then you have two choices: select the voice from the file or pick it up from microphone. If you want to select the voice from the file(s) button of the second group box and then press the action button:


41

A dialog will appear enter the word and then press ok:

An open file dialogue will appear select a file then press open repeat the process if you want to select more than one files, after selecting all files

ress cancel:

Wait until the training process completes.

b) If you want to pick up the voice from microphone select FromMic radio button of the second group box and then press the action button:

p


42

After capturing the voice of a word, a dialog will appear and ask you if you h to capture the same word again press yes if you wish to, no if you donwis ’t.

A dialog will appear enter the word and then press ok:

Wait until the training process completes.

2) If you want to recognize a certain word(s), first choose recognize radio button from the first group box then you have two choices: select the voice from the file or pick it up from microphone

a) If you want to select the voice from the file(s) select FromFile radio button of the second group box and then press the action button:

An open file dialogue will appear select a file then press open:


43

The recognized word(s) will appear in the text box:

b) If you want to pick up the voice from microphone select FromMic radio button of the second group box and then press the action button:

The recognized word(s) will appear in the text box:

Press Action button again to stop capturing.


44

3.8 Design of Some Methods: End Point Detection:

Figure (3.10) End point Detection


45

Hidden Markov Model Methods:

Figure (3.11) void Gamma(double[,] alpha, double[,] beta, out double[,] result) Method Design


46

Figure (3.12) void FindXi(float[][] o, double[,] alpha, double[,] beta,double [,]b, out double[,,]xi)

Method Design


47

Figure (3.13) ForwardProcedureWithScale Method Design


48

Figure (3.14) BackwardProcedureWithScale Method Design


49

Figure (3.15) NormalizeMatrixA(double[,] matrix) Design


50

Figure (3.16) void NormalizeMatrixC_Meu_Sigma(double[,]

tempC,double[,][]tempMeu,double[,][,]tempSigma) Method Design


51

Begin

Initialize v[noOfCepstrum]

i <= noOfCepstrumi=1

i++

v[i] = x[i] - meu[i]

F

Set sum = 0.0

i < meu.Lengthi=1

i++F

sum += Math.Pow(v[i], 2) / sigma[i, i]

determinant =Determinant(sigma)exponent=Math.Exp(-sum/2)

determinant == 0.0

Yes

No

determinant = eps

exponent=Math.Exp(-745)

Exponent==0.0

Yes

No

result = exponent / (Pow((2 * PI), (noOfCepstrum / 2)) * Sqrt(determinant))

result > 1e100

return 1e100

Yes

No

return result

End

Figure (3.17) double Gauss(float[] x, double[] meu, double[,] sigma) Method Design


52

Figure (3.18) void CompB(float[][] o, out double[,] b, out double[, ,] cb)

Method Design


53

Train & Recognize Methods:

Figure (3.19) double Train(float[][][] multipleSequences, int numberOfStates, int numberOfMixtures) Method Design


54



55

tempC[j, k] += Second_Gamma[t, j, k]tempMeu[j, k] += Second_Gamma[t, j, k]*multipleSequences[s][t - 1]

double[] substract = multipleSequences[s][t - 1]- meu[j, k]tempSigma[j, k] += Second_Gamma[t, j, k]*(Substract*substract))

F

F

Call this.NormalizeMatrixA(tempA)Call this.NormalizeMatrixC_Meu_Sigma(tempC, tempMeu,

tempSigma)

forwardResult = 0

E

j<=this.noOfStatesJ=1J++

k <= this.noOfMixtures

k=1K++

t <= this.t t=1t++

D

F

F

F

i < sequenceForwardResult

.Length

i=1i++

F

C

forwardResult += sequenceForwardRe

sult[i]

F



56



57

REFERENCES: [1]LawrenceRabiner,Bing-HwangJuang,“Fundamentals of Speech Recognition” Prentice Hall. [2] Lawrence Rabiner, “A tutorial on Hidden Markov Models and selected applications in speech recognition”, pages 256–286. Proceedings of the IEEE, 1989. http://www.cs.ubc.ca/~murphyk/bayes/rabiner.pdf. [3]Li Tan and Montri Karnjanadecha, “Modified Mel-Frequency Cepstrum Cofficient” ,Department of Computer Engineering, Faculty of Engineering, Prince of Songkhla University, Hat Yai, Songkhla,Thailand. [4]G. Saha, Sandipan Chakrobort, Suman Senapati, “A New Silence Removal and Endpoint Detection Algorithm for Speech and Speaker Recognition Applications”, Department of Electronics and Electrical Communication Engineering, Indian Institute of Technology, Khragpur, Kharagpur-721 302, India. [5] Rakesh Dugal and U. B. Desai. “A tutorial on hidden markov models”, http://uirvli.ai.uiuc.edu/dugad/hmm_tut.html. [6]L. R. Rabiner and M. R. Sambur. “An algorithm for determining the endpoints of isolated utterances”. Bell System Technical Journal, Vol. 54, pages 297-315, 1975. [7] Waleed H. Abdulla and Nikola K. Kasabov. “The Concepts of Hidden Markov Model in Speech Recognition”.Knowledge Engineering Lab, Department of Information Science, University of Otago, New Zealand. [8]Matthew Nicholas Stuttle. “A Gaussian Mixture Model Spectral Representation for Speech Recognition”. Hughes Hall and Cambridge University Engineering Department, July 2003. [9] Keiichi Tokuda, Takao Kobayashi, and Satoshi Imai. “Recursive Calculation of Mel-Cepstrum from LP Coeffcients”. Department of Computer Science,Nagoya Institute of Technology, Nagoya, 466-8555 Japan;Precision and Intelligence Laboratory,Tokyo Institute of Technology, Yokohama, 227 Japan,1 April 1994. [10] Feature extraction – outline, NTNU. [11] Ripul Gupta, “Speech Recognition for Hindi”, M. Tech. Project Report,

http://www.cs.ubc.ca/%7Emurphyk/bayes/rabiner.pdf

http://uirvli.ai.uiuc.edu/dugad/hmm_tut.html


58

Department of Computer Science and Engineering,Indian Institute of Technology, Bombay,Mumbai, India. [12] Daniel Jurafsky,James H. Martin,“Speech and Language Processing: An introduction to natural language processing, computational linguistics, and speech recognition.” [13] The HTK Book (for HTK Version 3.1), Cambridge University Engineering Department. [14] Emad yousif Al-Bassam, “Speaker Segmentation”, Bsc graduation project, KAAU, Saudi Arabia. [15] Sakhr Awad Mohammad Saleh, “Sound Analysis for Recitation of the Holy Koran”, Bsc graduation project, KAAU, Saudi Arabia.

[16] Kaustubh R. Kale, Research assistant: Department of ECE “Isolated word, Speech recognition using Dynamic Time Warping towards smart appliances.”.

http://www.cnel.ufl.edu/~kkale/6825Project.html [17] Lori F. Lamel, Lawrence R. Rabiner, Aaron E. Rosenberg, Jay G. Wilpon “An Improved Endpoint Detector for Isolated Word Recognition”. [18] Goutam Saha, Ulla S. Yadhunandan, “Modified Mel-Frequency cepstral coefficient”, Department of Electronics and Electrical Communication Engineering,Indian Institute of Technology, Kharagpur,India. [19] John-Paul Hosom , “Hidden Markov Models for Speech Recognition”, Lecture notes, Oregon Health & Science University, OGI School of Science & Engineering, Spring, 2006. [20] Several Websites: http://www.codeproject.com

http://www.codeproject.com/


59

CHAPTER FOUR

Appendix

8.1 WAV file format A WAV file stores data in Little Endian format. A WAV file is written in Resource Interchange File Format (RIFF). In RIFF format, file is divided into chunks. Each chunk has headers which gives information about the data that follows. A WAV file requires at least two chunks: Format Chunk and Data Chunk. Figure 8.1 shows a graphical representation of a minimal WAV file.

Figure 8.1: Format of a Wave file

8.1.1 RIFF WAVE Chunk RIFF WAVE Chunk contains nothing but headers of the WAV file. This chunk has three headers. First header is a string “RIFF” indicating that the file follows RIF Format. Second header is an integer specifying the size of the content that will follow. Third header is a string “WAVE” identifying the file type.

8.1.2 FMT SubChunk FMT Subchunk describes the format in which sound information is stored. Like RIFF chunk this chunk also has a SubChunkID which is “fmt ”. It also has information such as Sampling rate, Bits per sample, Number of channels, Audio Format, etc. Size of this header is minimally 24 bytes and can be followed by Extra headers.

8.1.3 Data SubChunk Data Subchunk has only two headers: ID and Size. This is followed by the actual sound data. Figure 8.2 shows part of hex dump of a wave file.


60

Figure 8.2: Hex Dump of a wave file

8.2 Results & Conclusions Experimental Results Training Ten HMM models were made for the first ten English digits (0-10). For this I used about 15 to 20 recorded sounds for each digit. Voices of ten people were used to train the system. Recognition Recognition was tried on 10 recorded sounds of different speakers other than those used in the training phase, the results showed at about 80% to 85% performance.

0 1 2 3 4 5 6 7 8 9 Recognized 8 10 6 7 8 9 7 6 8 9

Un-Recognized 2 0 4 3 2 1 3 4 2 1 Conclusion Experiments showed bad performance in recognizing sounds of speakers other than used in training phase, I hope some future works will overcome this problem.

Date post:	27-Apr-2015
Category:	Documents
Upload:	lomie
View:	1,343 times
Download:	0 times

CW Speech Recognition

Documents