Speech Recognition Using Hybrid System of Neural Networks...

Speech Recognition Using Hybrid System of Neural Networks and Knowledge Sources

©Hisham Darjazini

A thesis submitted to the School of Engineering in fulfillment of the requirements for the degree of Doctor of Philosophy

School of Engineering College of Health and Science University of Western Sydney

II

July 2006

Statement of Authentication

The work presented in this thesis is, to the best of my knowledge and

belief, original except as acknowledged in the text. I hereby declare that I

have not submitted this material, either in full or in part, for a degree at this

or any other institution.

________________________

Signature

III

ABSTRACT

In this thesis, a novel hybrid Speech Recognition (SR) system called RUST (Recognition

Using Syntactical Tree) is developed. RUST combines Artificial Neural Networks (ANN)

with a Statistical Knowledge Source (SKS) for a small topic focused database. The

hypothesis of this research work was that the inclusion of syntactic knowledge

represented in the form of probability of occurrence of phones in words and sentences

improves the performance of an ANN-based SR system.

The lexicon of the first version of RUST (RUST-I) was developed with 1357 words of

which 549 were unique. These words were extracted from three topics (finance, physics

and general reading material), and could be expanded or reduced (specialised). The

results of experiments carried out on RUST showed that by including basic statistical

phonemic/syntactic knowledge with an ANN phone recognisor, the phone recognition

rate was increased to 87% and word recognition rate to 78%.

The first implementation of RUST was not optimal. Therefore, a second version of

RUST (RUST-II) was implemented with an incremental learning algorithm and it has

been shown to improve the phone recognition rate to 94%. The introduction of

incremental learning to ANN-based speech recognition can be considered as the most

innovative feature of this research.

In conclusion this work has proved the hypothesis that inclusion of a phonemic syntactic

knowledge of probabilistic nature and topic related statistical data using on adaptive

phone recognisor based on neural networks has the potential to improve the performance

of a speech recognition system.

IV

Acknowledgements

This work would not have been completed without the continuous support of Dr. Qi

Cheng. I would like to sincerely thank him for his valuable advice and unlimited

support that he provided me in the course of producing this thesis. Dr. Cheng dedicated

numerous hours over many months to help me in producing something that I am proud

of.

Special thanks are also due for Dr. Ranjith Liyana-pathirana for his support, advice and

helpful comments. His revision of the draft of this thesis gave me valuable feedback. I

am grateful to Professor Jann Conroy, Professor Steven Riley, and Mrs. Mary Kron for

their support at various stages of this work.

I will always remember the efforts, advice and endless support provided to me by Dr.

Jo Tibbitts, Professor Godfrey Lucas and Associate Professor Mahmood Nagrial in

producing the first version of this thesis.

Avery warm and special thanks are due for my family, who provided me with all their

support and patience during the course of this work. Without their continuous support I

would not have been able to carry on the very long process of finishing this research.

Thank you all; I specially mention my late father, Mahmood, my mother, Samira and

my wife, Shaheenaz, for their patience during the long hours I spent in bringing this

work to fruition.

Many thanks are also due to all those who participated in the acquisition of the UWS

speech database, which I used throughout the practical part of this thesis. I also greatly

appreciate all help I had received from the academic, administrative and technical staff

of School of Engineering and other departments of University of Western Sydney.

Last but not the least, I wish to acknowledge the moral support of my friends and col-

leagues. I especially remember the support that I have received from Dr. Jamal Rizk.

V

Contents

Page

Abstract III

Acknowledgements IV

List of Figures VIII

List of Tables X

Chapter 1: Introduction 1

1.0 System Description 1

1.1 Thesis Outline 2

1.2 Publications 3

Chapter 2: Fundamental Concepts 4

2.0 Introduction 4

2.1 RUST-I Fundamentals 4

2.2 Feature Extraction 7

2.2.0 Review of Feature Extraction Techniques Used in

Speech Processing 10

2.2.1 Speech Modelling and MFCC 12

2.2.2 Mel-scale Cepstral Coefficients (MFCC) for RUST-I 16

2.3 Features of Australian English 23

2.4 UWS Speech Database Acquisition 26

2.5 Techniques Used in Speech Recognition 28

2.5.0 Pattern Recognition (PR) 28

2.5.1 Hidden Markov Model (HMM) 29

2.5.2 Artificial Neural Networks (ANN) 33

2.5.3 Advantages of ANN 41

2.5.4 Artificial Intelligence (AI) 42

2.5.5 Hybrid ANN/HMM Systems 44

2.6 Conclusion 47

VI

Chapter 3: Phonemic/Syntactic KnoWledge and Adaptive Phone Recognisor

– Design and Implementation 48

3.0 Introduction 48

3.1 Adaptive Phone Recognisor (APR) 49

3.2 Syntactic Knowledge Estimator 50

3.2.0 Syntactic Knowledge Database 51

3.2.1 RUST-I Lexicon 53

3.2.2 Categorisation 54

3.2.3 Data Organisation in the Syntactic Database 59

3.3 Determination of RUST-I Syntactic Knowledge: Example 62

3.4 Code Activator and Accumulator 67

3.5 Sub-recognisor: Structure 73

3.6 Conclusions 78

Chapter 4: Experimental Procedures 79

4.0 Introduction 79

4.1 Selection of Parameters and Initial Conditions 80

4.1.0 Further Results on Training and Testing 81

4.1.1 Confusion Matrix 85

4.2 Training the Adaptive Phone Recognisor 86

4.3 Experiment One: Operation of Each Sub-Recognisor Without the

Syntactical Knowledge 88

4.3.1 Input Stimuli 88

4.3.2 Experimental Method 88

4.3.3 Results 90

4.3.4 Experiment One: Conclusion 109

4.4 Experiment Two: Operation of Each Sub-Recognisor With the

Syntactical Knowledge 110



4.4.3 Results 112

4.4.4 Experiment Two: Conclusion 113

4.5 Experiment Three: Verification of the System as IWR 113


VII


4.5.3 Representation of the Results 115

4.5.4 Analytical Procedure 115

4.5.5 Results 116

4.5.6 Experiment Three: Conclusion 129

Chapter 5: Implementation of Incremental Learning Neural Networks

(RUST-II) 131

5.0 Introduction 131

5.1 The Speech Corpus 132

5.1.0 Background 132

5.1.1 TIMIT Database 132

5.1.2 Corpus Selection 133

5.1.3 Phone Segmentation and Feature Extraction 137

5.1.4 Preparation of the Data for the Neural Networks Input 138

5.2 Modification of the APR to Include Incremental Learning Neural

Networks 139

5.2.0 Weight Selection Algorithm 140

5.3 Experiment and Results 145

5.4 Discussion 148

5.5 Conclusion 150

Chapter 6: Conclusion and Future Work 152

6.1 Conclusion 152

6.2 Future Work 154

References 157

Appendix

Probabilistic Values of the Second Level of the Syntactic Knowledge 164

Glossary of Commonly Used Abbreviations 183

VIII

List of Figure

Page

Figure 2.1 Simplified schematic block diagram of RUST-I. 4

Figure 2.2 Detailed schematic diagram of RUST-I. 7

Figure 2.3 Discrete time model of speech. 13

Figure 2.4 Cepstrum computation procedure. 14

Figure 2.5 Different spacing of band-pass filters. 15

Figure 2.6 MFCC extraction block diagram. 17

Figure 2.7 Simulation of Mel-scale filters frequency bands. 18

Figure 2.8 Algorithm for program to compute MFCC. 21

Figure 2.9 Formant frequency plot of Australian English (general - male). 23

Figure 2.10 Spectrogram of the words ‘bard’ /bad/ and ‘bud’ /bΛd/

pronounced in Australian accent. 24

Figure 2.11 Time delay computational element. 39

Figure 2.12 Hidden control neural network. 40

Figure 3.1 Adaptive phone recognisor. 49

Figure 3.2 The syntactic knowledge estimator. 50

Figure 3.3 Graphical representation of data clusters. 58

Figure 3.4 Example of a data cluster for the front edge phonemic class /t/

(phones are represented by their identification code.). 59

Figure 3.5 Bubble diagram of cluster number 3 of front edge phonemic

class /æ/. 60

Figure 3.6 Portion of the syntactic database that represent cluster 4. 61

Figure 3.7 Probabilities of Phones in set O. 66

Figure 3.8 Self-information of phones in set O. 66

Figure 3.9 Algorithm of code activator in pseudo-code form. 69

Figure 3.10 Block diagram of the accumulator. 72

Figure 3.11 Algorithm of the accumulator. 72

Figure 3.12 Structure of the sub-recognisor. 74

Figure 3.13 Architecture of one neuro-slice. 75

IX

Figure 4.1 RMS error curve for training with adjusted parameters. 81

Figure 4.2 Format of the data input training file. 84

Figure 4.3.(a) 3-D representation of the full confusion matrix of speaker 9

(right side view). 91

Figure 4.3 (b) 3-D representation of the full confusion matrix of speaker 9

(left side view). 92

Figure 4.4 2-D representation of the confusion matrix of speaker 9. 93

Figure 4.5 Block diagram of Experiment Two. 111

Figure 4.6 Block diagram of the system as configured for experiment Three. 114

Figure 5.1 Feature extraction from the phone /s/. 139

Figure 5.2 The modified structure of the APR. 141

Figure 5.3 Selection of the weight set for incremental learning. 143

Figure 5.4 Structure of new sub-recognisor. 143

Figure 5.5 The sub-recognisor performance in the initial session. 148

Figure 5.6(a) Recognition experiments of the phone /s/. 149

Figure 5.6(b) Recognition experiments of the phone /s/. 150

X

List of Tables

Page

Table 2.1 Minimum, maximum and average values of frame number (N). 9

Table 2.2 Mel-scale frequency bands. 17

Table 2.3 Equations to compute Mel-scale filter outputs for each of the

17 Mel-scale filters. 19

Table 2.4 Example of output from the program that computes

MFCCs of one frame of speech signal representing vowel /a/

acquired from the word 'last' spoken by speaker11. 22

Table 2.5 International phonetic alphabet symbols for use in Australian

English.. 25

Table 2.6 Classical studies in SR using ANN to pre-segmented speech. 35

Table 2.7 Summary of studies, which employed ANN for speech

signal processing. 36

Table 2.8 Percentage of correct recognition for various topologies

related to the number of iterations. 38

Table 2.9 Results of open tests for ANN trained on full vowel

and steady-state vowel. 40

Table 3.1 Phones ID’s of RUST-I. 52

Table 3.2 Phonemic classes and their associated levels represented

in the front edge level of the syntactic knowledge. 56

Table 3.3 Phonemic classes which are not represented in the front edge

layer of the syntactic knowledge. 57

Table 3.4 Syntactical knowledge front edge phones set, their frequencies,

probabilities and self-information. 65

Table 3.5 Localised probabilistic values of phonemic subclasses in level

2 of the phonemic set Oð. 67

Table 3.6 Simulation of seven architectures of MLP. 77

XI

Table 4.1 Optimum learning rates and momentum terms for all layers

during training and testing. 81

Table 4.2 Number of training and testing tokens used for each

sub-recognisor. 82

Table 4.3 Example of the sequential order of presentation in terms of the

phone id (P), example number (E), frame number (F), speaker

number (S) and word number (W). 83

Table 4.4 Example of the confusion matrix. 86

Table 4.5 Summary of the primary training session of the APR. 86

Table 4.6 summary of the most remarkable IASCs. 87

Table 4.7(a) Responses of the sub-recognisors for expected input

stimulus - Speaker 6. 94

Table 4.7(b) Responses of the sub-recognisors for expected input


Table 4.7(c) Responses of the sub-recognisors for expected input


Table 4.7(d) Responses of the sub-recognisors for expected input


Table 4.7(e) Responses of the sub-recognisors for expected input


Table 4.8(a) Vowels confusion matrix - Stimuli presented versus

sub- recogniser responses. 100

Table 4.8(b) Three most common confusion across speakers for the vowel

subgroup. 101

Table 4.9(a) Diphthong confusion matrix (average values over all speakers. 102

Table 4.9(b) Three most common confusions across speakers for the

diphthongs subgroup. 103

Table 4.10(a) Stops confusion matrix (average values over all speakers). 104


stops subgroup. 104

Table 4.11(a) Nasals confusion matrix (average values over all speakers). 105

Table 4.11(b) Three highest confusions of the nasal subgroup. 105

XII

Table 4.12(a) Confusion matrix of the fricatives. (average values over

all speakers). 106


fricatives subgroup. 107

Table 4.13(a) Affricatives confusion matrix. (average values over

all speakers). 107

Table 4.13(b) Three main confusions for the affricative subgroup

confusion matrix. 108

Table 4.14(a) Semivowels intra confusion matrix. 108

Table 4.14(b) Semivowels inter confusion matrix. 109

Table 4.15 Average of SRS across subgroup. 109

Table 4.16 Average SRS scores for all phones across all speakers. 110

Table 4.17 Summary of SRS < 0.60 and recognition rate across all speakers. 112

Table 4.18 Overall results of 100 words recognition. 116

Table 4.19 Comparison of two-word recognition results over all speakers. 117

Table 4.20 Recognition results of words that used in experiment Three. 119

Table 4.21 Summary of error types. 130

Table 5.1 Abstracted information on the chosen speakers. 133

Table 5.2 Updated phonemic symbols code. 135

Table 5.3 Phones set used in the learning session and their

relevant number . 145

1

Chapter 1: Introduction

It has been established in the field of Speech Recognition (SR) that any level of linguistic

knowledge applied at a level above that of phone recognition will enhance the performance

of a word recognition system (Furui, 1989). Speech recognition is defined as "the process

of automatically extracting and determining linguistic information conveyed by a speech

wave using computers or electronic circuits". This type of definition implies that in order

to achieve tangible results in speech recognition, the problem has to be approached from a

linguistic prospect. This thesis describes the work on the implementation of an Isolated

Word Recognition (IWR) system by combining linguistic knowledge with Artificial

Neural Network (ANN) technique without and with incremental learning.

1.0 System Description

The work described in this thesis comprises two parts: (1) the system that was studied and

implemented between 1992 and 1997 by the author, which is referred to as ‘RUST-I’

(Recognition Using Syntactic Tree) and (2) an incremental learning update of the original

system with variable weight vectors in the neural networks, referred to as ‘RUST-II’. The

novel concepts of the proposed system described in this thesis can be summarised as

follows:

• The adaptive phone recognisor in the system uses a parallel structure, which can now

be implemented using affordable IC chips and offers advantages in processing speed.

• The phone composition of the vocabulary is expressed as a statistically labeled

syntactic / phonemic tree.

• The phone recognition process is controlled by syntactical knowledge in a potentially

2

adaptive way.

• The system explores and tests the incremental learning algorithm in neural network for

phone recognition.

RUST-I incorporates two basic levels of statistical knowledge in speech. The first is

phonemic knowledge, in the form of probability of occurrence of phones in the lexicon

words, and the second is primary syntactic knowledge in the form of probability of

occurrence of phones in sentences or sequences of words. A focus on phonemic knowledge

allows RUST-I to operate as a continuous recognisor. The phonemic knowledge source is

used in the overall structure of an Adaptive Phone Recognisor (APR). The phonemic and

statistical knowledge is followed by the syntactic knowledge estimator.

The lexicon of RUST-I was developed using 1357 words of which 45 were unique. The

words were extracted from three topics, namely finance, physics and general reading

material. This is not exclusive to those topics and it might be updated or specialized to other

topics, and it could be expanded or reduced. The later version of the lexicon is developed

from the TIMIT speech database, with 75 sentences (637 words), and due to time constraints,

this later lexicon version was not used in RUST-II.

1.1 Thesis Outline

Chapter 2 presents a general introduction to the field of research and the theoretical

background and fundamentals necessary to understand the system aspects. An in-depth

description of the syntactical / phonemic knowledge and the phone recognisor is presented

in Chapter 3. It presents the basics of the language model and the lexicon, the formation of

the syntactical database and its code activator. Also in this chapter the interaction between

the syntactical knowledge and the phonemic knowledge is shown within the system

functionality. In Chapter 4, RUST-I is trained and examined for validity and efficiency as an

isolated phone and isolated word recognisor. The overall performance of the system as an

IWR was found to be dependent on the performances of both the APR and the syntactic

knowledge estimator. Chapter 5 presents a novel technique in the area of incremental

3

learning neural networks. The purpose of applying the incremental learning technique on

RUST-I is to demonstrate that incremental learning neural networks are able to contribute

to the development of more robust speech recognition system. The effort in Chapter 5 will

be focused on the development of the incremental learning algorithm.

1.2 Publications

This work has led to the publication of the following two peer reviewed papers in

international conferences:

DARJAZINI, H. AND TIBBITTS, J., 1994. The construction of phonemic knowledge using

clustering methodology. Proceedings of the 5th Australian international conference on speech

science and technology SST-94, December 1994 Perth, Vol. 1, 202-207.

DARJAZINI, H., CHENG, Q. and LIYANA-PATHIRANA, R. 2006. Incremental

learning algorithm for speech recognition. Paper accepted by the 8th international

conference on signal and image processing SIP-06, August 14-16 2006.

4

Chapter 2: Fundamental Concepts

2.0 Introduction

This chapter describes RUST-I, the acquisition of UWS speech database and some funda-

mental concepts of signal processing techniques and neural networks which was used in

this work.

2.1 RUST-I Fundamentals

A simplified schematic block diagram of RUST-I is shown in Figure 2.1. RUST-I has a

hybrid structure, which is a combination of a low-level phonemic knowledge recognisor

(ANN based) and higher-level syntactic knowledge.

Figure 2.1 Simplified schematic block diagram of RUST-I.

RUST-I consists of three main blocks as follows:

5

• Signal processing (feature extraction) block - which preprocesses the digitised

speech signal and extracts features from the speech signal to be used in the recognition

process.

• Phone recognisor based ANN block - which performs the phone recognition task and

represents the phonemic knowledge of the system.

• Syntactic knowledge block - which represents the syntactic reference of the system

and contains the lexicon and phonemic database parts. The phonemic statistical likeli-

hood of occurrence is used in this block and it is integrated with low level phonemic

knowledge in the ANN-based phone recognisor block to form the complete RUST-I sys-

tem.

Figure 2.2 shows a detailed block schematic diagram of RUST-I. The system in the figure

performs multi-speaker, large vocabulary IWR. Digital speech is passed through the

segmentor, which separates the speech into Hamming windows of 256 points each. The

windowed speech is passed into the feature extractor to derive 12 Mel-scale frequency

coefficients (to be discussed later in this chapter) per window. These 12 MFCCs are passed

into the adaptive phone recognisor (APR), which is composed of a bank of 46 sub-

recognisors. The sub-recognisors are spatially aligned to respond to 45 phones of Australian

English plus silence. The output from the APR is passed to the syntactic knowledge estimator,

which -in response- generates activation signals, each of which selects the most appropriate

phone sub-recognisor of the APR to be activated. The activation signal enables the output of

the sub-recognisor of the next phone with the highest probability among all the phones based

on the pattern of the previous recognised phone sequence. The output is collected by the

accumulator of the syntactic knowledge estimator block corresponding to the recognised

phone to indicate whether or not a match occurs between the input data and an estimated

phone. The syntactical knowledge estimator detects the end of a word and releases an End of

Word Identifier (EOWI) signal to the accumulator. This indicates that the recognition process

has been completed, and the accumulator is prompted to supply the final output.

6

To implement the syntactic knowledge, a lexicon of 1357 words was chosen. This number

can be increased or decreased as necessary to suit a particular application. Therefore,

RUST-I can be regarded as a large vocabulary SR system.

The speech database, which was used to train and test the system, was derived from words

uttered by 15 native Australian English speakers, and is called the “UWS speech database”.

The acquired phone set forms the basic speech units that are the basic building blocks of the

phonemic knowledge. The functions of the phonemic knowledge and the syntactic knowledge

are integrated so that any phone missing or misrecognised at the phonemic level, can be

predicted and compensated for at the syntactic level.

The training data provided to the system is different from the testing data, and both are

acquired from multiple speakers. The training and testing data have been acquired in a natural

room environment. Therefore, the proposed system is meant to perform in low-level ambient

noise environment.

7

Figure 2.2 Detailed schematic diagram of RUST-I.

Blocks in the schematic diagram of Figure 2.2 can be divided into four main parts: the first

part deals with the feature extraction from speech data; the second part is devoted to the

phone recognition within the adaptive phone recognisor; the third part deals with the

derivation of the syntactic knowledge within the syntactic knowledge estimator; and the

fourth part of this system is the acquisition of phones in the word within the accumulator.

2.2 Feature Extraction

The process of feature extraction includes segmentation, windowing and computation of 12

MFCCs, for a sequence of M frames of 256 (speech) samples.

8

In RUST-I, the segmentation was carried out manually. In RUST-II (to be described in

Chapter 5), the segmentation was performed using phone boundaries provided in the TIMIT

database. The duration of each phone of the Australian phonemic set was segmented into N

frames of 256 samples each. As the boundaries of phone were not distinctive, therefore, an

add-overlap method was used in the segmentation, with an overlap of 22% to maintain the

continuity across phone boundaries.

For example, the phone represented by the phonetic symbol /I/, has been isolated from the

following words ' pit, sing, and thin ', and measurements showed different time durations for

the signal that represents this phone. The smallest reading is in the case of speaker 1 in the

word ' pit ' where it has 903 sampling points (57 ms ≈ 4 frames). The longest duration for the

same phone is in the case of speaker 2 in the word /sing /, the phone takes 1682 sampling

points (140 ms ≈ 7 frames). The maximum and the minimum values of N (number of frames)

are passed to the ‘frame count parameter estimator’. The frame count parameter M, is

calculated for each phone by averaging the maximum and minimum values of N over all the

speakers for each phone as shown in Table 2.1.

9

Table 2.1 Minimum, maximum and average values of frame number (N).

ID Phone ID Phone

PH Nmin Nmax Mi PH Nmin Nmax Mi

1 I 4 7 6 24 t 2 20 11

2 i 5 11 8 25 d 2 7 5

3 ε 5 9 7 26 k 4 8 6

4 æ 6 10 8 27 g 2 7 5

5 a 10 21 16 28 f 9 11 10

6 Þ 6 9 8 29 v 3 8 6

7 Ď 5 6 6 30 し 5 13 9

8 Э 10 12 11 31 ð 6 11 9

9 Ω 5 6 6 32 s 7 15 11

10 u 7 20 14 33 z 6 11 9

11 έ 11 13 12 34 ∫ 9 11 10

12 ∂ 4 11 6 35 ξ 4 14 19

13 Λ 4 8 6 36 h 3 14 9

14 aI 12 23 18 37 r 4 5 5

15 eI 8 23 6 38 t∫ 5 7 6

16 ЭI 17 19 18 39 dξ 3 5 4

17 aΩ 15 23 19 40 m 4 12 18

18 OΩ 8 22 15 41 n 3 13 8

19 I∂ 13 19 16 42 さ 9 15 12

20 ε∂ 11 21 16 43 j 3 14 9

21 Ω∂ 11 17 14 44 w 4 11 8

22 p 3 8 6 45 L 2 16 9

23 b 1 9 5 46 sln - - 24

Note: PH = phone, ID = identifier, sln = silence.

10

The frame count parameter M, is calculated in the segmentation block and is an indication of

the number of windows or the time duration of the presented phone. This parameter

determines the number of neuro-slices used in the adaptive phone recognisor.

To minimise the effects of frame truncation, a windowing function is required. This function

is expected to reduce the discontinuities at the frame boundaries, while maintaining the signal

integrity over most of the frame. The improvement produced by windowing is at the expense

of the transition width (ramping from zero to maximum).

A 330-point Hamming window, w(n), was chosen (this resulted in almost 20% increase of the

signal intensity (factors of 0.168 and 0.184) occurring at both boundaries). This window size

was used to ensure that the assumptions made in the derivation of the cepstrum coefficient

(Section 2.2.1) were valid. A narrower window increases the bandwidth in the frequency

domain and could degrade results (Davis and Mermelstein, 1980).

2.2.0 Review of Feature Extraction Techniques Used in Speech Processing

The selection of an appropriate feature vector representation for a speech signal is dependent

on the required accuracy of the recognisor, the size of the vocabulary to be recognised and the

structure of the target speech signal (i.e., phone, syllable, word, phrase or sentence).

Researchers (De Mori 1983, Furui 1989, Dermody et. al. 1986, Fant 1960 and Flanagan

1983) have shown that there are inherent perceptually important acoustic features within the

speech waveform. For speech recognition, an adequate vector representation is needed to

extract those perceptually important features. Three main techniques were studied over time,

filter-bank, Linear Predictive Coding (LPC) representation and Cepstral representation. This

section provides a comparative summary of these three representations. The Comparison will

be only at the level of words and phones.

Rabiner and Juang (1993) summarised comparative studies between the filter-bank model

representation and the LPC Analysis model representation and showed that the LPC analysis

model generally resulted in improved performance for speech recognition tasks. This work

11

was performed on telephone quality speech (sampling frequency at 8 kHz), and so was band

limited to under 4 kHz. Dermody et. al., (1986) have shown that some dynamic sounds like

stop consonants have high frequency information in excess of 4 kHz, where this information

is lost on telephone quality speech. On the other hand, Davis and Mermelstein (1980) and

Hunt (1988) showed that the validity of performance of LPC when used with unvoiced

sounds or sounds with zeros (e.g. nasals) is deteriorated beyond usefulness. However, it was

found that altering the type of LPC analysis, the window size or the order of the filter

overcame most of these difficulties (Deller et al., 1993). But Hunt (1988) believes that the

relative superiority of an LPC representation over the filter-bank is still being disputed. This

was supported by Markel and Gray in (Rabiner and Juang, 1993) where they showed that

LPC performance deteriorates in the presence of noise.

Davis and Mermelstein (1980) compared the performance of five types of vector

representation (MFCC, SCR, LPC spectrum and reflection coefficients). The results showed

that the LPC spectrum achieved about 85% recognition rate and the reflection coefficients

achieved between 77% and 83%.

Love and Kinsner (1992) presented LPC coefficients to a multi-layer perceptron neural

network. The correct recognition performance for vowels was from 42% to 68% and for

consonants (in consonant-vowel form) was around 33% to 57% for consonant with the vowel

/a/, and around 40% to 53% for consonant with the vowel /e/. The average false recognition

score for the vowels was 51%.

Creekmore et. al., (1991) have carried out another comparative study on five spectral

representations as input to a feed-forward neural network. The five representations included

the DFT, autocorrelation based LPC, LPC spectral intensities, LPC cepstral coefficients and

the cepstral coefficients derived from Perceptual Linear Predictive (PLP) analysis. The

recognition rate for all but PLP was around 40% to 41% on an open phone data set. The PLP

analysis method scored 45%.

12

It was found that the spectral representation derived using the LPC coefficients is highly

speaker dependent, Waibel, (1981), therefore, according to the performance of the LPC and

its speaker dependency, the technique is not suitable for the purpose of this work.

Ultimately, there is a need for a method that can extract speaker independent information

from that spectrum to produce an efficient vector representation for speech recognition by

removing as much of the redundancies associated with speaker identity as possible while

retaining the perceptually important acoustic features. Limitations of the LPC and the

filter-bank models led to the decision to exclude those two representations from this research.

Davis and Mermelstein (1980) compared the recognition performance of three cepstral

representations, Mel-frequency cepstral coefficients (MFCC), smoothed cepstrum or linear

frequency cepstral coefficients (SCR) and LPC cepstral coefficients (LCC) using template

matching on the phone level. The recognition rates ranged from 86% to 96%. Mel-cepstrum

coefficients produced an improved performance of between 95% and 97% over the other two

cepstral representations. The success of the MFCC has been attributed to the accurate

modelling of the critical band frequencies of the auditory system (Waibel and Yegnanarayana,

1981).

In conclusion, these studies show the recognition performance of the cepstral representations

was higher than either LPC or filter bank representations. A perceptually based cepstral

representation resulted in a marginally higher performance score than any of the linear

cepstral representations. Hence, the MFCC parameters will be used in this research.

2.2.1 Speech Modeling and MFCC

An understanding of speech production and speech acoustic features is crucial to speech

modeling. Speech is the result of exciting the vocal tract system with an excitation, which

consist of either quasi-periodic impulses or random noise (Flanagan, 1983). Assuming the

vocal tract system and excitation are independent, the discrete time model of speech

production is shown in Figure 2.3.

13

Figure 2.3 Discrete time model of speech (Oppenheim and Schafer, 1989).

To minimize the truncation effect of the segmentation, each frame should be multiplied by a

window function. The Fourier transform of the windowing function can be obtained as

∑−= −= 1

0

0][)(M

k

jkekNwWn ωω (2.1)

For a short period (e.g., a frame), the vocal tract can be regarded as a linear time-invariant

system, and the superposition principle applies. Assuming that the impulse response of the

vocal tract is )(nh , the speech samples can be modeled as )(*)()( nenhnx = , where

)()( npne = or )()( nrne = and * denotes the convolution. Denote by )(),(),( ωωω EHX the

Fourier transforms of ),(),(),( nenhnx respectively. )()()( ωωω EHX = . By taking the

logarithm of the Fourier transform, the multiplication can be transformed into addition as

)](log[)](log[)](log[ ωωω EHX += . The cepstrum transform can then be obtained from:

)]([)]([)]([log)( 111 ωωω EFHFXFqCs −−− +== . There are two types of cepstra: the

complex cepstrum and the real cepstrum. The basic difference between these types is that the

14

RC discards phase information whereas the CC retains it (Deller et. al. 1993) and

(Oppenheim and Schafer, 1989). The complex cepstrum is given by

ωωωπ π

πω deXjXqCCs qj ))(|)(log(|

2

1][ ∫− ∠+= and the real cepstrum by

ωωπ π

πω deXqRCs qj∫−= |)(log|

2

1][ . In order to make )(log ωX unique, the argument of

)(),( ωω XX ∠ , must be an odd continuous function of w (Oppenheim and Schafer, 1989).

This can be done by adding multiples of 2π to phases (unwrapping) to meet this requirement;

consequently the discontinuities, associated with computation of the phase modulo-2π, are

removed.

Figure 2.4 Cepstrum computation procedure.

According to human perception, a logarithmic scale Fourier transform is preferred. This

logarithmic scale (also called Mel-scale) transform can be obtained by passing )(wX through

a set of band-pass filters with center frequencies and bandwidths as shown in Figure 2.5(a).

15

Figure 2.5 Different spacing of band-pass filters:

(a) Logarithmic (Oppenheim and Shafer, 1989).

(b) Linear. (Mihelic et. al., 1991).

The MFCC is calculated as

])5.0.(cos[1

∑= −= a

a

i

N

kk N

kiXMFCCπ

, (2.2)

where, Na is the number of cepstral coefficients, and Xk, k = 1, 2, ..., Na represents the log

energy output of the kth filter. The cosine transform presents approximation to a set of

triangular band-pass filters. Equation 2.2 applies the cosine transform to the log power of a

Mel-scale filter-bank to derive the Mel-scale cepstrum. The low-order terms in the cepstral

magnitude correspond to smooth features in the spectrum, while the higher-order terms

represent the spectral fine features, and are therefore filtered out by any approximation of the

cosine series (Davis and Mermelstein, 1980).

16

2.2.2 Mel-scale Cepstral Coefficients (MFCC) for RUST-I

It was shown that the recognition performance when using cepstral representations produces

higher accuracy than other representations. Hence, the MFCC is chosen as an appropriate

vector representation of speech features for RUST-I. Figure 2.6 shows a block diagram of the

portion of the feature extractor that derives the MFCC vectors. The first block is the power

spectra estimator, calculated from a 512 point Discrete Fourier Transform (DFT).

The second block is the Mel output summer, which determines the Mel-scale filter outputs.

The number of Mel-scale filters required for a signal with maximum frequency 6 kHz is

found from Table 2.2 (Mel-scale frequencies) to be 17. Figure 2.7 shows a simulation of the

spacing of these 17 filters in the frequency range from 0 to 6 kHz.

The third block of Figure 2.6 is the bank which calculates the log of the outputs, m(k), from

each of the 17 Mel-scale filters in dB as shown in the following equation:

)(log.10 10 kmXk= (2.3)

where, k = 1,2,...17.

The fourth block of Figure 2.4 is the MFCC Vector Estimator which calculates the Mel-

frequency cepstral coefficients or DI(12), which are determined by applying a the cosine

transformation to Xk , that is the real logarithm of the short term power spectrum expressed on

a Mel-frequency scale as shown in Equation 2.2. A program was written to automatically

compute the MFCCs for all frames of each phone; for all tokens in the speech database. An

algorithm of this program is given in Figure 2.8. The MFCC vectors were normalised within

the range of 0 to +1 for input to the neural network.

17

Figure 2.6 MFCC extraction block diagram.

Table 2.2 Mel-scale frequency bands.

Index Frequency band[Hz]

1 0-117

2 117-281

3 281-445

4 445-609

5 609-773

6 773-914

7 914-1101

8 1101-1312

9 1312-1570

10 1570-1875

11 1875-2203

12 2203-2625

13 2625-3117

14 3117-3679

15 2679-4359

16 4359-5156

17 5156-6000

.

18

Figure 2.7 Simulation of Mel-scale filter frequency bands.

The frequency values were derived and used to calculate the Mel-scale filter outputs, which is

the linear sum of the intensities of all line spectra within that frequency band. Any component

closest to the boundary is an exception, where it is halved and shared between the two

boundaries. For example, the first filter covers the frequency band from 0 to about 117 Hz.

The Mel-scale filter output, m(1), is calculated by the summation of the magnitude of the first

four values in the line spectra, s(1) to s(4), plus half of the fifth, s(5), as illustrated in Equation

below because they fall within the range 0 to 117 Hz.

2

)5()4()3()2()1()1(

sssssm ++++= (2.4)

The equations to compute Mel-scale output for all 17 Mel-scale filters can be found in

Table 2.3 along with the range of the filter and the number of spectral magnitudes used in

its computation. Table 2.4 shows an example of the output of the program, which

computes MFCCs for one frame of speech signal.

19

Table 2.3 Equations to compute Mel-scale filter outputs for each of the 17 Mel-scale filters.

# Range Mel-scale filter output equation # of LS

1 0 - 117Hz m(1) = s(1) + s(2) + s(3) + s(4) + 0.5s(5) 4 + [email protected]

2 117 - 281Hz m(2) = 0.5s(5) + s(6) + ... + 0.5s(12) 6 + [email protected]




6 773 - 914 Hz m(6) = 0.5s(33) + s(34) + ... + 0.5s(40) 6 + [email protected]

7 914 - 1101 Hz m(7) = 0.5s(40) + s(41) + ... + 0.5s(47) 6 + [email protected]

8 1101- 1312 Hz m(8) = 0.5s(47) + s(48) + ... + 0.5s(56) 8 + [email protected]

9 1312- 1570 Hz m(9) = 0.5s(56) + s(57) + ... + 0.5s(67) [email protected]



12 2203- 2625 Hz m(12) = 0.5s(94) + s(95) +... +0.5s(112) [email protected]

13 2625- 3117 Hz m(13) = 0.5s(112)+s(113)+... + 0.5s(133) [email protected]

14 3117- 3679 Hz m(14) = 0.5s(133) +s(134)+... +0.5s(157) [email protected]



17 5156 -6000 Hz m(17) = 0.5s(220) +s(221)+... +0.5s(256) [email protected]

20

ALGORITHM FOR COMPUTING MFCC VECTORS OF THE SPEECH SIGNAL

SEGMENTS:

clear memory;

open phone file for reading;

open file for spectral data writing;

read phone data as matrix of 256 columns x n rows;

create Hamming window of 330 points length; (eq. # 3.1)

for var1 = 1 to n

expand frame(var1) to 330 points by set points outside the frame boundary to 1's;

apply Hamming window on every row(n);

compute FFT order 9 of the resulting frame;

compute power spectrum of the resulting FFT vector;

write results into spectral data file;

end for

open file for logarithmic data writing;

open file for cepstral data writing;

form matrix spec contains spectral data (n x 330);

for var2 = 1 to n

form n vectors of 330 elements each;

compute log Mel-scale 17 filters outputs; (eq. # 3.3)

writing the output in log file;

end for;

%% compute Mel-frequency cepstrum coefficients (12 coefficients) per frame

for loop1 = 1 to 12

for loop2 = 1 to 17

compute MFCC vector; (eq. # 2.24)

end loop2

write results into mfcc file;

21

end loop1

form vectors of 12 elements MFCC;

initialise max to 0;

scan all vectors for coef > max;

normalised coef = coef /max;

temporally unfold the MFCC vectors according to its frames index;

write results in ASCII format files;

close all files;

end;

Figure 2.8 Algorithm for program to compute MFCC.

22

Table 2.4 Example of output from the program that computes MFCCs of one frame of

speech signal representing vowel /a/ acquired from the word 'last' spoken by speaker 11.

Filter Frequency

range [Hz

Filter output

[mV]

# MFCCi Unnormalise

d value of

MFCCi

m(1) 0-117 -1.826000 MFCC1 18.782129

m(2) 117-281 4.581117 MFCC2 -0.639120

m(3) 281-445 2.356047 MFCC3 -11.526380

m(4) 445-609 2.960689 MFCC4 -4.390971

m(5) 609-773 3.711937 MFCC5 -0.735596

m(6) 773-914 3.145611 MFCC6 -7.190951

m(7) 914-1101 1.520497 MFCC7 -1.433654

m(8) 1101-1312 1.402426 MFCC8 -0.499446

m(9) 1312-1570 1.142529 MFCC9 -6.208356

m(10) 1570-1875 -0.163876 MFCC10 -6.149410

m(11) 1875-2203 -2.377023 MFCC11 -3.856123

m(12) 2203-2625 -2.584206 MFCC12 -5.575328

m(13) 2625-3117 -1.600273

m(14) 3117-3679 -0.062703

m(15) 3679-4359 -1.651505

m(16) 4359-5156 -0.316149

m(17) 5156-6000 -1.605814

23

2.3 Features of Australian English

Australian English differs from other forms of English in the position of vowels and

diphthongs within the vowel triangle; also, it differs in vowel length (Bernard et. al., 1989).

Vowels vary amongst talkers by the timbre (pitch), local duration and emotional dynamics

incorporated into the sound. Spectrographic analysis of Australian English vowels shows

formants that convey the timbre of the vowels as illustrated in Figure 2.9.

Figure 2.9 Formant frequency plot of Australian English (general - male) (Bernard et. al.,

1989).

Spectrograms in Figure 2.10 reveal that Australian pronunciations of 'bard' /bad/ and 'bud'

/bΛd/ have a similar format pattern. The spectrograms also show that the explicit difference

between the two vowels is in their duration, the vowel /a/ being about twice as long as the

vowel /Λ/. Australian English shows the same pattern in /i/ and /I/, /Ω/and /u/, and /æ/ and /e/.

This particular sound-duration pattern is very much a part of the Australian accent and differs

from length patterning observable in other English accents.

Australian vowels are also more pronounced than vowels in other English accents, for

example, the word 'station' is pronounced in Australian English /'steI∫en/ note how the vowel

is emphasised by taking on the form of the diphthong /eI/.

24

Figure 2.10 Spectrogram of the words ‘bard’ /bad/ and ‘bud’ /bΛd/ pronounced in an Austra-

lian

accent (Bernard et. al. 1989).

It has been reported by Bernard et. al. 1989, that Australian English tends to display

distinctive intonation patterns, within certain characteristic ranges for the rate of utterances

adopted by the average Australian speaker. Australian speakers can be classified into three

main categories: Broad (30% of Australians), General (60% of Australians) and Cultivated

(almost 10% of Australians). Pronunciation of vowels varies depending on the particular

category of the Australian speaker. For example the word 'seat' could have pronunciations

ranging from /seIt/ (Broad Category), /sIit/ (General Category) to /sit/ (Cultivated Category).

Similar grading would apply to 'say' (/saI/ - B, /seI/ - G and /seI/ - C).

Another significant differentiator in Australian speech is the pronunciation of the centering

diphthongs that can be heard in words such as 'beer' and 'bear'. Cultivated speakers tend to say

/bIӘ/ and /bε/ with a pronounced second element. General speakers have a slight glide

towards the central vowel /Ә/. Broad speakers tend to say /bI:/ and /bi:/ with hardly any

second element. The effect in the last case is to almost create a lengthened pure vowel.

Table 2.5 shows the International Phonetic Alphabet Symbols for use in Australian English

(Macquarie Library Dictionary, 1998). Symbols in the table have been used throughout this

25

thesis in relation to the construction of the phonemic and the syntactic knowledge. In addition,

the words in the table have been used to construct the speech database which has been used in

this study (UWS speech database).

Table 2.5 International phonetic alphabet symbols used in Australian English.

(Macquarie Library Dictionary, 1998).

Sound Type Phonetic Symbol Example Phonetic Alphabet

of the example

Vowels I peat pit

i pit pIt

ε pet pεt æ pat pæt

a part pat

þ pot pÞt

Λ but bΛt

э port pЭt

Ω put pΩt

u pool pul

έ pert pέt ∞ apart ∞’pat

Ď bon voyage bĎvwa’jaξ Diphthongs aI buy baI

eI bay beI

эI boy bЭI

aΩ how haΩ

OΩ hoe hOΩ

I∞ here hI∞

ε∞ hair hε∞

Ω∞ tour tΩ∞

Consonants

Plosive (stops) p pet pεt b bet bεt t tale teIL

d dale deIL

k came keIm

g game geIm

Affricatives t∫ choke t∫OΩk

26

dξ joke dξOΩk

Nasals m mile mail

n neat nit

さ sing siη

Fricatives f fine faIn

v vine vain

し thin しin

ð then ðεn

s seal siL

z zeal ziL

∫ show ∫OΩ

ξ measure mεξ∞

h heat hit

Semi-vowel j you ju

w woo wu

Laterals l last Last

r rain reIn

2.4 UWS Speech Database Acquisition

This section describes the acquisition of the non-standard speech database used in this

research. This database is referred to throughout this thesis as the “UWS speech database”.

The focus of the UWS speech database is solely on Australian English.

The UWS database consists of words chosen to cover all of the Australian English phonemic

set. The speech data have been segmented and labelled at phonemic level. The database

consisted of 45 words spoken in Australian English by 15 adult speakers (10 male; 5 female)

with native Australian English (at least second generation in Australia) of average age of 26.

The 45 words contain at least one of the phones of Australian English (Macquaire Dictionary,

1994) and are listed in Table 2.5. This word set allows for multiple representations (from 1 to

16) of most phones in different position within the word (initial, central and final). The range

of duration of the phones was from 12 ms to 487 ms.

27

The speakers had a general to broad Australian accent. Every speaker read the set of the

words, one word at a time. The words were stored in files and are classified within

subdirectories labelled depending on the speaker. A MATLAB program was written to

process each file separately. The program opens each file and blocks it into frames of 21.3 ms

duration (256 points).

A sampling rate of 12 kHz was chosen as a compromise between accuracy and processing

time/complexity. This may limit the cues available for dynamic speech like the stops where

the maximum frequency in the signal can extend up to 8 or 9 kHz (Dermody et. al.,1986).

The data is prefiltered at fs/2 using digital, tracking anti-alias filter.

Recording was done in the naturally noisy environment of a computer room with an

approximate signal to noise ratio of 30 dB. Then the recorded speech was phonemically

segmented and labelled using manual processes. The duration of each phone of the Australian

phonemic set was segmented into N frames of 256 points each. The database contained 45

different phones. An overlap-and-add method was used for segmentation, where an overlap of

22% (256 data points extracted every 200 data points) was found to maintain the continuity

across frame boundaries.

Hypersignal Acoustic™ software package was used for manual segmentation and labelling at

the phonemic level. This process can not be explicitly defined within the whole word as

phones do not exist with clear boundaries between each other but indeed run into each other

(co-articulation). For example, the phone /ð/as in 'the' overlapped into the next phone,

whereas the phone /θ/could be isolated easily. In the editing of the segments, consonants were

considered to end at the point where the speech signal showed a significant shift in amplitude

and/or at the onset of regularity and periodicity; this was verified by perceptual judgement.

Diphthongs were segmented and labelled as distinctive phones units to minimise the possibil-

ity of confusing them with the other vowels.

28

2.5 Techniques Used in Speech Recognition

In this section, a brief description of the current techniques used in speech recognition will

be presented. This will cover the most popular techniques yet used, including Pattern Rec-

ognition (PR), Hidden Markov Models (HMM), Neural Network (ANN), Artificial Intelli-

gence (AI) and hybrid ANN/HMM systems.

2.5.0 Pattern Recognition (PR)

Pattern Recognition (PR) is a well-known technique in the field of image recognition as

well as in speech recognition. Pattern recognition means the identification of the ideal (pat-

tern) which represents a given object. In speech recognition, PR uses speech pattern di-

rectly without explicit feature determination and segmentation. This technique has two

steps. The first is to find the ideal speech pattern (training), and the second is the recogni-

tion of the patterns via comparison process. The concept is that if enough versions of a pat-

tern are included in a training set provided to the algorithm, the training procedure should

be able to adequately characterise the acoustic properties of the ideal pattern. Then by di-

rect comparison between the ideal and unknown speech pattern, the system has to be able

to classify the input to be one of the known patterns for the system.

Some researchers (e.g., Rabiner and Juang, 1993) observed some advantages of this tech-

nique, such as:

Simplicity.

Robustness and invariance to different speech vocabularies, users, feature sets, pat-

tern comparison algorithms and decision rules.

Acceptable performance for some speech recognition tasks.

29

Such pattern recognition systems could achieve comparatively better rates only for

speaker-dependent templates and for limited number of vocabularies.

2.5.1 Hidden Markov Model (HMM)

HMM approach is a statistical method of characterising the spectral properties of the

frames of a pattern. The key assumption of HMM is that the speech signal can be well

characterised as a parametric random process, and the parameters of the random process

can be estimated. This technique showed better recognition results when compared with

the PR. In application of HMM technique, a statistical model of each word in the vocabu-

lary (in Isolated Word Recognition –IWR- research) was constructed. Each input word was

recognised as the word in the vocabulary whose model assigns the greatest likelihood to

the occurrence of the observed input pattern. HMM integrates well into systems both tasks

syntax1 and semantics

2 (Rabiner and Juang, 1993). Thus, when constructing the statistical

model of the HMM for selected problems in SR, there are three key issues that have to be

addressed:

Evaluation of the probability (or likelihood) of a sequence of observations given a

specific HMM. This represents the efficiency of computing the probability of an

observation P(O|λ) which is denoted as the probability of the observation sequence

or state sequence.

Determination of the best sequence of model states, which produces the optimal

model for that application (i.e., which best explains the observation).

Adjustment of model parameters so as to best account for the observed signal, i.e.,

adjust λ to maximise P(O|λ).

In HMM, each word is represented by a set of states (including initial and final) with the

probabilities of transitions from state to state. For each state there is an associated random

variable, whose value is a vector of acoustic parameters. The variability of each spoken

1 Syntax: Grammar the patterns of formation of sentences and phrases from words in a par-

ticular language. (Macquarie Library Dictionary, 3rd

ed. 1998). 2 Semantics: Relating to meaning. (Macquarie Library, dictionary, 3

rd ed. 1998).

30

word is therefore modeled by N distinct random variables, where N is the number of states

in the model. Many HMM recognisors have an HMM model based on phones (Grant,

1991). The final stage of the recognition then combines lexical knowledge with phonemic

knowledge by concatenating the phone of the HMMs into words.

Recognition rates in systems employing HMM varied depending on the type of recognition

task that the system was required to perform. An example is the system which has been

tested by Pepper and Clements (Pepper and Clements, 1992). They described in their re-

port experiments on phonemic recognition using large HMM; the system achieved recogni-

tion rates ranging between 52.2% and 53.3% depending on the size of the system that was

used in the experiment. Other experiments employed HMM to recognise syllables (non-

sense sounds) of series of consonant-vowel (CV) (a consonant with the vowel /e/) using

temporal cues and HMM (Flaherty and Poe, 1993). They reported a HMM system that

achieved a recognition accuracy of 74% using time varying information, compared with

50% without that information.

When employing pure HMM in IWR research, there were varieties of acoustic cues, which

have been employed to construct the statistical model of HMM. For instance, Gupta et al.

(1991) reported improvement in the recognition accuracy when employing temporal cues

combined with energy contour information of phones to construct HMM. By applying

minimum duration and energy thresholds the accuracy improved from 23.1% to 27.3% in

the case of acoustic cues recognition, and from 8.8% to 14.3% with the language model.

The system was built as speaker-dependent for a large vocabulary. It can be noted here that

the results of this system are consistent with the work of Flaherty and Poe (1993).

The results from the various systems which were discussed above showed wide variability

in the performance of the HMM, when states of the model represent phones, syllables or

words. This is very much related to the first essential issue in the HMM design mentioned

above. Hence, before the use of HMM, one must answer the following question: What do

the states in the model correspond to? Then deciding how many states should be in the

model and to know the initial state. Generally, states are interconnected in such a way that

31

any state can be reached from any other state. This increases the computational cost mas-

sively even for a few states. The reason can be found in the observational representation

nature of HMM. This is the probabilistic function of the states. That means the HMM is a

doubly embedded stochastic process with an underlying stochastic process that is not di-

rectly observable, but which can be observed only through another set of stochastic proc-

ess, which produces the sequence of observation (Rabiner and Juang, 1993).

HMM modeling necessarily computes the variability of spectra at different parts of each

word. It also has variable-time distortion penalties, and it relates these penalties to the

spectral distortion penalties in a theoretically defensible way. On the other hand, its timing

model is unrealistic in that the probability of staying in a given hidden state decays expo-

nentially with time (Hunt, 1988).

When using HMM, the recognition problem is usually formulated as one of finding the se-

quence of states in the hidden Markov chain whose a posterioric probability is maximum.

The easiest way of doing this is by means of Viterbi algorithm (Kenny, 1993). However,

this algorithm suffers from several drawbacks:

1. It is an exhaustive search. For phone-based recognisors with large vocabularies,

the speech model can be very large and the search requires expensive computa-

tional time. Although the Viterbi algorithm is the search strategy which is usually

used on medium vocabulary applications (around 1000 words), it is not clear how

it can be extended to very large vocabulary applications (around 100,000 words). It

can be observed that the recognition rates declined when the number of vocabular-

ies increased in such systems (Kenny, 1993).

2. It generates only one recognition hypothesis. Although it can be modified to gen-

erate the N best hypotheses, the amount of computation increases proportionately.

3. The simple device of imposing context-dependent minimum duration constraints

on phone segments in recognition has been found to lead to major improvements

in recognition performance (Gupta et. al., 1991). Because of their non-Markovian

nature, these constraints cannot be accommodated by the Viterbi algorithm without

32

changing the topology of the model. It is even possible to modify the Viterbi algo-

rithm so that this can be done but there is a substantial price to be paid (Kenny,

1993).

Most recent research in HMM has tried to concentrate on the improvement of the recogni-

tion rate of the systems that employ HMM. This can be seen as an attempt to compensate

for the price paid because of the drawbacks mentioned above. This is the core idea of the

system reported by Gupta et al., (1991). Another example in this context came from Smith

et al. (1995) in their reported experiment, which was aimed at optimising the HMM per-

formance. Smith et. al. (1995) described a system which used two kinds of inter-frame de-

pendent observation structures, both built on the observation densities of a first-order de-

pendent form, which accounts for the statistical dependence between successive frames. In

the first model, the dependency relation among the frames was determined (optimally) by

maximising the likelihood of the observations in both training and testing. In the second

model, the dependency structure associated with each frame was described by a weighted

sum of the conditional densities of the frame given individual previous frames. To estimate

the parameters of the two models, the system was implemented by segmental K-means and

the forward-backward algorithms, respectively. Then the system was tested on an IWR

task. The system achieved better performance than both the standard continuous HMM and

the paradigm-constrained HMM. However, this report is similar to other HMM reports, in

that it did not provide details about the computational price paid for the improvement. In

summary the following points can be extracted:

1. To construct a recognition system based on phonemic recognition, a larger HMM

will be required (refer to Pepper and Clements 1991).

2. To achieve reasonable accuracy and recognition rate in syllable-based recognition

system, temporal cues must be incorporated into the system. (Refer to Flaherty and Poe

1993, Gupta et. al. 1991).

3. The previous two points cause a significant increment in the computational price of

the system, and this price will be higher in the case of larger vocabulary IWR, or con-

tinuous SR systems.

33

4. Inadequacy in modeling of the duration of the acoustic events associated with each

state Hunt 1988. This is especially critical for RUST-I as the duration of the phones in

the associated phonemic knowledge inherited temporal tolerance margin, which require

more flexible techniques such as neural networks.

Therefore, the development of a system that can be closer to the ultimate goal of SR using

pure HMM technique is a very much harder option.

2.5.2 Artificial Neural Network (ANN)

The use of artificial neural networks is the main technical discipline of neurocomputing

technology, which is concerned with information processing systems that autonomously

develop operational capabilities in adaptive response to an information environment

(Hecht-Nielsen, 1990).

Technically, ANN can be defined as a parallel distributed information processing-structure

consisting of processing elements, which can possess a local memory and can carry out

localised information processing operations. All elements in the structure are intercon-

nected via unidirectional signal channels called connections. All connections have associ-

ated adjustable-weights which perform the learning process in the structure.

Researchers in the field of SR realised that ANN can work as well as HMM or even better

when dealing with speech patterns (Deller et al. 1993). The initial search was for an alter-

native system that can comprehend the highly robust nature of the speech patterns. The re-

quired system should be able to generalise the problem of pattern recognition, also it

should be of non-logarithmic in nature and it should be able to adapt. The system will be

fed examples of speech patterns, so that it can learn the general features of the speech.

Consequently, it is expected to be capable of recognising any similar patterns.

ANNs are known for their adaptive, self-organising, fault tolerant functions and non-linear

capabilities. This makes them particularly applicable to the problem of SR. ANNs are often

34

used in speech processing to implement pattern recognition, i.e., to associate input patterns

with classes –classification- (Deller et al., 1993). Within this function, at least three sub-

types of classifiers can be delineated. The first, when an output pattern results, which iden-

tifies the class membership of the input pattern. The second is a vector quantisation func-

tion in which vector input patterns are quantised into a class index by the network. This

application is reserved for a particular type of ANN architecture that is trained differently

from the more general types of pattern associator networks. A third subtype of classifier is

called the associative memory network. This type of network is used to produce a memo-

rised pattern or class exemplar as output in response to input, which might be a noisy or

incomplete pattern from a given class.

In addition to pattern recognisors, a second general type of ANN is a feature extractor. The

basic function of such ANN is the reduction of large input vectors to small output vectors

that effectively characterise the classes represented by the input patterns. The feature ex-

tractor reduces the dimensions of the representation space by removing redundant informa-

tion. It is also sometimes the case that feature representations will appear as patterns of ac-

tivation internal to the network rather than at the output. An example of this is given by

Waibel et al. (1989).

The classical application of ANN into SR has focused on the fundamental problem of clas-

sifying static pre-segmented speech. This predominantly employed either Multi-Layer Per-

ceptron (MLP) or Learning Vector Quantiser (LVQ) topologies. A list of classical studies

in SR using ANN can be found in Table 2.6.

It can be noticed from Table 2.6 architectures are either MLP or LVQ, except in the case of

the Feature Map Classifier (FMC) of Huang and Lippman (1988), which is a hierarchical

network consisting of an LVQ-like layer followed by a perceptron-like layer. All MLPs

were trained by a back-propagation learning (BP) algorithm.

35

Table 2.6 Classical studies in SR using ANN to pre-segmented speech.

Study Approach/Problem

Elman and Zipser (1987) MLP-Consonant, vowel recognition

Huang and Lippman (1988) MLP and FMC – Vowel discrimination

Kammerer and Kupper (1988) MLP and single layer of perceptrons –

Speaker dependent and independent word

recognition

Kohonen (1988) LVQ – Labeled Finish speech

Lippman and Gold (1987) MLP – Digit recognition

Peeling and Moore (1987) MLP – Digit recognition

Ahalt et al. (1991) MLP and LVQ – Vowel discrimination,

gender discrimination speaker recognition

These classical studies which apply ANN technique to relatively simple SR problems, trig-

gered hundreds of related studies. Many possible ANN architectures were tested for SR to

assess topology, training time and recognition rate. Also, many of the known vectorial in-

put representations were applied, and the recognition rates were monitored. Table 2.7 is a

summary of studies with their variations in input parameters.

Inspired by the classical work of ANN applications, most of the experiments in Table 2.7

retained the pure MLP topology. In some cases it was combined with self-organising net-

works. The input vectors for the networks varied in each particular study explore its possi-

bilities. The overall performance of the ANN was compared to HMM techniques when ap-

plied to similar tasks. Gramss (1992) showed that the use of ANN achieve faster results

than HMM.

Table 2.7 shows that for each case, the resultant accuracy was related to the type of input

presented to the network. From the table, results reported by Shim et al., (1991) was pro-

duced using MLP/PB and input of LPC vectors, Davenport and Garudari (1991) used a

network of the Receptive Field and input of wavelet vectors, and Escande et al. (1991)

36

used a network of GP and input of Time-Frequency spectral vectors. All previous methods

showed overall lower recognition accuracy. It should also be noted that there is a relation

between the topology of the ANN and its accuracy, and it can be seen that the Receptive

field and GP topologies achieved lower accuracy too.

Table 2.7 Summary of studies, which employed ANN for speech signal processing.

Study name ANN To-

pology/

Learning

algorithm

Input type Dependency

& Recogni-

tion type

Number of

vocabu-

lary/Databa

se type

Number of

speakers

Accuracy

(max.)

Shim et al. (1991)

MLP /BP LPC MSD/CV 16

/unknown

3 70%

Davenport

& Garudari

(1991)

Receptive

field/

Supervised

Wavelet Speaker-

independ-

ent/Feature

extractor &

recognisor

of phones

795/TIMIT 48 81%

Escande et al.(1991)

GP Time-

Frequency

spectral rep-

resenta-tion

IWR

Digits

/RSG10

NATO

4

Accuracy

less than that

for other

systems

Gramss

(1992)

FFNN Contrasted

spectrogr-

ams

Speaker-

independ-

ent/IWR

RSRE &

DPI digits

databases

(German)

unknown

97.1%

94.5%

(faster than

HMM)

Kuang &

Kuh (1992)

Combination

of self-

organising

feature map

and MLP

Various pa-

rameters

MSD /IWR

20 words, 10

digits & 10

control

words /TI20

4

99.5%

Kitamura et al. (1992)

CombNet:

self-

organising

& 4 layers

MLP

TDMC Speaker

dependent

IWR

100

/Japanese

cities

9

96.8%

Kitamura et al. (1992)

CombNet:

self-

organising

& 3 layers

MLP

TDMC Speaker

dependent

IWR

100

/Japanese

cities

9

99.1%

Elvira and Carrasco (1991) carried out a study, which compared the most popular topolo-

gies and various input parameters. They concluded that the most common topologies are:

37

1. Adaline.

2. Monolayer Perceptron.

3. Back-propagation with the Sigmoid function.

4. Back-propagation with the hyperbolic tangent function.

5. Radial Basis functions (RBF) network with the Gaussian function.

6. Volterra connectionist model.

The following input parameters were used:

1. 12 parameter PARCOR linear predictive coding LPC using the Durbin

Method.

2. 12 frequency band coefficients calculated using Mel-scale distribution from

256 frequency coefficients obtained using the Fast Fourier Transform

(FFT).

3. 12 frequency band coefficients calculated on a linear scale distribution from

the same 256 frequency coefficient FFT.

4. 12 Mel cepstrum coefficients.

In the above mentioned study two databases were used for training and another two data-

bases were used for testing. The ANN was tested on vowels. The results showed that BP-

tanh gave the best performance for any number of training iterations when Mel-scale FFT

coefficients are used as inputs. For digit recognition, the BP-sigmoid achieved the best per-

formance related to the number of iterations (Table 2.8). These results are consistent with

the research details presented in Tables 2.6 and 2.7.

38

Table 2.8 Percentage of correct recognition for various topologies related to the number of

iterations (Elvira and Carrasco, 1991).

Iterations 50 100 150 200 250 300

Adaline 53.64 54.77 57.05 54.28 53.15 5658

Perceptron 51.55 55.66 56.24 53.31 55.20 56.02

BP-Sigm 64.53 66.99 67.83 68.18 68.93 68.10

BP-Tanh 62.33 69.21 63.49 63.32 66.14 62.08

RBF 57.67 59.06 60.77 57.85 - -

Volterra 60.27 62.03 60.44 61.60 - -

These findings have generated confidence in using MLP topology and its derivatives for

SR. Neural networks are used also in speech perception. Cassidy and Harrington (1992)

carried out a study using one of the MLP derivatives with a Sigmoid output function. Their

aim was to investigate the validity and the importance of the dynamic structure of vowels

in vowel perception. Vowels were represented by bark spectra and applied to the input of

the ANN. The performance of the ANN confirmed the importance of dynamic structure of

words.

Generally speaking, standard ANN is structured to work with static patterns. When applied

to speech, which is dynamic in nature, the ANN structure needs to be modified. Research-

ers have employed various architectures to accommodate the dynamic requirement. For

instance, Zhang, et al. (1995) have employed high-order fully recurrent ANN for this pur-

pose. The system proposed to provide effective processing of temporal information within

speech signals. That was achieved by implementing an ANN with a self-organising input

layer followed by a fully recurrent hidden layer and an output layer.

The most popular option which has been applied to speech signals is the Time Delay Neu-

ral Network (TDNN) (Waible, et al., 1989). Figure 2.11 shows a simplified architecture of

this network. The structure of the TDNN extends the input to each computational element

39

to include N speech frames represented by spectral vectors over the duration of ΔN sec (Δ

is the time slot between adjacent speech frames).

Other ANN structures were proposed by researchers but did not prove to be as useful as

TDNN. An example is the Hidden Control Neural Network (HCNN) (Rabiner and Juang,

1993) as shown in Figure 2.12. This network uses a time varying control, c, as a supple-

ment to the standard input, x, to allow the network properties or input-output relations to

change over time in a prescribed manner.

Figure 2.11 Time delay computational element (Sugiyama, et al., 1991).

40

Figure 2.12 Hidden control neural network (Rabiner, and Juang, 1993).

In similar way to TDNN, an architecture that converts the time dimension of the speech

signal into form of a distributed structure has been used. An example of this architecture is

the network reported by Cassidy and Harrington (1992). That ANN maps the temporal di-

mension onto a special dimension of MLP consisted of four layers of units connected by

links of varying delays. The model performance was measured using two types of test sets.

Table 2.9 summarises the performance of that ANN.

Table 2.9 Results of open tests for ANN trained on full vowel and steady-state vowel

(Cassidy and Harrington, 1992).

Training Set Test Set Correct (%) Rejected (%) Error (%)

Full Vowel Full 90.0 5.0 5.0

Steady-State S-S 73.2 7.5 19.3

41

2.5.3 Advantages of ANN

The advantage of applying artificial neural networks to the problem of speech recognition

is due to:

The parallel nature of ANN. The parallel distributed processing of ANN gives it

the ability to adapt, and that is at the very centre of ANN operation. Adaptation

takes the form of adjusting the connection weights to achieve the desired mapping

of inputs to outputs. Furthermore, ANN can continue to adapt and learn (incre-

mental learning), which is extremely useful in processing and recognising speech.

Adaptation (learning) algorithms continue to be a major focus of research in the

ANN field (Hecht-Nielsen, 1990).

ANNs tend to be more robust and fault-tolerant because the network is composed

of many interconnecting Processing Elements (PEs), which are all computing sim-

ple mathematical functions in parallel. The failure of a PE is compensated by re-

dundancy in the network. Similarly, ANN can often “generalise” a reasonable re-

sult from incomplete or noisy data. Finally, in direct contrast to HMM, when ANN

is used as a classifier, it does not require strong statistical characterisation of data

(Deller, et al. 1993).

Since information or relationship is embedded in the ANN and spread amongst the PEs

within the network, this structure has low sensitivity to noise or defects within the structure

(Laurene, 1994). The subject of robust speech recognition is till a major research topic, and

the robustness of the ANN showed some promising results. For example, in a paper written

by Sorensen (1991), the author showed an improvement of 65% on the average recognition

rate when a noise reduction neural network is added to the system under evaluation. In this

case the network provided at the input by cepstral coefficient vectors derived from isolated

words and associated with noise from non-stationary source.

The other advantage of ANNs comes from the variability of the connection weights. This

allows ANNs to adapt in real-time and improve the overall performance of the system.

Adaptive learning is the most important advantage of ANNs, which results from the non-

42

linearity of its activation function. This means that large ANNs can approximate a non-

linear dynamical system (Rabiner and Juang, 1993), which conveniently accommodates the

dynamic nature of speech.

2.5.4 Artificial Intelligence (AI)

This review of the AI Technique briefly presents the basic use of AI within SR. The AI

technique applied to SR is a hybrid technique that integrates acoustic-phonetic phenome-

non and with pattern-recognition concepts.

As a basic concept in AI, Expert Systems (ES) has achieved remarkable success in many

domains (business, robotics, biomedical engineering) (Hunt, 1988). This has lead to in-

tense interest in their application to SR. ESs are intended to model human conscious rea-

soning. It attempts to decode speech information at the phonetic level by modeling the be-

haviour of a phonetician reading spectrograms, this can be explained as simulating human

intelligence in visualising, analysing, and finally making a decision on the extracted acous-

tic features.

Studies, which apply ES to perform the SR task, can be divided into two broad categories:

A system of rules embodying human knowledge of what characterises speech

sounds as they appear in spectrograms – so the ES models a skilled spectrogram

reader. It transpires that the system of rules is much less effective at decoding

speech than a human listener with normal hearing ability.

A systems use statistical properties of training material to compare patterns on

continuous scales.

Rabiner and Juang (1993) defined knowledge sources as:

Acoustic knowledge: An evidence of which sounds (predefined phonetic units) are spoken

based on spectral measurements and presence or absence of some acoustic features.

43

Lexical knowledge: The combination of acoustic evidence so as to postulate words as

specified by a lexical that maps sounds into words (or equivalently decomposes words into

sounds).

Syntactic knowledge: The combination of words that form grammatically correct strings

(according to a language model) such as sentences or phrases.

Semantic knowledge: Understanding of the task domain of speech so as to be able to vali-

date sentences (or phrases) that are consistent with the task being performed, or which are

consistent with meaning of coded sentences.

Pragmatic knowledge: Inference ability necessary in resolving ambiguity of meaning

based on ways in which words are generally used.

The incorporating of such levels of knowledge within the system enhance its ability to recover

corrupted speech. This was studied by (De Morie, 1983), who carried out experiments on

speech sounds that were selectively masked by noise. The results of the experiments showed

that listeners used semantic, syntactic, prosodic (rhythm or intonation), pragmatic, phonetic

(body of fact about speech and its production) and acoustic knowledge to understand

corrupted or uncorrupted speech. Experimental results support the use of a language model

that uses high level syntactic knowledge to support the acquisition/retrieving of lower

phonemic knowledge.

The approach of ES may be appropriate for the organisation of higher level syntactic and

particularly semantic information, which is susceptible to conscious analysis. The effective

use of such higher level information will be necessary to achieve sophisticated SR (Hunt,

1988). An ES can not be successfully used for segmentation and labeling Rabiner, and

Juang, (1993). In particular, methods that integrate phonemic, lexical, syntactic, semantic

and even pragmatic knowledge into the ES have been proposed and studied. The main re-

quirement in such methods is that the learning should adapt to the dynamic component of

the data. For example, the expert system approach to segmentation and labeling would

augment the generally used acoustic knowledge with phonemic knowledge, lexical knowl-

edge, syntactic knowledge, semantic knowledge, and even pragmatic knowledge (Rabiner

and Juang. 1993).

44

The main advantage of the integration of a higher-level knowledge source in rec-

ognition systems occurs in the significant improvement of word-correction capa-

bility of system Rabiner, and Juang, (1993).

It can be concluded that a variety of knowledge sources need to be established in AI.

Therefore two key concepts of AI have to be addressed. They are the automatic knowledge

acquisition (learning) and adaptation (learning on the run). One of the ways in which these

concepts can be implemented is using ANN. This idea was tested by Shuping, and Millar,

(1992). They highlighted the importance of using speech knowledge in SR systems to

achieve closer results to the ultimate objectives of SR. They emphasised that to achieve

measurable advances in SR, the recognition problem should be approached using phoneti-

cally based knowledge techniques, where this knowledge has to be encoded into the sys-

tem structure.

2.5.5 Hybrid ANN/HMM Systems

Despite of the relative success of the HMMs technique in specified recognition tasks, their

inherent drawbacks which are outlined in Section 2.5.1 have caused some limitations on

their functionality in more advanced SR tasks (Gupta et al., 1991). Consequently, there is a

tendency among the SR research community to employ other techniques such as ANN for

large recognition tasks. However, HMM success in some particular systems has encour-

aged researchers to explore the potential of modified forms of HMM. Several researchers

have developed the core of HMM to overcome the conditional –independence limitation,

which is one of the HMM drawbacks. An example of this tendency is a research carried

out by Chan and Chan (1992). In their report, the proposed Static Model (SM) in the form

of a vector is used to represent the temporal properties of a sequence of speech feature vec-

tors. The system captures the average joint probabilities of state transitions of consecutive

observations over time, instead of the conditional probabilities, which are captured by

HMM. The system has been tested using an artificial vocabulary of ten words. The results

45

of the test were not encouraging as the system exhibits limitations in handling certain types

of vocabulary.

To merge the two techniques, it was noticed that ANNs –in particular MLP- were funda-

mentally similar to HMMs in that both have the ability to learn from training data (Deller

and Proakis, 1993). The process by which HMMs are trained may likewise be considered

to be a form of supervised learning. However, learned materials in each case are different

in content and in methodology. This is valid, even if both models are being applied to the

same problem. The HMM learns the statistical nature of observation sequences presented

to it, while the ANN may learn any number of things, such as the classes (e.g., words) to

which such sequences are assigned. It was from this point that researchers started the inte-

gration process of the two techniques, emphasising the effort to train ANN to learn the sta-

tistical sequences of events performed previously by HMM (a pre-processor).

Although influenced by the statistical nature of the observations, the internal structure of

the ANN that is learnt is not statistical. In its basic form, an ANN requires fixed-length in-

put, whereas this is not necessary in HMM because of its time normalisation property.

Although different in philosophy, HMMs and ANNs do have important similarities. Both

HMM and ANN can be robust to noise; to missing data in the observations; and to missing

exemplars in the training. However, there is a fundamental difference between the two

techniques represented by the input nature, learnt materials and learning mechanisms. Al-

though both systems perform mappings (the HMM an observation string to a likelihood,

the ANN an input to an output), the fact remains that the dynamics of the HMM are fun-

damentally linear. Whereas the ANN is a nonlinear system, which makes ANN perform

better against HMM for some SR tasks (Deller, 1993 and Hunt, 1988).

The comparison between limitations/advantages of HMM (Section 2.5.1) and

limitations/advantages of ANN (Section 2.5.3), scores in favour of ANN, especially in a

system that incorporated knowledge representation, as in the current system. ANN

optimises the functionality of the recognition system. This is the idea that made current SR

systems tend to incorporate ANN/HMM in one hybrid system. However, the incorporation

46

of ANN/ multi-level language knowledge as in this system, is regarded as novel

investigation. By adopting ANN as recognisor on the phonemic level, the three problems

of the HMM design especially of finding the states number, probability and the optimum

sequence is avoided by using the natural sequence of language depending on natural

language model. This especially eases the design procedures of second problem HMM

design, i.e. the attempt to uncover the hidden part of the model and finding a correct state

sequence. The proposed system maintains the statistical structure in the relation between

the phonemic and the syntactic level as HMM does.

The integration of HMM with ANN was started by experiments on ANNs trained to act as

expanded HMM. For instance, in research by Rigoll (1991), a MLP is used as a Vector

Quantiser (VQ) in a HMM based SR system. The system can use a variety of speech fea-

tures such as cepstral coefficients, differential cepstral coefficients and energy as joint in-

put into the VQ. This avoids the use of multiple codebooks – so the system simulates

multi-HMMs in order to achieve a more robust system. It should be noted here that this

system transfers the computing complexity of the expanded HMM system to ANN. How-

ever, this study did not show improved performance in the recognition rate compared with

pure HMM techniques.

Other efforts using the same idea produced similar results. Cheng et al. (1992) carried out

an assessment of the possibility of modeling phone trajectories to accomplish SR. The as-

sessment performed using hybrid Segmental Learning Vector Quantisation /Hidden

Markov Model (SLVQ/HMM) system. Results obtained from that system showed signifi-

cant difference with those results from SLVQ system alone. However, the difference with

pure HMM techniques was small.

Zavaliagkos et al. (1994) reported improved performance of a hybrid system when com-

pared with the pure HMM technique. This system was used to perform large-vocabulary

continuous SR. They demonstrated that the hybrid system could show consistent im-

provement in performance over the baseline HMM system. This system used a N-best

paradigm by connecting segmental ANN and models of all the frames of the phonetic

47

segment simultaneously, thus overcoming the conditional-independence limitation of the

HMM.

Various hybrid systems were then developed by assigning relatively different function to

be performed by ANN. Reichl and Ruske (1995) reported a hybrid system that achieved a

reasonable recognition rate. The system consisted of ANN with Radial Basis Function

(RBF) and HMM. The RBF-ANN trained to approximate a posteriori probabilities of sin-

gle HMM state. Those probabilities are used by the Viterbi algorithm to compute the total

scores of the individual hybrid phone model.

All of the literature in this area tends to carry out comparisons between pure HMM and

hybrid ANN/HMM systems. No comparison between pure ANN and ANN/HMM hybrid

systems, or ANN/ES and pure ANN systems has been reported. This thesis develops such

a comparison, and presents the performance of two types of artificial neural networks; con-

ventional BP and ANNs with incremental learning ability. This is one of the novel points

of the current work.

2.6 Conclusion

This chapter describes the overall system, and the acquisition of the UWS speech database

which was used in the study of first version of the system RUST-I. The speech model and

MFCC parameter calculation have also been presented. A survey of the techniques used in

speech recognition has also been conducted.

Based on the literature survey, it has been decided this work is to implement a hybrid sys-

tem of ANNs and knowledge sources for speech recognition in particular an incremental

learning ANN.

48

Chapter 3: Phonemic/Syntactic Knowledge and Adaptive Phone

Recognisor - Design and Implementation

3.0 Introduction

This chapter presents an in-depth description of phonemic/syntactic knowledge and the

Adaptive Phone Recognisor (APR) and their implementation. Section 3.1 describes the

design and implementation of the APR. Section 3.2 describes in detail the syntactic

knowledge of RUST-I. The basics of the language model and the set of words that form the

lexicon are discussed in Section 3.2.1. The method of categorisation of the phones within

these words is then described in Section 3.2.2. Section 3.3 describes the method to derive the

statistical probabilities of patterns of words. Section 3.4 is dedicated to the description of the

code activator and the accumulator. Section 3.5 is concerned with the sub-recognisor

architecture, the structure of the neuro-slices in a sub-recognisor, the initial conditions and

parameters for the sub-recognisor.

It was shown in Chapter 2 that the functional relationship between the adaptive phone

recognisor and syntactic knowledge estimator produced the syntactic knowledge of RUST-I.

The syntactic knowledge is in the form of associative procedure that links phonemic events

with a primitive, syntactically correct language model.

The phonemic knowledge is represented by the ANN parameters (weights) of the 45

sub-recognisors. This knowledge includes the training of the 45 sub-recognisors. (This

knowledge is fault tolerant to some extent.) The syntactical knowledge is represented

as the probabilities of occurrences of phones in the formation of words (in specific

texts).

49

Combining the syntactic level with the phonemic level produces an IWR system. Addi-

tional syntactic or semantic functions can be provided to detect any syntax errors and in-

formation about sentence structure and grammar. But to do this is beyond the scope of the

present thesis. Altering the syntactic functions such as the vocabulary size, topic focus and

adding other functions will alter the performance of RUST-I as well.

3.1 Adaptive Phone Recognisor (APR)

The function of the APR is to find the match of an input phone. The adaptive phone

recognisor consists of a bank of sub-recognisors that perform the mapping of the speech

input represented by MFCC vectors, to the classified output represented by the phone

identification responses, PIR1 to PIR46, for all pertinent frames.

A block diagram of the APR is shown in Figure 3.1. The length of the input phone was also

used in recognition as an additional parameter. With this knowledge, only a small number of

sub-recognisors will activated according to the syntactical knowledge (probability of

occurance).

Figure 3.1 Adaptive phone recognisor.

50

For an activated ith sub-recognisor, only Mi sets of 12 MFCC parameters will be used to

calculate PIRi.. Each set is retrieved from one frame of a speech signal. If Mi is greater than

the number of frames in the speech signal, zero sets will be used. These MFCC coefficient

sets, D1(12),…,DIMi(12), were presented to all the activated sub-recognisors simultaneously.

The output of only one sub-recognisor is activated at any one time using the ACLi signal.

The order of activation of sub-recognisors for any one phone is controlled by the syntactic

knowledge estimator.

3.2 Syntactic Knowledge Estimator

The syntactic knowledge estimator shown in Figure 3.2 consists of two modules: the syntactic

knowledge database and the code activator. The syntactic knowledge database provides the

probabilities of patterns of phones, which occur in words stored in it. The probabilities are

utilised by the code activator. The code activator arranges the outputs of the adaptive phone

recognisor for the best match, based on the probabilities of the phone sequences. Thus the

syntactic knowledge estimator provides the activation control patterns to the adaptive phone

recognisor and informs the accumulator of a word boundary.

Figure 3.2 The syntactic knowledge estimator.

51

The input to the syntactic knowledge estimator is the Phone Identification Responses (PIR1)

to (PIR46) of the previous phone in the word to be recognised. There are two outputs from this

block; the activation control lines, ACL1 to ACL46, and the End of Word Identifier (EOWI).

The largest value of PIR1 to PIR46 (above a certain threshold) indicates the matched phone for

the input phone. The code activator of the syntactic knowledge estimator then checks the

syntactic knowledge database for the list of most likely phones to follow the current pattern

and passes this estimate to the adaptive phone recognisor via the state of the activation control

lines, ACL1 to ACL46. The code activator will continue checking likely phones at one level

until a match is made as to the phonemic identity of the speech input. If no match is made a

message will be generated to indicate that the word does not exist in the lexicon. If a silence is

detected at any level other than the first, a word boundary will be identified and passed on to

the accumulator via the EOWI signal.

3.2.0 Syntactic Knowledge Database

Within the syntactic knowledge database, the syntactical data was represented in the form of

clusters. In the clusters, the data units are linked to each other using pointers in similar manner

to linked list, where the data units of the syntactical knowledge are the phone ID’s of Table

3.1.

52

Table 3.1 Phone ID’s of RUST-I.

Phone ID Phone ID Phone ID Phone ID

I 1 Λ 13 d 25 r 37

i 2 aI 14 k 26 t∫ 38

ε 3 eI 15 g 27 dξ 39

æ 4 ЭI 16 f 28 m 40

a 5 aΩ 17 v 29 n 41

Þ 6 OΩ 18 し 30 さ 42

Ď 7 I∂ 19 ð 31 j 43

Э 8 ε∂ 20 s 32 w 44

Ω 9 Ω∂ 21 z 33 L 45

u 10 p 22 ∫ 34 sln 46

έ 11 b 23 ξ 35

∂ 12 t 24 h 36

The database contains the probability of occurrence of a phone as the first one in a word or

within any pattern of phones up to a maximum number of 14 levels in depth. For example, the

fricative /ð/ has the highest probability of occurrence (0.1149594) among all phones in the

analysed textual material (lexicon) likely to be first in a word. The vowel /∂/ has the second

highest probability, and probability value of (0.0781134) of being first in a word. The phone

that has the lowest probability of being first in a word is the diphthong /eI/ at (0.0007369).

The lateral /r/ has the highest probability of occurrence (0.403) to follow the plosive /p/. The

database contains information on the depth into the word (that is 1, 2 down to 14 levels) and a

statistically aligned list indicating the order of most likely phones at a given level.

Phonemic units are distributed in the knowledge space statistically according to their

probabilities of occurrence, which is dependent on the focus of the knowledge source (KS).

The file that contained the probabilities of the linked clustered data units is referred to as a

syntactic database.

53

The syntactic knowledge database is constructed to form clustered linked-lists. The front edge

list, which contains the order of probabilities of phones occurring first in a word, acts as the

syntactic knowledge interface. Both the linked-lists and the clusters are constructed using the

statistical order distribution, which is derived from the probability values of Tables A.1

through A.34. The first search cycle navigates the front edge list until a match is found. Each

subsequent search cycle moves through the levels in the linked lists selected from the front

edge list.

3.2.1 RUST-I Lexicon

RUST-I will work with any size database (theoretically with an infinite size database) that is

limited only by the time of access through a large database. A limited size database was used

in RUST-I to illustrate its operation.

Generally, the number and type of the words in a lexicon are logically related to the area of

knowledge or topic where those words came from. Particular words can occur frequently only

in one area (e.g., the word ‘budget’ will be repeated in financial area of knowledge, whereas

the word ‘child’ may only occur in a general area of knowledge). Other vocabulary can be

described as general use vocabulary and they are necessary in any English speech. The same

set of words may appear in many areas at the same time (e.g., the word ‘the’ will occur in all

areas). To demonstrate the concept of the syntactical knowledge and the system lexicon, a

limited number of speech areas were used to create the RUST-I lexicon.

Two approaches were commonly used for lexicon development. The first approach selected

words pertinent to a particular area, such that a meaningful conversation could be carried out

in this subject area. The second approach was to select all words from a word reference text

(Macquarie Dictionary, 1998). The second approach utilised an alphabetic classification and

did not comply with the syntactic knowledge concept. Therefore only the first approach was

used in this research. In this approach, a much smaller but more effective mixture of words

were obtained by selecting three areas of speech. The three chosen areas were (1) general

community topics, (2) accounting and (3) physics. The general community topic extracts

54

came from a local community newspaper (The Torch). The accounting extracts came from

(Costinett, 1997). The physics extracts came from (Hall, 1977). Words extracted from these

areas were combined in the lexicon and analysed and prepared to be represented in the

syntactical knowledge. At this stage, the lexicon of RUST-I comprised a set of 1357 words of

which 541 are unique, but the lexicon could be expanded or reduced according to the system

application. To derive the lexicon for RUST-I, two approaches were followed.

3.2.2 Categorisation

Categorisation is the process of dividing the lexicon word set (1357 words) into phonemic

classes depending on their onset phone and subsequent phonemic structure. The onset phone

of each word determines the phonemic class which that word is associated with. The onset

phones of all the words form the front edge phonemic level of the syntactic database

knowledge.

In the first stage of the categorisation, all of the words in the lexicon are classed according to

the first phone in each word. This produces the phonemic classes of the syntactic knowledge

in RUST-I. The phonemic structures of the words in the lexicon are specified in brackets

beside the textual representation of each word, e.g., the word ‘with’ is represented

phonemically as [wIð]. The conversion process is based on the International Phonetic

Alphabet for Australian English shown in Table 2.5.

In the second stage of categorisation, the phonemic classes were placed in descending order

according to their probability of occurrence. For example, words that start with the fricative

/ð/ have the highest probability of occurrence at the beginning of a sentence. Hence, the

phonemic class /ð/ located at the beginning of the lexicon and is also at the start of the front

edge level in the syntactic database. The fricative /ð/ has therefore the highest probability in

the syntactical knowledge. Table 3.2 shows the phonemic classes included in the syntactic

knowledge, the number of tokens in each phonemic class and the phonemic sub-classes,

which follow the front edge phonemic class. The raw phonemic data was categorised by this

55

process to extract statistical information for the syntactic database in the syntactical

knowledge estimator.

Not all phones in the Australian English phonemic set are represented in the front edge level

of the syntactic database. Table 3.3 shows those phonemic classes, which are excluded from

the representation in the front edge level of the syntactic knowledge. The reasons for

exclusion of those phonemic classes are:

• RUST-I lexicon does not contain words starting with that phone (for example the

phones z, ∫, ξ of Table 3.3).

• The phone is syntactically impossible to occur at the beginning of an English word

(for the other phones of the Table 3.3).

Table 3.2 shows that all of the words in the lexicon can be categorised phonemically into 35

front edge phonemic classes.

56

Table 3.2 Phonemic classes and their associated levels represented in the front edge level

of the syntactic knowledge.

Associated levels Phonemic class

Number

(max)

Second level phonemic sub-classes

ð 3 ∂ - I - æ - eI - ε - i - OΩ - Λ

∂ 12 sln - L - k - w - t – r - b - m - v - p - d - g - f

æ 12 n - t - z - d - L

i 9 n - z - t - f - m

h 9 i - a - æ - έ - I – OΩ - ε - aI - Э - Λ - u

w 7 I - Þ - Λ - i - ε - έ - Ω - eI - aI - ε∂ - Э

þ 9 v - n - f – p

f 10 Э - r - i - I - ε - ∂ - IƏ - L - aΩ - a - aI - æ - eI - u - έ - Λ

p 8 r – L - ∂ - Ω - i - Þ - a - Λ - æ – eI - I∂ - I - έ - ε t 10 u - r - eI - w - ∂ - έ - aI

s 12 ε - i - I - m - p - eI - OΩ - έ - æ - t - L - ∂ - k - aI

b 8 i - I - Ω - aI - ∂ - Λ - æ - r - L - ε - aΩ - Э - Þ – eI

k 11 æ - Э - aI - Λ - Þ - Ω - L - OΩ - ∂ - m - s - ε - I – I

d 11 r - ∂ - Λ - I - u - eI - Þ - æ - Э - OΩ - ε - IƏ - i – j

m 10 Ə - Λ - Э - æ - Þ - ε - eI - a - aΩ - OΩ

n 7 Λ - j - Þ - ε - OΩ - ЭI - eI - i - aI

i 8 L - v - t∫ - k - z - s - t - n

ε 11 L - n - v - ∂ - k - g - dξ - b

ε∂ 1 Sln

g 9 eI - r - ε - L - Λ – Þ - aI - I - i - OΩ - Ω

r 11 ε - i - ∂ - u - OΩ - Λ - æ

a 4 sln - I - t∫ - f - m - s - t

э 6 sln - b - d - g – L

j 8 u - Э - ∂ - I∂

t∫ 4 æ - a - έ - ε∂ - aI

L 6 aI - Ω - ε - eI – OΩ - I - æ - a

Λ 9 ð - p - n – s

し 7 r - I - æ - ∂ - Ω

∂Ω 2 t - ∂

dξ 6 Þ - n - ε - Λ – ЭI

OΩ 4 L - d - v – m

aI 3 Sln - ∂

v 7 ε - ∂

έ 4 L – し eI 4 Dξ

57

Table 3.3 Phonemic classes which are not represented in the front edge layer of the syntactic

knowledge.

Category Phone Word example Phonetic form Ω Put pΩt

u Pool pul

Vowels

Ď Bon voyage bĎvw’jaξ ЭI Boy bЭI

I∂ Here hI∂

ε∂ Hair hε∂

Diphthongs

Ω∂ Tour tΩ∂

Nasals さ Sing Siさ

z Zeal Zil

∫ Show ∫OΩ

Fricatives

ξ Measure mεξ∂

The numerical representation of the Australian phone set specified in Table 3.1 is used to

facilitate the manipulation of the phonemic data in the syntactic knowledge estimator. This

numerical code is used to represent the phonemic units in the syntactic knowledge database

and the accumulator (in Fig. 2.2). In Table 3.1, the silent period between words is considered

to be separate code and can be identified by its duration, which is chosen to be longer than the

longest duration of any of the phonemic unit. The longest duration measured was 487 ms for

the diphthong phone /eI/ (its ID = 15). Therefore, a speech segment is considered to be silence

if its duration exceeds 500 ms.

To integrate the syntactic knowledge within the isolated word recognisor, a procedure of

mixture of 'bottom-up' and 'top-down' processes is used. The lowest level of knowledge is the

phonemic knowledge or knowledge of basic phonemic units, where the phone identification

numbers of Table 3.1 are used to represented phonemic units in the syntactical knowledge.

Categorisation of the phonemic knowledge into front edge layer classes provides an efficient

structure for organising syntactic information. The phones extracted from words in the

lexicon are categorised into phonemic classes depending on front edge phonemic classes and

then to phonemic subclasses. Every phonemic class contains one phone from the front edge

phonemic class and at least one phone in the phonemic subclass. This generates a hierarchical

structure for the phonemic clusters, which has the advantage that once the front edge phone

58

has been classified, it automatically inherits all statistical information (probabilities) about

subclasses from that classification.

Figure 3.3 shows how this classification scheme is implemented. Assume that the front edge

phonemic class G0 is one of the possible 35 front edge phonemic classes that were generated

from the first stage classification. The first subclass of G0 is classified as C011 for subclass 1

in level 1 in class 0. The subclass C012 refers to the subclass 2 of level 1 of class 0. The term

SC refers to the subclass of phones at levels deeper than the first subclass in the word. The

number following the mnemonic SC refers to the depth of the phone into the cluster. For

example, SC041 penetrates four levels away from the original front edge class G0. It refers to

subclass 1 of level 4 of class 0.

Figure 3.3 Graphical representation of data clusters.

All the clusters of phonemic classes and subclasses are formed by this process. Phonemic

classes and subclasses are represented using their numerical IDs. Figure 3.4 shows an

example of the data cluster for the front edge phonemic class /t/. Thirteen words make up that

data cluster. The linkages show the sequence of phones in each word from left to right. The

words are ranked from the first subclass to the last in an order derived from the probabilities

of occurrence in the lexicon. The description of how the statistical probabilities are derived

can be found in Section 3.2.3.

59

Figure 3.4 Example of a data cluster for the front edge phonemic class /t/ (phones are

represented by their identification code.).

3.2.3 Data Organisation in the Syntactic Database

To construct the data file that contains the phonemic units of the syntactic database, the

phonemic classes were given priorities of sequenced appearance according to their

probabilities. As shown in Fig. 3.5, the order of the phonemic clusters reflects the degree of

priority for each phonemic class in the front edge layer, as well as for each phonemic subclass

or individual phone within the syntactic clusters. Once the order of clusters is obtained from

the probability values, those values are not required any longer in the recognition process, as

the recalling mechanism is following the order of positioning in the syntactical knowledge.

For example, the phonemic class Oæ located in the third level of priority of the front edge

phonemic classes. The system will access all of the related phonemic clusters to this

60

phonemic class in case this set activated for recognition cycle. This cluster is shown in

Figure 3.5 below:

Figure 3.5 Bubble diagram of cluster number 3 of front edge phonemic class /æ/.

For each front edge phonemic class, the first level of subclasses was also ordered in a

decreasing probabilities of subclasses. For example, as shown in Figure 3.6 the first level of

subclasses or second phone in words beginning with this front edge phone / æ / is in the

probability order: 41 24 33 25 45 or /n t z d L/ .

In all cases, the probability of a subclass makes the priority order at levels deeper than 2 less

significant than at levels 0 and 1 classes and overall probability values of phones occurring in

the lexicon are used to define order. So the order of the higher levels of subclasses (at or

above 2) is determined by the number of occurrences of a phone within the lexicon. For

example, in Figure 3.5 the subclass 12 in level 2 leads to 2 subclasses in level 3, they are 29

and 45. Both of these subclasses lead to two different words but the subclass 45 is given lower

priority than the subclass 29 because phone 45 has the lower overall priority in the context of

61

the lexicon. The localised probability of the two classes 29 and 45 is 0.5 as they both are

represented by the same number of words in this cluster.

Using this analysis method, all of the phonemic classes have been scanned horizontally in

level steps and the phonemic IDs were collected vertically in their order and used to create the

syntactic knowledge database file. For example, Figure 3.6 shows a portion of that file, which

represents the front edge phonemic class and other related lines.

Figure 3.6 Portion of the syntactic database that represent cluster 4.

In Figure 3.6, the data units are organised in lines, and lines in decreasing order of priorities.

Each line contains a number of fields separated by spaces and ended by semicolon. The first

field contains the line ID or pointer, which is derived from the degree of depth in the cluster.

The ID of a phonemic class is represented by two symbols (from 00 to 0Z) for 34 classes of

the syntactical knowledge. A subsequent symbol (from 0 to 9) is also used to indicate the

depth of the phonemic level in the cluster. For example, the phonemic class /eI/ (with the

least probability of occurrence) was assigned the address 0A0, and 017 means the seventh

level in the phonemic class 01.

During the recognition process, any identification to any of the phonemic unit in any level

will trigger the activation to the next level in the cluster. This is not applied to the to the

data unit 46 in any level. This phonemic unit activates End of Word (EOW) signal,

011 46 45 26 44 24 37 23 40 29 22 25 27 28 ;

012 3 17 11 32 15 6 37 14 19 18 40 20 1 ;

013 26 46 41 22 40 24 6 29 1 34 45 35 ;

014 24 1 3 25 46 32 34 5 41 12 34 ;

015 37 26 46 12 37 24 12 1 15 45 ;

016 6 1 41 12 46 32 26 2 45 ;

017 41 26 24 40 24 46 ;

018 46 37 ;

019 15 ;

01A 34 ;

01B 12 ;

01C 46 ;

62

therefore, it does not lead to any other phonemic level. In some levels there is only end of

process unit 46 – the silence ID-.

3.3 Determination of RUST-I Syntactic Knowledge: Example

In RUST-I, phones and words are considered as arbitrary items of data, so they are units of

information source. From the information theory, the self-information, Ij conveyed by a

phone J in a contextual lexicon depends on the probability P(J) of occurrence of that

phone. If the occurrence of the phone J, depends upon a finite number m of preceding lev-

els or phones, the information source is then called an mth order Markov source.

In RUST-I, m is taken to be equal to 1 if probabilities at levels higher than 1 are too high to

contribute significantly in the optimisation of knowledge representation. This is expected

in a database of 1357 words. Higher orders could be useful with significantly larger vo-

cabularies. Therefore RUST-I is represented by a 1st order Markov source. To represent

the system mathematically, consider the Australian English phone set (45 phones and si-

lence) forming a universal set A, where:

A = I, i, ..., sln

n(A) = 46

where, n(A) is the number of elements in A.

From Table 3.2, the number of possible front edge phones for the current knowledge data-

base is 35, and they form sample space represented by the set O; where each phone J ∈ O

is a member of some words in the lexicon:

O = ð, ∂, æ, I, h, w, Þ, f, p, t, s, b, k, d, m, n, i, ε, ε∂, g, r, a, Э, j, t∫, L, Λ, し, aΩ, dξ, OΩ,

aI, v, έ, eI

n(O) = 35

It should be noticed that:

O ⊂ A

63

The detection of a front edge phone by the adaptive phone recognisor initiates action by the

syntactic knowledge estimator to recall the cluster of phones related to that front edge phone

with its statistically related phonemic units. This is referred to as an event in the knowledge

and will initiate a specific set of linked lists which represent a cluster of words that are all

initiated from the same front edge phone and are part of the same phonemic class. If these

events can be regarded as independent source of information, there is no relationship between

their probabilities. The detection of any front edge phone is an independent process. Let all

words in the lexicon be members of the set W, where

n(W) = 1357

The set O represents the front edge events of the set W members. Therefore, the probability

P(J) of a phone J ∈ O can be found using the relation:

)(

)()(

Wn

JncJP =

Where, nc(J) is the number of occurrence of J at the front edge in the lexicon.

Table 3.4 depicts all probability values of the front edge phonemic classes of the set O along

with their frequency.

The phones at the front edge of the set W can be considered sources of information, therefore

each class can convey a quantity called the amount of information. This quantity measures the

information conveyed by an event at the time of its detection, and has a nonlinear relationship

to the event probability of occurrence. A higher amount of information is conveyed by an

event when it has a lower probability of occurrence for that event. This quantity can be an

indicator of the independence of probability amongst members of the set, O.

Consider a front edge phone, J ∈ O, that has a probability value of P(J), and all phones of the

set O are independent (each of them forms an independent source of information). Then the

64

amount of information of that phonemic class is obtained from the self-information that is

associated with that phone. From information theory, the self-information associated with this

phone can be obtained as follows:

)(log2 JPI j −= [bit]

)(log32.3)(log10log 10102 JPJPI j ≈−=

The values of Ij for each phone of the set O are shown in the last column of Table 3.4. The

items in Table 3.4 are organised in descending order of probability values. It can be noted

from the table that when the probability of a front edge class is lower, the self-information

associated with that class becomes higher, this is represented graphically in Figures 3.7 and

3.8. So, phonemic classes with high probability of occurrence do not convey high amount of

information to the system. On the other side, phonemic classes with lower values of

probabilities are associated with higher amount of information indicating to the uncertainty

factor associated with those phonemic classes. For example, the phonemic class of the

diphthong /eI/ conveyed self-information of IeI=10.4 [bit] as it has the lowest probability in

the front edge phonemic classes.

65

Table 3.4 Syntactic-knowledge front-edge phones set, their frequencies, probabilities and

self-information.

Phone Frequency Probability Self-information [bit]

ð 156 P(ð) = 0.1149594 I(ð) = 3.12

∂ 106 P(∂) = 0.0781134 I(∂) = 3.68

æ 82 P(æ) = 0.0604274 I(æ) = 4.05

I 81 P(I) = 0.0596904 I(I) = 4.06

h 79 P(h) = 0.0582166 I(h) = 4.1

w 74 P(w) = 0.054532 I(w) = 4.19

Þ 65 P(Þ) = 0.0478997 I(Þ) = 4.38

f 63 P(f) = 0.0464259 I(f) = 4.43

p 62 P(p) = 0.045689 I(p) = 4.45

t 61 P(t) = 0.0449521 I(t) = 4.47

s 57 P(s) = 0.0420044 I(s) = 4.57

b 56 P(b) = 0.0412675 I(b) = 4.6

k 48 P(k) = 0.0353721 I(k) = 4.82

d 44 P(d) = 0.0324244 I(d) = 4.94

m 38 P(m) = 0.0280029 I(m) = 5.15

n 34 P(n) = 0.0250552 I(n) = 5.32

i 28 P(i) = 0.0206337 I(i) = 5.59

ε 27 P(ε) = 0.0198968 I(ε) = 5.65

g 26 P(g) = 0.0191599 I(g) = 5.7

r 24 P(r) = 0.017686 I(r) = 5.82

a 19 P(a) = 0.0140014 I(a) = 6.15

Э 19 P(Э) = 0.0140014 I(Э) = 6.15

j 19 P(j) = 0.0140014 I(j) = 6.15

t∫ 16 P(t∫) = 0.0117907 I(t∫) = 6.4

L 16 P(L) = 0.0117907 I(L) = 6.4

Λ 14 P(Λ) = 0.0103168 I(Λ) = 6.59

し 11 P(し) = 0.0081061 I(し) = 6.94

aΩ 8 P(aΩ) = 0.0058953 I(aΩ) = 7.4

dξ 8 P(dξ) = 0.0058953 I(dξ) = 7.4

OΩ 4 P(OΩ) = 0.0029476 I(OΩ) = 8.4

aI 4 P(aI) = 0.0029476 I(aI) = 8.4

v 4 P(v) = 0.0029476 I(v) = 8.4

έ 3 P(έ) = 0.0022107 I(έ) = 8.8

eI 1 P(eI) = 0.0007369 I(eI) = 10.4

66

Figure 3.7 Probabilities of Phones in set O.

Figure 3.8 Self-information of Phones in set O.

At the second level of the syntactic knowledge (where m = 1), localised probabilities and self-

information are applied within each phonemic class to derive the statistical data associated

with the phonemic subclasses which are clustered within each front edge phonemic class. The

same calculations and formulae that are used at the first level (front edge level) are applied to

this first phonemic subclass.

The probabilistic values of the links between the phonemic classes on the front edge level and

their phonemic subclasses addresses the sequential distribution of the clusters in the syntactic

knowledge. Those values are computed as mentioned before, where each phonemic class has

been considered as a universal set O that contains specific phonemic subclasses.

0

0.02

0.04

0.06

0.08

I D t k n g L aI eI

Self-information graph

0

2

4

6

8

10

12

I D t k n g L aI eI

67

An example of the second level localised probabilistic values is illustrated in Table 3.5. This

level contains set of 156 words start by the phone /ð/, let us call this set as Oð; where, n(ð) =

156. The first phonemic subclass E(ðI) in the table achieved the highest localised

probability of P(ðI) = 0.75, and so forth.

Table 3.5 Localised probabilistic values of phonemic subclasses in level 2 of the phonemic

set Oð.

Phonemic set Oð , n(ð) = 156 Sequence Phonemic subclass

& number of its occurrence

Localised probability

Self information [bit]

1 E(ðI) = 117 P(ðI) = 0.750 I(ðI) = 0.415

2 E(ð∂) = 110 P(ð∂) = 0.705128 I(ð∂) = 0.504

3 E(ðæ) = 13 P(ðæ) = 0.0833 I(ðæ) = 3.583

4 E(ðε) = 11 P(ðε) = 0.0705512 I(ðε) = 3.824

5 E(ðeI) = 9 P(ðeI) = 0.057692 I(ðeI) = 4.113

6 E(ði) = 3 P(ði) = 0.019230 I(ði) = 5.697

7 E(ðOΩ) = 2 P(ðOΩ) = 0.01282 I(ðOΩ) = 6.281

3.4 Code Activator and Accumulator

The Code activator is the controller of the syntactic knowledge estimator and is the link

between the basic phonemic knowledge of the adaptive phone recognisor and the syntactic

knowledge in the syntactic database. It has three main functions. The first function is to

browse the syntactic knowledge database and derive an estimate of the most likely phone to

occur first in a sentence, first in a word, followed by a given pattern of phones that have been

collected in the accumulator. The second function is to monitor the PIRi output from the

adaptive phone recognisor and determine which phone sub-recognisor’s output is the largest

output that exceeds the threshold of 0.6 (where the maximum response value is 1 – a complete

match). The code activator will then feed the ID code for that phone to the accumulator. The

third function is to determine the end of word EOW by a silence and to signal the accumulator

to release the identified word and to start a new word.

68

Figure 3.8 shows the algorithm that implements the three functions of the code activator. The

code activator will go through an initialisation routine on power up, which involves the

following:

• zeroing the identified word in the accumulator.

• setting internal registers to predefined values.

• set the pointer value to the beginning of the front edge level of the

syntactic database.

• set up the activation function to enable the output from the most common

phone found first in a sentence.

Every time a phone is detected, the code activator will move further into the syntactic

knowledge database to find the next level of activation. Every searching cycle utilises the

same mechanism when accessing the data units. In this process the code activator operates as

a database engine, as the initialisation routine which loads the front edge level phone IDs, is

instigated. Then the syntactic knowledge interface is initiated to search and find the correct

level of phonemic units. The idea of the front edge level, significantly reduces the time

required for the code activator to browse through the syntactic database as it has less data

units (Darjazini and Tibbitts, 1994).

Once the first word is found, as indicated by a silence being detected, the code activator

writes the phone IDs out to the accumulator and subsequently the accumulator is instructed to

release the identified word.

The code activator starts navigating the syntactic database from the front edge level (the

highest probability). The ID of the front edge phones are applied directly to the appropriate

Activation Control Lines ACLi, for example ACL42 is high when the ID is 42. Then the code

activator waits for the adaptive phone recognisor responses, which are represented by signal

set - PIR. A process is instigated to read the PIR signals then perform a check to identify any

above the threshold of 0.6. The maximum response is then selected from these phones, if the

ID shows a silence the pointer to the syntactic database is reset and the main process is started

69

again. Therefore, the code activator seems performing sequential search through the

statistically ordered phones at the front edge level until a match is found (threshold > 0.6). In

the case of confusion, i.e. more than one response occur, the code activator will select the

phone that has the highest level of response from the PIR. If the search ends without a match,

an error message is delivered indicating out of lexicon input situation.

ALGORITHM FOR CODE ACTIVATOR

% Initialisation Routine

% tells accumulator to zero identified word

% sets internal registers

% reset database pointer

% set up the activation control lines to identify the front edge level

allocate memory;

open syntactic database file;

set up database pointer to 1

initialise I/O buffers;

initialise the accumulator ACC = 0;

EOWI = 0;

set counter = 1;

set found = 0;

% Search for the first phone in sentence after initialisation

while not end of the file

read front edge - discard first field;

read front edge pointers - discard first field;

while not end of front edge list

get ID(I) and its pointer;

activate the relevant ACL (I);

70

read the PIR(I) from adaptive phone recognisor;

if PIR(I) > 0.6 then set found = 1;

counter++;

if found = 1

find out the maximum PIR(I);

send ID(I) to the accumulator;

get ID(I) associated pointer ;

move the control pointer to the value pointed by I(D)’s pointer;

found = 0;

else

message “out of lexicon”;

go to the start of the routine;

% Search for the other phones

repeat until pointer = 5 or counter >= 13

read level - discard first field;

if content = 46 only

EOWI = 1;

go to the start of the routine;

read level’s pointers - discard first field;

while not end of level

get ID(I) and its pointer;

71

activate the relevant ACL (I);

read the PIR(I) from adaptive phone recognisor;

if PIR(I) > 0.6 then set found = 1;

if found = 1

counter++;

find out the maximum PIR(I);

send ID(I) to the accumulator;

get ID(I) associated pointer ;

move the control pointer to the value pointed by I(D)’s pointer;

found = 0;

Figure 3.9 Algorithm of the code activator in pseudo-code form.

Figure 3.10 shows a block diagram of the accumulator, and Figure 3.11 illustrates the

algorithm for the accumulator. The data inputs to the accumulator are the phone identification

responses, PIR1 to PIR46, from the adaptive phone recognisor. These responses are

sequentially stored in the phone sequence stack, which operates as a serial to parallel register

of identified phones. The control input to the accumulator is the end of word identifier, EOWI

that informs the accumulator that a word boundary has reached and that the word can be

released onto the output. The output is the identified word (IW) (from 1 to 14 characters) in

the form of the numbers relating to the phones (sub-recognisors) identified. For example,

identification of the word 'please' (phonetically - /pliz/) would result in the following set of

numbers released from the phone sequence stack (22, 45, 2, 33). (See Table 3.1 for list of

numerical identification (ID) associated with each phone.)

72

Figure 3.10 Block diagram of the accumulator.

ALGORITHM OF ACCUMULATOR

do

get ID(I);

PIRi IDENTIFIED = ID

Until EOWI

IW = (PIRiID1 TO PIRiID12)

Figure 3.11 Algorithm of the accumulator.

73

The functions of the Neuro-Slice Response collector (NSR) and the output selector as

shown in Figure 3.12 are combined in the same algorithm and hence program. The response

from the neuro-slices, NSRij, is a continuous variable between 0 and 1 that represents the

degree of match for the jth frame of the ith phone, and is stored in an ASCII file, with one

value of the output per line. These responses are inputs to the neuro-slice response collector

and are available simultaneously. The file is read and an average of the outputs is found and

stored as IPIRi. If ACLi for that sub-recognisor is zero, then the final output from output

selector is zero. Alternately if ACLi for that sub-recognisor is one, then the final output from

output selector is equal to IPIRi and is stored in the final output file as PIRi. The algorithm

was implemented using MATLAB script.

3.5 Sub-recognisor: Structure

The structure of the sub-recognisors chosen for RUST-I is illustrated in Figure 3.12. It

consists of slices of smaller neural networks (referred to as the neuro-slices as opposed to

one large neural network). This type of architecture was chosen for two reasons. The first

reason was that using neuro-slices reduced the number of outputs per ANN and hence

reduced the number of PEs in each of the hidden layers of the ANNs. This effect is called

scaling, and is known to increase network accuracy and decrease network training time. The

second reason was that the development of this architecture was inherently linked to the

development of the syntactic knowledge and its effective use, and localising phone

recognition to one sub-recognisor assisted in this process.

74

Figure 3.12 Structure of the sub-recognisor.

The total number of neuro-slices for each sub-recognisor actually depends on the phone

duration, which varies from phone to phone and from time to time even for the same phone.

To overcome this hurdle, this number is set to the average value Mi (shown in Table 2.1) in

RUST-I as a compromise between implementation and performance. The output of each

neuro-slice is called the Neuro-Slice Response, NSRij. It measures the degree to which the

input frame data, DIj(12), j=1,2,…,Mi, matches the jth frame of the phone that the ith sub-

recognisor was trained on. In all the cases, i represents the sub-recognisor and j represents

the order of the neuro-slice within that sub-recognisor. Using a number of frames in the

recognition process to define the number of active neuro-slices for any one sub-recognisor is

advantageous. This is resulted from the temporal allocation of the neuro-slices as they

provide temporal cues of the phone, especially duration information. This has been found

beneficial for the recognition of the speech sounds which are perceived using mainly

temporal cues and some spectral cues (Tibbitts 1989) and (Lee and Dermody 1992).

Therefore, this technique achieves that by providing a mixture of both cues. The distribution

of frames of the phone through the neuro-slices of a sub-recognisor is referred to as temporal

unfolding, time is therefore an additional dimension within the structure of the APR, as each

sub-recognisor varies in the number presented to it, Mi varies across sub-recognisors.

75

Whenever a sub-recognisor is activated by ACLi, its output will be enabled. The NSR

collector adds the NSRi outputs from each neuro-slice and generates the IPIR signal which is

used by the output selector activated by ACLi to produce the phone response signal PIRi.

Figure 3.12 shows the basic architecture of one neuro-slice of the APR. It is a fully

interconnected feed-forward network with 12 inputs, three hidden layers (24 - 12 - 6) and

one output in the output layer. The input layer takes each of the 12 elements of the MFCC

vector. The output layer contains one PE representing a measure of matching of input speech

to phones. This output PIR is a continuous variable between 0 and 1. A match is considered

to occur if this value is greater than 0.6.

Figure 3.13 Architecture of one neuro-slice.

76

The structure in Figure 3.13 is called the multi-layer perceptron (MLP). The number of

layers and the number of processing elements in each hidden layer affect the performance of

the network. To determine the optimal structure of the network, a trial and error method is

followed in addition to recommendations for starting point suggested by McCORD and

Illingworth 1991. Seven different structures of MLPs were investigated before the structure

of 24/12/6 PEs per hidden layer was derived.

The error function at the output layer in the initial run is computed as

2)( outPET −=ε , (3.1)

where, T is the target and PEout is the net output. In order to ensure the global convergence

of the back-propagation algorithm, the following assumptions are needed (Magoulas and

Varhatis, 1999): (1) The error function ε is a real-valued function defined and continuous

everywhere in Rn. (2) For any two periods ω and υ ∈ R

n , ∇ε satisfies the Lipschitz condi-

tion,

||||||)()(|| υωυεωε −≤∇−∇ L , (3.2)

where, L > 0 denotes the Lipschitz constant. If these assumptions are satisfied, the back-

propagation algorithm can globally converge by determining the learning rate in the direc-

tion of minimising the error in each iteration.

The trials for the selection of the best structure are started by testing a neural network with

three hidden layers. This was similar to Lippmann and Gold model of (Lippman, 1987), the

investigation resulted in the proposed MLP structure of 12 - 36 - 50 - 25 - 1. The momentum

was ρ = 0.99 and the threshold value of the PE μ = 0.35, the number of iterations was set at

5000, the weights were set initially to small normally distributed random values. Then the

MLP was trained using the back-propagation learning algorithm (Laurene, 1994). The

training set contained 20 stimuli consisting of the vowel /a/; that is made up of five of the

fifteen speakers and all 4 words that contained /a/ were used. The vowel /a/ was chosen as it

is known that it contain explicit formants, which makes it easier to be recognised.

77

The MLP was then tested on 20 different stimuli of the vowel /a / with five different

speakers saying the same 4 words. As shown in Table 3.6, this architecture achieved a

recognition rate of 40%.

Table 3.6 Simulation of seven architectures of MLP.

Series Structure Accuracy1 12-36-50-25-1 40

2 12-36-24-6-1 49

3 12-48-24-12-1 45

4 12-24-24-12-1 66

5 12-18-20-10-1 55

6 12-24-12-3-1 80

7 12-24-12-6-1 1000

50

1001

2

3

45

6

7

Series1

To observe the effect of altering the structure on the recognition performance of the MLP,

the number of PEs in the second and third layer were decreased and all PEs per layer were

made a multiple of six; to derive the structure 12 - 36 - 24 - 6 - 1. All other parameters,

training and testing conditions remained the same, this structure achieved a slightly

improved accuracy of 49% during training. Other trials results are illustrated in Table 3.6

above. With more trials performed, it is noted that the accuracy improved markedly with

the manipulation of the second and third layers only, therefore, number of PEs in the third

layer is only increased to derive a structure of 12 - 24 - 12 - 6 - 1. All other parameters,

training and testing conditions remained the same. This structure produced an optimal

accuracy of 100% as shown in Table 3.6. The fast back-propagation (FBP) algorithm

(Technical Publications Class, 1993) is used in training the neuro-slices.

78

3.6 Conclusion

The language model described in this chapter and the lexicon words in their contextual

presence form the syntactical knowledge of the system. This syntactical knowledge is

interacted with neural networks to form the phonemic recognition block of the system. The

structure of the neuro-slice is also presented.

79

Chapter 4: Experimental Procedure

4.0 Introduction

In this chapter the performance of RUST-I is investigated. This work was part of the

original research work conducted on non-standard speech samples. It will be shown in

later sections of this chapter that there is a need to carry out further testing on standard

speech samples, as will be explained in Chapter 5. Section 4.2 describes the training of all

the 46 sub-recognisors of the APR. Section 4.3 describes the testing of the sub-recognisors

using isolated phones, and Section 4.4 deals with the testing using isolated phones with the

isolated phone identification factor included. The whole system is tested on isolated word

recognition in Section 4.5.

In the testing procedures used in this chapter, there were two scores of interest

1. The Self-Recognition Score (SRS) is the score of a sub-recognisor output, PIRi,

when the sub-recognisor is presented with the phone it was trained to

recognise.

2. The Misrecognition Score (MRS) is the score of a sub-recognisor output when

presented with any phone other than the one it was trained to recognise.

A confusion is termed to be any MRS that is greater than 0.1.

All 45 phonemes were divided into seven subgroups

1. vowels (i, I, ε, æ, a, Þ, Ď, Э, Ω, u, έ, ∂, Λ)

2. diphthongs (aI, eI, ЭI, aΩ, OΩ, I∂, ε∂, Ω∂)

3. stops (p, b, t, d, k, g)

4. nasals (m, n, さ)

5. fricatives (f, v, し, ð, s, z, ∫, ξ )

80

6. affricatives (t∫, dξ) 7. semi-vowels (h, r, j, w, L).

An Intra Subgroup Confusion (IASC) is confusion within a subgroup. An Inter Subgroup

Confusion (IRSC) is confusion across subgroups.

4.1 Selection of Parameters and Initial Conditions

The transfer function of a Sigmoid is given:

sesfY β−+==

1

1)( ,

where, β is a constant in the range 0 to 1. This function was applied as a firing function

allover the current network If the network stimulus exceeds the processing element’s

(PE) transfer function range, that PE is said to be saturated. The sigmoid function,

which was applied to RUST-I sub-recognisors accepted values between +6 and -6.

Saturation occurs when a PE's net stimulus exceeds this range.

The connection weights to each of the 45 PEs within each sub-recognisor are initialised to

small random values. The fast back-propagation (FBP) algorithm (Technical

Publications Group, 1993) was used to adjust the weights and minimise the global error.

Table 4.1 shows the learning rates and momentum terms for all layers and for training

and testing. These values were derived on a trial and error basis by monitoring the RMS

error and weight saturation. The choice of learning rate and momentum term was shown

to affect the training speed of the network, the stability of the RMS error curve and/or

saturation of the PEs. For example, using the default values of learning rate (that is I - 0.5,

1st - 0.25, 2nd - 0.2, 3rd - 0.15) and momentum rate of 0.4 with 10000 iterations; the

RMS error jumped to a normalised value of 1 and subsequently changed very little. The

saturation level of the PEs in the all hidden layer showed that these PEs reached

saturation (the first hidden layer after only 100 iterations) and the weights were not

changed after that, leading to no decrease in error and no more learning. With the values

of learning rate defined in Table 4.1 and 2000 iterations, the RMS error initially jumped

to a normalised value of 1 and subsequently dropped to near zero as shown in Figure

81

3.14.

Table 4.1 Optimum learning rates and momentum terms for all layers

during training and testing.

Training Testing

Layer Learning Rate Momentum term Learning Rate Momentum term

Input 0.25 0.5 0.15 0.9

1st hidden 0.125 0.25 0.075 0.25

2nd

hidden 0.0313 0.0625 0.0188 0.0625

3rd

hidden 0.0019 0.0039 0.0012 0.0039

Figure 4.1 RMS error curves for training with adjusted parameters.

4.1.0 Further Results on Training and Testing

Two different data sets were used for training and testing the MLP within the neuro-slice,

one for the training and the other for testing. Both data sets contained the same phones

spoken by different speakers and were extracted from different words or from different

positions in the same word. For example, the phone /m/ and the phone /L/ have been

extracted from the word 'multimillionaire' from two different positions of that word.

Speakers labelled as 1, 2, 3, 4 and 5 were used for training. The first three were male and

82

the last two female. Speakers labelled as 6, 7, 8, 9 and 10 were used for testing. The first

three were male and the last two female. Table 4.2 shows the number of training and

testing tokens (phone sample) used for each sub-recognisor. For example, column 1 row

1 of Table 4.2 shows that the phone /I/ sub-recognisor has identifier 1 and is represented

in 3 different words spoken by all speakers, so there are 15 different examples of this

phone for both training and testing in column 3 and 4. The second column of Table 4.2

contains both the phone identifier and the sub-recognisor identifier (separated by a slash

"/").

Table 4.2 Number of training and testing tokens used for each sub-recognisor.

Phone ID Train Test Phone ID Train Test

I 1/I1 15 15 t 24/T 5 5

i 2/I2 25 25 d 25/D 10 10

ε 3/A1 20 20 k 26/K 5 5

æ 4/A2 10 10 g 27/G 5 5

a 5/A3 20 20 f 28/F 10 10

Þ 6/A4 5 5 v 29/V 5 5

Ď 7/A5 5 5 し 30/THE 5 5

Э 8/O1 5 5 ð 31/THI 15 15

Ω 9/O2 5 5 s 32/S 5 5

u 10/O3 15 15 Z 33/Z 5 5

έ 11/A6 5 5 ∫ 34/SH 10 10

∂ 12/A7 15 15 ξ 35/JH 25 25

Λ 13/A8 10 10 h 36/H 5 5

aI 14/AI 20 20 r 37/R 5 5

eI 15/EI 30 30 t∫ 38/TCH 5 5

ЭI 16/OI 5 5 dξ 39/DJH 30 30

aΩ 17/AU 5 5 m 40/M 35 35

OΩ 18/OU 15 15 n 41/N 5 5

I∂ 19/IA 5 5 さ 42/MNG 10 10

ε∂ 20/EA 10 10 j 43/J 10 10

83

Ω∂ 21/UA 5 5 w 44/W 35 35

p 22/P 55 55 L 45/L 80 80

b 23/B 30 30 silence 46/slc 20 20

The procedure used to prepare training files was to separate each frame of MFCC

coefficients for all the examples of training tokens and place them in separate files so that

each neuro-slice was trained on its appropriate frame independently. Table 4.3 shows an

example of the sequential order of presentation in terms of the phone id (P), example

number (E), frame number (F), speaker number (S) and word number (W). For example,

phone 1 is extracted from words 1, 2 and 3 from each of speakers 1, 2, 3, 4 and 5. The

number of examples differs for each phone as shown in Table 4.2 and is referred to as j

for the training files.

Table 4.3 Example of the sequential order of presentation in terms of the phone ID (P),

example number (E), frame number (F), speaker number (S) and word number (W).

E1 E2

P F W S P F W S

1 1

2 2

3 3

4 4

5 5

6

1

6

1

1 1

2 2

3 3

4 4

5 5

6

2

6

2

1 1

2 2

3 3

4 4

5 5

6

3

6

3

1 1

2 2

3 3

4 4

5 5

6

4

6

4

12

1

1

5

12

1

2

5

84

2 2

3 3

4 4

5 5

6 6

The data input file used for training has the format shown in Figure 4.2, where every

example of each frame for each phone is placed in a separate file. Each file has j tokens.

MFCC for frame 1, phone 1, word 1, speaker 1










Figure 4.2 Format of the data input training file.

An example of an input data training file is shown below. The file is in ASCII format. It

consists of two rows of twelve normalised real numbers representing the MFCC vector

and an additional row with the required target separated by ampersand sign. The values in

the input fields are separated by a space. For example:

0.615297 0.124238 0.095474 0.055436 0.084756 0.191427 0.037501 0.083012

0.183048 0.094391 0.045078 0.094451 & 1.0000

0.599774 0.006463 0.102975 0.048325 0.196358 0.143246 0.081053 0.022370

0.148832 0.083293 0.064328 0.021072& 1.0000

The first row is for the first frame of the vowel /a/ spoken by speaker 1 from word 1

followed by the desired output. The second row is the input and desired output for the

first frame of the vowel /a/ spoken by speaker 2 from word 1. The pattern continues with

85

the third row indicating the input and desired output for the first frame of the vowel /a/

spoken by the other speakers.

The MLP utilises supervised learning so that the desired outputs are presented with the

inputs to the network in the training file only, the desired output is not present in the

testing file. The testing data file is also in ASCII format and consists of two rows of

twelve normalised real numbers representing the MFCC vector. The order of the test file

is random over phone, speaker and word. In testing, a match between the testing input set

and the training set was assumed if the output was greater than 0.60. Thus, an output

between 0.60 and 1.00 will represent a "correct response", while any other response will

be considered to be a "no match".

The training condition of exit from the fast back-propagation (FBP) algorithm was the

number of iterations, which was set at 2000. The testing condition of exit from the FBP

algorithm was the error below the default minimum. The output from the testing of each

neuro-slice stored in a separate file, with one response per line. This output is the Neural

Net Output, NNOij , for the ith PE and j

th neuro-slice.

4.1.1 Confusion Matrix

A confusion matrix is a collection of intensities or scores versus presented stimuli and

response. It is a grid where the number on the diagonal indicates a correct response to the

input stimulus and a number either side of the diagonal indicates the degree of

recognition for other phones that the sub-recognisors identified. The numbers in each

square are the output from each sub-recognisor. An error occurs if an off-diagonal

number is greater than 0.6. This plot was used as it represents the error obtained in phone

recognition and how this is influenced with syntactic knowledge. Table 4.4 shows an

example of a confusion matrix as used to record the outputs from the sub-recognisors.

The y axis is the stimulus and the x axis is the response (PIRi). Any score on the diagonal

represents correct response to the applied stimulus. Any non-zero off-diagonal score

represents an error.

86

Table 4.4 Example of the confusion matrix.

Response (PIRi)

Stimulus I i ε æ a

I 0.82 0.1 0.3171 0.05 0

i 0.1 0.81 0.15 0.09 0

ε 0.15 0.2 1.0 0.4812 0.08

æ 0 0 0.19 0.9998 0

a 0 0 0.1 0 1.00

4.2 Training the Adaptive Phone Recognisor

This section describes the results from primary training of the APR on individual phones.

At the beginning each sub-recognisor was trained on the relevant correct phone extracted

from five speakers. The speaker ID’s were: 1, 2, 3, 4 and 5. The data set for one sub-

recognisor consisted of all representations of the one phone from all five speakers.

Table 4.5 summarises the SRS results of the primary training session of the APR, the table

shows the maximum and minimum values of the responses for all types of phones. The

confusion matrices for all training speakers were obtained. It can be noticed from the table

that the vowels and the semivowels achieved the best results, this is expected because of

the explicit spectral nature of those phones.

Table 4.5 Summary of the primary training session of the APR.

87

Vowels &

Diphthongs

Stops,

Fricatives &

Affricatives

Nasals

Semivowels

Silence

Phone

Group

min max min max min max min max min Max

SRS

092

1.00

0.81

0.99

0.79

0.98

0.80

1.00

0.70

0.7

All MRSs for the training set were in the range 0.07 to 0.6, which were below the lowest

SRS of 0.70 for silence, meaning that there will not be any confusion between any phone

and the silence, and therefore the system can distinguish the difference between a sound

and the silence. Table 4.6 summaries the most remarkable MRSs of the highest Intra

Subgroup Confusion (IASC) values within the phones group.

Table 4.6 Summary of the most remarkable IASCs.

Phone

Group

VWL

DPH

STP

FR

AFR

NS

SVWL

Phone-

to-

Phone

Ω to έ

aΩ to OΩ

t to k

ð to し

t∫ to dξ

さ to m

IASC 0.35 0.35 0.6 0.49 0.22 0.28 0.00

VWL: Vowels, Dph: Diphthongs, STP: Stops, FR: Fricatives, AFR: Affricatives, NS: Nasals, SVWL: Semi-

vowels

The maximum IRSC for vowels occurred with the semivowel subgroup with scores of less

than 0.28. The maximum IRSC for vowels occurred when applying the semivowel /r/ to

the sub-recognisor of the phone /a/, which achieved MRS of 0.28. In the case of

diphthongs, the only IRSC greater than 0.0 occurred when applying the vowel / ∂ / to the

sub-recognisor of the phone /ε∂/, which resulted a MRS of 0.13. The maximum IRSC for

the stops occurred with the affricatives when applying the affricative /t∫/ to the sub-

recognisor /t/, resulted MRS of 0.38. The IRSC for fricatives were low (<= 0.1). The

maximum IRSC for fricatives was when applying the semivowel /h/ to the sub-recognisor

/s/ which achieved MRS of 0.10. The nasals had an IRSC of zero with every other

subgroup. The maximum IRSC for affricatives was when applying the stops /t/ and /d/ to

88

the sub-recognisor /t∫/ resulted MRS of 0.34. The two highest IRSCs for semivowels

occurred when applying the fricative /s/ to the semivowel /h/, and the vowel / Ω / to the

semivowel /w/ both resulted MRS of 0.15.

In conclusion:

• At the end of the training session for the APR, results show that SRSs are

higher than MRSs, which allow the module to pass Experiment One.

• Potential problem areas are that some sub-recognisors achieved some

MRS close to their SRS.

4.3 Experiment One: Operation of Each Sub-recognisor without the

Syntactical Knowledge

Experiment One was designed to measure the performance of each of the sub-recognisors

on isolated phones before syntactic knowledge is included. The overall performance of

RUST-I as an IWR is dependent on its ability to recognise individual phones. RUST-I

demands that the SRS for the correct phone needs to be above the minimum threshold to

be considered for syntactic knowledge evaluation. During the experiment ACLs to the

APR were deactivated so that there is no input from the syntactic knowledge estimator.

The aim of this experiment was firstly to determine the level of confusion that occurred for

phones without syntactic knowledge and secondly to determine the required threshold of

output for recognition of the correct response (self-recognition score - SRS) from the sub-

recognisors. Unique test data not used in training (from speakers 6, 7, 8, 9 and 10) was

provided for this experiment.

4.3.1 Input Stimuli

The stimuli data set presented to the adaptive phone recognisor in this experiment contains

the same phone set applied in training the neural nets but now spoken by different

speakers. The testing set contains one token of each of the 45 unique phones derived from

89

speakers 6, 7, 8, 9 and 10. Thus the new speaker set used in testing ensures speaker

independency for the system.

4.3.2 Experimental Method

The inputs to the sub-recognisor are the 12 Mel-scale frequency cepstral coefficients

(MFCC) for the Mi frames to the ith sub-recognisor. The number of frames, M, determines

the number of neuro-slices in each sub-recognisor. For example the phone /p/ is the 22nd

sub-recognisor and has 6 frames associated with it so that there are 6 neuro-slices required

in this sub-recognisor. For the phone /p/ six sets of 12 MFCC were applied to each of the

neuro-slices simultaneously and collected in the neuro-slice collector. The output form the

neuro-slice collector, IPIR22, was a value from 0.00 to 1.00 that measured the degree of

matching for that sub-recognisor.

Representation of Raw Data: All 225 tokens (45 phonemes by 5 speakers) were applied

to each of the 46 sub-recognisors and the IPIRi outputs were measured. The IPIRi output

from each sub-recognisor was then stored in an ASCII file, and represented graphically as

a 3-D confusion matrix. Both the 2-D and 3-D confusion matrices are available for the

speakers from the test set with the maximum output responses (IPIRi) only.

Representation of Significant Confusions: Tables were created to represent the output of

a sub-recognisor for its correct stimulus (the SRS) for all speakers in the test set. These

tables summarised the IPIRi output for each sub-recognisor when presented with its true

stimuli, i.e. the SRS.

Tables were also created showing the IASC averaged over all speakers in the test set and

for each of the six subgroups. These tables were derived to look at the influence of the

place of articulation on phone confusion, and to be used to determine IASCs of the APR

and also assist in the derivation of the appropriate threshold level for that subgroup. The

tables contain the average output from all sub-recognisors in a subgroup and in response to

input stimuli from that subgroup.

It is expected that confusions may occur across similar subgroups, i.e., IRSC such as

90

between vowels and diphthongs, semivowels and vowels or diphthongs, stops and

affricatives or between fricatives and affricatives. Tables were also created showing the

main confusions for each phone input and for all speakers over all phones in subgroups.

These tables are used to determine the main confusions and so identify possible errors in

the system, investigate the speaker independence of the phone recognisor, and assist in the

derivation of an appropriate level of threshold for the system.

4.3.3 Results

The results are presented in two formats as described in Section 4.3.2 and are presented in

Section 4.3.3 respectively. The recognition decision was evaluated by the matching scores

collected at the NSR collector end for every sub-recognisor. Results were represented by

two forms. The first is the self-recognition score (SRS) which is the immediate phone

identification response (IPIRi) appeared at the output of the neuro-slices response collector

of the ith sub-recognisor when presented at its input to the phones ith. The second is the

misrecognition score (MRS) which is the immediate phone identification response (IPIRi)

appearing at the output of the neuro-slices response collector of the ith sub-recognisor when

presented at its input to the jth phone.

Confusion Matrix: Responses of all sub-recognisors for all input stimuli of the five

testing speakers (numbers 6 to 10) in the test set. Results representation is in the form of

confusion matrices, which are represented in tables. All speakers showed some similar

trends that are verified in the following tables. The majority of confusions occurred within

a subgroup (intra) rather than across subgroups (inter)- meaning place of articulation was

confused rather than manner of articulation. Some exceptions occurred consistently for all

speakers. These were low level confusion (from 0.05 to 0.35) of the vowels /I/ and /i/ with

the semivowel /j/, the vowels / Ω / and /u/ with the semivowel /w/, the vowel /a/ with the

semivowel /r/, diphthongs /aΩ/ and /ε∂/ with vowels /Ď/ and /έ/, between the affricatives

and some of the stops and for the silence with low intensity consonants - stops, fricatives

and affricatives.

Figure 4.3a and 4.3b shows the 3-D graphical representation of the full confusion

matrix for speaker 9 from the right and left side of the diagonal respectively. The X-axis

91

represents the identification of the presented stimuli and the sub-recognisor that responded.

The Y-axis indicates the degree to which a sub-recognisor responded to the input stimulus,

and the vertical Z-axis represents the intensity or the amplitude of the response. The

highest scores are shown to be centred on the diagonal (> 0.60). Off diagonal scores

tend to be between 0.20 and 0.60). There is evidence of clustering of data such that

confusions occur mainly within subgroups, i.e., IASC.

I

u

I

f

r

Sil

ence

I

a

a

d

v

z

rn

l

00.10.20.30.40.50.6

0.7

0.8

0.9

1

Response Value

Response Phone

Stimuli Phone

Speaker 9 Confusion Matrix

(right-side view)

0.9-1

0.8-0.9

0.7-0.8

0.6-0.7

0.5-0.6

0.4-0.5

0.3-0.4

0.2-0.3

0.1-0.2

0-0.1

Figure 4.3(a) 3-D representation of the full confusion matrix of speaker 9 (right side

view).

92

I

I d r j

I

u

I

fr

Silence

00.10.20.30.40.50.60.70.80.9

1

Response Value

Stimuli Phone

Response Phone


(left-side view)

0.9-1

0.8-0.9

0.7-0.8

0.6-0.7

0.5-0.6

0.4-0.5

0.3-0.4

0.2-0.3

0.1-0.2

0-0.1

Figure 4.3(b) 3-D representation of the full confusion matrix of speaker 9 (left side view).

To enhance the total views of Figures 4.3a and 4.3b, Figure 4.4 illustrates the 2-D

graphical representation (top view) of the full confusion matrix for speaker 9. In this

diagram it is easier to see the symmetry of stimulus with response. For example if a stimuli

X was partially recognised by sub-recognisor Y then stimuli Y was also partially

recognised by sub-recognisor X. The scores may differ but the similarity in the two signals

will be coded into both sub-recognisors. The regions of the graphs in Figures 4.3 and 4.4

are segmented into subgroups, i.e., vowels, diphthongs and …. The confusion is shown to

appear mainly within those subgroups indicating that the place of articulation is the main

source of confusion for the APR for this speaker as it is for human listeners.

93

I u

I I p d f r m

j

Sile

nce

I

u

I

I

p

d

f

r

m

j

Silence

Stimuli Phone

Response Phone


1.1-1.2

1-1.1

0.9-1

0.8-0.9

0.7-0.8

0.6-0.7

0.5-0.6

0.4-0.5

0.3-0.4

0.2-0.3

0.1-0.2

0-0.1

Figure 4.4 2-D representation of the confusion matrix of speaker 9.

Self-Recognition Scores (SRS) for All Speakers: Tables 4.7a-e contain the SRS or actual

output (IPIRi) from each sub-recognisor for all the speakers in the test set (Speakers 6 to

10) when stimulated only with the correct stimulus for that sub-recognisor. These tables

therefore contain the values from the diagonals of the confusion matrices as shown below.

Table 4.7(a) Responses of the sub-recognisors for expected input stimulus - speaker 6.

94

Phone Response Phone Response Phone Response

I 0.80 ЭI 0.83 ð 0.68

i 0.81 aΩ 0.89 s 0.59

ε 0.95 OΩ 0.58 z 0.70

æ 0.92 I∂ 0.78 ∫ 0.59

a 0.92 ε∂ 0.56 ξ 0.52

Þ 0.91 Ω∂ 0.51 t∫ 0.50

Ď 0.52 p 0.56 dξ 0.62

Э 0.71 b 0.50 m 0.79

Ω 0.59 t 0.61 n 0.58

u 0.80 d 0.85 さ 0.55

έ 0.90 k 0.83 h 0.51

∂ 0.92 g 0.55 r 0.81

Λ 0.91 f 0.59 j 0.90

aI 0.86 v 0.71 w 0.59

eI 0.90 し 0.55 L 0.50

Table 4.7(b) Responses of the sub-recognisors for expected input stimulus – speaker 7.

95


I 0.81 ЭI 0.80 ð 0.55

i 0.81 aΩ 0.58 s 0.58

ε 1.00 OΩ 0.53 z 0.80

æ 0.95 I∂ 0.81 ∫ 0.53

a 0.99 ε∂ 0.59 ξ 0.83

Þ 0.98 Ω∂ 0.80 t∫ 0.78

Ď 0.59 p 0.72 dξ 0.55

Э 0.80 b 0.59 m 0.83

Ω 0.83 t 0.65 n 0.54

u 0.87 d 0.75 さ 0.56

έ 0.92 k 0.77 h 0.74

∂ 1.00 g 0.55 r 0.58

Λ 0.95 f 0.59 j 0.94

aI 0.86 v 0.74 w 0.85

eI 0.83 し 0.71 L 0.82

ble 4.7(c) Responses of the sub-recognisors for expected input stimulus – speaker 8.

96


I 0.81 ЭI 0.82 ð 0.73

i 0.80 aΩ 0.55 s 0.84

ε 0.99 OΩ 0.54 z 0.79

æ 0.98 I∂ 0.91 ∫ 0.78

a 0.99 ε∂ 0.88 ξ 0.53

Þ 0.98 Ω∂ 0.91 t∫ 0.59

Ď 0.98 p 0.79 dξ 0.79

Э 0.78 b 0.84 m 0.90

Ω 0.79 t 0.52 n 0.55

u 0.88 d 0.89 さ 0.60

έ 0.98 k 0.79 h 0.59

∂ 0.99 g 0.55 r 0.55

Λ 0.96 f 0.74 j 0.92

aI 0.79 v 0.59 w 0.90

eI 0.87 し 0.55 L 0.82

able 4.7(d) Responses of the sub-recognisors for expected input stimulus – speaker 9.

97


I 0.82 ЭI 0.86 ð 0.74

i 0.81 aΩ 0.88 s 0.85

ε 1.00 OΩ 0.92 z 0.80

æ 0.99 I∂ 0.89 ∫ 0.79

a 1.00 ε∂ 0.93 ξ 0.83

Þ 0.99 Ω∂ 0.95 t∫ 0.80

Ď 0.99 p 0.62 dξ 0.70

Э 0.79 b 0.85 m 0.91

Ω 0.80 t 0.69 n 0.69

u 0.89 d 0.85 さ 0.61

έ 0.99 k 0.80 h 0.70

∂ 1.00 g 0.79 r 0.92

Λ 0.97 f 0.75 j 0.96

aI 0.89 v 0.80 w 0.91

eI 0.91 し 0.76 L 0.81

98

Table 4.7(e) Responses of the sub-recognisors for expected input stimulus - speaker 10.


I 0.73 ЭI 0.81 ð 0.70

i 0.70 aΩ 0.52 s 0.68

ε 0.85 OΩ 0.63 z 0.79

æ 0.80 I∂ 0.65 ∫ 0.55

a 0.95 ε∂ 0.61 ξ 0.57

Þ 0.65 Ω∂ 0.54 t∫ 0.75

Ď 0.62 p 0.75 dξ 0.55

Э 0.70 b 0.65 m 0.85

Ω 0.75 t 0.60 n 0.65

u 0.85 d 0.80 さ 0.53

έ 0.89 k 0.54 h 0.70

∂ 0.82 g 0.74 r 0.85

Λ 0.97 f 0.70 j 0.90

aI 0.63 v 0.80 w 0.88

eI 0.70 し 0.54 L 0.54

The minimum threshold of SRS for these five speakers varied from 0.50 for speaker 6 to

0.61 for speaker 9. The average vowel SRS per speaker varied from 0.79 for speaker 10 to

0.93 for speaker 9. The overall average SRS for all vowels over all speakers was 0.87.

Vowels generally resulted in the highest SRS values of all the subgroups but the vowel that

obtained the lowest values varied across speakers. The vowels /a/ and /Λ/ had consistently

high SRS (0.90 to 1.00) across all speakers. These results are unique for Australian vowels

(Section 2.3).

The average diphthong SRS per speaker varied from 0.63 for speaker 10 to 0.95 for

speaker 9. The overall average SRS for all diphthongs over all speakers was 0.77. No

diphthongs consistently obtained lower SRS values but the diphthong /ЭI/ had a

consistently high SRS (above 0.80) across all speakers. Speaker 9 had a much higher

99

average diphthong score (0.95) than any other speaker.

The average stop SRS per speaker varied from 0.65 for speakers 6 and 7 to 0.83 for

speaker 9. The overall average SRS for all stops over all speakers was 0.71. The stop /t/

obtained lower SRS (0.52 to 0.69) for all speakers. The stop /d/ had a consistently high

SRS (above 0.75) across all speakers. Speaker 9 had a much higher average stop score

(0.83) than any other speaker.

The average nasal SRS per speaker varied from 0.64 for speakers 6 and 7 to 0.77 for

speaker 9. The overall average SRS for all nasals over all speakers was 0.68. The nasals /n/

and /さ/ consistently obtained lower SRS (0.53 to 0.65) for all speakers. The nasal /m/ had a

consistently high SRS (above 0.79) across all of the speakers. Again speaker 9 had a much

higher average nasal SRS (0.77) than any other speaker.

The average fricative SRS per speaker varied from 0.62 for speaker 6 to 0.79 for speaker 9.

The overall average SRS for all fricatives over all speakers was 0.69. The fricative /z/ had

a consistently high SRS (above 0.70) across all speakers. All other fricatives obtained

varied SRSs, which are generally tended to be good (above 0.69) except some cases for the

fricative /ξ/. The highest average SRS for the fricatives was obtained by speaker 9 at

(0.69).

The average affricative SRS per speaker varied from 0.56 for speaker 6 to 0.75 for speaker

9. The overall average SRS for all affricatives over all speakers was 0.65. The affricative

/t∫/ obtained higher SRS for three speakers (0.75 to 0.80). The affricative /dξ/ had a higher

SRS (0.7 to 0.79) for two speakers. For both affricatives, speaker 9 had a much higher

average SRS (0.65) than any other speaker.

The average semivowel SRS per speaker varied from 0.66 for speaker 6 to 0.82 for

speaker 9. The overall average SRS for all semivowels over all speakers was 0.74. No

semivowel consistently obtained lower values but the semivowel /j/ had consistently

higher SRSs (above 0.81) across all speakers. Speaker 9 had a much higher average

semivowel SRS (0.87) than any other speaker.

Average Confusion Response for Subgroups: Table 4.8a shows the vowel stimuli

100

presented verses the average MRS across all speakers and across all vowel sub-

recognisors. For all members of this subgroup the SRS was always higher than the

MRS (misrecognition) achieved by any other sub-recognisor. Table 4.8b shows the

three most common confusions and their associated MRS across speakers for just the

members of the vowel subgroup.

Table 4.8(a) Vowels confusion matrix - Stimuli presented versus sub-recognisor

responses.

I i ε æ a Þ Ď Э Ω u έ ∂ Λ

I .79 .14 .24 .14 .05 .00 .03 .01 .02 .04 .57 .00 .00

i .13 .78 .32 .13 .02 .00 .00 .00 .00 .00 .14 .00 .00

ε .24 .23 .96 .46 .13 .10 .09 .04 .01 .06 .23 .03 .05

æ .00 .07 .23 .91 .00 .15 .19 .18 .13 .00 .15 .00 .26

a .00 .00 .12 .00 .97 .00 .00 .41 .00 .00 .13 .14 .26

Þ .00 .00 .00 .02 .00 .90 .00 .00 .00 .00 .00 .00 .00

Ď .00 .01 .00 .05 .00 .02 74 .00 .00 .00 .00 .02 .00

Э .00 .00 .00 .22 .19 .28 .15 .76 .22 .24 .12 .23 .38

Ω .00 .01 .02 .00 .00 .00 .00 .25 .75 .57 .63 .04 .33

u .00 .00 .00 .00 .00 .09 .09 .20 .42 .86 .09 .00 .00

έ .08 .00 .45 .32 .00 .00 .00 .00 .27 .00 .94 .00 .25

∂ .00 .00 .00 .53 .06 .00 .00 .00 .00 .00 .00 .95 .00

Λ .00 .00 .00 .35 .22 .07 .18 .24 .19 .00 .01 .00 .95

Table 4.8(b) Three most common confusion across speakers for the vowel subgroup.

101

Speaker 6 Speaker 7 Speaker 8 Speaker 9 Speaker 10

I έ – 0.45

æ - 0.33

a – 0.24

έ - 0.60

i - 0.21

æ - 0.12

έ - 0.60

ε - 0.50

i - 0.20

έ - 0.60

j - 0.40

ε - 0.31

έ - 0.61

j - 0.39

ε - 0.31

i ε - 0.35

j – 0.33

æ - 0.25

j - 0.59

ε - 0.20

I -0.18

j - 0.60

I - 0.25

ε - 0.20

j - 0.60

ε - 0.15

ε - 0.70

j - 0.55

æ - 0.20

ε æ - 0.60

I – 0.44

i – 0.41

æ - 0.40

έ - 0.29

i - 0.11

æ - 0.35

I - 0.30

i - 0.22

æ - 0.48

έ/i - 0.20

æ - 0.45

i - 0.22

έ - 0.21

æ Э - 0.38

i/Þ - 0.36

Λ - 0.31

ε - 0.22

Ď - 0.20

ε - 0.30

Λ - 0.23

έ/ Ď - 0.15

Λ - 0.30

έ - 0.20

ε - 0.19

Λ - 0.25

ε - 0.20

έ - 0.19

a Λ - 0.61

Э - 0.29

ε - 0.18

Э - 0.41

r - 0.25

Λ - 0.23

Э - 0.45 ∂ /ε - 0.20

Э - 0.40

r - 0.23

Λ - 0.20

Э - 0.50 ∂ - 0.30

έ - 0.25

Þ Non non non non non

Ď aΩ - 0.41

æ - 0.25 ∂ - 0.12

non non non non

Э ∂ - 0.40

æ - 0.36

u - 0.33

Λ - 0.41

Þ - 0.31

u - 0.25

Λ - 0.41

æ/Ω - 0.30

Λ - 0.43

a/Þ - 0.30

Λ - 0.45

Þ - 0.38

a - 0.25

Ω u - 0.41

έ - 0.36

Λ - 0.36

έ - 0.70

u - 0.61

Λ - 0.29

έ - 0.70

u - 0.61

Э - 0.22

έ - 0.70

u - 0.61

Λ - 0.39

έ - 0.70

u - 0.61

Λ - 0.40

u Э - 0.29

Ω - 0.27

έ - 0.18

Ω - 0.30

Э - 0.20

Þ - 0.11

Ω - 0.50

Э /w - 0.20

Ω - 0.48 Ω - 0.54

w - 0.35

Э- 0.22

έ Ω - 0.38

æ - 0.36

ε - 0.35

ε - 0.45

Λ - 0.30

Ω /ε - 0.26

ε - 0.42 εƏ- 0.35

Λ - 0.24

ε - 0.50

ε∂ - 0.40

Ω - 0.30

ε - 0.52

ε∂ - 0.46

Λ - 0.28

∂ æ - 0.62 æ - 0.60 æ - 0.40 æ - 0.49 æ - 0.50

a - 0.30

Λ æ - 0.40

Þ - 0.38

Э - 0.11

Ω - 0.39

Э - 0.35

æ - 0.26

Э - 0.29

æ - 0.20

a - 0.23

æ - 0.40

Ω - 0.24

a/ Э - 0.20

æ - 0.50

a - 0.30

Э - 0.25

Table 4.9a shows the diphthong stimuli verses the average MRS for all diphthong sub-

102

recognisors. For all members of this subgroup the SRS was always higher than the MRS.

Table 4.9b shows the three most common confusions across speakers and their associated

MRS across members of the diphthong subgroup (IASC). Detailed results are presented in

Tables 4.9a and 4.9b.

Table 4.9(a) Diphthong confusion matrix (average values over all speakers).

aI eI ЭI aΩ OΩ I∂ ε∂ Ω∂

aI 0.81 0.22 0.18 0.00 0.00 0.00 0.28 0.00

eI 0.68 0.84 0.34 0.00 0.00 0.00 0.31 0.00

ЭI 0.27 0.20 0.82 0.43 0.22 0.00 0.30 0.00

aΩ 0.00 0.00 0.29 0.68 0.64 0.00 0.23 0.00

OΩ 0.00 0.00 0.32 0.43 0.67 0.02 0.22 0.00

I∂ 0.00 0.00 0.00 0.00 0.00 0.81 0.27 0.33

ε∂ 0..30 0.35 0.43 0.39 0.27 0.28 0.71 0.00

Ω∂ 0.00 0.00 0.00 0.00 0.00 0.26 0.30 0.74

Table 4.10a shows the stop stimuli presented verses the average MRS for each speaker.

In all cases the SRS was higher than the MRSs achieved by the other stop sub-

recognisors. Table 4.10b shows the three most common confusions and their associated

MRS across speakers for just the members of the stop subgroup IASC. Detailed results

are shown in Tables 4.10a and 4.10b.

Table 4.9(b) Three most common confusions across speakers for the diphthongs

103

subgroup.

Speaker 6 Speaker 7 Speaker 8 Speaker 9 Speaker10

aI eI - 0.39 ε∂ - 0.25

ЭI- 0.20

ЭI- 0.31 ε∂ - 0.25

eI - 0.12

ε∂- 0.25

ЭI- 0.20

eI - 0.18

ε∂- 0.32

eI - 0.20

ε∂- 0.32

eI - 0.19

eI ε∂- 0.39

ЭI- 0.28

aI - 0.26

aI - 0.70

ЭI- 0.50 ε∂- 0.25

aI - 0.68

ЭI- 0.48 ε∂- 0.35

aI - 0.69 ε∂- 0.40

ЭI- 0.20

aI - 0.66

ЭI- 0.23 ε∂- 0.21

ЭI aΩ- 0.45 ε∂- 0.36

aI - 0.35

aΩ - 0.41

OΩ- 0.35

Э - 0.25

aΩ - 0.45

aI - 0.41 ε∂- 0.31

aΩ - 0.40 ε∂- 0.30

aI/ OΩ- 0.22

aΩ- 0.45 εΩ- 0.28

aI - 0.23

aΩ OΩ - 0.61

ЭI- 0.26 ε∂- 0.22

OΩ- 0.68

ЭI - 0.43 ε∂- 0.27

OΩ- 0.66 ε∂- 0.25

ЭI- 0.22

OΩ- 0.60 ε∂/ ЭI - 0.30

OΩ- 0.60

ЭI- 0.23 ε∂- 0.12

OΩ ЭI- 0.31 ε∂- 0.28

aΩ- 0.22

aΩ- 0.48

ЭI- 0.21 ε∂/w- 0.15

aΩ- 0.42

ЭI/w - 0.31

aΩ- 0.50

ЭI- 0.30 ε∂/w- 0.20

ЭI- 0.45 ε∂- 0.26

w - 0.30

I∂ Ω∂- 0.38

j - 0.25 ε∂- 0.22

Ω∂ - 0.31 ε∂- 0.24

j - 0.20

Ω∂ - 0.31 ε∂- 0.25

j - 0.21

ε∂- 0.40

Ω∂ - 0.30

Ω∂ - 0.35 ε∂- 0.23

j - 0.25

ε∂ aΩ- 0.39

I∂- 0.35

OΩ - 0.32

ЭI- 0.51

aI - 0.41

I∂- 0.36

aΩ- 0.60

eI - 0.36

έ - 0.34

aΩ- 0.50

eI - 0.50

aI/ЭI- 0.40

aΩ- 0.46

OΩ- 0.38

έ - 0.33

Ω∂ w - 0.36 ε∂- 0.28

I∂- 0.25

ε∂- 0.5

w - 0.20

I∂- 0.31

w - 0.20 ε∂- 0.16

I∂- 0.40 ε∂- 0.20

ε∂ - 0.38

w - 0.33

I∂ - 0.25

Table 4.10(a) Stops confusion matrix (average values over all speakers).

104

p b t d k g

p 0.69 0.03 0.31 0.17 0.30 0.13

b 0.05 0.69 0.07 0.28 0.05 0.21

t 0.42 0.10 0.60 0.06 0.60 0.22

d 0.16 0.21 0.05 0.83 0.17 0.28

k 0.39 0.02 0.20 0.06 0.73 0.20

g 0.17 0.30 0.16 0.33 0.20 0.64

Table 4.10(b) Three most common confusions across speakers for the stops subgroup.

Speaker 6 Speaker 7 Speaker 8 Speaker 9 Speaker10

p k - 0.36

t∫ - 0.28

t - 0.22

k - 0.25

t - 0.22

t∫ - 0.28

t - 0.38

dξ - 0.31

k - 0.28

t - 0.40

k - 0.35

t∫ - 0.20

t - 0.44

k - 0.25

t∫ - 0.22

b d - 0.32

g - 0.28

t∫ - 0.21

g/ dξ - 0.15

t∫ - 0.13

t∫ - 0.50

d - 0.35

dξ- 0.30

dξ - 0.50

d - 0.40

t∫ - 0.30

dξ- 0.50

t∫ - 0.33

g - 0.30

t dξ - 0.7

t∫/k - 0.60

p - 0.41

dξ- 0.70

t∫/k - 0.60

g - 0.26

dξ - 0.70

t∫/k - 0.60

p - 0.40

dξ - 0.70

t∫/k - 0.60

p - 0.50

dξ- 0.70

t∫/k - 0.60

p - 0.54

d dξ- 0.7

b - 0.25

t∫ - 0.21

dξ- 0.70

g - 0.36

t∫ - 0.27

dξ- 0.70

g - 0.30

t∫ - 0.25

dξ- 0.70

g - 0.40

t∫ - 0.30

dξ- 0.70

t∫ - 0.43

b - 0.22

k p/t∫ - 0.33

g - 0.25

t - 0.13

p - 0.45

g - 0.15

t∫ - 0.14

p - 0.35

t∫ - 0.23

t - 0.15

t - 0.50

p - 0.40

g - 0.2

p - 0.40

g - 0.33

t - 0.31

g dξ- 0.60

d - 0.30

k - 0.22

dξ - 0.60

b - 0.45

d - 0.27

dξ - 0.6

d - 0.4

p - 0.35

dξ - 0.70

d - 0.40

t - 0.35

dξ - 0.60

d - 0.41

b - 0.31

Table 4.11a shows the nasal stimuli presented verses the average MRS for each speaker

105

and across all nasal sub-recognisors. For the nasal /さ /, the SRS is lower than the MRS of

the nasal /m/. For the nasals /m/ and /n/, the SRS is higher than the MRS achieved by the

other nasal sub-recognisors IASC. Table 4.11b shows the highest three confusions for the

nasal subgroup

Table 4.11(a) Nasals confusion matrix (average values over all speakers).

m n さ

M 0.86 0.21 0.18

N 0.34 0.60 0.02

Η 0.60 0.34 0.57

Table 4.11(b) Three highest confusions of the nasal subgroup.


m さ - 0.25

n - 0.17

n - 0.30

さ - 0.20

さ - 0.23

n - 0.20

n - 0.19 n - 0.20

さ - 0.12

n m - 0.38 m - 0.30 m - 0.33 m - 0.40 m - 0.25

さ m - 0.61

n - 0.38

m - 0.60

n - 0.41

m - 0.60

n - 0.40

n - 0.28 m - 0.60

n - 0.30

Table 4.12a shows the fricative stimuli presented verses the average MRS for all speakers

and across all fricative sub-recognisors. Two members of this subgroup, the fricatives /ð/

and /z/, show SRSs that were lower than the average MRS for the other fricatives in this

subgroup. Table 4.12b shows the first three confusions with their associated MRS for the

fricative subgroup.

Table 4.12(a) Confusion matrix of the fricatives. (average values over all speakers).

106

f v θ ð s z ∫ ξ

f 0.67 0.34 0.60 0.3 0.37 0.27 0.40 0.19

v 0.34 0.73 0.28 0.36 0.28 0.30 0.22 0.27

し 0.38 0.18 0.62 0.60 0.42 0.35 0.17 0.18

ð 0.24 0.30 0.76 0.68 0.60 0.80 0.41 0.60

s 0.60 0.24 0.31 0.70 0.71 0.60 0.39 0.33

z 0.23 0.33 0.38 0.75 0.70 0.65 0.28 0.33

∫ 0.60 0.04 0.28 0.33 0.31 0.26 0.65 0.40

ξ 0.14 0.28 0.31 0.70 0.27 0.28 0.38 0.66

Table 4.13a shows the affricative stimuli presented verses the average MRS for all

speakers and across all affricative sub-recognisors. For both members of this subgroup, the

affricatives /t∫/ and /dξ/ achieved average SRSs exceeding the MRS. Table 4.13b shows

the first three confusions with their associated MRS for the affricative subgroup.

107

Table 4.12(b) Three most common confusions across speakers for the fricatives

subgroup.


f し - 0.60

s - 0.48

ð - 0.42

し - 0.60

ð - 0.45

ξ - 0.41

し - 0.60

ð - 0.41

z - 0.40

し - 0.60

s/∫ - 0.48

v/h - 0.40

し - 0.60 ∫ - 0.52

s - 0.51

v f - 0.45

h - 0.41

s/ξ - 0.36

し - 0.41

s - 0.32

ð - 0.24

z - 0.35

s - 0.31

f - 0.30

ð - 0.50

f/z - 0.40

ξ - 0.30

ð - 0.51

f/z - 0.41

ξ - 0.30

θ ð - 0.60

s - 0.41

v/z - 0.31

し - 0.60

f - 0.46

s - 0.37

ð - 0.64

f - 0.46

s - 0.35

ð - 0.49

f/s - 0.50

ξ - 0.40

ð - 0.60

s - 0.49

f - 0.48

ð し - 0.70

z - 0.80

s/ξ - 0.60

し - 0.65

s/ ξ -0.60

z - 0.79

し - 0.61

s/ξ - 0.60

h - 0.46

z - 0.80

し - 0.80

ξ- 0.60

z - 0.80

し - 0.71

s/ ξ - 0.60

s ð - 0.70

f/z - 0.60

し - 0.48

ξ /f - 0.60 ∫ - 0.40

ð - 0.70

し - 0.35

f/z - 0.20

ð - 0.70

f/z - 0.60 ∫ - 0.50

ð - 0.70

f/z - 0.60 ∫ - 0.48

z ð - 0.75

s - 0.70

v/ ξ - 0.40

ð - 0.75

v - 0.46

ð - 0.75

s - 0.70

ξ - 0.47

ð - 0.75

ξ - 0.50

ð - 0.75

s - 0.70

ξ - 0.48

∫ f - 0.60

し - 0.40

s - 0.37

f - 0.60

z - 0.41

s - 0.35

f - 0.60

ð - 0.52

ξ - 0.46

f - 0.60

ξ - 0.50

ð/し/s - 0.40

f - 0.60

し - 0.40

ξ ð - 0.70 ∫ - 0.42

し - 0.33

ð - 0.70

z - 0.40

s - 040

ð - 0.70 ∫ - 0.51

s/し - 0.25

f - 0.70 ∫ - 0.50

v/z - 0.30

ð - 0.70

し - 0.40

z - 0.36

Table 4.13(a) Affricatives confusion matrix. (average values over all speakers).

t∫ dξ

t∫ 0.68 0.41

dξ 0.32 0.64

108

Table 4.13(b) Three main confusions for the affricative subgroup confusion matrix.


t∫ t - 0.75

dξ - 0.53

d - 0.36

t - 0.58

p - 0.23

d/ dξ - 0.21

t - 0.75

dξ - 0.47

p - 0.35

t - 0.55

dξ- 0.54

d - 0.40

t - 0.75

dξ- 0.50

d - 0.35

dξ d - 0.70

g - 0.60

t - 0.50

d - 0.70

g - 0.60

t∫ - 0.31

d - 0.70

g - 0.60

t∫ - 0.39

d - 0.70

g - 0.60

t∫ - 0.40

d - 0.70

g - 0.60

silence 0.3

Table 4.14a shows the semivowel stimuli presented verses the average MRS for all

speakers and across all semivowel sub-recognisors. In all members of this subgroup the

SRS was always higher than the MRSs achieved by any other sub-recognisors. Table

4.14b shows the three most common confusions and their associated MRSs across

speakers for just the members of the semivowel subgroup.

Table 4.14(a) Semivowels intra confusion matrix.

h r j w L

h 0.65 0.00 0.00 0.00 0.00

r 0.00 0.74 0.00 0.00 0.16

j 0.00 0.00 0.93 0.00 0.00

w 0.00 0.00 0.00 0.83 0.00

L 0.00 0.19 0.00 0.00 0.53

109

Table 4.14(b) Semivowels inter confusion matrix.


h し/s - 0.31

z - 0.21

ð - 0.20

f - 0.32

し/ξ - 0.31

ð - 0.30

し/∫ - 0.36

z - 0.35

v - 0.32

f - 0.50

s/∫ - 0.40

s - 0.50

f - 0.45 ∫ - 0.36

r a - 0.25 a - 0.18 L - 0.23

a - 0.10

a - 0.40 a - 0.25

L - 0.22

j i - 0.34

I∂- 0.31

i - 0.36

I∂ - 0.22

I- 0.12

I∂- 0.40

i - 0.23

I - 0.15

i - 0.40

I∂- 0.20

i - 0.25

I∂ - 0.23

w u - 0.31

Ω - 0.28

ΩƏ- 0.21

Ω - 0.31

OΩ - 0.25

u - 0.12

Ω∂ - 0.25

u - 0.22

Ω - 0.19

Ω - 0.40

u - 0.20

OΩ- 0.25

Ω∂ - 0.30

L r - 0.09 r - 0.23 r - 0.24 r - 0.15 r - 0.23

4.3.4 Experiment One: Conclusion

The results of this experiment showed that there was a variation in the recognition

performance across phones, subgroups and speakers. The descending order of average

SRS across subgroups was vowels, followed by diphthongs, semivowels, stops, nasals,

fricatives and affricatives. Table 4.15 summarises the average SRS for all speakers across

subgroup.

Table 4.15 Average of SRS across subgroup.

Subgroup Vowels Diph-

thongs

Stops Fric-

atives

Affric-

atives

Nasals Semi-

vowels

Avg. SRS 0.87 0.76 0.69 0.67 0.66 0.68 0.74

The best average SRS was for the vowel subgroup that achieved average SRS over all

speakers of 0.87, and the lowest performance was for the affricative subgroup that

achieved average SRS over all speakers of 0.66. Variations were observed also across

speakers, but general trends were consistent. Average SRS scores for all phonemes across

all speakers is illustrated in Table 4.16

110

Table 4.16 Average SRS scores for all phones across all speakers.

Speaker # 6 7 8 9 10

Avg. SRS 0.70 0.75 0.78 0.85 0.71

Overall the descending order of SRS for speakers was 9, 8, 7, 10 and 6. This order was

derived to select the threshold value from the SRS output of one of the input data entry. In

this experiment the threshold value was chosen from the output of the lowest SRS.

Referring to Table 4.7a, the lowest SRS (0.50) occurred for the sub-recognisors of the

phones /b/, /L/ and /t∫/ when it presented from the input data set of speaker 6. This value

was chosen as the threshold in this experiment. This choice ensures the syntactic

knowledge estimator will select all sub-recognisors when presented with the correct phone

as no SRS value was less than 0.5 across all speakers. The main disadvantage of using a

threshold of 0.5 is that in the worst case all sub-recognisors need to be checked to find the

correct solution, which means longer processing time.

Three values of the minimum threshold were checked (0.5, 0.6 and 0.7) and it was

found that as the proposed thresholds were increased, the number of sub-recognisors

that achieved SRS above threshold decreased and hence processing time decreased. As

the threshold was decreased, the number of sub-recognisors that achieved MRS above

the threshold increased. So a threshold was selected that balanced adequate SRS,

minimal MRS and reasonable processing time. Evaluation and analysis on the system

performance for those threshold values showed that the threshold of 0.60 achieved

reasonable results. The recognition rate for this threshold was 76% and the confusion

rate was 6.6%.

4.4 Experiment Two: Operation of Each Sub-recognisor with the

Syntactical Knowledge

The Experiment Two was designed to test the functionality of the APR when controlled by

the ACL signals as shown in Figure 4.5. Each sub-recognisor of the APR was tested by

applying all Di(12) input and the activation control signal and measuring the output, PIRi

where PIR is the phone identification response from the activated sub-recognisor. To

111

simplify the experiment, the threshold was not applied because the experiment is meant to

test ACL line only.

This experiment verifies the operation of the APR when under the control of the ACL

signals and so predicts the performance of RUST-I as an IWR system, assuming an ideal

syntactic knowledge estimator.

Figure 4.5 Block diagram of Experiment Two.

4.4.1 Input Stimuli

The stimuli data set, which was presented to the adaptive phone recognisor in this

experiment is the same input data set that was presented in Experiment One (Section 4.3).

The ACL signals were binary control lines.


The block diagram of the setup for this experiment follows Figure 4.5. The inputs are

DI1(12) to DIMi(12), which are the 12 Mel-scale frequency cepstral coefficients (MFCC)

for the Mi frames to the ith sub-recognisor. The ACL signal was only activated for the sub-

recognisor of the correct phone presented on the input so PIR indicated SRS only.

112

All 225 tokens (45 phonemes by 5 test speakers) were applied to each of the 46 sub-

recognisors and the PIRi outputs obtained. The activation control lines ACLi of the

appropriate sub-recognisor was activated in pseudo-simulation conditions. Under ideal

conditions only the activation control line of the sub-recognisor representing the expected

phone was activated. This part of the experiment simulates operation of the syntactic

knowledge estimator assuming that it correctly identifies the word pattern. Under non-

ideal conditions the activation control line of all sub-recognisors were activated

individually. This part of the experiment simulates operation of the syntactic knowledge

estimator without any assumptions being made about the word pattern. The PIRi output

from each sub-recognisor was then stored in an ASCII file, and placed into a graphical

confusion matrix and appropriate tables.

4.4.3 Results

The responses of all sub-recognisors for all input stimuli and the five speakers (# 6 - 10) in

the test set are identical to the values of Tables 4.7a to 4.7e. This occurred because the

application of the appropriate ACL control signals to the output selector, which resulted in

the suppression of all possible IASC and IRSC that occurred in Experiment One. The SRS

values were maintained from Experiment One within two decimal places. The effect of

application of ideal ACLi signals is to remove confusions or misrecognitions of incorrect

phones completely. The confusion matrix of speaker 6 was chosen to represent the results

of this experiment.

Table 4.17 summarises the SRS under ideal conditions for the five speakers (#6-10). The

local recognition rate is the recognition rate for each speaker so is the number of phones

with SRS > 0.60 divided by 45.

Table 4.17 Summary of SRS < 0.60 and recognition rate across all speakers.

Speaker 6 7 8 9 10

# SRS < 0.60 19 14 11 0 9

Local Recognition 57.77% 66.60% 75.50% 100% 77.8%

113

The highest number of problematic SRSs occurred for tokens from the test set of speaker

6. The lowest number of problematic SRSs occurred for tokens from the test set of speaker

9. The local recognition rates under ideal conditions were between 57.77% and 100%.

4.4.4 Experiment Two: Conclusion

The results obtained in this experiment showed that all the sub-recognisors of the APR

responded as expected to the ACL signals. However, activation of the ACL signals if the

sub-recognisor output is less than the system threshold will not allow correct recognition

of that phone. The activation control of the APR is meant to reduce the number of MRS

responses. Therefore, even if the system is completely protected against MRS confusions,

some system failure will be expected because of the minimum threshold being greater than

some SRS.

4.5 Experiment Three: Verification of the System as an IWR

The Experiment Three is designed to investigate the operation of the system as an IWR

where the syntactic knowledge estimator was combined with the adaptive phone

recognisor. The aim of this experiment is to verify the word recognition efficiency of

RUST-I and analyse its performance. A comprehensive analysis of all correct and

incorrect results is provided with reference back to the first and second experiments.

4.5.1 Input Stimuli

In this experiment, RUST-I has as input data set of one hundred words chosen arbitrarily

from the system lexicon. The words were spoken by the same speakers of the test set, i.e.,

speakers 6, 7, 8, 9, and 10. Each word was processed and presented to the system as a

temporal sequence of vectors in the form of MFCC, DIi(12). The following is a list of the

100 words that were used in this experiment:

their - the - this - these - there - three - think - thank - to - table - time - trying -

transaction - today - and - at - occur - a - away - across - arrive - ago - are - after - ask -

august - almost - or - order - agent - any - air - of - on - often - off - until - other - up - old -

114

over - only - out - hour - one - it - indeed - in - into - its - isn't - inside - introduce - if -

industry - he - heard - her - head - heavy - stone - school - chair - child - church - receive -

real - room - before - earth - earlier - must - market - mouth - number - noise - nature - got

- glass - give - good - gate - general light - lamp - large - lay - year - your - you - perform -

permit - pay - do - destruction - describe - defined - discount - duty - floor


Figure 4.6 Block diagram of the system as configured for Experiment Three.

Figure 4.6 shows the block diagram of the system configuration that was used in this

experiment. At the beginning, the adaptive phone recognisor outputs and the

accumulator contents were initialised to zero. All the ACLi signals and the EOW

control signal were set to be inactive - low. The syntactic knowledge estimator was

reset to the top of the lexicon database whenever a new word was to be processed. The

words were presented to the system from data files stored in ASCII format as in

previous experiments. The words were presented to the system, one at a time with a

silent period in between each pair of words that was of suitably long duration to ensure

the silence was detected. The silence period was set to be greater than 1 sec, which was

found to eliminate the possibility of word boundary confusions.

The response of the system was examined by checking the contents of the accumulator

before the presentation of the next word in the specified word test set. The words were

presented to the system randomly and the ID output values from the accumulator were

stored in ASCII text files.

4.5.3 Representation of Results

115

Table 4.20 contains the word recognition results of this experiment. The first column of the

table contained a list of the test words, the second column of the table contained a list of

the words’ phonemic equivalent, the third column in the table contained a list of the

speaker test set they were derived from, the fourth column in the table contained a list of

the recognition decision, the fifth column in the table contained a list of the expected ID

codes in the accumulator if the word was correctly recognised and the last column in the

table contained a list of the actual ID codes obtained from the accumulator at the end of the

recognition cycle. The words in the table were categorised according to the first phone in

their phonemic stream to allow ease of analysis of system errors related back to the

performance of the syntactic knowledge estimator.

The binary decision, from the recognition process, was indicated by 'U' for the

‘Unrecognised’ words and by 'R' for the ‘Recognised’ words. The expected ID was

derived from the phonemic ID representation of Table 3.1. If a word was correctly

recognised then the expected accumulator ID stream will be identical to the actual

accumulator ID stream. Any difference of these ID streams indicated an error had occurred

in the recognition process. If no suitable match could be found during the process for any

reason an asterisk '*' sign occurred in the ID stream indicating the termination of the

recognition process.

4.5.4 Analytical Procedure

The recognition outcome for any word can be analysed by tracking the recognition process

through the syntactic knowledge estimator and applying the corresponding SRS and MRS

results from the activated phones, as outlined in the Experiments One and Two. Tables

4.7a-e provided the SRSs for each phone and Tables 4.(8-13)b provided the MRSs. Any

word that was unrecognised or terminated during recognition was analysed to derive

possible causes. The syntactic bubbles diagram and the syntactic database were used to

track through the syntactic knowledge estimator and hence analyse the behaviour of the

system during the recognition process.

4.5.5 Results

116

Table 4.18 summaries of the results of this experiment. It contains the percentages of

words from each of the five speakers that were recognised and unrecognised. For instance,

speaker 6 contributed 17% of the total number of the words used in this experiment.

64.71% of the words contributed by speaker 6 were recognised. Speaker 6 achieved

15.27% of the correct word recognition and 21.42% of the incorrect word recognition.

Table 4.19 shows the variation of recognition patterns for two words across all speakers.

The two words used were 'agent' and 'more'. The word 'agent' was unrecognised by the

system for four out of the five speakers. The system was able to recognise this word when

spoken only by speaker 9. In contrast, the word 'more' was recognised by the system for all

speakers.

Table 4.18 Overall results of 100 word recognition.

Speaker % of

Recognised

Words

% of

Unrecognised

Words

% of

Words

From the

Speaker

6 15.27% 21.42% 17%

7 12.50% 25% 16%

8 11.11% 28.57% 16%

9 41.66% 10.71% 33%

10 19.44% 14.28% 18%

117

Table 4.19 Comparison of two-word recognition results over all speakers.

Threshold = 0.60

Word Speaker # Accumulator Result

‘Agent’ 6 15-39-12-* U

7 15-* U

8 15-39-12-* U

9 15-39-12-41-24 R

10 15-* U

‘more’ 6 41-8 R

7 41-8 R

8 41-8 R

9 41-8 R

10 41-8 R

The following paragraphs track the performance of the word 'agent' to illustrate the

tracking procedure. The word 'agent' starts with the diphthong /eI/, which is the last phone

in the list to be checked as it is least likely to be the front edge phone. Table 4.9b shows

that there is no significant IASC or IRSC scores with any of the phones that are checked

before /eI/. The diphthong /eI/ has an SRS from 0.70 to 0.91, which exceeds any MRS for

this phone. Therefore, the diphthong/eI/ was successfully recognised for all five speakers

and the ID of /eI/ (15) was consistently found as the first phone in the accumulator.

Subsequent IDs in the accumulator required that the syntactic knowledge estimator

branched into phonemic subclass (Level 1) and points to two locations, firstly, the

phonemic subclass /dξ/ = 39 and secondly the end of process signal /46/. As /dξ/ is the

next phone in the word ‘agent’, it is the only phone checked, and there was no possibility

of further confusion given the previous patterns of phones.

The SRS of /dξ/ shown in Tables 4.7a-e indicate that a match occurred between the

incoming phone /dξ/ and the estimated sub-recognisor for the three speakers 6, 8, and 9

only as their SRSs (0.62, 0.79 and 0.70, respectively) are greater than 0.60. The other two

test sets from speakers (7 and 10) resulted in an error for recognition of this phone as their

SRSs were below 0.6 at only 0.55. As expected, the recognition process was terminated at

118

this point for these two speakers (7 and 10). The ID of the phone /dξ/ (39) for speakers 6, 8

and 9 were found in the second position in the accumulator. In the case of speakers 7 and

10, the system generated an error at this point in the recognition process as indicated in

Table 4.19 by the asterisk in the position of the second phone. Once an error occurs, the

current system configuration stops the recognition processes for that word.

The recognition process continued in the cases of speakers 6, 8, and 9. The next level

(Level 2) of the syntactic database points to the phonemic subclass /∂ = 12/ from the

previous pattern of /eI/ then /dξ/, which is the only possibility. The sub-recognisor for the

vowel /∂/ was activated and checked to be matched with the incoming data. For the three

remaining speakers (6, 8 and 9) the SRSs were from 0.92 to 1.00 without any significant

confusion, which explain the existence of the phone /∂/ ID in the accumulator.

The recognition process again continued in the cases of speakers 6, 8, and 9. The next level

(Level 3) points to the phonemic subclass /n = 41/. This is the only subclass from the

previous pattern of /eI/, /dξ/ then /∂/. The sub-recognisor for the nasal /n/ was activated and

checked to be matched with the incoming data. For two of the three remaining speakers (6

and 8) the SRS scores were 0.55 and 0.58, which are below threshold and so the

recognition process was terminated for these two speakers. In the case of speakers 6 and 8,

the system generated an error at this point in the recognition process as shown in Table

4.19 by the asterisk in the position of the fourth phone.

Only the speaker 9 continued through the recognition process as it had a SRS of 0.69. The

ID of the phone /n/ for the speaker 9 was found in the fourth position in the accumulator.

For the speaker 9, the subclass of the phone /n/ in Level 3 points to the phonemic subclass

of the stop /t = 24/ in level 4 of the syntactic knowledge database. This is branching into

phonemic subclass (Level 4) from the previous pattern of /eI/, /dξ/, /∂/ then /n/ had only

one possibility which was the stop /t/. The sub-recognisor /t/ is activated and checked to be

matched with the incoming data. For the one remaining speaker (9) the SRS score for /t/

was 0.69, therefore, it was considered to be recognised. The ID code of the phone /t/ in the

case of the speaker 9 was found in the fifth position of the accumulator. Therefore the

word 'agent' was totally recognised when spoken by the speaker 9 but not when spoken by

any other speaker in the test set. The word 'more' was recognised successfully for all

119

speakers.

The description of the recognition process is similar for all speakers. The accumulator

contained the ID codes 40-8 for all speakers which represents the codes for the two phones

/m = 40/ and /Э = 8/. Response from the sub-recognisor /m/ was received and recorded in

the accumulator. Phone /m/ over all speakers had no MRS greater than 0.60. Also, the SRS

of /m/ for all speakers was in the range from 0.79 to 0.91 which was greater than MRS.

Therefore, the ID of /m/ (40) is found in the first position in the accumulator for all words

from all speakers.

Branching into the first phonemic subclass (Level 1) from the phone /m/ resulted in ten

possibilities, which are the phones /t∫/, /t/, /s/, /さ /, /m/, /t/, /n/, /d /, /z/ and /k/. The vowel

/Э/ has no misrecognition score greater than 0.60 for any of the phone sets. Also it has an

SRS in the range of 0.71 to 0.80 over all speakers, which is greater than any MRS of this

phone. Therefore, the ID of /Э/ (8) was found in the accumulator in the second position for

all speakers. Recognition results for the 100 words used in Experiment Three are presented

in Table 4.20

Table 4.20 (part 1) Recognition results for words used in Experiment Three.

(Spk# = speaker number, D = decision, U = unrecognised, R = recognised, and * =

process stopped)

Word

Phone

Spk #

D

Expected Result

Actual Result

Phone Class ð = 31

their

ðε∂

6

R

31-20

31-20

the

ð∂

6

R

31-12

31-12

this

ðIs

8

R

31-1-32

31-1-32

these

ðiz

9

R

31-2-33

31-2-33

there

ðε∂

10

R

31-20

31-20

Phone Class t = 30

three

しri

9

R

30-37-2

30-37-2

120

think

しIさk

6

U

30-1-42-26

31-1-*

thank

しæさk

8

U

30-4-42-26

31-4-*

Phone Class t = 24

to

tu

9

R

24-10

24-10

table

teIb∂L

9

R

24-15-23-12-45

24-15-23-12-45

time

taIm

6

R

24-14-40

24-14-40

trying

traIIさ

7

R

24-37-14-1-42

24-37-14-1-42

transaction

trænzæk∫∂n

10

U

24-37-4-41-33-4-

26-34-12-41

24-37-4-41-33-4-*

today

t∂deI

8

U

24-12-25-15

*

Phone Class æ = 4

and

ænd

10

R

4-41-25

4-41-25

at

æt

9

R

4-24

4-24

121

Table 4-20 (part 2)

Word

Phone

Spk #

D

Expected Result

Actual Result

Phone Class ∂ = 12

occur

∂kέ

7

R

12-26-11

12-26-11

a

∂

8

R

12

12

away

∂'weI

7

R

12-44-15

12-44-15

across

∂krÞs

9

R

12-26-37-6-32

12-26-37-6-32

arrive

∂raIv

8

U

12-37-14-29

12-*

ago

∂gOΩ

8

U

12-27-18

12-*

Phone Class a = 5

are

a

6

R

5

5

after

aft∂

9

R

5-28-24-12

5-28-32-24-12

ask

ask

8

R

5-32-26

5-32-26

Phone Class Χ = 8

august

ЭgΛst

9

R

8-27-13-32-24

8-27-13-32-24

almost

ЭLmOΩst

7

U

8-45-40-18-32-24

8-45-40-*

or

Ω

7

U

8

8-*

order

Ωd∂

8

U

8-25-12

8-*

Phone Class eI = 15

agent

eIdξent

9

R

15-39-12-41-24

15-39-12-41-24

Phone Class ε = 3

any

εni

10

R

3-41-2

3-41-2

air

ε∂

8

R

3-12

3-12

122

Phone Class Þ = 6

of

Þv

6

R

6-29

6-29

on

Þn

10

R

6-41

6-41

often

Þfen

7

U

6-28-12-41

6-*

Table 4.20 (part 3)

Word

Phone

Spk #

D

Expected Result

Actual Result

off

Þf

9

R

6-28

6-28

Phone Class Λ = 13

until

ΛntiL

7

U

13-41-24-2-45

13-*

other

Λðe

6

R

13-31-12

13-31-12

up

Λp

8

U

13-22

13-*

Phone Class OΩ = 18

old

OΩLd

9

U

18-45-25

17-*

over

OΩv∂

10

U

18-29-12

17-*

only

OΩnLi

6

U

18-41-45-2

17-*

Phone Class aΩ = 17

out

aΩt

9

R

17-24

17-24

hour

aΩ∂

6

R

17-12

17-12

Phone Class w = 44

one

wΛn

10

R

44-13-41

44-13-41

Phone Class I = 1

it

It

9

R

1-24

1-24

indeed

Indid

10

R

1-41-25-2-25

1-41-25-2-25

in

In

10

R

1-41

1-41

123

into Intu 9 R 1-41-24-10 1-41-24-10

its

Its

9

R

1-24-32

1-24-32

isn't

Iz∂nt

6

U

1-33-12-41-24

1-33-12-*

inside

InsaId

10

R

1-41-32-14-25

1-41-32-14-25

introduce

Intr∂djus

9

R

1-41-24-37-12-

25-43-10-32

1-41-24-37-

12-25-43-10-

32

if

If

10

R

1-28

1-28

Table 4.20 (part 4)

Word

Phone

Spk #

D

Expected

Result

industry

IndΛstri

9

R

1-41-25-13-32-

24-37-2

1-41-25-13-

32-24-37-2

Phone Class h = 36

he

hi

7

R

36-2

36-2

heard

hέd

6

U

36-11-25

*

her

hέ

7

U

36-11

36-1

head

hεd

9

R

36-3-25

36-3-25

heavy

hεvi

10

R

36-3-29-2

36-3-29-2

Phone Class s = 32

stone

stOΩn

9

R

32-24-18-41

32-24-18-41

school

skuL

8

U

32-26-10-45

31-*

Phone Class t∫ = 38

chair

t∫ε∂

10

U

38-20

24-*

child

t∫aILd

7

R

38-14-45-25

38-14-45-25

church

t∫έt∫

9

R

38-11-38

38-11-38

124

Phone Class r = 37

receive

r∂siv

10

R

37-12-32-2-29

37-12-32-2-29

real

riL

9

R

37-2-45

37-2-45

room

rum

6

R

37-10-40

37-10-40

Phone Class b = 23

before

bifЭ

8

R

23-2-28-8

23-2-28-8

Phone Class έ = 11

earth

έし

7

R

11-30

11-30

earlier

έLi∂

6

R

11-45-2-12

11-45-2-12

Table 4.20 (part 5)

Word

Phone

Spk #

D

Expected

Result

Phone Class m = 40

must

mΛst

9

R

40-13-32-24

40-13-32-24

market

Mak∂t

7

R

40-5-26-12-24

40-5-26-12-24

mouth

maΩθ

9

R

40-17-30

40-17-30

Phone Class n = 41

number

nΛmb∂

10

R

41-13-40-23-12

41-13-40-23-12

noise

nЭIIz

9

R

41-16-33

41-16-33

nature

neIt∫∂

6

U

41-15-38-12

*

Phone Class g = 27

got

gÞt

9

R

27-6-24

27-6-24

glass

gLas

6

U

27-45-5-32

*

125

give

gIv

9

R

27-1-29

27-1-29

good

gΩd

10

R

27-9-25

27-9-25

gate

geIt

9

R

27-15-24

27-15-24

Phone Class dξ = 39

general

dξεnr∂L

9

U

39-3-41-37-12-45

25-3-41-37-*

Phone Class L = 45

light

LaIt

7

R

45 - 14 - 24

45-14-24

lamp

Læmp

8

R

45 - 4 - 40 - 22

45 - 4 - 40 - 22

large

Ladξ

9

R

45 - 5 - 39

45-5-39

lay

LeI

7

R

45 - 15

45 - 15

Phone Class j = 43

year

jI∂

9

R

43 - 19

43-19

your

jЭ

8

R

43 - 8

43-8

you

ju

6

R

43 - 10

43-10

Table 4.20 (part 6)

Word

Phone

Spk #

D

Expected

Result

Phone Class p = 22

perform

p∂f Эm

8

R

22-12-28-8-40

22-12-28-

8-40

permit

p∂mIt

7

R

22-12-40-1-24

22-12-40-

1-24

pay

peI

10

R

22-15

22-15

126

Phone Class d = 25

do

du

6

R

25-10

25-10

destruction

d∂strΛk∫∂n

9

R

25-12-32-24-37-13-26-

34-12-41

25-12-32-

24-37-13-

26-34-12-

41

describe

d∂skraIb

10

U

25-12-32-26-37-14-23

25-12-32-*

defined

d∂faInd

8

U

25-12-28-14-41-25

25-12-28-

14-*

discount

dIskaΩnt

7

U

25-1-32-26-17-41-24

25-1-*

duty

djuti

9

U

25-43-10-24-2

25-2-*

Phone Class f = 28

floor

fLЭ

9

R

28-45-8

28-45-8

From a total of 100 words, 73 were recognised correctly, hence their expected and actual

ID codes in Table 4.20 were identical. Words were correctly recognised for the following

two reasons:

• Eight of the 73 words (10.95%) contained phones that achieved SRS

values that were higher than the system threshold and had no IASC or

IRSC (i.e. no MRS > 0.60).

• A majority of the recognised words (65 words or 89.05%) required the

assistance of the syntactic knowledge estimator to be correctly identified

as the relevant sub-recognisors had confusions > 0.60 with other phones.

In the first case (10.95% of the recognised words), the ACL of the syntactic knowledge

estimator played a minor role in the recognition process of these words, so that they would

have been adequately identified with the APR by itself. For example, the word 'on'

contains two phones /Λ/ and /n/. Neither of these phones had a MRS greater than 0.60,

127

therefore the syntactic knowledge estimator did not have any competition in their

recognition. The ten words in this category are on, of, one, he, lamp, year, your, you

In the second case (89.05% of the recognised words), the ACL of the syntactic knowledge

estimator played a major role in the recognition process of these words, so that they

depended on the ACLs in their recognition process. For example, the word 'their' contains

the phone string /ðε∂/. The phone /ð/ has MRS with the phoneme /し/, which is higher than

the system threshold (0.60), so there was possibility for confusion. The ACLs of the

syntactic knowledge estimator eliminate that possibility, so the first phone of the word

‘their’ /ð/ was recognised correctly. Therefore, the syntactic knowledge estimator

effectively competed when recognising these words. The words in this category are: their,

the, this, these, there, three, to, table, time, trying, and, at, occur, a, away, across, are, after,

ask, august, agent, any, air, off, other, out, hour, it, indeed, in, into, its, inside, introduce, if,

industry, head, heavy, stone, child, church, receive, real, room, before, earth, earlier, must,

market, mouth, number, noise, got, give, good, gate, light, large, lay, perform, permit, pay,

do, destruction, floor

There were 27 words in Table 4.20 that were classified as unrecognised. These incorrectly

recognised words were also spoken by various speakers, therefore, the unrecognition cases

described as word dependent problem. The following paragraphs describe why these errors

occurred by using the analytical tracking procedure previously described for the words

‘agent’ and the word ‘more’. These 27 errors were categorised into three main types of

word mis-recognition as follows:

• The first type of error occurred due to the inability of the APR to recognise the front

edge phone as the phone's SRS value was less than the system threshold (0.60) and it

does not have any significant MRS with other phones.

• The second type of error occurs usually in the front edge level but it can occur at lower

levels. This error was observed when the syntactic knowledge estimator identified

phones whose MRS exceeded the threshold (0.60) and they were checked before the

correct phone.

• The third type of error occurred in levels below the front edge level and occurred

because the sequence of the recognition disconnected as one of the phones under

process had an SRS less than the threshold and the MRS is less likely to effect

128

performance. This type of error is related to the APR performance rather than the

syntactic knowledge estimator.

It was found that 4 words or 14.81% of the (27) misrecognised words were due to errors of

the first type, that is they had the front edge phone unrecognised as it achieved an SRS

value that was less than the system threshold (0.60). These words were ‘today’ from

speaker 8, ‘heard’ from speaker 6, ‘nature’ from speaker 6 and ‘glass’ from speaker 6.

Ten words or 37.03% of the (27) misrecognised words were due to errors of the second

type. These words were 'think’ from speaker 6, ‘thank’ from speaker 8, ‘school’ from

speaker 8,’general’ from speaker 9,’old’ from speaker 9,’over’ from speaker 10,’only’

from speaker 6,’duty’ from speaker 9,’her’ from speaker 7 and ‘chair’ from speaker 10.

The words 'think’ and ‘thank' had the confusion occur in the first level or for the phonemic

class /し/. Both of them resulted in the same incorrect ID 31- being placed in the

accumulator, which led to unknown path in the syntactic database. By following the same

error tracking procedure described above, it is found that both fricatives /し/ and /ð/ have

SRS and MRS higher than the system threshold (0.60). The phone /ð/ has a higher

browsing priority than the phone /し/, so it is always checked first. The syntactic

knowledge estimator followed an incorrect branch into the second level of the syntactic

database and found no match. Hence, an error message was produced.

For the word 'general', the expected ID string is 39-3-41-37-12-45. The resulting ID in the

accumulator is the ID string 25-3-41-37-*. The first ID indicates that confusion occurred

between the affricative /dξ/ and the stop /d/, which have an SRS and MRS greater than the

system threshold (0.60). So the phone /d/ (25) was recognised instead the phone / dξ / (39)

because the phone /d/ has priority over the phone / dξ / in the front edge level of the

syntactic database. As these two phones have similar subsequent branches into their phone

subclasses, the system kept the recognition process until the fifth level. Similar results

occurred for the other words, therefore, the message ‘Unrecognised’ was generated. The

failure in the case of the words 'old, over’ and ‘only' was because of the IASC between the

phones /OΩ/ and / aΩ /, therefore, the system produced the incorrect ID string (17) instead

of the expected ID (18).

129

The second error type occurred in the word, 'duty', but it was at the second level rather than

at the first level. The corresponding ID string was 25-2-*. The adaptive phone recognisor

confused the vowel /i/ (2) and the semivowel /j/ (43), and because the vowel /i/ is checked

before the semivowel /j/ in the syntactic knowledge database, the sub-recognisor for the

vowel /i/ responded. The recognition process failed at the third level of the recognition

process because no match was found for /t/ after the pattern /di/.

Another type of error occurred for the word 'her'. No error message was generated for this

word as it appeared to have finalised successfully in the recognition process but the second

ID was incorrect. The ID found in the accumulator was (36-1) but (36-11) was expected.

The first ID is correct, but the second ID represents the vowel /I/ which is often confused

with the vowel / έ /. An option at the third level after /hI/ is the silence or end of word so

the system assumed a correct match and no error message was generated.

There were 13 words or 48.16% in the third category of errors. Words in this category

were: 'transaction, arrive, ago, almost, or, order, often, until, up, isn’t, describe, define, and

discount’. For example, the word ‘arrive’ /∂raIv/, which was spoken by speaker 8 passed

the recognition process on the front edge level for the phone / ∂ /. But the recognition

process terminated on the level 1 for the phone /r/ as this phone had an SRS, which was

less than the system threshold, and it did not have any MRS with any other preceding

phone in level 1 of the syntactic database.

4.5.6 Experiment Three: Conclusion

This experiment showed that 73% of the set of words were correctly identified using

RUST-I. The overall performance of the system as an IWR was found to be dependent on

the performances of both the APR and the syntactic knowledge estimator. The APR

defined the ability of RUST-I to recognise the correct phone using its ACL signals. This

ability was affected by the value of threshold, taken for this experiment to be a system

threshold of 0.60.

The syntactic knowledge estimator defined the most likely order of phones occurring first

in a word and also following a given pattern of phones. The syntactic database of the

130

syntactic knowledge estimator was defined by the method of clustering the phonemic data,

which in RUST-I was designed to originate from the most likely first phone in a word,

then follow a pattern of the most likely phone given a pattern of phones.

At the front edge of the recognition, i.e., the first phone in a word, the syntactic knowledge

estimator defines the order of likelihood as the statistical likelihood of a phone being first

in a word, which is a function of the database. The recognition rate at this level increases

with the size and applicability of the database. The syntactic knowledge estimator is not

operating optimally at this level. Once the syntactic knowledge estimator has correctly

identified the phone at this level, the usefulness of the syntactic knowledge estimator

comes into full effect as it then has a predefined set of phones ranked in order of

likelihood, which it must to browse through including the 'correct' phone. In many cases

the misrecognition of incorrect phones did not occur as phones whose PIR scores exceeded

the threshold were either not in the list to be checked or were to be checked after the

'correct' phone. Deeper the recognition process the system delves into fewer the options

available for checking and greater the likelihood of recognition.

Table 4.21 Summary of error types.

1st Type 2nd Type 3rd Type Total

# of words 4 10 13 27

Error rate 14.81% 37.03% 48.16% 100%

As shown in Table 4.21, three types of errors occurred. The first type of error was due to

the lower SRS value of the front edge phone. This error could be reduced by improving the

performance of the APR and hence improving the SRS values. The second type of error

was due to the fact that some of the phones had higher MRS for the 'incorrect' phone than

the SRS resulted for the correct phone. These two types of error could be reduced by

improving the performance of the APR, and/or improving the mechanism for selection and

use of the threshold in the recognition process. These combined errors were responsible for

51.84% of the errors produced by the system.

The third category of error involved low SRS below threshold at lower levels. This

131

caused the recognition sequence to be terminated at some stage in the process. This

type of failure was responsible for 48.16% of the total errors.

131

Chapter 5: Implementation of Incremental Learning Neural

Networks (RUST-II)

5.0 Introduction

This chapter deals with the development of RUST-I to incorporate the incremental

learning into the standard back-propagation network that was used so far. Adding the

feature of incremental learning to the standard back-propagation neural network of the

APR (of RUST-II) is an attempt to investigate the performance of the incremental

learning for speech recognition. It has been shown that the incremental learning im-

proves the system capability and performance by making use of the incremental learn-

ing technique. It also expected that the system will be able to adapt more readily to new

speech input without the need to run additional training sessions.

In this chapter an incremental learning algorithm will be presented and tested based on

a modified version of the previous APR. To make a fair performance comparison of the

system, the standard speech database TIMIT has been used and some minor changes

were carried out on the structure of the input representation to the adaptive phone rec-

ognisor and the syntactical knowledge as well.

Section 5.1 presents the TIMIT speech data file structure, corpus selection, speech seg-

mentation, feature extraction and input data preparation. Section 5.2 describes the

modifications performed on the APR structure to fit the new speech database and the

new speech feature vector. The experimental procedure and the results are presented in

Section 5.3. Discussion of the new APR experiments is given in Section 5.4. The in-

corporation of incremental learning in the back-propagation network is described in

detail in Section 5.3. Finally, Section 5.5 deals with the global layer WANN: the

132

weight selection method for incremental learning, experimental results are also shown

in this section.

5.1 Speech Corpus

5.1.0 Background

RUST-I has been built around a non-standard speech database. The use of a non-

standard speech database had side effects on the system, particularly on its reliability

and performance. Some of those negatives are the limited resources of speakers,

speaker factors, number of speakers, number of intakes each speaker and speaker dia-

lect. In particular, the number of intakes from each speaker and the number of available

speakers introduced certain limitations on the system functionality in that it made the

system to appear performing multi-speaker recognition rather than speaker independ-

ent. This can be seen from the results of Experiment Two in Section 4.4, where it can

be noticed that the system achieved better SRS results for speaker 9 comparing with the

results achieved by the other four speakers of the test set (see Table 4.12). Among the

many speech databases available for speech processing research, the TIMIT speech

corpus was chosen as a standard speech database for its wide variety of the number of

speakers, dialects, genders, vocabularies and sentences.

5.1.1 TIMIT Database

TIMIT provides speech data for the acquisition of acoustic-phonetic knowledge. There

are 6300 sentences in TIMIT spoken by 630 male and female speakers, from 8 major

dialects of the United States. The dialect region is referred to the geographical area

where the speakers lived during their childhood years.

The text material in the TIMIT contains 2 sentences designed to reveal the identity of

the dialect, 450 phonetically compact sentences and 1890 phonetically cpmpact sen-

tences. Additional information can be found in the printed documentations which ac-

company the database CD.

5.1.2 Corpus Selection

133

25 speakers of TIMIT were chosen to form the core training and testing set for our sys-

tem. They are 5 females and 20 males. All contributed speakers were chosen from the

three main dialects of American English. Only 3 of them are from the dialect region of

New England (referred to as DR1), 19 speakers are from the western region (referred to

as DR7) where the dialect boundaries are not known with any confidence (TIMIT

documents). The last dialect (referred to as DR8) of this set are speakers moved around

a lot during their childhood. This coverage of dialects ensures wider diversity of the

phone patterns introduced to the system. Table 5.1 illustrates abstracted information on

the chosen speakers and their contribution to the system lexicon, syntactic knowledge

and language model.

Table 5.1 Abstracted information on the chosen speakers.

Dialect Re-

gion

Number of

speakers

Number of

Sentences

Number of

Words

Number of

phones

DR1 3 30 247 933

DR7 19 31 284 1057

DR8 3 14 106 450

Total 25 75 637 2440

15 of the chosen sentences textually contain repeated utterances of 2 distinctive sen-

tences referred to in TIMIT as “shibboleth” sentences. The two “shibboleth” sentences

are set to reflect the dialect of the speaker. Many of the speakers involved in this sys-

tem were chosen to produce these two sentences, in order to reveal the colour of the

speaker’s dialect for building-up of accent knowledge. 35 sentences of the set accord-

ing to TIMIT are phonetically-compact, in that they were designed to provide coverage

of pairs of phones with extra occurrences of phonetic contexts. The last 25 sentences

are classified as phonetically-diverse sentences (TIMIT documents). Those sentences

are meant to add diversity in sentence types and phonetic contexts to maximise the va-

riety of allophonic contexts.

As TIMIT is acquired from American speakers, the phonemic set that was used in Ta-

ble 3.1 is not valid, and the system was updated to accommodate the phonemic and

134

phonetic symbols used in the TIMIT lexicon. These include two stress marks, the clo-

sure intervals of stops which are distinguished from the stop release by adding ‘cl’ to

the stop symbol, e.g. ‘the stop / t / has the closure phone / tcl /. By testing those phones

perceptually, it was found that some of those closures were temporally too short, there-

fore many of them were integrated within the original stop to form a whole segment of

phone to be introduced to the system. Some phones are speaker dependent, on the

speaker, dialect, speaking rate and phonemic context. Those phones had a lower num-

ber of occurrences, therefore they had a lower number of samples in the system lexi-

con. Those phones are:

• Flap / dx / such as in the word “dirty”.

• Nasal flap / nx / as in “winner”.

• Glottal stop / q /, which may be a phone of / t /, or may mark an initial vowel or

a vowel-vowel boundary.

• Fronted / u / .i.e. / ux /.

• Very short devoiced vowel / ax-h /, typically occur for reduced vowels sur-

rounded by voiceless consonants.

• Other symbols include two types of silence; / pau / (pause), and / epi /; denoting

epithetic silence which is often found between a fricative and a semivowel or

nasal. / # / is used to mark the silence and/or non-speech events found at the be-

ginning and end of the signal.

TIMIT is a large database, therefore when searching for a specific piece of data for

quick match or extraction it was more convenient to produce a search engine that can

help in carrying out the search accurately and effectively. The search engine was coded

using C++ but made simple to migrate by using many features of the C language. The

program offers three search options:

1. Speaker details inquiry.

2. Speaker dependent search.

3. Lexical dependent search.

Table 5.2 Updated phonemic symbol code.

135

Phone types Phone symbol Phone numeric representa-

tion

Vowels iy 1

ih 2

eh 3

ey 4

ae 5

aa 6

aw 7

ay 8

ah 9

ao 10

oy 11

ow 12

uh 13

uw 14

ux 15

er 16

ax 17

ix 18

axr 19

ax-h 20

Semivowels L 21

r 22

w 23

y 24

hh 25

hu 26

EL 27

Nasals m 28

n 29

ng 30

136

em 31

en 32

eng 33

nx 34

Fricatives s 34

sh 35

z 36

zh 37

f 38

th 39

v 40

dh 41

Affricatives ch 42

jh 43

Stops b 44

d 45

g 46

p 47

t 48

k 49

q 50

dx 51

Silence pau 52

epi 53

# 54

137

5.1.3 Phone Segmentation and Feature Extraction

Speech data provided by TIMIT is recorded in .wav files of SPHERE headed format.

To be able to process the waveform files using MATLAB®

they must be in Windows®

.wav format. Therefore, SPHERE files were converted to WAV format to make it play-

able in Windows Media Player for the perceptual tests. Hence, a script of MATLAB

instructions was generated and ran successfully and all the selected data files were con-

verted.

TIMIT provides the phone boundary information which has been used for phone seg-

mentation, which was first performed based on this information. The boundary between

some phones in the samples is not distinctive from the signal point of view, leading to

overlapping period around the boundaries. Hence, a second stage called phonemic

amalgamation has been applied to some phones taking one phone set as a unique clus-

ter in a larger phone set, which was used to create larger learning set. For example, the

set / b / and / bcl / were merged together to produce a new learning set called as / b /.

This is aimed to achieve wider variety of the particular phone forms in the phonemic

knowledge. The amalgamation process resulted in more learning sessions to run and

more complex work to be performed on the MLWA side, but it was rewarding in re-

ducing the overall size of the APR. (The number of distinctive phones in the phonemic

knowledge was chosen to be 54.)

The MATLAB script was developed to perform automatic phonemic segmentations

along with the phonemic amalgamation for each sentence. A total of 2440 samples (of

51 phones) were extracted and saved individually in text files. The segmentation proc-

ess was completed, based on the phone boundaries information provided by the TIMIT.

Once extracted, each phone was subject to individual perceptual test to verify the

phone identity. Samples of the same phone vary with their occurrences (in sentences),

the speaker genders and dialects.

As in the previous version of the system, the speech features as inputs to the neural

networks were the Mel-scale spectrum coefficients, which were discussed in Chapter 2.

In this chapter, the number of the coefficients was increased from 12 to 17, to improve

138

the accuracy of the Mel-filters. To simplify the MFCC vector while not losing any in-

formation, the sample number at each of the peaks in the spectrum was used to repre-

sent each MFCC coefficient. The bottom line is to produce meaningful representation

from information related to the vocal tract. The filter model of the vocal tract provided

such information in its response shape and the transfer function. Cepstrum analysis pro-

vides such representation near its origin, therefore, MFCC provide such critical infor-

mation by smoothing the spectrum envelop and reveal the first four formants from the

signal spectrum.

All the segmented data were saved in text files and passed to the MATLAB feature ex-

traction scripts, where speech data was preconditioned, cepstrally analysed and then

MFCC produced and saved in text files. Figure 5.1 shows an example on the phone / s

/, where the feature extraction script displays the phone under processing in various

domains showing graphically the sequence of the feature extraction process.

5.1.4 Preparation of Data for Neural Networks Input

To prepare the .mfc file data, some processing procedures are needed. Firstly, the data

in the .mfc file has to be organized in n x m matrix. The line n of the matrix contains m

= 17 columns, which are the MFCC elements of the particular phone. By observing the

data resulted from the data extraction script, it was found that numerous numerical val-

ues were significantly larger than 1. Because the output of the network is a decision

represented in form of numerical value in the range from 0 to 1, the input to the net-

works ought to be normalized to avoid the constant saturation at the network output.

This process was carried out over all the input data.

On the other hand, too many negatives were resulted in the input, this usually produces

incorrect misfiring cases in the network. Therefore a process of mirror reflection was

carried out over all the input data vectors in order to promote the firing of the net-

work’s PEs.

139

Figure 5.1 Feature extraction from the phone /s/.

The processed data were all saved in files identified by their phone contents, where an

input data file contains as much as produced from that particular phone. Those files

were ready to be presented to the standard neural networks for training and testing and

to be also presented to the network with the incremental learning method for experi-

ment.

5.2 Modification of the APR to Include Incremental Learning Neural

Networks

The incremental learning is suitable for speech recognition, because the signal changes

from speaker to speaker and from time to time, and even for the same speaker. Though

incremental learning has been applied to speech enhancement (Deng, et. al. 2003), very

little research has been reported in the literature on the use of the incremental learning

technique for speech recognition.

140

In this section, we propose to implement a feed-forward incremental learning algorithm

(Darjazini, Cheng and Liyana-pathirana, 2006) based on the hybrid knowledge method

developed by (Darjazini and Tibbitts, 1994). This approach is novel in that it develops

and applies a modified method of the incremental learning algorithm to the problem of

speech recognition. Previously, the incremental learning was mostly designed and

tested for problems of pattern recognition (Chakraborty and Pal, 2003, Polikar et. al.

2001, Vo 1994, and Wang and Yuwono, 1996).

It was shown in Figures 3.12 and 3.13 (Chapter 3) that the APR is based on a comb of

feed forward neural networks with back-propagation learning algorithm. Each one of

those neural nets was referred to as a sub-recognisor and each one was specialised in

the recognition of an individual phone. The incremental learning algorithm picks up

new information from unknown input data (ID) and uses it to adapt the sub-recognisor

to the new changes in the input without further training.

5.2.0 Weight Selection Algorithm

The weight-selection algorithm is based on a method for speech recognition that em-

ploys a comb of phone sub-recognisors (Darjazini and Tibbitts 1994). As shown in

Figure 5.2, the method employs 55 sub-recognisors for recognition of 54 phones and

one sub-recognisor is dedicated for silent period. All the sub-recognizers are imple-

mented using an identical feed-forward neural-network (FF-NN) structure. Each sub-

recognizer has an output referred to as Phone Identification Response (PIR), where the

PIR is a continuous variable between 0 and 1.

Each sub-recognizer indicates that the input speech contains a specific phone if the

value of PIR is close to 1. The tolerance in the network is set to 0.05, therefore, each PIR

with a value of greater than or equal to 0.95 is taken as an indication of a potential match.

141

Figure 5.2 The modified structure of the APR.

In contrast with the previous back-propagation learning algorithm, the incremental

learning algorithm extracts a new weight matrix (WM) from a new data set during rec-

ognition. In this algorithm, the updated sub-recognisor contains now two phases of

back-propagation instead of one. At the initial running, the network behaves as a nor-

mal back-propagation network, the same as in the pervious sub-recognisor. At the sub-

sequent running the network performs the incremental learning process firstly by run-

ning the network using the previous weight matrix, at this point a measure applies at

the output, if the error comes greater than the maximum allowed error, the process will

be terminated as a non recognised phone. If the error is less than the maximum allowed

error and lower than the minimum acceptable output, the incremental learning phase

starts. The goal of the incremental learning phase here is to achieve an acceptable out-

put at the output layer by adjusting the weight matrix using adaptive learning rate.

When this is achieved the new weight matrix will be saved and sent to the MLWA (see

Figure 5.3) for later reference. The previous description can be outlined as following:

1. Initial running: Normal back-propagation learning algorithm

2. Further running: Input presented

3. Previous weight matrix used: if the output acceptable then recognition flag set and

process stop.

142

4. elseif the resulted error ≥ the maximum allowed error then flag a mis-recognition

message.

5. else if the resulted error ≤ the maximum allowed error and the resulted error ≥ the

minimum allowed error then: the error backpropagated locally at the output layer

and globally over the net to adjust the weights with making use of adaptive learning

rate.

6. When convergence achieved save the new weight matrix and send it to the MLWA

In the following recognition, the new set, as well as all the existing sets, will be tested

as a potential weight matrix candidate in the FF-NN. The weight matrix that produces

the highest value of PIR is selected as the most recent updated weight matrix. This

function is performed by the Most Likelihood Weight Activator (MLWA) unit, which

is shown in Figure 5.3.

Figure 5.3 Selection of the weight matrix for incremental learning.

In Figure 5.3, WM1 is obtained from the initial training session, obviously, in the early

stages of incremental learning. Subsequent WMS along with WM1 will have statistical

order in the MLWA and the highest probable WM is the weight matrix which is used

mostly. Other sets will be used more often later on.

The original sub-recognisor was also adjusted to fit the new speech data of TIMIT and

therefore the dimension of the feature extracted from the speech samples was updated to

17 (elements of the MFCC). Figure 5.4 shows the new neural network topology of the

sub-recognisor, with a multi-layer structure. The input layer contains 17 processing

143

elements (PE) used to receive 17 input elements, which represents the Mel-scale

Frequency Coefficients (MFCC) of the corresponding phone. In this structure, the input

layer acts as a buffer to the subsequent hidden layers. There are three hidden layers H1,

H2, and H3, each one containing (34 - 51 - 34) PEs respectively. The output layer

contains one PE representing a measure of the matching of the input speech (stimulus) to

a particular phone.

Figure 5.4 Structure of new sub-recognisor.

In the first feed-forward phase, all the current output values are computed and the end-

side output is compared with the target value. At this point the network performs learn-

ing using a constant learning rate. In the backward phase the error propagated through

the network and the weights are adjusted, and the new weight of the output layer is

computed providing that the change in the weight must accelerate the convergence to-

wards the lowest possible error.

At the output layer, the adjustment of the output PE weight can be formulated as

following:

144

out

out

out

out

outout

outoutout W

I

IWW

.3

2

.3

.3

2

.3 ∂∂

∂∂

∂∂−=∂

∂−=Δ φφεηεη

joutoutjoutoutoutout T 3333.3 ).1())(2( φδηφφφφη =−−−−= (5.3)

where j is the order of the PE in the third hidden layer, φ3j is the weighted input of the jth

PE, and δ3.out = 2 (T - φout) φout (1-φout). Therefore, the new weight at the output layer can

be determined from

joutoutoutout NWNW .33333 )()1( φδη−=+ (5.4)

where, W3out(N+1) and W3out(N) are the weight vectors in the N-th and (N+1)-st iterations

and η3.out is the fixed learning rate used in the first phase.

After achieving the final convergence in the first phase, the network performs the in-

cremental learning phase, for any subsequent input. In this stage, the learning rate η3.out in

(5.4) is made adaptive to achieve fast convergence. It increases if the successive changes

in the weight are in the same direction and having positive value, and decreases other-

wise. This adaptation ensures that the largest decrease in error is obtained in each itera-

tion. The learning rate adaptation is formulated as:

W

NN

ΔΔ+=+ 21

)(

1

εηη (5.5)

where, ηN+1 and ηN are the learning rates in the Nth and (N+1)

st iterations.

By substituting η3.out with ηN+1 in (5.5) into (5.4), the weight adjustment for the incre-

mental learning at the output end can be formulated as

joutNoutout

W

NWNW .3.32.3.3 ))(

1()()1( φδεη

ΔΔ+−=+ (5.6)

145

5.3 Experiment and Results

The input data were extracted from 75 spoken sentences of the TIMIT speech database

as shown Section 5.1. The sentences are spoken by 25 speakers (5 female and 20 male).

Every speaker posses one of three main dialects from the American English, and the

dialects were chosen arbitrarily. The data were mixed to produce as much variety as

possible to every phone. This is to reap the advantage of having the sub-recognisor be-

ing exposed and to deal with most varied forms of the same phone. In the primitive rep-

resentation of the input data, there were 54 distinctive phones appeared in 2440 sam-

ples, which were segmented from 637 words. Table 5.3 shows these phones and their

number of occurrences in the sentences.

Table 5.3 Phones set used in the learning session and their relevant number.

Phone Number of samples ch 12

jh 15

dh 48

f 33

s 126

sh 38

th 10

v 40

z 42

em 3

en 15

eng 1

m 73

n 137

ng 23

nx 13

epi 21

h# 1

pau 22

eI 18

hh 15

hv 24

L 82

r 87

w 43

y 24

146

b 43

bcl 3

d 59

dcl 13

dx 44

g 23

gcl 3

k 87

kcl 13

p 51

q 64

t 85

tcl 22

aa 64

ae 69

ah 34

ao 41

aw 15

ax 75

axh 7

axr 41

ay 31

eh 57

er 37

ey 46

ih 91

ix 136

iy 112

ow 38

oy 11

uh 9

uw 6

ux 28

Experiments were performed firstly by initiating (first run) the sub-recognisors using

the back-propagation learning algorithm and applying the Delta rule. The exit condition

of this session was the number of iterations, which was set at 500 (as at the beginning

the adequate number of iterations to achieve the convergence was unknown), and the

learning rates at the hidden layers and the output layer were all initiated to the value of

0.5. The weights were initiated to random normally distributed values and the learning

set contained non-clustered stimulus. Maximum accepted error is 0.01 and the

incremental learning width is 0.2, i.e. the range is from 0.97 to 0.77

147

The initial session provides the first weight matrix (WM1) for the MLWA and

determines the first cluster of the input data. Number of phone samples for the initial

session was in this case 15 samples. In this trial, the sub-recognisor converged from the

target (1) after 50 epochs. In each epoch, the network manipulated the inner weights of

the hidden layers. An error monitor was set to measure the value of Mean Squared Error

(MSE) at each hidden layer and at the output, to monitor the effects of a particular PE

performance on the overall result of the entire network. The accuracy of the PIR was

within an error value of 0.01, which is below the tolerance value of 0.05. The overall

performance on the initial learning set scored 94.44% accuracy.

Figure 5.5 illustrates the performance of the sub-recognisor in the initial session, where

Figure 5.5A illustrates the measure of the mean square error (MSE) graph and Figure

5.5B illustrates the PIR values at the end of the initial session. It can be noted that the

network converged successfully within short time measured at about 50 epochs. The

rest of the samples have been presented to the network in the incremental learning stage

where the performance was close to perfect 99.20%. The failed cases were samples

produced PIR values were out of the incremental learning range.

148

Figure 5.5 The Sub-recognisor performance in the initial session.

5.4. Discussion

In the initiation session, some of the phone sets required up to 13 trials to achieve

convergence. This was partially due to the wide range of the used phone types in the

input data. The diversity of the input data resulted in the wide distances between some

of the stimuli presented to the network.

Figure 5.6 illustrates examples of two trials on the phone / s /, where, Figure 5.6(a)

shows the MSE and the PIR for one of the non-converged trials. When that occurred

the trial was restarted again (based on a new randomly-generated weight matrix) and

at the end a convergence was achieved as shown in Figure 5.6(b).

149

(a)

Figure 5.6 Recognition experiments of the phone /s/.

150

(b)

Figure 5.6 Recognition experiments of the phone /s/.

5.5 Conclusion

The proposed incremental learning algorithm allows the phonemic knowledge of the

APR of RUST-II and its sub-recognisors to be updated without causing the system to

lose its original phonemic knowledge or suffer from catastrophic forgetting problems.

RUST-II has demonstrated excellent performance. One critical parameter worth men-

tioning here is that the incremental learning range is a very critical parameter for the

system performance and has to be predetermined for the system. The wrong range

could be resulted in false recognition results in the phonemically adjacent phones, and

may lead to a situation of catastrophic forgetting.

151

Experiments in similar conditions on the two versions of the system showed that

RUST achieved significant improvement in the performance, where the system

achieved accuracy of 76% in the earlier version and the accuracy was speaker de-

pendent (see Table 4.13). The recognition accuracy is improved significantly to

94.44% in the incremental learning version.

New syntactical knowledge can be obtained from the TIMIT database. Use of this

knowledge has been known to improve the performance of speech recognition. Its in-

corporation into the system developed in this chapter has not been addressed due to

time constraints and can be a topic of future research.

152

Chapter 6: Conclusion and Future Work

6.1 Conclusion

In this thesis, a hybrid Speech Recognition (SR) system called RUST (Recognition Using

Syntactical Tree) was developed. The system combined Artificial Neural Networks (ANN)

with a Statistical Knowledge source (SK) for a small topic-focused database.

RUST has the capacity to implement two basic levels of speech knowledge represented

statistically. The first is a phonemic knowledge in the form of likelihood of occurrence of

phones in words. The second is primary syntactic knowledge, in the form of likelihood of

occurrence of phones in sentences or sequences of words. The syntactic knowledge is

primitive in that it only focuses on the probability of a phone in series of topic related words

and the key for the process is the probability and the recognition of the onset phones in a

sentence. RUST has two versions I and II, in the first version of RUST (RUST-I); the lexicon

was developed with 1357 words of which 541 are unique. These words were extracted from

three topics (finance, physics and general reading material), and could be expanded or

reduced (specialised). The second version of RUST (RUST-II) has modified APR to suit

speech data extracted from TIMIT speech database, and its lexicon consists of 673 words.

Three experiments have been carried out on RUST-I. The first two experiments examined the

operation of the system as an isolated phone recognisor and the third experiment tested the

operation of the system as an isolated word recognisor.

153

The first experiment showed that average Self-Recognition Scores (SRS) across subgroups

was highest for vowels and lowest for affricatives. The SRS also varied across speakers within

the testing set with the highest average SRS occurring for speaker 9 and the lowest for speaker

6. The system consistently recognised all the phones of all the speakers. This experiment

showed that the adaptive phone recognisor performed reasonably well as an isolated phone

recognisor.

In the second experiment it was shown that over all speakers and phones (totalling 225

tokens), the number of SRSs, that were greater than the three thresholds (0.5, 0.6, 0.7), were

225, 172 and 160 tokens. (A threshold of 0.60 was selected from these results as it optimises

low mis-recognition and high self recognition.)

Out of the 100 words applied in the third experiment, 73% were successfully recognised. 91%

of the front edge phones were recognised successfully and at the next level of the syntactic

knowledge database, an 82% phone recognition rate was achieved. Inclusion of the syntactic

knowledge estimator was shown to successfully eliminate 89.05% of the APR mis-

recognition. Therefore, 89.05% of the recognised words required the support of the syntactic

knowledge estimator in their recognition.

An analysis of the 27% mis-recognised words identified three reasons for failure in the

recognition process. The first category of errors was due to the SRS for some sub-recognisors

being below the threshold, this occurred in 14.81% of all the mis-recognised words. The

second category of errors was due to the MRS for some sub-recognisors of the front edge level

being higher than the system threshold and the probability of occurrence of the mis-recognised

phone within the ordering structure of the syntactic knowledge being greater than for the

correct phone. This occurred in 37.03% of all the mis-recognised words. The third category of

error was due to the SRS for the phone in levels below the front edge level of the syntactic

knowledge is lower than the MRS for other phones in the same level. This occurred in 48.16%

of all the mis-recognised words. The three experiments demonstrated the map-road to achieve

154

better recognition using ANN in combination with the appropriate knowledge.

In RUST-I, the speech database used was a non-standard Australian speech database (UWS

speech database). Another speech corpus used in RUST is the TIMIT speech database. This

was a subject for trial in Chapter 5 in RUST-II. Applying TIMIT, some adjustments was

required to the syntactic knowledge estimator and the APR to achieve the transformation to a

standard speech database that has a different phonemic set and pronunciation. C++ code was

developed to browse the TIMIT database and extract the information required for

implementation of RUST-II.

RUST-II demonstrated superb recognition results with its updated APR to the incremental

learning algorithm. The application of the incremental learning algorithm on the APR has

led to significant improvement in the system. Experiment showed recognition rates of up

to 94.44% at the phone level.

6.2 Future Work

RUST-I represents the phonemic knowledge source in the overall structure of the APR and

also in the statistical knowledge source in the syntactic knowledge estimator. The

performance of RUST (I and II) was dependent firstly on the accuracy of the browsing

system and hence the probabilistic order of priority in the syntactic knowledge, and secondly

on the performance of the syntactic knowledge estimator (SKE). It is essential that the

probabilistic representation is as accurate as possible. Also greater success can be achieved in

RUST (I and II) by optimising the APR’s ability to achieve higher SRS compared to MRS.

RUST has the potential to be upgraded to recognise continuous speech by accommodating

higher syntactic (sentence level), semantic and pragmatic knowledge sources. One way

RUST can be expanded to continuous speech is to include probabilistic forms of words given

155

the patterns of occurrences of other words within a sentence. Additional sources of

knowledge such as intonation patterns, common co-articulation patterns and common rules

of grammar can also be included to improve RUST’s performance.

One aspect of the sentence structure incorporated into RUST is the probability of a word

being first in a sentence. This probability was calculated as part of the low level syntactic

knowledge representation and is used to assist in the recognition of words presented to the

system.

The performance and usefulness of RUST could also be improved by providing the syntactic

knowledge estimator with a mechanism for the self-learning of new words as they are added

to the system. The self-learning can use a high level of linguistic knowledge to determine if a

sequence of phones is grammatically, colloquially and semantically possible.

The APR must be efficient enough to include the correct phone amongst the list of possible

phones. Then the syntactic knowledge estimator needs to determine the most likely phone

from that list. Presently the system selects the first phone that exceeds the system threshold to

be the "correct phone". This technique has a limitation which leads to some errors in

performance that could have been avoided. The problem can be solved by applying varying

threshold values for different sub-recognisors or for different phonemic subgroups and using

a selector to determine the maximum response from the list of syntactically likely phones and

work through them in descending order.

Further improvement to the syntactic knowledge estimator of RUST can include a

mechanism of going back through the recognition process if a mistake occurs in not

following the path defined previously. If a mistake is made at either a phonemic level or

branch level the syntactic knowledge estimator needs to go back through the word's browsing

history and alter some of the decisions it has made to check for an overall better phone match

for the word. This requires the implementation of an algorithm of far greater intelligence and

156

complexity than that has been provided in the current versions of RUST

The performance of RUST can be examined for improvement by using continuous activation

input into each sub-recognisor rather than the current binary value. This continuous input

would represent the probability of occurrence of the current phone and would add to the

effects of the SRS to derive a decision on the "correct" phone.

In RUST-I, the speech database used was a non standard Australian speech database.

Another speech corpus could be used with RUST, such as the TIMIT speech database. This

was a subject for trial in Chapter 5 for RUST-II. Applying TIMIT, some adjustments were

required to the syntactic knowledge estimator and the APR to achieve the transformation to a

standard speech database that has a different phonemic set and pronunciation.

157

REFERENCES

BERNARD, J., 1989. Australian at talk. Canberra. Documentation of video

exploration program prepared to the curriculum development centre, Canberra.

CASSIDY, S. AND HARRINGTON, J., 1992. Investigating the dynamic nature of

vowels using neural network. Proceedings of the 4th Australian international

conference on speech science and technology, December 1992 Brisbane. 495-500.

CHAKRABORTY, D. AND PAL, N., 2003. A novel learning scheme for multilayered

perceptrons to realize proper generalization and incremental learning, IEEE transaction

on neural networks, 14(1), January 2003, 1-14.

CHAN, C. And CHAN, TAT-CHUNG., A controlled study of the suitability and

limitations of static modelling of speech. Proceedings of IEEE region 10th

international conference on technology enabling tomorrow: computers,

communications and automation towards the 21st century TENCON-92, 1992. Vol. 1,

272-276.

CHENG, Y.M., O’SHAUGHNESSY, D., GUPTA, V., KENNY, P., MERMELSTEIN,

P., AND PARTHASARATHY, S., 1992. Hybrid segmental-LVQ/HMM for large

vocabulary speech recognition. Proceedings of IEEE international conference on

acoustics speech and signal processing, 1992, Vol. 1, 593-596.

COSTINETT, S., 1997. The language of accounting in English. New York: Regents

publishing company.

CREEKMORE, J.W., FANTY, M. AND COLE, R.A., 1991. A comparative study of

five spectral representations for speaker-independent phonetic recognition. The 25th

Asilomar conference on signals systems and computer, 1991. 330-334.

158

DARJAZINI, H., CHENG, Q., AND LIYANA-PATHIRANA, R., 2006. Incremental

learning algorithm for speech recognition, (unpublished).

DARJAZINI, H. AND TIBBITTS, J., 1994. The construction of phonemic knowledge

using clustering methodology. Proceedings of the 5th Australian international

conference on speech science and technology SST-94, December 1994 Perth, Vol. 1,

202-207.

DAVENPORT, M. AND GARUDADRI, H., 1991. A neural network acoustics

phonetic feature extractor based on wavelet. Proceedings of IEEE Pacific rim

international conference on communication computers and signal processing, 1991.

Vol. 2, 449-452.

DAVIS, S.B. AND MERMELSTEIN, P., 1980. Comparison of parametric

representations for monosyllabic word recognition in continuously spoken sentences.

IEEE transactions on acousticss speech, and signal processing, 28 (4), 357-366.

DE MORI, R., 1983. Computer models of speech using fuzzy algorithm. USA: Plenum

Press.

DELLER, J.R. JR. PROAKIS, J.G. AND HANSEN, J.H., 1993. Discrete-time

processing of speech signals. USA: Macmillan publishing co.

DENG, L., DROPPO, J., AND ACERO, A., 2003. Incremental Bayes learning with prior

evolution for tracking nonstationary noise statistics from noisy speech data, Proceedings of

IEEE international conference on acoustics speech and signal processing, April.

2003,Vol. 1, 6-10.

DERMODY, P., MACKIE, K. AND KATSCH, R., 1986. Initial speech sound

processing in spoken word recognition. Proceedings of the 1st Australian conference on

speech science and technology SST-86, 1986 Canberra.

159

ELVIRA, J. AND CARRASCO, R., 1991. Neural network architectures for speech

processing. IEE colloquium on systems and applications of man-machine interaction

using speech I/O, 1991. Digest No. 066, 4(1-5).

ESCANDE, P., BEROULE, D. AND BLANCHAT, P., 1991. Speech recognition

experiments with guided propagation. Proceedings of IEEE conference on neural

network, 1991. 765-768.

FANT, G., 1960. Acoustics theory of speech production. , S.Gravenbage: Mountain

and Co.

FLAHERTY, M.J. AND POE, D.B., 1993. Orthogonal transformations of stacked

feature vectors applied to HMM speech recognition. IEE proceedings - 1, 140 (2).

FLANAGAN, J.L., 1983. Speech analysis synthesis and perception. 3rd ed. Berlin:

Springer-Verlag.

FURUI, S., 1989. Digital speech processing synthesis and recognition. New York:

Marcel Dekker Inc.

GRAMSS, T., 1992. Fast learning algorithms for a self-optimizing neural network with

an application to isolated word recognition. IEE proceedings – F, 139 (6), 391-396.

GRANT, P.M., 1991. Speech recognition techniques. IEEE electronics and

communication engineering journal, (2), 37- 48.

GUPTA, V.N., LENNIG, M., MERMELSTEIN, P., KENNY, P., SEITZ, F., AND

O’SHAUGHNESSY, D., 1991. Using phoneme duration and energy contour

information to improve large vocabulary isolated word recognition. Proceedings of

IEEE international conference on acoustics speech and signal processing ICASSP-91,

1991. Vol. 1, 341-344.

160

HALL, E., 1977. The language of electrical and electronic engineering in English.

N.Y.: Regents publishing company.

HECHT-NIELSEN, R., 1990. Neurocomputing. USA: Addison-Wesley publishing

company.

HUNT, M.J., 1988. An overview of technology for spoken international with machines.

Ottawa: National aeronautical establishment. (Report - Feb 1988).

KENNY, P., 1993. A*- admissible heuristics for rapid lexical access. IEEE transaction

on speech and audio processing, 1 (1), 49-58.

KITAMURA, T., NISHIOKA, K., ITO, A. AND HAYAHARA, E., 1992. Speaker

dependent 100 word recognition using dynamic spectral features of speech and neural

network. Proceedings of the 34th Midwest symposium on circuits and systems, 1992.

Vol. 1, 533-536.

KITAMURA, T., NISHIOKA, K., IWATA, A. AND HAYAHARA, E., 1992. Speaker

dependent recognition using CombNET dynamic spectral features of speech.

Proceedings of the 34th Midwest symposium on circuits and systems, 1992. Vol. 1, 83-

86.

KUANG, Z. AND KUH, A., 1992. A combined self-organizing feature map and multi-

layer perceptron for isolated word recognition. IEEE transaction on signal processing,

11(40), 2651-2657.

LAURENE, F., 1994. Fundamentals of neural networks. USA: Prentice Hall.

LEE, K. AND DERMODY, P., 1992. The relationship between perceptual and

acoustics analysis of speech sounds. Proceedings of the 4th Australian international

conference on speech science and technology SST-92, December 1992 Brisbane. 14-19.

161

LIPPMANN, R.P., 1987. An introduction to computing with neural nets. IEEE

Acoustics speech and signal processing magazine, 3(4), 4-22.

LOVE, C. AND KINSNER, W., 1992. A speech recognition system using a neural

network model for vocal shaping. Canada: University of Manitoba, Department of

electrical and computer Engineering (report).

MACQUARIE LIBRARY, 1998. The budget Macquarie dictionary. NSW: Macquarie

University, 3rd ed.

MAGOULAS, G. D. AND VRAHATIS, M. N., 1999. Improving the convergence of the

back-propagation algorithm using learning rate adaptation methods. Neural computation

magazine, 11, Massachusetts institute of technology, pp. 1769-1796.

McCORD NELSON, M. AND ILLINGWORTH, W.T., 1991. A practical guide to

neural nets. USA: Addison Wesly.

MIHELIČ, F., GYERGYEK, L. AND PAVEŠIĆ, N., 1991. Selection of features and

classification rules for Slovene phoneme. Proceedings of 6th Mediterranean electro-

technical conference, 1991. Vol. 2, 1180-1183.

OPPENHEIM, A.V. AND SCHAFER, R.W., 1989. Discrete-time signal processing.

USA: Prentice Hall, Signal processing series.

PEPPER, D.J. AND CLEMENTS, M.A., 1992. Phonemic recognition using a large

hidden Markov model. IEEE transactions on signal processing, 40 (6), 1590-1595.

POLIKAR, R., UDPA, L., UDPA, S., AND HONAVAR, V., 2001. Learn++: An

incremental learning algorithm for supervised neural networks, IEEE transaction on

systems, man, and cybernetics - Part C: Applications and reviews, 31(4), Nov. 2001, 497-

508.

162

REICHL, W. AND RUSKE, G., 1995. A hybrid RBF-HMM system for continuous

speech recognition. Proceedings of IEEE international conference on acoustics speech

and signal processing ICASSP–95, 1995. Vol. 5, 3335-3338.

RABINER, L. AND JUANG, B-H., 1993. Fundamentals of speech recognition. USA:

Prentice Hall.

RIGOLL, G., 1991. A new unsupervised learning algorithm for multi-layer perceptrons

based on information theory principles. Proceedings of IEEE international joint

conference on neural networks, 1991. 1764-1769.

SHIM, C., ESPINOZA-VARSA, B. AND CHEUNG, J., 1991. Difficult syllables

recognition with LPC coefficients differences and PC-based neural network.

Proceedings of 33rd IEEE Midwest symposium on circuits and systems, 1991. Vol. 2,

783-786.

SHUPING, R. AND MILLAR, B., 1992. Phonetic feature extraction using artificial

neural networks. Proceedings of the 4th Australian international conference on speech

science and technology SST-92, December 1992 Brisbane. 22-27.

SMITH, F.J., MING, J., O’BOYLE, P. AND IRVINE, A.D., 1995. A hidden Markov

model with optimized inter-frame dependence. Proceedings of IEEE International

conference on acoustics speech and signal processing ICASSP-95, 1995. Vol. 1, 209-

212.

SORENSEN, H., 1991. A cepstral noise reduction multi-layer neural network.

Proceedings of IEEE international conference on acoustics, speech and signal

processing ICASSP-91, 1991. Vol. 2, 933-936.

163

SUGIYAMA, M., SAWAI, H. AND WAIBEL, A., 1991. Review of TDNN

architectures for speech recognition. IEEE international symposium on circuits and

systems, 1991. Vol. 1, 582-585.

TECHNICAL PUBLICATIONS GROUP, 1993. Neural computing, A technology

handbook for professional II/Plus and NeuralWorks explorer. USA: NeuralWare Inc.

TIBBITTS, J., 1996. Utilisation of perceptually acoustics cues in NNT for speech

recognition. Sydney: report to ARC small grant.

TIBBITTS, J., 1989. A digital signal processing technique to improve the intelligibility

of speech for the hearing impaired in quiet. Thesis (PhD). Sydney University.

VO, M.T., 1994. Incremental learning using the time delay neural network, Proceedings of

IEEE international conference on acoustics speech and signal processing ICASSP-94, Vol.

2, April 1994, 629-632.

WAIBLE, A., HANAZAWA, T., HINTON, G., SHIKANO, K. AND LANG, K.J.,

1989. Phoneme recognition using time-delay neural networks. IEEE transactions on

acoustics, speech and signal processing, 3(73), 328-339.

WANG, D., AND YUWONO, B., 1996. Incremental learning of complex temporal

patterns, IEEE transactions on neural networks, 7(6), Nov. 1996, 1465-1481.

ZAVALIAGKOS, G., ZHAO, Y., SCHWARTZ, R., AND MAKHOUL J., 1994. A

hybrid segmental neural net / hidden Markov model system for continuous speech

recognition. IEEE transactions on speech and audio processing, 2 (1), Part 2, 151-160.

ZHANG, Q.J., WANG, F. AND NAKHLA, M.S., 1995. A high-order temporal neural

network for word recognition. Proceedings on international conference on acoustics,

speech and signal processing ICASSP-95, 1995. Vol. 5, 3343-3346.

164

APPENDIX

Probabilistic Values of the Second Level of the Syntactic Knowledge

The probabilistic values of the links between the phonemic groups on the onset level and their

phonemic subgroups are shown in the tables below. These values address the sequential

distribution of the clusters in the syntactic knowledge.

Table A.1 Probabilistic values of phonemic subgroups of the phonemic set Oð.

Phonemic set Oð , n(ð) = 156

Sequence Phonemic subgroup

& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(ðI) = 117 P(ðI) = 0.750 I(ðI) = 0.415

2 E(ð∂) = 110 P(ð∂) = 0.705128 I(ð∂) = 0.504

3 E(ðæ) = 13 P(ðæ) = 0.0833 I(ðæ) = 3.583

4 E(ðε) = 11 P(ðε) = 0.0705512 I(ðε) = 3.824

5 E(ðeI) = 9 P(ðeI) = 0.057692 I(ðeI) = 4.113

6 E(ði) = 3 P(ði) = 0.019230 I(ði) = 5.697

7 E(ðO Ω) = 2 P(ðO Ω) = 0.01282 I(ðO Ω) = 6.281

165

Table A.2 Probabilistic values of phonemic subgroups of the phonemic set O∂.

Phonemic set O∂ , n(∂) = 106


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(∂sln) = 36 P(∂sln) = 0.339622 I(∂sln) =1.557

2 E(∂L) = 23 P(∂L) = 0.216981 I(∂L) = 2.2

3 E(∂k) =18 P(∂k) = 0.169811 I(∂k) = 2.556

4 E(∂t) = 6 P(∂t) = 0.056603 I(∂t) = 4.14

5 E(∂r) = 6 P(∂r) = 0.056603 I(∂r) = 4.14

6 E(∂w) = 5 P(∂w) = 0.047169 I(∂w) = 4.4

7 E(∂b) = 3 P(∂b) = 0.0283018 I(∂b) = 5.139

8 E(∂f) = 3 P(∂f) = 0.0283018 I(∂f) = 5.139

9 E(∂m) = 2 P(∂m) = 0.0188679 I(∂m) = 5.724

10 E(∂g) = 2 P(∂g) = 0.0188679 I(∂g) = 5.724

11 E(∂p) = 1 P(∂p) = 0.0094339 I(∂p) = 6.724

12 E(∂d) = 1 P(∂d) = 0.0094339 I(∂d) = 6.724

Table A.3 Probabilistic values of phonemic subgroups of the phonemic set Oæ.

Phonemic set Oæ , n(æ) = 82


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(æn) = 56 P(æn) = 0.682926 I(æn) = 0.549

2 E(æt) = 18 P(æt) = 0.219512 I(æt) = 2.186

3 E(æz) = 8 P(æz) = 0.097560 I(æz) = 3.355

4 E(æd) = 1 P(æd) = 0.012195 I(æd) = 6.353

5 E(æL) = 1 P(æL) = 0.012195 I(æL) = 6.353

166

Table A.4 Probabilistic values of phonemic subgroups of the phonemic set OI.

Phonemic set OI , n(I) = 81


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(In) = 48 P(In) = 0.592592 I(In) = 0.754

2 E(Iz) = 16 P(Iz) = 0.197530 I(Iz) = 2.338

3 E(It) = 12 P(It) = 0.148148 I(It) = 2.75

4 E(If) = 4 P(If) = 0.049382 I(If) = 4.337

5 E(Im) = 1 P(Im) = 0.123456 I(Im) = 6.336

Table A.5 Probabilistic values of phonemic subgroups of the phonemic set Oh.

Phonemic set Oh , n(h) = 79


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(hi) = 16 P(hi) = 0.202531 I(hi) = 2.3

2 E(hæ) = 15 P(hæ) = 0.189873 I(hæ) = 2.39

3 E(ha) = 12 P(ha) = 0.151898 I(ha) = 2.71

4 E(hέ) = 10 P(hέ) = 0.126582 I(hέ) = 2.98

5 E(hI) = 10 P(hI) = 0.126582 I(hI) = 2.98

6 E(hOΩ) = 6 P(hOΩ) = 0.075949 I(hOΩ) = 3.72

7 E(hε) = 3 P(hε) = 0.0379746 I(hε) = 4.72

8 E(haI) = 2 P(haI) = 0.0253164 I(haI) = 5.3

9 E(hu) = 2 P(hu) = 0.0253164 I(hu) = 5.3

10 E(hI∂) = 1 P(hI∂) = 0.0126582 I(hI∂) = 6.3

11 E(hЭ) = 1 P(hЭ) = 0.0126582 I(hЭ) = 6.3

12 E(hΛ) = 1 P(hΛ) = 0.0126582 I(hΛ) = 6.3

167

Table A.6 Probabilistic values of phonemic subgroups of the phonemic set Ow.

Phonemic set Ow , n(w) = 74


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(wI) = 17 P(wI) = 0.229729 I(wI) = 2.12

2 E(wÞ) = 11 P(wÞ) = 0.148648 I(wÞ) = 3.32

3 E(wi) = 8 P(wi) = 0.108108 I(wi) = 3.36

4 E(wέ) = 8 P(wέ) = 0.108108 I(wέ) = 3.36

5 E(wε) = 7 P(wε) = 0.0945945 I(wε) = 3.4

6 E(wΛ) = 6 P(wΛ) = 0.0810810 I(wΛ) = 3.62

7 E(weI) = 5 P(weI) = 0.0675675 I(weI) = 3.88

8 E(wЭ) = 3 P(wЭ) = 0.0405405 I(wЭ) = 4.62

9 E(wΩ) = 2 P(wΩ) = 0.027027 I(wΩ) = 5.2

10 E(waI) = 2 P(waI) = 0.027027 I(waI) =5.98

11 E(wε∂) = 1 P(wε∂) = 0.013513 I(wε∂) = 5.98

Table A.7 Probabilistic values of phonemic subgroups of the phonemic set OÞ.

Phonemic set OÞ , n(Þ) = 65


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(Þv) = 51 P(Þv) = 0.784615 I(Þv) = 0.349

2 E(Þn) = 8 P(Þn) = 0.1230789 I(Þn) = 3.02

3 E(Þf) = 4 P(Þf) = 0.0615384 I(Þf) = 4.02

4 E(Þp) = 2 P(Þp) = 0.030769 I(Þp) = 5.02

168

Table A.8 Probabilistic values of phonemic subgroups of the phonemic set Of.

Phonemic set Of , n(f) = 63


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(fr) = 18 P(fr) = 0.285714 I(fr) = 1.8

2 E(fЭ) = 14 P(fЭ) = 0.222222 I(fЭ) = 2.6

3 E(fI) = 6 P(fI) = 0.095238 I(fI) = 3.39

4 E(fi) = 3 P(fi) = 0.047619 I(fi) = 4.39

5 E(fε) = 3 P(fε) = 0.047619 I(fε) = 4.39

6 E(fa) = 3 P(fa) = 0.047619 I(fa) = 4.39

7 E(feI) = 3 P(feI) = 0.047619 I(feI) = 4.39

8 E(f∂) = 2 P(f∂) = 0.031746 I(f∂) = 4.97

9 E(faI) = 2 P(faI) = 0.031746 I(faI) = 4.97

10 E(fæ) = 2 P(fæ) = 0.031746 I(fæ) = 4.97

11 E(fΛ) = 2 P(fΛ) = 0.031746 I(fΛ) = 4.97

12 E(fI∂) = 1 P(fI∂) = 0.015873 I(fI∂) = 5.97

13 E(fL) = 1 P(fL) = 0.015873 I(fL) = 5.97

14 E(faΩ) = 1 P(faΩ) = 0.015873 I(faΩ) = 5.97

15 E(fu) = 1 P(fu) = 0.015873 I(fu) = 5.97

169

Table A.9 Probabilistic values of phonemic subgroups of the phonemic set Op.

Phonemic set Op , n(p) = 62


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(pr) = 25 P(pr) = 0.403 I(pr) = 1.31

2 E(pL) = 6 P(pL) = 0.0967 I(pL) = 3.36

3 E(pa) = 6 P(pa) = 0.0967 I(pa) = 3.36

4 E(p∂) = 4 P(p∂) = 0.0645 I(p∂) = 3.95

5 E(pÞ) = 4 P(pÞ) = 0.0645 I(pÞ) = 3.95

6 E(pΩ) = 3 P(pΩ) = 0.0483 I(pΩ) = 4.37

7 E(pi) = 3 P(pi) = 0.0483 I(pi) = 4.37

8 E(pΛ) = 2 P(pΛ) = 0.0322 I(pΛ) = 4.95

9 E(pæ) = 2 P(pæ) = 0.0322 I(pæ) = 4.95

10 E(peI) = 2 P(peI) = 0.0322 I(peI) = 4.95

11 E(pέ) = 2 P(pέ) = 0.0322 I(pέ) = 4.95

12 E(pI∂) = 1 P(pI∂) = 0.0161 I(pI∂) = 5.95

13 E(pI) = 1 P(pI) = 0.0161 I(pI) = 5.95

14 E(pε) = 1 P(pε) = 0.0161 I(pε) = 5.95

170

Table A.10 Probabilistic values of phonemic subgroups of the phonemic set Ot.

Phonemic set Ot , n(t) = 61


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(tu) = 39 P(tu) = 0.639344 I(tu) = 0.644

2 E(tr) = 8 P(tr) = 0.131147 I(tr) = 2.92

3 E(teI) =6 P(teI) = 0.098360 I(teI) = 3.34

4 E(t∂) = 3 P(t∂) = 0.049180 I(t∂) = 4.34

5 E(tw) = 2 P(tw) = 0.032786 I(tw) = 4.92

6 E(tέ) = 2 P(tέ) = 0.032786 I(tέ) = 4.92

7 E(taI) = 1 P(taI) = 0.016393 I(taI) = 5.93

Table A.11 Probabilistic values of phonemic subgroups of the phonemic set Os.

Phonemic set Os , n(s) = 57


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(sε) = 14 P(sε) = 0.245614 I(sε) = 2.02

2 E(sI) = 9 P(sI) = 0.157894 I(sI) = 2.66

3 E(st) = 7 P(st) = 0.122807 I(st) = 3.02

4 E(sp) = 4 P(sp) = 0.105263 I(sp) = 3.24

5 E(si) = 3 P(si) = 0.052631 I(si) = 4.24

6 E(sm) = 3 P(sm) = 0.052631 I(sm) = 4.24

7 E(s∂) = 3 P(s∂) = 0.052631 I(s∂) = 4.24

8 E(seI) = 2 P(seI) = 0.035087 I(seI) = 4.83

9 E(sOΩ) = 2 P(sOΩ) = 0.035087 I(sOΩ) = 4.83

10 E(sέ) = 2 P(sέ) = 0.035087 I(sέ) = 4.83

11 E(sæ) = 2 P(sæ) = 0.035087 I(sæ) = 4.83

12 E(sk) = 2 P(sk) = 0.035087 I(sk) = 4.83

171

13 E(sL) = 1 P(sL) = 0.017543 I(sL) = 5.83

14 E(saI) = 1 P(saI) = 0.017543 I(saI) = 5.83

Table A.12 Probabilistic values of phonemic subgroups of the phonemic set Ob.

Phonemic set Ob , n(b) = 56


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(bi) = 19 P(bi) = 0.339285 I(bi) = 1.56

2 E(bΩ) = 10 P(bΩ) = 0.178571 I(bΩ) = 2.48

3 E(bI) = 7 P(bI) = 0.125 I(bI) = 2.3

4 E(baI) = 4 P(baI) = 0.0714285 I(baI) = 3.8

5 E(b∂) = 3 P(b∂) = 0.0535714 I(b∂) = 4.21

6 E(br) = 3 P(br) = 0.0535714 I(br) = 4.21

7 E(bL) = 3 P(bL) = 0.0535714 I(bL) = 4.21

8 E(bΛ) = 2 P(bΛ) = 0.035714 I(bΛ) = 4.8

9 E(bæ) = 2 P(bæ) = 0.035714 I(bæ) = 4.8

10 E(bε) = 1 P(bε) = 0.017857 I(bε) = 5.8

11 E(baΩ) = 1 P(baΩ) = 0.017857 I(baΩ) = 5.8

12 E(bЭ) = 1 P(bЭ) = 0.017857 I(bЭ) = 5.8

13 E(bÞ) = 1 P(bÞ) = 0.017857 I(bÞ) = 5.8

14 E(beI) = 1 P(beI) = 0.017857 I(beI) = 5.8

172

Table A.13 Probabilistic values of phonemic subgroups of the phonemic set Ok .

Phonemic set Ok , n(k) = 48


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(k∂) = 11 P(k∂) = 0.22916 I(k∂) = 2.12

2 E(kæ) = 10 P(kæ) = 0.20833 I(kæ) = 2.26

3 E(kÞ) = 9 P(kÞ) = 0.1875 I(kÞ) = 2.41

4 E(kΛ) = 4 P(kΛ) = 0.083333 I(kΛ) = 3.58

5 E(kaI) = 3 P(kaI) = 0.0625 I(kaI) = 3.99

6 E(keI) = 2 P(keI) = 0.041666 I(keI) = 4.58

7 E(kε) = 2 P(kε) = 0.041666 I(kε) = 4.58

8 E(kЭ) = 2 P(kЭ) = 0.041666 I(kЭ) = 4.58

9 E(kΩ) = 1 P(kΩ) = 0.020833 I(kΩ) = 5.58

10 E(kL) = 1 P(kL) = 0.020833 I(kL) = 5.58

11 E(kOΩ) = 1 P(kOΩ) = 0.020833 I(kOΩ) = 5.58

12 E(kI) = 1 P(kI) = 0.020833 I(kI) = 5.58

13 E(ki) = 1 P(ki) = 0.020833 I(ki) = 5.58

173

Table A.14 Probabilistic values of phonemic subgroups of the phonemic set Od.

Phonemic set Od , n(d) = 44


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(d∂) = 12 P(d∂) = 0.272727 I(d∂) = 1.87

2 E(dr) = 10 P(dr) = 0.2272727 I(dr) = 2.14

3 E(dI) = 6 P(dI) = 0.1363636 I(dI) = 2.87

4 E(dΛ) = 2 P(dΛ) = 0.454545 I(dΛ) = 4.48

5 E(du) = 2 P(du) = 0.454545 I(du) = 4.48

6 E(deI) = 2 P(deI) = 0.454545 I(deI) = 4.48

7 E(dε) = 2 P(dε) = 0.454545 I(dε) = 4.48

8 E(di) = 2 P(di) = 0.454545 I(di) = 4.48

9 E(dÞ) = 1 P(dÞ) = 0.0227272 I(dÞ) = 4.54

10 E(dæ) = 1 P(dæ) = 0.0227272 I(dæ) = 4.54

11 E(dЭ) = 1 P(dЭ) = 0.0227272 I(dЭ) = 4.54

12 E(dOΩ) = 1 P(dOΩ) = 0.022727 I(dOΩ) = 4.54

13 E(dI∂) = 1 P(dI∂) = 0.0227272 I(dI∂) = 4.54

14 E(dj) = 1 P(dj) = 0.0227272 I(dj) = 4.54

174

Table A.15 Probabilistic values of phonemic subgroups of the phonemic set Om.

Phonemic set Om , n(m) = 38


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(mΛ) = 7 E(mΛ) = 0.159090 I(mΛ) = 2.65

2 E(mæ) = 6 P(mæ) = 0.157894 I(mæ) = 2.66

3 E(mÞ) = 6 P(mÞ) = 0.157894 I(mÞ) = 2.66

4 E(meI) = 6 P(meI) = 0.157894 I(meI) = 2.66

5 E(m∂) = 4 P(m∂) = 0.105263 I(m∂) = 3.25

6 E(mЭ) = 3 P(mЭ) = 0.0789473 I(mЭ) = 3.66

7 E(mOΩ) = 2 P(mOΩ) = 0.052631 I(mOΩ) = 4.24

8 E(ma) = 1 P(ma) = 0.0263157 I(ma) = 5.24

9 E(maΩ) = 1 P(maΩ) = 0.026315 I(maΩ) = 5.24

Table A.16 Probabilistic values of phonemic subgroups of the phonemic set On .

Phonemic set On , n(n) = 34


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(nj) = 12 P(nj) = 0.352941 I(nj) = 1.5

2 E(nΛ) = 7 P(nΛ) = 0.205882 I(nΛ) = 2.27

3 E(nε) = 4 P(nε) = 0.117647 I(nε) = 3.1

4 E(nOΩ) = 4 P(nOΩ) = 0.117647 I(nOΩ) = 3.1

5 E(nÞ) = 2 P(nÞ) = 0.058823 I(nÞ) = 4.08

6 E(naI) = 2 P(naI) = 0.058823 I(naI) = 4.08

7 E(nЭI) = 1 P(nЭI) = 0.029411 I(nЭI) = 5.08

8 E(neI) = 1 P(neI) = 0.029411 I(neI) = 5.08

9 E(ni) = 1 P(ni) = 0.029411 I(ni) = 5.08

175

Table A.17 Probabilistic values of phonemic subgroups of the phonemic set Oi .

Phonemic set Oi , n(i) = 28


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(iL) = 17 P(iL) = 0.607142 I(iL) = 0.72

2 E(iv) = 2 P(iv) = 0.071428 I(iv) = 3.8

3 E(it∫) = 2 P(it∫) = 0.071428 I(it∫) = 3.8

4 E(ik) = 2 P(ik) = 0.071428 I(ik) = 3.8

5 E(iz) = 2 P(iz) = 0.071428 I(iz) = 3.8

6 E(is) = 1 P(is) = 0.035714 I(is) = 4.8

7 E(it) = 1 P(it) = 0.035714 I(it) = 4.8

8 E(in) = 1 P(in) = 0.035714 I(in) = 4.8

Table A.18 Probabilistic values of phonemic subgroups of the phonemic set Oε. Phonemic set Oε , n(ε) = 28


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(εn) = 10 P(εn) = 0.370370 I(εn) = 1.43

2 E(εL) = 6 P(εL) = 0.222222 I(εL) = 2.17

3 E(εk) = 3 P(εk) = 0.111111 I(εk) = 3.17

4 E(εv) = 2 P(εv) = 0.074074 I(εv) = 3.75

5 E(ε∂) = 2 P(ε∂) = 0.074074 I(ε∂) = 3.75

6 E(εg) = 2 P(εg) = 0.074074 I(εg) = 3.75

7 E(εd) = 1 P(εd) = 0.037037 I(εd) = 4.75

8 E(εb) = 1 P(εb) = 0.037037 I(εb) = 4.75

176

Table A.19 Probabilistic values of phonemic subgroups of the phonemic set Og.

Phonemic set Og , n(g) = 26


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(gr) = 11 P(gr) = 0.423076 I(gr) = 1.24

2 E(geI) = 3 P(geI) = 0.115384 I(geI) = 3.11

3 E(gε) = 2 P(gε) = 0.0769230 I(gε) = 3.7

4 E(gL) = 2 P(gL) = 0.0769230 I(gL) = 3.7

5 E(gΛ) = 2 P(gΛ) = 0.0769230 I(gΛ) = 3.7

6 E(gÞ) = 1 P(gÞ) = 0.038461 I(g Þ) = 4.69

7 E(gaI) = 1 P(gaI) = 0.038461 I(gaI) = 4.69

8 E(gI) = 1 P(gI) = 0.038461 I(gI) = 4.69

9 E(gi) = 1 P(gi) = 0.038461 I(gi) = 4.69

10 E(gOΩ) = 1 P(gOΩ) = 0.038461 I(gOΩ) = 4.69

11 E(gΩ) = 1 P(gΩ) = 0.038461 I(gΩ) = 4.69

Table A.20 Probabilistic values of phonemic subgroups of the phonemic set Or.

Phonemic set Or , n(r) = 24


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(r∂) = 11 P(r∂) = 0.458333 I(r∂) = 1.13

2 E(rε) = 4 P(rε) = 0.166666 I(rε) = 2.58

3 E(ri) = 4 P(ri) = 0.166666 I(ri) = 2.58

4 E(ru) = 2 P(ru) = 0.08333 I(ru) = 3.58

5 E(rOΩ) = 1 P(rOΩ) = 0.041666 I(rOΩ) = 4.58

6 E(rΛ) = 1 P(rΛ) = 0.0416666 I(rΛ) = 4.58

7 E(ræ) = 1 P(ræ) = 0.0416666 I(ræ) = 4.58

177

Table A.21 Probabilistic values of phonemic subgroups of the phonemic set Oa.

Phonemic set Oa , n(a) = 19


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(a-) = 12 P(a-) = 0.631578 I(a-) = 0.662

2 E(aI) = 2 P(aI) = 0.105263 I(aI) = 3.24

3 E(at∫) = 1 P(at∫) = 0.052631 I(at∫) = 4.24

4 E(af) = 1 P(af) = 0.052631 I(af) = 4.24

5 E(am) = 1 P(am) = 0.052631 I(am) = 4.24

6 E(as) = 1 P(as) = 0.052631 I(as) = 4.24

7 E(at) = 1 P(at) = 0.052631 I(at) = 4.24

Table A.22 Probabilistic values of phonemic subgroups of the phonemic set OЭ.

Phonemic set OΧ , n(Χ) = 19


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(Э-) = 6 P(Э-) = 0.315789 I(Э-) = 1.66

2 E(Эb) = 4 P(Эb) = 0.210526 I(Эb) = 2.24

3 E(Эg) = 4 P(Эg) = 0.210526 I(Эg) = 2.24

4 E(Эd) = 3 P(Эd) = 0.157894 I(Эd) = 2.66

5 E(ЭL) = 2 P(ЭL) = 0.105263 I(ЭL) = 3.24

178

Table A.23 Probabilistic values of phonemic subgroups of the phonemic set OJ.

Phonemic set Oj , n(j) = 19


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(ju) = 14 P(ju) = 0.736842 I(ju) = 0.44

2 E(jЭ) = 3 P(jЭ) = 0.157894 I(jЭ) = 2.66

3 E(j∂) = 1 P(j∂) = 0.052631 I(j∂) = 4.24

4 E(jI∂) = 1 P(jI∂) = 0.052631 I(jI∂) = 4.24

Table A.24 Probabilistic values of phonemic subgroups of the phonemic set Ot∫. Phonemic set Ot∫ , n(t∫) = 16


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(t∫æ) = 7 P(t∫æ) = 0.4375 I(t∫æ) = 1.19

2 E(t∫a) = 6 P(t∫a) = 0.375 I(t∫a) = 1.41

3 E(t∫έ) = 1 P(t∫έ) = 0.0625 I(t∫έ) = 3.997

4 E(t∫ε) = 1 P(t∫ε) = 0.0625 I(t∫ε) = 3.997

5 E(t∫aI) = 1 P(t∫aI) = 0.0625 I(t∫aI) = 3.997

179

Table A.25 Probabilistic values of phonemic subgroups of the phonemic set OL.

Phonemic set OL , n(L) = 16


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(LaI) = 5 P(LaI) = 0.3125 I(LaI) = 1.67

2 E(LΩ) = 3 P(LΩ) = 0.1875 I(LΩ) = 2.41

3 E(Li) = 3 P(Li) = 0.1875 I(Li) = 2.41

4 E(Lε) = 1 P(Lε) = 0.0625 I(Lε) = 3.99

5 E(LeI) = 1 P(LeI) = 0.0625 I(LeI) = 3.99

6 E(LOΩ) = 1 P(LOΩ) = 0.0625 I(LOΩ) = 3.99

7 E(Læ) = 1 P(Læ) = 0.0625 I(Læ) = 3.99

8 E(La) = 1 P(La) = 0.0625 I(La) = 3.99

Table A.26 Probabilistic values of phonemic subgroups of the phonemic set OΛ.

Phonemic set OΛ , n(Λ) = 14


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(Λð) = 4 P(Λð) = 0.285714 I(Λð) = 1.8

2 E(Λp) = 4 P(Λp) = 0.285714 I(Λp) = 1.8

3 E(Λn) = 4 P(Λn) = 0.285714 I(Λn) = 1.8

4 E(Λs) = 1 P(Λs) = 0.071428 I(Λs) = 3.8

180

Table A.27 Probabilistic values of phonemic subgroups of the phonemic set Oθ.

Phonemic set Oθ , n(θ) = 11


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(θr) = 5 P(θr) = 0.454545 I(θr) = 1.14

2 E(θI) = 2 P(θI) = 0.181818 I(θI) = 2.64

3 E(θæ) = 2 P(θæ) = 0.181818 I(θæ) = 2.64

4 E(θaΩ) = 1 P(θaΩ) = 0.090909 I(θaΩ) = 3.64

Table A.28 Probabilistic values of phonemic subgroups of the phonemic set OaΩ.

Phonemic set OaΩ , n(aΩ) = 8


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(aΩt) = 4 P(aΩt) = 0.5 I(aΩt) = 1

2 E(aΩ∂) = 4 P(aΩ∂) = 0.5 I(aΩ∂) = 1

Table A.29 Probabilistic values of phonemic subgroups of the phonemic set Odξ.

Phonemic set Odξ , n(dξ) = 8


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(dξÞ) = 2 P(dξÞ) = 0.25 I(dξÞ) = 2

2 E(dξΛ) = 2 P(dξΛ) = 0.25 I(dξΛ) = 2

3 E(dξЭI) = 2 P(dξЭI) = 0.25 I(dξЭI) = 2

4 E(dξn) = 1 P(dξn) = 0.125 I(dξn) = 3

5 E(dξε) = 1 P(dξε) = 0.125 I(dξε) = 3

181

Table A.30 Probabilistic values of phonemic subgroups of the phonemic set OOΩ.

Phonemic set OOΩ , n(OΩ) = 4


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(OΩL) = 1 P(OΩL) = 0.125 I(OΩL) = 2

2 E(OΩd) = 1 P(OΩd) = 0.125 I(OΩd) = 2

3 E(OΩv) = 1 P(OΩv) = 0.125 I(OΩv) = 2

4 E(OΩn) = 1 P(OΩn) = 0.125 I(OΩn) = 2

Table A.31 Probabilistic values of phonemic subgroups of the phonemic set OaI.

Phonemic set OaI , n(aI) = 4


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(aI-) = 3 P(aI-) = 0.750 I(aI-) = 0.414

2 E(aI∂) = 1 P(aI∂) = 0.250 I(aI∂) = 2

Table A.32 Probabilistic values of phonemic subgroups of the phonemic set Ov.

Phonemic set Ov , n(v) = 4


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(vε) = 3 P(vε) = 0.75 I(vε) = 0.414

2 E(v∂) = 1 P(v∂) = 0.25 I(v∂) = 2

182

Table A.33 Probabilistic values of phonemic subgroups of the phonemic set O έ.

Phonemic set Oέ , n(έ) = 3


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(έL) = 2 P(έL) = 0.666666 I(έL) = 0.58

2 E(έθ) = 1 P(έθ) = 0.333333 I(έθ) = 1.58

Table A.34 Probabilistic values of phonemic subgroups of the phonemic set OeI.

Phonemic set OeI , n(eI) = 1


& number of its

occurrence

Localised

probability

Self information

[bit]

1 E(eIdξ) = 1 P(eIdξ) = 1 I(eIdξ) = 0

Date post:	03-Sep-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Speech Recognition Using Hybrid System of Neural Networks...

Documents