Speech Recognition Using Hybrid System of Neural Networks and Knowledge Sources
©Hisham Darjazini
A thesis submitted to the School of Engineering in fulfillment of the requirements for the degree of Doctor of Philosophy
School of Engineering College of Health and Science University of Western Sydney
II
July 2006
Statement of Authentication
The work presented in this thesis is, to the best of my knowledge and
belief, original except as acknowledged in the text. I hereby declare that I
have not submitted this material, either in full or in part, for a degree at this
or any other institution.
________________________
Signature
III
ABSTRACT
In this thesis, a novel hybrid Speech Recognition (SR) system called RUST (Recognition
Using Syntactical Tree) is developed. RUST combines Artificial Neural Networks (ANN)
with a Statistical Knowledge Source (SKS) for a small topic focused database. The
hypothesis of this research work was that the inclusion of syntactic knowledge
represented in the form of probability of occurrence of phones in words and sentences
improves the performance of an ANN-based SR system.
The lexicon of the first version of RUST (RUST-I) was developed with 1357 words of
which 549 were unique. These words were extracted from three topics (finance, physics
and general reading material), and could be expanded or reduced (specialised). The
results of experiments carried out on RUST showed that by including basic statistical
phonemic/syntactic knowledge with an ANN phone recognisor, the phone recognition
rate was increased to 87% and word recognition rate to 78%.
The first implementation of RUST was not optimal. Therefore, a second version of
RUST (RUST-II) was implemented with an incremental learning algorithm and it has
been shown to improve the phone recognition rate to 94%. The introduction of
incremental learning to ANN-based speech recognition can be considered as the most
innovative feature of this research.
In conclusion this work has proved the hypothesis that inclusion of a phonemic syntactic
knowledge of probabilistic nature and topic related statistical data using on adaptive
phone recognisor based on neural networks has the potential to improve the performance
of a speech recognition system.
IV
Acknowledgements
This work would not have been completed without the continuous support of Dr. Qi
Cheng. I would like to sincerely thank him for his valuable advice and unlimited
support that he provided me in the course of producing this thesis. Dr. Cheng dedicated
numerous hours over many months to help me in producing something that I am proud
of.
Special thanks are also due for Dr. Ranjith Liyana-pathirana for his support, advice and
helpful comments. His revision of the draft of this thesis gave me valuable feedback. I
am grateful to Professor Jann Conroy, Professor Steven Riley, and Mrs. Mary Kron for
their support at various stages of this work.
I will always remember the efforts, advice and endless support provided to me by Dr.
Jo Tibbitts, Professor Godfrey Lucas and Associate Professor Mahmood Nagrial in
producing the first version of this thesis.
Avery warm and special thanks are due for my family, who provided me with all their
support and patience during the course of this work. Without their continuous support I
would not have been able to carry on the very long process of finishing this research.
Thank you all; I specially mention my late father, Mahmood, my mother, Samira and
my wife, Shaheenaz, for their patience during the long hours I spent in bringing this
work to fruition.
Many thanks are also due to all those who participated in the acquisition of the UWS
speech database, which I used throughout the practical part of this thesis. I also greatly
appreciate all help I had received from the academic, administrative and technical staff
of School of Engineering and other departments of University of Western Sydney.
Last but not the least, I wish to acknowledge the moral support of my friends and col-
leagues. I especially remember the support that I have received from Dr. Jamal Rizk.
V
Contents
Page
Abstract III
Acknowledgements IV
List of Figures VIII
List of Tables X
Chapter 1: Introduction 1
1.0 System Description 1
1.1 Thesis Outline 2
1.2 Publications 3
Chapter 2: Fundamental Concepts 4
2.0 Introduction 4
2.1 RUST-I Fundamentals 4
2.2 Feature Extraction 7
2.2.0 Review of Feature Extraction Techniques Used in
Speech Processing 10
2.2.1 Speech Modelling and MFCC 12
2.2.2 Mel-scale Cepstral Coefficients (MFCC) for RUST-I 16
2.3 Features of Australian English 23
2.4 UWS Speech Database Acquisition 26
2.5 Techniques Used in Speech Recognition 28
2.5.0 Pattern Recognition (PR) 28
2.5.1 Hidden Markov Model (HMM) 29
2.5.2 Artificial Neural Networks (ANN) 33
2.5.3 Advantages of ANN 41
2.5.4 Artificial Intelligence (AI) 42
2.5.5 Hybrid ANN/HMM Systems 44
2.6 Conclusion 47
VI
Chapter 3: Phonemic/Syntactic KnoWledge and Adaptive Phone Recognisor
– Design and Implementation 48
3.0 Introduction 48
3.1 Adaptive Phone Recognisor (APR) 49
3.2 Syntactic Knowledge Estimator 50
3.2.0 Syntactic Knowledge Database 51
3.2.1 RUST-I Lexicon 53
3.2.2 Categorisation 54
3.2.3 Data Organisation in the Syntactic Database 59
3.3 Determination of RUST-I Syntactic Knowledge: Example 62
3.4 Code Activator and Accumulator 67
3.5 Sub-recognisor: Structure 73
3.6 Conclusions 78
Chapter 4: Experimental Procedures 79
4.0 Introduction 79
4.1 Selection of Parameters and Initial Conditions 80
4.1.0 Further Results on Training and Testing 81
4.1.1 Confusion Matrix 85
4.2 Training the Adaptive Phone Recognisor 86
4.3 Experiment One: Operation of Each Sub-Recognisor Without the
Syntactical Knowledge 88
4.3.1 Input Stimuli 88
4.3.2 Experimental Method 88
4.3.3 Results 90
4.3.4 Experiment One: Conclusion 109
4.4 Experiment Two: Operation of Each Sub-Recognisor With the
Syntactical Knowledge 110
4.4.1 Input Stimuli 111
4.4.2 Experimental Method 111
4.4.3 Results 112
4.4.4 Experiment Two: Conclusion 113
4.5 Experiment Three: Verification of the System as IWR 113
4.5.1 Input Stimuli 113
VII
4.5.2 Experimental Method 114
4.5.3 Representation of the Results 115
4.5.4 Analytical Procedure 115
4.5.5 Results 116
4.5.6 Experiment Three: Conclusion 129
Chapter 5: Implementation of Incremental Learning Neural Networks
(RUST-II) 131
5.0 Introduction 131
5.1 The Speech Corpus 132
5.1.0 Background 132
5.1.1 TIMIT Database 132
5.1.2 Corpus Selection 133
5.1.3 Phone Segmentation and Feature Extraction 137
5.1.4 Preparation of the Data for the Neural Networks Input 138
5.2 Modification of the APR to Include Incremental Learning Neural
Networks 139
5.2.0 Weight Selection Algorithm 140
5.3 Experiment and Results 145
5.4 Discussion 148
5.5 Conclusion 150
Chapter 6: Conclusion and Future Work 152
6.1 Conclusion 152
6.2 Future Work 154
References 157
Appendix
Probabilistic Values of the Second Level of the Syntactic Knowledge 164
Glossary of Commonly Used Abbreviations 183
VIII
List of Figure
Page
Figure 2.1 Simplified schematic block diagram of RUST-I. 4
Figure 2.2 Detailed schematic diagram of RUST-I. 7
Figure 2.3 Discrete time model of speech. 13
Figure 2.4 Cepstrum computation procedure. 14
Figure 2.5 Different spacing of band-pass filters. 15
Figure 2.6 MFCC extraction block diagram. 17
Figure 2.7 Simulation of Mel-scale filters frequency bands. 18
Figure 2.8 Algorithm for program to compute MFCC. 21
Figure 2.9 Formant frequency plot of Australian English (general - male). 23
Figure 2.10 Spectrogram of the words ‘bard’ /bad/ and ‘bud’ /bΛd/
pronounced in Australian accent. 24
Figure 2.11 Time delay computational element. 39
Figure 2.12 Hidden control neural network. 40
Figure 3.1 Adaptive phone recognisor. 49
Figure 3.2 The syntactic knowledge estimator. 50
Figure 3.3 Graphical representation of data clusters. 58
Figure 3.4 Example of a data cluster for the front edge phonemic class /t/
(phones are represented by their identification code.). 59
Figure 3.5 Bubble diagram of cluster number 3 of front edge phonemic
class /æ/. 60
Figure 3.6 Portion of the syntactic database that represent cluster 4. 61
Figure 3.7 Probabilities of Phones in set O. 66
Figure 3.8 Self-information of phones in set O. 66
Figure 3.9 Algorithm of code activator in pseudo-code form. 69
Figure 3.10 Block diagram of the accumulator. 72
Figure 3.11 Algorithm of the accumulator. 72
Figure 3.12 Structure of the sub-recognisor. 74
Figure 3.13 Architecture of one neuro-slice. 75
IX
Figure 4.1 RMS error curve for training with adjusted parameters. 81
Figure 4.2 Format of the data input training file. 84
Figure 4.3.(a) 3-D representation of the full confusion matrix of speaker 9
(right side view). 91
Figure 4.3 (b) 3-D representation of the full confusion matrix of speaker 9
(left side view). 92
Figure 4.4 2-D representation of the confusion matrix of speaker 9. 93
Figure 4.5 Block diagram of Experiment Two. 111
Figure 4.6 Block diagram of the system as configured for experiment Three. 114
Figure 5.1 Feature extraction from the phone /s/. 139
Figure 5.2 The modified structure of the APR. 141
Figure 5.3 Selection of the weight set for incremental learning. 143
Figure 5.4 Structure of new sub-recognisor. 143
Figure 5.5 The sub-recognisor performance in the initial session. 148
Figure 5.6(a) Recognition experiments of the phone /s/. 149
Figure 5.6(b) Recognition experiments of the phone /s/. 150
X
List of Tables
Page
Table 2.1 Minimum, maximum and average values of frame number (N). 9
Table 2.2 Mel-scale frequency bands. 17
Table 2.3 Equations to compute Mel-scale filter outputs for each of the
17 Mel-scale filters. 19
Table 2.4 Example of output from the program that computes
MFCCs of one frame of speech signal representing vowel /a/
acquired from the word 'last' spoken by speaker11. 22
Table 2.5 International phonetic alphabet symbols for use in Australian
English.. 25
Table 2.6 Classical studies in SR using ANN to pre-segmented speech. 35
Table 2.7 Summary of studies, which employed ANN for speech
signal processing. 36
Table 2.8 Percentage of correct recognition for various topologies
related to the number of iterations. 38
Table 2.9 Results of open tests for ANN trained on full vowel
and steady-state vowel. 40
Table 3.1 Phones ID’s of RUST-I. 52
Table 3.2 Phonemic classes and their associated levels represented
in the front edge level of the syntactic knowledge. 56
Table 3.3 Phonemic classes which are not represented in the front edge
layer of the syntactic knowledge. 57
Table 3.4 Syntactical knowledge front edge phones set, their frequencies,
probabilities and self-information. 65
Table 3.5 Localised probabilistic values of phonemic subclasses in level
2 of the phonemic set Oð. 67
Table 3.6 Simulation of seven architectures of MLP. 77
XI
Table 4.1 Optimum learning rates and momentum terms for all layers
during training and testing. 81
Table 4.2 Number of training and testing tokens used for each
sub-recognisor. 82
Table 4.3 Example of the sequential order of presentation in terms of the
phone id (P), example number (E), frame number (F), speaker
number (S) and word number (W). 83
Table 4.4 Example of the confusion matrix. 86
Table 4.5 Summary of the primary training session of the APR. 86
Table 4.6 summary of the most remarkable IASCs. 87
Table 4.7(a) Responses of the sub-recognisors for expected input
stimulus - Speaker 6. 94
Table 4.7(b) Responses of the sub-recognisors for expected input
stimulus - Speaker 7. 95
Table 4.7(c) Responses of the sub-recognisors for expected input
stimulus - Speaker 8. 96
Table 4.7(d) Responses of the sub-recognisors for expected input
stimulus - Speaker 9. 97
Table 4.7(e) Responses of the sub-recognisors for expected input
stimulus - Speaker 10. 98
Table 4.8(a) Vowels confusion matrix - Stimuli presented versus
sub- recogniser responses. 100
Table 4.8(b) Three most common confusion across speakers for the vowel
subgroup. 101
Table 4.9(a) Diphthong confusion matrix (average values over all speakers. 102
Table 4.9(b) Three most common confusions across speakers for the
diphthongs subgroup. 103
Table 4.10(a) Stops confusion matrix (average values over all speakers). 104
Table 4.10(b) Three most common confusions across speakers for the
stops subgroup. 104
Table 4.11(a) Nasals confusion matrix (average values over all speakers). 105
Table 4.11(b) Three highest confusions of the nasal subgroup. 105
XII
Table 4.12(a) Confusion matrix of the fricatives. (average values over
all speakers). 106
Table 4.12(b) Three most common confusions across speakers for the
fricatives subgroup. 107
Table 4.13(a) Affricatives confusion matrix. (average values over
all speakers). 107
Table 4.13(b) Three main confusions for the affricative subgroup
confusion matrix. 108
Table 4.14(a) Semivowels intra confusion matrix. 108
Table 4.14(b) Semivowels inter confusion matrix. 109
Table 4.15 Average of SRS across subgroup. 109
Table 4.16 Average SRS scores for all phones across all speakers. 110
Table 4.17 Summary of SRS < 0.60 and recognition rate across all speakers. 112
Table 4.18 Overall results of 100 words recognition. 116
Table 4.19 Comparison of two-word recognition results over all speakers. 117
Table 4.20 Recognition results of words that used in experiment Three. 119
Table 4.21 Summary of error types. 130
Table 5.1 Abstracted information on the chosen speakers. 133
Table 5.2 Updated phonemic symbols code. 135
Table 5.3 Phones set used in the learning session and their
relevant number . 145
1
Chapter 1: Introduction
It has been established in the field of Speech Recognition (SR) that any level of linguistic
knowledge applied at a level above that of phone recognition will enhance the performance
of a word recognition system (Furui, 1989). Speech recognition is defined as "the process
of automatically extracting and determining linguistic information conveyed by a speech
wave using computers or electronic circuits". This type of definition implies that in order
to achieve tangible results in speech recognition, the problem has to be approached from a
linguistic prospect. This thesis describes the work on the implementation of an Isolated
Word Recognition (IWR) system by combining linguistic knowledge with Artificial
Neural Network (ANN) technique without and with incremental learning.
1.0 System Description
The work described in this thesis comprises two parts: (1) the system that was studied and
implemented between 1992 and 1997 by the author, which is referred to as ‘RUST-I’
(Recognition Using Syntactic Tree) and (2) an incremental learning update of the original
system with variable weight vectors in the neural networks, referred to as ‘RUST-II’. The
novel concepts of the proposed system described in this thesis can be summarised as
follows:
• The adaptive phone recognisor in the system uses a parallel structure, which can now
be implemented using affordable IC chips and offers advantages in processing speed.
• The phone composition of the vocabulary is expressed as a statistically labeled
syntactic / phonemic tree.
• The phone recognition process is controlled by syntactical knowledge in a potentially
2
adaptive way.
• The system explores and tests the incremental learning algorithm in neural network for
phone recognition.
RUST-I incorporates two basic levels of statistical knowledge in speech. The first is
phonemic knowledge, in the form of probability of occurrence of phones in the lexicon
words, and the second is primary syntactic knowledge in the form of probability of
occurrence of phones in sentences or sequences of words. A focus on phonemic knowledge
allows RUST-I to operate as a continuous recognisor. The phonemic knowledge source is
used in the overall structure of an Adaptive Phone Recognisor (APR). The phonemic and
statistical knowledge is followed by the syntactic knowledge estimator.
The lexicon of RUST-I was developed using 1357 words of which 45 were unique. The
words were extracted from three topics, namely finance, physics and general reading
material. This is not exclusive to those topics and it might be updated or specialized to other
topics, and it could be expanded or reduced. The later version of the lexicon is developed
from the TIMIT speech database, with 75 sentences (637 words), and due to time constraints,
this later lexicon version was not used in RUST-II.
1.1 Thesis Outline
Chapter 2 presents a general introduction to the field of research and the theoretical
background and fundamentals necessary to understand the system aspects. An in-depth
description of the syntactical / phonemic knowledge and the phone recognisor is presented
in Chapter 3. It presents the basics of the language model and the lexicon, the formation of
the syntactical database and its code activator. Also in this chapter the interaction between
the syntactical knowledge and the phonemic knowledge is shown within the system
functionality. In Chapter 4, RUST-I is trained and examined for validity and efficiency as an
isolated phone and isolated word recognisor. The overall performance of the system as an
IWR was found to be dependent on the performances of both the APR and the syntactic
knowledge estimator. Chapter 5 presents a novel technique in the area of incremental
3
learning neural networks. The purpose of applying the incremental learning technique on
RUST-I is to demonstrate that incremental learning neural networks are able to contribute
to the development of more robust speech recognition system. The effort in Chapter 5 will
be focused on the development of the incremental learning algorithm.
1.2 Publications
This work has led to the publication of the following two peer reviewed papers in
international conferences:
DARJAZINI, H. AND TIBBITTS, J., 1994. The construction of phonemic knowledge using
clustering methodology. Proceedings of the 5th Australian international conference on speech
science and technology SST-94, December 1994 Perth, Vol. 1, 202-207.
DARJAZINI, H., CHENG, Q. and LIYANA-PATHIRANA, R. 2006. Incremental
learning algorithm for speech recognition. Paper accepted by the 8th international
conference on signal and image processing SIP-06, August 14-16 2006.
4
Chapter 2: Fundamental Concepts
2.0 Introduction
This chapter describes RUST-I, the acquisition of UWS speech database and some funda-
mental concepts of signal processing techniques and neural networks which was used in
this work.
2.1 RUST-I Fundamentals
A simplified schematic block diagram of RUST-I is shown in Figure 2.1. RUST-I has a
hybrid structure, which is a combination of a low-level phonemic knowledge recognisor
(ANN based) and higher-level syntactic knowledge.
Figure 2.1 Simplified schematic block diagram of RUST-I.
RUST-I consists of three main blocks as follows:
5
• Signal processing (feature extraction) block - which preprocesses the digitised
speech signal and extracts features from the speech signal to be used in the recognition
process.
• Phone recognisor based ANN block - which performs the phone recognition task and
represents the phonemic knowledge of the system.
• Syntactic knowledge block - which represents the syntactic reference of the system
and contains the lexicon and phonemic database parts. The phonemic statistical likeli-
hood of occurrence is used in this block and it is integrated with low level phonemic
knowledge in the ANN-based phone recognisor block to form the complete RUST-I sys-
tem.
Figure 2.2 shows a detailed block schematic diagram of RUST-I. The system in the figure
performs multi-speaker, large vocabulary IWR. Digital speech is passed through the
segmentor, which separates the speech into Hamming windows of 256 points each. The
windowed speech is passed into the feature extractor to derive 12 Mel-scale frequency
coefficients (to be discussed later in this chapter) per window. These 12 MFCCs are passed
into the adaptive phone recognisor (APR), which is composed of a bank of 46 sub-
recognisors. The sub-recognisors are spatially aligned to respond to 45 phones of Australian
English plus silence. The output from the APR is passed to the syntactic knowledge estimator,
which -in response- generates activation signals, each of which selects the most appropriate
phone sub-recognisor of the APR to be activated. The activation signal enables the output of
the sub-recognisor of the next phone with the highest probability among all the phones based
on the pattern of the previous recognised phone sequence. The output is collected by the
accumulator of the syntactic knowledge estimator block corresponding to the recognised
phone to indicate whether or not a match occurs between the input data and an estimated
phone. The syntactical knowledge estimator detects the end of a word and releases an End of
Word Identifier (EOWI) signal to the accumulator. This indicates that the recognition process
has been completed, and the accumulator is prompted to supply the final output.
6
To implement the syntactic knowledge, a lexicon of 1357 words was chosen. This number
can be increased or decreased as necessary to suit a particular application. Therefore,
RUST-I can be regarded as a large vocabulary SR system.
The speech database, which was used to train and test the system, was derived from words
uttered by 15 native Australian English speakers, and is called the “UWS speech database”.
The acquired phone set forms the basic speech units that are the basic building blocks of the
phonemic knowledge. The functions of the phonemic knowledge and the syntactic knowledge
are integrated so that any phone missing or misrecognised at the phonemic level, can be
predicted and compensated for at the syntactic level.
The training data provided to the system is different from the testing data, and both are
acquired from multiple speakers. The training and testing data have been acquired in a natural
room environment. Therefore, the proposed system is meant to perform in low-level ambient
noise environment.
7
Figure 2.2 Detailed schematic diagram of RUST-I.
Blocks in the schematic diagram of Figure 2.2 can be divided into four main parts: the first
part deals with the feature extraction from speech data; the second part is devoted to the
phone recognition within the adaptive phone recognisor; the third part deals with the
derivation of the syntactic knowledge within the syntactic knowledge estimator; and the
fourth part of this system is the acquisition of phones in the word within the accumulator.
2.2 Feature Extraction
The process of feature extraction includes segmentation, windowing and computation of 12
MFCCs, for a sequence of M frames of 256 (speech) samples.
8
In RUST-I, the segmentation was carried out manually. In RUST-II (to be described in
Chapter 5), the segmentation was performed using phone boundaries provided in the TIMIT
database. The duration of each phone of the Australian phonemic set was segmented into N
frames of 256 samples each. As the boundaries of phone were not distinctive, therefore, an
add-overlap method was used in the segmentation, with an overlap of 22% to maintain the
continuity across phone boundaries.
For example, the phone represented by the phonetic symbol /I/, has been isolated from the
following words ' pit, sing, and thin ', and measurements showed different time durations for
the signal that represents this phone. The smallest reading is in the case of speaker 1 in the
word ' pit ' where it has 903 sampling points (57 ms ≈ 4 frames). The longest duration for the
same phone is in the case of speaker 2 in the word /sing /, the phone takes 1682 sampling
points (140 ms ≈ 7 frames). The maximum and the minimum values of N (number of frames)
are passed to the ‘frame count parameter estimator’. The frame count parameter M, is
calculated for each phone by averaging the maximum and minimum values of N over all the
speakers for each phone as shown in Table 2.1.
9
Table 2.1 Minimum, maximum and average values of frame number (N).
ID Phone ID Phone
PH Nmin Nmax Mi PH Nmin Nmax Mi
1 I 4 7 6 24 t 2 20 11
2 i 5 11 8 25 d 2 7 5
3 ε 5 9 7 26 k 4 8 6
4 æ 6 10 8 27 g 2 7 5
5 a 10 21 16 28 f 9 11 10
6 Þ 6 9 8 29 v 3 8 6
7 Ď 5 6 6 30 し 5 13 9
8 Э 10 12 11 31 ð 6 11 9
9 Ω 5 6 6 32 s 7 15 11
10 u 7 20 14 33 z 6 11 9
11 έ 11 13 12 34 ∫ 9 11 10
12 ∂ 4 11 6 35 ξ 4 14 19
13 Λ 4 8 6 36 h 3 14 9
14 aI 12 23 18 37 r 4 5 5
15 eI 8 23 6 38 t∫ 5 7 6
16 ЭI 17 19 18 39 dξ 3 5 4
17 aΩ 15 23 19 40 m 4 12 18
18 OΩ 8 22 15 41 n 3 13 8
19 I∂ 13 19 16 42 さ 9 15 12
20 ε∂ 11 21 16 43 j 3 14 9
21 Ω∂ 11 17 14 44 w 4 11 8
22 p 3 8 6 45 L 2 16 9
23 b 1 9 5 46 sln - - 24
Note: PH = phone, ID = identifier, sln = silence.
10
The frame count parameter M, is calculated in the segmentation block and is an indication of
the number of windows or the time duration of the presented phone. This parameter
determines the number of neuro-slices used in the adaptive phone recognisor.
To minimise the effects of frame truncation, a windowing function is required. This function
is expected to reduce the discontinuities at the frame boundaries, while maintaining the signal
integrity over most of the frame. The improvement produced by windowing is at the expense
of the transition width (ramping from zero to maximum).
A 330-point Hamming window, w(n), was chosen (this resulted in almost 20% increase of the
signal intensity (factors of 0.168 and 0.184) occurring at both boundaries). This window size
was used to ensure that the assumptions made in the derivation of the cepstrum coefficient
(Section 2.2.1) were valid. A narrower window increases the bandwidth in the frequency
domain and could degrade results (Davis and Mermelstein, 1980).
2.2.0 Review of Feature Extraction Techniques Used in Speech Processing
The selection of an appropriate feature vector representation for a speech signal is dependent
on the required accuracy of the recognisor, the size of the vocabulary to be recognised and the
structure of the target speech signal (i.e., phone, syllable, word, phrase or sentence).
Researchers (De Mori 1983, Furui 1989, Dermody et. al. 1986, Fant 1960 and Flanagan
1983) have shown that there are inherent perceptually important acoustic features within the
speech waveform. For speech recognition, an adequate vector representation is needed to
extract those perceptually important features. Three main techniques were studied over time,
filter-bank, Linear Predictive Coding (LPC) representation and Cepstral representation. This
section provides a comparative summary of these three representations. The Comparison will
be only at the level of words and phones.
Rabiner and Juang (1993) summarised comparative studies between the filter-bank model
representation and the LPC Analysis model representation and showed that the LPC analysis
model generally resulted in improved performance for speech recognition tasks. This work
11
was performed on telephone quality speech (sampling frequency at 8 kHz), and so was band
limited to under 4 kHz. Dermody et. al., (1986) have shown that some dynamic sounds like
stop consonants have high frequency information in excess of 4 kHz, where this information
is lost on telephone quality speech. On the other hand, Davis and Mermelstein (1980) and
Hunt (1988) showed that the validity of performance of LPC when used with unvoiced
sounds or sounds with zeros (e.g. nasals) is deteriorated beyond usefulness. However, it was
found that altering the type of LPC analysis, the window size or the order of the filter
overcame most of these difficulties (Deller et al., 1993). But Hunt (1988) believes that the
relative superiority of an LPC representation over the filter-bank is still being disputed. This
was supported by Markel and Gray in (Rabiner and Juang, 1993) where they showed that
LPC performance deteriorates in the presence of noise.
Davis and Mermelstein (1980) compared the performance of five types of vector
representation (MFCC, SCR, LPC spectrum and reflection coefficients). The results showed
that the LPC spectrum achieved about 85% recognition rate and the reflection coefficients
achieved between 77% and 83%.
Love and Kinsner (1992) presented LPC coefficients to a multi-layer perceptron neural
network. The correct recognition performance for vowels was from 42% to 68% and for
consonants (in consonant-vowel form) was around 33% to 57% for consonant with the vowel
/a/, and around 40% to 53% for consonant with the vowel /e/. The average false recognition
score for the vowels was 51%.
Creekmore et. al., (1991) have carried out another comparative study on five spectral
representations as input to a feed-forward neural network. The five representations included
the DFT, autocorrelation based LPC, LPC spectral intensities, LPC cepstral coefficients and
the cepstral coefficients derived from Perceptual Linear Predictive (PLP) analysis. The
recognition rate for all but PLP was around 40% to 41% on an open phone data set. The PLP
analysis method scored 45%.
12
It was found that the spectral representation derived using the LPC coefficients is highly
speaker dependent, Waibel, (1981), therefore, according to the performance of the LPC and
its speaker dependency, the technique is not suitable for the purpose of this work.
Ultimately, there is a need for a method that can extract speaker independent information
from that spectrum to produce an efficient vector representation for speech recognition by
removing as much of the redundancies associated with speaker identity as possible while
retaining the perceptually important acoustic features. Limitations of the LPC and the
filter-bank models led to the decision to exclude those two representations from this research.
Davis and Mermelstein (1980) compared the recognition performance of three cepstral
representations, Mel-frequency cepstral coefficients (MFCC), smoothed cepstrum or linear
frequency cepstral coefficients (SCR) and LPC cepstral coefficients (LCC) using template
matching on the phone level. The recognition rates ranged from 86% to 96%. Mel-cepstrum
coefficients produced an improved performance of between 95% and 97% over the other two
cepstral representations. The success of the MFCC has been attributed to the accurate
modelling of the critical band frequencies of the auditory system (Waibel and Yegnanarayana,
1981).
In conclusion, these studies show the recognition performance of the cepstral representations
was higher than either LPC or filter bank representations. A perceptually based cepstral
representation resulted in a marginally higher performance score than any of the linear
cepstral representations. Hence, the MFCC parameters will be used in this research.
2.2.1 Speech Modeling and MFCC
An understanding of speech production and speech acoustic features is crucial to speech
modeling. Speech is the result of exciting the vocal tract system with an excitation, which
consist of either quasi-periodic impulses or random noise (Flanagan, 1983). Assuming the
vocal tract system and excitation are independent, the discrete time model of speech
production is shown in Figure 2.3.
13
Figure 2.3 Discrete time model of speech (Oppenheim and Schafer, 1989).
To minimize the truncation effect of the segmentation, each frame should be multiplied by a
window function. The Fourier transform of the windowing function can be obtained as
∑−= −= 1
0
0][)(M
k
jkekNwWn ωω (2.1)
For a short period (e.g., a frame), the vocal tract can be regarded as a linear time-invariant
system, and the superposition principle applies. Assuming that the impulse response of the
vocal tract is )(nh , the speech samples can be modeled as )(*)()( nenhnx = , where
)()( npne = or )()( nrne = and * denotes the convolution. Denote by )(),(),( ωωω EHX the
Fourier transforms of ),(),(),( nenhnx respectively. )()()( ωωω EHX = . By taking the
logarithm of the Fourier transform, the multiplication can be transformed into addition as
)](log[)](log[)](log[ ωωω EHX += . The cepstrum transform can then be obtained from:
)]([)]([)]([log)( 111 ωωω EFHFXFqCs −−− +== . There are two types of cepstra: the
complex cepstrum and the real cepstrum. The basic difference between these types is that the
14
RC discards phase information whereas the CC retains it (Deller et. al. 1993) and
(Oppenheim and Schafer, 1989). The complex cepstrum is given by
ωωωπ π
πω deXjXqCCs qj ))(|)(log(|
2
1][ ∫− ∠+= and the real cepstrum by
ωωπ π
πω deXqRCs qj∫−= |)(log|
2
1][ . In order to make )(log ωX unique, the argument of
)(),( ωω XX ∠ , must be an odd continuous function of w (Oppenheim and Schafer, 1989).
This can be done by adding multiples of 2π to phases (unwrapping) to meet this requirement;
consequently the discontinuities, associated with computation of the phase modulo-2π, are
removed.
Figure 2.4 Cepstrum computation procedure.
According to human perception, a logarithmic scale Fourier transform is preferred. This
logarithmic scale (also called Mel-scale) transform can be obtained by passing )(wX through
a set of band-pass filters with center frequencies and bandwidths as shown in Figure 2.5(a).
15
Figure 2.5 Different spacing of band-pass filters:
(a) Logarithmic (Oppenheim and Shafer, 1989).
(b) Linear. (Mihelic et. al., 1991).
The MFCC is calculated as
])5.0.(cos[1
∑= −= a
a
i
N
kk N
kiXMFCCπ
, (2.2)
where, Na is the number of cepstral coefficients, and Xk, k = 1, 2, ..., Na represents the log
energy output of the kth filter. The cosine transform presents approximation to a set of
triangular band-pass filters. Equation 2.2 applies the cosine transform to the log power of a
Mel-scale filter-bank to derive the Mel-scale cepstrum. The low-order terms in the cepstral
magnitude correspond to smooth features in the spectrum, while the higher-order terms
represent the spectral fine features, and are therefore filtered out by any approximation of the
cosine series (Davis and Mermelstein, 1980).
16
2.2.2 Mel-scale Cepstral Coefficients (MFCC) for RUST-I
It was shown that the recognition performance when using cepstral representations produces
higher accuracy than other representations. Hence, the MFCC is chosen as an appropriate
vector representation of speech features for RUST-I. Figure 2.6 shows a block diagram of the
portion of the feature extractor that derives the MFCC vectors. The first block is the power
spectra estimator, calculated from a 512 point Discrete Fourier Transform (DFT).
The second block is the Mel output summer, which determines the Mel-scale filter outputs.
The number of Mel-scale filters required for a signal with maximum frequency 6 kHz is
found from Table 2.2 (Mel-scale frequencies) to be 17. Figure 2.7 shows a simulation of the
spacing of these 17 filters in the frequency range from 0 to 6 kHz.
The third block of Figure 2.6 is the bank which calculates the log of the outputs, m(k), from
each of the 17 Mel-scale filters in dB as shown in the following equation:
)(log.10 10 kmXk= (2.3)
where, k = 1,2,...17.
The fourth block of Figure 2.4 is the MFCC Vector Estimator which calculates the Mel-
frequency cepstral coefficients or DI(12), which are determined by applying a the cosine
transformation to Xk , that is the real logarithm of the short term power spectrum expressed on
a Mel-frequency scale as shown in Equation 2.2. A program was written to automatically
compute the MFCCs for all frames of each phone; for all tokens in the speech database. An
algorithm of this program is given in Figure 2.8. The MFCC vectors were normalised within
the range of 0 to +1 for input to the neural network.
17
Figure 2.6 MFCC extraction block diagram.
Table 2.2 Mel-scale frequency bands.
Index Frequency band[Hz]
1 0-117
2 117-281
3 281-445
4 445-609
5 609-773
6 773-914
7 914-1101
8 1101-1312
9 1312-1570
10 1570-1875
11 1875-2203
12 2203-2625
13 2625-3117
14 3117-3679
15 2679-4359
16 4359-5156
17 5156-6000
.
18
Figure 2.7 Simulation of Mel-scale filter frequency bands.
The frequency values were derived and used to calculate the Mel-scale filter outputs, which is
the linear sum of the intensities of all line spectra within that frequency band. Any component
closest to the boundary is an exception, where it is halved and shared between the two
boundaries. For example, the first filter covers the frequency band from 0 to about 117 Hz.
The Mel-scale filter output, m(1), is calculated by the summation of the magnitude of the first
four values in the line spectra, s(1) to s(4), plus half of the fifth, s(5), as illustrated in Equation
below because they fall within the range 0 to 117 Hz.
2
)5()4()3()2()1()1(
sssssm ++++= (2.4)
The equations to compute Mel-scale output for all 17 Mel-scale filters can be found in
Table 2.3 along with the range of the filter and the number of spectral magnitudes used in
its computation. Table 2.4 shows an example of the output of the program, which
computes MFCCs for one frame of speech signal.
19
Table 2.3 Equations to compute Mel-scale filter outputs for each of the 17 Mel-scale filters.
# Range Mel-scale filter output equation # of LS
1 0 - 117Hz m(1) = s(1) + s(2) + s(3) + s(4) + 0.5s(5) 4 + [email protected]
2 117 - 281Hz m(2) = 0.5s(5) + s(6) + ... + 0.5s(12) 6 + [email protected]
3 281 - 445Hz m(3) = 0.5s(12) + s(13) + ... + 0.5s(19) 6 + [email protected]
4 445 - 609Hz m(4) = 0.5s(19) + s(20) + ... + 0.5s(26) 6 + [email protected]
5 609 - 773Hz m(5) = 0.5s(26) + s(27) + ... + 0.5s(33) 6 + [email protected]
6 773 - 914 Hz m(6) = 0.5s(33) + s(34) + ... + 0.5s(40) 6 + [email protected]
7 914 - 1101 Hz m(7) = 0.5s(40) + s(41) + ... + 0.5s(47) 6 + [email protected]
8 1101- 1312 Hz m(8) = 0.5s(47) + s(48) + ... + 0.5s(56) 8 + [email protected]
9 1312- 1570 Hz m(9) = 0.5s(56) + s(57) + ... + 0.5s(67) [email protected]
10 1570- 1875 Hz m(10) = 0.5s(67) + s(68) + ... + 0.5s(80) [email protected]
11 1875- 2203 Hz m(11) = 0.5s(80) + s(81) + ... + 0.5s(94) [email protected]
12 2203- 2625 Hz m(12) = 0.5s(94) + s(95) +... +0.5s(112) [email protected]
13 2625- 3117 Hz m(13) = 0.5s(112)+s(113)+... + 0.5s(133) [email protected]
14 3117- 3679 Hz m(14) = 0.5s(133) +s(134)+... +0.5s(157) [email protected]
15 3679- 4359 Hz m(15) = 0.5s(157) +s(158)+... +0.5s(186) [email protected]
16 4359- 5156 Hz m(16) = 0.5s(186) +s(187)+... +0.5s(220) [email protected]
17 5156 -6000 Hz m(17) = 0.5s(220) +s(221)+... +0.5s(256) [email protected]
20
ALGORITHM FOR COMPUTING MFCC VECTORS OF THE SPEECH SIGNAL
SEGMENTS:
clear memory;
open phone file for reading;
open file for spectral data writing;
read phone data as matrix of 256 columns x n rows;
create Hamming window of 330 points length; (eq. # 3.1)
for var1 = 1 to n
expand frame(var1) to 330 points by set points outside the frame boundary to 1's;
apply Hamming window on every row(n);
compute FFT order 9 of the resulting frame;
compute power spectrum of the resulting FFT vector;
write results into spectral data file;
end for
open file for logarithmic data writing;
open file for cepstral data writing;
form matrix spec contains spectral data (n x 330);
for var2 = 1 to n
form n vectors of 330 elements each;
compute log Mel-scale 17 filters outputs; (eq. # 3.3)
writing the output in log file;
end for;
%% compute Mel-frequency cepstrum coefficients (12 coefficients) per frame
for loop1 = 1 to 12
for loop2 = 1 to 17
compute MFCC vector; (eq. # 2.24)
end loop2
write results into mfcc file;
21
end loop1
form vectors of 12 elements MFCC;
initialise max to 0;
scan all vectors for coef > max;
normalised coef = coef /max;
temporally unfold the MFCC vectors according to its frames index;
write results in ASCII format files;
close all files;
end;
Figure 2.8 Algorithm for program to compute MFCC.
22
Table 2.4 Example of output from the program that computes MFCCs of one frame of
speech signal representing vowel /a/ acquired from the word 'last' spoken by speaker 11.
Filter Frequency
range [Hz
Filter output
[mV]
# MFCCi Unnormalise
d value of
MFCCi
m(1) 0-117 -1.826000 MFCC1 18.782129
m(2) 117-281 4.581117 MFCC2 -0.639120
m(3) 281-445 2.356047 MFCC3 -11.526380
m(4) 445-609 2.960689 MFCC4 -4.390971
m(5) 609-773 3.711937 MFCC5 -0.735596
m(6) 773-914 3.145611 MFCC6 -7.190951
m(7) 914-1101 1.520497 MFCC7 -1.433654
m(8) 1101-1312 1.402426 MFCC8 -0.499446
m(9) 1312-1570 1.142529 MFCC9 -6.208356
m(10) 1570-1875 -0.163876 MFCC10 -6.149410
m(11) 1875-2203 -2.377023 MFCC11 -3.856123
m(12) 2203-2625 -2.584206 MFCC12 -5.575328
m(13) 2625-3117 -1.600273
m(14) 3117-3679 -0.062703
m(15) 3679-4359 -1.651505
m(16) 4359-5156 -0.316149
m(17) 5156-6000 -1.605814
23
2.3 Features of Australian English
Australian English differs from other forms of English in the position of vowels and
diphthongs within the vowel triangle; also, it differs in vowel length (Bernard et. al., 1989).
Vowels vary amongst talkers by the timbre (pitch), local duration and emotional dynamics
incorporated into the sound. Spectrographic analysis of Australian English vowels shows
formants that convey the timbre of the vowels as illustrated in Figure 2.9.
Figure 2.9 Formant frequency plot of Australian English (general - male) (Bernard et. al.,
1989).
Spectrograms in Figure 2.10 reveal that Australian pronunciations of 'bard' /bad/ and 'bud'
/bΛd/ have a similar format pattern. The spectrograms also show that the explicit difference
between the two vowels is in their duration, the vowel /a/ being about twice as long as the
vowel /Λ/. Australian English shows the same pattern in /i/ and /I/, /Ω/and /u/, and /æ/ and /e/.
This particular sound-duration pattern is very much a part of the Australian accent and differs
from length patterning observable in other English accents.
Australian vowels are also more pronounced than vowels in other English accents, for
example, the word 'station' is pronounced in Australian English /'steI∫en/ note how the vowel
is emphasised by taking on the form of the diphthong /eI/.
24
Figure 2.10 Spectrogram of the words ‘bard’ /bad/ and ‘bud’ /bΛd/ pronounced in an Austra-
lian
accent (Bernard et. al. 1989).
It has been reported by Bernard et. al. 1989, that Australian English tends to display
distinctive intonation patterns, within certain characteristic ranges for the rate of utterances
adopted by the average Australian speaker. Australian speakers can be classified into three
main categories: Broad (30% of Australians), General (60% of Australians) and Cultivated
(almost 10% of Australians). Pronunciation of vowels varies depending on the particular
category of the Australian speaker. For example the word 'seat' could have pronunciations
ranging from /seIt/ (Broad Category), /sIit/ (General Category) to /sit/ (Cultivated Category).
Similar grading would apply to 'say' (/saI/ - B, /seI/ - G and /seI/ - C).
Another significant differentiator in Australian speech is the pronunciation of the centering
diphthongs that can be heard in words such as 'beer' and 'bear'. Cultivated speakers tend to say
/bIӘ/ and /bε/ with a pronounced second element. General speakers have a slight glide
towards the central vowel /Ә/. Broad speakers tend to say /bI:/ and /bi:/ with hardly any
second element. The effect in the last case is to almost create a lengthened pure vowel.
Table 2.5 shows the International Phonetic Alphabet Symbols for use in Australian English
(Macquarie Library Dictionary, 1998). Symbols in the table have been used throughout this
25
thesis in relation to the construction of the phonemic and the syntactic knowledge. In addition,
the words in the table have been used to construct the speech database which has been used in
this study (UWS speech database).
Table 2.5 International phonetic alphabet symbols used in Australian English.
(Macquarie Library Dictionary, 1998).
Sound Type Phonetic Symbol Example Phonetic Alphabet
of the example
Vowels I peat pit
i pit pIt
ε pet pεt æ pat pæt
a part pat
þ pot pÞt
Λ but bΛt
э port pЭt
Ω put pΩt
u pool pul
έ pert pέt ∞ apart ∞’pat
Ď bon voyage bĎvwa’jaξ Diphthongs aI buy baI
eI bay beI
эI boy bЭI
aΩ how haΩ
OΩ hoe hOΩ
I∞ here hI∞
ε∞ hair hε∞
Ω∞ tour tΩ∞
Consonants
Plosive (stops) p pet pεt b bet bεt t tale teIL
d dale deIL
k came keIm
g game geIm
Affricatives t∫ choke t∫OΩk
26
dξ joke dξOΩk
Nasals m mile mail
n neat nit
さ sing siη
Fricatives f fine faIn
v vine vain
し thin しin
ð then ðεn
s seal siL
z zeal ziL
∫ show ∫OΩ
ξ measure mεξ∞
h heat hit
Semi-vowel j you ju
w woo wu
Laterals l last Last
r rain reIn
2.4 UWS Speech Database Acquisition
This section describes the acquisition of the non-standard speech database used in this
research. This database is referred to throughout this thesis as the “UWS speech database”.
The focus of the UWS speech database is solely on Australian English.
The UWS database consists of words chosen to cover all of the Australian English phonemic
set. The speech data have been segmented and labelled at phonemic level. The database
consisted of 45 words spoken in Australian English by 15 adult speakers (10 male; 5 female)
with native Australian English (at least second generation in Australia) of average age of 26.
The 45 words contain at least one of the phones of Australian English (Macquaire Dictionary,
1994) and are listed in Table 2.5. This word set allows for multiple representations (from 1 to
16) of most phones in different position within the word (initial, central and final). The range
of duration of the phones was from 12 ms to 487 ms.
27
The speakers had a general to broad Australian accent. Every speaker read the set of the
words, one word at a time. The words were stored in files and are classified within
subdirectories labelled depending on the speaker. A MATLAB program was written to
process each file separately. The program opens each file and blocks it into frames of 21.3 ms
duration (256 points).
A sampling rate of 12 kHz was chosen as a compromise between accuracy and processing
time/complexity. This may limit the cues available for dynamic speech like the stops where
the maximum frequency in the signal can extend up to 8 or 9 kHz (Dermody et. al.,1986).
The data is prefiltered at fs/2 using digital, tracking anti-alias filter.
Recording was done in the naturally noisy environment of a computer room with an
approximate signal to noise ratio of 30 dB. Then the recorded speech was phonemically
segmented and labelled using manual processes. The duration of each phone of the Australian
phonemic set was segmented into N frames of 256 points each. The database contained 45
different phones. An overlap-and-add method was used for segmentation, where an overlap of
22% (256 data points extracted every 200 data points) was found to maintain the continuity
across frame boundaries.
Hypersignal Acoustic™ software package was used for manual segmentation and labelling at
the phonemic level. This process can not be explicitly defined within the whole word as
phones do not exist with clear boundaries between each other but indeed run into each other
(co-articulation). For example, the phone /ð/as in 'the' overlapped into the next phone,
whereas the phone /θ/could be isolated easily. In the editing of the segments, consonants were
considered to end at the point where the speech signal showed a significant shift in amplitude
and/or at the onset of regularity and periodicity; this was verified by perceptual judgement.
Diphthongs were segmented and labelled as distinctive phones units to minimise the possibil-
ity of confusing them with the other vowels.
28
2.5 Techniques Used in Speech Recognition
In this section, a brief description of the current techniques used in speech recognition will
be presented. This will cover the most popular techniques yet used, including Pattern Rec-
ognition (PR), Hidden Markov Models (HMM), Neural Network (ANN), Artificial Intelli-
gence (AI) and hybrid ANN/HMM systems.
2.5.0 Pattern Recognition (PR)
Pattern Recognition (PR) is a well-known technique in the field of image recognition as
well as in speech recognition. Pattern recognition means the identification of the ideal (pat-
tern) which represents a given object. In speech recognition, PR uses speech pattern di-
rectly without explicit feature determination and segmentation. This technique has two
steps. The first is to find the ideal speech pattern (training), and the second is the recogni-
tion of the patterns via comparison process. The concept is that if enough versions of a pat-
tern are included in a training set provided to the algorithm, the training procedure should
be able to adequately characterise the acoustic properties of the ideal pattern. Then by di-
rect comparison between the ideal and unknown speech pattern, the system has to be able
to classify the input to be one of the known patterns for the system.
Some researchers (e.g., Rabiner and Juang, 1993) observed some advantages of this tech-
nique, such as:
Simplicity.
Robustness and invariance to different speech vocabularies, users, feature sets, pat-
tern comparison algorithms and decision rules.
Acceptable performance for some speech recognition tasks.
29
Such pattern recognition systems could achieve comparatively better rates only for
speaker-dependent templates and for limited number of vocabularies.
2.5.1 Hidden Markov Model (HMM)
HMM approach is a statistical method of characterising the spectral properties of the
frames of a pattern. The key assumption of HMM is that the speech signal can be well
characterised as a parametric random process, and the parameters of the random process
can be estimated. This technique showed better recognition results when compared with
the PR. In application of HMM technique, a statistical model of each word in the vocabu-
lary (in Isolated Word Recognition –IWR- research) was constructed. Each input word was
recognised as the word in the vocabulary whose model assigns the greatest likelihood to
the occurrence of the observed input pattern. HMM integrates well into systems both tasks
syntax1 and semantics
2 (Rabiner and Juang, 1993). Thus, when constructing the statistical
model of the HMM for selected problems in SR, there are three key issues that have to be
addressed:
Evaluation of the probability (or likelihood) of a sequence of observations given a
specific HMM. This represents the efficiency of computing the probability of an
observation P(O|λ) which is denoted as the probability of the observation sequence
or state sequence.
Determination of the best sequence of model states, which produces the optimal
model for that application (i.e., which best explains the observation).
Adjustment of model parameters so as to best account for the observed signal, i.e.,
adjust λ to maximise P(O|λ).
In HMM, each word is represented by a set of states (including initial and final) with the
probabilities of transitions from state to state. For each state there is an associated random
variable, whose value is a vector of acoustic parameters. The variability of each spoken
1 Syntax: Grammar the patterns of formation of sentences and phrases from words in a par-
ticular language. (Macquarie Library Dictionary, 3rd
ed. 1998). 2 Semantics: Relating to meaning. (Macquarie Library, dictionary, 3
rd ed. 1998).
30
word is therefore modeled by N distinct random variables, where N is the number of states
in the model. Many HMM recognisors have an HMM model based on phones (Grant,
1991). The final stage of the recognition then combines lexical knowledge with phonemic
knowledge by concatenating the phone of the HMMs into words.
Recognition rates in systems employing HMM varied depending on the type of recognition
task that the system was required to perform. An example is the system which has been
tested by Pepper and Clements (Pepper and Clements, 1992). They described in their re-
port experiments on phonemic recognition using large HMM; the system achieved recogni-
tion rates ranging between 52.2% and 53.3% depending on the size of the system that was
used in the experiment. Other experiments employed HMM to recognise syllables (non-
sense sounds) of series of consonant-vowel (CV) (a consonant with the vowel /e/) using
temporal cues and HMM (Flaherty and Poe, 1993). They reported a HMM system that
achieved a recognition accuracy of 74% using time varying information, compared with
50% without that information.
When employing pure HMM in IWR research, there were varieties of acoustic cues, which
have been employed to construct the statistical model of HMM. For instance, Gupta et al.
(1991) reported improvement in the recognition accuracy when employing temporal cues
combined with energy contour information of phones to construct HMM. By applying
minimum duration and energy thresholds the accuracy improved from 23.1% to 27.3% in
the case of acoustic cues recognition, and from 8.8% to 14.3% with the language model.
The system was built as speaker-dependent for a large vocabulary. It can be noted here that
the results of this system are consistent with the work of Flaherty and Poe (1993).
The results from the various systems which were discussed above showed wide variability
in the performance of the HMM, when states of the model represent phones, syllables or
words. This is very much related to the first essential issue in the HMM design mentioned
above. Hence, before the use of HMM, one must answer the following question: What do
the states in the model correspond to? Then deciding how many states should be in the
model and to know the initial state. Generally, states are interconnected in such a way that
31
any state can be reached from any other state. This increases the computational cost mas-
sively even for a few states. The reason can be found in the observational representation
nature of HMM. This is the probabilistic function of the states. That means the HMM is a
doubly embedded stochastic process with an underlying stochastic process that is not di-
rectly observable, but which can be observed only through another set of stochastic proc-
ess, which produces the sequence of observation (Rabiner and Juang, 1993).
HMM modeling necessarily computes the variability of spectra at different parts of each
word. It also has variable-time distortion penalties, and it relates these penalties to the
spectral distortion penalties in a theoretically defensible way. On the other hand, its timing
model is unrealistic in that the probability of staying in a given hidden state decays expo-
nentially with time (Hunt, 1988).
When using HMM, the recognition problem is usually formulated as one of finding the se-
quence of states in the hidden Markov chain whose a posterioric probability is maximum.
The easiest way of doing this is by means of Viterbi algorithm (Kenny, 1993). However,
this algorithm suffers from several drawbacks:
1. It is an exhaustive search. For phone-based recognisors with large vocabularies,
the speech model can be very large and the search requires expensive computa-
tional time. Although the Viterbi algorithm is the search strategy which is usually
used on medium vocabulary applications (around 1000 words), it is not clear how
it can be extended to very large vocabulary applications (around 100,000 words). It
can be observed that the recognition rates declined when the number of vocabular-
ies increased in such systems (Kenny, 1993).
2. It generates only one recognition hypothesis. Although it can be modified to gen-
erate the N best hypotheses, the amount of computation increases proportionately.
3. The simple device of imposing context-dependent minimum duration constraints
on phone segments in recognition has been found to lead to major improvements
in recognition performance (Gupta et. al., 1991). Because of their non-Markovian
nature, these constraints cannot be accommodated by the Viterbi algorithm without
32
changing the topology of the model. It is even possible to modify the Viterbi algo-
rithm so that this can be done but there is a substantial price to be paid (Kenny,
1993).
Most recent research in HMM has tried to concentrate on the improvement of the recogni-
tion rate of the systems that employ HMM. This can be seen as an attempt to compensate
for the price paid because of the drawbacks mentioned above. This is the core idea of the
system reported by Gupta et al., (1991). Another example in this context came from Smith
et al. (1995) in their reported experiment, which was aimed at optimising the HMM per-
formance. Smith et. al. (1995) described a system which used two kinds of inter-frame de-
pendent observation structures, both built on the observation densities of a first-order de-
pendent form, which accounts for the statistical dependence between successive frames. In
the first model, the dependency relation among the frames was determined (optimally) by
maximising the likelihood of the observations in both training and testing. In the second
model, the dependency structure associated with each frame was described by a weighted
sum of the conditional densities of the frame given individual previous frames. To estimate
the parameters of the two models, the system was implemented by segmental K-means and
the forward-backward algorithms, respectively. Then the system was tested on an IWR
task. The system achieved better performance than both the standard continuous HMM and
the paradigm-constrained HMM. However, this report is similar to other HMM reports, in
that it did not provide details about the computational price paid for the improvement. In
summary the following points can be extracted:
1. To construct a recognition system based on phonemic recognition, a larger HMM
will be required (refer to Pepper and Clements 1991).
2. To achieve reasonable accuracy and recognition rate in syllable-based recognition
system, temporal cues must be incorporated into the system. (Refer to Flaherty and Poe
1993, Gupta et. al. 1991).
3. The previous two points cause a significant increment in the computational price of
the system, and this price will be higher in the case of larger vocabulary IWR, or con-
tinuous SR systems.
33
4. Inadequacy in modeling of the duration of the acoustic events associated with each
state Hunt 1988. This is especially critical for RUST-I as the duration of the phones in
the associated phonemic knowledge inherited temporal tolerance margin, which require
more flexible techniques such as neural networks.
Therefore, the development of a system that can be closer to the ultimate goal of SR using
pure HMM technique is a very much harder option.
2.5.2 Artificial Neural Network (ANN)
The use of artificial neural networks is the main technical discipline of neurocomputing
technology, which is concerned with information processing systems that autonomously
develop operational capabilities in adaptive response to an information environment
(Hecht-Nielsen, 1990).
Technically, ANN can be defined as a parallel distributed information processing-structure
consisting of processing elements, which can possess a local memory and can carry out
localised information processing operations. All elements in the structure are intercon-
nected via unidirectional signal channels called connections. All connections have associ-
ated adjustable-weights which perform the learning process in the structure.
Researchers in the field of SR realised that ANN can work as well as HMM or even better
when dealing with speech patterns (Deller et al. 1993). The initial search was for an alter-
native system that can comprehend the highly robust nature of the speech patterns. The re-
quired system should be able to generalise the problem of pattern recognition, also it
should be of non-logarithmic in nature and it should be able to adapt. The system will be
fed examples of speech patterns, so that it can learn the general features of the speech.
Consequently, it is expected to be capable of recognising any similar patterns.
ANNs are known for their adaptive, self-organising, fault tolerant functions and non-linear
capabilities. This makes them particularly applicable to the problem of SR. ANNs are often
34
used in speech processing to implement pattern recognition, i.e., to associate input patterns
with classes –classification- (Deller et al., 1993). Within this function, at least three sub-
types of classifiers can be delineated. The first, when an output pattern results, which iden-
tifies the class membership of the input pattern. The second is a vector quantisation func-
tion in which vector input patterns are quantised into a class index by the network. This
application is reserved for a particular type of ANN architecture that is trained differently
from the more general types of pattern associator networks. A third subtype of classifier is
called the associative memory network. This type of network is used to produce a memo-
rised pattern or class exemplar as output in response to input, which might be a noisy or
incomplete pattern from a given class.
In addition to pattern recognisors, a second general type of ANN is a feature extractor. The
basic function of such ANN is the reduction of large input vectors to small output vectors
that effectively characterise the classes represented by the input patterns. The feature ex-
tractor reduces the dimensions of the representation space by removing redundant informa-
tion. It is also sometimes the case that feature representations will appear as patterns of ac-
tivation internal to the network rather than at the output. An example of this is given by
Waibel et al. (1989).
The classical application of ANN into SR has focused on the fundamental problem of clas-
sifying static pre-segmented speech. This predominantly employed either Multi-Layer Per-
ceptron (MLP) or Learning Vector Quantiser (LVQ) topologies. A list of classical studies
in SR using ANN can be found in Table 2.6.
It can be noticed from Table 2.6 architectures are either MLP or LVQ, except in the case of
the Feature Map Classifier (FMC) of Huang and Lippman (1988), which is a hierarchical
network consisting of an LVQ-like layer followed by a perceptron-like layer. All MLPs
were trained by a back-propagation learning (BP) algorithm.
35
Table 2.6 Classical studies in SR using ANN to pre-segmented speech.
Study Approach/Problem
Elman and Zipser (1987) MLP-Consonant, vowel recognition
Huang and Lippman (1988) MLP and FMC – Vowel discrimination
Kammerer and Kupper (1988) MLP and single layer of perceptrons –
Speaker dependent and independent word
recognition
Kohonen (1988) LVQ – Labeled Finish speech
Lippman and Gold (1987) MLP – Digit recognition
Peeling and Moore (1987) MLP – Digit recognition
Ahalt et al. (1991) MLP and LVQ – Vowel discrimination,
gender discrimination speaker recognition
These classical studies which apply ANN technique to relatively simple SR problems, trig-
gered hundreds of related studies. Many possible ANN architectures were tested for SR to
assess topology, training time and recognition rate. Also, many of the known vectorial in-
put representations were applied, and the recognition rates were monitored. Table 2.7 is a
summary of studies with their variations in input parameters.
Inspired by the classical work of ANN applications, most of the experiments in Table 2.7
retained the pure MLP topology. In some cases it was combined with self-organising net-
works. The input vectors for the networks varied in each particular study explore its possi-
bilities. The overall performance of the ANN was compared to HMM techniques when ap-
plied to similar tasks. Gramss (1992) showed that the use of ANN achieve faster results
than HMM.
Table 2.7 shows that for each case, the resultant accuracy was related to the type of input
presented to the network. From the table, results reported by Shim et al., (1991) was pro-
duced using MLP/PB and input of LPC vectors, Davenport and Garudari (1991) used a
network of the Receptive Field and input of wavelet vectors, and Escande et al. (1991)
36
used a network of GP and input of Time-Frequency spectral vectors. All previous methods
showed overall lower recognition accuracy. It should also be noted that there is a relation
between the topology of the ANN and its accuracy, and it can be seen that the Receptive
field and GP topologies achieved lower accuracy too.
Table 2.7 Summary of studies, which employed ANN for speech signal processing.
Study name ANN To-
pology/
Learning
algorithm
Input type Dependency
& Recogni-
tion type
Number of
vocabu-
lary/Databa
se type
Number of
speakers
Accuracy
(max.)
Shim et al. (1991)
MLP /BP LPC MSD/CV 16
/unknown
3 70%
Davenport
& Garudari
(1991)
Receptive
field/
Supervised
Wavelet Speaker-
independ-
ent/Feature
extractor &
recognisor
of phones
795/TIMIT 48 81%
Escande et al.(1991)
GP Time-
Frequency
spectral rep-
resenta-tion
IWR
Digits
/RSG10
NATO
4
Accuracy
less than that
for other
systems
Gramss
(1992)
FFNN Contrasted
spectrogr-
ams
Speaker-
independ-
ent/IWR
RSRE &
DPI digits
databases
(German)
unknown
97.1%
94.5%
(faster than
HMM)
Kuang &
Kuh (1992)
Combination
of self-
organising
feature map
and MLP
Various pa-
rameters
MSD /IWR
20 words, 10
digits & 10
control
words /TI20
4
99.5%
Kitamura et al. (1992)
CombNet:
self-
organising
& 4 layers
MLP
TDMC Speaker
dependent
IWR
100
/Japanese
cities
9
96.8%
Kitamura et al. (1992)
CombNet:
self-
organising
& 3 layers
MLP
TDMC Speaker
dependent
IWR
100
/Japanese
cities
9
99.1%
Elvira and Carrasco (1991) carried out a study, which compared the most popular topolo-
gies and various input parameters. They concluded that the most common topologies are:
37
1. Adaline.
2. Monolayer Perceptron.
3. Back-propagation with the Sigmoid function.
4. Back-propagation with the hyperbolic tangent function.
5. Radial Basis functions (RBF) network with the Gaussian function.
6. Volterra connectionist model.
The following input parameters were used:
1. 12 parameter PARCOR linear predictive coding LPC using the Durbin
Method.
2. 12 frequency band coefficients calculated using Mel-scale distribution from
256 frequency coefficients obtained using the Fast Fourier Transform
(FFT).
3. 12 frequency band coefficients calculated on a linear scale distribution from
the same 256 frequency coefficient FFT.
4. 12 Mel cepstrum coefficients.
In the above mentioned study two databases were used for training and another two data-
bases were used for testing. The ANN was tested on vowels. The results showed that BP-
tanh gave the best performance for any number of training iterations when Mel-scale FFT
coefficients are used as inputs. For digit recognition, the BP-sigmoid achieved the best per-
formance related to the number of iterations (Table 2.8). These results are consistent with
the research details presented in Tables 2.6 and 2.7.
38
Table 2.8 Percentage of correct recognition for various topologies related to the number of
iterations (Elvira and Carrasco, 1991).
Iterations 50 100 150 200 250 300
Adaline 53.64 54.77 57.05 54.28 53.15 5658
Perceptron 51.55 55.66 56.24 53.31 55.20 56.02
BP-Sigm 64.53 66.99 67.83 68.18 68.93 68.10
BP-Tanh 62.33 69.21 63.49 63.32 66.14 62.08
RBF 57.67 59.06 60.77 57.85 - -
Volterra 60.27 62.03 60.44 61.60 - -
These findings have generated confidence in using MLP topology and its derivatives for
SR. Neural networks are used also in speech perception. Cassidy and Harrington (1992)
carried out a study using one of the MLP derivatives with a Sigmoid output function. Their
aim was to investigate the validity and the importance of the dynamic structure of vowels
in vowel perception. Vowels were represented by bark spectra and applied to the input of
the ANN. The performance of the ANN confirmed the importance of dynamic structure of
words.
Generally speaking, standard ANN is structured to work with static patterns. When applied
to speech, which is dynamic in nature, the ANN structure needs to be modified. Research-
ers have employed various architectures to accommodate the dynamic requirement. For
instance, Zhang, et al. (1995) have employed high-order fully recurrent ANN for this pur-
pose. The system proposed to provide effective processing of temporal information within
speech signals. That was achieved by implementing an ANN with a self-organising input
layer followed by a fully recurrent hidden layer and an output layer.
The most popular option which has been applied to speech signals is the Time Delay Neu-
ral Network (TDNN) (Waible, et al., 1989). Figure 2.11 shows a simplified architecture of
this network. The structure of the TDNN extends the input to each computational element
39
to include N speech frames represented by spectral vectors over the duration of ΔN sec (Δ
is the time slot between adjacent speech frames).
Other ANN structures were proposed by researchers but did not prove to be as useful as
TDNN. An example is the Hidden Control Neural Network (HCNN) (Rabiner and Juang,
1993) as shown in Figure 2.12. This network uses a time varying control, c, as a supple-
ment to the standard input, x, to allow the network properties or input-output relations to
change over time in a prescribed manner.
Figure 2.11 Time delay computational element (Sugiyama, et al., 1991).
40
Figure 2.12 Hidden control neural network (Rabiner, and Juang, 1993).
In similar way to TDNN, an architecture that converts the time dimension of the speech
signal into form of a distributed structure has been used. An example of this architecture is
the network reported by Cassidy and Harrington (1992). That ANN maps the temporal di-
mension onto a special dimension of MLP consisted of four layers of units connected by
links of varying delays. The model performance was measured using two types of test sets.
Table 2.9 summarises the performance of that ANN.
Table 2.9 Results of open tests for ANN trained on full vowel and steady-state vowel
(Cassidy and Harrington, 1992).
Training Set Test Set Correct (%) Rejected (%) Error (%)
Full Vowel Full 90.0 5.0 5.0
Steady-State S-S 73.2 7.5 19.3
41
2.5.3 Advantages of ANN
The advantage of applying artificial neural networks to the problem of speech recognition
is due to:
The parallel nature of ANN. The parallel distributed processing of ANN gives it
the ability to adapt, and that is at the very centre of ANN operation. Adaptation
takes the form of adjusting the connection weights to achieve the desired mapping
of inputs to outputs. Furthermore, ANN can continue to adapt and learn (incre-
mental learning), which is extremely useful in processing and recognising speech.
Adaptation (learning) algorithms continue to be a major focus of research in the
ANN field (Hecht-Nielsen, 1990).
ANNs tend to be more robust and fault-tolerant because the network is composed
of many interconnecting Processing Elements (PEs), which are all computing sim-
ple mathematical functions in parallel. The failure of a PE is compensated by re-
dundancy in the network. Similarly, ANN can often “generalise” a reasonable re-
sult from incomplete or noisy data. Finally, in direct contrast to HMM, when ANN
is used as a classifier, it does not require strong statistical characterisation of data
(Deller, et al. 1993).
Since information or relationship is embedded in the ANN and spread amongst the PEs
within the network, this structure has low sensitivity to noise or defects within the structure
(Laurene, 1994). The subject of robust speech recognition is till a major research topic, and
the robustness of the ANN showed some promising results. For example, in a paper written
by Sorensen (1991), the author showed an improvement of 65% on the average recognition
rate when a noise reduction neural network is added to the system under evaluation. In this
case the network provided at the input by cepstral coefficient vectors derived from isolated
words and associated with noise from non-stationary source.
The other advantage of ANNs comes from the variability of the connection weights. This
allows ANNs to adapt in real-time and improve the overall performance of the system.
Adaptive learning is the most important advantage of ANNs, which results from the non-
42
linearity of its activation function. This means that large ANNs can approximate a non-
linear dynamical system (Rabiner and Juang, 1993), which conveniently accommodates the
dynamic nature of speech.
2.5.4 Artificial Intelligence (AI)
This review of the AI Technique briefly presents the basic use of AI within SR. The AI
technique applied to SR is a hybrid technique that integrates acoustic-phonetic phenome-
non and with pattern-recognition concepts.
As a basic concept in AI, Expert Systems (ES) has achieved remarkable success in many
domains (business, robotics, biomedical engineering) (Hunt, 1988). This has lead to in-
tense interest in their application to SR. ESs are intended to model human conscious rea-
soning. It attempts to decode speech information at the phonetic level by modeling the be-
haviour of a phonetician reading spectrograms, this can be explained as simulating human
intelligence in visualising, analysing, and finally making a decision on the extracted acous-
tic features.
Studies, which apply ES to perform the SR task, can be divided into two broad categories:
A system of rules embodying human knowledge of what characterises speech
sounds as they appear in spectrograms – so the ES models a skilled spectrogram
reader. It transpires that the system of rules is much less effective at decoding
speech than a human listener with normal hearing ability.
A systems use statistical properties of training material to compare patterns on
continuous scales.
Rabiner and Juang (1993) defined knowledge sources as:
Acoustic knowledge: An evidence of which sounds (predefined phonetic units) are spoken
based on spectral measurements and presence or absence of some acoustic features.
43
Lexical knowledge: The combination of acoustic evidence so as to postulate words as
specified by a lexical that maps sounds into words (or equivalently decomposes words into
sounds).
Syntactic knowledge: The combination of words that form grammatically correct strings
(according to a language model) such as sentences or phrases.
Semantic knowledge: Understanding of the task domain of speech so as to be able to vali-
date sentences (or phrases) that are consistent with the task being performed, or which are
consistent with meaning of coded sentences.
Pragmatic knowledge: Inference ability necessary in resolving ambiguity of meaning
based on ways in which words are generally used.
The incorporating of such levels of knowledge within the system enhance its ability to recover
corrupted speech. This was studied by (De Morie, 1983), who carried out experiments on
speech sounds that were selectively masked by noise. The results of the experiments showed
that listeners used semantic, syntactic, prosodic (rhythm or intonation), pragmatic, phonetic
(body of fact about speech and its production) and acoustic knowledge to understand
corrupted or uncorrupted speech. Experimental results support the use of a language model
that uses high level syntactic knowledge to support the acquisition/retrieving of lower
phonemic knowledge.
The approach of ES may be appropriate for the organisation of higher level syntactic and
particularly semantic information, which is susceptible to conscious analysis. The effective
use of such higher level information will be necessary to achieve sophisticated SR (Hunt,
1988). An ES can not be successfully used for segmentation and labeling Rabiner, and
Juang, (1993). In particular, methods that integrate phonemic, lexical, syntactic, semantic
and even pragmatic knowledge into the ES have been proposed and studied. The main re-
quirement in such methods is that the learning should adapt to the dynamic component of
the data. For example, the expert system approach to segmentation and labeling would
augment the generally used acoustic knowledge with phonemic knowledge, lexical knowl-
edge, syntactic knowledge, semantic knowledge, and even pragmatic knowledge (Rabiner
and Juang. 1993).
44
The main advantage of the integration of a higher-level knowledge source in rec-
ognition systems occurs in the significant improvement of word-correction capa-
bility of system Rabiner, and Juang, (1993).
It can be concluded that a variety of knowledge sources need to be established in AI.
Therefore two key concepts of AI have to be addressed. They are the automatic knowledge
acquisition (learning) and adaptation (learning on the run). One of the ways in which these
concepts can be implemented is using ANN. This idea was tested by Shuping, and Millar,
(1992). They highlighted the importance of using speech knowledge in SR systems to
achieve closer results to the ultimate objectives of SR. They emphasised that to achieve
measurable advances in SR, the recognition problem should be approached using phoneti-
cally based knowledge techniques, where this knowledge has to be encoded into the sys-
tem structure.
2.5.5 Hybrid ANN/HMM Systems
Despite of the relative success of the HMMs technique in specified recognition tasks, their
inherent drawbacks which are outlined in Section 2.5.1 have caused some limitations on
their functionality in more advanced SR tasks (Gupta et al., 1991). Consequently, there is a
tendency among the SR research community to employ other techniques such as ANN for
large recognition tasks. However, HMM success in some particular systems has encour-
aged researchers to explore the potential of modified forms of HMM. Several researchers
have developed the core of HMM to overcome the conditional –independence limitation,
which is one of the HMM drawbacks. An example of this tendency is a research carried
out by Chan and Chan (1992). In their report, the proposed Static Model (SM) in the form
of a vector is used to represent the temporal properties of a sequence of speech feature vec-
tors. The system captures the average joint probabilities of state transitions of consecutive
observations over time, instead of the conditional probabilities, which are captured by
HMM. The system has been tested using an artificial vocabulary of ten words. The results
45
of the test were not encouraging as the system exhibits limitations in handling certain types
of vocabulary.
To merge the two techniques, it was noticed that ANNs –in particular MLP- were funda-
mentally similar to HMMs in that both have the ability to learn from training data (Deller
and Proakis, 1993). The process by which HMMs are trained may likewise be considered
to be a form of supervised learning. However, learned materials in each case are different
in content and in methodology. This is valid, even if both models are being applied to the
same problem. The HMM learns the statistical nature of observation sequences presented
to it, while the ANN may learn any number of things, such as the classes (e.g., words) to
which such sequences are assigned. It was from this point that researchers started the inte-
gration process of the two techniques, emphasising the effort to train ANN to learn the sta-
tistical sequences of events performed previously by HMM (a pre-processor).
Although influenced by the statistical nature of the observations, the internal structure of
the ANN that is learnt is not statistical. In its basic form, an ANN requires fixed-length in-
put, whereas this is not necessary in HMM because of its time normalisation property.
Although different in philosophy, HMMs and ANNs do have important similarities. Both
HMM and ANN can be robust to noise; to missing data in the observations; and to missing
exemplars in the training. However, there is a fundamental difference between the two
techniques represented by the input nature, learnt materials and learning mechanisms. Al-
though both systems perform mappings (the HMM an observation string to a likelihood,
the ANN an input to an output), the fact remains that the dynamics of the HMM are fun-
damentally linear. Whereas the ANN is a nonlinear system, which makes ANN perform
better against HMM for some SR tasks (Deller, 1993 and Hunt, 1988).
The comparison between limitations/advantages of HMM (Section 2.5.1) and
limitations/advantages of ANN (Section 2.5.3), scores in favour of ANN, especially in a
system that incorporated knowledge representation, as in the current system. ANN
optimises the functionality of the recognition system. This is the idea that made current SR
systems tend to incorporate ANN/HMM in one hybrid system. However, the incorporation
46
of ANN/ multi-level language knowledge as in this system, is regarded as novel
investigation. By adopting ANN as recognisor on the phonemic level, the three problems
of the HMM design especially of finding the states number, probability and the optimum
sequence is avoided by using the natural sequence of language depending on natural
language model. This especially eases the design procedures of second problem HMM
design, i.e. the attempt to uncover the hidden part of the model and finding a correct state
sequence. The proposed system maintains the statistical structure in the relation between
the phonemic and the syntactic level as HMM does.
The integration of HMM with ANN was started by experiments on ANNs trained to act as
expanded HMM. For instance, in research by Rigoll (1991), a MLP is used as a Vector
Quantiser (VQ) in a HMM based SR system. The system can use a variety of speech fea-
tures such as cepstral coefficients, differential cepstral coefficients and energy as joint in-
put into the VQ. This avoids the use of multiple codebooks – so the system simulates
multi-HMMs in order to achieve a more robust system. It should be noted here that this
system transfers the computing complexity of the expanded HMM system to ANN. How-
ever, this study did not show improved performance in the recognition rate compared with
pure HMM techniques.
Other efforts using the same idea produced similar results. Cheng et al. (1992) carried out
an assessment of the possibility of modeling phone trajectories to accomplish SR. The as-
sessment performed using hybrid Segmental Learning Vector Quantisation /Hidden
Markov Model (SLVQ/HMM) system. Results obtained from that system showed signifi-
cant difference with those results from SLVQ system alone. However, the difference with
pure HMM techniques was small.
Zavaliagkos et al. (1994) reported improved performance of a hybrid system when com-
pared with the pure HMM technique. This system was used to perform large-vocabulary
continuous SR. They demonstrated that the hybrid system could show consistent im-
provement in performance over the baseline HMM system. This system used a N-best
paradigm by connecting segmental ANN and models of all the frames of the phonetic
47
segment simultaneously, thus overcoming the conditional-independence limitation of the
HMM.
Various hybrid systems were then developed by assigning relatively different function to
be performed by ANN. Reichl and Ruske (1995) reported a hybrid system that achieved a
reasonable recognition rate. The system consisted of ANN with Radial Basis Function
(RBF) and HMM. The RBF-ANN trained to approximate a posteriori probabilities of sin-
gle HMM state. Those probabilities are used by the Viterbi algorithm to compute the total
scores of the individual hybrid phone model.
All of the literature in this area tends to carry out comparisons between pure HMM and
hybrid ANN/HMM systems. No comparison between pure ANN and ANN/HMM hybrid
systems, or ANN/ES and pure ANN systems has been reported. This thesis develops such
a comparison, and presents the performance of two types of artificial neural networks; con-
ventional BP and ANNs with incremental learning ability. This is one of the novel points
of the current work.
2.6 Conclusion
This chapter describes the overall system, and the acquisition of the UWS speech database
which was used in the study of first version of the system RUST-I. The speech model and
MFCC parameter calculation have also been presented. A survey of the techniques used in
speech recognition has also been conducted.
Based on the literature survey, it has been decided this work is to implement a hybrid sys-
tem of ANNs and knowledge sources for speech recognition in particular an incremental
learning ANN.
48
Chapter 3: Phonemic/Syntactic Knowledge and Adaptive Phone
Recognisor - Design and Implementation
3.0 Introduction
This chapter presents an in-depth description of phonemic/syntactic knowledge and the
Adaptive Phone Recognisor (APR) and their implementation. Section 3.1 describes the
design and implementation of the APR. Section 3.2 describes in detail the syntactic
knowledge of RUST-I. The basics of the language model and the set of words that form the
lexicon are discussed in Section 3.2.1. The method of categorisation of the phones within
these words is then described in Section 3.2.2. Section 3.3 describes the method to derive the
statistical probabilities of patterns of words. Section 3.4 is dedicated to the description of the
code activator and the accumulator. Section 3.5 is concerned with the sub-recognisor
architecture, the structure of the neuro-slices in a sub-recognisor, the initial conditions and
parameters for the sub-recognisor.
It was shown in Chapter 2 that the functional relationship between the adaptive phone
recognisor and syntactic knowledge estimator produced the syntactic knowledge of RUST-I.
The syntactic knowledge is in the form of associative procedure that links phonemic events
with a primitive, syntactically correct language model.
The phonemic knowledge is represented by the ANN parameters (weights) of the 45
sub-recognisors. This knowledge includes the training of the 45 sub-recognisors. (This
knowledge is fault tolerant to some extent.) The syntactical knowledge is represented
as the probabilities of occurrences of phones in the formation of words (in specific
texts).
49
Combining the syntactic level with the phonemic level produces an IWR system. Addi-
tional syntactic or semantic functions can be provided to detect any syntax errors and in-
formation about sentence structure and grammar. But to do this is beyond the scope of the
present thesis. Altering the syntactic functions such as the vocabulary size, topic focus and
adding other functions will alter the performance of RUST-I as well.
3.1 Adaptive Phone Recognisor (APR)
The function of the APR is to find the match of an input phone. The adaptive phone
recognisor consists of a bank of sub-recognisors that perform the mapping of the speech
input represented by MFCC vectors, to the classified output represented by the phone
identification responses, PIR1 to PIR46, for all pertinent frames.
A block diagram of the APR is shown in Figure 3.1. The length of the input phone was also
used in recognition as an additional parameter. With this knowledge, only a small number of
sub-recognisors will activated according to the syntactical knowledge (probability of
occurance).
Figure 3.1 Adaptive phone recognisor.
50
For an activated ith sub-recognisor, only Mi sets of 12 MFCC parameters will be used to
calculate PIRi.. Each set is retrieved from one frame of a speech signal. If Mi is greater than
the number of frames in the speech signal, zero sets will be used. These MFCC coefficient
sets, D1(12),…,DIMi(12), were presented to all the activated sub-recognisors simultaneously.
The output of only one sub-recognisor is activated at any one time using the ACLi signal.
The order of activation of sub-recognisors for any one phone is controlled by the syntactic
knowledge estimator.
3.2 Syntactic Knowledge Estimator
The syntactic knowledge estimator shown in Figure 3.2 consists of two modules: the syntactic
knowledge database and the code activator. The syntactic knowledge database provides the
probabilities of patterns of phones, which occur in words stored in it. The probabilities are
utilised by the code activator. The code activator arranges the outputs of the adaptive phone
recognisor for the best match, based on the probabilities of the phone sequences. Thus the
syntactic knowledge estimator provides the activation control patterns to the adaptive phone
recognisor and informs the accumulator of a word boundary.
Figure 3.2 The syntactic knowledge estimator.
51
The input to the syntactic knowledge estimator is the Phone Identification Responses (PIR1)
to (PIR46) of the previous phone in the word to be recognised. There are two outputs from this
block; the activation control lines, ACL1 to ACL46, and the End of Word Identifier (EOWI).
The largest value of PIR1 to PIR46 (above a certain threshold) indicates the matched phone for
the input phone. The code activator of the syntactic knowledge estimator then checks the
syntactic knowledge database for the list of most likely phones to follow the current pattern
and passes this estimate to the adaptive phone recognisor via the state of the activation control
lines, ACL1 to ACL46. The code activator will continue checking likely phones at one level
until a match is made as to the phonemic identity of the speech input. If no match is made a
message will be generated to indicate that the word does not exist in the lexicon. If a silence is
detected at any level other than the first, a word boundary will be identified and passed on to
the accumulator via the EOWI signal.
3.2.0 Syntactic Knowledge Database
Within the syntactic knowledge database, the syntactical data was represented in the form of
clusters. In the clusters, the data units are linked to each other using pointers in similar manner
to linked list, where the data units of the syntactical knowledge are the phone ID’s of Table
3.1.
52
Table 3.1 Phone ID’s of RUST-I.
Phone ID Phone ID Phone ID Phone ID
I 1 Λ 13 d 25 r 37
i 2 aI 14 k 26 t∫ 38
ε 3 eI 15 g 27 dξ 39
æ 4 ЭI 16 f 28 m 40
a 5 aΩ 17 v 29 n 41
Þ 6 OΩ 18 し 30 さ 42
Ď 7 I∂ 19 ð 31 j 43
Э 8 ε∂ 20 s 32 w 44
Ω 9 Ω∂ 21 z 33 L 45
u 10 p 22 ∫ 34 sln 46
έ 11 b 23 ξ 35
∂ 12 t 24 h 36
The database contains the probability of occurrence of a phone as the first one in a word or
within any pattern of phones up to a maximum number of 14 levels in depth. For example, the
fricative /ð/ has the highest probability of occurrence (0.1149594) among all phones in the
analysed textual material (lexicon) likely to be first in a word. The vowel /∂/ has the second
highest probability, and probability value of (0.0781134) of being first in a word. The phone
that has the lowest probability of being first in a word is the diphthong /eI/ at (0.0007369).
The lateral /r/ has the highest probability of occurrence (0.403) to follow the plosive /p/. The
database contains information on the depth into the word (that is 1, 2 down to 14 levels) and a
statistically aligned list indicating the order of most likely phones at a given level.
Phonemic units are distributed in the knowledge space statistically according to their
probabilities of occurrence, which is dependent on the focus of the knowledge source (KS).
The file that contained the probabilities of the linked clustered data units is referred to as a
syntactic database.
53
The syntactic knowledge database is constructed to form clustered linked-lists. The front edge
list, which contains the order of probabilities of phones occurring first in a word, acts as the
syntactic knowledge interface. Both the linked-lists and the clusters are constructed using the
statistical order distribution, which is derived from the probability values of Tables A.1
through A.34. The first search cycle navigates the front edge list until a match is found. Each
subsequent search cycle moves through the levels in the linked lists selected from the front
edge list.
3.2.1 RUST-I Lexicon
RUST-I will work with any size database (theoretically with an infinite size database) that is
limited only by the time of access through a large database. A limited size database was used
in RUST-I to illustrate its operation.
Generally, the number and type of the words in a lexicon are logically related to the area of
knowledge or topic where those words came from. Particular words can occur frequently only
in one area (e.g., the word ‘budget’ will be repeated in financial area of knowledge, whereas
the word ‘child’ may only occur in a general area of knowledge). Other vocabulary can be
described as general use vocabulary and they are necessary in any English speech. The same
set of words may appear in many areas at the same time (e.g., the word ‘the’ will occur in all
areas). To demonstrate the concept of the syntactical knowledge and the system lexicon, a
limited number of speech areas were used to create the RUST-I lexicon.
Two approaches were commonly used for lexicon development. The first approach selected
words pertinent to a particular area, such that a meaningful conversation could be carried out
in this subject area. The second approach was to select all words from a word reference text
(Macquarie Dictionary, 1998). The second approach utilised an alphabetic classification and
did not comply with the syntactic knowledge concept. Therefore only the first approach was
used in this research. In this approach, a much smaller but more effective mixture of words
were obtained by selecting three areas of speech. The three chosen areas were (1) general
community topics, (2) accounting and (3) physics. The general community topic extracts
54
came from a local community newspaper (The Torch). The accounting extracts came from
(Costinett, 1997). The physics extracts came from (Hall, 1977). Words extracted from these
areas were combined in the lexicon and analysed and prepared to be represented in the
syntactical knowledge. At this stage, the lexicon of RUST-I comprised a set of 1357 words of
which 541 are unique, but the lexicon could be expanded or reduced according to the system
application. To derive the lexicon for RUST-I, two approaches were followed.
3.2.2 Categorisation
Categorisation is the process of dividing the lexicon word set (1357 words) into phonemic
classes depending on their onset phone and subsequent phonemic structure. The onset phone
of each word determines the phonemic class which that word is associated with. The onset
phones of all the words form the front edge phonemic level of the syntactic database
knowledge.
In the first stage of the categorisation, all of the words in the lexicon are classed according to
the first phone in each word. This produces the phonemic classes of the syntactic knowledge
in RUST-I. The phonemic structures of the words in the lexicon are specified in brackets
beside the textual representation of each word, e.g., the word ‘with’ is represented
phonemically as [wIð]. The conversion process is based on the International Phonetic
Alphabet for Australian English shown in Table 2.5.
In the second stage of categorisation, the phonemic classes were placed in descending order
according to their probability of occurrence. For example, words that start with the fricative
/ð/ have the highest probability of occurrence at the beginning of a sentence. Hence, the
phonemic class /ð/ located at the beginning of the lexicon and is also at the start of the front
edge level in the syntactic database. The fricative /ð/ has therefore the highest probability in
the syntactical knowledge. Table 3.2 shows the phonemic classes included in the syntactic
knowledge, the number of tokens in each phonemic class and the phonemic sub-classes,
which follow the front edge phonemic class. The raw phonemic data was categorised by this
55
process to extract statistical information for the syntactic database in the syntactical
knowledge estimator.
Not all phones in the Australian English phonemic set are represented in the front edge level
of the syntactic database. Table 3.3 shows those phonemic classes, which are excluded from
the representation in the front edge level of the syntactic knowledge. The reasons for
exclusion of those phonemic classes are:
• RUST-I lexicon does not contain words starting with that phone (for example the
phones z, ∫, ξ of Table 3.3).
• The phone is syntactically impossible to occur at the beginning of an English word
(for the other phones of the Table 3.3).
Table 3.2 shows that all of the words in the lexicon can be categorised phonemically into 35
front edge phonemic classes.
56
Table 3.2 Phonemic classes and their associated levels represented in the front edge level
of the syntactic knowledge.
Associated levels Phonemic class
Number
(max)
Second level phonemic sub-classes
ð 3 ∂ - I - æ - eI - ε - i - OΩ - Λ
∂ 12 sln - L - k - w - t – r - b - m - v - p - d - g - f
æ 12 n - t - z - d - L
i 9 n - z - t - f - m
h 9 i - a - æ - έ - I – OΩ - ε - aI - Э - Λ - u
w 7 I - Þ - Λ - i - ε - έ - Ω - eI - aI - ε∂ - Э
þ 9 v - n - f – p
f 10 Э - r - i - I - ε - ∂ - IƏ - L - aΩ - a - aI - æ - eI - u - έ - Λ
p 8 r – L - ∂ - Ω - i - Þ - a - Λ - æ – eI - I∂ - I - έ - ε t 10 u - r - eI - w - ∂ - έ - aI
s 12 ε - i - I - m - p - eI - OΩ - έ - æ - t - L - ∂ - k - aI
b 8 i - I - Ω - aI - ∂ - Λ - æ - r - L - ε - aΩ - Э - Þ – eI
k 11 æ - Э - aI - Λ - Þ - Ω - L - OΩ - ∂ - m - s - ε - I – I
d 11 r - ∂ - Λ - I - u - eI - Þ - æ - Э - OΩ - ε - IƏ - i – j
m 10 Ə - Λ - Э - æ - Þ - ε - eI - a - aΩ - OΩ
n 7 Λ - j - Þ - ε - OΩ - ЭI - eI - i - aI
i 8 L - v - t∫ - k - z - s - t - n
ε 11 L - n - v - ∂ - k - g - dξ - b
ε∂ 1 Sln
g 9 eI - r - ε - L - Λ – Þ - aI - I - i - OΩ - Ω
r 11 ε - i - ∂ - u - OΩ - Λ - æ
a 4 sln - I - t∫ - f - m - s - t
э 6 sln - b - d - g – L
j 8 u - Э - ∂ - I∂
t∫ 4 æ - a - έ - ε∂ - aI
L 6 aI - Ω - ε - eI – OΩ - I - æ - a
Λ 9 ð - p - n – s
し 7 r - I - æ - ∂ - Ω
∂Ω 2 t - ∂
dξ 6 Þ - n - ε - Λ – ЭI
OΩ 4 L - d - v – m
aI 3 Sln - ∂
v 7 ε - ∂
έ 4 L – し eI 4 Dξ
57
Table 3.3 Phonemic classes which are not represented in the front edge layer of the syntactic
knowledge.
Category Phone Word example Phonetic form Ω Put pΩt
u Pool pul
Vowels
Ď Bon voyage bĎvw’jaξ ЭI Boy bЭI
I∂ Here hI∂
ε∂ Hair hε∂
Diphthongs
Ω∂ Tour tΩ∂
Nasals さ Sing Siさ
z Zeal Zil
∫ Show ∫OΩ
Fricatives
ξ Measure mεξ∂
The numerical representation of the Australian phone set specified in Table 3.1 is used to
facilitate the manipulation of the phonemic data in the syntactic knowledge estimator. This
numerical code is used to represent the phonemic units in the syntactic knowledge database
and the accumulator (in Fig. 2.2). In Table 3.1, the silent period between words is considered
to be separate code and can be identified by its duration, which is chosen to be longer than the
longest duration of any of the phonemic unit. The longest duration measured was 487 ms for
the diphthong phone /eI/ (its ID = 15). Therefore, a speech segment is considered to be silence
if its duration exceeds 500 ms.
To integrate the syntactic knowledge within the isolated word recognisor, a procedure of
mixture of 'bottom-up' and 'top-down' processes is used. The lowest level of knowledge is the
phonemic knowledge or knowledge of basic phonemic units, where the phone identification
numbers of Table 3.1 are used to represented phonemic units in the syntactical knowledge.
Categorisation of the phonemic knowledge into front edge layer classes provides an efficient
structure for organising syntactic information. The phones extracted from words in the
lexicon are categorised into phonemic classes depending on front edge phonemic classes and
then to phonemic subclasses. Every phonemic class contains one phone from the front edge
phonemic class and at least one phone in the phonemic subclass. This generates a hierarchical
structure for the phonemic clusters, which has the advantage that once the front edge phone
58
has been classified, it automatically inherits all statistical information (probabilities) about
subclasses from that classification.
Figure 3.3 shows how this classification scheme is implemented. Assume that the front edge
phonemic class G0 is one of the possible 35 front edge phonemic classes that were generated
from the first stage classification. The first subclass of G0 is classified as C011 for subclass 1
in level 1 in class 0. The subclass C012 refers to the subclass 2 of level 1 of class 0. The term
SC refers to the subclass of phones at levels deeper than the first subclass in the word. The
number following the mnemonic SC refers to the depth of the phone into the cluster. For
example, SC041 penetrates four levels away from the original front edge class G0. It refers to
subclass 1 of level 4 of class 0.
Figure 3.3 Graphical representation of data clusters.
All the clusters of phonemic classes and subclasses are formed by this process. Phonemic
classes and subclasses are represented using their numerical IDs. Figure 3.4 shows an
example of the data cluster for the front edge phonemic class /t/. Thirteen words make up that
data cluster. The linkages show the sequence of phones in each word from left to right. The
words are ranked from the first subclass to the last in an order derived from the probabilities
of occurrence in the lexicon. The description of how the statistical probabilities are derived
can be found in Section 3.2.3.
59
Figure 3.4 Example of a data cluster for the front edge phonemic class /t/ (phones are
represented by their identification code.).
3.2.3 Data Organisation in the Syntactic Database
To construct the data file that contains the phonemic units of the syntactic database, the
phonemic classes were given priorities of sequenced appearance according to their
probabilities. As shown in Fig. 3.5, the order of the phonemic clusters reflects the degree of
priority for each phonemic class in the front edge layer, as well as for each phonemic subclass
or individual phone within the syntactic clusters. Once the order of clusters is obtained from
the probability values, those values are not required any longer in the recognition process, as
the recalling mechanism is following the order of positioning in the syntactical knowledge.
For example, the phonemic class Oæ located in the third level of priority of the front edge
phonemic classes. The system will access all of the related phonemic clusters to this
60
phonemic class in case this set activated for recognition cycle. This cluster is shown in
Figure 3.5 below:
Figure 3.5 Bubble diagram of cluster number 3 of front edge phonemic class /æ/.
For each front edge phonemic class, the first level of subclasses was also ordered in a
decreasing probabilities of subclasses. For example, as shown in Figure 3.6 the first level of
subclasses or second phone in words beginning with this front edge phone / æ / is in the
probability order: 41 24 33 25 45 or /n t z d L/ .
In all cases, the probability of a subclass makes the priority order at levels deeper than 2 less
significant than at levels 0 and 1 classes and overall probability values of phones occurring in
the lexicon are used to define order. So the order of the higher levels of subclasses (at or
above 2) is determined by the number of occurrences of a phone within the lexicon. For
example, in Figure 3.5 the subclass 12 in level 2 leads to 2 subclasses in level 3, they are 29
and 45. Both of these subclasses lead to two different words but the subclass 45 is given lower
priority than the subclass 29 because phone 45 has the lower overall priority in the context of
61
the lexicon. The localised probability of the two classes 29 and 45 is 0.5 as they both are
represented by the same number of words in this cluster.
Using this analysis method, all of the phonemic classes have been scanned horizontally in
level steps and the phonemic IDs were collected vertically in their order and used to create the
syntactic knowledge database file. For example, Figure 3.6 shows a portion of that file, which
represents the front edge phonemic class and other related lines.
Figure 3.6 Portion of the syntactic database that represent cluster 4.
In Figure 3.6, the data units are organised in lines, and lines in decreasing order of priorities.
Each line contains a number of fields separated by spaces and ended by semicolon. The first
field contains the line ID or pointer, which is derived from the degree of depth in the cluster.
The ID of a phonemic class is represented by two symbols (from 00 to 0Z) for 34 classes of
the syntactical knowledge. A subsequent symbol (from 0 to 9) is also used to indicate the
depth of the phonemic level in the cluster. For example, the phonemic class /eI/ (with the
least probability of occurrence) was assigned the address 0A0, and 017 means the seventh
level in the phonemic class 01.
During the recognition process, any identification to any of the phonemic unit in any level
will trigger the activation to the next level in the cluster. This is not applied to the to the
data unit 46 in any level. This phonemic unit activates End of Word (EOW) signal,
011 46 45 26 44 24 37 23 40 29 22 25 27 28 ;
012 3 17 11 32 15 6 37 14 19 18 40 20 1 ;
013 26 46 41 22 40 24 6 29 1 34 45 35 ;
014 24 1 3 25 46 32 34 5 41 12 34 ;
015 37 26 46 12 37 24 12 1 15 45 ;
016 6 1 41 12 46 32 26 2 45 ;
017 41 26 24 40 24 46 ;
018 46 37 ;
019 15 ;
01A 34 ;
01B 12 ;
01C 46 ;
62
therefore, it does not lead to any other phonemic level. In some levels there is only end of
process unit 46 – the silence ID-.
3.3 Determination of RUST-I Syntactic Knowledge: Example
In RUST-I, phones and words are considered as arbitrary items of data, so they are units of
information source. From the information theory, the self-information, Ij conveyed by a
phone J in a contextual lexicon depends on the probability P(J) of occurrence of that
phone. If the occurrence of the phone J, depends upon a finite number m of preceding lev-
els or phones, the information source is then called an mth order Markov source.
In RUST-I, m is taken to be equal to 1 if probabilities at levels higher than 1 are too high to
contribute significantly in the optimisation of knowledge representation. This is expected
in a database of 1357 words. Higher orders could be useful with significantly larger vo-
cabularies. Therefore RUST-I is represented by a 1st order Markov source. To represent
the system mathematically, consider the Australian English phone set (45 phones and si-
lence) forming a universal set A, where:
A = I, i, ..., sln
n(A) = 46
where, n(A) is the number of elements in A.
From Table 3.2, the number of possible front edge phones for the current knowledge data-
base is 35, and they form sample space represented by the set O; where each phone J ∈ O
is a member of some words in the lexicon:
O = ð, ∂, æ, I, h, w, Þ, f, p, t, s, b, k, d, m, n, i, ε, ε∂, g, r, a, Э, j, t∫, L, Λ, し, aΩ, dξ, OΩ,
aI, v, έ, eI
n(O) = 35
It should be noticed that:
O ⊂ A
63
The detection of a front edge phone by the adaptive phone recognisor initiates action by the
syntactic knowledge estimator to recall the cluster of phones related to that front edge phone
with its statistically related phonemic units. This is referred to as an event in the knowledge
and will initiate a specific set of linked lists which represent a cluster of words that are all
initiated from the same front edge phone and are part of the same phonemic class. If these
events can be regarded as independent source of information, there is no relationship between
their probabilities. The detection of any front edge phone is an independent process. Let all
words in the lexicon be members of the set W, where
n(W) = 1357
The set O represents the front edge events of the set W members. Therefore, the probability
P(J) of a phone J ∈ O can be found using the relation:
)(
)()(
Wn
JncJP =
Where, nc(J) is the number of occurrence of J at the front edge in the lexicon.
Table 3.4 depicts all probability values of the front edge phonemic classes of the set O along
with their frequency.
The phones at the front edge of the set W can be considered sources of information, therefore
each class can convey a quantity called the amount of information. This quantity measures the
information conveyed by an event at the time of its detection, and has a nonlinear relationship
to the event probability of occurrence. A higher amount of information is conveyed by an
event when it has a lower probability of occurrence for that event. This quantity can be an
indicator of the independence of probability amongst members of the set, O.
Consider a front edge phone, J ∈ O, that has a probability value of P(J), and all phones of the
set O are independent (each of them forms an independent source of information). Then the
64
amount of information of that phonemic class is obtained from the self-information that is
associated with that phone. From information theory, the self-information associated with this
phone can be obtained as follows:
)(log2 JPI j −= [bit]
)(log32.3)(log10log 10102 JPJPI j ≈−=
The values of Ij for each phone of the set O are shown in the last column of Table 3.4. The
items in Table 3.4 are organised in descending order of probability values. It can be noted
from the table that when the probability of a front edge class is lower, the self-information
associated with that class becomes higher, this is represented graphically in Figures 3.7 and
3.8. So, phonemic classes with high probability of occurrence do not convey high amount of
information to the system. On the other side, phonemic classes with lower values of
probabilities are associated with higher amount of information indicating to the uncertainty
factor associated with those phonemic classes. For example, the phonemic class of the
diphthong /eI/ conveyed self-information of IeI=10.4 [bit] as it has the lowest probability in
the front edge phonemic classes.
65
Table 3.4 Syntactic-knowledge front-edge phones set, their frequencies, probabilities and
self-information.
Phone Frequency Probability Self-information [bit]
ð 156 P(ð) = 0.1149594 I(ð) = 3.12
∂ 106 P(∂) = 0.0781134 I(∂) = 3.68
æ 82 P(æ) = 0.0604274 I(æ) = 4.05
I 81 P(I) = 0.0596904 I(I) = 4.06
h 79 P(h) = 0.0582166 I(h) = 4.1
w 74 P(w) = 0.054532 I(w) = 4.19
Þ 65 P(Þ) = 0.0478997 I(Þ) = 4.38
f 63 P(f) = 0.0464259 I(f) = 4.43
p 62 P(p) = 0.045689 I(p) = 4.45
t 61 P(t) = 0.0449521 I(t) = 4.47
s 57 P(s) = 0.0420044 I(s) = 4.57
b 56 P(b) = 0.0412675 I(b) = 4.6
k 48 P(k) = 0.0353721 I(k) = 4.82
d 44 P(d) = 0.0324244 I(d) = 4.94
m 38 P(m) = 0.0280029 I(m) = 5.15
n 34 P(n) = 0.0250552 I(n) = 5.32
i 28 P(i) = 0.0206337 I(i) = 5.59
ε 27 P(ε) = 0.0198968 I(ε) = 5.65
g 26 P(g) = 0.0191599 I(g) = 5.7
r 24 P(r) = 0.017686 I(r) = 5.82
a 19 P(a) = 0.0140014 I(a) = 6.15
Э 19 P(Э) = 0.0140014 I(Э) = 6.15
j 19 P(j) = 0.0140014 I(j) = 6.15
t∫ 16 P(t∫) = 0.0117907 I(t∫) = 6.4
L 16 P(L) = 0.0117907 I(L) = 6.4
Λ 14 P(Λ) = 0.0103168 I(Λ) = 6.59
し 11 P(し) = 0.0081061 I(し) = 6.94
aΩ 8 P(aΩ) = 0.0058953 I(aΩ) = 7.4
dξ 8 P(dξ) = 0.0058953 I(dξ) = 7.4
OΩ 4 P(OΩ) = 0.0029476 I(OΩ) = 8.4
aI 4 P(aI) = 0.0029476 I(aI) = 8.4
v 4 P(v) = 0.0029476 I(v) = 8.4
έ 3 P(έ) = 0.0022107 I(έ) = 8.8
eI 1 P(eI) = 0.0007369 I(eI) = 10.4
66
Figure 3.7 Probabilities of Phones in set O.
Figure 3.8 Self-information of Phones in set O.
At the second level of the syntactic knowledge (where m = 1), localised probabilities and self-
information are applied within each phonemic class to derive the statistical data associated
with the phonemic subclasses which are clustered within each front edge phonemic class. The
same calculations and formulae that are used at the first level (front edge level) are applied to
this first phonemic subclass.
The probabilistic values of the links between the phonemic classes on the front edge level and
their phonemic subclasses addresses the sequential distribution of the clusters in the syntactic
knowledge. Those values are computed as mentioned before, where each phonemic class has
been considered as a universal set O that contains specific phonemic subclasses.
0
0.02
0.04
0.06
0.08
I D t k n g L aI eI
Self-information graph
0
2
4
6
8
10
12
I D t k n g L aI eI
67
An example of the second level localised probabilistic values is illustrated in Table 3.5. This
level contains set of 156 words start by the phone /ð/, let us call this set as Oð; where, n(ð) =
156. The first phonemic subclass E(ðI) in the table achieved the highest localised
probability of P(ðI) = 0.75, and so forth.
Table 3.5 Localised probabilistic values of phonemic subclasses in level 2 of the phonemic
set Oð.
Phonemic set Oð , n(ð) = 156 Sequence Phonemic subclass
& number of its occurrence
Localised probability
Self information [bit]
1 E(ðI) = 117 P(ðI) = 0.750 I(ðI) = 0.415
2 E(ð∂) = 110 P(ð∂) = 0.705128 I(ð∂) = 0.504
3 E(ðæ) = 13 P(ðæ) = 0.0833 I(ðæ) = 3.583
4 E(ðε) = 11 P(ðε) = 0.0705512 I(ðε) = 3.824
5 E(ðeI) = 9 P(ðeI) = 0.057692 I(ðeI) = 4.113
6 E(ði) = 3 P(ði) = 0.019230 I(ði) = 5.697
7 E(ðOΩ) = 2 P(ðOΩ) = 0.01282 I(ðOΩ) = 6.281
3.4 Code Activator and Accumulator
The Code activator is the controller of the syntactic knowledge estimator and is the link
between the basic phonemic knowledge of the adaptive phone recognisor and the syntactic
knowledge in the syntactic database. It has three main functions. The first function is to
browse the syntactic knowledge database and derive an estimate of the most likely phone to
occur first in a sentence, first in a word, followed by a given pattern of phones that have been
collected in the accumulator. The second function is to monitor the PIRi output from the
adaptive phone recognisor and determine which phone sub-recognisor’s output is the largest
output that exceeds the threshold of 0.6 (where the maximum response value is 1 – a complete
match). The code activator will then feed the ID code for that phone to the accumulator. The
third function is to determine the end of word EOW by a silence and to signal the accumulator
to release the identified word and to start a new word.
68
Figure 3.8 shows the algorithm that implements the three functions of the code activator. The
code activator will go through an initialisation routine on power up, which involves the
following:
• zeroing the identified word in the accumulator.
• setting internal registers to predefined values.
• set the pointer value to the beginning of the front edge level of the
syntactic database.
• set up the activation function to enable the output from the most common
phone found first in a sentence.
Every time a phone is detected, the code activator will move further into the syntactic
knowledge database to find the next level of activation. Every searching cycle utilises the
same mechanism when accessing the data units. In this process the code activator operates as
a database engine, as the initialisation routine which loads the front edge level phone IDs, is
instigated. Then the syntactic knowledge interface is initiated to search and find the correct
level of phonemic units. The idea of the front edge level, significantly reduces the time
required for the code activator to browse through the syntactic database as it has less data
units (Darjazini and Tibbitts, 1994).
Once the first word is found, as indicated by a silence being detected, the code activator
writes the phone IDs out to the accumulator and subsequently the accumulator is instructed to
release the identified word.
The code activator starts navigating the syntactic database from the front edge level (the
highest probability). The ID of the front edge phones are applied directly to the appropriate
Activation Control Lines ACLi, for example ACL42 is high when the ID is 42. Then the code
activator waits for the adaptive phone recognisor responses, which are represented by signal
set - PIR. A process is instigated to read the PIR signals then perform a check to identify any
above the threshold of 0.6. The maximum response is then selected from these phones, if the
ID shows a silence the pointer to the syntactic database is reset and the main process is started
69
again. Therefore, the code activator seems performing sequential search through the
statistically ordered phones at the front edge level until a match is found (threshold > 0.6). In
the case of confusion, i.e. more than one response occur, the code activator will select the
phone that has the highest level of response from the PIR. If the search ends without a match,
an error message is delivered indicating out of lexicon input situation.
ALGORITHM FOR CODE ACTIVATOR
% Initialisation Routine
% tells accumulator to zero identified word
% sets internal registers
% reset database pointer
% set up the activation control lines to identify the front edge level
allocate memory;
open syntactic database file;
set up database pointer to 1
initialise I/O buffers;
initialise the accumulator ACC = 0;
EOWI = 0;
set counter = 1;
set found = 0;
% Search for the first phone in sentence after initialisation
while not end of the file
read front edge - discard first field;
read front edge pointers - discard first field;
while not end of front edge list
get ID(I) and its pointer;
activate the relevant ACL (I);
70
read the PIR(I) from adaptive phone recognisor;
if PIR(I) > 0.6 then set found = 1;
counter++;
if found = 1
find out the maximum PIR(I);
send ID(I) to the accumulator;
get ID(I) associated pointer ;
move the control pointer to the value pointed by I(D)’s pointer;
found = 0;
else
message “out of lexicon”;
go to the start of the routine;
% Search for the other phones
repeat until pointer = 5 or counter >= 13
read level - discard first field;
if content = 46 only
EOWI = 1;
go to the start of the routine;
read level’s pointers - discard first field;
while not end of level
get ID(I) and its pointer;
71
activate the relevant ACL (I);
read the PIR(I) from adaptive phone recognisor;
if PIR(I) > 0.6 then set found = 1;
if found = 1
counter++;
find out the maximum PIR(I);
send ID(I) to the accumulator;
get ID(I) associated pointer ;
move the control pointer to the value pointed by I(D)’s pointer;
found = 0;
Figure 3.9 Algorithm of the code activator in pseudo-code form.
Figure 3.10 shows a block diagram of the accumulator, and Figure 3.11 illustrates the
algorithm for the accumulator. The data inputs to the accumulator are the phone identification
responses, PIR1 to PIR46, from the adaptive phone recognisor. These responses are
sequentially stored in the phone sequence stack, which operates as a serial to parallel register
of identified phones. The control input to the accumulator is the end of word identifier, EOWI
that informs the accumulator that a word boundary has reached and that the word can be
released onto the output. The output is the identified word (IW) (from 1 to 14 characters) in
the form of the numbers relating to the phones (sub-recognisors) identified. For example,
identification of the word 'please' (phonetically - /pliz/) would result in the following set of
numbers released from the phone sequence stack (22, 45, 2, 33). (See Table 3.1 for list of
numerical identification (ID) associated with each phone.)
72
Figure 3.10 Block diagram of the accumulator.
ALGORITHM OF ACCUMULATOR
do
get ID(I);
PIRi IDENTIFIED = ID
Until EOWI
IW = (PIRiID1 TO PIRiID12)
Figure 3.11 Algorithm of the accumulator.
73
The functions of the Neuro-Slice Response collector (NSR) and the output selector as
shown in Figure 3.12 are combined in the same algorithm and hence program. The response
from the neuro-slices, NSRij, is a continuous variable between 0 and 1 that represents the
degree of match for the jth frame of the ith phone, and is stored in an ASCII file, with one
value of the output per line. These responses are inputs to the neuro-slice response collector
and are available simultaneously. The file is read and an average of the outputs is found and
stored as IPIRi. If ACLi for that sub-recognisor is zero, then the final output from output
selector is zero. Alternately if ACLi for that sub-recognisor is one, then the final output from
output selector is equal to IPIRi and is stored in the final output file as PIRi. The algorithm
was implemented using MATLAB script.
3.5 Sub-recognisor: Structure
The structure of the sub-recognisors chosen for RUST-I is illustrated in Figure 3.12. It
consists of slices of smaller neural networks (referred to as the neuro-slices as opposed to
one large neural network). This type of architecture was chosen for two reasons. The first
reason was that using neuro-slices reduced the number of outputs per ANN and hence
reduced the number of PEs in each of the hidden layers of the ANNs. This effect is called
scaling, and is known to increase network accuracy and decrease network training time. The
second reason was that the development of this architecture was inherently linked to the
development of the syntactic knowledge and its effective use, and localising phone
recognition to one sub-recognisor assisted in this process.
74
Figure 3.12 Structure of the sub-recognisor.
The total number of neuro-slices for each sub-recognisor actually depends on the phone
duration, which varies from phone to phone and from time to time even for the same phone.
To overcome this hurdle, this number is set to the average value Mi (shown in Table 2.1) in
RUST-I as a compromise between implementation and performance. The output of each
neuro-slice is called the Neuro-Slice Response, NSRij. It measures the degree to which the
input frame data, DIj(12), j=1,2,…,Mi, matches the jth frame of the phone that the ith sub-
recognisor was trained on. In all the cases, i represents the sub-recognisor and j represents
the order of the neuro-slice within that sub-recognisor. Using a number of frames in the
recognition process to define the number of active neuro-slices for any one sub-recognisor is
advantageous. This is resulted from the temporal allocation of the neuro-slices as they
provide temporal cues of the phone, especially duration information. This has been found
beneficial for the recognition of the speech sounds which are perceived using mainly
temporal cues and some spectral cues (Tibbitts 1989) and (Lee and Dermody 1992).
Therefore, this technique achieves that by providing a mixture of both cues. The distribution
of frames of the phone through the neuro-slices of a sub-recognisor is referred to as temporal
unfolding, time is therefore an additional dimension within the structure of the APR, as each
sub-recognisor varies in the number presented to it, Mi varies across sub-recognisors.
75
Whenever a sub-recognisor is activated by ACLi, its output will be enabled. The NSR
collector adds the NSRi outputs from each neuro-slice and generates the IPIR signal which is
used by the output selector activated by ACLi to produce the phone response signal PIRi.
Figure 3.12 shows the basic architecture of one neuro-slice of the APR. It is a fully
interconnected feed-forward network with 12 inputs, three hidden layers (24 - 12 - 6) and
one output in the output layer. The input layer takes each of the 12 elements of the MFCC
vector. The output layer contains one PE representing a measure of matching of input speech
to phones. This output PIR is a continuous variable between 0 and 1. A match is considered
to occur if this value is greater than 0.6.
Figure 3.13 Architecture of one neuro-slice.
76
The structure in Figure 3.13 is called the multi-layer perceptron (MLP). The number of
layers and the number of processing elements in each hidden layer affect the performance of
the network. To determine the optimal structure of the network, a trial and error method is
followed in addition to recommendations for starting point suggested by McCORD and
Illingworth 1991. Seven different structures of MLPs were investigated before the structure
of 24/12/6 PEs per hidden layer was derived.
The error function at the output layer in the initial run is computed as
2)( outPET −=ε , (3.1)
where, T is the target and PEout is the net output. In order to ensure the global convergence
of the back-propagation algorithm, the following assumptions are needed (Magoulas and
Varhatis, 1999): (1) The error function ε is a real-valued function defined and continuous
everywhere in Rn. (2) For any two periods ω and υ ∈ R
n , ∇ε satisfies the Lipschitz condi-
tion,
||||||)()(|| υωυεωε −≤∇−∇ L , (3.2)
where, L > 0 denotes the Lipschitz constant. If these assumptions are satisfied, the back-
propagation algorithm can globally converge by determining the learning rate in the direc-
tion of minimising the error in each iteration.
The trials for the selection of the best structure are started by testing a neural network with
three hidden layers. This was similar to Lippmann and Gold model of (Lippman, 1987), the
investigation resulted in the proposed MLP structure of 12 - 36 - 50 - 25 - 1. The momentum
was ρ = 0.99 and the threshold value of the PE μ = 0.35, the number of iterations was set at
5000, the weights were set initially to small normally distributed random values. Then the
MLP was trained using the back-propagation learning algorithm (Laurene, 1994). The
training set contained 20 stimuli consisting of the vowel /a/; that is made up of five of the
fifteen speakers and all 4 words that contained /a/ were used. The vowel /a/ was chosen as it
is known that it contain explicit formants, which makes it easier to be recognised.
77
The MLP was then tested on 20 different stimuli of the vowel /a / with five different
speakers saying the same 4 words. As shown in Table 3.6, this architecture achieved a
recognition rate of 40%.
Table 3.6 Simulation of seven architectures of MLP.
Series Structure Accuracy1 12-36-50-25-1 40
2 12-36-24-6-1 49
3 12-48-24-12-1 45
4 12-24-24-12-1 66
5 12-18-20-10-1 55
6 12-24-12-3-1 80
7 12-24-12-6-1 1000
50
1001
2
3
45
6
7
Series1
To observe the effect of altering the structure on the recognition performance of the MLP,
the number of PEs in the second and third layer were decreased and all PEs per layer were
made a multiple of six; to derive the structure 12 - 36 - 24 - 6 - 1. All other parameters,
training and testing conditions remained the same, this structure achieved a slightly
improved accuracy of 49% during training. Other trials results are illustrated in Table 3.6
above. With more trials performed, it is noted that the accuracy improved markedly with
the manipulation of the second and third layers only, therefore, number of PEs in the third
layer is only increased to derive a structure of 12 - 24 - 12 - 6 - 1. All other parameters,
training and testing conditions remained the same. This structure produced an optimal
accuracy of 100% as shown in Table 3.6. The fast back-propagation (FBP) algorithm
(Technical Publications Class, 1993) is used in training the neuro-slices.
78
3.6 Conclusion
The language model described in this chapter and the lexicon words in their contextual
presence form the syntactical knowledge of the system. This syntactical knowledge is
interacted with neural networks to form the phonemic recognition block of the system. The
structure of the neuro-slice is also presented.
79
Chapter 4: Experimental Procedure
4.0 Introduction
In this chapter the performance of RUST-I is investigated. This work was part of the
original research work conducted on non-standard speech samples. It will be shown in
later sections of this chapter that there is a need to carry out further testing on standard
speech samples, as will be explained in Chapter 5. Section 4.2 describes the training of all
the 46 sub-recognisors of the APR. Section 4.3 describes the testing of the sub-recognisors
using isolated phones, and Section 4.4 deals with the testing using isolated phones with the
isolated phone identification factor included. The whole system is tested on isolated word
recognition in Section 4.5.
In the testing procedures used in this chapter, there were two scores of interest
1. The Self-Recognition Score (SRS) is the score of a sub-recognisor output, PIRi,
when the sub-recognisor is presented with the phone it was trained to
recognise.
2. The Misrecognition Score (MRS) is the score of a sub-recognisor output when
presented with any phone other than the one it was trained to recognise.
A confusion is termed to be any MRS that is greater than 0.1.
All 45 phonemes were divided into seven subgroups
1. vowels (i, I, ε, æ, a, Þ, Ď, Э, Ω, u, έ, ∂, Λ)
2. diphthongs (aI, eI, ЭI, aΩ, OΩ, I∂, ε∂, Ω∂)
3. stops (p, b, t, d, k, g)
4. nasals (m, n, さ)
5. fricatives (f, v, し, ð, s, z, ∫, ξ )
80
6. affricatives (t∫, dξ) 7. semi-vowels (h, r, j, w, L).
An Intra Subgroup Confusion (IASC) is confusion within a subgroup. An Inter Subgroup
Confusion (IRSC) is confusion across subgroups.
4.1 Selection of Parameters and Initial Conditions
The transfer function of a Sigmoid is given:
sesfY β−+==
1
1)( ,
where, β is a constant in the range 0 to 1. This function was applied as a firing function
allover the current network If the network stimulus exceeds the processing element’s
(PE) transfer function range, that PE is said to be saturated. The sigmoid function,
which was applied to RUST-I sub-recognisors accepted values between +6 and -6.
Saturation occurs when a PE's net stimulus exceeds this range.
The connection weights to each of the 45 PEs within each sub-recognisor are initialised to
small random values. The fast back-propagation (FBP) algorithm (Technical
Publications Group, 1993) was used to adjust the weights and minimise the global error.
Table 4.1 shows the learning rates and momentum terms for all layers and for training
and testing. These values were derived on a trial and error basis by monitoring the RMS
error and weight saturation. The choice of learning rate and momentum term was shown
to affect the training speed of the network, the stability of the RMS error curve and/or
saturation of the PEs. For example, using the default values of learning rate (that is I - 0.5,
1st - 0.25, 2nd - 0.2, 3rd - 0.15) and momentum rate of 0.4 with 10000 iterations; the
RMS error jumped to a normalised value of 1 and subsequently changed very little. The
saturation level of the PEs in the all hidden layer showed that these PEs reached
saturation (the first hidden layer after only 100 iterations) and the weights were not
changed after that, leading to no decrease in error and no more learning. With the values
of learning rate defined in Table 4.1 and 2000 iterations, the RMS error initially jumped
to a normalised value of 1 and subsequently dropped to near zero as shown in Figure
81
3.14.
Table 4.1 Optimum learning rates and momentum terms for all layers
during training and testing.
Training Testing
Layer Learning Rate Momentum term Learning Rate Momentum term
Input 0.25 0.5 0.15 0.9
1st hidden 0.125 0.25 0.075 0.25
2nd
hidden 0.0313 0.0625 0.0188 0.0625
3rd
hidden 0.0019 0.0039 0.0012 0.0039
Figure 4.1 RMS error curves for training with adjusted parameters.
4.1.0 Further Results on Training and Testing
Two different data sets were used for training and testing the MLP within the neuro-slice,
one for the training and the other for testing. Both data sets contained the same phones
spoken by different speakers and were extracted from different words or from different
positions in the same word. For example, the phone /m/ and the phone /L/ have been
extracted from the word 'multimillionaire' from two different positions of that word.
Speakers labelled as 1, 2, 3, 4 and 5 were used for training. The first three were male and
82
the last two female. Speakers labelled as 6, 7, 8, 9 and 10 were used for testing. The first
three were male and the last two female. Table 4.2 shows the number of training and
testing tokens (phone sample) used for each sub-recognisor. For example, column 1 row
1 of Table 4.2 shows that the phone /I/ sub-recognisor has identifier 1 and is represented
in 3 different words spoken by all speakers, so there are 15 different examples of this
phone for both training and testing in column 3 and 4. The second column of Table 4.2
contains both the phone identifier and the sub-recognisor identifier (separated by a slash
"/").
Table 4.2 Number of training and testing tokens used for each sub-recognisor.
Phone ID Train Test Phone ID Train Test
I 1/I1 15 15 t 24/T 5 5
i 2/I2 25 25 d 25/D 10 10
ε 3/A1 20 20 k 26/K 5 5
æ 4/A2 10 10 g 27/G 5 5
a 5/A3 20 20 f 28/F 10 10
Þ 6/A4 5 5 v 29/V 5 5
Ď 7/A5 5 5 し 30/THE 5 5
Э 8/O1 5 5 ð 31/THI 15 15
Ω 9/O2 5 5 s 32/S 5 5
u 10/O3 15 15 Z 33/Z 5 5
έ 11/A6 5 5 ∫ 34/SH 10 10
∂ 12/A7 15 15 ξ 35/JH 25 25
Λ 13/A8 10 10 h 36/H 5 5
aI 14/AI 20 20 r 37/R 5 5
eI 15/EI 30 30 t∫ 38/TCH 5 5
ЭI 16/OI 5 5 dξ 39/DJH 30 30
aΩ 17/AU 5 5 m 40/M 35 35
OΩ 18/OU 15 15 n 41/N 5 5
I∂ 19/IA 5 5 さ 42/MNG 10 10
ε∂ 20/EA 10 10 j 43/J 10 10
83
Ω∂ 21/UA 5 5 w 44/W 35 35
p 22/P 55 55 L 45/L 80 80
b 23/B 30 30 silence 46/slc 20 20
The procedure used to prepare training files was to separate each frame of MFCC
coefficients for all the examples of training tokens and place them in separate files so that
each neuro-slice was trained on its appropriate frame independently. Table 4.3 shows an
example of the sequential order of presentation in terms of the phone id (P), example
number (E), frame number (F), speaker number (S) and word number (W). For example,
phone 1 is extracted from words 1, 2 and 3 from each of speakers 1, 2, 3, 4 and 5. The
number of examples differs for each phone as shown in Table 4.2 and is referred to as j
for the training files.
Table 4.3 Example of the sequential order of presentation in terms of the phone ID (P),
example number (E), frame number (F), speaker number (S) and word number (W).
E1 E2
P F W S P F W S
1 1
2 2
3 3
4 4
5 5
6
1
6
1
1 1
2 2
3 3
4 4
5 5
6
2
6
2
1 1
2 2
3 3
4 4
5 5
6
3
6
3
1 1
2 2
3 3
4 4
5 5
6
4
6
4
12
1
1
5
12
1
2
5
84
2 2
3 3
4 4
5 5
6 6
The data input file used for training has the format shown in Figure 4.2, where every
example of each frame for each phone is placed in a separate file. Each file has j tokens.
MFCC for frame 1, phone 1, word 1, speaker 1
MFCC for frame 1, phone 1, word 1, speaker 2
MFCC for frame 1, phone 1, word 1, speaker 3
MFCC for frame 1, phone 1, word 1, speaker 4
MFCC for frame 1, phone 1, word 1, speaker 5
MFCC for frame 1, phone 1, word 2, speaker 1
MFCC for frame 1, phone 1, word 2, speaker 2
MFCC for frame 1, phone 1, word 2, speaker 3
MFCC for frame 1, phone 1, word 2, speaker 4
MFCC for frame 1, phone 1, word 2, speaker 5
Figure 4.2 Format of the data input training file.
An example of an input data training file is shown below. The file is in ASCII format. It
consists of two rows of twelve normalised real numbers representing the MFCC vector
and an additional row with the required target separated by ampersand sign. The values in
the input fields are separated by a space. For example:
0.615297 0.124238 0.095474 0.055436 0.084756 0.191427 0.037501 0.083012
0.183048 0.094391 0.045078 0.094451 & 1.0000
0.599774 0.006463 0.102975 0.048325 0.196358 0.143246 0.081053 0.022370
0.148832 0.083293 0.064328 0.021072& 1.0000
The first row is for the first frame of the vowel /a/ spoken by speaker 1 from word 1
followed by the desired output. The second row is the input and desired output for the
first frame of the vowel /a/ spoken by speaker 2 from word 1. The pattern continues with
85
the third row indicating the input and desired output for the first frame of the vowel /a/
spoken by the other speakers.
The MLP utilises supervised learning so that the desired outputs are presented with the
inputs to the network in the training file only, the desired output is not present in the
testing file. The testing data file is also in ASCII format and consists of two rows of
twelve normalised real numbers representing the MFCC vector. The order of the test file
is random over phone, speaker and word. In testing, a match between the testing input set
and the training set was assumed if the output was greater than 0.60. Thus, an output
between 0.60 and 1.00 will represent a "correct response", while any other response will
be considered to be a "no match".
The training condition of exit from the fast back-propagation (FBP) algorithm was the
number of iterations, which was set at 2000. The testing condition of exit from the FBP
algorithm was the error below the default minimum. The output from the testing of each
neuro-slice stored in a separate file, with one response per line. This output is the Neural
Net Output, NNOij , for the ith PE and j
th neuro-slice.
4.1.1 Confusion Matrix
A confusion matrix is a collection of intensities or scores versus presented stimuli and
response. It is a grid where the number on the diagonal indicates a correct response to the
input stimulus and a number either side of the diagonal indicates the degree of
recognition for other phones that the sub-recognisors identified. The numbers in each
square are the output from each sub-recognisor. An error occurs if an off-diagonal
number is greater than 0.6. This plot was used as it represents the error obtained in phone
recognition and how this is influenced with syntactic knowledge. Table 4.4 shows an
example of a confusion matrix as used to record the outputs from the sub-recognisors.
The y axis is the stimulus and the x axis is the response (PIRi). Any score on the diagonal
represents correct response to the applied stimulus. Any non-zero off-diagonal score
represents an error.
86
Table 4.4 Example of the confusion matrix.
Response (PIRi)
Stimulus I i ε æ a
I 0.82 0.1 0.3171 0.05 0
i 0.1 0.81 0.15 0.09 0
ε 0.15 0.2 1.0 0.4812 0.08
æ 0 0 0.19 0.9998 0
a 0 0 0.1 0 1.00
4.2 Training the Adaptive Phone Recognisor
This section describes the results from primary training of the APR on individual phones.
At the beginning each sub-recognisor was trained on the relevant correct phone extracted
from five speakers. The speaker ID’s were: 1, 2, 3, 4 and 5. The data set for one sub-
recognisor consisted of all representations of the one phone from all five speakers.
Table 4.5 summarises the SRS results of the primary training session of the APR, the table
shows the maximum and minimum values of the responses for all types of phones. The
confusion matrices for all training speakers were obtained. It can be noticed from the table
that the vowels and the semivowels achieved the best results, this is expected because of
the explicit spectral nature of those phones.
Table 4.5 Summary of the primary training session of the APR.
87
Vowels &
Diphthongs
Stops,
Fricatives &
Affricatives
Nasals
Semivowels
Silence
Phone
Group
min max min max min max min max min Max
SRS
092
1.00
0.81
0.99
0.79
0.98
0.80
1.00
0.70
0.7
All MRSs for the training set were in the range 0.07 to 0.6, which were below the lowest
SRS of 0.70 for silence, meaning that there will not be any confusion between any phone
and the silence, and therefore the system can distinguish the difference between a sound
and the silence. Table 4.6 summaries the most remarkable MRSs of the highest Intra
Subgroup Confusion (IASC) values within the phones group.
Table 4.6 Summary of the most remarkable IASCs.
Phone
Group
VWL
DPH
STP
FR
AFR
NS
SVWL
Phone-
to-
Phone
Ω to έ
aΩ to OΩ
t to k
ð to し
t∫ to dξ
さ to m
IASC 0.35 0.35 0.6 0.49 0.22 0.28 0.00
VWL: Vowels, Dph: Diphthongs, STP: Stops, FR: Fricatives, AFR: Affricatives, NS: Nasals, SVWL: Semi-
vowels
The maximum IRSC for vowels occurred with the semivowel subgroup with scores of less
than 0.28. The maximum IRSC for vowels occurred when applying the semivowel /r/ to
the sub-recognisor of the phone /a/, which achieved MRS of 0.28. In the case of
diphthongs, the only IRSC greater than 0.0 occurred when applying the vowel / ∂ / to the
sub-recognisor of the phone /ε∂/, which resulted a MRS of 0.13. The maximum IRSC for
the stops occurred with the affricatives when applying the affricative /t∫/ to the sub-
recognisor /t/, resulted MRS of 0.38. The IRSC for fricatives were low (<= 0.1). The
maximum IRSC for fricatives was when applying the semivowel /h/ to the sub-recognisor
/s/ which achieved MRS of 0.10. The nasals had an IRSC of zero with every other
subgroup. The maximum IRSC for affricatives was when applying the stops /t/ and /d/ to
88
the sub-recognisor /t∫/ resulted MRS of 0.34. The two highest IRSCs for semivowels
occurred when applying the fricative /s/ to the semivowel /h/, and the vowel / Ω / to the
semivowel /w/ both resulted MRS of 0.15.
In conclusion:
• At the end of the training session for the APR, results show that SRSs are
higher than MRSs, which allow the module to pass Experiment One.
• Potential problem areas are that some sub-recognisors achieved some
MRS close to their SRS.
4.3 Experiment One: Operation of Each Sub-recognisor without the
Syntactical Knowledge
Experiment One was designed to measure the performance of each of the sub-recognisors
on isolated phones before syntactic knowledge is included. The overall performance of
RUST-I as an IWR is dependent on its ability to recognise individual phones. RUST-I
demands that the SRS for the correct phone needs to be above the minimum threshold to
be considered for syntactic knowledge evaluation. During the experiment ACLs to the
APR were deactivated so that there is no input from the syntactic knowledge estimator.
The aim of this experiment was firstly to determine the level of confusion that occurred for
phones without syntactic knowledge and secondly to determine the required threshold of
output for recognition of the correct response (self-recognition score - SRS) from the sub-
recognisors. Unique test data not used in training (from speakers 6, 7, 8, 9 and 10) was
provided for this experiment.
4.3.1 Input Stimuli
The stimuli data set presented to the adaptive phone recognisor in this experiment contains
the same phone set applied in training the neural nets but now spoken by different
speakers. The testing set contains one token of each of the 45 unique phones derived from
89
speakers 6, 7, 8, 9 and 10. Thus the new speaker set used in testing ensures speaker
independency for the system.
4.3.2 Experimental Method
The inputs to the sub-recognisor are the 12 Mel-scale frequency cepstral coefficients
(MFCC) for the Mi frames to the ith sub-recognisor. The number of frames, M, determines
the number of neuro-slices in each sub-recognisor. For example the phone /p/ is the 22nd
sub-recognisor and has 6 frames associated with it so that there are 6 neuro-slices required
in this sub-recognisor. For the phone /p/ six sets of 12 MFCC were applied to each of the
neuro-slices simultaneously and collected in the neuro-slice collector. The output form the
neuro-slice collector, IPIR22, was a value from 0.00 to 1.00 that measured the degree of
matching for that sub-recognisor.
Representation of Raw Data: All 225 tokens (45 phonemes by 5 speakers) were applied
to each of the 46 sub-recognisors and the IPIRi outputs were measured. The IPIRi output
from each sub-recognisor was then stored in an ASCII file, and represented graphically as
a 3-D confusion matrix. Both the 2-D and 3-D confusion matrices are available for the
speakers from the test set with the maximum output responses (IPIRi) only.
Representation of Significant Confusions: Tables were created to represent the output of
a sub-recognisor for its correct stimulus (the SRS) for all speakers in the test set. These
tables summarised the IPIRi output for each sub-recognisor when presented with its true
stimuli, i.e. the SRS.
Tables were also created showing the IASC averaged over all speakers in the test set and
for each of the six subgroups. These tables were derived to look at the influence of the
place of articulation on phone confusion, and to be used to determine IASCs of the APR
and also assist in the derivation of the appropriate threshold level for that subgroup. The
tables contain the average output from all sub-recognisors in a subgroup and in response to
input stimuli from that subgroup.
It is expected that confusions may occur across similar subgroups, i.e., IRSC such as
90
between vowels and diphthongs, semivowels and vowels or diphthongs, stops and
affricatives or between fricatives and affricatives. Tables were also created showing the
main confusions for each phone input and for all speakers over all phones in subgroups.
These tables are used to determine the main confusions and so identify possible errors in
the system, investigate the speaker independence of the phone recognisor, and assist in the
derivation of an appropriate level of threshold for the system.
4.3.3 Results
The results are presented in two formats as described in Section 4.3.2 and are presented in
Section 4.3.3 respectively. The recognition decision was evaluated by the matching scores
collected at the NSR collector end for every sub-recognisor. Results were represented by
two forms. The first is the self-recognition score (SRS) which is the immediate phone
identification response (IPIRi) appeared at the output of the neuro-slices response collector
of the ith sub-recognisor when presented at its input to the phones ith. The second is the
misrecognition score (MRS) which is the immediate phone identification response (IPIRi)
appearing at the output of the neuro-slices response collector of the ith sub-recognisor when
presented at its input to the jth phone.
Confusion Matrix: Responses of all sub-recognisors for all input stimuli of the five
testing speakers (numbers 6 to 10) in the test set. Results representation is in the form of
confusion matrices, which are represented in tables. All speakers showed some similar
trends that are verified in the following tables. The majority of confusions occurred within
a subgroup (intra) rather than across subgroups (inter)- meaning place of articulation was
confused rather than manner of articulation. Some exceptions occurred consistently for all
speakers. These were low level confusion (from 0.05 to 0.35) of the vowels /I/ and /i/ with
the semivowel /j/, the vowels / Ω / and /u/ with the semivowel /w/, the vowel /a/ with the
semivowel /r/, diphthongs /aΩ/ and /ε∂/ with vowels /Ď/ and /έ/, between the affricatives
and some of the stops and for the silence with low intensity consonants - stops, fricatives
and affricatives.
Figure 4.3a and 4.3b shows the 3-D graphical representation of the full confusion
matrix for speaker 9 from the right and left side of the diagonal respectively. The X-axis
91
represents the identification of the presented stimuli and the sub-recognisor that responded.
The Y-axis indicates the degree to which a sub-recognisor responded to the input stimulus,
and the vertical Z-axis represents the intensity or the amplitude of the response. The
highest scores are shown to be centred on the diagonal (> 0.60). Off diagonal scores
tend to be between 0.20 and 0.60). There is evidence of clustering of data such that
confusions occur mainly within subgroups, i.e., IASC.
I
u
I
f
r
Sil
ence
I
a
a
d
v
z
rn
l
00.10.20.30.40.50.6
0.7
0.8
0.9
1
Response Value
Response Phone
Stimuli Phone
Speaker 9 Confusion Matrix
(right-side view)
0.9-1
0.8-0.9
0.7-0.8
0.6-0.7
0.5-0.6
0.4-0.5
0.3-0.4
0.2-0.3
0.1-0.2
0-0.1
Figure 4.3(a) 3-D representation of the full confusion matrix of speaker 9 (right side
view).
92
I
I d r j
I
u
I
fr
Silence
00.10.20.30.40.50.60.70.80.9
1
Response Value
Stimuli Phone
Response Phone
Speaker 9 Confusion Matrix
(left-side view)
0.9-1
0.8-0.9
0.7-0.8
0.6-0.7
0.5-0.6
0.4-0.5
0.3-0.4
0.2-0.3
0.1-0.2
0-0.1
Figure 4.3(b) 3-D representation of the full confusion matrix of speaker 9 (left side view).
To enhance the total views of Figures 4.3a and 4.3b, Figure 4.4 illustrates the 2-D
graphical representation (top view) of the full confusion matrix for speaker 9. In this
diagram it is easier to see the symmetry of stimulus with response. For example if a stimuli
X was partially recognised by sub-recognisor Y then stimuli Y was also partially
recognised by sub-recognisor X. The scores may differ but the similarity in the two signals
will be coded into both sub-recognisors. The regions of the graphs in Figures 4.3 and 4.4
are segmented into subgroups, i.e., vowels, diphthongs and …. The confusion is shown to
appear mainly within those subgroups indicating that the place of articulation is the main
source of confusion for the APR for this speaker as it is for human listeners.
93
I u
I I p d f r m
j
Sile
nce
I
u
I
I
p
d
f
r
m
j
Silence
Stimuli Phone
Response Phone
Speaker 9 Confusion Matrix
1.1-1.2
1-1.1
0.9-1
0.8-0.9
0.7-0.8
0.6-0.7
0.5-0.6
0.4-0.5
0.3-0.4
0.2-0.3
0.1-0.2
0-0.1
Figure 4.4 2-D representation of the confusion matrix of speaker 9.
Self-Recognition Scores (SRS) for All Speakers: Tables 4.7a-e contain the SRS or actual
output (IPIRi) from each sub-recognisor for all the speakers in the test set (Speakers 6 to
10) when stimulated only with the correct stimulus for that sub-recognisor. These tables
therefore contain the values from the diagonals of the confusion matrices as shown below.
Table 4.7(a) Responses of the sub-recognisors for expected input stimulus - speaker 6.
94
Phone Response Phone Response Phone Response
I 0.80 ЭI 0.83 ð 0.68
i 0.81 aΩ 0.89 s 0.59
ε 0.95 OΩ 0.58 z 0.70
æ 0.92 I∂ 0.78 ∫ 0.59
a 0.92 ε∂ 0.56 ξ 0.52
Þ 0.91 Ω∂ 0.51 t∫ 0.50
Ď 0.52 p 0.56 dξ 0.62
Э 0.71 b 0.50 m 0.79
Ω 0.59 t 0.61 n 0.58
u 0.80 d 0.85 さ 0.55
έ 0.90 k 0.83 h 0.51
∂ 0.92 g 0.55 r 0.81
Λ 0.91 f 0.59 j 0.90
aI 0.86 v 0.71 w 0.59
eI 0.90 し 0.55 L 0.50
Table 4.7(b) Responses of the sub-recognisors for expected input stimulus – speaker 7.
95
Phone Response Phone Response Phone Response
I 0.81 ЭI 0.80 ð 0.55
i 0.81 aΩ 0.58 s 0.58
ε 1.00 OΩ 0.53 z 0.80
æ 0.95 I∂ 0.81 ∫ 0.53
a 0.99 ε∂ 0.59 ξ 0.83
Þ 0.98 Ω∂ 0.80 t∫ 0.78
Ď 0.59 p 0.72 dξ 0.55
Э 0.80 b 0.59 m 0.83
Ω 0.83 t 0.65 n 0.54
u 0.87 d 0.75 さ 0.56
έ 0.92 k 0.77 h 0.74
∂ 1.00 g 0.55 r 0.58
Λ 0.95 f 0.59 j 0.94
aI 0.86 v 0.74 w 0.85
eI 0.83 し 0.71 L 0.82
ble 4.7(c) Responses of the sub-recognisors for expected input stimulus – speaker 8.
96
Phone Response Phone Response Phone Response
I 0.81 ЭI 0.82 ð 0.73
i 0.80 aΩ 0.55 s 0.84
ε 0.99 OΩ 0.54 z 0.79
æ 0.98 I∂ 0.91 ∫ 0.78
a 0.99 ε∂ 0.88 ξ 0.53
Þ 0.98 Ω∂ 0.91 t∫ 0.59
Ď 0.98 p 0.79 dξ 0.79
Э 0.78 b 0.84 m 0.90
Ω 0.79 t 0.52 n 0.55
u 0.88 d 0.89 さ 0.60
έ 0.98 k 0.79 h 0.59
∂ 0.99 g 0.55 r 0.55
Λ 0.96 f 0.74 j 0.92
aI 0.79 v 0.59 w 0.90
eI 0.87 し 0.55 L 0.82
able 4.7(d) Responses of the sub-recognisors for expected input stimulus – speaker 9.
97
Phone Response Phone Response Phone Response
I 0.82 ЭI 0.86 ð 0.74
i 0.81 aΩ 0.88 s 0.85
ε 1.00 OΩ 0.92 z 0.80
æ 0.99 I∂ 0.89 ∫ 0.79
a 1.00 ε∂ 0.93 ξ 0.83
Þ 0.99 Ω∂ 0.95 t∫ 0.80
Ď 0.99 p 0.62 dξ 0.70
Э 0.79 b 0.85 m 0.91
Ω 0.80 t 0.69 n 0.69
u 0.89 d 0.85 さ 0.61
έ 0.99 k 0.80 h 0.70
∂ 1.00 g 0.79 r 0.92
Λ 0.97 f 0.75 j 0.96
aI 0.89 v 0.80 w 0.91
eI 0.91 し 0.76 L 0.81
98
Table 4.7(e) Responses of the sub-recognisors for expected input stimulus - speaker 10.
Phone Response Phone Response Phone Response
I 0.73 ЭI 0.81 ð 0.70
i 0.70 aΩ 0.52 s 0.68
ε 0.85 OΩ 0.63 z 0.79
æ 0.80 I∂ 0.65 ∫ 0.55
a 0.95 ε∂ 0.61 ξ 0.57
Þ 0.65 Ω∂ 0.54 t∫ 0.75
Ď 0.62 p 0.75 dξ 0.55
Э 0.70 b 0.65 m 0.85
Ω 0.75 t 0.60 n 0.65
u 0.85 d 0.80 さ 0.53
έ 0.89 k 0.54 h 0.70
∂ 0.82 g 0.74 r 0.85
Λ 0.97 f 0.70 j 0.90
aI 0.63 v 0.80 w 0.88
eI 0.70 し 0.54 L 0.54
The minimum threshold of SRS for these five speakers varied from 0.50 for speaker 6 to
0.61 for speaker 9. The average vowel SRS per speaker varied from 0.79 for speaker 10 to
0.93 for speaker 9. The overall average SRS for all vowels over all speakers was 0.87.
Vowels generally resulted in the highest SRS values of all the subgroups but the vowel that
obtained the lowest values varied across speakers. The vowels /a/ and /Λ/ had consistently
high SRS (0.90 to 1.00) across all speakers. These results are unique for Australian vowels
(Section 2.3).
The average diphthong SRS per speaker varied from 0.63 for speaker 10 to 0.95 for
speaker 9. The overall average SRS for all diphthongs over all speakers was 0.77. No
diphthongs consistently obtained lower SRS values but the diphthong /ЭI/ had a
consistently high SRS (above 0.80) across all speakers. Speaker 9 had a much higher
99
average diphthong score (0.95) than any other speaker.
The average stop SRS per speaker varied from 0.65 for speakers 6 and 7 to 0.83 for
speaker 9. The overall average SRS for all stops over all speakers was 0.71. The stop /t/
obtained lower SRS (0.52 to 0.69) for all speakers. The stop /d/ had a consistently high
SRS (above 0.75) across all speakers. Speaker 9 had a much higher average stop score
(0.83) than any other speaker.
The average nasal SRS per speaker varied from 0.64 for speakers 6 and 7 to 0.77 for
speaker 9. The overall average SRS for all nasals over all speakers was 0.68. The nasals /n/
and /さ/ consistently obtained lower SRS (0.53 to 0.65) for all speakers. The nasal /m/ had a
consistently high SRS (above 0.79) across all of the speakers. Again speaker 9 had a much
higher average nasal SRS (0.77) than any other speaker.
The average fricative SRS per speaker varied from 0.62 for speaker 6 to 0.79 for speaker 9.
The overall average SRS for all fricatives over all speakers was 0.69. The fricative /z/ had
a consistently high SRS (above 0.70) across all speakers. All other fricatives obtained
varied SRSs, which are generally tended to be good (above 0.69) except some cases for the
fricative /ξ/. The highest average SRS for the fricatives was obtained by speaker 9 at
(0.69).
The average affricative SRS per speaker varied from 0.56 for speaker 6 to 0.75 for speaker
9. The overall average SRS for all affricatives over all speakers was 0.65. The affricative
/t∫/ obtained higher SRS for three speakers (0.75 to 0.80). The affricative /dξ/ had a higher
SRS (0.7 to 0.79) for two speakers. For both affricatives, speaker 9 had a much higher
average SRS (0.65) than any other speaker.
The average semivowel SRS per speaker varied from 0.66 for speaker 6 to 0.82 for
speaker 9. The overall average SRS for all semivowels over all speakers was 0.74. No
semivowel consistently obtained lower values but the semivowel /j/ had consistently
higher SRSs (above 0.81) across all speakers. Speaker 9 had a much higher average
semivowel SRS (0.87) than any other speaker.
Average Confusion Response for Subgroups: Table 4.8a shows the vowel stimuli
100
presented verses the average MRS across all speakers and across all vowel sub-
recognisors. For all members of this subgroup the SRS was always higher than the
MRS (misrecognition) achieved by any other sub-recognisor. Table 4.8b shows the
three most common confusions and their associated MRS across speakers for just the
members of the vowel subgroup.
Table 4.8(a) Vowels confusion matrix - Stimuli presented versus sub-recognisor
responses.
I i ε æ a Þ Ď Э Ω u έ ∂ Λ
I .79 .14 .24 .14 .05 .00 .03 .01 .02 .04 .57 .00 .00
i .13 .78 .32 .13 .02 .00 .00 .00 .00 .00 .14 .00 .00
ε .24 .23 .96 .46 .13 .10 .09 .04 .01 .06 .23 .03 .05
æ .00 .07 .23 .91 .00 .15 .19 .18 .13 .00 .15 .00 .26
a .00 .00 .12 .00 .97 .00 .00 .41 .00 .00 .13 .14 .26
Þ .00 .00 .00 .02 .00 .90 .00 .00 .00 .00 .00 .00 .00
Ď .00 .01 .00 .05 .00 .02 74 .00 .00 .00 .00 .02 .00
Э .00 .00 .00 .22 .19 .28 .15 .76 .22 .24 .12 .23 .38
Ω .00 .01 .02 .00 .00 .00 .00 .25 .75 .57 .63 .04 .33
u .00 .00 .00 .00 .00 .09 .09 .20 .42 .86 .09 .00 .00
έ .08 .00 .45 .32 .00 .00 .00 .00 .27 .00 .94 .00 .25
∂ .00 .00 .00 .53 .06 .00 .00 .00 .00 .00 .00 .95 .00
Λ .00 .00 .00 .35 .22 .07 .18 .24 .19 .00 .01 .00 .95
Table 4.8(b) Three most common confusion across speakers for the vowel subgroup.
101
Speaker 6 Speaker 7 Speaker 8 Speaker 9 Speaker 10
I έ – 0.45
æ - 0.33
a – 0.24
έ - 0.60
i - 0.21
æ - 0.12
έ - 0.60
ε - 0.50
i - 0.20
έ - 0.60
j - 0.40
ε - 0.31
έ - 0.61
j - 0.39
ε - 0.31
i ε - 0.35
j – 0.33
æ - 0.25
j - 0.59
ε - 0.20
I -0.18
j - 0.60
I - 0.25
ε - 0.20
j - 0.60
ε - 0.15
ε - 0.70
j - 0.55
æ - 0.20
ε æ - 0.60
I – 0.44
i – 0.41
æ - 0.40
έ - 0.29
i - 0.11
æ - 0.35
I - 0.30
i - 0.22
æ - 0.48
έ/i - 0.20
æ - 0.45
i - 0.22
έ - 0.21
æ Э - 0.38
i/Þ - 0.36
Λ - 0.31
ε - 0.22
Ď - 0.20
ε - 0.30
Λ - 0.23
έ/ Ď - 0.15
Λ - 0.30
έ - 0.20
ε - 0.19
Λ - 0.25
ε - 0.20
έ - 0.19
a Λ - 0.61
Э - 0.29
ε - 0.18
Э - 0.41
r - 0.25
Λ - 0.23
Э - 0.45 ∂ /ε - 0.20
Э - 0.40
r - 0.23
Λ - 0.20
Э - 0.50 ∂ - 0.30
έ - 0.25
Þ Non non non non non
Ď aΩ - 0.41
æ - 0.25 ∂ - 0.12
non non non non
Э ∂ - 0.40
æ - 0.36
u - 0.33
Λ - 0.41
Þ - 0.31
u - 0.25
Λ - 0.41
æ/Ω - 0.30
Λ - 0.43
a/Þ - 0.30
Λ - 0.45
Þ - 0.38
a - 0.25
Ω u - 0.41
έ - 0.36
Λ - 0.36
έ - 0.70
u - 0.61
Λ - 0.29
έ - 0.70
u - 0.61
Э - 0.22
έ - 0.70
u - 0.61
Λ - 0.39
έ - 0.70
u - 0.61
Λ - 0.40
u Э - 0.29
Ω - 0.27
έ - 0.18
Ω - 0.30
Э - 0.20
Þ - 0.11
Ω - 0.50
Э /w - 0.20
Ω - 0.48 Ω - 0.54
w - 0.35
Э- 0.22
έ Ω - 0.38
æ - 0.36
ε - 0.35
ε - 0.45
Λ - 0.30
Ω /ε - 0.26
ε - 0.42 εƏ- 0.35
Λ - 0.24
ε - 0.50
ε∂ - 0.40
Ω - 0.30
ε - 0.52
ε∂ - 0.46
Λ - 0.28
∂ æ - 0.62 æ - 0.60 æ - 0.40 æ - 0.49 æ - 0.50
a - 0.30
Λ æ - 0.40
Þ - 0.38
Э - 0.11
Ω - 0.39
Э - 0.35
æ - 0.26
Э - 0.29
æ - 0.20
a - 0.23
æ - 0.40
Ω - 0.24
a/ Э - 0.20
æ - 0.50
a - 0.30
Э - 0.25
Table 4.9a shows the diphthong stimuli verses the average MRS for all diphthong sub-
102
recognisors. For all members of this subgroup the SRS was always higher than the MRS.
Table 4.9b shows the three most common confusions across speakers and their associated
MRS across members of the diphthong subgroup (IASC). Detailed results are presented in
Tables 4.9a and 4.9b.
Table 4.9(a) Diphthong confusion matrix (average values over all speakers).
aI eI ЭI aΩ OΩ I∂ ε∂ Ω∂
aI 0.81 0.22 0.18 0.00 0.00 0.00 0.28 0.00
eI 0.68 0.84 0.34 0.00 0.00 0.00 0.31 0.00
ЭI 0.27 0.20 0.82 0.43 0.22 0.00 0.30 0.00
aΩ 0.00 0.00 0.29 0.68 0.64 0.00 0.23 0.00
OΩ 0.00 0.00 0.32 0.43 0.67 0.02 0.22 0.00
I∂ 0.00 0.00 0.00 0.00 0.00 0.81 0.27 0.33
ε∂ 0..30 0.35 0.43 0.39 0.27 0.28 0.71 0.00
Ω∂ 0.00 0.00 0.00 0.00 0.00 0.26 0.30 0.74
Table 4.10a shows the stop stimuli presented verses the average MRS for each speaker.
In all cases the SRS was higher than the MRSs achieved by the other stop sub-
recognisors. Table 4.10b shows the three most common confusions and their associated
MRS across speakers for just the members of the stop subgroup IASC. Detailed results
are shown in Tables 4.10a and 4.10b.
Table 4.9(b) Three most common confusions across speakers for the diphthongs
103
subgroup.
Speaker 6 Speaker 7 Speaker 8 Speaker 9 Speaker10
aI eI - 0.39 ε∂ - 0.25
ЭI- 0.20
ЭI- 0.31 ε∂ - 0.25
eI - 0.12
ε∂- 0.25
ЭI- 0.20
eI - 0.18
ε∂- 0.32
eI - 0.20
ε∂- 0.32
eI - 0.19
eI ε∂- 0.39
ЭI- 0.28
aI - 0.26
aI - 0.70
ЭI- 0.50 ε∂- 0.25
aI - 0.68
ЭI- 0.48 ε∂- 0.35
aI - 0.69 ε∂- 0.40
ЭI- 0.20
aI - 0.66
ЭI- 0.23 ε∂- 0.21
ЭI aΩ- 0.45 ε∂- 0.36
aI - 0.35
aΩ - 0.41
OΩ- 0.35
Э - 0.25
aΩ - 0.45
aI - 0.41 ε∂- 0.31
aΩ - 0.40 ε∂- 0.30
aI/ OΩ- 0.22
aΩ- 0.45 εΩ- 0.28
aI - 0.23
aΩ OΩ - 0.61
ЭI- 0.26 ε∂- 0.22
OΩ- 0.68
ЭI - 0.43 ε∂- 0.27
OΩ- 0.66 ε∂- 0.25
ЭI- 0.22
OΩ- 0.60 ε∂/ ЭI - 0.30
OΩ- 0.60
ЭI- 0.23 ε∂- 0.12
OΩ ЭI- 0.31 ε∂- 0.28
aΩ- 0.22
aΩ- 0.48
ЭI- 0.21 ε∂/w- 0.15
aΩ- 0.42
ЭI/w - 0.31
aΩ- 0.50
ЭI- 0.30 ε∂/w- 0.20
ЭI- 0.45 ε∂- 0.26
w - 0.30
I∂ Ω∂- 0.38
j - 0.25 ε∂- 0.22
Ω∂ - 0.31 ε∂- 0.24
j - 0.20
Ω∂ - 0.31 ε∂- 0.25
j - 0.21
ε∂- 0.40
Ω∂ - 0.30
Ω∂ - 0.35 ε∂- 0.23
j - 0.25
ε∂ aΩ- 0.39
I∂- 0.35
OΩ - 0.32
ЭI- 0.51
aI - 0.41
I∂- 0.36
aΩ- 0.60
eI - 0.36
έ - 0.34
aΩ- 0.50
eI - 0.50
aI/ЭI- 0.40
aΩ- 0.46
OΩ- 0.38
έ - 0.33
Ω∂ w - 0.36 ε∂- 0.28
I∂- 0.25
ε∂- 0.5
w - 0.20
I∂- 0.31
w - 0.20 ε∂- 0.16
I∂- 0.40 ε∂- 0.20
ε∂ - 0.38
w - 0.33
I∂ - 0.25
Table 4.10(a) Stops confusion matrix (average values over all speakers).
104
p b t d k g
p 0.69 0.03 0.31 0.17 0.30 0.13
b 0.05 0.69 0.07 0.28 0.05 0.21
t 0.42 0.10 0.60 0.06 0.60 0.22
d 0.16 0.21 0.05 0.83 0.17 0.28
k 0.39 0.02 0.20 0.06 0.73 0.20
g 0.17 0.30 0.16 0.33 0.20 0.64
Table 4.10(b) Three most common confusions across speakers for the stops subgroup.
Speaker 6 Speaker 7 Speaker 8 Speaker 9 Speaker10
p k - 0.36
t∫ - 0.28
t - 0.22
k - 0.25
t - 0.22
t∫ - 0.28
t - 0.38
dξ - 0.31
k - 0.28
t - 0.40
k - 0.35
t∫ - 0.20
t - 0.44
k - 0.25
t∫ - 0.22
b d - 0.32
g - 0.28
t∫ - 0.21
g/ dξ - 0.15
t∫ - 0.13
t∫ - 0.50
d - 0.35
dξ- 0.30
dξ - 0.50
d - 0.40
t∫ - 0.30
dξ- 0.50
t∫ - 0.33
g - 0.30
t dξ - 0.7
t∫/k - 0.60
p - 0.41
dξ- 0.70
t∫/k - 0.60
g - 0.26
dξ - 0.70
t∫/k - 0.60
p - 0.40
dξ - 0.70
t∫/k - 0.60
p - 0.50
dξ- 0.70
t∫/k - 0.60
p - 0.54
d dξ- 0.7
b - 0.25
t∫ - 0.21
dξ- 0.70
g - 0.36
t∫ - 0.27
dξ- 0.70
g - 0.30
t∫ - 0.25
dξ- 0.70
g - 0.40
t∫ - 0.30
dξ- 0.70
t∫ - 0.43
b - 0.22
k p/t∫ - 0.33
g - 0.25
t - 0.13
p - 0.45
g - 0.15
t∫ - 0.14
p - 0.35
t∫ - 0.23
t - 0.15
t - 0.50
p - 0.40
g - 0.2
p - 0.40
g - 0.33
t - 0.31
g dξ- 0.60
d - 0.30
k - 0.22
dξ - 0.60
b - 0.45
d - 0.27
dξ - 0.6
d - 0.4
p - 0.35
dξ - 0.70
d - 0.40
t - 0.35
dξ - 0.60
d - 0.41
b - 0.31
Table 4.11a shows the nasal stimuli presented verses the average MRS for each speaker
105
and across all nasal sub-recognisors. For the nasal /さ /, the SRS is lower than the MRS of
the nasal /m/. For the nasals /m/ and /n/, the SRS is higher than the MRS achieved by the
other nasal sub-recognisors IASC. Table 4.11b shows the highest three confusions for the
nasal subgroup
Table 4.11(a) Nasals confusion matrix (average values over all speakers).
m n さ
M 0.86 0.21 0.18
N 0.34 0.60 0.02
Η 0.60 0.34 0.57
Table 4.11(b) Three highest confusions of the nasal subgroup.
Speaker 6 Speaker 7 Speaker 8 Speaker 9 Speaker 10
m さ - 0.25
n - 0.17
n - 0.30
さ - 0.20
さ - 0.23
n - 0.20
n - 0.19 n - 0.20
さ - 0.12
n m - 0.38 m - 0.30 m - 0.33 m - 0.40 m - 0.25
さ m - 0.61
n - 0.38
m - 0.60
n - 0.41
m - 0.60
n - 0.40
n - 0.28 m - 0.60
n - 0.30
Table 4.12a shows the fricative stimuli presented verses the average MRS for all speakers
and across all fricative sub-recognisors. Two members of this subgroup, the fricatives /ð/
and /z/, show SRSs that were lower than the average MRS for the other fricatives in this
subgroup. Table 4.12b shows the first three confusions with their associated MRS for the
fricative subgroup.
Table 4.12(a) Confusion matrix of the fricatives. (average values over all speakers).
106
f v θ ð s z ∫ ξ
f 0.67 0.34 0.60 0.3 0.37 0.27 0.40 0.19
v 0.34 0.73 0.28 0.36 0.28 0.30 0.22 0.27
し 0.38 0.18 0.62 0.60 0.42 0.35 0.17 0.18
ð 0.24 0.30 0.76 0.68 0.60 0.80 0.41 0.60
s 0.60 0.24 0.31 0.70 0.71 0.60 0.39 0.33
z 0.23 0.33 0.38 0.75 0.70 0.65 0.28 0.33
∫ 0.60 0.04 0.28 0.33 0.31 0.26 0.65 0.40
ξ 0.14 0.28 0.31 0.70 0.27 0.28 0.38 0.66
Table 4.13a shows the affricative stimuli presented verses the average MRS for all
speakers and across all affricative sub-recognisors. For both members of this subgroup, the
affricatives /t∫/ and /dξ/ achieved average SRSs exceeding the MRS. Table 4.13b shows
the first three confusions with their associated MRS for the affricative subgroup.
107
Table 4.12(b) Three most common confusions across speakers for the fricatives
subgroup.
Speaker 6 Speaker 7 Speaker 8 Speaker 9 Speaker 10
f し - 0.60
s - 0.48
ð - 0.42
し - 0.60
ð - 0.45
ξ - 0.41
し - 0.60
ð - 0.41
z - 0.40
し - 0.60
s/∫ - 0.48
v/h - 0.40
し - 0.60 ∫ - 0.52
s - 0.51
v f - 0.45
h - 0.41
s/ξ - 0.36
し - 0.41
s - 0.32
ð - 0.24
z - 0.35
s - 0.31
f - 0.30
ð - 0.50
f/z - 0.40
ξ - 0.30
ð - 0.51
f/z - 0.41
ξ - 0.30
θ ð - 0.60
s - 0.41
v/z - 0.31
し - 0.60
f - 0.46
s - 0.37
ð - 0.64
f - 0.46
s - 0.35
ð - 0.49
f/s - 0.50
ξ - 0.40
ð - 0.60
s - 0.49
f - 0.48
ð し - 0.70
z - 0.80
s/ξ - 0.60
し - 0.65
s/ ξ -0.60
z - 0.79
し - 0.61
s/ξ - 0.60
h - 0.46
z - 0.80
し - 0.80
ξ- 0.60
z - 0.80
し - 0.71
s/ ξ - 0.60
s ð - 0.70
f/z - 0.60
し - 0.48
ξ /f - 0.60 ∫ - 0.40
ð - 0.70
し - 0.35
f/z - 0.20
ð - 0.70
f/z - 0.60 ∫ - 0.50
ð - 0.70
f/z - 0.60 ∫ - 0.48
z ð - 0.75
s - 0.70
v/ ξ - 0.40
ð - 0.75
v - 0.46
ð - 0.75
s - 0.70
ξ - 0.47
ð - 0.75
ξ - 0.50
ð - 0.75
s - 0.70
ξ - 0.48
∫ f - 0.60
し - 0.40
s - 0.37
f - 0.60
z - 0.41
s - 0.35
f - 0.60
ð - 0.52
ξ - 0.46
f - 0.60
ξ - 0.50
ð/し/s - 0.40
f - 0.60
し - 0.40
ξ ð - 0.70 ∫ - 0.42
し - 0.33
ð - 0.70
z - 0.40
s - 040
ð - 0.70 ∫ - 0.51
s/し - 0.25
f - 0.70 ∫ - 0.50
v/z - 0.30
ð - 0.70
し - 0.40
z - 0.36
Table 4.13(a) Affricatives confusion matrix. (average values over all speakers).
t∫ dξ
t∫ 0.68 0.41
dξ 0.32 0.64
108
Table 4.13(b) Three main confusions for the affricative subgroup confusion matrix.
Speaker 6 Speaker 7 Speaker 8 Speaker 9 Speaker 10
t∫ t - 0.75
dξ - 0.53
d - 0.36
t - 0.58
p - 0.23
d/ dξ - 0.21
t - 0.75
dξ - 0.47
p - 0.35
t - 0.55
dξ- 0.54
d - 0.40
t - 0.75
dξ- 0.50
d - 0.35
dξ d - 0.70
g - 0.60
t - 0.50
d - 0.70
g - 0.60
t∫ - 0.31
d - 0.70
g - 0.60
t∫ - 0.39
d - 0.70
g - 0.60
t∫ - 0.40
d - 0.70
g - 0.60
silence 0.3
Table 4.14a shows the semivowel stimuli presented verses the average MRS for all
speakers and across all semivowel sub-recognisors. In all members of this subgroup the
SRS was always higher than the MRSs achieved by any other sub-recognisors. Table
4.14b shows the three most common confusions and their associated MRSs across
speakers for just the members of the semivowel subgroup.
Table 4.14(a) Semivowels intra confusion matrix.
h r j w L
h 0.65 0.00 0.00 0.00 0.00
r 0.00 0.74 0.00 0.00 0.16
j 0.00 0.00 0.93 0.00 0.00
w 0.00 0.00 0.00 0.83 0.00
L 0.00 0.19 0.00 0.00 0.53
109
Table 4.14(b) Semivowels inter confusion matrix.
Speaker 6 Speaker 7 Speaker 8 Speaker 9 Speaker 10
h し/s - 0.31
z - 0.21
ð - 0.20
f - 0.32
し/ξ - 0.31
ð - 0.30
し/∫ - 0.36
z - 0.35
v - 0.32
f - 0.50
s/∫ - 0.40
s - 0.50
f - 0.45 ∫ - 0.36
r a - 0.25 a - 0.18 L - 0.23
a - 0.10
a - 0.40 a - 0.25
L - 0.22
j i - 0.34
I∂- 0.31
i - 0.36
I∂ - 0.22
I- 0.12
I∂- 0.40
i - 0.23
I - 0.15
i - 0.40
I∂- 0.20
i - 0.25
I∂ - 0.23
w u - 0.31
Ω - 0.28
ΩƏ- 0.21
Ω - 0.31
OΩ - 0.25
u - 0.12
Ω∂ - 0.25
u - 0.22
Ω - 0.19
Ω - 0.40
u - 0.20
OΩ- 0.25
Ω∂ - 0.30
L r - 0.09 r - 0.23 r - 0.24 r - 0.15 r - 0.23
4.3.4 Experiment One: Conclusion
The results of this experiment showed that there was a variation in the recognition
performance across phones, subgroups and speakers. The descending order of average
SRS across subgroups was vowels, followed by diphthongs, semivowels, stops, nasals,
fricatives and affricatives. Table 4.15 summarises the average SRS for all speakers across
subgroup.
Table 4.15 Average of SRS across subgroup.
Subgroup Vowels Diph-
thongs
Stops Fric-
atives
Affric-
atives
Nasals Semi-
vowels
Avg. SRS 0.87 0.76 0.69 0.67 0.66 0.68 0.74
The best average SRS was for the vowel subgroup that achieved average SRS over all
speakers of 0.87, and the lowest performance was for the affricative subgroup that
achieved average SRS over all speakers of 0.66. Variations were observed also across
speakers, but general trends were consistent. Average SRS scores for all phonemes across
all speakers is illustrated in Table 4.16
110
Table 4.16 Average SRS scores for all phones across all speakers.
Speaker # 6 7 8 9 10
Avg. SRS 0.70 0.75 0.78 0.85 0.71
Overall the descending order of SRS for speakers was 9, 8, 7, 10 and 6. This order was
derived to select the threshold value from the SRS output of one of the input data entry. In
this experiment the threshold value was chosen from the output of the lowest SRS.
Referring to Table 4.7a, the lowest SRS (0.50) occurred for the sub-recognisors of the
phones /b/, /L/ and /t∫/ when it presented from the input data set of speaker 6. This value
was chosen as the threshold in this experiment. This choice ensures the syntactic
knowledge estimator will select all sub-recognisors when presented with the correct phone
as no SRS value was less than 0.5 across all speakers. The main disadvantage of using a
threshold of 0.5 is that in the worst case all sub-recognisors need to be checked to find the
correct solution, which means longer processing time.
Three values of the minimum threshold were checked (0.5, 0.6 and 0.7) and it was
found that as the proposed thresholds were increased, the number of sub-recognisors
that achieved SRS above threshold decreased and hence processing time decreased. As
the threshold was decreased, the number of sub-recognisors that achieved MRS above
the threshold increased. So a threshold was selected that balanced adequate SRS,
minimal MRS and reasonable processing time. Evaluation and analysis on the system
performance for those threshold values showed that the threshold of 0.60 achieved
reasonable results. The recognition rate for this threshold was 76% and the confusion
rate was 6.6%.
4.4 Experiment Two: Operation of Each Sub-recognisor with the
Syntactical Knowledge
The Experiment Two was designed to test the functionality of the APR when controlled by
the ACL signals as shown in Figure 4.5. Each sub-recognisor of the APR was tested by
applying all Di(12) input and the activation control signal and measuring the output, PIRi
where PIR is the phone identification response from the activated sub-recognisor. To
111
simplify the experiment, the threshold was not applied because the experiment is meant to
test ACL line only.
This experiment verifies the operation of the APR when under the control of the ACL
signals and so predicts the performance of RUST-I as an IWR system, assuming an ideal
syntactic knowledge estimator.
Figure 4.5 Block diagram of Experiment Two.
4.4.1 Input Stimuli
The stimuli data set, which was presented to the adaptive phone recognisor in this
experiment is the same input data set that was presented in Experiment One (Section 4.3).
The ACL signals were binary control lines.
4.4.2 Experimental Method
The block diagram of the setup for this experiment follows Figure 4.5. The inputs are
DI1(12) to DIMi(12), which are the 12 Mel-scale frequency cepstral coefficients (MFCC)
for the Mi frames to the ith sub-recognisor. The ACL signal was only activated for the sub-
recognisor of the correct phone presented on the input so PIR indicated SRS only.
112
All 225 tokens (45 phonemes by 5 test speakers) were applied to each of the 46 sub-
recognisors and the PIRi outputs obtained. The activation control lines ACLi of the
appropriate sub-recognisor was activated in pseudo-simulation conditions. Under ideal
conditions only the activation control line of the sub-recognisor representing the expected
phone was activated. This part of the experiment simulates operation of the syntactic
knowledge estimator assuming that it correctly identifies the word pattern. Under non-
ideal conditions the activation control line of all sub-recognisors were activated
individually. This part of the experiment simulates operation of the syntactic knowledge
estimator without any assumptions being made about the word pattern. The PIRi output
from each sub-recognisor was then stored in an ASCII file, and placed into a graphical
confusion matrix and appropriate tables.
4.4.3 Results
The responses of all sub-recognisors for all input stimuli and the five speakers (# 6 - 10) in
the test set are identical to the values of Tables 4.7a to 4.7e. This occurred because the
application of the appropriate ACL control signals to the output selector, which resulted in
the suppression of all possible IASC and IRSC that occurred in Experiment One. The SRS
values were maintained from Experiment One within two decimal places. The effect of
application of ideal ACLi signals is to remove confusions or misrecognitions of incorrect
phones completely. The confusion matrix of speaker 6 was chosen to represent the results
of this experiment.
Table 4.17 summarises the SRS under ideal conditions for the five speakers (#6-10). The
local recognition rate is the recognition rate for each speaker so is the number of phones
with SRS > 0.60 divided by 45.
Table 4.17 Summary of SRS < 0.60 and recognition rate across all speakers.
Speaker 6 7 8 9 10
# SRS < 0.60 19 14 11 0 9
Local Recognition 57.77% 66.60% 75.50% 100% 77.8%
113
The highest number of problematic SRSs occurred for tokens from the test set of speaker
6. The lowest number of problematic SRSs occurred for tokens from the test set of speaker
9. The local recognition rates under ideal conditions were between 57.77% and 100%.
4.4.4 Experiment Two: Conclusion
The results obtained in this experiment showed that all the sub-recognisors of the APR
responded as expected to the ACL signals. However, activation of the ACL signals if the
sub-recognisor output is less than the system threshold will not allow correct recognition
of that phone. The activation control of the APR is meant to reduce the number of MRS
responses. Therefore, even if the system is completely protected against MRS confusions,
some system failure will be expected because of the minimum threshold being greater than
some SRS.
4.5 Experiment Three: Verification of the System as an IWR
The Experiment Three is designed to investigate the operation of the system as an IWR
where the syntactic knowledge estimator was combined with the adaptive phone
recognisor. The aim of this experiment is to verify the word recognition efficiency of
RUST-I and analyse its performance. A comprehensive analysis of all correct and
incorrect results is provided with reference back to the first and second experiments.
4.5.1 Input Stimuli
In this experiment, RUST-I has as input data set of one hundred words chosen arbitrarily
from the system lexicon. The words were spoken by the same speakers of the test set, i.e.,
speakers 6, 7, 8, 9, and 10. Each word was processed and presented to the system as a
temporal sequence of vectors in the form of MFCC, DIi(12). The following is a list of the
100 words that were used in this experiment:
their - the - this - these - there - three - think - thank - to - table - time - trying -
transaction - today - and - at - occur - a - away - across - arrive - ago - are - after - ask -
august - almost - or - order - agent - any - air - of - on - often - off - until - other - up - old -
114
over - only - out - hour - one - it - indeed - in - into - its - isn't - inside - introduce - if -
industry - he - heard - her - head - heavy - stone - school - chair - child - church - receive -
real - room - before - earth - earlier - must - market - mouth - number - noise - nature - got
- glass - give - good - gate - general light - lamp - large - lay - year - your - you - perform -
permit - pay - do - destruction - describe - defined - discount - duty - floor
4.5.2 Experimental Method
Figure 4.6 Block diagram of the system as configured for Experiment Three.
Figure 4.6 shows the block diagram of the system configuration that was used in this
experiment. At the beginning, the adaptive phone recognisor outputs and the
accumulator contents were initialised to zero. All the ACLi signals and the EOW
control signal were set to be inactive - low. The syntactic knowledge estimator was
reset to the top of the lexicon database whenever a new word was to be processed. The
words were presented to the system from data files stored in ASCII format as in
previous experiments. The words were presented to the system, one at a time with a
silent period in between each pair of words that was of suitably long duration to ensure
the silence was detected. The silence period was set to be greater than 1 sec, which was
found to eliminate the possibility of word boundary confusions.
The response of the system was examined by checking the contents of the accumulator
before the presentation of the next word in the specified word test set. The words were
presented to the system randomly and the ID output values from the accumulator were
stored in ASCII text files.
4.5.3 Representation of Results
115
Table 4.20 contains the word recognition results of this experiment. The first column of the
table contained a list of the test words, the second column of the table contained a list of
the words’ phonemic equivalent, the third column in the table contained a list of the
speaker test set they were derived from, the fourth column in the table contained a list of
the recognition decision, the fifth column in the table contained a list of the expected ID
codes in the accumulator if the word was correctly recognised and the last column in the
table contained a list of the actual ID codes obtained from the accumulator at the end of the
recognition cycle. The words in the table were categorised according to the first phone in
their phonemic stream to allow ease of analysis of system errors related back to the
performance of the syntactic knowledge estimator.
The binary decision, from the recognition process, was indicated by 'U' for the
‘Unrecognised’ words and by 'R' for the ‘Recognised’ words. The expected ID was
derived from the phonemic ID representation of Table 3.1. If a word was correctly
recognised then the expected accumulator ID stream will be identical to the actual
accumulator ID stream. Any difference of these ID streams indicated an error had occurred
in the recognition process. If no suitable match could be found during the process for any
reason an asterisk '*' sign occurred in the ID stream indicating the termination of the
recognition process.
4.5.4 Analytical Procedure
The recognition outcome for any word can be analysed by tracking the recognition process
through the syntactic knowledge estimator and applying the corresponding SRS and MRS
results from the activated phones, as outlined in the Experiments One and Two. Tables
4.7a-e provided the SRSs for each phone and Tables 4.(8-13)b provided the MRSs. Any
word that was unrecognised or terminated during recognition was analysed to derive
possible causes. The syntactic bubbles diagram and the syntactic database were used to
track through the syntactic knowledge estimator and hence analyse the behaviour of the
system during the recognition process.
4.5.5 Results
116
Table 4.18 summaries of the results of this experiment. It contains the percentages of
words from each of the five speakers that were recognised and unrecognised. For instance,
speaker 6 contributed 17% of the total number of the words used in this experiment.
64.71% of the words contributed by speaker 6 were recognised. Speaker 6 achieved
15.27% of the correct word recognition and 21.42% of the incorrect word recognition.
Table 4.19 shows the variation of recognition patterns for two words across all speakers.
The two words used were 'agent' and 'more'. The word 'agent' was unrecognised by the
system for four out of the five speakers. The system was able to recognise this word when
spoken only by speaker 9. In contrast, the word 'more' was recognised by the system for all
speakers.
Table 4.18 Overall results of 100 word recognition.
Speaker % of
Recognised
Words
% of
Unrecognised
Words
% of
Words
From the
Speaker
6 15.27% 21.42% 17%
7 12.50% 25% 16%
8 11.11% 28.57% 16%
9 41.66% 10.71% 33%
10 19.44% 14.28% 18%
117
Table 4.19 Comparison of two-word recognition results over all speakers.
Threshold = 0.60
Word Speaker # Accumulator Result
‘Agent’ 6 15-39-12-* U
7 15-* U
8 15-39-12-* U
9 15-39-12-41-24 R
10 15-* U
‘more’ 6 41-8 R
7 41-8 R
8 41-8 R
9 41-8 R
10 41-8 R
The following paragraphs track the performance of the word 'agent' to illustrate the
tracking procedure. The word 'agent' starts with the diphthong /eI/, which is the last phone
in the list to be checked as it is least likely to be the front edge phone. Table 4.9b shows
that there is no significant IASC or IRSC scores with any of the phones that are checked
before /eI/. The diphthong /eI/ has an SRS from 0.70 to 0.91, which exceeds any MRS for
this phone. Therefore, the diphthong/eI/ was successfully recognised for all five speakers
and the ID of /eI/ (15) was consistently found as the first phone in the accumulator.
Subsequent IDs in the accumulator required that the syntactic knowledge estimator
branched into phonemic subclass (Level 1) and points to two locations, firstly, the
phonemic subclass /dξ/ = 39 and secondly the end of process signal /46/. As /dξ/ is the
next phone in the word ‘agent’, it is the only phone checked, and there was no possibility
of further confusion given the previous patterns of phones.
The SRS of /dξ/ shown in Tables 4.7a-e indicate that a match occurred between the
incoming phone /dξ/ and the estimated sub-recognisor for the three speakers 6, 8, and 9
only as their SRSs (0.62, 0.79 and 0.70, respectively) are greater than 0.60. The other two
test sets from speakers (7 and 10) resulted in an error for recognition of this phone as their
SRSs were below 0.6 at only 0.55. As expected, the recognition process was terminated at
118
this point for these two speakers (7 and 10). The ID of the phone /dξ/ (39) for speakers 6, 8
and 9 were found in the second position in the accumulator. In the case of speakers 7 and
10, the system generated an error at this point in the recognition process as indicated in
Table 4.19 by the asterisk in the position of the second phone. Once an error occurs, the
current system configuration stops the recognition processes for that word.
The recognition process continued in the cases of speakers 6, 8, and 9. The next level
(Level 2) of the syntactic database points to the phonemic subclass /∂ = 12/ from the
previous pattern of /eI/ then /dξ/, which is the only possibility. The sub-recognisor for the
vowel /∂/ was activated and checked to be matched with the incoming data. For the three
remaining speakers (6, 8 and 9) the SRSs were from 0.92 to 1.00 without any significant
confusion, which explain the existence of the phone /∂/ ID in the accumulator.
The recognition process again continued in the cases of speakers 6, 8, and 9. The next level
(Level 3) points to the phonemic subclass /n = 41/. This is the only subclass from the
previous pattern of /eI/, /dξ/ then /∂/. The sub-recognisor for the nasal /n/ was activated and
checked to be matched with the incoming data. For two of the three remaining speakers (6
and 8) the SRS scores were 0.55 and 0.58, which are below threshold and so the
recognition process was terminated for these two speakers. In the case of speakers 6 and 8,
the system generated an error at this point in the recognition process as shown in Table
4.19 by the asterisk in the position of the fourth phone.
Only the speaker 9 continued through the recognition process as it had a SRS of 0.69. The
ID of the phone /n/ for the speaker 9 was found in the fourth position in the accumulator.
For the speaker 9, the subclass of the phone /n/ in Level 3 points to the phonemic subclass
of the stop /t = 24/ in level 4 of the syntactic knowledge database. This is branching into
phonemic subclass (Level 4) from the previous pattern of /eI/, /dξ/, /∂/ then /n/ had only
one possibility which was the stop /t/. The sub-recognisor /t/ is activated and checked to be
matched with the incoming data. For the one remaining speaker (9) the SRS score for /t/
was 0.69, therefore, it was considered to be recognised. The ID code of the phone /t/ in the
case of the speaker 9 was found in the fifth position of the accumulator. Therefore the
word 'agent' was totally recognised when spoken by the speaker 9 but not when spoken by
any other speaker in the test set. The word 'more' was recognised successfully for all
119
speakers.
The description of the recognition process is similar for all speakers. The accumulator
contained the ID codes 40-8 for all speakers which represents the codes for the two phones
/m = 40/ and /Э = 8/. Response from the sub-recognisor /m/ was received and recorded in
the accumulator. Phone /m/ over all speakers had no MRS greater than 0.60. Also, the SRS
of /m/ for all speakers was in the range from 0.79 to 0.91 which was greater than MRS.
Therefore, the ID of /m/ (40) is found in the first position in the accumulator for all words
from all speakers.
Branching into the first phonemic subclass (Level 1) from the phone /m/ resulted in ten
possibilities, which are the phones /t∫/, /t/, /s/, /さ /, /m/, /t/, /n/, /d /, /z/ and /k/. The vowel
/Э/ has no misrecognition score greater than 0.60 for any of the phone sets. Also it has an
SRS in the range of 0.71 to 0.80 over all speakers, which is greater than any MRS of this
phone. Therefore, the ID of /Э/ (8) was found in the accumulator in the second position for
all speakers. Recognition results for the 100 words used in Experiment Three are presented
in Table 4.20
Table 4.20 (part 1) Recognition results for words used in Experiment Three.
(Spk# = speaker number, D = decision, U = unrecognised, R = recognised, and * =
process stopped)
Word
Phone
Spk #
D
Expected Result
Actual Result
Phone Class ð = 31
their
ðε∂
6
R
31-20
31-20
the
ð∂
6
R
31-12
31-12
this
ðIs
8
R
31-1-32
31-1-32
these
ðiz
9
R
31-2-33
31-2-33
there
ðε∂
10
R
31-20
31-20
Phone Class t = 30
three
しri
9
R
30-37-2
30-37-2
120
think
しIさk
6
U
30-1-42-26
31-1-*
thank
しæさk
8
U
30-4-42-26
31-4-*
Phone Class t = 24
to
tu
9
R
24-10
24-10
table
teIb∂L
9
R
24-15-23-12-45
24-15-23-12-45
time
taIm
6
R
24-14-40
24-14-40
trying
traIIさ
7
R
24-37-14-1-42
24-37-14-1-42
transaction
trænzæk∫∂n
10
U
24-37-4-41-33-4-
26-34-12-41
24-37-4-41-33-4-*
today
t∂deI
8
U
24-12-25-15
*
Phone Class æ = 4
and
ænd
10
R
4-41-25
4-41-25
at
æt
9
R
4-24
4-24
121
Table 4-20 (part 2)
Word
Phone
Spk #
D
Expected Result
Actual Result
Phone Class ∂ = 12
occur
∂kέ
7
R
12-26-11
12-26-11
a
∂
8
R
12
12
away
∂'weI
7
R
12-44-15
12-44-15
across
∂krÞs
9
R
12-26-37-6-32
12-26-37-6-32
arrive
∂raIv
8
U
12-37-14-29
12-*
ago
∂gOΩ
8
U
12-27-18
12-*
Phone Class a = 5
are
a
6
R
5
5
after
aft∂
9
R
5-28-24-12
5-28-32-24-12
ask
ask
8
R
5-32-26
5-32-26
Phone Class Χ = 8
august
ЭgΛst
9
R
8-27-13-32-24
8-27-13-32-24
almost
ЭLmOΩst
7
U
8-45-40-18-32-24
8-45-40-*
or
Ω
7
U
8
8-*
order
Ωd∂
8
U
8-25-12
8-*
Phone Class eI = 15
agent
eIdξent
9
R
15-39-12-41-24
15-39-12-41-24
Phone Class ε = 3
any
εni
10
R
3-41-2
3-41-2
air
ε∂
8
R
3-12
3-12
122
Phone Class Þ = 6
of
Þv
6
R
6-29
6-29
on
Þn
10
R
6-41
6-41
often
Þfen
7
U
6-28-12-41
6-*
Table 4.20 (part 3)
Word
Phone
Spk #
D
Expected Result
Actual Result
off
Þf
9
R
6-28
6-28
Phone Class Λ = 13
until
ΛntiL
7
U
13-41-24-2-45
13-*
other
Λðe
6
R
13-31-12
13-31-12
up
Λp
8
U
13-22
13-*
Phone Class OΩ = 18
old
OΩLd
9
U
18-45-25
17-*
over
OΩv∂
10
U
18-29-12
17-*
only
OΩnLi
6
U
18-41-45-2
17-*
Phone Class aΩ = 17
out
aΩt
9
R
17-24
17-24
hour
aΩ∂
6
R
17-12
17-12
Phone Class w = 44
one
wΛn
10
R
44-13-41
44-13-41
Phone Class I = 1
it
It
9
R
1-24
1-24
indeed
Indid
10
R
1-41-25-2-25
1-41-25-2-25
in
In
10
R
1-41
1-41
123
into Intu 9 R 1-41-24-10 1-41-24-10
its
Its
9
R
1-24-32
1-24-32
isn't
Iz∂nt
6
U
1-33-12-41-24
1-33-12-*
inside
InsaId
10
R
1-41-32-14-25
1-41-32-14-25
introduce
Intr∂djus
9
R
1-41-24-37-12-
25-43-10-32
1-41-24-37-
12-25-43-10-
32
if
If
10
R
1-28
1-28
Table 4.20 (part 4)
Word
Phone
Spk #
D
Expected
Result
industry
IndΛstri
9
R
1-41-25-13-32-
24-37-2
1-41-25-13-
32-24-37-2
Phone Class h = 36
he
hi
7
R
36-2
36-2
heard
hέd
6
U
36-11-25
*
her
hέ
7
U
36-11
36-1
head
hεd
9
R
36-3-25
36-3-25
heavy
hεvi
10
R
36-3-29-2
36-3-29-2
Phone Class s = 32
stone
stOΩn
9
R
32-24-18-41
32-24-18-41
school
skuL
8
U
32-26-10-45
31-*
Phone Class t∫ = 38
chair
t∫ε∂
10
U
38-20
24-*
child
t∫aILd
7
R
38-14-45-25
38-14-45-25
church
t∫έt∫
9
R
38-11-38
38-11-38
124
Phone Class r = 37
receive
r∂siv
10
R
37-12-32-2-29
37-12-32-2-29
real
riL
9
R
37-2-45
37-2-45
room
rum
6
R
37-10-40
37-10-40
Phone Class b = 23
before
bifЭ
8
R
23-2-28-8
23-2-28-8
Phone Class έ = 11
earth
έし
7
R
11-30
11-30
earlier
έLi∂
6
R
11-45-2-12
11-45-2-12
Table 4.20 (part 5)
Word
Phone
Spk #
D
Expected
Result
Phone Class m = 40
must
mΛst
9
R
40-13-32-24
40-13-32-24
market
Mak∂t
7
R
40-5-26-12-24
40-5-26-12-24
mouth
maΩθ
9
R
40-17-30
40-17-30
Phone Class n = 41
number
nΛmb∂
10
R
41-13-40-23-12
41-13-40-23-12
noise
nЭIIz
9
R
41-16-33
41-16-33
nature
neIt∫∂
6
U
41-15-38-12
*
Phone Class g = 27
got
gÞt
9
R
27-6-24
27-6-24
glass
gLas
6
U
27-45-5-32
*
125
give
gIv
9
R
27-1-29
27-1-29
good
gΩd
10
R
27-9-25
27-9-25
gate
geIt
9
R
27-15-24
27-15-24
Phone Class dξ = 39
general
dξεnr∂L
9
U
39-3-41-37-12-45
25-3-41-37-*
Phone Class L = 45
light
LaIt
7
R
45 - 14 - 24
45-14-24
lamp
Læmp
8
R
45 - 4 - 40 - 22
45 - 4 - 40 - 22
large
Ladξ
9
R
45 - 5 - 39
45-5-39
lay
LeI
7
R
45 - 15
45 - 15
Phone Class j = 43
year
jI∂
9
R
43 - 19
43-19
your
jЭ
8
R
43 - 8
43-8
you
ju
6
R
43 - 10
43-10
Table 4.20 (part 6)
Word
Phone
Spk #
D
Expected
Result
Phone Class p = 22
perform
p∂f Эm
8
R
22-12-28-8-40
22-12-28-
8-40
permit
p∂mIt
7
R
22-12-40-1-24
22-12-40-
1-24
pay
peI
10
R
22-15
22-15
126
Phone Class d = 25
do
du
6
R
25-10
25-10
destruction
d∂strΛk∫∂n
9
R
25-12-32-24-37-13-26-
34-12-41
25-12-32-
24-37-13-
26-34-12-
41
describe
d∂skraIb
10
U
25-12-32-26-37-14-23
25-12-32-*
defined
d∂faInd
8
U
25-12-28-14-41-25
25-12-28-
14-*
discount
dIskaΩnt
7
U
25-1-32-26-17-41-24
25-1-*
duty
djuti
9
U
25-43-10-24-2
25-2-*
Phone Class f = 28
floor
fLЭ
9
R
28-45-8
28-45-8
From a total of 100 words, 73 were recognised correctly, hence their expected and actual
ID codes in Table 4.20 were identical. Words were correctly recognised for the following
two reasons:
• Eight of the 73 words (10.95%) contained phones that achieved SRS
values that were higher than the system threshold and had no IASC or
IRSC (i.e. no MRS > 0.60).
• A majority of the recognised words (65 words or 89.05%) required the
assistance of the syntactic knowledge estimator to be correctly identified
as the relevant sub-recognisors had confusions > 0.60 with other phones.
In the first case (10.95% of the recognised words), the ACL of the syntactic knowledge
estimator played a minor role in the recognition process of these words, so that they would
have been adequately identified with the APR by itself. For example, the word 'on'
contains two phones /Λ/ and /n/. Neither of these phones had a MRS greater than 0.60,
127
therefore the syntactic knowledge estimator did not have any competition in their
recognition. The ten words in this category are on, of, one, he, lamp, year, your, you
In the second case (89.05% of the recognised words), the ACL of the syntactic knowledge
estimator played a major role in the recognition process of these words, so that they
depended on the ACLs in their recognition process. For example, the word 'their' contains
the phone string /ðε∂/. The phone /ð/ has MRS with the phoneme /し/, which is higher than
the system threshold (0.60), so there was possibility for confusion. The ACLs of the
syntactic knowledge estimator eliminate that possibility, so the first phone of the word
‘their’ /ð/ was recognised correctly. Therefore, the syntactic knowledge estimator
effectively competed when recognising these words. The words in this category are: their,
the, this, these, there, three, to, table, time, trying, and, at, occur, a, away, across, are, after,
ask, august, agent, any, air, off, other, out, hour, it, indeed, in, into, its, inside, introduce, if,
industry, head, heavy, stone, child, church, receive, real, room, before, earth, earlier, must,
market, mouth, number, noise, got, give, good, gate, light, large, lay, perform, permit, pay,
do, destruction, floor
There were 27 words in Table 4.20 that were classified as unrecognised. These incorrectly
recognised words were also spoken by various speakers, therefore, the unrecognition cases
described as word dependent problem. The following paragraphs describe why these errors
occurred by using the analytical tracking procedure previously described for the words
‘agent’ and the word ‘more’. These 27 errors were categorised into three main types of
word mis-recognition as follows:
• The first type of error occurred due to the inability of the APR to recognise the front
edge phone as the phone's SRS value was less than the system threshold (0.60) and it
does not have any significant MRS with other phones.
• The second type of error occurs usually in the front edge level but it can occur at lower
levels. This error was observed when the syntactic knowledge estimator identified
phones whose MRS exceeded the threshold (0.60) and they were checked before the
correct phone.
• The third type of error occurred in levels below the front edge level and occurred
because the sequence of the recognition disconnected as one of the phones under
process had an SRS less than the threshold and the MRS is less likely to effect
128
performance. This type of error is related to the APR performance rather than the
syntactic knowledge estimator.
It was found that 4 words or 14.81% of the (27) misrecognised words were due to errors of
the first type, that is they had the front edge phone unrecognised as it achieved an SRS
value that was less than the system threshold (0.60). These words were ‘today’ from
speaker 8, ‘heard’ from speaker 6, ‘nature’ from speaker 6 and ‘glass’ from speaker 6.
Ten words or 37.03% of the (27) misrecognised words were due to errors of the second
type. These words were 'think’ from speaker 6, ‘thank’ from speaker 8, ‘school’ from
speaker 8,’general’ from speaker 9,’old’ from speaker 9,’over’ from speaker 10,’only’
from speaker 6,’duty’ from speaker 9,’her’ from speaker 7 and ‘chair’ from speaker 10.
The words 'think’ and ‘thank' had the confusion occur in the first level or for the phonemic
class /し/. Both of them resulted in the same incorrect ID 31- being placed in the
accumulator, which led to unknown path in the syntactic database. By following the same
error tracking procedure described above, it is found that both fricatives /し/ and /ð/ have
SRS and MRS higher than the system threshold (0.60). The phone /ð/ has a higher
browsing priority than the phone /し/, so it is always checked first. The syntactic
knowledge estimator followed an incorrect branch into the second level of the syntactic
database and found no match. Hence, an error message was produced.
For the word 'general', the expected ID string is 39-3-41-37-12-45. The resulting ID in the
accumulator is the ID string 25-3-41-37-*. The first ID indicates that confusion occurred
between the affricative /dξ/ and the stop /d/, which have an SRS and MRS greater than the
system threshold (0.60). So the phone /d/ (25) was recognised instead the phone / dξ / (39)
because the phone /d/ has priority over the phone / dξ / in the front edge level of the
syntactic database. As these two phones have similar subsequent branches into their phone
subclasses, the system kept the recognition process until the fifth level. Similar results
occurred for the other words, therefore, the message ‘Unrecognised’ was generated. The
failure in the case of the words 'old, over’ and ‘only' was because of the IASC between the
phones /OΩ/ and / aΩ /, therefore, the system produced the incorrect ID string (17) instead
of the expected ID (18).
129
The second error type occurred in the word, 'duty', but it was at the second level rather than
at the first level. The corresponding ID string was 25-2-*. The adaptive phone recognisor
confused the vowel /i/ (2) and the semivowel /j/ (43), and because the vowel /i/ is checked
before the semivowel /j/ in the syntactic knowledge database, the sub-recognisor for the
vowel /i/ responded. The recognition process failed at the third level of the recognition
process because no match was found for /t/ after the pattern /di/.
Another type of error occurred for the word 'her'. No error message was generated for this
word as it appeared to have finalised successfully in the recognition process but the second
ID was incorrect. The ID found in the accumulator was (36-1) but (36-11) was expected.
The first ID is correct, but the second ID represents the vowel /I/ which is often confused
with the vowel / έ /. An option at the third level after /hI/ is the silence or end of word so
the system assumed a correct match and no error message was generated.
There were 13 words or 48.16% in the third category of errors. Words in this category
were: 'transaction, arrive, ago, almost, or, order, often, until, up, isn’t, describe, define, and
discount’. For example, the word ‘arrive’ /∂raIv/, which was spoken by speaker 8 passed
the recognition process on the front edge level for the phone / ∂ /. But the recognition
process terminated on the level 1 for the phone /r/ as this phone had an SRS, which was
less than the system threshold, and it did not have any MRS with any other preceding
phone in level 1 of the syntactic database.
4.5.6 Experiment Three: Conclusion
This experiment showed that 73% of the set of words were correctly identified using
RUST-I. The overall performance of the system as an IWR was found to be dependent on
the performances of both the APR and the syntactic knowledge estimator. The APR
defined the ability of RUST-I to recognise the correct phone using its ACL signals. This
ability was affected by the value of threshold, taken for this experiment to be a system
threshold of 0.60.
The syntactic knowledge estimator defined the most likely order of phones occurring first
in a word and also following a given pattern of phones. The syntactic database of the
130
syntactic knowledge estimator was defined by the method of clustering the phonemic data,
which in RUST-I was designed to originate from the most likely first phone in a word,
then follow a pattern of the most likely phone given a pattern of phones.
At the front edge of the recognition, i.e., the first phone in a word, the syntactic knowledge
estimator defines the order of likelihood as the statistical likelihood of a phone being first
in a word, which is a function of the database. The recognition rate at this level increases
with the size and applicability of the database. The syntactic knowledge estimator is not
operating optimally at this level. Once the syntactic knowledge estimator has correctly
identified the phone at this level, the usefulness of the syntactic knowledge estimator
comes into full effect as it then has a predefined set of phones ranked in order of
likelihood, which it must to browse through including the 'correct' phone. In many cases
the misrecognition of incorrect phones did not occur as phones whose PIR scores exceeded
the threshold were either not in the list to be checked or were to be checked after the
'correct' phone. Deeper the recognition process the system delves into fewer the options
available for checking and greater the likelihood of recognition.
Table 4.21 Summary of error types.
1st Type 2nd Type 3rd Type Total
# of words 4 10 13 27
Error rate 14.81% 37.03% 48.16% 100%
As shown in Table 4.21, three types of errors occurred. The first type of error was due to
the lower SRS value of the front edge phone. This error could be reduced by improving the
performance of the APR and hence improving the SRS values. The second type of error
was due to the fact that some of the phones had higher MRS for the 'incorrect' phone than
the SRS resulted for the correct phone. These two types of error could be reduced by
improving the performance of the APR, and/or improving the mechanism for selection and
use of the threshold in the recognition process. These combined errors were responsible for
51.84% of the errors produced by the system.
The third category of error involved low SRS below threshold at lower levels. This
131
caused the recognition sequence to be terminated at some stage in the process. This
type of failure was responsible for 48.16% of the total errors.
131
Chapter 5: Implementation of Incremental Learning Neural
Networks (RUST-II)
5.0 Introduction
This chapter deals with the development of RUST-I to incorporate the incremental
learning into the standard back-propagation network that was used so far. Adding the
feature of incremental learning to the standard back-propagation neural network of the
APR (of RUST-II) is an attempt to investigate the performance of the incremental
learning for speech recognition. It has been shown that the incremental learning im-
proves the system capability and performance by making use of the incremental learn-
ing technique. It also expected that the system will be able to adapt more readily to new
speech input without the need to run additional training sessions.
In this chapter an incremental learning algorithm will be presented and tested based on
a modified version of the previous APR. To make a fair performance comparison of the
system, the standard speech database TIMIT has been used and some minor changes
were carried out on the structure of the input representation to the adaptive phone rec-
ognisor and the syntactical knowledge as well.
Section 5.1 presents the TIMIT speech data file structure, corpus selection, speech seg-
mentation, feature extraction and input data preparation. Section 5.2 describes the
modifications performed on the APR structure to fit the new speech database and the
new speech feature vector. The experimental procedure and the results are presented in
Section 5.3. Discussion of the new APR experiments is given in Section 5.4. The in-
corporation of incremental learning in the back-propagation network is described in
detail in Section 5.3. Finally, Section 5.5 deals with the global layer WANN: the
132
weight selection method for incremental learning, experimental results are also shown
in this section.
5.1 Speech Corpus
5.1.0 Background
RUST-I has been built around a non-standard speech database. The use of a non-
standard speech database had side effects on the system, particularly on its reliability
and performance. Some of those negatives are the limited resources of speakers,
speaker factors, number of speakers, number of intakes each speaker and speaker dia-
lect. In particular, the number of intakes from each speaker and the number of available
speakers introduced certain limitations on the system functionality in that it made the
system to appear performing multi-speaker recognition rather than speaker independ-
ent. This can be seen from the results of Experiment Two in Section 4.4, where it can
be noticed that the system achieved better SRS results for speaker 9 comparing with the
results achieved by the other four speakers of the test set (see Table 4.12). Among the
many speech databases available for speech processing research, the TIMIT speech
corpus was chosen as a standard speech database for its wide variety of the number of
speakers, dialects, genders, vocabularies and sentences.
5.1.1 TIMIT Database
TIMIT provides speech data for the acquisition of acoustic-phonetic knowledge. There
are 6300 sentences in TIMIT spoken by 630 male and female speakers, from 8 major
dialects of the United States. The dialect region is referred to the geographical area
where the speakers lived during their childhood years.
The text material in the TIMIT contains 2 sentences designed to reveal the identity of
the dialect, 450 phonetically compact sentences and 1890 phonetically cpmpact sen-
tences. Additional information can be found in the printed documentations which ac-
company the database CD.
5.1.2 Corpus Selection
133
25 speakers of TIMIT were chosen to form the core training and testing set for our sys-
tem. They are 5 females and 20 males. All contributed speakers were chosen from the
three main dialects of American English. Only 3 of them are from the dialect region of
New England (referred to as DR1), 19 speakers are from the western region (referred to
as DR7) where the dialect boundaries are not known with any confidence (TIMIT
documents). The last dialect (referred to as DR8) of this set are speakers moved around
a lot during their childhood. This coverage of dialects ensures wider diversity of the
phone patterns introduced to the system. Table 5.1 illustrates abstracted information on
the chosen speakers and their contribution to the system lexicon, syntactic knowledge
and language model.
Table 5.1 Abstracted information on the chosen speakers.
Dialect Re-
gion
Number of
speakers
Number of
Sentences
Number of
Words
Number of
phones
DR1 3 30 247 933
DR7 19 31 284 1057
DR8 3 14 106 450
Total 25 75 637 2440
15 of the chosen sentences textually contain repeated utterances of 2 distinctive sen-
tences referred to in TIMIT as “shibboleth” sentences. The two “shibboleth” sentences
are set to reflect the dialect of the speaker. Many of the speakers involved in this sys-
tem were chosen to produce these two sentences, in order to reveal the colour of the
speaker’s dialect for building-up of accent knowledge. 35 sentences of the set accord-
ing to TIMIT are phonetically-compact, in that they were designed to provide coverage
of pairs of phones with extra occurrences of phonetic contexts. The last 25 sentences
are classified as phonetically-diverse sentences (TIMIT documents). Those sentences
are meant to add diversity in sentence types and phonetic contexts to maximise the va-
riety of allophonic contexts.
As TIMIT is acquired from American speakers, the phonemic set that was used in Ta-
ble 3.1 is not valid, and the system was updated to accommodate the phonemic and
134
phonetic symbols used in the TIMIT lexicon. These include two stress marks, the clo-
sure intervals of stops which are distinguished from the stop release by adding ‘cl’ to
the stop symbol, e.g. ‘the stop / t / has the closure phone / tcl /. By testing those phones
perceptually, it was found that some of those closures were temporally too short, there-
fore many of them were integrated within the original stop to form a whole segment of
phone to be introduced to the system. Some phones are speaker dependent, on the
speaker, dialect, speaking rate and phonemic context. Those phones had a lower num-
ber of occurrences, therefore they had a lower number of samples in the system lexi-
con. Those phones are:
• Flap / dx / such as in the word “dirty”.
• Nasal flap / nx / as in “winner”.
• Glottal stop / q /, which may be a phone of / t /, or may mark an initial vowel or
a vowel-vowel boundary.
• Fronted / u / .i.e. / ux /.
• Very short devoiced vowel / ax-h /, typically occur for reduced vowels sur-
rounded by voiceless consonants.
• Other symbols include two types of silence; / pau / (pause), and / epi /; denoting
epithetic silence which is often found between a fricative and a semivowel or
nasal. / # / is used to mark the silence and/or non-speech events found at the be-
ginning and end of the signal.
TIMIT is a large database, therefore when searching for a specific piece of data for
quick match or extraction it was more convenient to produce a search engine that can
help in carrying out the search accurately and effectively. The search engine was coded
using C++ but made simple to migrate by using many features of the C language. The
program offers three search options:
1. Speaker details inquiry.
2. Speaker dependent search.
3. Lexical dependent search.
Table 5.2 Updated phonemic symbol code.
135
Phone types Phone symbol Phone numeric representa-
tion
Vowels iy 1
ih 2
eh 3
ey 4
ae 5
aa 6
aw 7
ay 8
ah 9
ao 10
oy 11
ow 12
uh 13
uw 14
ux 15
er 16
ax 17
ix 18
axr 19
ax-h 20
Semivowels L 21
r 22
w 23
y 24
hh 25
hu 26
EL 27
Nasals m 28
n 29
ng 30
136
em 31
en 32
eng 33
nx 34
Fricatives s 34
sh 35
z 36
zh 37
f 38
th 39
v 40
dh 41
Affricatives ch 42
jh 43
Stops b 44
d 45
g 46
p 47
t 48
k 49
q 50
dx 51
Silence pau 52
epi 53
# 54
137
5.1.3 Phone Segmentation and Feature Extraction
Speech data provided by TIMIT is recorded in .wav files of SPHERE headed format.
To be able to process the waveform files using MATLAB®
they must be in Windows®
.wav format. Therefore, SPHERE files were converted to WAV format to make it play-
able in Windows Media Player for the perceptual tests. Hence, a script of MATLAB
instructions was generated and ran successfully and all the selected data files were con-
verted.
TIMIT provides the phone boundary information which has been used for phone seg-
mentation, which was first performed based on this information. The boundary between
some phones in the samples is not distinctive from the signal point of view, leading to
overlapping period around the boundaries. Hence, a second stage called phonemic
amalgamation has been applied to some phones taking one phone set as a unique clus-
ter in a larger phone set, which was used to create larger learning set. For example, the
set / b / and / bcl / were merged together to produce a new learning set called as / b /.
This is aimed to achieve wider variety of the particular phone forms in the phonemic
knowledge. The amalgamation process resulted in more learning sessions to run and
more complex work to be performed on the MLWA side, but it was rewarding in re-
ducing the overall size of the APR. (The number of distinctive phones in the phonemic
knowledge was chosen to be 54.)
The MATLAB script was developed to perform automatic phonemic segmentations
along with the phonemic amalgamation for each sentence. A total of 2440 samples (of
51 phones) were extracted and saved individually in text files. The segmentation proc-
ess was completed, based on the phone boundaries information provided by the TIMIT.
Once extracted, each phone was subject to individual perceptual test to verify the
phone identity. Samples of the same phone vary with their occurrences (in sentences),
the speaker genders and dialects.
As in the previous version of the system, the speech features as inputs to the neural
networks were the Mel-scale spectrum coefficients, which were discussed in Chapter 2.
In this chapter, the number of the coefficients was increased from 12 to 17, to improve
138
the accuracy of the Mel-filters. To simplify the MFCC vector while not losing any in-
formation, the sample number at each of the peaks in the spectrum was used to repre-
sent each MFCC coefficient. The bottom line is to produce meaningful representation
from information related to the vocal tract. The filter model of the vocal tract provided
such information in its response shape and the transfer function. Cepstrum analysis pro-
vides such representation near its origin, therefore, MFCC provide such critical infor-
mation by smoothing the spectrum envelop and reveal the first four formants from the
signal spectrum.
All the segmented data were saved in text files and passed to the MATLAB feature ex-
traction scripts, where speech data was preconditioned, cepstrally analysed and then
MFCC produced and saved in text files. Figure 5.1 shows an example on the phone / s
/, where the feature extraction script displays the phone under processing in various
domains showing graphically the sequence of the feature extraction process.
5.1.4 Preparation of Data for Neural Networks Input
To prepare the .mfc file data, some processing procedures are needed. Firstly, the data
in the .mfc file has to be organized in n x m matrix. The line n of the matrix contains m
= 17 columns, which are the MFCC elements of the particular phone. By observing the
data resulted from the data extraction script, it was found that numerous numerical val-
ues were significantly larger than 1. Because the output of the network is a decision
represented in form of numerical value in the range from 0 to 1, the input to the net-
works ought to be normalized to avoid the constant saturation at the network output.
This process was carried out over all the input data.
On the other hand, too many negatives were resulted in the input, this usually produces
incorrect misfiring cases in the network. Therefore a process of mirror reflection was
carried out over all the input data vectors in order to promote the firing of the net-
work’s PEs.
139
Figure 5.1 Feature extraction from the phone /s/.
The processed data were all saved in files identified by their phone contents, where an
input data file contains as much as produced from that particular phone. Those files
were ready to be presented to the standard neural networks for training and testing and
to be also presented to the network with the incremental learning method for experi-
ment.
5.2 Modification of the APR to Include Incremental Learning Neural
Networks
The incremental learning is suitable for speech recognition, because the signal changes
from speaker to speaker and from time to time, and even for the same speaker. Though
incremental learning has been applied to speech enhancement (Deng, et. al. 2003), very
little research has been reported in the literature on the use of the incremental learning
technique for speech recognition.
140
In this section, we propose to implement a feed-forward incremental learning algorithm
(Darjazini, Cheng and Liyana-pathirana, 2006) based on the hybrid knowledge method
developed by (Darjazini and Tibbitts, 1994). This approach is novel in that it develops
and applies a modified method of the incremental learning algorithm to the problem of
speech recognition. Previously, the incremental learning was mostly designed and
tested for problems of pattern recognition (Chakraborty and Pal, 2003, Polikar et. al.
2001, Vo 1994, and Wang and Yuwono, 1996).
It was shown in Figures 3.12 and 3.13 (Chapter 3) that the APR is based on a comb of
feed forward neural networks with back-propagation learning algorithm. Each one of
those neural nets was referred to as a sub-recognisor and each one was specialised in
the recognition of an individual phone. The incremental learning algorithm picks up
new information from unknown input data (ID) and uses it to adapt the sub-recognisor
to the new changes in the input without further training.
5.2.0 Weight Selection Algorithm
The weight-selection algorithm is based on a method for speech recognition that em-
ploys a comb of phone sub-recognisors (Darjazini and Tibbitts 1994). As shown in
Figure 5.2, the method employs 55 sub-recognisors for recognition of 54 phones and
one sub-recognisor is dedicated for silent period. All the sub-recognizers are imple-
mented using an identical feed-forward neural-network (FF-NN) structure. Each sub-
recognizer has an output referred to as Phone Identification Response (PIR), where the
PIR is a continuous variable between 0 and 1.
Each sub-recognizer indicates that the input speech contains a specific phone if the
value of PIR is close to 1. The tolerance in the network is set to 0.05, therefore, each PIR
with a value of greater than or equal to 0.95 is taken as an indication of a potential match.
141
Figure 5.2 The modified structure of the APR.
In contrast with the previous back-propagation learning algorithm, the incremental
learning algorithm extracts a new weight matrix (WM) from a new data set during rec-
ognition. In this algorithm, the updated sub-recognisor contains now two phases of
back-propagation instead of one. At the initial running, the network behaves as a nor-
mal back-propagation network, the same as in the pervious sub-recognisor. At the sub-
sequent running the network performs the incremental learning process firstly by run-
ning the network using the previous weight matrix, at this point a measure applies at
the output, if the error comes greater than the maximum allowed error, the process will
be terminated as a non recognised phone. If the error is less than the maximum allowed
error and lower than the minimum acceptable output, the incremental learning phase
starts. The goal of the incremental learning phase here is to achieve an acceptable out-
put at the output layer by adjusting the weight matrix using adaptive learning rate.
When this is achieved the new weight matrix will be saved and sent to the MLWA (see
Figure 5.3) for later reference. The previous description can be outlined as following:
1. Initial running: Normal back-propagation learning algorithm
2. Further running: Input presented
3. Previous weight matrix used: if the output acceptable then recognition flag set and
process stop.
142
4. elseif the resulted error ≥ the maximum allowed error then flag a mis-recognition
message.
5. else if the resulted error ≤ the maximum allowed error and the resulted error ≥ the
minimum allowed error then: the error backpropagated locally at the output layer
and globally over the net to adjust the weights with making use of adaptive learning
rate.
6. When convergence achieved save the new weight matrix and send it to the MLWA
In the following recognition, the new set, as well as all the existing sets, will be tested
as a potential weight matrix candidate in the FF-NN. The weight matrix that produces
the highest value of PIR is selected as the most recent updated weight matrix. This
function is performed by the Most Likelihood Weight Activator (MLWA) unit, which
is shown in Figure 5.3.
Figure 5.3 Selection of the weight matrix for incremental learning.
In Figure 5.3, WM1 is obtained from the initial training session, obviously, in the early
stages of incremental learning. Subsequent WMS along with WM1 will have statistical
order in the MLWA and the highest probable WM is the weight matrix which is used
mostly. Other sets will be used more often later on.
The original sub-recognisor was also adjusted to fit the new speech data of TIMIT and
therefore the dimension of the feature extracted from the speech samples was updated to
17 (elements of the MFCC). Figure 5.4 shows the new neural network topology of the
sub-recognisor, with a multi-layer structure. The input layer contains 17 processing
143
elements (PE) used to receive 17 input elements, which represents the Mel-scale
Frequency Coefficients (MFCC) of the corresponding phone. In this structure, the input
layer acts as a buffer to the subsequent hidden layers. There are three hidden layers H1,
H2, and H3, each one containing (34 - 51 - 34) PEs respectively. The output layer
contains one PE representing a measure of the matching of the input speech (stimulus) to
a particular phone.
Figure 5.4 Structure of new sub-recognisor.
In the first feed-forward phase, all the current output values are computed and the end-
side output is compared with the target value. At this point the network performs learn-
ing using a constant learning rate. In the backward phase the error propagated through
the network and the weights are adjusted, and the new weight of the output layer is
computed providing that the change in the weight must accelerate the convergence to-
wards the lowest possible error.
At the output layer, the adjustment of the output PE weight can be formulated as
following:
144
out
out
out
out
outout
outoutout W
I
IWW
.3
2
.3
.3
2
.3 ∂∂
∂∂
∂∂−=∂
∂−=Δ φφεηεη
joutoutjoutoutoutout T 3333.3 ).1())(2( φδηφφφφη =−−−−= (5.3)
where j is the order of the PE in the third hidden layer, φ3j is the weighted input of the jth
PE, and δ3.out = 2 (T - φout) φout (1-φout). Therefore, the new weight at the output layer can
be determined from
joutoutoutout NWNW .33333 )()1( φδη−=+ (5.4)
where, W3out(N+1) and W3out(N) are the weight vectors in the N-th and (N+1)-st iterations
and η3.out is the fixed learning rate used in the first phase.
After achieving the final convergence in the first phase, the network performs the in-
cremental learning phase, for any subsequent input. In this stage, the learning rate η3.out in
(5.4) is made adaptive to achieve fast convergence. It increases if the successive changes
in the weight are in the same direction and having positive value, and decreases other-
wise. This adaptation ensures that the largest decrease in error is obtained in each itera-
tion. The learning rate adaptation is formulated as:
W
NN
ΔΔ+=+ 21
)(
1
εηη (5.5)
where, ηN+1 and ηN are the learning rates in the Nth and (N+1)
st iterations.
By substituting η3.out with ηN+1 in (5.5) into (5.4), the weight adjustment for the incre-
mental learning at the output end can be formulated as
joutNoutout
W
NWNW .3.32.3.3 ))(
1()()1( φδεη
ΔΔ+−=+ (5.6)
145
5.3 Experiment and Results
The input data were extracted from 75 spoken sentences of the TIMIT speech database
as shown Section 5.1. The sentences are spoken by 25 speakers (5 female and 20 male).
Every speaker posses one of three main dialects from the American English, and the
dialects were chosen arbitrarily. The data were mixed to produce as much variety as
possible to every phone. This is to reap the advantage of having the sub-recognisor be-
ing exposed and to deal with most varied forms of the same phone. In the primitive rep-
resentation of the input data, there were 54 distinctive phones appeared in 2440 sam-
ples, which were segmented from 637 words. Table 5.3 shows these phones and their
number of occurrences in the sentences.
Table 5.3 Phones set used in the learning session and their relevant number.
Phone Number of samples ch 12
jh 15
dh 48
f 33
s 126
sh 38
th 10
v 40
z 42
em 3
en 15
eng 1
m 73
n 137
ng 23
nx 13
epi 21
h# 1
pau 22
eI 18
hh 15
hv 24
L 82
r 87
w 43
y 24
146
b 43
bcl 3
d 59
dcl 13
dx 44
g 23
gcl 3
k 87
kcl 13
p 51
q 64
t 85
tcl 22
aa 64
ae 69
ah 34
ao 41
aw 15
ax 75
axh 7
axr 41
ay 31
eh 57
er 37
ey 46
ih 91
ix 136
iy 112
ow 38
oy 11
uh 9
uw 6
ux 28
Experiments were performed firstly by initiating (first run) the sub-recognisors using
the back-propagation learning algorithm and applying the Delta rule. The exit condition
of this session was the number of iterations, which was set at 500 (as at the beginning
the adequate number of iterations to achieve the convergence was unknown), and the
learning rates at the hidden layers and the output layer were all initiated to the value of
0.5. The weights were initiated to random normally distributed values and the learning
set contained non-clustered stimulus. Maximum accepted error is 0.01 and the
incremental learning width is 0.2, i.e. the range is from 0.97 to 0.77
147
The initial session provides the first weight matrix (WM1) for the MLWA and
determines the first cluster of the input data. Number of phone samples for the initial
session was in this case 15 samples. In this trial, the sub-recognisor converged from the
target (1) after 50 epochs. In each epoch, the network manipulated the inner weights of
the hidden layers. An error monitor was set to measure the value of Mean Squared Error
(MSE) at each hidden layer and at the output, to monitor the effects of a particular PE
performance on the overall result of the entire network. The accuracy of the PIR was
within an error value of 0.01, which is below the tolerance value of 0.05. The overall
performance on the initial learning set scored 94.44% accuracy.
Figure 5.5 illustrates the performance of the sub-recognisor in the initial session, where
Figure 5.5A illustrates the measure of the mean square error (MSE) graph and Figure
5.5B illustrates the PIR values at the end of the initial session. It can be noted that the
network converged successfully within short time measured at about 50 epochs. The
rest of the samples have been presented to the network in the incremental learning stage
where the performance was close to perfect 99.20%. The failed cases were samples
produced PIR values were out of the incremental learning range.
148
Figure 5.5 The Sub-recognisor performance in the initial session.
5.4. Discussion
In the initiation session, some of the phone sets required up to 13 trials to achieve
convergence. This was partially due to the wide range of the used phone types in the
input data. The diversity of the input data resulted in the wide distances between some
of the stimuli presented to the network.
Figure 5.6 illustrates examples of two trials on the phone / s /, where, Figure 5.6(a)
shows the MSE and the PIR for one of the non-converged trials. When that occurred
the trial was restarted again (based on a new randomly-generated weight matrix) and
at the end a convergence was achieved as shown in Figure 5.6(b).
149
(a)
Figure 5.6 Recognition experiments of the phone /s/.
150
(b)
Figure 5.6 Recognition experiments of the phone /s/.
5.5 Conclusion
The proposed incremental learning algorithm allows the phonemic knowledge of the
APR of RUST-II and its sub-recognisors to be updated without causing the system to
lose its original phonemic knowledge or suffer from catastrophic forgetting problems.
RUST-II has demonstrated excellent performance. One critical parameter worth men-
tioning here is that the incremental learning range is a very critical parameter for the
system performance and has to be predetermined for the system. The wrong range
could be resulted in false recognition results in the phonemically adjacent phones, and
may lead to a situation of catastrophic forgetting.
151
Experiments in similar conditions on the two versions of the system showed that
RUST achieved significant improvement in the performance, where the system
achieved accuracy of 76% in the earlier version and the accuracy was speaker de-
pendent (see Table 4.13). The recognition accuracy is improved significantly to
94.44% in the incremental learning version.
New syntactical knowledge can be obtained from the TIMIT database. Use of this
knowledge has been known to improve the performance of speech recognition. Its in-
corporation into the system developed in this chapter has not been addressed due to
time constraints and can be a topic of future research.
152
Chapter 6: Conclusion and Future Work
6.1 Conclusion
In this thesis, a hybrid Speech Recognition (SR) system called RUST (Recognition Using
Syntactical Tree) was developed. The system combined Artificial Neural Networks (ANN)
with a Statistical Knowledge source (SK) for a small topic-focused database.
RUST has the capacity to implement two basic levels of speech knowledge represented
statistically. The first is a phonemic knowledge in the form of likelihood of occurrence of
phones in words. The second is primary syntactic knowledge, in the form of likelihood of
occurrence of phones in sentences or sequences of words. The syntactic knowledge is
primitive in that it only focuses on the probability of a phone in series of topic related words
and the key for the process is the probability and the recognition of the onset phones in a
sentence. RUST has two versions I and II, in the first version of RUST (RUST-I); the lexicon
was developed with 1357 words of which 541 are unique. These words were extracted from
three topics (finance, physics and general reading material), and could be expanded or
reduced (specialised). The second version of RUST (RUST-II) has modified APR to suit
speech data extracted from TIMIT speech database, and its lexicon consists of 673 words.
Three experiments have been carried out on RUST-I. The first two experiments examined the
operation of the system as an isolated phone recognisor and the third experiment tested the
operation of the system as an isolated word recognisor.
153
The first experiment showed that average Self-Recognition Scores (SRS) across subgroups
was highest for vowels and lowest for affricatives. The SRS also varied across speakers within
the testing set with the highest average SRS occurring for speaker 9 and the lowest for speaker
6. The system consistently recognised all the phones of all the speakers. This experiment
showed that the adaptive phone recognisor performed reasonably well as an isolated phone
recognisor.
In the second experiment it was shown that over all speakers and phones (totalling 225
tokens), the number of SRSs, that were greater than the three thresholds (0.5, 0.6, 0.7), were
225, 172 and 160 tokens. (A threshold of 0.60 was selected from these results as it optimises
low mis-recognition and high self recognition.)
Out of the 100 words applied in the third experiment, 73% were successfully recognised. 91%
of the front edge phones were recognised successfully and at the next level of the syntactic
knowledge database, an 82% phone recognition rate was achieved. Inclusion of the syntactic
knowledge estimator was shown to successfully eliminate 89.05% of the APR mis-
recognition. Therefore, 89.05% of the recognised words required the support of the syntactic
knowledge estimator in their recognition.
An analysis of the 27% mis-recognised words identified three reasons for failure in the
recognition process. The first category of errors was due to the SRS for some sub-recognisors
being below the threshold, this occurred in 14.81% of all the mis-recognised words. The
second category of errors was due to the MRS for some sub-recognisors of the front edge level
being higher than the system threshold and the probability of occurrence of the mis-recognised
phone within the ordering structure of the syntactic knowledge being greater than for the
correct phone. This occurred in 37.03% of all the mis-recognised words. The third category of
error was due to the SRS for the phone in levels below the front edge level of the syntactic
knowledge is lower than the MRS for other phones in the same level. This occurred in 48.16%
of all the mis-recognised words. The three experiments demonstrated the map-road to achieve
154
better recognition using ANN in combination with the appropriate knowledge.
In RUST-I, the speech database used was a non-standard Australian speech database (UWS
speech database). Another speech corpus used in RUST is the TIMIT speech database. This
was a subject for trial in Chapter 5 in RUST-II. Applying TIMIT, some adjustments was
required to the syntactic knowledge estimator and the APR to achieve the transformation to a
standard speech database that has a different phonemic set and pronunciation. C++ code was
developed to browse the TIMIT database and extract the information required for
implementation of RUST-II.
RUST-II demonstrated superb recognition results with its updated APR to the incremental
learning algorithm. The application of the incremental learning algorithm on the APR has
led to significant improvement in the system. Experiment showed recognition rates of up
to 94.44% at the phone level.
6.2 Future Work
RUST-I represents the phonemic knowledge source in the overall structure of the APR and
also in the statistical knowledge source in the syntactic knowledge estimator. The
performance of RUST (I and II) was dependent firstly on the accuracy of the browsing
system and hence the probabilistic order of priority in the syntactic knowledge, and secondly
on the performance of the syntactic knowledge estimator (SKE). It is essential that the
probabilistic representation is as accurate as possible. Also greater success can be achieved in
RUST (I and II) by optimising the APR’s ability to achieve higher SRS compared to MRS.
RUST has the potential to be upgraded to recognise continuous speech by accommodating
higher syntactic (sentence level), semantic and pragmatic knowledge sources. One way
RUST can be expanded to continuous speech is to include probabilistic forms of words given
155
the patterns of occurrences of other words within a sentence. Additional sources of
knowledge such as intonation patterns, common co-articulation patterns and common rules
of grammar can also be included to improve RUST’s performance.
One aspect of the sentence structure incorporated into RUST is the probability of a word
being first in a sentence. This probability was calculated as part of the low level syntactic
knowledge representation and is used to assist in the recognition of words presented to the
system.
The performance and usefulness of RUST could also be improved by providing the syntactic
knowledge estimator with a mechanism for the self-learning of new words as they are added
to the system. The self-learning can use a high level of linguistic knowledge to determine if a
sequence of phones is grammatically, colloquially and semantically possible.
The APR must be efficient enough to include the correct phone amongst the list of possible
phones. Then the syntactic knowledge estimator needs to determine the most likely phone
from that list. Presently the system selects the first phone that exceeds the system threshold to
be the "correct phone". This technique has a limitation which leads to some errors in
performance that could have been avoided. The problem can be solved by applying varying
threshold values for different sub-recognisors or for different phonemic subgroups and using
a selector to determine the maximum response from the list of syntactically likely phones and
work through them in descending order.
Further improvement to the syntactic knowledge estimator of RUST can include a
mechanism of going back through the recognition process if a mistake occurs in not
following the path defined previously. If a mistake is made at either a phonemic level or
branch level the syntactic knowledge estimator needs to go back through the word's browsing
history and alter some of the decisions it has made to check for an overall better phone match
for the word. This requires the implementation of an algorithm of far greater intelligence and
156
complexity than that has been provided in the current versions of RUST
The performance of RUST can be examined for improvement by using continuous activation
input into each sub-recognisor rather than the current binary value. This continuous input
would represent the probability of occurrence of the current phone and would add to the
effects of the SRS to derive a decision on the "correct" phone.
In RUST-I, the speech database used was a non standard Australian speech database.
Another speech corpus could be used with RUST, such as the TIMIT speech database. This
was a subject for trial in Chapter 5 for RUST-II. Applying TIMIT, some adjustments were
required to the syntactic knowledge estimator and the APR to achieve the transformation to a
standard speech database that has a different phonemic set and pronunciation.
157
REFERENCES
BERNARD, J., 1989. Australian at talk. Canberra. Documentation of video
exploration program prepared to the curriculum development centre, Canberra.
CASSIDY, S. AND HARRINGTON, J., 1992. Investigating the dynamic nature of
vowels using neural network. Proceedings of the 4th Australian international
conference on speech science and technology, December 1992 Brisbane. 495-500.
CHAKRABORTY, D. AND PAL, N., 2003. A novel learning scheme for multilayered
perceptrons to realize proper generalization and incremental learning, IEEE transaction
on neural networks, 14(1), January 2003, 1-14.
CHAN, C. And CHAN, TAT-CHUNG., A controlled study of the suitability and
limitations of static modelling of speech. Proceedings of IEEE region 10th
international conference on technology enabling tomorrow: computers,
communications and automation towards the 21st century TENCON-92, 1992. Vol. 1,
272-276.
CHENG, Y.M., O’SHAUGHNESSY, D., GUPTA, V., KENNY, P., MERMELSTEIN,
P., AND PARTHASARATHY, S., 1992. Hybrid segmental-LVQ/HMM for large
vocabulary speech recognition. Proceedings of IEEE international conference on
acoustics speech and signal processing, 1992, Vol. 1, 593-596.
COSTINETT, S., 1997. The language of accounting in English. New York: Regents
publishing company.
CREEKMORE, J.W., FANTY, M. AND COLE, R.A., 1991. A comparative study of
five spectral representations for speaker-independent phonetic recognition. The 25th
Asilomar conference on signals systems and computer, 1991. 330-334.
158
DARJAZINI, H., CHENG, Q., AND LIYANA-PATHIRANA, R., 2006. Incremental
learning algorithm for speech recognition, (unpublished).
DARJAZINI, H. AND TIBBITTS, J., 1994. The construction of phonemic knowledge
using clustering methodology. Proceedings of the 5th Australian international
conference on speech science and technology SST-94, December 1994 Perth, Vol. 1,
202-207.
DAVENPORT, M. AND GARUDADRI, H., 1991. A neural network acoustics
phonetic feature extractor based on wavelet. Proceedings of IEEE Pacific rim
international conference on communication computers and signal processing, 1991.
Vol. 2, 449-452.
DAVIS, S.B. AND MERMELSTEIN, P., 1980. Comparison of parametric
representations for monosyllabic word recognition in continuously spoken sentences.
IEEE transactions on acousticss speech, and signal processing, 28 (4), 357-366.
DE MORI, R., 1983. Computer models of speech using fuzzy algorithm. USA: Plenum
Press.
DELLER, J.R. JR. PROAKIS, J.G. AND HANSEN, J.H., 1993. Discrete-time
processing of speech signals. USA: Macmillan publishing co.
DENG, L., DROPPO, J., AND ACERO, A., 2003. Incremental Bayes learning with prior
evolution for tracking nonstationary noise statistics from noisy speech data, Proceedings of
IEEE international conference on acoustics speech and signal processing, April.
2003,Vol. 1, 6-10.
DERMODY, P., MACKIE, K. AND KATSCH, R., 1986. Initial speech sound
processing in spoken word recognition. Proceedings of the 1st Australian conference on
speech science and technology SST-86, 1986 Canberra.
159
ELVIRA, J. AND CARRASCO, R., 1991. Neural network architectures for speech
processing. IEE colloquium on systems and applications of man-machine interaction
using speech I/O, 1991. Digest No. 066, 4(1-5).
ESCANDE, P., BEROULE, D. AND BLANCHAT, P., 1991. Speech recognition
experiments with guided propagation. Proceedings of IEEE conference on neural
network, 1991. 765-768.
FANT, G., 1960. Acoustics theory of speech production. , S.Gravenbage: Mountain
and Co.
FLAHERTY, M.J. AND POE, D.B., 1993. Orthogonal transformations of stacked
feature vectors applied to HMM speech recognition. IEE proceedings - 1, 140 (2).
FLANAGAN, J.L., 1983. Speech analysis synthesis and perception. 3rd ed. Berlin:
Springer-Verlag.
FURUI, S., 1989. Digital speech processing synthesis and recognition. New York:
Marcel Dekker Inc.
GRAMSS, T., 1992. Fast learning algorithms for a self-optimizing neural network with
an application to isolated word recognition. IEE proceedings – F, 139 (6), 391-396.
GRANT, P.M., 1991. Speech recognition techniques. IEEE electronics and
communication engineering journal, (2), 37- 48.
GUPTA, V.N., LENNIG, M., MERMELSTEIN, P., KENNY, P., SEITZ, F., AND
O’SHAUGHNESSY, D., 1991. Using phoneme duration and energy contour
information to improve large vocabulary isolated word recognition. Proceedings of
IEEE international conference on acoustics speech and signal processing ICASSP-91,
1991. Vol. 1, 341-344.
160
HALL, E., 1977. The language of electrical and electronic engineering in English.
N.Y.: Regents publishing company.
HECHT-NIELSEN, R., 1990. Neurocomputing. USA: Addison-Wesley publishing
company.
HUNT, M.J., 1988. An overview of technology for spoken international with machines.
Ottawa: National aeronautical establishment. (Report - Feb 1988).
KENNY, P., 1993. A*- admissible heuristics for rapid lexical access. IEEE transaction
on speech and audio processing, 1 (1), 49-58.
KITAMURA, T., NISHIOKA, K., ITO, A. AND HAYAHARA, E., 1992. Speaker
dependent 100 word recognition using dynamic spectral features of speech and neural
network. Proceedings of the 34th Midwest symposium on circuits and systems, 1992.
Vol. 1, 533-536.
KITAMURA, T., NISHIOKA, K., IWATA, A. AND HAYAHARA, E., 1992. Speaker
dependent recognition using CombNET dynamic spectral features of speech.
Proceedings of the 34th Midwest symposium on circuits and systems, 1992. Vol. 1, 83-
86.
KUANG, Z. AND KUH, A., 1992. A combined self-organizing feature map and multi-
layer perceptron for isolated word recognition. IEEE transaction on signal processing,
11(40), 2651-2657.
LAURENE, F., 1994. Fundamentals of neural networks. USA: Prentice Hall.
LEE, K. AND DERMODY, P., 1992. The relationship between perceptual and
acoustics analysis of speech sounds. Proceedings of the 4th Australian international
conference on speech science and technology SST-92, December 1992 Brisbane. 14-19.
161
LIPPMANN, R.P., 1987. An introduction to computing with neural nets. IEEE
Acoustics speech and signal processing magazine, 3(4), 4-22.
LOVE, C. AND KINSNER, W., 1992. A speech recognition system using a neural
network model for vocal shaping. Canada: University of Manitoba, Department of
electrical and computer Engineering (report).
MACQUARIE LIBRARY, 1998. The budget Macquarie dictionary. NSW: Macquarie
University, 3rd ed.
MAGOULAS, G. D. AND VRAHATIS, M. N., 1999. Improving the convergence of the
back-propagation algorithm using learning rate adaptation methods. Neural computation
magazine, 11, Massachusetts institute of technology, pp. 1769-1796.
McCORD NELSON, M. AND ILLINGWORTH, W.T., 1991. A practical guide to
neural nets. USA: Addison Wesly.
MIHELIČ, F., GYERGYEK, L. AND PAVEŠIĆ, N., 1991. Selection of features and
classification rules for Slovene phoneme. Proceedings of 6th Mediterranean electro-
technical conference, 1991. Vol. 2, 1180-1183.
OPPENHEIM, A.V. AND SCHAFER, R.W., 1989. Discrete-time signal processing.
USA: Prentice Hall, Signal processing series.
PEPPER, D.J. AND CLEMENTS, M.A., 1992. Phonemic recognition using a large
hidden Markov model. IEEE transactions on signal processing, 40 (6), 1590-1595.
POLIKAR, R., UDPA, L., UDPA, S., AND HONAVAR, V., 2001. Learn++: An
incremental learning algorithm for supervised neural networks, IEEE transaction on
systems, man, and cybernetics - Part C: Applications and reviews, 31(4), Nov. 2001, 497-
508.
162
REICHL, W. AND RUSKE, G., 1995. A hybrid RBF-HMM system for continuous
speech recognition. Proceedings of IEEE international conference on acoustics speech
and signal processing ICASSP–95, 1995. Vol. 5, 3335-3338.
RABINER, L. AND JUANG, B-H., 1993. Fundamentals of speech recognition. USA:
Prentice Hall.
RIGOLL, G., 1991. A new unsupervised learning algorithm for multi-layer perceptrons
based on information theory principles. Proceedings of IEEE international joint
conference on neural networks, 1991. 1764-1769.
SHIM, C., ESPINOZA-VARSA, B. AND CHEUNG, J., 1991. Difficult syllables
recognition with LPC coefficients differences and PC-based neural network.
Proceedings of 33rd IEEE Midwest symposium on circuits and systems, 1991. Vol. 2,
783-786.
SHUPING, R. AND MILLAR, B., 1992. Phonetic feature extraction using artificial
neural networks. Proceedings of the 4th Australian international conference on speech
science and technology SST-92, December 1992 Brisbane. 22-27.
SMITH, F.J., MING, J., O’BOYLE, P. AND IRVINE, A.D., 1995. A hidden Markov
model with optimized inter-frame dependence. Proceedings of IEEE International
conference on acoustics speech and signal processing ICASSP-95, 1995. Vol. 1, 209-
212.
SORENSEN, H., 1991. A cepstral noise reduction multi-layer neural network.
Proceedings of IEEE international conference on acoustics, speech and signal
processing ICASSP-91, 1991. Vol. 2, 933-936.
163
SUGIYAMA, M., SAWAI, H. AND WAIBEL, A., 1991. Review of TDNN
architectures for speech recognition. IEEE international symposium on circuits and
systems, 1991. Vol. 1, 582-585.
TECHNICAL PUBLICATIONS GROUP, 1993. Neural computing, A technology
handbook for professional II/Plus and NeuralWorks explorer. USA: NeuralWare Inc.
TIBBITTS, J., 1996. Utilisation of perceptually acoustics cues in NNT for speech
recognition. Sydney: report to ARC small grant.
TIBBITTS, J., 1989. A digital signal processing technique to improve the intelligibility
of speech for the hearing impaired in quiet. Thesis (PhD). Sydney University.
VO, M.T., 1994. Incremental learning using the time delay neural network, Proceedings of
IEEE international conference on acoustics speech and signal processing ICASSP-94, Vol.
2, April 1994, 629-632.
WAIBLE, A., HANAZAWA, T., HINTON, G., SHIKANO, K. AND LANG, K.J.,
1989. Phoneme recognition using time-delay neural networks. IEEE transactions on
acoustics, speech and signal processing, 3(73), 328-339.
WANG, D., AND YUWONO, B., 1996. Incremental learning of complex temporal
patterns, IEEE transactions on neural networks, 7(6), Nov. 1996, 1465-1481.
ZAVALIAGKOS, G., ZHAO, Y., SCHWARTZ, R., AND MAKHOUL J., 1994. A
hybrid segmental neural net / hidden Markov model system for continuous speech
recognition. IEEE transactions on speech and audio processing, 2 (1), Part 2, 151-160.
ZHANG, Q.J., WANG, F. AND NAKHLA, M.S., 1995. A high-order temporal neural
network for word recognition. Proceedings on international conference on acoustics,
speech and signal processing ICASSP-95, 1995. Vol. 5, 3343-3346.
164
APPENDIX
Probabilistic Values of the Second Level of the Syntactic Knowledge
The probabilistic values of the links between the phonemic groups on the onset level and their
phonemic subgroups are shown in the tables below. These values address the sequential
distribution of the clusters in the syntactic knowledge.
Table A.1 Probabilistic values of phonemic subgroups of the phonemic set Oð.
Phonemic set Oð , n(ð) = 156
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(ðI) = 117 P(ðI) = 0.750 I(ðI) = 0.415
2 E(ð∂) = 110 P(ð∂) = 0.705128 I(ð∂) = 0.504
3 E(ðæ) = 13 P(ðæ) = 0.0833 I(ðæ) = 3.583
4 E(ðε) = 11 P(ðε) = 0.0705512 I(ðε) = 3.824
5 E(ðeI) = 9 P(ðeI) = 0.057692 I(ðeI) = 4.113
6 E(ði) = 3 P(ði) = 0.019230 I(ði) = 5.697
7 E(ðO Ω) = 2 P(ðO Ω) = 0.01282 I(ðO Ω) = 6.281
165
Table A.2 Probabilistic values of phonemic subgroups of the phonemic set O∂.
Phonemic set O∂ , n(∂) = 106
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(∂sln) = 36 P(∂sln) = 0.339622 I(∂sln) =1.557
2 E(∂L) = 23 P(∂L) = 0.216981 I(∂L) = 2.2
3 E(∂k) =18 P(∂k) = 0.169811 I(∂k) = 2.556
4 E(∂t) = 6 P(∂t) = 0.056603 I(∂t) = 4.14
5 E(∂r) = 6 P(∂r) = 0.056603 I(∂r) = 4.14
6 E(∂w) = 5 P(∂w) = 0.047169 I(∂w) = 4.4
7 E(∂b) = 3 P(∂b) = 0.0283018 I(∂b) = 5.139
8 E(∂f) = 3 P(∂f) = 0.0283018 I(∂f) = 5.139
9 E(∂m) = 2 P(∂m) = 0.0188679 I(∂m) = 5.724
10 E(∂g) = 2 P(∂g) = 0.0188679 I(∂g) = 5.724
11 E(∂p) = 1 P(∂p) = 0.0094339 I(∂p) = 6.724
12 E(∂d) = 1 P(∂d) = 0.0094339 I(∂d) = 6.724
Table A.3 Probabilistic values of phonemic subgroups of the phonemic set Oæ.
Phonemic set Oæ , n(æ) = 82
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(æn) = 56 P(æn) = 0.682926 I(æn) = 0.549
2 E(æt) = 18 P(æt) = 0.219512 I(æt) = 2.186
3 E(æz) = 8 P(æz) = 0.097560 I(æz) = 3.355
4 E(æd) = 1 P(æd) = 0.012195 I(æd) = 6.353
5 E(æL) = 1 P(æL) = 0.012195 I(æL) = 6.353
166
Table A.4 Probabilistic values of phonemic subgroups of the phonemic set OI.
Phonemic set OI , n(I) = 81
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(In) = 48 P(In) = 0.592592 I(In) = 0.754
2 E(Iz) = 16 P(Iz) = 0.197530 I(Iz) = 2.338
3 E(It) = 12 P(It) = 0.148148 I(It) = 2.75
4 E(If) = 4 P(If) = 0.049382 I(If) = 4.337
5 E(Im) = 1 P(Im) = 0.123456 I(Im) = 6.336
Table A.5 Probabilistic values of phonemic subgroups of the phonemic set Oh.
Phonemic set Oh , n(h) = 79
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(hi) = 16 P(hi) = 0.202531 I(hi) = 2.3
2 E(hæ) = 15 P(hæ) = 0.189873 I(hæ) = 2.39
3 E(ha) = 12 P(ha) = 0.151898 I(ha) = 2.71
4 E(hέ) = 10 P(hέ) = 0.126582 I(hέ) = 2.98
5 E(hI) = 10 P(hI) = 0.126582 I(hI) = 2.98
6 E(hOΩ) = 6 P(hOΩ) = 0.075949 I(hOΩ) = 3.72
7 E(hε) = 3 P(hε) = 0.0379746 I(hε) = 4.72
8 E(haI) = 2 P(haI) = 0.0253164 I(haI) = 5.3
9 E(hu) = 2 P(hu) = 0.0253164 I(hu) = 5.3
10 E(hI∂) = 1 P(hI∂) = 0.0126582 I(hI∂) = 6.3
11 E(hЭ) = 1 P(hЭ) = 0.0126582 I(hЭ) = 6.3
12 E(hΛ) = 1 P(hΛ) = 0.0126582 I(hΛ) = 6.3
167
Table A.6 Probabilistic values of phonemic subgroups of the phonemic set Ow.
Phonemic set Ow , n(w) = 74
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(wI) = 17 P(wI) = 0.229729 I(wI) = 2.12
2 E(wÞ) = 11 P(wÞ) = 0.148648 I(wÞ) = 3.32
3 E(wi) = 8 P(wi) = 0.108108 I(wi) = 3.36
4 E(wέ) = 8 P(wέ) = 0.108108 I(wέ) = 3.36
5 E(wε) = 7 P(wε) = 0.0945945 I(wε) = 3.4
6 E(wΛ) = 6 P(wΛ) = 0.0810810 I(wΛ) = 3.62
7 E(weI) = 5 P(weI) = 0.0675675 I(weI) = 3.88
8 E(wЭ) = 3 P(wЭ) = 0.0405405 I(wЭ) = 4.62
9 E(wΩ) = 2 P(wΩ) = 0.027027 I(wΩ) = 5.2
10 E(waI) = 2 P(waI) = 0.027027 I(waI) =5.98
11 E(wε∂) = 1 P(wε∂) = 0.013513 I(wε∂) = 5.98
Table A.7 Probabilistic values of phonemic subgroups of the phonemic set OÞ.
Phonemic set OÞ , n(Þ) = 65
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(Þv) = 51 P(Þv) = 0.784615 I(Þv) = 0.349
2 E(Þn) = 8 P(Þn) = 0.1230789 I(Þn) = 3.02
3 E(Þf) = 4 P(Þf) = 0.0615384 I(Þf) = 4.02
4 E(Þp) = 2 P(Þp) = 0.030769 I(Þp) = 5.02
168
Table A.8 Probabilistic values of phonemic subgroups of the phonemic set Of.
Phonemic set Of , n(f) = 63
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(fr) = 18 P(fr) = 0.285714 I(fr) = 1.8
2 E(fЭ) = 14 P(fЭ) = 0.222222 I(fЭ) = 2.6
3 E(fI) = 6 P(fI) = 0.095238 I(fI) = 3.39
4 E(fi) = 3 P(fi) = 0.047619 I(fi) = 4.39
5 E(fε) = 3 P(fε) = 0.047619 I(fε) = 4.39
6 E(fa) = 3 P(fa) = 0.047619 I(fa) = 4.39
7 E(feI) = 3 P(feI) = 0.047619 I(feI) = 4.39
8 E(f∂) = 2 P(f∂) = 0.031746 I(f∂) = 4.97
9 E(faI) = 2 P(faI) = 0.031746 I(faI) = 4.97
10 E(fæ) = 2 P(fæ) = 0.031746 I(fæ) = 4.97
11 E(fΛ) = 2 P(fΛ) = 0.031746 I(fΛ) = 4.97
12 E(fI∂) = 1 P(fI∂) = 0.015873 I(fI∂) = 5.97
13 E(fL) = 1 P(fL) = 0.015873 I(fL) = 5.97
14 E(faΩ) = 1 P(faΩ) = 0.015873 I(faΩ) = 5.97
15 E(fu) = 1 P(fu) = 0.015873 I(fu) = 5.97
169
Table A.9 Probabilistic values of phonemic subgroups of the phonemic set Op.
Phonemic set Op , n(p) = 62
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(pr) = 25 P(pr) = 0.403 I(pr) = 1.31
2 E(pL) = 6 P(pL) = 0.0967 I(pL) = 3.36
3 E(pa) = 6 P(pa) = 0.0967 I(pa) = 3.36
4 E(p∂) = 4 P(p∂) = 0.0645 I(p∂) = 3.95
5 E(pÞ) = 4 P(pÞ) = 0.0645 I(pÞ) = 3.95
6 E(pΩ) = 3 P(pΩ) = 0.0483 I(pΩ) = 4.37
7 E(pi) = 3 P(pi) = 0.0483 I(pi) = 4.37
8 E(pΛ) = 2 P(pΛ) = 0.0322 I(pΛ) = 4.95
9 E(pæ) = 2 P(pæ) = 0.0322 I(pæ) = 4.95
10 E(peI) = 2 P(peI) = 0.0322 I(peI) = 4.95
11 E(pέ) = 2 P(pέ) = 0.0322 I(pέ) = 4.95
12 E(pI∂) = 1 P(pI∂) = 0.0161 I(pI∂) = 5.95
13 E(pI) = 1 P(pI) = 0.0161 I(pI) = 5.95
14 E(pε) = 1 P(pε) = 0.0161 I(pε) = 5.95
170
Table A.10 Probabilistic values of phonemic subgroups of the phonemic set Ot.
Phonemic set Ot , n(t) = 61
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(tu) = 39 P(tu) = 0.639344 I(tu) = 0.644
2 E(tr) = 8 P(tr) = 0.131147 I(tr) = 2.92
3 E(teI) =6 P(teI) = 0.098360 I(teI) = 3.34
4 E(t∂) = 3 P(t∂) = 0.049180 I(t∂) = 4.34
5 E(tw) = 2 P(tw) = 0.032786 I(tw) = 4.92
6 E(tέ) = 2 P(tέ) = 0.032786 I(tέ) = 4.92
7 E(taI) = 1 P(taI) = 0.016393 I(taI) = 5.93
Table A.11 Probabilistic values of phonemic subgroups of the phonemic set Os.
Phonemic set Os , n(s) = 57
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(sε) = 14 P(sε) = 0.245614 I(sε) = 2.02
2 E(sI) = 9 P(sI) = 0.157894 I(sI) = 2.66
3 E(st) = 7 P(st) = 0.122807 I(st) = 3.02
4 E(sp) = 4 P(sp) = 0.105263 I(sp) = 3.24
5 E(si) = 3 P(si) = 0.052631 I(si) = 4.24
6 E(sm) = 3 P(sm) = 0.052631 I(sm) = 4.24
7 E(s∂) = 3 P(s∂) = 0.052631 I(s∂) = 4.24
8 E(seI) = 2 P(seI) = 0.035087 I(seI) = 4.83
9 E(sOΩ) = 2 P(sOΩ) = 0.035087 I(sOΩ) = 4.83
10 E(sέ) = 2 P(sέ) = 0.035087 I(sέ) = 4.83
11 E(sæ) = 2 P(sæ) = 0.035087 I(sæ) = 4.83
12 E(sk) = 2 P(sk) = 0.035087 I(sk) = 4.83
171
13 E(sL) = 1 P(sL) = 0.017543 I(sL) = 5.83
14 E(saI) = 1 P(saI) = 0.017543 I(saI) = 5.83
Table A.12 Probabilistic values of phonemic subgroups of the phonemic set Ob.
Phonemic set Ob , n(b) = 56
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(bi) = 19 P(bi) = 0.339285 I(bi) = 1.56
2 E(bΩ) = 10 P(bΩ) = 0.178571 I(bΩ) = 2.48
3 E(bI) = 7 P(bI) = 0.125 I(bI) = 2.3
4 E(baI) = 4 P(baI) = 0.0714285 I(baI) = 3.8
5 E(b∂) = 3 P(b∂) = 0.0535714 I(b∂) = 4.21
6 E(br) = 3 P(br) = 0.0535714 I(br) = 4.21
7 E(bL) = 3 P(bL) = 0.0535714 I(bL) = 4.21
8 E(bΛ) = 2 P(bΛ) = 0.035714 I(bΛ) = 4.8
9 E(bæ) = 2 P(bæ) = 0.035714 I(bæ) = 4.8
10 E(bε) = 1 P(bε) = 0.017857 I(bε) = 5.8
11 E(baΩ) = 1 P(baΩ) = 0.017857 I(baΩ) = 5.8
12 E(bЭ) = 1 P(bЭ) = 0.017857 I(bЭ) = 5.8
13 E(bÞ) = 1 P(bÞ) = 0.017857 I(bÞ) = 5.8
14 E(beI) = 1 P(beI) = 0.017857 I(beI) = 5.8
172
Table A.13 Probabilistic values of phonemic subgroups of the phonemic set Ok .
Phonemic set Ok , n(k) = 48
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(k∂) = 11 P(k∂) = 0.22916 I(k∂) = 2.12
2 E(kæ) = 10 P(kæ) = 0.20833 I(kæ) = 2.26
3 E(kÞ) = 9 P(kÞ) = 0.1875 I(kÞ) = 2.41
4 E(kΛ) = 4 P(kΛ) = 0.083333 I(kΛ) = 3.58
5 E(kaI) = 3 P(kaI) = 0.0625 I(kaI) = 3.99
6 E(keI) = 2 P(keI) = 0.041666 I(keI) = 4.58
7 E(kε) = 2 P(kε) = 0.041666 I(kε) = 4.58
8 E(kЭ) = 2 P(kЭ) = 0.041666 I(kЭ) = 4.58
9 E(kΩ) = 1 P(kΩ) = 0.020833 I(kΩ) = 5.58
10 E(kL) = 1 P(kL) = 0.020833 I(kL) = 5.58
11 E(kOΩ) = 1 P(kOΩ) = 0.020833 I(kOΩ) = 5.58
12 E(kI) = 1 P(kI) = 0.020833 I(kI) = 5.58
13 E(ki) = 1 P(ki) = 0.020833 I(ki) = 5.58
173
Table A.14 Probabilistic values of phonemic subgroups of the phonemic set Od.
Phonemic set Od , n(d) = 44
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(d∂) = 12 P(d∂) = 0.272727 I(d∂) = 1.87
2 E(dr) = 10 P(dr) = 0.2272727 I(dr) = 2.14
3 E(dI) = 6 P(dI) = 0.1363636 I(dI) = 2.87
4 E(dΛ) = 2 P(dΛ) = 0.454545 I(dΛ) = 4.48
5 E(du) = 2 P(du) = 0.454545 I(du) = 4.48
6 E(deI) = 2 P(deI) = 0.454545 I(deI) = 4.48
7 E(dε) = 2 P(dε) = 0.454545 I(dε) = 4.48
8 E(di) = 2 P(di) = 0.454545 I(di) = 4.48
9 E(dÞ) = 1 P(dÞ) = 0.0227272 I(dÞ) = 4.54
10 E(dæ) = 1 P(dæ) = 0.0227272 I(dæ) = 4.54
11 E(dЭ) = 1 P(dЭ) = 0.0227272 I(dЭ) = 4.54
12 E(dOΩ) = 1 P(dOΩ) = 0.022727 I(dOΩ) = 4.54
13 E(dI∂) = 1 P(dI∂) = 0.0227272 I(dI∂) = 4.54
14 E(dj) = 1 P(dj) = 0.0227272 I(dj) = 4.54
174
Table A.15 Probabilistic values of phonemic subgroups of the phonemic set Om.
Phonemic set Om , n(m) = 38
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(mΛ) = 7 E(mΛ) = 0.159090 I(mΛ) = 2.65
2 E(mæ) = 6 P(mæ) = 0.157894 I(mæ) = 2.66
3 E(mÞ) = 6 P(mÞ) = 0.157894 I(mÞ) = 2.66
4 E(meI) = 6 P(meI) = 0.157894 I(meI) = 2.66
5 E(m∂) = 4 P(m∂) = 0.105263 I(m∂) = 3.25
6 E(mЭ) = 3 P(mЭ) = 0.0789473 I(mЭ) = 3.66
7 E(mOΩ) = 2 P(mOΩ) = 0.052631 I(mOΩ) = 4.24
8 E(ma) = 1 P(ma) = 0.0263157 I(ma) = 5.24
9 E(maΩ) = 1 P(maΩ) = 0.026315 I(maΩ) = 5.24
Table A.16 Probabilistic values of phonemic subgroups of the phonemic set On .
Phonemic set On , n(n) = 34
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(nj) = 12 P(nj) = 0.352941 I(nj) = 1.5
2 E(nΛ) = 7 P(nΛ) = 0.205882 I(nΛ) = 2.27
3 E(nε) = 4 P(nε) = 0.117647 I(nε) = 3.1
4 E(nOΩ) = 4 P(nOΩ) = 0.117647 I(nOΩ) = 3.1
5 E(nÞ) = 2 P(nÞ) = 0.058823 I(nÞ) = 4.08
6 E(naI) = 2 P(naI) = 0.058823 I(naI) = 4.08
7 E(nЭI) = 1 P(nЭI) = 0.029411 I(nЭI) = 5.08
8 E(neI) = 1 P(neI) = 0.029411 I(neI) = 5.08
9 E(ni) = 1 P(ni) = 0.029411 I(ni) = 5.08
175
Table A.17 Probabilistic values of phonemic subgroups of the phonemic set Oi .
Phonemic set Oi , n(i) = 28
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(iL) = 17 P(iL) = 0.607142 I(iL) = 0.72
2 E(iv) = 2 P(iv) = 0.071428 I(iv) = 3.8
3 E(it∫) = 2 P(it∫) = 0.071428 I(it∫) = 3.8
4 E(ik) = 2 P(ik) = 0.071428 I(ik) = 3.8
5 E(iz) = 2 P(iz) = 0.071428 I(iz) = 3.8
6 E(is) = 1 P(is) = 0.035714 I(is) = 4.8
7 E(it) = 1 P(it) = 0.035714 I(it) = 4.8
8 E(in) = 1 P(in) = 0.035714 I(in) = 4.8
Table A.18 Probabilistic values of phonemic subgroups of the phonemic set Oε. Phonemic set Oε , n(ε) = 28
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(εn) = 10 P(εn) = 0.370370 I(εn) = 1.43
2 E(εL) = 6 P(εL) = 0.222222 I(εL) = 2.17
3 E(εk) = 3 P(εk) = 0.111111 I(εk) = 3.17
4 E(εv) = 2 P(εv) = 0.074074 I(εv) = 3.75
5 E(ε∂) = 2 P(ε∂) = 0.074074 I(ε∂) = 3.75
6 E(εg) = 2 P(εg) = 0.074074 I(εg) = 3.75
7 E(εd) = 1 P(εd) = 0.037037 I(εd) = 4.75
8 E(εb) = 1 P(εb) = 0.037037 I(εb) = 4.75
176
Table A.19 Probabilistic values of phonemic subgroups of the phonemic set Og.
Phonemic set Og , n(g) = 26
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(gr) = 11 P(gr) = 0.423076 I(gr) = 1.24
2 E(geI) = 3 P(geI) = 0.115384 I(geI) = 3.11
3 E(gε) = 2 P(gε) = 0.0769230 I(gε) = 3.7
4 E(gL) = 2 P(gL) = 0.0769230 I(gL) = 3.7
5 E(gΛ) = 2 P(gΛ) = 0.0769230 I(gΛ) = 3.7
6 E(gÞ) = 1 P(gÞ) = 0.038461 I(g Þ) = 4.69
7 E(gaI) = 1 P(gaI) = 0.038461 I(gaI) = 4.69
8 E(gI) = 1 P(gI) = 0.038461 I(gI) = 4.69
9 E(gi) = 1 P(gi) = 0.038461 I(gi) = 4.69
10 E(gOΩ) = 1 P(gOΩ) = 0.038461 I(gOΩ) = 4.69
11 E(gΩ) = 1 P(gΩ) = 0.038461 I(gΩ) = 4.69
Table A.20 Probabilistic values of phonemic subgroups of the phonemic set Or.
Phonemic set Or , n(r) = 24
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(r∂) = 11 P(r∂) = 0.458333 I(r∂) = 1.13
2 E(rε) = 4 P(rε) = 0.166666 I(rε) = 2.58
3 E(ri) = 4 P(ri) = 0.166666 I(ri) = 2.58
4 E(ru) = 2 P(ru) = 0.08333 I(ru) = 3.58
5 E(rOΩ) = 1 P(rOΩ) = 0.041666 I(rOΩ) = 4.58
6 E(rΛ) = 1 P(rΛ) = 0.0416666 I(rΛ) = 4.58
7 E(ræ) = 1 P(ræ) = 0.0416666 I(ræ) = 4.58
177
Table A.21 Probabilistic values of phonemic subgroups of the phonemic set Oa.
Phonemic set Oa , n(a) = 19
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(a-) = 12 P(a-) = 0.631578 I(a-) = 0.662
2 E(aI) = 2 P(aI) = 0.105263 I(aI) = 3.24
3 E(at∫) = 1 P(at∫) = 0.052631 I(at∫) = 4.24
4 E(af) = 1 P(af) = 0.052631 I(af) = 4.24
5 E(am) = 1 P(am) = 0.052631 I(am) = 4.24
6 E(as) = 1 P(as) = 0.052631 I(as) = 4.24
7 E(at) = 1 P(at) = 0.052631 I(at) = 4.24
Table A.22 Probabilistic values of phonemic subgroups of the phonemic set OЭ.
Phonemic set OΧ , n(Χ) = 19
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(Э-) = 6 P(Э-) = 0.315789 I(Э-) = 1.66
2 E(Эb) = 4 P(Эb) = 0.210526 I(Эb) = 2.24
3 E(Эg) = 4 P(Эg) = 0.210526 I(Эg) = 2.24
4 E(Эd) = 3 P(Эd) = 0.157894 I(Эd) = 2.66
5 E(ЭL) = 2 P(ЭL) = 0.105263 I(ЭL) = 3.24
178
Table A.23 Probabilistic values of phonemic subgroups of the phonemic set OJ.
Phonemic set Oj , n(j) = 19
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(ju) = 14 P(ju) = 0.736842 I(ju) = 0.44
2 E(jЭ) = 3 P(jЭ) = 0.157894 I(jЭ) = 2.66
3 E(j∂) = 1 P(j∂) = 0.052631 I(j∂) = 4.24
4 E(jI∂) = 1 P(jI∂) = 0.052631 I(jI∂) = 4.24
Table A.24 Probabilistic values of phonemic subgroups of the phonemic set Ot∫. Phonemic set Ot∫ , n(t∫) = 16
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(t∫æ) = 7 P(t∫æ) = 0.4375 I(t∫æ) = 1.19
2 E(t∫a) = 6 P(t∫a) = 0.375 I(t∫a) = 1.41
3 E(t∫έ) = 1 P(t∫έ) = 0.0625 I(t∫έ) = 3.997
4 E(t∫ε) = 1 P(t∫ε) = 0.0625 I(t∫ε) = 3.997
5 E(t∫aI) = 1 P(t∫aI) = 0.0625 I(t∫aI) = 3.997
179
Table A.25 Probabilistic values of phonemic subgroups of the phonemic set OL.
Phonemic set OL , n(L) = 16
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(LaI) = 5 P(LaI) = 0.3125 I(LaI) = 1.67
2 E(LΩ) = 3 P(LΩ) = 0.1875 I(LΩ) = 2.41
3 E(Li) = 3 P(Li) = 0.1875 I(Li) = 2.41
4 E(Lε) = 1 P(Lε) = 0.0625 I(Lε) = 3.99
5 E(LeI) = 1 P(LeI) = 0.0625 I(LeI) = 3.99
6 E(LOΩ) = 1 P(LOΩ) = 0.0625 I(LOΩ) = 3.99
7 E(Læ) = 1 P(Læ) = 0.0625 I(Læ) = 3.99
8 E(La) = 1 P(La) = 0.0625 I(La) = 3.99
Table A.26 Probabilistic values of phonemic subgroups of the phonemic set OΛ.
Phonemic set OΛ , n(Λ) = 14
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(Λð) = 4 P(Λð) = 0.285714 I(Λð) = 1.8
2 E(Λp) = 4 P(Λp) = 0.285714 I(Λp) = 1.8
3 E(Λn) = 4 P(Λn) = 0.285714 I(Λn) = 1.8
4 E(Λs) = 1 P(Λs) = 0.071428 I(Λs) = 3.8
180
Table A.27 Probabilistic values of phonemic subgroups of the phonemic set Oθ.
Phonemic set Oθ , n(θ) = 11
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(θr) = 5 P(θr) = 0.454545 I(θr) = 1.14
2 E(θI) = 2 P(θI) = 0.181818 I(θI) = 2.64
3 E(θæ) = 2 P(θæ) = 0.181818 I(θæ) = 2.64
4 E(θaΩ) = 1 P(θaΩ) = 0.090909 I(θaΩ) = 3.64
Table A.28 Probabilistic values of phonemic subgroups of the phonemic set OaΩ.
Phonemic set OaΩ , n(aΩ) = 8
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(aΩt) = 4 P(aΩt) = 0.5 I(aΩt) = 1
2 E(aΩ∂) = 4 P(aΩ∂) = 0.5 I(aΩ∂) = 1
Table A.29 Probabilistic values of phonemic subgroups of the phonemic set Odξ.
Phonemic set Odξ , n(dξ) = 8
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(dξÞ) = 2 P(dξÞ) = 0.25 I(dξÞ) = 2
2 E(dξΛ) = 2 P(dξΛ) = 0.25 I(dξΛ) = 2
3 E(dξЭI) = 2 P(dξЭI) = 0.25 I(dξЭI) = 2
4 E(dξn) = 1 P(dξn) = 0.125 I(dξn) = 3
5 E(dξε) = 1 P(dξε) = 0.125 I(dξε) = 3
181
Table A.30 Probabilistic values of phonemic subgroups of the phonemic set OOΩ.
Phonemic set OOΩ , n(OΩ) = 4
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(OΩL) = 1 P(OΩL) = 0.125 I(OΩL) = 2
2 E(OΩd) = 1 P(OΩd) = 0.125 I(OΩd) = 2
3 E(OΩv) = 1 P(OΩv) = 0.125 I(OΩv) = 2
4 E(OΩn) = 1 P(OΩn) = 0.125 I(OΩn) = 2
Table A.31 Probabilistic values of phonemic subgroups of the phonemic set OaI.
Phonemic set OaI , n(aI) = 4
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(aI-) = 3 P(aI-) = 0.750 I(aI-) = 0.414
2 E(aI∂) = 1 P(aI∂) = 0.250 I(aI∂) = 2
Table A.32 Probabilistic values of phonemic subgroups of the phonemic set Ov.
Phonemic set Ov , n(v) = 4
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(vε) = 3 P(vε) = 0.75 I(vε) = 0.414
2 E(v∂) = 1 P(v∂) = 0.25 I(v∂) = 2
182
Table A.33 Probabilistic values of phonemic subgroups of the phonemic set O έ.
Phonemic set Oέ , n(έ) = 3
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(έL) = 2 P(έL) = 0.666666 I(έL) = 0.58
2 E(έθ) = 1 P(έθ) = 0.333333 I(έθ) = 1.58
Table A.34 Probabilistic values of phonemic subgroups of the phonemic set OeI.
Phonemic set OeI , n(eI) = 1
Sequence Phonemic subgroup
& number of its
occurrence
Localised
probability
Self information
[bit]
1 E(eIdξ) = 1 P(eIdξ) = 1 I(eIdξ) = 0