+ All Categories
Home > Documents > onaw•.Onl.no Onaw. (Onl.no) K1AON4 KIAON4digitool.library.mcgill.ca/thesisfile23380.pdf ·...

onaw•.Onl.no Onaw. (Onl.no) K1AON4 KIAON4digitool.library.mcgill.ca/thesisfile23380.pdf ·...

Date post: 15-Sep-2018
Category:
Upload: dinhnhan
View: 212 times
Download: 0 times
Share this document with a friend
119
1+1 Nationa' Library Bibliothèque nationale du Canada Acquisitions and Direction des acquisitions et Bibliographie services Branch des services bibliographiques 395 Wellmglon Slrecl 395. rue Wcnington onaw•. Onl.no Onaw. (Onl.no) K1AON4 KIAON4 NOTICE AVIS The quality of this microform is heavily dependent upon the quality of the original thesis submitted for microfilming. Every effort has been made to ensure the highest quality of reproduction possible. If pages are missing, contact the university which granted the degree. Sorne pages may have indistinct print especially if the original pages were typed with a poor typewriter ribbon or if the university sent us an inferior photocopy. Reproduction in full or in part of this microform is governed by the Canadian Copyright Act, R.S.C. 1970, c. C-30, and subsequent amendments. Canada La qualité de cette microforme dépend grandement de la qualité de la thèse soumise au microfilmage. Nous avons tout fait pour assurer une qualité supérieure de reproduction. S'il manque des pages, veuillez communiquer avec l'université qui a conféré le grade. La qualité d'impression de certaines pages peut laisser à désirer, surtout si les pages originales ont été dactylographiées à l'aide d'un ruban usé ou si l'université nous a fait parvenir une photocopie de qualité inférieure. La reproducticn, même partielle, de cette microforme est soumise à la Loi canadienne sur le droit d'auteur, SRC 1970, c. C-30, et ses amendements subséquents.
Transcript

1+1 Nationa' LibraryolCanacJ~

Bibliothèque nationaledu Canada

Acquisitions and Direction des acquisitions etBibliographie services Branch des services bibliographiques

395 Wellmglon Slrecl 395. rue Wcningtononaw•. Onl.no Onaw. (Onl.no)K1AON4 KIAON4

NOTICE AVIS

The quality of this microform isheavily dependent upon thequality of the original thesissubmitted for microfilming.Every effort has been made toensure the highest quality ofreproduction possible.

If pages are missing, contact theuniversity which granted thedegree.

Sorne pages may have indistinctprint especially if the originalpages were typed with a poortypewriter ribbon or if theuniversity sent us an inferiorphotocopy.

Reproduction in full or in part ofthis microform is governed bythe Canadian Copyright Act,R.S.C. 1970, c. C-30, andsubsequent amendments.

Canada

La qualité de cette microformedépend grandement de la qualitéde la thèse soumise aumicrofilmage. Nous avons toutfait pour assurer une qualitésupérieure de reproduction.

S'il manque des pages, veuillezcommuniquer avec l'universitéqui a conféré le grade.

La qualité d'impression decertaines pages peut laisser àdésirer, surtout si les pagesoriginales ont étédactylographiées à l'aide d'unruban usé ou si l'université nousa fait parvenir une photocopie dequalité inférieure.

La reproducticn, même partielle,de cette microforme est soumiseà la Loi canadienne sur le droitd'auteur, SRC 1970, c. C-30, etses amendements subséquents.

••

Improvement of Accoustic ModelsIn Automatic Speech Recognition

Systems

Rafah Aboul-Hosn.School Of Computer Science.

McGill University.Montreal,Canada.

August 23, 1995

A Thesis Submitted to the Faculty of Graduate Studiesand Research in Partial Fulfillment of the Requirements

for the Degree of Masters In Computer Science.

@1995 Raiah Aboul-Hosn

1+1 National Libraryof Canada

BibliothèQue nationaledu Canada

Acquisitions and Direction des acquisitions etBibliographie Services Branch des services bibliographiques

395 Wellington Sueet 395, rue WellingtonOU.wa, Ontario OUawa (Ontario)K1AON4 K1AON4

The author has granted anirrevocable non-exclusive licenceallowing the National Library ofCanada to reproduce, loan,distribute or sell copies ofhisjher thesis by any means andin any form or format, makingthis thesis available to interestedpersons.

The author retains ownership ofthe copyright in hisjher thesis.Neither the thesis nor substantialextracts trom it may be printed orotherwise reproduced withouthisjher permission.

L'auteur a accordé une licenceirrévocable et non exclusivepermettant à la Bibliothèquenationale du Canada dereproduire, prêter, distribuer ouvendre des copies de sa thèsede quelque manière et sousquelque forme que ce soit pourmettre des exemplaires de cettethèse à la disposition despersonnes intéressées.

L'auteur conserve la propriété dudroit d'auteur qui protège sathèse. Ni la thèse ni des extraitssubstantiels de celle-ci nedoivent être imprimés ouautrement reproduits sans sonautorisation.

ISBN 0-612-12153-4

Canada

Abstract

This thesis explores the use of efficient acoustic modeling techniques to improvethe performance of automatic speech recognition (A5R) systems. The principalidea behind this study b that the prouunciation of a word is not only affectedby the variability of the speaker and the environment, but largely by the wordsthat precede and foUow it. Bence, to accuratcly represent a language, one shouldmodcl the sounds in contex/ with other sounds. Furthermore, due to the largeamount of modcls produced when every sound is represented in every context itcau appear in, one needs to use a elus/_ring teclmique by which sounds of similarproperties are grouped together so as to limit the number of models needed. Theaim of this research is two fold: the first is to explore the effects of using contextdependent modcls on the performance of an ASR system, the second is to combinethe context dependent models pertaining to a specifie sound in a complex structureto produce a context independent modcl containing contextual information. Twosuch complex structures are designed aud their performance is tested.

Résumé

Le sujet de cette thèse est l'amélioration de la performance d'un système au­tomatique de la reconaissance de la parole (RAP) par l'utilisation des techniquesefficaces pour la modélisation acoustique. L'idèe principale de cette étude estque la prononciation d'un mot n'est pas seuleument influencée par le locuteur etl'environnement. mais surtout par les mots qui le precèdent et ceux qui le suivent.Par conséquence, pour pouvoir bien représenter une langue, cn doit représenterles sons dans leurs contextes. Malheureusement, le nombre de modèles qu'on ob­tient si on associe un modèle a chaque contexte est trop grand, ce qui rend lesystème inefficace. Aflll de réduire le nombre de modèles requis, on utilise destechniques de regroupement des modèles contextuels. Le but de ce recherche estpremièrement d'étudier l'effet des modèles conte."(tuels sur la performance d'unsystème automatique pour la reconaissance de la parole et deuxièment d'intégrerles modèles contextuels appartenant a un son dans des structures complèxes. Deuxstructures sont ainsi dé"elopées et leur effet sur la performance d'un système esté''a\ué.

Acknowledgement

1 would like to thank myadvisor Dr. Renato DeMori, whose help and guid­ance helped me greatly through out my research period.

1 would also like to extend my appreciating and gratitude to all the mem­bers of the Speech Group at the School Of Computer Science At McGill forall their heplful hints and comments.

Special thanks to Michael Galler and Charles Snow, my two mentors andfriends, for ail their patience and guidance.

,

Contents

1 Introduction1.1 Applications of Speech Recognition .

1.1.1 Telecommunication...............1.1.1.1 Automating Sorne Operator Services1.1.1.2 Accessing Information Over the Telephone.

1.1.2 Consumer Products ...1.2 Motivation and Outline. . . . . . . . .

2 Speech Generation and Phonetics2.1 Overview of Human Speech Generation.2.2 Speech Production and Acoustic-Phonetics .

2.2.1 Physiology of the Speech System2.2.2 Vowels, Consonants and Glides2.2.3 Manner of Articulation .2.2.4 Place of Articulation .

2.2.4.1 Consonants.2.2.4.2 Vowels....

2.2.5 Acoustic·Phonetics . .2.2.5.1 Acoustic Properties of Phonemes

3 Architecture of an ASR system3.1 Introduction...........3.2 Signal ModeIing Techniques . .

3.2.1 Sampling and Spectral Shaping3.2.2 Feature Extraction . . . . • . .

3.2.2.1 Fast Fourier Transform .3.2.3 Discrete vs Continuous Models

ii

1222456

910101214151616li1920

22222425262831

•'~ .

CONTENTS

3.3 Statistical Approach to RecClgnition ...3.3.1 Acoustic Moueling Using Hr-.Il\ls .

3.3.1.1 Markov Chain .....3.3.1.2 Hidden Markov Models .3.3.1.3 Parameters of an HMM .3.3.1.4 Structure of an HMM ..3.3.1.5 Types of HMMs .....

3.3.2 Using HMMs for Training and Recogn:tion .3.3.2.1 Overview of Training .3.3.2.2 Overview of Recognition .

ii:

333435363i3940424243

4 Training and Recognition Algorithms 444.1 Introduction........................ 444.2 The Fundamental Problems for HMM Design ..... 454.3 Problem I:Calculating P(O 1.\) • . . . . . . . . . . . . 46

4.3.1 Basic computation 464.3.2 The Forward-Backward Algorithm . 4i

4,4 Problem 2:Finding An Optimal Path . . . . . . . . . . 514.4.1 The Viterbi Algorithm . . . . . . . . . . . . . . 514,4.2 Recognition Using the Viterbi Algorithm . . . . 524,01.3 The Viterbi Bearn Search Algo~ithm 54

4.5 Problem 3:Estimating the Parameters of an HMM . . . 564.5.1 Maximum Likelihood Estimation ;"Iethod. . . . 564.5.2 MLE Method with Multiple Sentences ..... 594.5.3 Estimating the Output Distributions of a CDHMM 62

4.6 Implementation Considerations .... . . . . . . . &14.6.1 Initializing the HMM models. . . . . . . . . 644.6.2 Insufficient Training Data . . . . . 644.6.3 Underflow Problems . . . . . . . 66

5 State of Art in Speech Recognition 675.1 Availabilityof Large Training Data Sets 585.2 Channel Noise Reduction. . . . . . . . . . . . . . . . . . . .. 685.3 Speaker Adaptation for Speaker Independent Systems . iO5.4 Language Models . . . . . . . . . . . . . . . . . . . . . il5.5 Acoustic ;"lodeling .....•.............. il

5.5.1 Modeling Non Speech Sounds . . . . . . . . . . 71,

6.6

6 Experiments with CD Models6.1 Overview..........6.2 An Overview of Roger . . . .6.3 The TIMIT Corplls . . . . . ..

• CONTENTS iv

.5..5.2 Using HM?lls to Recognize Non-linguistic Features. .. 72

.5..5.3 Using Context Dependent Models . . . . . . . . . . .. 73

757.57677

6.4 Dcsigning Context Independent Models . . 786.4.1 Optimizing the TopoloS)' . . . . . . 786.4.2 Training and Recognition with CI Models . 81

6.4.2.1 Initialization............ 816.4.2.2 Recognition Results 826.4.2.3 Effect of Phoneme Bigram Weights 83

6.5 Designing Contest Dependent Models . . . . . . . . 856.5.1 Clustering Techniques ..... . . . . . . . 856.5.2 Creating and Clustering the Allophones. . . 87

6.5.2.1 Assembling and Pruning the Allophones 876.5.2.2 Clustering the Allophones . . . . . . . . . .. 87

6.5.3 Training and Recognition using CD Models 896.5.3.1 Initialization ..... . . . . . . . . . . . .. 896.5.3.2 Building the Phoneme Bigram Model . . 896.5.3.3 Recognition results . . . . . . . . . . . . 916.5.3.·1 Effect of Using Phonme Bigram Weights 92

Merging CD Models to Form CI Modcls ..... 936.6.1 CD ModeIs in Parallcl . . . . . . . . . . 94

6.6.1.1 Results . . . . . 956.6.2 A Form of State Clustering . . . . . 96

6.6.2.1 Results ..... . . . . . 97

; Conclusion and EUture Work 100

,,

List of Figures

2.1 Ovt:rview of the speech organs (ad"pted from [OGrady Si]) .. 12

3.1 Example of a simple ASR syst~m .3.2 Example of a Markov Model .3.3 A five state, left-to-right HMM model .

.1.1 Example of a trel1is, adapted from [DeMori J .4.2 Training with multiple observations .

6.1 Topology used in [Schwa S5] .6.2 Topology used in [Lee 89] .6.3 HMM topologies used .6.4 Parallel structure for the central phoneme ~aa~ ..6.5 Tied state structure for the central phoneme ~aa~ .

v

2435:19

5061

i9i9SO959i

List of Tables

2.1 List of the English Phonemcs .2.2 English Phonemcs and their corrcsponding featurcs

1118

6.1 Recognition using CI models . . . . . . . . . . . . . 8S6.2 Elfect of phoneme bigram weights on CI models . . 846.3 Clusters used for the CD models. . . . . . . . . . . 886,4 Recognition using CD models . . . . . . . . . . . . 916.5 Improvement in recognition using CD models 916.6 Elfect of phoneme bigram weights on CD models. . . . . . .. 926.ï Improvement in recognition using CD models with bigram

weights . . . . . . . . . . . . . . . . • . . . . . . . . . • . . .. 926.8 Recognition using allophoncs combined in a parallel manner . 966.9 Elfect of bigram weights on the parallel structured models .. 966.10 Recognition using state tying between allophoncs •.. 986.11 Elfect of J:,igram weights on the tied state models . 986.12 Overall rcsults using a phoneme bigram weight of 4 . . 99

,

VI

Chapter 1

Introduction

Although rcsearch in voice proccssing has been carried out for decadcs, be·

ginning 1990, the combination of powerful, inexpensive workstations and

improvcd a1gorithms for speech decoding, stimulated the use of speech tech·

nology in a variety of applications such as telecommunication, multimedia,

and a wide range of consumer products.

Nowadays, research in voice proccssing covers four main domains: VOlcr

synthesis, in which the machine transforms text into a synthcsized voice mes·

sage and transmits it, speech ruognition, in which the machine is capable of

~understanding~ the human voice and can thus act upon the speech it un·

derstood, speaker recognition, in which the machine identifies a person from

his/her voi!:e and finally natural language processing in which the machine

can understand the message uttered and can then translated it to anothcr

language.

1

• CH..\PTER 1. INTRODUCTION 2

The applications of automatic speech recognition or ASR systems arc nu­

merous, but by far, the telephone industry remains th.. principle test bed and

implementation source of such systems (example BNR, AT&T, NYNEX).

The next section will review the main applications of speech recognition and

especially its usage in the telecommunication area.

1.1 Applications of Speech Recognition

1.1.1 Telecommunication

As the telephone industry evolves in the coming years to provide easy to use

and efficient products to its customers, several technologies will become more

and more valuable. One of the principal technologies is speech recognition. In

fact, in 1994, the projected voice processing market was over $1.5 billion, and

its estimatcd growth is around 30% per year [Wilpon 94]. Indeed, nowadays,

the principal telecommunication companies around the worId are using sorne

form of automatic speech recognition in their products. FoUowing are a few

samples of what is currently available on the market.

1.1.1.1 Autornating Sorne Operator Services

The task of automating part of the telephone conversation usually destined

to an operator, such as billing functions (coIIect caIIs, caIling cards, persan­

to-person and biII~to-third-party) was first investigated by AT&T in 1985.

The driving force, at that time, was to reduce the workload of operators. by,

• CHAPTER 1. INTRODUCTION 3

providing a simple ASR system capable of accurately distinguishing words

from a small vocabulary and acting upon them. Early results in 1986 and

198; of such systems proved quite promising [Wilpon 88].

The first commercial product, called Automated Alternate Billing Services

or AABS, was put on the market in 1989. It was developed by Bell Northern

Research (BNR) and it consisted of a very simple speech recognizer capable

of very accurately recognizing the words yes/no in different pronunciations

[Lennig 90]. Combined with the Touch Tone service, the ASR system au­

tomated the answers of customers when asked about accepting the collect

caUs, or when charging calls to a third number.

However, it was only in 1992 that a system capable of recognizing more

words was put on the market by AT&T. The system was called Voice Recog­

nition Call Processing or VRCP and it fully automated the billing functions

described previously. This product uscd a technique called word spotting that

enables the system to recognize key words in a sentence. This meant that

the system could accurately recognize phrases such as:~ Oh, Please, could I

possibly make a col/cet cali to Mr Doe~, or ~ Hi, I would like to make a col/ect

calI please~, by keying on the word col/eet and ignoring the rest [LeeC 90b].

This technique provcd to be very succC5sful and according to [LeeC 90b] it

accurately recognizes 95% of all calls that can be automated.

This year, 1995, BNR released their Automated Directory Assistance

service ADAS which uses yet another technique called Flezible Vocabulary

Recognition of FVR [Lennig 92]. This method relies on entering the wor~s

• CHAPTER 1. INTRODUCTION 4

uttered by the customer as a sequence of subword units (like phonemes)

and then using pattern matching techniques to find the sequences of units

that matches the uttered sequence in a pronunciation dictionary. This way,

theoretically, thousands of words can be recognized. This service allows

a person to obtain telephone numbers via an ASR system using the FVR

technique, by first stating the language he/she would Iike to converse with,

then the systems asks the customer (in the selected language) to give the

city name, the system recognizes the city name and asks the caller which

listing category (residential or commercial) she/he needs, the listing is also

automatically recognized. In the case where the listing is local, the system

can further be used to recognize a selection of frequently asked listings. The

information gathered by the ADAS is then transmitted to the computer

terminal of a human operator who handles the final stages of the cali.

1.1.1.2 Accessing Information Over the Telephone

In 1981, NTT developed an Automatic Answer Network System of Electri­

cal Requests, ANSER that is used to gather banking information (accounts

statements, balance, etc ..) via a voice processing system that combines both

speech recognition and voice synthesis [Wilpon 94]. The system is composed

of a 16 word [encan! and 10 Japanese digits and permits the customer to ask

questions about 600 Japanese banks spread across iO Japanese cities. The

system is also speaker independent and uses isolated word detection. It is

fully interactive, recognizing the customer's request and replying back. One

1A lexicon is use<! to contain the phonetic transcription oC the words in the vocabulary,,

• CHAPTER 1. INTRODUCTION 5

of the key advantages of the service is its ability to fully interact with rotary

dial phones as weil as Touch Tone.

Recently, BNR released another product. called StockTalk, that allows

customers to inquire about the stocks of companies listed on the NASDAQ,

Toronto and New York Stock Exchange. The ASR system used is speaker

independent, and uses subword detection. The caller is first asked to say

which stock exchange she/he requires, the system recognizes the narne and

then asks the person to say which stocks she/he wants to inquire about.

Then the information acquired is passed to Telerate (the computerized stock

quotation service) and the system gets the information needed, then the

logic module of the StockTalk parses the information and transforms it into

English text which is then synthesized and transmitted to the caller.

1.1.2 Consumer Products

Along with the telecommunication market. many other consumer products

are taking advantage of ASR systems. In rObert 94J, it is suggcsted that the

speech recognition consumer market has an average growth of 40% per year,

and that an estimated $2 billion dollars will be invcsted in speech technology

be the end of the dccade.

Already, numerous computer companies have incorporated sorne form of ASa

systems in their applications. Others have produced voice activated home

appliances such as VeR and TV remote controls..

In other areas, researchers are integrating speaker indellendent ASR systems

• CllAPTEfl J. INTIWDUCTlON

in f1ight sirnlllators and in air t.raffic cont.rol syst.ems [Gall 92J.

1.2 Motivation and Outline

6

The system described in t.his t.besis is a continuolt,;, ,;peakcr indcpendent,

autolllatic .<pcec}, recognition system whose "ultimate" goal is to be capable

of underst.anding continuous speecb from a speaker irrespective of bis/her

age, gender, sex, and tbe environment in whicb be/sbe is speaking (quiet or

noisy). This area of research has captured a lot on interest because of its vast

applications in industry and although tbe "ultimate" goal is still far fetched,

technological advances and more efficient techniques in speech are constantly

reducing the gap betwcen the perfect system and the current state of the art

in ASR syst.ems.

The moti\'ation behind tbe research conducted in this paper is the de­

velopment of more efficient techniques to mode! the sounds of the language;

this is called acoustie modeIing and it will be fully described in the follow­

ing chapters. Nowadays, the important improvements in the performance of

ASR systems are deliverd by improved acoustic modeling. The idea that the

pronunciation of a word is not only affected by the variability of the speaker

and tbe environment. but largcly by the words that precede and follow it,

bas led researcbers to model sounds in eontext with other sounds [Schwa 85]

[Lee 90b]. Furthermore, due to tbe large amoufit of models produced when

every sound is represented in every context it can appear in, researchers devel-

CHAPTER 1. INTRODUCTION

oped cluslcring lcchn'iqucs by which sounds of similar prop,·rt.ies at'<' gt'Ollllt'd

together 50 as 1.0 limit the number of moclds Ileeded [Ljol !),I] [Yollng !J,t]

[DeMori 9.5J. These two iclcas form the bases of t.he experilllent.s performed

in this thesis in which new approaches to acoustic n1CJdding and <'Ontext. dus­

tering are investigated. These arc fully c1escribed iu chapter G.

Building efficient ASR systems is a complex task because of the interdisei­

plinary nature of the speech problem. One can think of speech processing hy

machine as an amalgamation of many different 11c1c1s ranging from anatomy

which l'l'ovide insights on which organs the humans use to eommllnicat.e.

1.0 linguistics which describes the properties of the soumIs created, to en­

gineering with which one can l'l'present the acoustic properties of sOllnds

and determine methods by which these properties can be extmcted from

the signal, 1.0 statistical analysis techniques which l'l'ovide the essencc of

recognition, and computer science with which ail the previous princip!e arc

combined in efficient algorithms 1.0 produce what is called automatic speech

recognition systems.

The material in this thesis is organized in ; chapters. The first t.hre"

chapters describe the main principles involved in the implementation of ASR

systems: chapter 2 gives an overview of speech generation in humans and

sorne linguistic backgr'lUnd, chapter 3 is divided into two main parts: the

first part describes how linguistic know!edge is combined 1.0 signal processing

techniques to extract perceptually meaningful parametcrs from the signal,

and the second part describes the statistica1 approach to recognition in which

• CHAPTER 1. INTRODUCTION 8

stochastic processes are used to model the sounds of a language. Chapter

4 gives a detailed description of the algorithms that implement recognition

and training using stochastic processes. Chapter 5 presents the main factors

which lead to the increase in performance of ASR systems.Chapter 6 describes

the experiments conducted and finally chapter ï concludes the thesis work.

Chapter 2

Overview of Speech

Generation and Phonetics

Understanding how humans communicate between each other and the prop­

erties of the sounds they produce provide rcsearchers in this area with ideas

on how to simulate the human speech proccss by machine.

This chapter attempts to give sorne of the background theory neccssary

for ASR implementations. it is divided into two main parts: the first part

gives an overview of the stages the speech signal goes through until it is pro­

nounced by the speaker, the second part describes sorne basic princip!es in

Iinguistics, mainly the physio!ogical aspects of speech production or articu­

latory phonetics and the physical properties of sounds or acoustic phonetics.

9

• CHAPTER 2. SPEECH GENERATION AND PHONETICS 10

2.1 Overview of Human Speech Generation

As automatic speech recognition systems try to mimicspeech production and

perception of humans, in order to understand such systems, one needs to first

understand how humans use their brains and speech organs to communicate

between each other. During a conversation between two people, the speaker

first decidcs on what he/she wants to say in his/her brain, then chooscs

the words he/she would like to express his/her thoughts in along with the

loudncss and pitch of his/her voice. Next, the speaker's neurological system

rcsponsible for the muscle movements. tells the vocal cords when to vibrate

and informs the rcst of the speech organs of the positions they have to assume

in order to produce the sequence of words. Finally, the sentence is uttered,

the speech produced is in the form of air wavcs that travel to the listener's

ear where the inverse process takcs place: first the ear performs some spectral

analysis on the incoming signal, then the neurological system"extracts" the

features out of the signal coming from the ear, the brain then interprets these

ft'aturcs and finally the (istener understands the words.

2.2 Speech Production and Acoustic-Phonetics

Although humans can produce an infinite number of speech sounds or phones,

each language can be characterized by a fini te set of abstract Iinguistic

units called phonerres, table 2.1 gives an example of the English phonemes.

Phonemes provide a language with an alphabet of sounds from which ail

words pertaining to this language can be uniquely described.

• CHAPTER 2. SPEECH GENERATION t\ND PHONETICS

Phoneme Example Phoneme Example Phoneme Exampleiy heed 1 led t lotih bit r race k kickeh b~t y yet z aebraae had w wet v !Leryix ros~s er bird f [Iveax th~ en mutton th thingah mud m mom s ~IS

UW boot n noon sh shoeuh hood ng sin~ hh helpoy bQY d dad zh me~re

aw bou~h g go dx butterow hoed p pop el bottleao bought ch church sil ·aa hod jh judge epi ·ey bait dh then . ·ay hide b bob . ·

Table 2.1: List of the English Phonemes

11

Al/ophones describe a class of phones pertaining to a specific variant of a

phoneme. Due to the non discrete nature of the vocal tract, and its ability

to vary in many ways, an infinite number of phones can correspond to aspe­

cific phoneme. There are numerous sources of variability: different people

have different pronunciation for the same phoneme, repeated pronunciations

of the same phoneme by the same speaker produces different phones, finally

phonemes vary depending on the context in which they appear in. The pro­

nunciation of a phoneme is affected by the phoneme that precedes it and the

one that fol1ows it in a word, this effect is called coarticulation. Coarticu­

lation is due to the fact that the articu1atory organs do not shift from one,

• CHAPTER 2. SPEECH GENERATION AND PHONETICS 12

position to the other abruptly, rather the transition is quite graduai and the

signal slowly changes from the characteristics of the previous sound to the

newone.

2.2.1 Physiology of the Speech System

Before reviewing the different classes of sound, it is important to know how

and where speech is generated in the human body. Fig 2.1 displays the speech

organs.

•Figure 2.1: Overview of the speech organs (adapted from (OGrady 8i))

• Cl1APTER 2. SPEECH GENERAT/O:\ Ar,,'/) PI/ONET/CS 1:1

As was described pre\'iously. sp"cc'h consists of air \\'avcs t.hat. t.l'av,·1 f01'l1l

the speaker's mouth to t.he list.ellt'r·s "al'. In ord"r t.o prodllc" such air \\,a\'('s

one needs: 1) an air supply (r"pl'l~sented by t.1lt' [,,,,:].,). a souurl sou l','" (l'l'p.

resented by the larynx) and a \'ariet.y of fIlt.ers t.hat. shap,' t.1lt' air \\'a\','S iut.o

dilferent SOUnt!s (reprc'sented by the pharyn.r. aud t.h" ol'Ill and ''''81,[ rll/·;I;,',_).

The larynx contains thc ,,'ocal COI'lI._ (also cal""l t.he l'oClI1 fo/d,<) aud .c< air

flows from the lungs to the lrachc/!, it passcs through th,' spac,· l",t\\'",·u tilt'

vocal cords called the 91011;8.

Depending on the state of the vocal cords. the glottis can ",'sun", rlilf,·r.'nt

shapes, thus resulting in different sounds. There arc thrce main glott." stat"s

that produce distinctive classes of sountls:

Unvoiced Sounds (such as housc, frog), thcse occur \\'hen th" vocal cords al'l'

pulled apart 50 there is no constriction as the air flo\\'s from the Inngs

to the trachea. In this case the speech signal consists of nois,' and is

aperiodic.

Whisper Soumis (such as bouse) these arc also unvoiced, and they occur

when the front portion of the vocal cords arc brought together and the

back portion are pulled apart.

Voiced Sounds (all vo\\'els are voiced, voiced consonants such vow), thcse

occur when the vocal cords are brought close together but arc not

completely closed. As the air from the lungs p""scs through the n1Ll'row

glottis, it causes the vocal cords to vibrate periodically, the rate of

vibration is referred to as the fundamcntal frcquency (Fa). /Iowevcr,

• CIIAPTER 2. SPEECH GENERATION AND PHONETICS 14

since both FO and the vocal tract shape change often, the signal is not

considered periodic but rather quasi-periodic.

Along with the classes described above, phonemes can be classified into

thlee additional classes: vowels, consonants and glides, manner of articula­

tion, and place of articulation. Each of these classes will be described in the

following threc sections.

2.2.2 Vowels, Consonants and Glides

One can distinguish betwecn vowels and consonants based on articulation

and acoustic properties. Glides (such as wet, ~ou) on the other hand, have

common features with both vowels and consonants 1.

The first distinction that can be made between vowels and consonants

is the shape of the vocal tract during their pronunciation' vowels are ail

voiced which, as we saw, means that the vocal folds are close together but

not constrictcd; consonants can be voiced and unvoiced, and sorne of them

arc produccd when the vocal tract is momentarily blocked and then reopened

(such as \?Op). Vowels are also more sonorant than consonants, that is we

perceive them as louder and longer; this is a. result of the difference in artic­

ulation.

Vowels are further divided into two classes, simple vowels in which the

\'owels doesn't show a noticeable change in quality when pronounced as in

s!:t. d~d, myg, and diphthongs which are vowels that exhibit a. change due

1Ref« to table 2.2 for the list of vowels. g1ides and consonants

• CHAPTER 2. SPEECH GENER..1TION AND PHONETICS 15

to the movement of the tongue away from the initial vowel towards a glidc

position as in boy,may.

Glidcs falls somewhere in between the two ether classes: they are pro­

nounced as vowels but they either move quickly to another articulation as in

~et or \,!'et, or stop abruptly as in bo~ and no\!'.

Although glides are perceived by th. auditory system as quickly articulated

vowels, they act as consonants. Glides are sometimes referred to as semi­

voweis or semi-consonants.

2.2.3 Manner of Articulation

Manner of articulation refers to the position of the glottis, lips, tongue, and

velum during phoneme production (refer to fig 2.1 of the speech organs).

For example, when the velum is lowered, air flows through the nostrils

producing nasaI sounds such as Done, or lIlaim; stops or plosives sounds, such

as pop and lIib, come about when the vocal tract is completely blocked for

a moment and then reopened so that the constrictrd air bursts out creat­

ing this "explosive" sound; liquids, such as lama and roar, are like vowels,

however. in this case, the tongue is used as an obstruction in the oral tract

which causes air to deflect around the tip; fricatives, such as frog and yan,

are characterized by a continuous airllow through the mouth, but the vocal

cords are 50 close together that during their production, continuous noise is

produced; if the noise has a high amplitude, these sounds are called strident

/ricatives; when a stop precedes a fricative, the sound is called affricative as

• CHAPTER 2. SPEECH GENERATION AND PHONETICS 16

in church and iump.

Table 2.2 shows the english phonemes with their manner and place of artic­

ulation.

2.2.4 Place of Articulation

The place oC articulation is considered one of the most important classifica·

tions Cor phonemes because it enables a finer distinction between the different

sounds. Although languages may share common voicing and manner of ar·

ticulation, the place of articulation varies largely.

Place of articulation is mostly associated with consonants because they use

a rclatively narrow constriction, however vowels can also be subdivided into

classes based on the tongue position as will be seen in subsequent sections.

2.2.4.1 Consonants

Eight regions in the vocal tract are associated with consonants production,

reCer to fig 2.1.

Labial: constriction occurs at the lip. If both lips are constricted, the sound

is called bilabia4 if the sound involves the lower lip and the upper teeth

is it reCerred to as labiodentaL

Dental: tip of the tongue touches the back of the incisor. If the tip protrudes

between the teeth the sound is called interclentaL

.4Iveolar: tip of the tongue approaches or touches the alveolar ridge (a small

ridge protruding from the behind the upper front teeth)•

• CHAPTER 2. SPEECH GENERATION AND PHONETICS

Palatal: the tongue blade constricts with the hard palate.

li

Velar: the tongue is close to the velum (50ft area towards the rear of the

roof of the mouth).

Uvular: tongue approaches or touches the uvula (f1eshy f1ap of tissue that

hangs from the velum).

Pharyngeal: the pharynx is constricted.

Glottal: vocal tract is either constricted or it is completely closed.

The place of articulation for the English consonants is shown in table 2.2.

2.2.4.2 Vowels

In vowels, variation in place of articulation is represented by different po­

sitions in the tongue and !ips. The tongue can assume a combination of

heights and positions: low, mid, high and front, central. back. The first three

represent the height of the tongue while the last three represent its position.

The lips can be either rounded or unrounded (place of articulation for the

different vowels and diphthongs is presented in table 2.2).

,

CHAPTER 2. SPEECH GENERATION AND PHONETICS

Phonernes Voiced Manner of Place ofArticulation Articulation

iy yes vowel high front tenseih yes vowel high front laxey yes vowel rnid front tenseeh yes vowel rnid front laxae yes vowel low front tenseaa yes vowel low back laxao yes vowel rnid back lax roundedow yes vowel rnid back tense roundeduh yes vowel high back lax roundeduw yes vowel high back tense roundeder yes vowel rnid tense (retrofiex)ah yes vowel rnid back laxax yes vowel rnid lax (shwa)ay yes diphthong low back to high frontaw yes diphthong low back to high backoy yes diphthong rnid back to high fronty yes glide front unroundedw yes glide back rounded1 yes Iiquid a1veolarr yes liquid retrofiex

rn yes nasal labialn yes nasal a1veolarf no fricative labiodentalv yes fricative labiodentalth no fricative dentaldh yes fricative dentals no fricative a1veolar stridentz yes fricative a1veolar strident

sh no fricative palatal stridentzh yes fricative palatal stridenthh no fricative glottalp no stop labialb yes stop labialt no stop a1veolard yes stop a1veolark no stop velarg yes stop velarch no affricative a1veopalataljh yes affricative a1veopalatal

Table 2.2: Enl!lish Phonemes and their corresoondinl! features

18

• CHAPTER 2. SPEECH GENERATION AND PHONETICS

2.2.5 Acoustic-Phonetics

19

In the previous sections, phonemes were classified on an articulatory base, in

this section, they are differentiated based on their acoustic properties. The

aim in this section is to investigate the waveform and spectral properties of

each phoneme, and to assign to each one sorne cornmon acoustic aspects.

A signal can be represented in both time and frequency domains [Opp 89J.

Although the time domain representation encodes all the information needed,

it is often too hard to interpret because two sounds that may appear identical

to the auditory system might have two different time plots. Most acoustic

features of speech sounds are more apparent in the frequency domain, thus

the use of a wideband spectrogram for analysis. A spectrogram transforms

the time domain representation of a signal into its frequency domain, and

plots it in a three dimensional way (time vs frequency vs amplitude). It is

mostly used in speech to examine formant frequencies; duration of acoustic

segments and their periodicity.

Following is a brief overview of the main acoustic properties of phonemes.

These properties are believed to be the cues upon which the human auditory

systems distinguishes between sounds [OShaug 8iJ; however, although nec­

essary, these properties are not sufficient to map the signal to a phonemic

string due to speaker and environment variabilities and the context in which

the phonemes appear in.

,

• CHAPTER 2. SPEECH GENERATION AND PHONETICS

2.2.5.1 Acoustic Properties of Phonemes

20

Vowels (simple and diphthongs) have usually the largest amp!itudeand longest

duration compared to other phonemes. As was discussed earlier, vowels cause

the vocal tract to vibrate in a quasi·periodic manner, this results in the en·

ergy being concentrated in spectral !ines at mul~iples of tr.e fundamental

frequency FO. Vowels are primarily distinguished by the location of the first

three formant frequencies, FI, F2, and F3. Usually, front vowels have high

F2 and F3, while mid vowels tend to show well separated and balanced loca­

tions of formants, and finally the back vowels seem to have low FI and F2.

Glides and !iquids are very similar to vowels in that they are also sonorant

and produce periodic signals. Glides tend to be transient, with a steady

spectrum that has a shorter duration than vowels. Liquids have also very

similar spectra to the vowels, but they normally have lower amplitudes.

Nasals show a sharp change in the intensity and spectral features of a vowel,

due to the entry of air into the nasal cavity. They are characterized by reso­

nances that are more highly damped than those of vowels.

Fricatives (and stops) have a very dilferent spectrum than the sonorants

mentioned above: they are aperiodic, less intense (because the constriction

of the vocal tract causes energy loss), and most of their energy is generally

concentrated in the high frequencies. Unvoiced fricatives are produced by

exciting the vocal tract by a steady air f10w which becomes turbulent at the

point of constriction. They exhibit a highpass spectrum and are shorter in

duration than fricative sounds. Voiced fricatives use two acoustic sources, a

periodic glottal source and the usual noise generated when the vocal tract is,,

• CHAPTER 2. SPEECH GENERATION AND PHONETICS 21

constricted. The noise amplitude varies between different voiced fricatives:

the non-strident fricatives (low noise component) show an almost periodie

signal and a spectrum similar to a weak version of glides, strident fricatives

on the other hand, show a large noise energy concentrated at high frequen.

CÎes.

Stops are highly influenced by the vowel that follows them 50 they are more

difficult to distinguish. Unlike all other classes of phonemes described 50 far,

stops are transient rather than a steady-state phenomena. They are usually

characterized by a long period of silence (during the constriction of airflow)

followed by a sudden increase in amplitude (when the vocal tract is reopened

and air flows out). When air is released, turbulent noise (referred to as lrica­

tion) continues at the opening of the constriction for about 10-40 ms. On

average, unvoiced stops have a longer frication than voiced stops.

One has to mention of course, that none of these observations holds

true all the time. Spectral analysis shows a large variation among different

speakers and there is generally an overlap betwecn formants across different

pronunciations [Rabi 93], these factors, accompanicd with the coarticulation

effect2complicate the task of automatically identifying phonemes from their

spectral properties.

2Coarticulation causes aIIophone5 ta have different spectra (rom the phone theyrepresent ..:

Chapter 3

Architecture of A Speech

Recognition System

3.1 Introduction

This chapter explores the foundations of automatic speech recognition sys­

tems (ASR). As was discussed previously, implementing such systems goes

beyond computer science, it involves principles from a varlety of fields such

as anatomy, linguistics, signal processing, pattern recognition etc... . The

following sections describe sorne of the building blocks of ASR systems and

the main principles underlying their implementation.

Speech recognition by machine undergoes two main phases as can be seen

from fig 3.1: the signal modeling phase in which the analogue signal is con­

vertcd into a digital form, then fcd to the feature extractor that uses spectral

22

• CHAPTER 3. ARCHITECTURE OF AN ASR SYSTEM 23

analysis techniques to produce a parametric representation of the signal, and

the recognition phase in which statistical modeling and search techniques are

used to hypothesize the most likely word that was initial1y pronounced.

The a.im in the first phase is to extract from the input signal, features that

are similar to those used by the human auditory system to distinguish be­

tween different sounds. In the second phase, the a.im is that given these

features, to determine the most likely sequence of words that was spoken.

Hence, recognition relies heavily on the feature vector produced during the

first phase: the more perceptually meaningful the features, the better the

recognition.

Chapter 1 reviewed the first building block of ASR design which is speech

generation by humans and the acoustical properties of sounds. The first part

of this chapter explores techniques by which this linguistic knowledge can

be combined to spectral analysis algorithms to produce a meaningful feature

vector. The second part reviews the statistical modeling approa.ch to speech

recognition. It is important to note here that the sections that fol1ow describe

the techniques used on Roger our ASR system at McGill University, however

there are alternative methods in both extra.cting the features (as in the use of

Linear Prediction Coding [OShaug 87)) and in recognizing the words (as in

the use of Artificial Intelligence strategies or different pattern classification

techniques [Rabi 93)).

• CHAPTER 3. ARCHITECTURE OF AN ASR SYSTEM 24

,,111,,11

_____ ..1

Pre-Emphasist---.I Filer 1----.1ND

Converter

• - - - - - - - - - - - - - - - - - .SignaJModeIi1g Phase - - - - - - - - - - - - - - - - - - -,, Mie

: l, '----.1,,11

1 -------------------------------------

:---------@]--::----------:l=)1 _C 'ene1 1 tg{1 1 .lEnetg{1l '-v-', FeaIIn_11 ...---',.

1 leIicaI S1od1aslic: Mode! Knooleô;el '-...... ,--..1..-...... .._ .....11 l.al9J3!lll1 lIallI11

1- - - - - - •SlaIisIIcaIllodeIilg Phase - - - - - -

Figure 3.1: Example of a simple ASR system

3.2 Signal Modeling Techniques

Signal modeling represents the front end of all speech recognition systems

and it plays a major role in determining the efficiency and robustness of

recognition. This phase cao be divided into three main parts: sampling,

spectral shaping, and feature extraction. The first two operations are simple

signal processing techniques [Opp 89], however, the third represents a critical

point in ASR system design. ,,

• CHAPTER 3. ARCHITECTURE OF AN ASR SYSTElv[ 25

This section is divided into two part: a first part describing the proce.

dure by which a speech signal is acquired, digitized, and conditioned before

processing by the feature extractor, followcd by a second part describing the

feature extraction phase.

3.2.1 Sampling and Spectral Shaping

The first step in speech recognition is converting the incoming analog signal

into a digital signal. There are two critical parts in this phase the A/D con­

version -the conversion from analog to digital- and the digital filtering ­

emphasizing important frequency components in the signal-.

The job of the A/D converter is to take a continuous signal and digitize it

by sampling it at regular interval and assigning signed intcger values to the

sarnples. The samples are then grouped into frames and fcd to the featurc

extractor.

In order to avoid aliasing, the sampling frequency has to satisfy Nyquist 's

Sampling Theorem [Opp 89, chap. 3J: given a bandlimited analogue signal

xe(t), then xe(t) is uniquely deterrnincd by its samples x[n] = xe(nT), n=-

1,-2,... ,+1,+2,... , if:

(3.1)

where n. represents the sampling frequency, while the 2!l", frequency rcpre­

scnts the bandwidth of the input signal, and it is referred to as the Nyquist

rate.,

• CHAPTER 3. ARCHITECTURE OF AN ASR SYSTEM

Next, the discrete (digitized) signal is pre·emphasized by:

Si = Si - QSi_l

26

(3.2)

Where Si denotes a sampIe i and i ranges from 0 to the total number of

samples in a particular frame and Ct is the pre-emphasis coefficient and is

typically set to 0.95. The pre.emphasis is used to amplify spectral compo­

nents above 1KHz -where human hearing is more sensitive [OShaug 87J-, thus

accentuating certain aspects of the signal that are known to be perceptually

significant .

Finally, the samples are grouped into frames which are later processed

with sliding windows to extract the features from the signal. Successive

window position overlap by typically 20% to 60% of the frame duration. De­

termining a window's duration requires making a tradeo!f between short win­

dows that provide better time resolution (good for detecting rapid spectral

changes) and long windows which a1low a better accuracy in the evaluation

of spectral features but smooth rapid changes.

3.2.2 Feature Extraction

There are three driving forces behind the design of efficient feature extractors:

the first is to be able to extract parameters that contain as much information

as possible about the linguistic content of the acoustical signal. the second

is that these features should be robust to variations in speakers (accent, age,

gender) and to background and channel noise, and finally the paramet~

• CHAPTER 3. ARCHITECTURE OF AN ASR SYSTEM

should be able to capture the changes of the signal spectrum with time.

2i

As ASR systems tend to mimic the human speech production/perception

mechanism, the first trend to feature extraction techniques involved some

form of auditory modeling. The most apparent form of these features are mel­

scaled parameters which are commonly used nowadays [DeMori 95J, [Young 94J

and [Ljol 94].

The mel scale maps an acoustic frequency f to a "perceptual" frequency scale

such as:f

me1,req = 25951og 1o(1 + iOO.O) (3.3)

The mel scale is often approximated as a linear scale from 0 to 1000Hz and

as a logarithmic scale beyond 1000Hz. One can thus think of eq 3.3 as a

transformation of the acoustic frequency scale into a meaningfullinear scale.

Another important set of parameters are dynamic features, introduced

by [Furu 86). which lead to a very good performance especially in speaker

independent recognition.

Other techniques have also exhibited positive effects on recognition: lin­

ear discriminant analysis [Haeb 92]. [Haeb 93] and cepstral transformation,

these are used to decorrolate parameters and then concentrate useful infor­

mation into a small number set of features.

Finally, in [DeMori 95) two new acoustic features are introduced by con-

• CHAPTER 3. ARCHITECTURE OF AN ASR SYSTEM 28

sidering measurements in time domain and in broad frequency bands; both

these features increased the accuracy of recognition.

The technique used for feature extraction is the Fast Fourier Transform

(FFT) and it will be presented in the following section. The feature vector

is composed of 26 parametersl :12 me! coefficients, 12 first derivative of me!

coefficients, energy (per frame) and final!y the first derivative of the energy.

3.2.2.1 Fast Fourier Transform

The Discrete Fourier Transform of a signal is given by:

N.-lSU) = ~ s(n)e-i2,",*

,,=0(3.4)

where f is the frequency of the input signal, f. is the sampling frequency

and N. denotes the the number of samples per window. The spectrum of the

signal is defined as 1SU) 1.

Usually in real time implementations of ASR systems a Fast Fourier Trans­

form (FFT) is used to compute the spectrum of the signal. An FFT is

a more efficient implementation (in terms of speed) of the DFT with the

added constraint that the spectrum has to be evaluated at a discrete set of

frequencies that are multiples of ft. These frequencies are called orthogonal

frequencies.

The feature vector used in our systems is composed of the follo\\;ng:

Energy of window

The energy is calculated for each sliding window. In order to reduce

IThese are the parameters use<! in our speech recognition system

• CHAPTER 3. ARCHITECTURE OF AN ASR SYSTEM 29

the side effects of sampJes at the edge of a window. a weighting function

is applied on all the samplcs inside a window such that samples ncar

the edge contribute Jess to the calculation, than those near the center.

On windows of such weighted values f( il, the energy of the window is

calculated as:N.

E", = 2:P(i)i=l

(3.5)

First derivative of the energy

The first derivative of the energy can be computed by a simple back·

ward difference method of the form:

llE",(i) = E",(i) - E",(i - 1) (3.6)

where E"" represents the energy computed for window i. An alternative

method is to perform a lincar regression:

. EZ~~Z: k.(E",(i +k) - E",(i - k)llE",(z) = k=+N/ k

2(3.il

Ek=-..../

where Nf is the number of frames over which the computation is donc.

This caIculation yields a smoother first order derivative. Note that

higher order derivatives (such as second) can be computed and added

to the feature vector by applying the previous equations to the lower

order (such as first) derivatives.

Cepstral Values

Cepstral caIculation is part of homomorphie signal processing tech·

niques introduced in [Opp 89). The importance of these non lin~

• CHAPTER 3. ARCHITECTURE OF AN ASR SYSTEM 30

sy.lems lies in the fact that by using them, one can separate the ex­

citation signal from the vocal tract shape, thus providing a mean by

which the vocal tract characteristics can be extracted. As was discussed

in the fist part of this chapter, a speech signal s(n) is produced when .

air excites the vocal tract; physically this can be represented as a con­

volution of the vocal tract's impulse response v(n) with the excitation

sigr l g(n).

s(n) =g(n) 0 v(n) (3.8)

Since the two signal have very different spectral characteristics they

nccd to be separated. If one can represent the signal in the log domain,

the two signais will be superimposed and thus can be separated using

conventional signal processing [Pic 93). This is how one can proceed:

First represent the signal in the frequency domain (i.e by performing a

Fourier transformation on both sides of the equation):

S(f) =G(f).v(f)

Then take the complex logarithm of each side:

(3.9)

/og(S(f)) = /og(G(f).V(f)) = /og(G(f)) + /og(V(f)). (3.10)

The cepstrum is defined to be the inverse transform of the logarithm

of the speech spectrum. Since perceptual frequency resolution is ap­

proximately linear up to 1KHz and logarithmic at higher frequencies,

in examining the distribution of energy acf055 frequencies for relevant

,

• CHAPTER 3. ARCHITECTURE OF AN ASR SYSTEM 31

(3.11),n =1,2,oo.,L

speech eues, the mel scale is used because, it exhibits this kind of fre­

quency resolution. The mel cepstral is given by:

F 1 ~c(n) =L logIDSk cos[n(k - 2) Fl

k=O

where F is the number of filters and L the length of the cepstra1.

Usually, 24 filters are used to extract the first 12 mel cepstral values.

First Derivative of Cepstral values

The first derivative of the cepstral values can he either calculated by

a simple backward difference method as in eq. 3.6 or using a Iinear

regression over several frames as in eq. 3.;.

3.2.3 Discrete vs Continuous Models

The feature vectors can be categorized as discrete or continuous. Continuous

pararneters are the end result of the feature extraction module. Discrete pa·

rarneters can only take a fini te number of values from sorne symbol alphabet,

they are normall)' generated by applying Veclor Quanti:ation [Cray 84) on

the continuous parameters.

The aim of vector quantization is to reduce the arnount of data coming

from the feature extractor, by constructing a codebook (or multiple code­

books) containing a distinct set of feature vectors (or codewords) that are

representative of the training data set.

During recognition, when a feature vector is produced by the feature ex­

tractor module, the distance of that vector to the nearest codeword in the

• CHAPTER 3. ARCHITECTURE OF AN A5R SYSTEM 32

codebook is calculated and the codeword that has the smallest distance to

the produced vector is fed to the statist:cal modelling module.

To increase the efficiency of the vector quantizer, multiple codebooks are some­

times used, one for each group of features extracted (so in our case one would

have a codebook for the energy, another for its derivative, a third for the cep·

stral coefficients and a fourth for their derivatives).

There are numerous considerations taken when designing a vector quantizer

such as the number of codewords, the number of codebooks, and the methods

used to initialize and train the codebooks.

A more comprehensive discussion of vector quantization can be found ID

[Gray 84], and [Rabi 93, chap. 3].

The main advantage of vector quantization is reducing the computation

complexity and thus improving the speed of the system, and its main dis­

advantage is the distortion it creates which may result in poor recognition.

This distortion is due to the fact that since there is only a finite set of code­

words, then choosing the ~best" one2to represent the produced feature vector

always carries a certain level of quantization errer: the greater the distance,

the higher the error. This error can be somehow reduced by having a larger

number of codewords, however it cannot be eliminated as long as there is

a finite set of codewords. This problem often leads to a tradeoff between a

large set of codewords per codebook (decrease the level of errer, but increase

complexity) and a smaller set (increase in speed).

2the one with the shortest distanee to the feature vector

• CHAPTER 3. ARCHITECTURE OF AN ASR SYSTEM

3.3 Statistical Approach to Recognition

33

Speech decoding (recognition) can be regarded as a transformation from a

set of parameters (represented by the vector 0) to a sound s (where 5 can

be a word, a sub·word or just a phoneme). If the transformation yields a

sound s '" so.;g (where so.;g is the sound that was originally uttered) then

the decoder made an error.

In statistical pattern recognition, the aim is to minimize the probability of

that error, so an efficient decoder is one that chooses s such as:

s =argmax,_•• P(s 1 0)

Following Bayes Rule [Komo Sï], eq. 3.12 can be rewritten as:

P(O 1 s).P(s)

(3.12)

where N, is the total number of models representing the sounds of the lan­

guage. P(s 10) is called the a posteriori probability of the model 5 given the

feature vector 0, P(s) is the probability of the model representing 5, and

P(Ols) is the probability of observing the feature vector given the model s.

The decoding problem thus reduccs ta solving the unknown parameters of

eq 3.13. This can be achieved by having a family of probabilistic functions

capable of containing as much information as possible about the process be­

ing modeled (speech sounds in this case); thus the use of Hidden Markov

Models (HMM).

• CIfAPTEn :J. A!tCHlTECTURE OF AN ASR SYSTEM

3.3.1 Acoustic Modeling Using HMMs

34

Hidden Markov Modcls (HMM) are stochastic processes that modcl events

iu sequ,mec. The under!yiug assumption in the use of these models is that

the speech signal can he represented as a parametric random process and

that. t.he paramet.ers of the st.ochastic proccss can be estimated by a precise

met.hod [Rabi 88].

HMMs should he viewed as a mean of computing the similarity between a

speech signal and a recognition pattern in a statistica! manner. There are

two main advant.agcs for using these processes: first their structure (as will be

seen in the following sections) allows them to efficiently mode! the variability

of the speech signal and its spectrum with time, second the parameters that

define HMMs can be re-estimated so as best to account for the acoustical

propertics of the sounds they represent3 •

The success of statistical pattern recognition techniques, especially HMMs,

have lead to their employment in most contemporary ASR systems as in the

AT&T system [LecC 89J, the SPHINX system at Carnegie Mellon [Lee 90a],

the France Telecom system [.Jouv 94b], Roger at McGill University [DeMori 95J.

Following is a description of the structure of HMMs and the parameters that

define them. Before exploring HMMs one nceds to first explain the sim­

ple discrclc Markov ModcL- and how these can be extended to form Hidden

Markov Modck

3Chapter 4 is dcdicated to the detailed description of recognition and r<-estimationmethods using IIMMs

• CHAPTER 3. ARCHITECTURE OF AN ASR SYSTEM

Figure 3.2: Examplc of a Markov Model

3.3.1.1 Markov Chain

35

A markov chain consists of a number of states and a number of transitions

between them as can be seen from fig. 3.2. Every state represents a fixed

symbol k and with every transition between two pairs of states (5;, Sj) is asso­

ciated a transitionai probability a;j. When the process starts at t = 1, every

state 5; has an initiai probability ll";. Each time a transition to astate is taken,

an output symbol is generated. For example, from fig 3.2, if one goes from

state 1 to 2, at time t, the output symbol "blue" is generated. By the same to­

ken, ifone observes the sequence of colors 0 = blue red red white blue white,

then one can trace back the sequence of states that produced it; in this case,

it will be 525151535253.

• CHAPTER 3. ARCHITECTURE OF AN ASR SYSTEM 36

In these types of models, each observatic.n is considered to be independent

of ail the previous observations except for the one that immediately precedes

it, so that aij denotes that the system was in state Si at time t and made a

transition to state Sj at time t+1.

The model described above is constrained by the fad that each state

represents a predetermined symbol, so if one wants the Markov chain to

generate a 100 symbols, one needs a model with a 100 states, this is dearly

unrealistic in applications where numerous observation symbols are needed

as in the case of speech recognition.

3.3.1.2 Hidden Markov Models

A practical solution to the problem stated above, is to make the observation

symbol a probabilistic function of the state. In this fashion, aIl symbols are

possible at every state, each with its own probabiIity. This method means

adding one more parameter to the Markov model described above: an NxM

matrix where N denotes the number of states in the model and M the number

of observation symbols. This matrix is referred to as an observation symbol

probability matrix B and it can represent any number of observations. Each

element of the matrix , say bl2 represents the output observation probability

of the 2nd symbol in state 1 (this can aIso be denoted as bql(02 ) •

Such models are called Hidden Markov Models (HMM) because they are

doubly embedded stochastic processes in the sense that not only are transi­

tions between states probabilistic, but the output symbol observed at each

state is aIso determined by a probability output function. Using the example

• CHAPTER.1. ARCHITECTURE OF AN ASH SYSTEM

of fig 3.2, if an Hl\U...l is lIscd to reprcscnt the colors. th('11 ('i\cll sta!.l' \\'olll~l

he able to rcpresent ail thrcc colors, rathcr t.han IHwillg olle rulor for {'"ch

sate. In this contcxt by going from S, to 8'2, Olll' rould pI'O(II1('(' \'('d, bllll' or

white dcpending on thc highcst output probabilit.y of stat.{' ....·2.

3.3.1.3 Parameters of an HMM

DÎ~crete"HMMs(in which the output distributioll fllllCt.ioll rt'pn's('nts a dis­

crete symbol k) can be described bya sel of N states (a st.at.c that. is reaclwd

at time t is denoted by qt). and a set of Al output. observation sj'llI1>ols il."SOCέ

ated \Vith l'very st.atc (these can be reprcscllted by V = [1'10 1,2•...• l'M]). TIlt"

observations correspond to the physical output of the syst.em 1>cing tIlodcbl.

An observation veetor is sometimcs denoted (\.... 0 = 0, Q.~ .. . 0]' wlwrt' 01 is

the symbol generated as time 1 and it is one of the symhols of tlll' st'l V. '1'

l'l'presents the total number of observations generatcd.

An HMM has can be defined by its thrcc paramcters:

1 The initial output probability associatcd \Vith each statc. This can I)(~

described as the probabi1ity that the system is in statc S, at timc 1 = 1:

"i =P[ql =Sr] (:1.1,1)

4Continuous IIMMs have a dilTerent output distribution [uncliQu anJ ar'~ dis~IIs.'if·,1

later

• CHAPTER 3. ARCHITECTURE OF AN ASR SYSTEM 38

2 An NxN statc transition matrix representing the transitional probability

for cvcry pair of states, it is given by:

A= (3.15)

A transition aii corresponds to the probability of making a transition

to state Si at time t+l given that the system was in state Si at time

t, this can be represented as:

(3.16)

4 An NxM observation symbol probability distribution matrix B such as:

B= (3.li)

The output distribution symbol probability bj(k) is the probability that

the system is iù state Si and that symbol k is observed, this can be

reprcsented as:

1 ~j ~ N

l~k$.M

(3.18)

(3.19)

• CHAPTER 3. ARCHITECTURE OF AN ASR SYSTEj',J

3.3.1.4 Structure of an HMM

39

Another important consideration is the structure of HMMs used in speech

recognition. The model described in fig 3.2 is called an ergodic model be­

cause every state can be reached from every other state; however, for HMMs

to properly model the time varying speech signal, transitions from states with

higher indices to states with lower index should not be allowed (i.e moving

backward in time), thus the use of /eft-to-right models as described in fig 3.3.

Figure 3.3: A five state, left-to-right HMM mode!

ln left-to-right models states are conneeted in a sequential manner, and

each state, except the last. is connected to itself to rel!eet variability in time

(since difl'erent instantiations of phonemes and words can rcgister at difl'ercnt

times), the last state is called a sinkstate and it denotes the end of the model.

The number of states in a model is usually chosen to denote the duration

in time of the process being modeled, for example in phonemic HMMs it is

common to choose 3 state models, state 1 representing the left part of the

phoneme, state 2 the middle part and state 3 the last part. Of course there is

no set rule as to how many states should be used, however an increa.se in the

• CHAPTER 3. ARCHITECTURE OF AN ASR SYSTEM 40

number of states means an increase in the complexity of the computations.

Usually, there is one HMM for every process being modeled, so in phone­

mic HMMs, there wiII be one HMM to each phoneme used. However when

phonemes in context arc modeled, elusterin/techniques are used to reduce

the number of phonemes needed [Jouv 94a] [Ljol 94] [DeMori 95].

3.3.1.5 Types of HMMs

Finally, there arc dilfercnt types of HMMs that can be used for recognition

depending on the type of feature vector presented by the feature extractor.

The main dilferencc betwccn dilferent types of HMl\Is lies in their output

distribution function bO.

Discrete llMMs using one codebook

The observation symbol prohability b(k) must satisfy:

k=.If

L b(k) = 1k=l

(3.20)

whcre k is a symbol of the alphabet and M is the total number of sym·

bols. The output discrete distribution function can thus be exprcssed

as:

b(O) = P(O 1 k) k E V (3.21 )

Discrete HMMs using multiple codebooks

The output discrete distribution function is the product of the distri·

bution functions associated with every codebook. 50 if there are Ne

'Clustering will be diocuosed in chapter 6

• CHAPTER :J. ARCHITECTURE OF AS :ISIl SYSTE.\1 ,II

codebooks, and for ('ach codl'!lOok w(' bav(' an output oh,,'rval;ou v,'''·

lor Oc lhen:r:.\'r

b(O) = II 1',(0" 11.')~=I

( '1 "")...-

where 0' is lhe ct!· C0ll1l'0ll<'nt. of 0 and (F E (J, 1.... , /\'" - 1 wi!t'n'

K, is t.he size of codebook c.

Cou/irlltous UMMs usi1l9 Multit'(lri,,/r: G(III""i,," ,{('1I,'i/y juurlioll,'

A continuous out.put probahility has to satisfy:

Jb(O)'{(O) = 1

A multivariate Gaussian distrihution funclion is giv('n hy:

b(O) = (:I.:!'\ )

where N is the dimension of the feature vcclor (:!G in our '·'LSl'), Il, is

the mean vector. and L:~ is the covariance rnatrix.

However. phonel11<'S arc poorly cstil11atcd hy IlMl\1s with on,· l'df l','r

transition. thus the use of a finite mir/lIN:.' of Gaussian d.'nsiti.", [N.·y 881.

The mixture distrihution is a weight('<\ surn of Nk .Iistrihutions:

k=N" k=N"

Pmir(O) = L It'k.Pk(O).snch a~ L "'k = 1 (:I.:!!i)k=1 k=1

These mixtures arc usually il11plel11ented hy having ""v('ralparalld t.ran·

sition betwccn two statcs. cach transition having a Ganssian ,Iistrilm·

tion function.

Scmi-Continuous lB/Ms

Semi-Continuous HMl\ls arc hybrid modcls that inlt~ratediscrete proh­

abilities with continuous densitÎ<'S in an effort to comhine SIX....~J with

• CIIAPTEJI. :1. MlCJlITECTURE OF AN ASR SYSTE,H 42

accuracy respectivcly. In these modcls, a set of continuous densities is

s/lllrcd by ail discrete output distributions [Huang 89].

3.3.2 Using HMMs for 'fraining and Recognition

3.3.2.1 Overview of Training

BcCore performiug recognition one first nccds to use sorne training algorithm

that can estimate the parameters of the HMM modcls from a training data

sd, which should normally contain a large number of sentences that are rep­

resentative of the words the system will be expected to recognize.

First step in training is listening to the training sentences and writing the

sequence of words that form thcsc sentences. This is callcd transcribing. Each

distinct word is then placcd in a /cncon with its phonemic spelling. Usually

for more efficient training, multiple pronunciations of the same words are

a1so put in the lexicon to provide as much information as possible about the

word.

Once the transcription and phoneme labelling is done, the training sentences

along with the text speech and lexicon arc fcd to a training algorithm that it­

erativcly rc-cstimates the parameters of H1\!:\ modcls each time maximizing

the likclihood that indccd the training speech sas produced by theses mod­

cls; this estimation method is bascd on the Baum·Wclsh a1gorithm [Bau 72)

and is fully dcscribcd in chapter 4.

Sometime the data is labeUed more prccisely such that a sentence is segc­

lIIeu1<-'<1 into timesequeuces (i.e a set of samples) each representing a phoneme•

• CHAPTER 3. ARCHITECTURE OF AN A.SR SYSTEM 43

When the training algorithm uses this time alignment. the training is called

segmented training and it is usually used in the first iterations of the training

procedure to properly initialize the parameters of the modc1s.

3.3.2.2 Overview of Recognition

Once the parameters of the HMMs arc properly estimated, the recognition

can be donc using the test sentences. However, one first needs to build il.

grammar that describes how words are connected. The role of the gram­

mar, as in the case of human speech perception, is to impose as set of con­

straints on the sequences of words. In HMMs, statistical grammars, called

n-gmm grammars, are provided that define the probability of occurrence of

phonemes. There are different types of grammars, such as bigrams that gives

the probability of ail pairs of phonemes (or words), and trigmms that give

the probability of ail triplets of phonemes (or words).

Once the grammar is built, the recognition process can begin. This pro­

cess, in its most simplistic form, is a large scarch among ail the phoneme

(word) models for the ~best~ word (phoneme) sequences that can describe

the observed feature vector. However sucb a method would require an exces·

sive amount of computation. More efficient techniques have been devc10ped

such as the Viterbi alogrithm [Vite 6i) which is also fully described in the

following chapter.

Chapter 4

Training and Recognition

Algorithms for HMMs

4.1 Introduction

The previous chapter explained the architecture of automatic speech recog­

nition and Hidden Markov i'>lodcls (HMM) were melltioned without any

details about the mathematical foundation underlying their use.

This chapter presents an overview of the fundamental problems for HMM de­

sign. and focuses on the search algorithms that allow us to use these models

during the recognition phase, and on the training algorithms during which

the parameters of the HMMs are re-estimated 50 as to best account for the

observed training data set.

In the folIowing sections it is assumed that the reader has some prior

knowledge of HMMs and their use in speech, however, if this is not the case,

44

• CHAPTER 4. TRAINING AND RECOGNITION ALGORITHMS 45

sorne introductory references are recommended such as [Rabi SS], [Rabi 93,

chap.6], [VAl 91], and [Pic 90p This chapter is divided into five main sections,

first the three main problems of HMM design are described, next the solution

for each of the problems is given in a separate section, and fina.lly sorne

implementation issues concerning HMM design are presented.

4.2 The Fundamental Problems for HMM

Design

Given a left-right HMM such as the one in fig 3.3, there are three basic

problems that have to be solved in order for these models to become useful

for rea.l time applications like speech recognition.

Suppose an observation sequence is given by:

And suppose an HMM model À is represented by:

À = (r.,A,B)3

Then one can define the three problems as:

(4.1)

(4.2)

Problem 1- Given the <"'5ervation sequence 0 and the model À, what it

the probability P(O 1À), ie. what is the probability that the sequence

Ois observed given the model À?

IThe theory and algorithms presented in these rererences are summarized in thischapter.

'where T is the length of this output sequence3As discussed in chapter 3, this representation ofan HMM. means that the model ~ has

an initial probability matrix ". a transitional probability matrix A. and an observationprobability matrix B•

• CllAPTEn 1. TRAINING AND RECOGNITION ALGORITHMS 46

Problem 2- Civen the observation sequence 0 and the model À, and sup­

pose that the state sequence of the modcl Àis defined as Q = 'Il '12' .. '1T,

then what is the olJtirnal state sequence given the observation sequence?

Problem 3- How does one adjust the parameter À= (r., A, B) of an HMM

so as to rnaximi=c l'(O 1 À)?

Finding the solution to problems one and two means identifying how recogni­

tion can be done using HMMs where as finding the solution of problem three

permits one to train the modcls so that they can best represent the observed

data set [Rabi 88].

4.3 Problem l:Calculating P(O 1 >.)

4.3.1 Basic computation

Civen a model À = (r., A, B), with a fixed state sequence of length T,

Q = Q1Q2'" QT and an output observation sequence 0 = 0 10 2... OT,

then one can easily compute l'(O 1 À) by summing over ail possible paths

q in the mode! À, the probability P(O 1 Q, À) (which is the probability of

observing the sequence 0 given the state sequence Q in the model À) multi­

plied by the a priori probability P(Q 1 À). Thus,

l'(O 1À) = L P(O 1Q,À)l'(Q 1À).1lQ

(4.3)

• CHAPTER 4. TRAINING AND RECOGNITION ALGORITHMS 4i

However, we already know that the probability of an observation sequence

0, given a state sequence Q and a model À can be represented as:

(4.'1 )

The probability of the state sequence Q is given by:

(4.5)

Thus, we can rewri te P(°1À) as:

P(O 1À) = L 1l"qlbql(0t!aqlq2" .aqT-lqTbqT(OT) (4.6)qlq2...qT

Since eq. 4.6 is a sum over ail paths in a mode!. and since the number

of paths increases exponentially with the length of the observation sequence,

then if the model has N possible states that can be reached and the obser­

vation sequence is of length T, the order of eq 4.6 becomes 2.TNT • So

even for very short observation sequences Iike 50 and with only 4 states, this

procedure would need 2.50.-150 or around 1032 computations, this is cIearly

unfeasible in real time applications.

Fortunately, there are recursive algorithms that have been deve!opcd

which make the calculation of P(O 1 À) both simple and efficient. One

such algorithm is called the forward-backward a/gorithm [Bau i2).

4.3.2 The Forward-Backward Algorithm

Consider the forward variable Cl'I(i) and the backward variable !J1(i):

•Cl't(i) - P[0102'" Oh qt =S; 1À)

!Jt(i) - PiOt+IOI+2 ••• OT, qr = Sol À)

(4.i)

(4.8)

• CHAPTER 4. TRAINING AND RECOGNITION ALGORITHMS 48

Ol(i) reprcsents the probability that the model ,\ produced the partial

output observation sequence 0 10 2 •• • 0, until time t, using a transition se­

quence that ends at state Si. BI ( i) reprcsents the probability that the model

,\ produced the partial output observation sequence 0 1+10,+2", OT, given

that the first transition in the generated sequence started from state Si. To

state it differently, Ol(i) is the joint probability that the output observa·

tion sequence 0 = 0 102" .0, is generated and we stop at state Si at time

t, and /3,( i) is the joint probability that the output observation sequence

o =0 1+10 1+2 " • OT is generated and we start at state Sj at time t.

Both of these quantities are normaIly calculated by creating a trellis (refer to

fig 4.1) in which the tlh column corresponds to time t and the ilh row corre­

sponds to state Si in the HMM mode!. They are both computed recursively,

column by column, 0,(i) starting from column 0 and moving forward in the

treIIis and /31(i) starting from column T and mO\'Ïng backward in the treIIis.

The recursive algorithm for Oi and /3i is given by:

LInitiali:ation :

o,(i) =l3T(i) =

2.Recursion :

1,

1 $ i $ N

1 $ i $ N

(4.9)

(4.10)

0 1+1 (i) = [E~I ol(i)aij)bj(Ol+!), 1 $ t $ T - 1

1 $i $ N (4.11)

P,(i) = Ef.,1 aijbj (0'+1 )PI+! (i), t = T - l, T - 2, ... ,1

l$i$N (4.12)

• CHAPTER 4. TRAINING AND RECOGNITION ALGORITHMS 49

In step 1, Q,(i) is initialized as the joint probability that we are in state Si,

and observing 01> while th( i) is arbitrarily set to 1 for ail N states in the

HMM mode!.

In the recursive step for the forward probability, since Ql(ij is the joint prob­

ability that the output sequence 0 10 2 ••• 0, is observed up to time 1and that

we are in state Si, then by multiplying Q,(i) by the transitional probability

aij, we get the joint probability that the output sequence up to time t is ob·

served and that we have made a transition to state Si at time 1+1 from state

Si. By summing the former product over ail N states, we get the probability

of state Si at time 1+1 with ail the previous ~U,put observations, up to time

1. Now that Si is known, then to lind Q,+I(i), we need to multiply Si by the

output observation at time 1+1 for that state, which is nothing but bi(OI+l)'

Note that in order to solve for P(O 1..\), ail what needs to be done is to sum

ail terminal for'\Vard variables QT(i). Hence,

N

P(O 1..\) =L or(i).i=l

(4.13)

Calculating P(O 1 ..\) using eq. 4.13 is of the order N2T, 50 going back

to the previous example, if there are 4 active states and 50 observations,

this procedure needs 4250, or 800 computations, comparing it to l()32 as was

obtained using the straightforward calculation of P(O 1À), we saved arouno

29 orders of magnitude.

Similarly. the recursive step of the backward probability shows that in order

to have been in state Si at time 1and to account for ail output observations

• CHAPTER 4. TRAINING AND RECOGNITION ALGORITHMS 50

startiog at t+l on, we have to take ioto considerations all possible states Sj,

aL Lime t+l, and accounL for all transitions from Si to state Sj (using the aij

term), along with the observation at time t+l in state Sj (thus the bj(Oe+l

term) and finally multiply it by all the remaining output observations from

statc Sj (thus the Pt+IU) term).

Both the forward and backward variables play a key role in search and train

a1gorithms as will be seen in following sections.

state

i

2

1

1 2 3 4

observation

... ..

......

.....L

•Figure 4.1: Example of a trellis. adapted from [DeMori ]

• CHAPTER 4. TRAINING AND RECOGNITION ALGORITHMS 51

4.4 Problem 2:Finding An Optimal Path

There are many ways one can interpret fin ding the optimal path given an

output observation sequence, and a mode! À. One can for examp!e, choose

the optima! path to be the one that maximizes the expected number of cor­

rect individual states. However, this cannot be applied to speech recognition

because in ma;'(imizing individual states, one might end up with a state se­

quence in which one or more states are not connected.

What is needed in speech recognition is the ability to determine the best state

sequence given an observation sequence 0 and a mode! À, in other words a

way of maximizing P(Q 1D,À).

Such a technique was developed in 196; by Viterbi [Vite 6;), and is referred

to as the Viterbi Algorithm. Following is an overview of this algorithm.

4.4.1 The Viterbi Algorithm

Givpu a trellis such as the one described in fig 4.1, the Viterbi algorithm

computes the lowest-cost path, where the cost of a path at a certain node in

th,; t.el.lis ti is given by the sum of the cost at the previous node ti_1 and the

cost of goillg from ti_1 to ti.

In order to view how this algorithm is structured, let us define

(') _ Q,(i)P.(i)

l' 1 - P(O 1À) (4.14)

•as the joint probability of being in state Si at time t and observing the output

sequence O. Note that the P(O 1À) in eq. 4.14 is a normalization factor that

makes 1.(i) a conditional probability, such as the sum over ail N of al!.ys at

• CHAPTER 4. TRAINING AND RECOGNITION ALGORITHMS 52

time t is 1. Note also that the value of P(O 1..\) can now be calculated using

the forward variable alti) as stated in eq. 4.13.

To find the lowest·cost path (or best state sequence ql q2 ••• qT) in the trellis,

given an output observation sequence 0 10 2 ••• OT, we define ét(i) to be the

best score along a certain path in the trellis, at a given time t, that can

account for the output observation sequence up to time t and that ends at

state Si:

From eq. 4.15, one caon ca.lculate the lowest·cost path recursively by:

(4.16)

4.4.2 Recognition Using the Viterbi Algorithm

During recognition, one a.lso needs to keep track of the state Si that had the

maximum él(i) at time t. Once the last observation is reached at time T,

recognition is performed by backtmcking through the trellis, extra.cting those

states that maximized cSt(i). 50 for recognition using the recursive Viterbi

a.lgorithm one needs to define two arrays, one to hold the maximum cSl(i) and

one to hold the corresponding state Si. The a.lgorithm is described as:

1. InitiaIi::ation :

él(i) = r.jbi ( Od, 1 ~ i ~ N

III/(i) = 0

2. Rccursion:

(4.1i)

(4.18)

• CHAPTER 4. TRAINING AND RECOGNITION ALGORITHMS 53

l$.j$.N (4.19)

1II,(j) = argmax 'S'SN [0,-1 (i)aijj, 2$.t$.T

l$.j$.N (4.20)

S. Termination:

po = max [OTt il] (4.21)IS'SN

qi- = argmax [oT(i)] (4.22)lSISN

4. Backtrad:ing:

q; = 1II,+I(q;+\), t = T - 1, T - 2, ... ,1 (4.23)

The array 1II,(j) holds the index i of the S'_I(i) that maximizes S,(j) ac­

cording to eq. 4.19, it is basicaily a pointer to the best preceding state Si.

After the last output observation at time T, p. and qi- will contain the high.

est vaiues of 0,( i) and the states that produced this maximum respectively.

Sometimes, wcights can be imposed on transition~ and observation prob­

abilities to increase their contribution during the search process. These

weights arc referred to in the literature as Language Madel Weights. For

example, if one wishes to increase the transitionai probability contribution,

then instead of using aij in eq. 4.19, one would use (aij)w, whére Wis a pre­

specified weight. In sorne recognition systems ~ , language models weights are

used on transitionai probabilities from one phoneme model to another, rather

than on transitionai probabilities between the states of a model. Language

~As the one developed in the speech lab at McCiIl University

CHAPTER 4. TRAINING AND RECOGNITION ALGORITHMS 54

model weights and their effects on recognition will be reviewed more closely

in the in chapter 6.

The essential calculation in the Viterbi algorithm lies in eq. 4.19: the only

path that gets propagated is the one that has the highest probability among

all the paths that can make a transition to the current state at time t. How·

ever, although the optimal state sequence is the most likely path through the

models, the sequence of models that correspond to this path may not be the

optimal one. This is due to the fact that the probability of a model sequence

must be summed over aU paths in the sequence, and not only the most likely

path. Nonetheless, in most cases, the most likely path does provide a good

~d efficient approximation.

The last point to make here is that although this algorithm reduces the

search space by propagating only the most likely path at a particular time

t, it still imposes a considerable amount of computation on the recognition

process resulting in large response times in real time applications. However,

some adjustments cao be made to the algorithm to increase its speed; this

leads to the next topic: the Viterbi Bearn Search.

4.4.3 The Viterbi Beam Search Aigorithm

In most real time applications, the response time plays a key role in mea.­

suring the efficiency of the system. In automatic speech recognition systems,

due to the comple:<ity of computations, speed becomes a critical point in the

CHAPTER 4. TRAINING AND RECOGNITION ALGORITHMS 55

design strategy.

One way of improving the speed of the system is to limit the search space

of the Viterbi a1gorithm. This can be achieved by rcstricting the search to

those trellis nodes that have a Iikelihood (or probability) greater than some

fraction of the maximum likelihood in the given column of the trellis. This

technique is called Bearn Search, and following is a brief description of how

it is implemented, more details can be found in [LeeC 89].

Given the trellis as ;n fig 4.1, each time a trellis column t is computed, the

value ptmaz of the highest probability of any node in the column is found.

Then, only those nodes with a probability greater than Pt"az - 6(t) will be

kept in the list of active nodcs, the rest of the nodes are pruned or disre­

garded. 6(t) is a preset thrcshold, it is referred to as the beam width. As the

computation time in the Viterbi a1gorithm is proportional to the number of

active nodcs in the trellis, then it is clear that the width of the beam will

have an effcct on the speed of the a1gorithm; needless to say that a smaller

width means less active nodcs and thus higher speed. However, there is no

general relationship between the beam width and the computation time, in

some experiments it was reported that computation time increased expo­

nentially with 6(t), while others reported an a1most linear increase for large

vocabulary [LeeC 90].

CRAPTER 4. TRAINING AND RECOGNITION ALGORITRMS 56

4.5 Problem 3:Estimating the Parameters of

anHMM

The previous sections described solutions to the problem of recogni:ing a

speech signa.! by finding the most likely set of models that could have pro­

duced the observation sequence generated by that signa.!; however, nothing

was said on how the parameters of these models were initia.!ized and esti­

mated so as best to account for the observed input speech signa.ls. This

section reviews how HMM models are created and how their parameters are

iteratively re-estimated, this is referred to as training the HMM models.

There are many training techniques, the most commonly used is the Max­

imum Likelihood Estimation (MLE) method, a.!so referred to as the Baum­

Welch or FOnDard-Backward re-estimation method [Bau ;2], [Lip 82]. Oth-

ers techniques have becn developed such as the segmental k-means training

[LeeC 90], the Maximum Mutual Information (MMl) estimation [Bahl86],

[Chow 90], the Minimum Discrimination Information (MDI) estimation [Eph 89J,

and Corrective Training [App 89]. Following is a description of the MLE

method.

4.5.1 Maximum Likelihood Estimation Method

ln MLE training, one tries to adjust the parameter (A, B, 11') of and HMM

model À 50 as to maximize the probability of the observation sequence gen-

• CHAPTER 4. TRAINING AND RECOGNITION ALGORITHMS 5i

erated by the training data. The following estimates of the parameters arc

proposed:

ii7 = Expected no. of times in state Si at t=1 (4.24)

al) = (Expected no. of transitions from state Si to state Si)Expected no. of transitions from state Si

(4.25)

(Expected no. of times in state Si and observing symbol k)Expected no. of times in state Sj

(4.26)

Next, one has to define the joint probability e,(i,j) of observing the se­

quence 0 = 0 10 2 , •• 0, and being in state Si at time t and making a transi·

tion from Si to Sj at time t+l. However, the joint probability of observing

the output sequence 0 and being in state Si has already been calculated in

the 'Jiterbi algorithm (eq. 4.14), so one can calculate e,(i,j) bymultiplying

the joint probability of observing the sequence 0 and being in state Si at

time t by the transitioniJ probability aij and the output observation at time

t+1 of state Si which is nothing but bj ( O,+d, this gives:

(4.2i)

Hence, the expected number of transitions from state Si to Sj is nothing

but the sum o"er t of e,(i,j). Moreover, if one sums il(i) of eq. 4.14 over

time t, one gets the expected number of times the state Si is visited or in

other words the expected number of transitions made rrom Si. Wc cao thus

CIiAPTER 4, TRAINING AND RECOGNITION ALGORlTIiMS 58

v.':oite:

T-l

L e,(i,j) - Expected no. of transitions from state Si to Sj (4.28).=1

T-I

L "(.i = Expected no. of transitions from state Si (4.29).=1

The estimation formulae for the parameters of the HMM mode! can be

rewritten as:

if = "(I(i)

(4.30)ET-1e(" ')

ai) = .=1 • Z,}

E T- I (').=1 "(. Z

(4.31 )

bi(k)ET,.. "(t(i)

= 0,··1rE;"I "(.(i)

(4.32)

Thus the MLE method can be implemented as a recursive procedure such

as:

1./nitiali:alion:

Consider a mode!..\ 'vith initial values,..\ = (r.,A,B)

!!.Recursion:

Use the previous values for ..\, A and B in the right hand side of equa­

tions 4.30,4.31 and 4.32 to CI)mpute new parameters X=(r.,A,Ë) as

determined from the !eft hand side of equations 4.30, 4.31 and 4.32.

• CHAPTER 4. TRAINING AND RECOGNITION ALGORITHMS 59

If P(O 1X) ~ PlO 1.\), then the new model Xis better than the old

one, so reset the previous values to those obtained in this iteration and

repeat step 2. Eise Stop.

3. Termination:

Model Xdefines a critical likelihood such as X=.\. At this stage one

has reached the maximum likelihood estimate of the model ,\.

4.5.2 MLE Method with Multiple Sentences

As was discussed in chapter 2, left to ri9ht models are mostly used in speech

recognition because of their ability to refiect the temporal changes in an

incoming speech signal. However. the major drawback of these models is that

their structure makes it very hard to accurately re-estimate the parameters

with only one observation sequence (or one sentence). This is due to the

fact that only a small number of observations is usually associated with each

state. To overcome this problem, the training algorithm discussed in the

p.evious section is extended to include multiple training sentences.

In this case the observation vcetor 0 bceomes:

(4.33)

where L is the total number of sentences and each sentence v has its own

observation vcetor O' =O\')O~') ... O~).

It is assumed that each observation sequence is independent of ail the others.

In this case, the parameters of the HMM model are only estimated after

aIl the sentences have been processed. this leads to the following estimation

CIiAPTER 4, TRAINING AND RECOGNITION ALGORITIiMS 60

formulae:

if 1 t V") (4.34)= L i1,1\/=1

LL LT-1~V(' ')ai) = 1,=1 t=1 t l,)

LL LT- 1 V(')0=1 '=1 i, 1(4.35)

L:=I LT,., irUlbj(k) or··.- LL LT V( ')0=1 '=1 i, J

(4.36)

Fig 4.2 presents the f10w diagram of the training procedure (adapted

fnm [VAl 91)). Usually, to see if the models have reached their maximum

likelihood, a recognition test is done using the newly estimated parameters.

If the error rates drops, it means the new models have improved over the

last ones. Training is repeated until the recognition with the new estimated

models doesn 't improve.

CHAPTER 4. TRAINING .-\ND RECOGNITION ALGORITHMS 61

initial modeli.(A,B.Jt)

use the previousmultiple indcpendant modcl i.(A.BJt) to

observations ...-~ estimalC aNEWmodel Â,'CA',B'.7t')

no

rcached maximumlikelihood

Figure 4.2: Training \Vith multiple observations

• CHAPTER 4. TRAINING AND RECOGNITION ALGORITHMS 62

4.5.3 Estimating the Output Distributions ofa CDHMM

So far, the discussion on parameter re-estimation has evolved around dis·

crete HMMs. The estimation formula for the output distribution vector B

(eq. 4.32) relies on observing a discrete symbol J( from a finite alphabet, thus

the use of a discrete probability density at each state of the mode!.

However, as was mentioned in chapter 3, there are other types of HMMs,

mainly those that use continuous output distributions rather than discrete

ones (continuous density Hidden Markov Models or CDHMM). As was

stated in the previous chapter, CDHMM have the advantage of being able

to model more precisely the continuous speech signal, especially when one

associates which each state a weighted sum, or mizture of Gaussians. This

sections examines the estimation formula for the parameters of a CDHMM

that uses mixtures of multivariate Gaussian distributions.

Let us first represent an M-mixture Gaussian output distribution by,

M

bj(O) = L c;kN(O,iJjk, Ujk)/r=1

(4.3ï)

where 0 is the observation vector extracted from the input signal and N is

a Gaussian density function with mean vector iJ and covariance vector U.

The tenn C;k represents the gain of the k'h mixture of state Sj. The sum

over all K's is used to represents ail the mixtures associated with state Sj'

The mixture gain in eq. 4.3; has to satisfy the constraint that the sum of

all mixtures for astate Sj is one, 50 that the probability density function is

• CHAPTER 4. TRAINING AND RECOGNITION ALGORITHMS 63

properly normalized:

1+00

-00 bj(O)dO = 1 1 ~ j ~ N. (4.38)

Suppose there is a CDHMM composed of an M·mixture Gaussian density

function with JI = 6. It was shown [Rabi 93J, that the mixture gains Cik for

astate i, can be interpreted as transitions to substates i b i2• i3• i4, is, and i6

with probabilities Cib Ci2. Ci3. Ci4. Cisand Ci6 respectively, with each transition

having its own mean /lik and covariance Uik • Then each substate makes a

transition to astate io called a wait state with probability LIt was proven

that this composite set of substates, each having a single density function

associated with it, is mathematically equivalent to the mixture density func­

tion associated with a single state. Thus, the estimation formula.e for the

para.meters of the mixture Gaussian density become:

Cjk =

/ljk =Ei:1 Î,(j, k).Ot

Ei:1 Î,(j, k)

(4.39)

(4.40)

(4.41)

Ujk =

Ei:1 Î,(j,k).(Ot - /ljk).[(Ot - /ljk)Tj

Ef:l Î,(j, k)

The T in (Ot - /ljk)T denotes the transpose of the matrix. Note that

here, Î,(j, k) represents the probability of being in state Sj at time t with

the k'h mixture component a.ccounting for the observation vector 0,.

As far as the estimation for Ti and <lij, the sa.me formula.e derived for the

discrete HMM ( eq 4.32 and eq. 4.30) ca.n be used for continuous HMM•

• CHAPTER 4. TRAINING AND RECOGNITION ALGORITHMS

4.6 Implementation Considerations

4.6.1 Initializing the HMM models

64

There is no set rule as to how to determine the initial values of the HMM

parameters (11", A, Bl, however our experiments along with others [Rabi 89),

have shown that these initial values play a major role in determining the per­

formance of the models during recognition. As was discussed in the previous

section, training is an iterative process that should eventually converge to a

local maximum. The initial values of (11", A, Bl determine which maximum is

reached after training.

In [Rabi 93], it is also suggested that the initial values for the transitional

probability matrix A and the initial probability matrix 11" can be safely cha­

sen randomly, but the values of the output distribution matrix B, especially

when using continuous density CDHMM have to be selected with care. There

are many techniques developed to initialize the output distribution matrix

such as the use of hand scgmented data to bootstrap the models. linearly

segmenting the training data into their corresponding distribution sequence

and then using all frames that correspond to a given distribution to estimate

the initial values [LceC 90J, the use of segmental k-means segmentation with

clustering etc... .

4.6.2 Insufficient Training Data

Insuflicient training data is considcred one of the most challenging problcms

in real time realization of speech recognition systems, especially in recent

CH..l.PTER 4. TRAINING AND RECOGNITION ALGORITHMS 65

years when the usefulness of Contc:rt Dependent Modcls emerged 5 and re.

searchers were faced with the problem of insufficient training data, and lim·

ited machine resources to train model~ in context.

Experiments dearly show that the larger the training set, the better the

models and the higher the recognition. Anoth~r important issue is the rep·

resentati"eness of the given training set, the more varied the set (in terms of

variations in acoustic featurcs such as gender, accents, age, context, etc ... ),

the more robust the modcls are.

In cases where the training data is too small. the output observation matrix

becomes mostly filled with zero probabilities. As it was shown in previous

sections, the estimation of bj(k) relies on the joint probability of observing a

symbol k and the expected number of time we are in state Sj, 50 if there is

no occurrence of symbol k, bj(k) is set to zero and will always remain zero.

This problem is more evident in CDHMM using multiple mixtures because

of their complex structure and usually large number of distributions. In

most cases. researchers tie mixtures together 50 as to reduce the computa­

tion complexity and provide more training data to estimate the parameters

[DeMori 95J.

Experiments show that the more training data is provided, the more fea·

tures (i.e more observation sequences) are induded in the training procedure.

the more robust the system becomes. Poor performance is observed when the

data used during recognition produce features that were never encount{:red

during training.

'Context Dependent Models will be discussed in details in chapter 5 &: 6

• CHAPTER 4. TRAINING AND RECOGNITION ALGORITHMS 66

4.6.3 Underflow Problems

The problem of using HMMs is that often, the probabilities calculated during

recognition and training tend to approach zero exponentially with time, this

causes these pararneters to attain values that excccd the precision range of

any machine, resulting in an underflow problem. A remedy to this problem

is the use of scaling and logarithmic calculation.

Calculating the logarithm of the probabilities leads to a more efficient com­

putation both in solving the underflow problem and eliminating the U:le of

multiplication and division (which arc expensive in terms of speed). As we

rccall:

log(r.y) = log(x) + log(y)

xlog(-) = log(x)-Iog(y)

y

(4.42)

(4.43)

50 by considering the logarithm of probabilities one can increase com­

putation spced considerably. However, one cannot compute the logarithm

of probabilities when their computation involves summations such as in the

forward and backward calculations (eq. 4.12 and eq. 4.13). In thcse cases,

scaling is applied to the variables [Rabi 93] .

Chapter 5

State of Art in Speech

Recognition

In this chapter, sorne of the recent developments that have played a major role

in promoting the use of continuous, speaker independent ASR systems are

explored. The following sections describe sorne of the new ideas and methods

that proved to have a positi\Oe effect on the accuracy and robustness of these

systems; needless to say, that these ideas wouldn 't have had this impact

hadn't it been for the constant improvements in the hardware technology

and the Iow cost of memory which makes these system run in real-time and

provide the proper environment for their implementation.

6;

CHAPTER ,5. STATE OF ART IN SPEECH RECOGNITION 68

S.l Availability of Large Training Data Sets

As was discussed in chapter 3, one of the critical points in the training proce­

dure is to have a large and representative set of sentences. The more features

the HMMs are exposed to, the better the parameters are and the higher the

recognition accuracy.

Nowadays, large (such as the Wall Street Journal corpus, 20,000 words),

medium (such as the Air Travel Information Service, 1800 words) and small

(such as the Ti connect·digit corpus, 10 sentences) speech corpora are avail­

able to use by all researchers, mainly due to the Advanced Research Project

Agency or ARPA efforts, [Makh 94J.

Another common medium speech corpus, and the one used in the experiments

in this thesis, is the DARPA TIMIT Acoustic Phonetic Speech Corpus. The

TIMIT corpus was a combined effort between Texas Instrument (TI), the

Massachusetts Institute of Technology (MIT) and Stanford University.

The availability of these speech corpora meant that performance of different

ASR systems could be compared using common test beds.

S.2 Channel Noise Reduction

Channel noise, specially over telephone lines, alters the features of the speech

signal thus causing performance to drop. One of the new methods developed

at France Telecom [Mokb 94] increases the robustness of ASR systems by

reducing the distortion caused by the telephone lines.

• CHAPTER 5. STATE OF ART IN SPEECH RECOGNITIOlI 69

ln [Mokb 94] it is noted that the telephone line acts as a linear convolved

lUter !t(t), such that the signal vtt) that enters the ASR system is actually

a convolution of the original speech s(t) (distortion due to the environment

is an additive noise to the original signal and can thus be ignored in the

equations without 1055 of generality) and the impulse rcsponse of the liuear

lUter:

vtt) = s(t) 0 h(t) (5.1 )

50, in fact, channel noise can be eliToinated if one can separate h( t) from

s( t). This can be achieved if one reprcsents y( t) in the log domain, thus the

multiplication of the two signais in eq. 5.1 becomcs an addition of the their

respective logs. By projecting the previous equation onto the feature spacc

mainly the cepstral featurcs 1, one can easily sec that:

(5.2)

50 the cesptral vector produced is equal to the original cepstral fea.turc-o plus

the cepstrum of the cha.nnel. It is then proven that the cepstrum or the

channel Ch(t) is equal to the average or the cepstral reatur.,g over a time

intervai when the telephone channel transfer function is constant. Thus,

noise channel can be eliminated by either cepstral subtraction (calculating the

average of the cepstral features and then subtracting it from each coefficient)

or by applying a high pass filter to the cepstral coefficients to suppress the

low frequencies of the cepstrals which represent the cepstrum of the channel.

E.'Cperimentai results using both the ccpstral subtraction and the high pass

fil ter showed an 11% and a 13% drop in the error rate respectively when

1Recall from chap.3 that th..e features are the inverse log of the power spectrum

• CHAPTER .5. STATE OF ART IN SPEECH RECOGNITION 70

the ASR system was using data coming from a voice server in real use over

actua! telephone !ines, and an 29% and 25% reduction when speakers in the

laboratory were asked to repeat a !ist of predefined words over the telephone

network.

5.3 Speaker Adaptation for Speaker Inde­

pendent Systems

Speaker adaptation attempts to improve the performance of speaker inde­

pendent systems by adapting the parameters of the HMMs to the acoustical

properties of the speaker's voice. This is useful ,for example, when non na­

tives of a language talk to a speaker independent ASR system trained on

mostly native speakers: the large variation in the accents usually produces

very different featur~ that the system has never encountered and thus cannot

recognize accurately.

One solution is the incrcmenta! mean adaptation [Doug 94J in which all

feature vectors that cause a particular Gaussian distribution to be activated

during recognition are cashed. Then at the end of the sentence the rr,'dIl of

these feature vectors is calculated, and the mean of the Gaussian activated

due to these vectc'"S is re-adjusted according to:

mean =(1 - ~)mean(old) + ~.mean(J".ur••) (5.3)

•where ~ is a purely experimenta! factor, and is normally very small (it is set

to 0.015 in our system). On Roger, speaker adaptation decreases the error

CHAPTER 5. STATE OF ART IN SPEECH RECOGNITION 71

rate by 30% .....hen tested on a set of sentences spoken by one person, using

speaker-independent models that are trained on the TIMIT corpus.

5.4 Language Models

Language modP.1s have a big elfect on the performance of ASR systems. By

imposing more constraints on the aIlowed sequences of words (or phonemes),

the recognition procedure has fewer choices and thus the recognition is im­

proved. Without any grammar, ail sequences are equally like!y and the search

becomes more difficult.

Many researchers have reported a good performance with the use of tri­

grams (where the probability of three consecutive words is given) [Ljol 94].

In [Place 93], some techniques are developed to provide a robust estimation

of trigram probabilities.

The key to building good language modeis is, of course, to have a very large

data set in order to incorporate ail possible sequences of words.

5.5 Acoustic Modeling

5.5.1 Modeling Non Speech Sounds

One of the major obstacles in recognizing continuous sentences is the use of

non-speech words by speakers, such as lIumm, Aha, Oh, eic .... One solution

to the problem would be to add to the existing set of wordlphoneme models,

non-speech models wh05e parameters are re-estimated according to feature

CIlAPTER 5. STATE OF ART IN SPEECH RECOGNITION i2

vectors produced from pronouncing these sounds. This strategy increases the

robustness of the system without adding to much complexity to the search

algorithm since in most cases, ail non-speech sounds are grouped into one

HMM mode!.

5.5.2 Using HMMs to Recognize Non-linguistic Fea­

tures

In this experiment, conducted by Lame!.L and GaU\'ain J.L [Gauv 95], phone­

based acoustic likelihood is used to identify non-linguistic features such as

the gender and the identity of the speaker along with :he accent or even the

language spoken.

The innovative part of this experiment is that the implementation of the

recognition process is identical to that of normal phoneme recognition except

that in this case, the recognizer uses gender, speaker and language dependent

models. Ma.'Cimum likelihood estimators are used to derive the language

specifie models while ma.ximum a priori estimators are used to derive the

gender and speaker models.

The experiments were conducted on five different corpora: the BDSONS

or Base de Données des Sons du Français, the BREF which is a large read­

speech corpora containing over 10 hours of French speech material from 120

speakers, the TIMIT corpus, the WSJ or Wall Street Journal corpus and

the JO-language OGI-TS corpus which is a multi-lingual telephone speech

co:pus.

The models constructed for each non-linguistic features are tested on one or

CHAPTER 5. STI\TE OF ART IN SPEECH RECOGNITION ;3

more corpora and results are very promising for ail three features: the lowest

error rate is around 1% for gender identification using TIMIT after Isec

of speech, that of speaker identification is 0.8% Olt the end of the sentence

using BREF, and 0.1;% after 2.5sec of speech using TIMIT, for language

identification the overa11 error rate on a11 corpora is 0.4%, 2.4sec into the

sentence. This means that, in the future, this kind of non-linguistic modeling

can be used to transcribe speech sentences instead of relying on manual

transcription.

5.5.3 Using Context Dependent Models

A recent study has shown that the largest phonetic variation in the TIMIT

corpus is due to the coarticulation factor [Sun 95]. In their study, Sun D. and

Deng L., developed a technique to asscss the elfects of various factors (such

as the phoneme unit, its class. its context with other phonemes, the speakers

gender. his identity, his/ her accent. etc ...) on the TIMIT <latabasc and they

found that among a11 the factors analyzed (nine in total), the context of the

phonemes had the highest effect. Indeed, earlier research [Schwa 851 [Lee 90b]

has shown that a higher recognition accuracy cao be achieved using context

dependent models. Nowadays. many continuous, speaker inèependent ASR

systems incorporate context in their modehng strategy.

The principal difference between context dependent and context inde­

pendent models is that in the latter, a phonerr.e's pronunciation and thus

its acoustic properties are considered to be il function of the phonemes that

precede and follow it. while in th~ former, every sound is considered to he

CHAPTER 5. STATE OF ART lN SPEECH RECOGNITION 74

indepcndent of the sounds that appear on its left and right respectively. If

one can successfully model every realization of a phoneme (i.e in ail contexts

it can appear in a language) than ail the coarticulation effects will be repre­

sented and the accuracy should be near perfecto However, this is unfeasible

duc to two main reasons: no matter how large the training set is, it won't

usually contain all the contexts every phoneme can appear in, second, in

order to be able to use ail the context dependent mode1s would require a

mad;ine with very large and powerful resources in terms of storage and pro­

cessing capacity. Even with ail the advanccs of technology, this would still

be a very expensive proci:dure.

To remedy the problem, rcsearchers try to balance between resources

and accuracy by grouping phonemes with similar properties into c1ustcrs so

instead of having a left and right phoneme, the centrai phoneme would have

a (cft and right cluster. This clustering strate~' is the one most exclusively

used with context dependent models.

Building context dependent modcls using clustering forms the essence of

this Ih<''5is. In the following chapter. the ideas and strategies used in the

experiments arc dcscribed and when appropriate compared to what has been

aIready implementcd in other ASR systems.

Chapter 6

Experiments With Context

Dependent Models

6.1 Overview

The aim of the experiments conducted was two folds: determining the e:fcct

of using context dependent (CD) vs context independent (CI) models on the

performance of the system and exploring new merging techniques in which

ailophones pertaining to a specifie phoneme are combined to form a complex

CI phoneme model. The reason behind the second ~t of p.xperiments was to

reduce the complexity of the computations by reducing the total number of

models used yet maintain a good accuracy by incorporating into the new CI

phoneme model some contextual information.

This research was conducted in three main .tages. In the first stage, CI

mode1s were designed, trained and ~ted. ln: .' .'Cond stage, CD models

75

• CHAPTER 6. EXPERIMENTS WITH CD MODELS ;6

were produced using the CI models of the /irst stage as seed models. The CO

models were in turn trained and tested. Finally two strategies for combining

allophone mode!s into one complex phoneme structure are exp!ored. The

performance of these models is measured for each of the structures.

This chapter is divided into /ive main sections, the first two present the

A5R system and speech corpus used for this study, the third, fourth and fifth

sections describe the thrce dilferent stabes of the research mentioned above.

6.2 An Overview of Roger

Roger is the McGill University speech lab, sp~aker icdependent A5R system.

It is composed of two main modules, the feature extractor and the recognizer.

It can perform both word and phon~me recognition. Roger has a friendly

interface for real-time applications, it can also process pre-recorded sentences

which is what we used it for in these experiments.

ln this system, sampling is done at 16KHz, the digitized signal is pre­

emphasized with an Q factor of 0.95, the samples are then grouped into

frames of duration 20ms. Each 10ms, a 512 point FFT is calcu!ated, and 24

/ilters are use<! to compute the first 12 mel cepstral coefficients. The feature

vector is composed of the energy of a window, its first derivative, 12 me!

cepstral coefficients and their first derivatives, 26 features in total.

The rC-:9gnizer uses the Marimum Likelihood Method to rc-estimate the

parameters of the HMMs, and the Viterbi beam search algorithm for recogni­

tien. The HMM models are continuous, compose<! of mixtures of multivariate

Gaussian distribution densities. The topology of the HMMs used in the ex-

• CHAPTER 6. EXPERIMENTS \I/ITH CD MODELS

periments will be discussed in subsequent sections.

6.3 The TIMIT Corpus

77

In ail the stages of this reseal'ch. the TIMIT speech corpus was llsed so as to

provide means of comparisons between the different results obtained.

The sentences, as was discussed in chapter 5, come from thrce different

sources: 2 dialeet sentences to expose accent variability (produced at Stan­

ford), 450 phonetically-compact sentences designed to cover a large number

of phone pairs as weil as certain phonetic contexts (produced at MIT), 1890

phonetically-diverse sentcncl"S selected from existing text aimed at covering

allophonic contexts (from TI) [Gall 9~J. In ail, the TIMIT corpus has 63000

sentences, 10 sentences spoken by each of the 630 speakers who came from 8

major dialccts in the United States. The speakers were from the two genders.

Along with the digitized sound waves, the TIMIT database contains time­

aligncd sequences of phonetic-labc1s for each of the sentences. There are 64

different labels which. in our experiments, are mappcd to 53 phonemes, 46

adapted from [Lee 89], and the extra ï phonemes model the plosive-specilic

closures bel, del, gel, kel, qel, pel, tel are adaptcd from [Gall 92]. Thcse

extra mode1s were proven by Galler :\1., in his experiments, to reduce the er·

rors due to the misc1assilication of both c10sures and plosives vs non-plosives

phonemes. The list of 46 phonemes along with examples of their pronuncia­

tions can be seen in table 2.1.

• CIJAPTER 6. EXPERIMENTS WITH CD MODELS 78

6.4 Designing Context Independent Models

ln order to initialize the CO models, CI models had to be built. The following

sections described the structure and the performance of these models.

6.4.1 Optimizing the Topology

The first step in designing HMMs is to decide on the number of states each

mode! should have, the number of transitions, the number of mixtures in

each transitions, and which mixtures to tie together. Unfortunately, there

is no set rule as to how the structure should be defined, in each research

dilferent topologies are used that provide good results. For example, in

[Schwa 85J a simple 5 state HMM model is used, with an initioÙ and final

state where no self·transitions arc allowed, and three inner states representing

the left, middle and right part of the phoneme. The transition !rom the

~lcfC state to the ~right~ state was allowed to modei phonemes when they

arc quickly articu!ated. In [Lee 89] a more complex topology was used, the

HM~I contained ï states. 12 transitions, and three output probability density

functions, these topologies are presented in figures 6.1 and 6.2. In [Taki 92],

a successive state splitting clgorithm is used to simultaneously optimize the

structure of the HMM mode!, the distribution of its probability densitie:,

and the phoneme clusters.

• CHAPTER 6. EXPERIJIENTS WITH CD MODELS

Figure 6.1: Topology used in [Schwa 85]

Figure 6.2: Topology used in [Lee 89]

ï9

For the purpose of this research, simple HMM structures were used, as

can he seen in fig 6.3. However, instead of having a uniform structure a.cross

ail phonemes as in [Lee 89], there were three different topologies for ca.ch of

the silence, consonant and vowels classes.

CHAPTER 6. EXPERIMENTS WITH CD MODELS

HIIII For Sll.ENCE

HIIII For CONSONANTS

HIIII For VOWELS

Figure 6.3: HMM topologies used

80

• CHAPTER 6. EXPERIMENTS WITH CD MODELS 81

The number of states used refiects the duration in time, of each of the

classes, and since vowels have the longest duration, they are modeled with

5 states, while silences which are the shortest are modeled with only three

states. Each transition in the all three models consists of a mixture of 18

multivariate Gaussian probability distribution function (pdf). To reduce the

computation complexity, mixtures going into each state are tied together.

6.4.2 Training and Recognition with CI Models

6.4.2.1 Initialization

The CI models were first initialized such as the sum of the mixture prob­

abilities coming out of each state is one; thus all paths in the model were

equi-probable. The means and variances of each of the prlf's were set to ran­

dom values in the first stage of the experiment, and training was performed

on a selected 512 data subset from the TlMIT training database. As the

means and variances were randomly chosen, some of the distributions proved

to be unsuitable and thus the mixture probabilities tied lo these distributions

were set to zero by the training procedure. To remedy this situation, weil

estimated initial means and variances were choscn and then perturbed to

produce slightly different vaiues; these new parameters replaced the unsuit­

able ones, and retraining was performed. In th,~ first few initial iterations,

segmented training was performed to improve the initial parameter estimates.

In segmented training, both the time-alignments and the phonetic labeling

are used so that the algorithm has a better idea on where each phoneme

occurs in the training sentence. However, once the paran:!eters are properly

• CI/APT/~n 6. EXPEJUMENT8 W/TH CD lvlODEL8 82

initialized th" training algorithm should not be rcstricted to lise the time­

alignment specified in the sentences, because it might not be accurate and

the constraints it imposes on the training algorithm might result in poor

parameter estimation. The reason why time-alignment hinders the training

process lies in the continuous nature of the speech signal which makes it very

hard to distingllish accllratcly the end points of every phoneme pronounced.

6.4.2.2 Recognition Results

As the mode!s were being trained on a subset of the 36;9 training set in

TIMIT, it took 12 iterations to reach the maximum likclihood. After each

iteration, a recognition was performed using a selected 192 sentences from

TIMIT's test database.

Since the aim was to measure phoneme recognition, rather than word recog­

nition, the phoneme modcl consisted of a bi-gram in which the probability of

pairs of phonemes is given. The finite state network thus formed had one en­

try and one exit state represented by the silence modcl and 20.j1 transitions.

The performance of the system is measured by its unit accuracyor UA and

by its percent correct or PC. The unit accuracy is defined as:

UA = 100.(1 _ # insertions + ~ d~letions + # substitutions) (6.1)# Unlts ln the sentence

The PC is the same as the UA but it doesn't take insertions into consid­

eration. Table 6.1 presents the results after each training iteration.

As one can sec from the table 6.1, at sorne point in the training process

(spccificallyat iteration 13), the parameters estimated become too dependent

on the data they are trained on so that when tested \Vith a different set of

• CHAPTER 6. EXPERIMENTS WITH CD MODELS 83

1 yes 57.34 60.292 yes 57.81 60.633 yes 57.98 60.894 yes 58.52 61.505 yes 58.58 61.796 yes 58.69 61.877 yes 58.67 61.878 no 59.63 63.759 no 59.85 64.2310 no 60.17 64.4611 no 60.25 64.6112 no 60.33 64.6313 no 60.29 63.24

[!§J With Segm·1 UA(%) 1 PC(%) ~

Table 6.1: Recognition using cr modcls

sentences, the accuracy rate drops. Thi~ phe:lomena is called over-training.

Usually, training is stopped after the first sign of over-estimation occurs, 50

in this case, training stopped at the 13th iteration, and the models obtained

at the 12th round arp. used for subsequent cr recognition.

6.4.2.3 Effect of Phoneme Bigram Weights

As was discussed in chapter 4, recognition is done by essentially multiplying

the probability of the phoneme sequence with the probability of the out­

put observation sequence P(ph).L(obs). However, these two probabilities are

mathematically unrelated, in fact the bigram probability, P(ph) is usually

smaller then L(obs) sa it needs to be weighted in order to becomes large

enough to affect the probability computation in the viterbi measure. This is

• CHAPTER 6. EXPERIMENTS WITH CD MODELS 84

1 60.33 64.63 315 5934 62.03 64.00 144 88;6 60.;8 62.lï 102 10398 59.38 60.28 66 1212

~ Weight 1 UA(%) 1 PC(%) 1 #Ins 1 #Del ~

Table 6.2: Effect of phoneme bigram weights on CI modeis

the role of the bigram weights: they increase the importance of the transi·

tiona! probabilities, so the viterbi measure becomes pW(ph).L(obs).

Four different weights are tested on the best performing CI mo<l~Is ob·

tained previously. From the results of table 6.2, one can see that, as the

bigram weight increases, the number of insertions decreases whiIe the number

of deietions increases. This is due to the fact that, a larger weight imposes

more constraints on the a priori phoneme sequence probabiIity. This uIti­

mately prevents unseen phonemes sequences from being produced resulting

in a lower number of insertions but also prevents, in sorne instances, correct

phonemes from appearing resuiting in a higher number of deletions. A good

tradcoff between the two was found using a weight of 4, which results in a

1.;% increase in UA. It is important to note here that since the PC depends

only on the number of deletions and substitution, it normally decreases (due

to the increase in deIetions) as the weight increases, indeed. with a weight of

4, the PC goes down by 0.63%.

• CHAPTER 6. EXPF.RIMENTS \VITH CD MODELS

6.5 Designing Contest Dependent Models

6.5.1 Clustering Techniques

85

The first step in designing CO models is to decide on the clustering strategy.

Many strategies are available in the literature. In [Lee 90b]. two methods

are used and compared: the first is based on an agglomerative c/uslering

technique, the second on decision lrees. In the first method. a COHMM is

produced for every single ccmtext, so initially. each cluster contains only one

allophone. Then an entropy distance measure is used to test the similarity

between each pair of clusters pertaining to a phone and clusters that are

~closest~ to each CJther are merged together. The procedure is repeatcd until

a certain convergence criteria is met. Although this method minimizes the

entropy, its main disadvantage is that if the t.raining and test sentences are

different, then during recognition, the new allophones encountered had no

CD models associated with them, so CI models had to also be used, which

decreased the performance of the system. In the second method, clusters are

generated by using a decision tree in which the root node contains all the

allophones pertaining to a phoneme, then the tree is traversed top to bottom,

and at each level, node splitting is done using a binary question about sorne

context of the allophone. The splitting method is based on the same entropy

distance measure used in the first algorithm, the questions are chosen by an

expert to capture the different classes of contextual classes. The leaves of the

tree contain the generalized allophone. This method eliminated the problem

of the agglomerative clustering, because if a new allophone is encountered

• CHAPTER 6. EXPERIMENTS WITH CD MODELS 86

during recognition, then the tree is traversed and the cluster to which this

allophone belongs to is found. [Bahl 91] also uses binary decision trees to

determine the clusters by asking a question about the context at every level.

However, in his experiments, the context of a phoneme is not only defined

by the adjacent left and right phonemes but by severa.1 other phonemes pre­

ceding and following the central phone. In [LeeC 91] a unit reduction role is

used to create context dependent units. The method is based on the number

of tokens of a particular unit that appear in the training data set. [Taki 92]

uses the successive splitting algorithm discussed earlier to optimize phoneme

classes. In that approach a simple HMM model consisting initially of one

state and two pdfs grows iteratively into a more complex model in which

contexts are clustered and integrated. Other researchers avoid the clustering

problem by integrating alileft and right contexts of a certain phoneme inside

the model structure of the phoneme in question [Jouv 94a] [Young 94]; this

is equivalent to tying the states of different allophoncs pertaining to the sa.me

phone.

In the experiments conducted in this thesis a form of unit reduction rule

is first performed to prune the allophones gathered from the TIMIT training

database, acoustic-phonetic rea.soning is then used to cluster the remaining

allophones. The fol1owing steps describe in details how the CD models were

produced.

• CHAPTER 6. EXPERIMENTS WITH CD MODELS

6.5.2 Creating and Clustering the Allophones

6.5.2.1 Assembling and Pruning the Allophones

87

The first step in this experiment was to gather all the possible allophones from

the training database of the TIMIT corpus. Since all the training sentences

of TIMIT are phonetically labelled, this made the task very simple. Once

all the allophones are gathered, the unit reduction rule is used to count the

number of times each allophone is encountered in the training set. This is an

important step because if there aren't enough samples for a certain allophone,

the CO model representing it will be poorly estimated and this will hinder

the performance of the system. The threshold for the unit reduction rule was

set to 10, so any allophone that didn't appear at least 10 times in the training

set was eliminated. From the 21444 different allophones encountered in the

3679 training sentences, only 582 allophones were thus kept; these formed

the set of CD models. However, because of the pruning, CI models had

to be added to the set of CD models to replace those allophones that were

eliminated. The total number of models used was 635: 582 CD models and

53 CI models.

6.5.2.2 Clustering the Allophones

The set of c1usters used is shown in table 6.3, the first 5 c1usters which

represent the vowels were adapted from [Gall 92), the consonants c1usters

were formed by hand, using the similarities between the acoustic properties

of certain consonants to group them together. The c1usters Wert. used for

bath the left and right contexts of the 582 allophones•

CHAPTER 6. EXPERIMENTS WITH CD MODELS

~ Cluster No. 1 Phonemes

1 ao aa ay aw ax ah2 ix ih iy ey3 ae er4 uwuhoyow5 eh6 v fhhjh m b p7 dhthchndt8 zsnggk9 zh sh10 r11 112 w13 y14 bel del gel kel pel qel tel sil epi15 el16 dx17 en

Table 6.3: Clusters used for the CD modcls

ss

• CHAPTER 6. EXPERIMENTS WITH CD MODELS

6.5.3 Training and Recognition using CD Models

6.5.3.1 Initialization

89

The CO models were initialized using the CI models produced in the first

stage. Each allophone was initially a duplicate of its central phoneme. The

models were then trained using the 3679 sentences from the TIMIT train

database. Since an HMM now represented triphones, the labelling of the

phrases had to change: each three phone segments were grouped together

to form a single segment representing a triphonc. The first and last phones

were padded with silences.

6.5.3.2 Building the Phoneme Bigram Model

In order to enhance the performance of the CD models, a phoneme bigrame

model incorporating the 582 CD and 53 CI models had to be designed.

Four criteria are used to inter-connect the models, however, as the grammar

is quite envolved, it will be explained by following an example:

Suppose an allophone model A is represented by clB-aa-cl6 where cl8 refers

to cluster B to whieh the left contezt of A belongs to, cl6 refers to cluster 6 to

which the right contezt of A belongs to, and aa is the central phoneme and

suppose it belongs to cluster 1 represented by cll.

Suppose also that clB=(z,s), cl6=(v,f) and cll=(ao,ah), then:

Criteria #1 Connect A to all allophones whose left context be10ngs to

cll and whose central phoneme belongs ta cl6, iff such allophone(s)

model(s) exist(s). 50 in this example connections would he made as

• CHAPTER 6. EXPERIMENTS WITH CD MODELS

follows:

el8-aa-cl6 - cll-v-X with probability P(v 1 aa) (X is any eluster)

c18-aa-cl6 - cll-f-X with probability PU 1aa)

90

Criteria #2 Connect A to all CI models that belong to its right c1uster el6.

So in this example connections would be made as follows:

el8-aa-cl6 - v with probability P(v 1aa)

cl8-aa-c16 - f with probability PU 1aa)

Citeria #3 Connect to A all CI models that belong to its left c1uster clB.

So in this example connections would be made as follows:

:: _ c18-aa-cl6 with probability P(aa 1::)

s _ cl8-aa-el6 with probability P(aa 1s)

Criteria #4 Connect all CI models together using the bigram probabilities

used for the CI models

The fini te state network thus formed contained in total 14220 transitions.

The chain had one entry state, through the silence CI model and multiple

exit states represented by the silence CI model and all CD models that had

the silence as their central phoneme and the c1uster 14 (to which the silence

belongs to) as their right context.

• CHAPTER 6. EXPERIMENTS W1TH CD MODELS 91

6.5.3.3 Recognition results

The recognition was performed alter each round of training. As the CI modelswere already well estimated, it only took 3 iterations for the CO model 1.0rcach the maximum likelihood the results are given in table 6.4

1 60.51 68.012 61.04 68.993 61.01 69.1i

~ UA(%) 1 PC(%) ~

Table 6.4: Recognition using CD models

UA(%) 60.33 61.01 0.68PC(%) 64.63 69.1i 4.54

~ 1 CI Models 1 CD Models 1 Improvement ~

Table 6.5: Improvement in recognition using CD models

As the results of table 6.5 indicate, recognition using CD models produced

considerably less deletions and substitutions (thus the 4.54% increase in the

PC), however the number of insertions remained relatively the same, so the

UA only increased by 0.68%. These results lead 1.0 believe that the difference

in order between the a priori phoneme sequence probability and the output

observation sequence was somehow large so in order to improve the UA, one

has 1.0 impose a weight on the language probability. The results of this test

are shown in the next section.

• CHAPTER 6. EXPERIMENTS WITH CD MODELS 92

6.5.3.4 Effect of Using Phoneme Bigram Weights

The same bigram weights were used on the best performing CD modcls (thoscof iteration 3) and indeed, the UA improved by 2.83% when a bigram wcightof 4 is used, and by 3048% when the weight was set to 6. These results arcpresented in tables 6.6 and 6.i.

1 60.01 69.li 598 3304 64.86 6804i 265 5896 64.26 66.9i 199 i3i8 63.iO 65.i3 149 894

~ Weight 1 UA(%) 1 PC(%U #Ins 1 #Del ~

Table 6.6: Effect of phoneme bigram weights on CD modcls

UA(%) 4 62.03 64.86 2.83PC(%) 4 64.00 6804i 4.4iUA(%) 6 60.i8 64.26 3.48PC(%) 6 62.1i 66.9i 4.8

~ 1 Weight 1 CI Models 1 CD Models ] Improvement ~

Table 6.i: Improvement in recognition using CD models with bigram wcights

• CHAPTER 6. EXPERIMENTS WITH CD MODELS 93

6.6 Merging CD Models to Form CI Models

In an attempt to reduce the number of models used for recognition, two

ideas are explored to combine allophones pertaining to a single phone into

one complex structure. In the first, all allophones are combined in parallel to

form a single structure consisting of one entry and exit states and multiple

paths in between, each path representing one context of the phoneme in

question. In the second approach, the parallel structure formed in the first

experiment is kept, however, states in the parallel paths are connected to

states in subsequent paths so that the search algorithm can begin with one

context and go to another within the same mode!.

In the initial stages of the two experiments, the intention was to combine

ail allophones of a single phone into one model; however the resulting CI

models proved to be to inefficient due to a large increase in computation

complexity during the training phase. Since every CD model contained 18

mixtures each representing a mutivariate Gaussian distribution, it meant

that for those phones with more than 10 allophones, the parallel structure

formed, contained close to 1000 transitions, each representing a Gaussion

distribution with 26 parameters. Even by tying the mixtures, the number of

parameters was still tao high.

In order ta decrease the number of transitions in the CD models, a form

of pruning was performed. For each mode! and for each mixture, if the

probability was less than a certain threshold, the transition was eliminated

from themodel. Thethreshold was set to loOe-ID• This resultedin CD models

with varying numbers of mixtures. Recognition, using a bigram weight of 4,

• CHAPTER 6. EXPERIMENTS WITH CD MODELS 94

was then performed to see the effect on the performance, and it was observed

that these new pruned models produced an accuracy rate UA = 64.79% and

PC =68.44 which compared to the original CD models (UA =64.86, PC =68.47), meant only a 0.07% decrease in UA and 0.03% decrease in PC which

is negligible. These pruned mod~ls are the ones used in the rest of the study.

However, even when using these models, the complexity of the structures

produced was till tao high, so the idea of combining all the allophones was

discarded and only three allophones for each phone were combined to form the

complex model. Needless to say that it was not expected that these models

will perform as well as the CD models due to the loss of some contextual

information, however the merging techniques proved to be quite promising

as will be seen in the following sections.

6.6.1 CD Models in Parallel

To reduce the amount of contextual information lost, the three allophones

picked where those that appeared the most in the training sentences. The

CD models that were chosen were assembled as in fig 6.4. The numbers

appearing on the transitions denote the tying of the mixtures. The entry

state of the mode! consisted of 3 different sets of mixture probabilities, each

leading to one of the contexts in parallel. During the beam search, one of the

paths in the mode1s is chosen and followed all the way ta the exit state. The

53 models thus obtained were trained and tested on the 192 test sentences

from the TIMIT test database. The results obtained are presented in the

fol1owing section.

• CHAPTER 6. EXPERIMENTS WITH CD MODELS

cl6-la<ll4

2

26 3

cJ6.u<17

,!6·6,b.'dl-u<l14

Figure 6.4: Parailei structure for the central phoneme "aa"

6.6.1.1 Results

95

Recognition was tested after each iteration using a bigram weight of 4. When

the parameters reached their maximum IikeIihood, different weights on the

best performing models (those of iteration 3 in this experiment) were used

and the performance was recorded. The results can be seen in tables 6.S

and 6.9•

• CHAPTER 6. EXPERIMENTS WITH CD MODELS 96

1 63.53 65.552 63.92 65.963 63.9ï 66.234 63.S1 66.0ï

ŒEJ UA(%) 1 PC(%) ~

Table 6.S: Recognition using ailophones combined in a paraile! manner

~ Weight 1 UA(%) 1 PC(%) 1 #Ins 1 #De! ~

1 62.01 66.96 356 50S

2 63.40 6ï.03 266 613

l' 4 63.9ï 66.23 166 ïï6

Tab!e 6.9: Effect of bigram weights on the parai!e! structured mode!s

6.6.2 A Form of State Clustering

In an another experiment, transitions from an inner state in a context CI

to the inner states in the two other contexts C2 and C3 are ailowed. This

structure imposes !ess constraints on the search procedure since now, if a cer·

tain path is chosen at the entry to the mode!, it doesn't have to be followed

ail the way to the exit state. This technique is similar to the state dus­

tering method proposed by [Young 94] in which an agg!omerative c!ustering

aigorithm is used to duster and tie together similar states in allophones per­

taining to a phoneme. An e.'Camp!e of the mode! designed cao he seen in

• CIIAPTER G, EXPERIMENTS WITIJ CD MODELS 9;

fig 6.5. Transitions for the first inner state of the model are represented

by dotted lin(~ to explain how mixtures are tied: those that have the same

pattern are tied together. Mixtures between the states belonging to the same

contcxt are tied as in fig 6.4.

cl6-aa-c1l4

Figure 6.5: Tied state structure for the central phoneme "aa"

6.6.2.1 Results

Tying the states of a1lophones together gives the search algorithm more free­

dom in choosing the highest sequences of states, 50 as result, the performance

• CHAPTER 6. EXPERIMENTS WITH CD MODELS 98

of the system is better than in the first merging technique. The results of

the recognition after each iteration round are given in table 6.10. The best

performing models are then tested using the same bigram weights uscd in

table 6.9, results are given in table 6.11. In table 6.12 the results of the best

pcrforming models using a weight of 4 is given.

1 63.98 65.i42 64.20 66.073 64.46 66.434 64.26 66.34

~ UA(%) 1 PC(%) ~

Table 6.10: Recognition using state tying between a1lophones

1 62.66 66.86 308 593

2 63.96 66.96 220 685

4 64.'16 66.43 144 825

~ Weight 1 UA(%) 1 PC(%) 1 #Ins 1 #Oel ~

Table 6.11: Elfect of bigram weights on the tied state models

• CIlAPTER 6. EXPERIMENTS WITIi CD MODELS 99

CI Models CO Modcls CD Model Parallel Struc. Tied State

Pruned Struct.

UA(%) 62.03 64.86 64.i9 63.9i 64.46

PC(%) 64.00 68.4i 68.44 66.23 66.43

Table 6.12: Overall results using a phoneme bigram weight of 4

The rcsults of table 6.12 suggest that by trading context dependent mod­

els with context independent models containing contextual information, the

performance of the system doesn 't depreciate significantly. In fact, the per­

formance cornes very close to that of the CD models with only a 0.33%

dilference in unit accuracy between the tied state models and the pruned CD

modcls. In addition, the ne\V CI models are formed with only a smaIl subset

of ali the aliophoncsj one could then conclude that if alI, or most of, the

aliophoncs can be efficiently included in the CI structure, the gap between

the two performances should be significantly less.

The parallei structure formed from the CD models did not prove to be as

efficient as the tied sate structure. The performance of the system degraded

by 0.83% in unit accuracy when these models were used. One explanation

would be that the number of aIlophones was too low and not representative

of ail the contexts. Futhermore, since the search aIgorithm was forced to

choose one of the three paths and follow it ali the way until the end, then if

the wrong context was chosen at the beginning, the aIgorithm was not given

a chance to move to a ~ closer~ context at a later state, hence the decrease

CHAPTER 6. EXPERUI'IENTS \VITIl CD MODELS

in performance.

100

Chapter 7

Conclusion and Future Work

As the popularity of ASR systems continues to expand, the demand for a

high level of accuracy increases. This thesis explored one of the important

factors in improved ASR design which is the use of context dependent models

to represent the phonemes of the languages. Previous work (as in [Schwa 85J

and [Lee 90b]) has already demonstrated that the use of CD models provides

better accuracy rates. Indeed, our experiments have shown that the CD

models show ;;.ppr,~ximately a 4% incrcase in performance compared to the

CI models. The key point in designing such models is allophone c1ustering.

There arc many methods already developed to c1uster the phonemes, they

range from implementing iterative optimization algorithms such as [Taki 92J,

to using some form of agglomerative c1ustering technique as in [Lee 90b] and

[Young 94J, to building decision trees as in [Bah! 91], and finally to using

phonetic reasoning as was done in this thesis. Another consideration is the

trainability of these models: building robust CD models means training on

101

• CHAPTER ï. CONCLUSION AND FUTURE WORI{ 102

as many context-specific words as possible. However, there is always a lim­

itation on the number of training sets available. Perhaps one can count the

number of training samples associated with each Gaussian (per modc1) and

disregard distributions which have a count below a certain threshold, this

will guarantee that the training data available can properly re-estimate the

parameters of the CO models. Finally, the CD models should be general

enough so as to produce good recognition rates even for words that arc not

present in the training database.

This thesis aIso explored sorne innovative techniques to merge CD modc1s

into complex CI structures. The aim of this study was to reduce both the

number of models needed and the complexity of the grammar used to connect

them together. By having CI models containing sorne contextual information,

one can both decrease the computation complexity while maintaing a good

accuracy. In the initial stages of this study, it was demonstrated that merging

aIl the aIlophones pertaining to one phoneme did not reduce the complexity, it

rather increased it, but when a small subset is used, and the viterbi algorithm

is aIlowed to go from one context to another at different stages of the search,

within the sarne models, the results were very promising. In fact, the tied

structure models were only 0.33% less accurate than the pruned CD models.

In futur work, one can perhaps gradually increment the number of allophones

in each complex structure until a certain computation complexity threshold

is attained.

Bibliography

[App 89] Applebaum T.H, Hanson B.A , Enhancing the Descrimination ofSpeaker Independent Hidden Markov Models with Corrective Training,Proceedings of the IEEE International Conference on Acoustics Speech,and Signal Processing, 1989, pp.302-30S.

[Bah186) Bahl L.R, Brown P.F, deSouza P.V, Mercer R.L, Nahamoo D.,Maximum Mutual Information Estimation of Hidden Markov Parame­ters for Speech Recognition, Proceedings of the IEEE International Con­ference on Acoustics Speech, and Signal Processing, 1986, pp.49-S2.

[Bahl91) Bahl L.R, deSouza P.V, Gopalakrishnan P.S, Nahamoo D.,Pecheriy M.A, Desision Trees for Phonological Rules in ContinuousSpeech, Proceedings of the IEEE International Conference on AcousticsSpeech, and Signal Processing, 1991, pp.18S-188.

[Bau 72) Baum L.E and associates, An Inequality adn Assoiciated Maxi­mization Technique in Statistical Estimation of Probabilistic Functionsof Markov Processes, Inequalities, 1972, pp.1-8.

[Casa 90) Casacuberta F., Vidal E., Mas B., Rulot H., Learning the Strcu­ture of HMM's through Grammatical Inference Techniques, Proceedingsof the IEEE International Conference on Acoustics Speech, and SignalProcessing, 1990, pp.71ï-ï20.

[Chow 90) Chow Y.L, Maximum Mutual Information Estimation of HMMParameters for Continuous Speech Recogntion using the N-Best Algo­rithm, Proceedings of the IEEE International Conference on AcousticsSpeech, and Signal Processing, 1990, pp.ï01-ï04.

103

• BIBLIOGRAPHY 104

[DeMori 93] De Mori R, Flammia G., Speaker-Indcpendcnt Consonant Clas­sification in Continuous Speech with Di..<tinctive Features and NeuralNetworks, Acoustical Society of America, Dec 1993.

[DeMori 94] De Mori D, Brugnara F., Giuliani D., ParaUd Hidden MarkovModels for Speech Recognition, Istituto per la Ricerca Scientifica e Tech­nologica, Pante de Povo, Trento, Italy, Apr 1994.

[DeMori] De Mori, R., Snow, C., Galler, M., Speech Recognition and Under­standing, School of Computer Science, McGill University.

[DeMori 95] De Mori, R., Brugnara F., Ga!ler M., Search and LearningStrategies for Improving Hidden Markov Modr.ls, Computer Speech andLanguage, Vol 9, Apr 1995, pp.107-121.

[Doug 94] Douglas B.P., Incrementai Speaker Adaptation ARPA SLS Tech­nology Workshop, March94.

[Eph 89] Ephraim Y., Dembo A., Rabiner L.R, A Minimum DiscriminationInformation Approach for Hidden Markov Models, IEEE 'i'ransactionson Information Theory, Vol 35, No.5, Sept 89, pp.1001-1013.

[Furu 86] Furui S., Speaker Independent Isolated Word Recognition UsingDynamic Features of Speech Recognition Proceedings of the IEEE In­ternational Conference on Acoustics Speech, and Signal Proccssing, Vol34. No. 1, Feb 1986, pp.52-59.

[Gall 92] Galler M., Improving Phoneme Models for Speaker-IndependentAutomatic Speech Recognition Master Thcsis, Faculty of Science, McGillUniversity, 1992.

[Gauv 91] Gauvain, J.L., Haton, J.P., Pierrel, J-M, Perennou, G., Caclen, J.,Reconnaissance Automatic de la Parole, DUNOD informatique, BordasParis, 1991.

[Gauv 95] Gauvain, J.L., Lamel L., A Phone-Based Approach To Non­Linguistic Speech Feature Identification, Computer Speech and Lan­guage, Vol 9, Jan 1995, pp.87-103.

• BlBLIOGRAPHY 105

[Gray 84J Gray R.M., Vector Quantization, IEEE ASSP Magazine, April1984, pp.4-29.

[Haeb 92J Haeb-Umbach R., Ney H., Linear Discriminant Analysis for lm­proved Large Vocabulary Continuous Speech Recognition, IEEE Trans­actions, 1992, Vol 1, pp.13-16.

[Haeb 93J Haeb-Umbach R., GelIer D., Ney H., lmprovements in ConnectedDigit Recognition Using Linear Discriminant Analysis and Mixture Den­sities, IEEE Transactions, 1993, Vol 2, pp.239-242.

[Huang 89] Huang, X.D., Jack M.A., Semi-Continous Markov Models forSpeech Signais, Readings in Speech Recognition, Academie Press, 1989.

[Huang 90] Huang, X.D., Ariki, Y., Jack, M.A., Hidden Markov Models forSpeech Recognition, Edinburgh University Press, Edinburgh, 1990.

[Jouv 94a] Jouvet D, Dautremont M, Gossart A., Comparaison des Mul­timodeles et des Densites Multigaussiennes pour la Reconaissance de laParole par Modeles de Markov, ICLSP 1994, YOKOHAMA, pp.153-158.

[Jouv 94b] Jouvet D, Bartkova K, Stouff A., Structure of Al/ophonic Mod­els and Reliable Estimation of the Contextual Parameters, ICLSP 1994,YOKOHAMA, pp.147-150.

[Komo 87] Komo J.J, Random Signal Analysis in Engineering Systems, Aca­demie Press, 1987.

[Lee 89J Lee K.F, Hon H.W, Speaker lndependent Phone Recognition UsingHidden Markov Models, Proceedings of the IEEE International Confer­ence on Acoustics Speech, and Signal Processing, Vol 37, No.11, Nov1989, pp. 1641-1646.

[Lee 90a] Lee K.F, Hon H.W, Reddy R., An Overview of the SPHINX SpeechRecognition System, IEEE Transactions of Acoustics, Speech and SignalProcessing, Vol 38., No. 1, Jan 1990, pp.35-44.

[Lee 90b] Lee K.F, Hayamizu S., Hon H.W., Huang C., Swartz J. Wiede R.,Al/ophone C/ustering for Continuous Speech Recognition, Proceedingsof the IEEE International Conference on Acoustics Speech, and SignalProcessing, 1990, pp.749-752.

• BIBLIOGRAPHY lOG

[LeeC 89] Lee C.H, Rabiner R.L, A Frame Synchronous Network Scarch Al­gorithm For Connected Word Recognition, Proceedings of the IEEEInternational Conference on Acoustics Speech, and Signal Processing,Vo137, No.n, Nov 1989, pp.1G49-1658.

[LeeC 90] Lee C.H, Rabiner R.L, Pieraccini R., Wilpon J.G, Acoustic Mod­eling for Large VocabulanJ Speech Recogntion, Computer Speech andLanguage, Vol 4, No.2, April 1990, pp.127-165.

[LeeC 90b] Lee C.H, Rabiner R.L, Goldman E.R., Wilpon J.G, AutomaticRecognition of Keywords in Unconstrained Speech using Hidden MarkovModels, IEEE Transactions of Acoustic, Speech and Signal Proccssing,Nov. 1990, pp.1870-1878.

[LeeC 91] Lee C.H, Giachin E., Rabiner R.L, Pieraccini R., Rosenberg A.E.,Improved Acoustic Modelling for Speaker Independent Large VocabulanJContinuous Speech Recognition, Proceedings of the IEEE InternationalConference on Acoustics Speech, and Signal Proccssing, 1991, pp.161­164.

[Lennig 90] Lennig M., Putting Speech Recognition to Work in the TelephoneNetwork, Computer, August 1990, pp.35-41.

[Lennig 92] Lennig M., Automated Bilingual Directory Assistance Trial inBell Canada, Proceedings of the lst IEEE Workshop on Interactive VoiceTechnology for Telecommunication Applications, N.J, Oct. 1992.

[Lip 82] Liporace L.A, Maximum Likelihood Estimation for MultiflarariateObservations of Markov Sources, IEEE Transactions on InformationTheory, Vol IT-28, No.5, Sept 1982, pp.729-734.

[Ljol 94] Ljolje A., High Accuracy Phone Recognition Using Context Cluster­ing and Quasi-Tiphone Models, Computer Speech and Language, Vol 8,Academie Press, 1994, pp.129-151.

[Makh 94] Makhoul J., Schwartz R., State of the Art in Continuous SpeechRecognition, Voice Communication Between Hurnans and Machines, Na­tional Academy Press, Washington D.C., 1994, pp.165-198.

• BIBLIOGR.1PHY 107

[Mokb 94] Mokbel C., Pachès·Leal, Jouvet D, Monnè J., Compensation ofTelephone Line Effects For Robust Speech Recognition, ICLSP 1994,YOKOHAMA, pp.161-164.

[Ney 88] Ney H., Noll A., Phoneme Modeling Using Continuous MixtureDensities, Proceedings of the IEEE International Conference on Acous­tics Speech, and Signal Processing, 1988, pp.437-440.

[Norm 91J Normandin,Y, Hidden Markov Models, Maximum Mutual Infor.mation Estimation, and the Speech Recognition Problem, PhD Thesis,Department of Electrial Engineering, McGill University, 1991.

rObert 94J Oberteuffer J.A, Commercial Applications of Speech InterfaceTechnology:An Industry at the Threshold, Voice Communication betweenHumans and Machine, National Academy Press, 1994, pp.347-369.

[OGrady 87J O'Grady, Dobrobolsky, Contemporary Linguistic Analysis, AnIntroduction, Copp Clark Pittman, 1987.

[Opp 89] Oppenheim, A.V, Schafer, R.W., Discrete Time Signal Processing,NJ:Prentice Hall, Englewood Cliffs, 1989.

[OShaug 87J O'Shaughnessy D., Speech Communication, Human and Ma­chine, Addison Wesley, 1987.

[Place 93] Placeway P.R, Schwartz P., Fung P., Nguyen L., The Estimation ofPowerfuI Language Models from Small and Large Corpora, Proceedingsof the IEEE International Conference on Acoustics Speech, and SignalProcessing, Minneapolis, April 1993, pp.33-36.

[Pic 90] Picone, J.W., Continuous Speech Recognition Using Hidden MarkovModels, IEEE ASSP Magazine, July 1993, pp.26-41.

[Pic 93J Picone, J.W., Signal Modeling Techniques in Speech Recognition,IEEE Procedings, Vol 81 NO 9, Sept 1993, pp.1215-1247.

[Roe 94] Roe B.D., Wilpon J.G., Voice Communication Between Humansand Machines, National Academy Press, Washington D.C., 1994.

[Rabi 78] Rabiner, L., Schafer, R.W., Digital Processing of Speech SignaisNJ:Prentice Hall, Englewood Cliffs, 1978.

• BIBLIOGRAPHY 108

[Rabi 88] Rabiner, L., Mathematical Foundation of Hidden Markov Models,NATO ASR Series, Vol F46, Berlin Heidelberg ,1988, pp.183-205.

[Rabi 89] Rabiner, L., A Tutorial on Hidden Markov Models and Selee!edApplications in Speech Recognition, Proceedings of the IEEE, Vol 77,No.2, Feb 1989, pp.257-285.

[Rabi 93] Rabiner, L., Juang, B.I-I., Fundamentals of Speech Recognition,Prentice Hall, Englewood Cliffs, 1993.

[Schwa 85] Schwartz R., Chow Y., Kimball O., Roucos S., Krasner M.,Makhoul J., Context-Dependent Modeling for Acoustic-Phone!ic Recog­nition of Continuous Speech, Procecdings of the IEEE InternationalConference on Acoustics Speech, and Signal Proccssing, April 1985,pp.1205-1208.

[Sun 95] Sun D., Deng L., Analysis of Acoustic-Phonetic Vllriations In Flu­ent Speech Using TIMIT, Proceedings of the IEEE International Con­ference on Acoustics Speech, and Signal Processing, 1995, pp.201-204.

[Taki 92] Takami J., Sagayama S., A Successive State Splitting Algorithmfor Efficient Al/ophone Modeling, Proceedings of the IEEE InternationalConference on Acoustics Speech, and Signal Processing, 1992, pp.1573­1576.

[VAl 88] Van Alphen, P., Pois, L.C.W., A Fast Algorithm for FIR Filterbank,Speech 88, 7th FASE Symp, Edimburgh, Book2, 1988, pp.677-682.

(VAI89] Van Alphen, P., Pois, L.C.W., A Real-Time FIR-Based filterbank,Proceedings Eurospeech, Paris, 1989, pp.621-624.

[VAl 91] Van Alphen, P., Van Bergem, D.R., Hidden Markov Models andTheir Application in Speech Recognition, IEEE Proccedings, Vol 79 No.1,April 1991, pp.1-25.

(Vite 67] Viterbi A.J, Error Bounds for Convolutional Codes and an Asymp­totical/y Optimum Decoding Algorithm, IEEE Transactions on Informa­tion Theory, Vol 13, No.2, April 1967, pp.260-269.

• BIBLIOGRAPHY 109

[Wilpon 88J Wilpon J.G, DeMarco D., Mikkilinemi P.R., Isolated WordRecognition over the DDD Telephone Network, Proceedings of the IEEEInternational Conference on Acoustics Speech, and Signal Processing,New York, April 1988, pp.55-58.

[Wilpon 94J Wilpon J.G, Application of Voice Processing Technology inTelecommunication, Voice Communication between Humans and Ma­chine, National Academy Press, 1994, pp.280-309.

[Young 92] Young S.J, The General Use of Tying in Phoneme-Based HMMSpeech Recogni::ers, Proceedings of the IEEE International Conferenceon Acoustics Speech, and Signal Processing, 1992, pp.I-569-5i2.

[Young 94J Young S.J., Woodland P.C., State Clustering in Hidden MarkovModel-Based Continous Speech Recognition, Computer Speech and Lan­guage, Vol 8, Oct 1994, pp.369-383.


Recommended