+ All Categories
Home > Documents > A novel Wake-Up-Word speech recognition system, Wake-Up...

A novel Wake-Up-Word speech recognition system, Wake-Up...

Date post: 01-Aug-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
18
Nonlinear Analysis 71 (2009) e2772–e2789 Contents lists available at ScienceDirect Nonlinear Analysis journal homepage: www.elsevier.com/locate/na A novel Wake-Up-Word speech recognition system, Wake-Up-Word recognition task, technology and evaluation V.Z. Këpuska * , T.B. Klein Electrical & Computer Engineering Department, Florida Institute of Technology, Melbourne, FL 32901, USA article info MSC: 68T10 Keywords: Wake-Up-Word Speech recognition Hidden Markov Models Support Vector Machines Mel-scale cepstral coefficients Linear prediction spectrum Enhanced spectrum HTK Microsoft SAPI abstract Wake-Up-Word (WUW) is a new paradigm in speech recognition (SR) that is not yet widely recognized. This paper defines and investigates WUW speech recognition, describes details of this novel solution and the technology that implements it. WUW SR is defined as detection of a single word or phrase when spoken in the alerting context of requesting attention, while rejecting all other words, phrases, sounds, noises and other acoustic events and the same word or phrase spoken in non-alerting context with virtually 100% accuracy. In order to achieve this accuracy, the following innovations were accomplished: (1) Hidden Markov Model triple scoring with Support Vector Machine classification, (2) Combining multiple speech feature streams: Mel-scale Filtered Cepstral Coefficients (MFCCs), Linear Prediction Coefficients (LPC)-smoothed MFCCs, and Enhanced MFCC, and (3) Improved Voice Activity Detector with Support Vector Machines. WUW detection and recognition performance is 2514%, or 26 times better than HTK for the same training & testing data, and 2271%, or 24 times better than Microsoft SAPI 5.1 recognizer. The out-of-vocabulary rejection performance is over 65,233%, or 653 times better than HTK, and 5900% to 42,900%, or 60 to 430 times better than the Microsoft SAPI 5.1 recognizer. This solution that utilizes a new recognition paradigm applies not only to WUW task but also to any general Speech Recognition tasks. © 2009 Elsevier Ltd. All rights reserved. 1. Introduction Automatic Speech recognition (ASR) is the ability of a computer to convert a speech audio signal into its textual tran- scription [1]. Some motivations for building ASR systems are, presented in order of difficulty, to improve human–computer interaction through spoken language interfaces, to solve difficult problems such as speech to speech translation, and to build intelligent systems that can process spoken language as proficiently as humans [2]. Speech as a computer interface has numerous benefits over traditional interfaces such as a GUI with mouse and keyboard: speech is natural for humans, requires no special training, improves multitasking by leaving the hands and eyes free, and is often faster and more efficient to transmit than using conventional input methods. Even though many tasks are solved with visual, pointing interfaces, and/or keyboards, speech has the potential to be a better interface for a number of tasks where full natural language communication is useful [2] and the recognition performance of the Speech Recognition (SR) system is sufficient to perform the tasks accurately [3,4]. This includes hands-busy or eyes-busy applications such as where the user has objects to manipulate or equipment/devices to control. Those are circumstances where the user wants to multitask while asking for assistance from the computer. * Corresponding address: Electrical & Computer Engineering Department, Florida Institute of Technology, 150 W. University Blvd. Olin Engineering Bldg., 353, Melbourne, FL 32901, USA. Tel.: +1 (321) 674 7183. E-mail addresses: [email protected] (V.Z. Këpuska), [email protected] (T.B. Klein). 0362-546X/$ – see front matter © 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.na.2009.06.089
Transcript
Page 1: A novel Wake-Up-Word speech recognition system, Wake-Up …my.fit.edu/~vkepuska/share/MyPapers/WUWPaper.pdf · 2009-11-17 · AnovelWake-Up-Wordspeechrecognitionsystem,Wake-Up-Word

Nonlinear Analysis 71 (2009) e2772–e2789

Contents lists available at ScienceDirect

Nonlinear Analysis

journal homepage: www.elsevier.com/locate/na

A novel Wake-Up-Word speech recognition system, Wake-Up-Wordrecognition task, technology and evaluationV.Z. Këpuska ∗, T.B. KleinElectrical & Computer Engineering Department, Florida Institute of Technology, Melbourne, FL 32901, USA

a r t i c l e i n f o

MSC:68T10

Keywords:Wake-Up-WordSpeech recognitionHidden Markov ModelsSupport Vector MachinesMel-scale cepstral coefficientsLinear prediction spectrumEnhanced spectrumHTKMicrosoft SAPI

a b s t r a c t

Wake-Up-Word (WUW) is a new paradigm in speech recognition (SR) that is not yetwidely recognized. This paper defines and investigatesWUWspeech recognition, describesdetails of this novel solution and the technology that implements it. WUW SR is definedas detection of a single word or phrase when spoken in the alerting context of requestingattention,while rejecting all otherwords, phrases, sounds, noises and other acoustic eventsand the same word or phrase spoken in non-alerting context with virtually 100% accuracy.In order to achieve this accuracy, the following innovationswere accomplished: (1) HiddenMarkov Model triple scoring with Support Vector Machine classification, (2) Combiningmultiple speech feature streams: Mel-scale Filtered Cepstral Coefficients (MFCCs), LinearPrediction Coefficients (LPC)-smoothed MFCCs, and Enhanced MFCC, and (3) ImprovedVoice Activity Detector with Support Vector Machines.WUW detection and recognition performance is 2514%, or 26 times better than HTK

for the same training & testing data, and 2271%, or 24 times better than Microsoft SAPI5.1 recognizer. The out-of-vocabulary rejection performance is over 65,233%, or 653 timesbetter than HTK, and 5900% to 42,900%, or 60 to 430 times better than the Microsoft SAPI5.1 recognizer.This solution that utilizes a new recognition paradigm applies not only to WUW task

but also to any general Speech Recognition tasks.© 2009 Elsevier Ltd. All rights reserved.

1. Introduction

Automatic Speech recognition (ASR) is the ability of a computer to convert a speech audio signal into its textual tran-scription [1]. Some motivations for building ASR systems are, presented in order of difficulty, to improve human–computerinteraction through spoken language interfaces, to solve difficult problems such as speech to speech translation, and to buildintelligent systems that can process spoken language as proficiently as humans [2].Speech as a computer interface has numerous benefits over traditional interfaces such as aGUIwithmouse and keyboard:

speech is natural for humans, requires no special training, improves multitasking by leaving the hands and eyes free, and isoften faster and more efficient to transmit than using conventional input methods. Even thoughmany tasks are solved withvisual, pointing interfaces, and/or keyboards, speech has the potential to be a better interface for a number of tasks wherefull natural language communication is useful [2] and the recognition performance of the Speech Recognition (SR) systemis sufficient to perform the tasks accurately [3,4]. This includes hands-busy or eyes-busy applications such as where theuser has objects to manipulate or equipment/devices to control. Those are circumstances where the user wants to multitaskwhile asking for assistance from the computer.

∗ Corresponding address: Electrical & Computer Engineering Department, Florida Institute of Technology, 150W. University Blvd. Olin Engineering Bldg.,353, Melbourne, FL 32901, USA. Tel.: +1 (321) 674 7183.E-mail addresses: [email protected] (V.Z. Këpuska), [email protected] (T.B. Klein).

0362-546X/$ – see front matter© 2009 Elsevier Ltd. All rights reserved.doi:10.1016/j.na.2009.06.089

Page 2: A novel Wake-Up-Word speech recognition system, Wake-Up …my.fit.edu/~vkepuska/share/MyPapers/WUWPaper.pdf · 2009-11-17 · AnovelWake-Up-Wordspeechrecognitionsystem,Wake-Up-Word

V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789 e2773

Novel SR technology named Wake-Up-Word (WUW) [5,6] bridges the gap between natural-language and other voicerecognition tasks [7].WUW SR is a highly efficient and accurate recognizer specializing in the detection of a single word or phrasewhen spoken in the alerting orWUW context [8] of requesting attention, while rejecting all other words, phrases, sounds, noisesand other acoustic events with virtually 100% accuracy including the same word or phrase uttered in non-alerting (referential)context.The WUW speech recognition task is similar to Key-Word spotting. However, WUW-SR is different in one important

aspect of being able to discriminate the specific word/phrase used only in alerting context (and not in the other; e.g.referential contexts). Specifically, the sentence ‘‘Computer, begin PowerPoint presentation’’ exemplifies the use of the word‘‘Computer’’ in alerting context. On the other hand in the sentence ‘‘My computer has dual Intel 64 bit processors each withquad cores’’ the word computer is used in referential (non-alerting) context. Traditional key-word spotters will not be ableto discriminate between the two cases. The discrimination will be only possible by deploying higher level natural languageprocessing subsystem in order to discriminate between the two. However, for applications deploying such solutions is verydifficult (practically impossible) to determine in real time if the user is speaking to the computer or about the computer.Traditional approaches to key-word spotting are usually based on large vocabulary word recognizers [9], phone

recognizers [9], or whole-word recognizers that either use HMMs or word templates [10]. Word recognizers require tensof hours of word-level transcriptions as well as a pronunciation dictionary. Phone recognizers require phone markedtranscriptions and whole-word recognizers require word markings for each of the keywords [11]. The word based key-word recognizers are not able to find words that are out-of-vocabulary (OOV). To solve this problem of OOV keywords,spotting approaches based on phone-level recognition have been applied to supplement or replace systems base upon largevocabulary recognition [12]. In contrast our approach to WUW-SR is independent to what is the basic unit of recognition(word, phone or any other sub-word unit). Furthermore, it is capable to discriminate if the user is speaking to the recognizeror not without deploying language modeling and natural language understanding methods. This feature makes WUW taskdistinct from key-word spotting.To develop natural and spontaneous speech recognition applications an ASR system must:

(1) be able to determine if the user is speaking to it or not, and(2) have orders of magnitude higher speech recognition accuracy and robustness.

CurrentWUW-SR systemdemonstrates that theproblemunder (1) is solvedwith sufficient accuracy. The generality of thesolution has been also successfully demonstrated on the tasks of event detection and recognition applied to seismic sensorsignal (i.e., footsteps, hammer hits to the ground, hammer hits to metal plate, and bowling ball drop). Further investigationsare underway to demonstrate the power of this approach to Large Vocabulary Speech Recognition (LVSR) solving problem (2)above. This effort is done as part of the NIST’s Rich Transcription Evaluation program (http://www.nist.gov/speech/tests/rt/).The presentedwork defines, in Section 2, the concept ofWake-Up-Word (WUW), a novel paradigm in speech recognition.

In Section 3, our solution to WUW task is presented. The implementation details of the WUW are depicted in Section 4.The WUW speech recognition evaluation experiments are presented in Section 5. Comparative study between our WUWrecognizer, HTK, and Microsoft’s SAPI 5.1 is presented in Section 6. Concluding remarks and future work is provided inSections 7 and 8.

2. Wake-Up-Word paradigm

A WUW speech recognizer is a highly efficient and accurate recognizer specializing in the detection of a single word orphrase when spoken in the context of requesting attention (e.g., alerting context), while rejecting the same word or phrasespoken under referential (non-alerting) context, all other words, phrases, sounds, noises and other acoustic events withvirtually 100% accuracy. This high accuracy enables development of speech recognition driven interfaces that utilize dialogsusing speech only.

2.1. Creating 100% speech-enabled interfaces

One of the goals of speech recognition is to allow natural communication between humans and computers via speech [2],where natural implies similarity to the ways humans interact with each other on a daily basis. A major obstacle to this isthe fact that most systems today still rely to large extent on non-speech input, such as pushing buttons or mouse clicking.However, much like a human assistant, a natural speech interface must be continuously listening andmust be robust enoughto recover from any communication errors without non-speech intervention.Speech recognizers deployed in continuously listening mode are continuously monitoring acoustic input and do not

necessarily require non-speech activation. This is in contrast to the push to talkmodel, in which speech recognition is onlyactivated when the user pushes a button. Unfortunately, today’s continuously listening speech recognizers are not reliableenough due to their insufficient accuracy, especially in the area of correct rejection. For example, such systems often responderratically, even when no speech is present. They sometimes interpret background noise as speech, and they sometimesincorrectly assume that certain speech is addressed at the speech recognizer when in fact it is targeted elsewhere (contextmisunderstanding). These problems have traditionally been solved by the push to talkmodel: requesting the user to push abutton immediately before or during talking or similar prompting paradigm.

Page 3: A novel Wake-Up-Word speech recognition system, Wake-Up …my.fit.edu/~vkepuska/share/MyPapers/WUWPaper.pdf · 2009-11-17 · AnovelWake-Up-Wordspeechrecognitionsystem,Wake-Up-Word

e2774 V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789

Another problemwith traditional speech recognizers is that they cannot recover from errors gracefully, and often requirenon-speech intervention. Any speech-enabled human–machine interface based on natural language relies on carefullycrafted dialogues. When the dialogue fails, currently there is no good mechanism to resynchronize the communication,and typically the transaction between human and machine fails by termination. A typical example is an SR system whichis in a dictation state, when in fact the human is attempting to use command-and-control to correct a previous dictationerror. Often the user is forced to intervene by pushing a button or keyboard key to resynchronize the system. Current SRsystems that do not deploy the push to talk paradigm use implicit context switching. For example, a system that has the abilityto switch from ‘‘dictation mode’’ to ‘‘command mode’’ does so by trying to infer whether the user is uttering a commandrather than dictating text. This task is rather difficult to perform with high accuracy, even for humans. The push to talkmodel uses explicit context switching, meaning that the action of pushing the button explicitly sets the context of the speechrecognizer to a specific state.To achieve the goal of developing a natural speech interface, it is first useful to consider human to human communication.

Upon hearing an utterance of speech a human listener must quickly make a decision whether or not the speech is directedtowards him or her. This decision determines whether the listener will make an effort to ‘‘process’’ and understand theutterance. Humans can make this decision quickly and robustly by utilizing visual, auditory, semantic, and/or contextualclues.Visual clues might be gestures such as waving of hands or other facial expressions. Auditory clues are attention grabbing

words or phrases such as the listener’s name (e.g. John), interjections such as ‘‘hey’’, ‘‘excuse me’’, and so forth. Additionally,the listener may make use of prosodic information such as pitch and intonation, as well as identification of the speaker’svoice.Semantic and/or contextual clues are inferred from interpretation of thewords in the sentence being spoken, visual input,

prior experience, and customs dictated by the culture. For example, if one knows that a nearby speaker is in the process oftalking on the phone with someone else, he/she will ignore the speaker altogether knowing that speech is not targetinghim/her. Humans are very robust in determining when speech is targeted towards them, and should a computer SR systembe able to make the same decisions its robustness would increase significantly.Wake-Up-Word is proposed as a method to explicitly request the attention of a computer using a spoken word or phrase.

The WUWmust be spoken in the context of requesting attention and should not be recognized in any other context. Aftersuccessful detection of WUW, the speech recognizer may safely interpret the following utterance as a command. TheWUWis analogous to the button in push to talk, but the interaction is completely based on speech. Therefore it is proposed to useexplicit context switching via WUW. This is similar to how context switching occurs in human to human communication.

2.2. WUW-SR differences from other SR tasks

Wake-Up-Word is often mistakenly compared to other speech recognition tasks such as key-word spotting or command& control, but WUW is different from the previously mentioned tasks in several significant ways. The most importantcharacteristic of a WUW-SR system is that it should guarantee virtually 100% correct rejection of non-WUW and same wordsuttered in non-alerting context while maintaining correct acceptance rate over 99%. This requirement sets apart WUW-SRfrom other speech recognition systems because no existing system can guarantee 100% reliability by any measure withoutsignificantly lowering correct recognition rate. It is this guarantee that allowsWUW-SR to be used in novel applications thatpreviously have not been possible. Second, a WUW-SR system should be context dependent; that is, it should detect onlywords uttered in alerting context. Unlike key-word spotting, which tries to find a specific keyword in any context, theWUWshould only be recognized when spoken in the specific context of grabbing the attention of the computer in real time. Thus,WUW recognition can be viewed as a refined key-word spotting task, albeit significantly more difficult. Finally, WUW-SR should maintain high recognition rates in speaker-independent or speaker-dependent mode, and in various acousticenvironments.

2.3. Significance of WUW-SR

The reliability of WUW-SR opens up the world of speech recognition to applications that were previously impossible.Today’s speech-enabled human–machine interfaces are still regarded with skepticism, and people are hesitant to entrustany significant or accuracy-critical tasks to a speech recognizer.Despite the fact that SR is becoming almost ubiquitous in the modern world, widely deployed in mobile phones,

automobiles, desktop, laptop, and palm computers, many handheld devices, telephone systems, etc., the majority of thepublic pays little attention to speech recognition.Moreover, themajority of speech recognizers use the push to talk paradigmrather than continuously listening, simply because they are not robust enough against false positives. One can imagine thatthe driver of a speech-enabled automobile would be quite unhappy if his or her headlights suddenly turned off because thecontinuously listening speech recognizer misunderstood a phrase in the conversation between driver and passenger.The accuracy of speech recognizers is often measured by word error rate (WER), which uses three measures [1]:Insertion (INS) – an extra word was inserted in the recognized sentence.Deletion (DEL) – a correct word was omitted in the recognized sentence.Substitution (SUB) – an incorrect word was substituted for a correct word.

Page 4: A novel Wake-Up-Word speech recognition system, Wake-Up …my.fit.edu/~vkepuska/share/MyPapers/WUWPaper.pdf · 2009-11-17 · AnovelWake-Up-Wordspeechrecognitionsystem,Wake-Up-Word

V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789 e2775

WER is defined as:

WER =INS + DEL+ SUB

TRUE NUM. OF WORDS IN SENTENCE100%. (1)

Substitution errors or equivalently correct recognition in WUW context represent the accuracy of the recognition, whileinsertion+ deletion errors are caused by other factors; typically erroneous segmentation.However, WER as an accuracy measurement is of limited usefulness to a continuously listening command and control

system. To understand why, consider a struggling movie director who cannot afford a camera crew and decides to install arobotic camera system instead. The system is controlled by a speech recognizer that understands four commands: ‘‘lights’’,‘‘camera’’, ‘‘action’’, and ‘‘cut’’. The programmers of the speech recognizer claim that it has 99% accuracy, and the directoris eager to try it out. When the director sets up the scene and utters ‘‘lights’’, ‘‘camera’’, ‘‘action’’, the recognizer makes nomistake, and the robotic lights and cameras and spring into action. However, as the actors being performing the dialoguein their scene, the computer misrecognizes ‘‘cut’’ and stops the film, ruining the scene and costing everyone their timeand money. The actors could only speak 100 words before the recognizer, which truly had 99% accuracy, triggered a falseacceptance.This anecdote illustrates two important ideas. First, judging the accuracy of a continuously listening system requires using

ameasure of ‘‘rejection’’. That is, the ability of the system to correctly reject out-of-vocabulary utterances. TheWER formulaincorrectly assumes that all speechutterances are targeted at the recognizer and that all speech arrives in simple, consecutivesentences. Consequently, performance of WUW-SR is measured in terms of correct acceptances (CA) and correct rejections(CR). Because it is difficult to quantify and compare the number of CRs, rejection performance is also given in terms of ‘‘falseacceptances per hour’’.The second note of interest is that 99% rejection accuracy is actually a very poor performance level for a continuously

listening command and control system. In fact, the 99% accuracy claim is amisleading figure.While 99% acceptance accuracyis impressive in terms of recognition performance, 99% rejection implies one false acceptance per 100 words of speech. Itis not uncommon for humans to speak hundreds of words per minute, and such a system would trigger multiple falseacceptances per minute. That is the reason why today’s speech recognizers a) primarily use a ‘‘push to talk’’ design, andb) are limited to performing simple convenience functions and are never relied on for critical tasks. On the other hand,experiments have shown the WUW-SR implementation presented in this work to reach up to 99.99% rejection accuracy,that translates to one false acceptance every 2.2 h.

3. Wake-Up-Word problem identification and solution

Wake-Up-Word recognition requires solving three problems: (1) detection of theWUW(e.g., alerting) context, (2) correctidentification of WUW, and (3) correct rejection of non-WUW.

3.1. Detecting WUW context

The Wake-Up-Word is proposed as a means to grab the computer’s attention with extremely high accuracy. Unlike key-word spotting, the recognizer should not trigger on every instance of the word, but rather only in certain alerting contexts.For example, if the Wake-Up-Word is ‘‘computer’’, the WUW should not trigger if spoken in a sentence such as ‘‘I am nowgoing to use my computer’’.Some previous attempts at WUW-SR avoided context detection by choosing a word that is unique and unlikely to be

spoken. An example of this is WildFire Communication’s ‘‘wildfire’’ word introduced in the early 1990s. However, eventhough the solution was customized for the specific word ‘‘wildfire’’, it was neither accurate enough nor robust.The implementation of WUW presented in this work selects context based on leading and trailing silence. Speech is

segmented from background noise by the Voice Activity Detector (VAD), whichmakes use of three features for its decisions:energy difference from long term average, spectral difference from long term average, and Mel-scale Filtered CepstralCoefficients (MFCC) [13] difference from long term average. In the future, prosody information (pitch, energy, intonationof speech) will be investigated as an additional means to determine the context as well as to set the VAD state.

3.2. Identifying WUW

The VAD is responsible for finding utterances spoken in the correct context and segmenting them from the rest of theaudio stream. After the VAD makes a decision, the next task of the system is to identify whether or not the segmentedutterance is a WUW.The front end extracts the following features from the audio stream: MFCC [13], LPC smoothedMFCC [14], and enhanced

MFCC, a proprietary technology. The MFCC and LPC features are standard features deployed by vast majority of recognizers.The enhancedMFCCs are introduced to validate their expected superior performance over the standard features specificallyexhibited under noisy conditions. Finally, although all three features individually may provide comparable performance, itis hypothesized that the combination of all three features will provide significant performance gains as demonstrated in

Page 5: A novel Wake-Up-Word speech recognition system, Wake-Up …my.fit.edu/~vkepuska/share/MyPapers/WUWPaper.pdf · 2009-11-17 · AnovelWake-Up-Wordspeechrecognitionsystem,Wake-Up-Word

e2776 V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789

Fig. 4.1. WUW-SR overall architecture. Signal processing module accepts the raw samples of audio signal and produces spectral representation of a short-time signal. Feature extractionmodule generates features from spectral representation of the signal. Those features are decodedwith correspondingHMMs.The individual feature scores are classified using SVM.

this paper. This in turn indicates that each feature captures slightly different short time characteristics of the speech signalsegment from which they are produced.Acoustic modeling is performed with Hidden Markov Models (HMM) [15,16] with triple scoring [4], an approached

further developed throughout this work. Finally, classification of those scores is performed with Support Vector Machines(SVM) [17–21].

3.3. Correct rejection of Non-WUW

The final and critical task of aWUW system is correctly rejecting Out-Of-Vocabulary (OOV) speech. TheWUW-SR systemshould have a correct rejection rate of virtually 100% while maintaining high correct recognition rates of over 99%. The taskof rejecting OOV segments is difficult, and there are many papers in the literature discussing this topic. Typical approachesuse garbage models, filler models, and/or noise models to capture extraneous words or sounds that are out of vocabulary.For example, in [22], a large number of garbage models is used to capture as much of the OOV space as possible. In [23],OOV words are modeled by creating a generic word model which allows for arbitrary phone sequences during recognition,such as the set of all phonetic units in the language. Their approach yields a correct acceptance rate of 99.2% and a falseacceptance rate of 71% on data collected from the Jupiter weather information system.The WUW-SR system presented in this work uses triple scoring, a novel and scoring technique by author [4] and further

developed throughout this work, to assign three scores to every incoming utterance. The three scores produced by thetriple scoring approach are highly correlated, and furthermore the correlation between scores of INV utterances is distinctlydifferent than that of OOV utterances. The difference in correlations makes INV and OOV utterances highly separable in thescore space, and an SVM classifier is used to produce the optimal separating surface.

4. Implementation of WUW-SR

The WUW Speech Recognizer is a complex software system comprising of three major parts: Front End, VAD, and BackEnd. The Front End is responsible for extracting features from an input audio signal. The VAD (Voice Activity Detector)segments the signal into speech andnon-speech regions. Finally, the back endperforms recognition and scoring. The diagramin Fig. 4.1 below shows the overall procedure.TheWUWrecognizer has been implemented entirely in C++, and is capable of running live and performing recognitions in

real time. Functionally is broken into 3 major units. (1) Front End: responsible for features extraction and VAD classificationof each frame. (2) Back End: performing word segmentation and classification of those segments for each feature stream,and (3) INV/OOV Classification using individual HMM segmental scores with SVM.The Front End Processor and the HMM software can also be run and scripted independently, and they are accompanied

by numerous MATLAB scripts to assist with batch processing necessary for training the models.

Page 6: A novel Wake-Up-Word speech recognition system, Wake-Up …my.fit.edu/~vkepuska/share/MyPapers/WUWPaper.pdf · 2009-11-17 · AnovelWake-Up-Wordspeechrecognitionsystem,Wake-Up-Word

V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789 e2777

Input

SamplesDC-Filtering Pre-emphasis Window Advance Windowing Autocorrelation

LPCFFT Spectrogram

Spectrogram Computation

MFCC

E-Front End

Voice Activity Detection

Buffering

SpectrogramFeature

EnergyFeature

MFCCFeature

VAD – VoiceActivity Detection

Logic

Estimation of StationaryBackground Spectrum

Enhanced MFCC Feature Computation

EnhancedSpectrum

ENH

Mel scale FilteringMF

Mel scale FilteringMF

EN

H-M

FC

CV

AD

-ST

AT

EM

FC

CLP

C-M

FC

C

MFCC Feature Computation

Discrete CosineTransform

DCT

Discrete CosineTransform

DCT

Cepstral MeanNormalization

CMN

...

Fig. 4.2. Details of signal processing of the Front End. It contains a common Spectrogram computation module, feature based VAD, and module thatcompute 3 features (MFCC, LPC-MFCC, and enhancedMFCC) for each analysis frame. The VAD classifier determines the state (speech or NO speech) of eachframe indicating if a segment contains speech. This information is used by Back End of the recognizer.

4.1. Front end

The front end is responsible for extracting features out of the input signal. Three sets of features are extracted: MelFiltered Cepstral Coefficients (MFCC) [13], LPC (Linear Predictive Coding) [14] smoothed MFCCs, and Enhanced MFCCs. Thefollowing diagram, Fig. 4.2, illustrates the architecture of the Front End and VAD.The following image, Fig. 4.3, shows an example of the waveform superimposed with its VAD segmentation, its

spectrogram, and its enhanced spectrogram.

4.2. Back end

The Back End is responsible for classifying observation sequences as In-Vocabulary (INV), i.e. the sequence of frameshypothesized as a segment is a Wake-Up-Word, or Out Of Vocabulary (OOV), i.e. the sequence is not a Wake-Up-Word. TheWUW-SR system uses a combination of Hidden Markov Models and Support Vector Machines for acoustic modeling, and asa result the back end consists of an HMM recognizer and a SVM classifier. Prior to recognition, HMM and SVMmodels mustbe created and trained for the word or phrase which is to be the Wake-Up-Word.When the VAD state changes from VAD_OFF to VAD_ON, the HMM recognizer resets and prepares for a new observation

sequence. As long as the VAD state remains VAD_ON, feature vectors are continuously passed to the HMM recognizer, wherethey are scored using the novel triple scoring method. If using multiple feature streams (a process explained in a latersection), recognition is performed for each stream in parallel. When VAD state changes from VAD_ON to VAD_OFF, multiplescores are obtained from the HMM recognizer and are sent to the SVM classifier. SVM produces a classification score whichis compared against a threshold to make the final classification decision of INV or OOV.

5. Wake-Up-Word speech recognition experiments

For HMM training and recognition we used a number of corpora as described next.

Page 7: A novel Wake-Up-Word speech recognition system, Wake-Up …my.fit.edu/~vkepuska/share/MyPapers/WUWPaper.pdf · 2009-11-17 · AnovelWake-Up-Wordspeechrecognitionsystem,Wake-Up-Word

e2778 V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789

-2

-1

0

1

2

4.5 5 5.5 6 6.5 7 7.5

Time (s)

Speech and VAD segmentation

4 8

4.5 5 5.5 6 6.5 7 7.5

Time [s]

4.5 5 5.5 6 6.5 7 7.5

Time [s]

1000

2000

3000

Fre

quen

cy [H

z]

0

4000

Enhanced Spectrogram

4 8

1000

2000

3000

Fre

quen

cy [H

z]

0

4000

Spectrogram

4 8

Fig. 4.3. Example of the speech signal (first plot in blue) with VAD segmentation (first plot in red), spectrogram (second plot), and enhanced spectrogram(third plot) generated by the Front Endmodule. (For interpretation of the references to colour in this figure legend, the reader is referred to theweb versionof this article.)

5.1. Speech corpora

Speech recognition experimentsmade use of the following corpora: CCW17,WUW-I,WUW-II, Phonebook, and Callhome.The properties of each corpus are summarized in the following table, Table 5.1.

5.2. Selecting INV and OOV utterances

The HMMs used throughout this work are predominantly whole word models, and HMM training requires manyexamples of the same word spoken in isolation or segmented from the continuous speech. As such, training utteranceswere only chosen from CCW17, WUW-I, and WUW-II, but OOV utterances were chosen from all corpora. The majority ofthe preliminary tests were performed using the CCW17 corpus, which is well organized and very easy to work with. Of theten words available in CCW17, the word ‘‘operator’’ has been chosen for most experiments as the INV word. This word isnot especially unique in the English vocabulary and is moderately confusable (e.g, ‘‘operate’’, ‘‘moderator’’, ‘‘narrator’’, etc.),making a good candidate for demonstrating the power of WUW speech recognition.

5.3. INV/OOV decisions & cumulative error rates

The WUW speech recognizer Back End is essentially a binary classifier that, given an arbitrary length input sequenceof observations, classifies it as In-Vocabulary (INV) or Out-Of-Vocabulary (OOV). Although the internal workings are verycomplex and can involve a hierarchy of Hidden Markov Models, Support Vector Machines, and other classifiers, the end

Page 8: A novel Wake-Up-Word speech recognition system, Wake-Up …my.fit.edu/~vkepuska/share/MyPapers/WUWPaper.pdf · 2009-11-17 · AnovelWake-Up-Wordspeechrecognitionsystem,Wake-Up-Word

V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789 e2779

Table 5.1Corpora used for Training and Testing. CCW17, WUW-I andWUW-II were privately collected corpora by ThinkEngine Networks. Phonebook and Callhomeare publically available corpora from Linguistic Data Consortium (LDC) of Penn State University.

CCW17

Speech type: Isolated words The CCW17 corpus contains 10 isolated words recorded over the telephone from multiplespeakers. The calls were made from land lines, cellular phones, and speakerphones, undera variety of noise conditions. The data is grouped by telephone call, and each call containsone speaker uttering up to 10 words. The speakers have varying backgrounds, but arepredominantly American from the New England region and Indian (many of whom carry aheavy Indian accent).

Vocabulary size: 10Unique speakers: More than 200Total duration: 7.9 hTotal words: 4234Channel type: TelephoneChannel quality: Variety of noise conditions

WUW-I &WUW-II

Speech type: Isolated words & sentences WUW-I and WUW-II are similar to CCW17. They consist of isolated words and additionallycontain sentences that use those words in various ways.

Phonebook

Speech type: Isolated wordsThe phonebook corpus consists of 93,267 files, each containing a single utterancesurrounded by two small periods of silence. The utterances were recorded by 1358 uniqueindividuals. The corpus has been split into a training set of 48,598 words (7945 unique)spanning 19.6 h of speech, and testing set of 44,669 words (7944 unique) spanning 18 h ofspeech.

Vocabulary size: 8002Total words: 93,267Unique speakers: 1358Length: 37.6 hChannel type: TelephoneChannel quality: Quiet to noisy

Callhome

Speech type: Free & spontaneous

The Callhome corpus contains 80 natural telephone conversations in English of length30 min each. For each conversation both channels were recorded and included separately.

Vocabulary size: 8577Total words: 165,546Unique speakers: More than 160Total duration: 40 hChannel type: TelephoneChannel quality: Quiet to noisy

result is that any given input sequence produces a final score. The score space is divided into two regions, R1 and R−1based on a chosen decision boundary, and an input sequence can be marked as INV or OOV depending on which region itfalls in. The score space can be partitioned at any threshold depending on the application requirements, and the error ratescan be computed for that threshold. An INV utterance that falls in the OOV region is a False Rejection error, while an OOVutterance that falls in the INV region is considered a False Acceptance error.OOV error rate (False Acceptances of False Positives) and INV error rate (False Rejections or False Negatives) can be

plotted as functions of the threshold over the range of scores. The point where they intersect is known as the Equal ErrorRate (EER), and is often used as a simple measure of classification performance. The EER defines a score threshold for whichFalse Rejection is equal to False Acceptance rate. However, the EER is often not the minimum error, especially when thescores do not have well-formed Gaussian or uniform distributions, nor is it the optimal threshold for most applications. Forinstance, in Wake-Up-Word the primary concern is minimizing OOV error, so the threshold would be biased as to pick athreshold that minimizes False Acceptance rate at the expense of increasing False Rejection rate.For the purpose of comparison and evaluation, the following experiments include plots of the INV and OOV recognition

score distributions, the OOV and INV error rate curves, as well as the point of equal error rate. Where applicable, the WUWoptimal threshold and error rates at that threshold are also shown.

5.4. Plain HMM scores

For the first tests on speech data, a HMMwas trained on theword ‘‘operator’’. The training sequenceswere taken from theCCW17 and WUW-II corpora for a total of 573 sequences from over 200 different speakers. After features were extracted,some of the erroneous VAD segments were manually removed. The INV testing sequences were the same as the trainingsequences, while the OOV testing sequences included the rest of the CCW17 corpus (3833 utterances, 9 different words,over 200 different speakers). The HMMwas a left to right model with no skips, 30 states, and 6 mixtures per state, and wastrained with two iterations of Baum–Welch algorithm [16,15].The score is the result of the Viterbi algorithm over the input sequence. Recall that the Viterbi algorithm finds the state

sequence that has the highest probability of being taken while generating the observation sequence. The final score is thatprobability normalized by the number of input observations, T. The below Fig. 5.1 shows the results:The distributions look Gaussian, but there is a significant overlap between them. The equal error rate of 15.5% essentially

means that at that threshold, 15.5% of the OOV words would be classified as INV, and 15.5% of the INV words would beclassified as OOV. Obviously, no practical applications can be developed based on the performance of this recognizer.

Page 9: A novel Wake-Up-Word speech recognition system, Wake-Up …my.fit.edu/~vkepuska/share/MyPapers/WUWPaper.pdf · 2009-11-17 · AnovelWake-Up-Wordspeechrecognitionsystem,Wake-Up-Word

e2780 V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789

Cum

ulat

ive

erro

r pe

rcen

t (%

)

Recognition Score

Fig. 5.1. Score distributions for plain HMM on CCW17 corpus data for the WUW ‘‘operator’’.

-120

-110

-100

-90

-80

-70

-60

-120

-110

-100

-90

-80

-70

-60

Sco

re 2

-130 -120 -110 -100 -90 -80 -70 -600

0.05

-120 -110 -100 -90 -80 -70Score 1

-130 -60 0 0.05

OOV sequencesINV sequences

Fig. 5.2. Scatter plots of score 1 vs. score 2 and their corresponding independently constructed histograms.

OOV sequences

INV sequences

-130 -120 -110 -100 -90 -80 -70 -600

0.05

-120 -110 -100 -90 -80 -70-130 -60

Sco

re 3

-110

-105

-100

-95

-90

-85

-80

-75

-70

-65

-60

-110

-105

-100

-95

-90

-85

-80

-75

-70

-65

-60

Score 10 0.05

Fig. 5.3. Scatter plots of score 1 vs. score 3 and their corresponding independently constructed histograms.

5.5. Triple scoring method

In addition to the Viterbi score, the algorithm developed in this work produces two additional scores for any givenobservation sequence. When considering the three scores as features in a three-dimensional space, the separation betweenINV and OOV distributions increases significantly. The next experiment runs recognition on the same data as above,but this time the recognizer uses the triple scoring algorithm to output three scores. The Figs. 5.2 and 5.3 show two-dimensional scatter plots of Score 1 vs. Score 2, and Score 1 vs. Score 3 for each observation sequence. In addition, ahistogram on the horizontal axis shows the distributions of Score 1 independently, and a similar histogram on the verticalaxis shows the distributions of Score 2 and Score 3 independently. The histograms are hollowed out so that the overlapbetween distributions can be seen clearly. The distribution for Score 1 is exactly the same as in the previous section, as

Page 10: A novel Wake-Up-Word speech recognition system, Wake-Up …my.fit.edu/~vkepuska/share/MyPapers/WUWPaper.pdf · 2009-11-17 · AnovelWake-Up-Wordspeechrecognitionsystem,Wake-Up-Word

V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789 e2781

Fig. 5.4. Linear SVM, of the scatter plot of scores 1–2.

Fig. 5.5. Linear SVM, of the scatter plot of scores 1–3.

the data and model have not changed. Any individual score does not produce a good separation between classes, and infact the Score 2 distributions have almost complete overlap. However, the two-dimensional separation in either case isremarkable.

5.6. Classification with Support Vector Machines

In order to automatically classify an input sequence as INV or OOV, the triple score feature space,R3, can be partitionedby a binary classifier into two regions,R3

1 andR3−1. Support Vector Machines have been selected for this task because of

the following reasons: they can produce various kinds of decision surfaces including radial basis function, polynomial, andlinear; they employ Structural Risk Minimization (SRM) [18] to maximize the margin which has shown empirically to havegood generalization performance.Two types of SVMs have been considered for this task: linear and RBF. The linear SVM uses a dot product kernel function,

K(x, y) = x.y, and separates the feature space with a hyperplane. It is very computationally efficient because no matterhow many support vectors are found, evaluation requires only a single dot product. Figs. 5.4 and 5.5 below shows that theseparation between distributions based on Score 1 and Score 3 is almost linear, so a linear SVMwould likely give good results.However, in the Score 1 / Score 2 space, the distributions have a curvature, so the linear SVM is unlikely to generalize wellfor unseen data. The Figs. 5.4 and 5.5 below show the decision boundary found by a linear SVM trained on Score 1+ 2, andScore 1+ 3, respectively.The line in the center represents the contour of the SVM function at u = 0, and outer two lines are drawn at u = ±1.

Using 0 as the threshold, the accuracy of Scores 1–2 is 99.7% Correct Rejection (CR) and 98.6% Correct Acceptance (CA), whilefor Scores 1–3 it is 99.5% CR and 95.5% CA. If considering only two features, Scores 1 and 2 seem to have better classificationability. However, combining the three scores produces the plane shown below, in Figs. 5.6 and 5.7 from two differentangles.

Page 11: A novel Wake-Up-Word speech recognition system, Wake-Up …my.fit.edu/~vkepuska/share/MyPapers/WUWPaper.pdf · 2009-11-17 · AnovelWake-Up-Wordspeechrecognitionsystem,Wake-Up-Word

e2782 V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789

-40

-50

-50

-60

-60

-60

-70

-70

-80

-80

-80

-90

-90

-100

-100

-100

-110

-110

-120

-120-140

-120-130

Score 2

Score 1

Linear SVM hyperplane-40

-130

Sco

re 3

-40-140

Fig. 5.6. Linear SVM, three-dimensional Discriminating Surface view 1 as a function of scores 1–3.

-40 -40

-60

-60-60

-80

-80 -80

-100

-100 -100

-140 -140

-120

-120 -120

Score 2Score 1

Linear SVM Separating Plane

-40

-140

Sco

re 3

Fig. 5.7. Linear SVM, three-dimensional Discriminating Surface view 2 as a function of scores 1–3.

Cum

ulat

ive

erro

r pe

rcen

t (%

)

Recognition Score

Fig. 5.8. Histogram of score distributions for triple scoring using linear SVM discriminating surface.

The plane split the feature space with an accuracy of 99.9% CR and 99.5% CA (just 6 of 4499 total sequences weremisclassified). The accuracy was better than any of the two-dimensional cases, indicating that Score 3 contains additionalinformation not found in Score 2 and vice versa. The classification error rate of the linear SVM is shown in Fig. 5.8.The conclusion is that using the triple scoring method combined with a linear SVM decreased the equal error rate on this

particular data set from 15.5% to 0.2%, or in other words increased accuracy by over 77.5 times (i.e., error rate reduction of7650%)!In the next experiment, a Radial Basis Function (RBF) kernel was used for the SVM. The RBF function,K (x, y) = e−γ |x−y|

2,

maps feature vectors into a potentially infinitely dimensional Hilbert space and is able to achieve complete separationbetween classes in most cases. However, the γ parameter must be chosen carefully in order to avoid overtraining. As thereis no way to determine it automatically, a grid search may be used to find a good value. For most experiments γ = 0.008gave good results. Shown below in Figs. 5.9 and 5.10 are the RBF SVM contours for both two-dimensional cases.

Page 12: A novel Wake-Up-Word speech recognition system, Wake-Up …my.fit.edu/~vkepuska/share/MyPapers/WUWPaper.pdf · 2009-11-17 · AnovelWake-Up-Wordspeechrecognitionsystem,Wake-Up-Word

V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789 e2783

Fig. 5.9. RBF SVM kernel, of the scatter plot of scores 1–2.

Fig. 5.10. RBF SVM kernel, of the scatter plot of scores 1–3.

At the u = 0 threshold, the classification accuracy was 99.8% CR, 98.6% CA for Score 1–2, and 99.4% CR, 95.6% CA forScore 1–3. In both cases the RBF kernel formed a closed decision region around the INV points (Recall that the SVM decisionfunction at u = 0 is shown by the green line).Some interesting observations can be made from these plots. First, the INV outlier in the bottom left corner of the first

plot caused a region to form around it. SVM function output values inside the region were somewhere between −1 and0, not high enough to cross into the INV class. However, it is apparent that the RBF kernel is sensitive to outliers, so the γparameter must be chosen carefully to prevent overtraining. Had the γ parameter been a little bit higher, the SVM functionoutput inside the circular region would have increased beyond 0, and that region would have been considered INV.Second, the RBF kernel’s classification accuracy showed almost no improvement over the linear SVM. However, a number

of factors should be considered when making the final decision about which kernel to use. If both models have similarperformance, it is usually the simplest one that is eventually chosen (according to Occam’s Razor). RBF kernels feature anextra parameter (γ ) and are therefore considered more complex than linear kernels. However, the criterion that mattersthe most in real-time application such as this is not necessarily the feature of curved decision boundaries but it is thenumber of support vectors utilized by each model; the one with the least number, would be the simplest one and theone finally chosen, whether RBF or linear kernel based. From the plots, the two distributions look approximately Gaussianwith a similar covariance structure, whose principal directions roughly coincide for both populations. This may be anotherreason why a roughly linear decision boundary between the two classes provides good classification results. On the otherhand it is expected that due to the RBF kernel’s ability to create arbitrary curved decision surfaces, it will have bettergeneralizationperformance than the linear SVM’s hyperplane (as demonstratedby the results describedbelow) if the specificscore distribution does not follow Gaussian distribution.The Figs. 5.11 and 5.12 below shows an RBF kernel SVM trained on all three scores.The RBF kernel created a closed three-dimensional surface around the INV points and had a classification accuracy of

99.95% CR, 99.30% CA. If considering u = 0 as the threshold, the triple score SVMwith RBF kernel function shows only little

Page 13: A novel Wake-Up-Word speech recognition system, Wake-Up …my.fit.edu/~vkepuska/share/MyPapers/WUWPaper.pdf · 2009-11-17 · AnovelWake-Up-Wordspeechrecognitionsystem,Wake-Up-Word

e2784 V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789

-40

-50

-60

-60-60

-70

-80

-80-80

-90

-100

-100-100

-110

-120-140

-120Score 2 Score 1

RBF SVM decision region

-40

-140

-120

-40

Sco

re 3

Fig. 5.11. RBF SVM kernel, three-dimensional Discriminating Surface view 1 as a function of scores 1–3.

-60-80

-100

Score 2

Score 1

RBF SVM decision region

-50

-60

-60

-70

-70

-80

-80

-90

-90

-100

-100

-110

-110-120

-120

-130

-40

-140

-50

-120

Sco

re 3

Fig. 5.12. RBF SVM kernel, three-dimensional Discriminating Surface view 2 as a function of scores 1–3.

Fig. 5.13. Histogram of OOV-blue and INV-red score distributions for triple scoring using RBF SVM kernel.

improvement over the linear SVM for this data set. However, as shown in Fig. 5.13 below, the SVM score distributions aresignificantly more separated, and the equal error rate is lower than the linear SVM; from 0.2% to 0.1%.Table 5.2 summarizes all of the results up to this point.

5.7. Front end features

The Front End Processor produces three sets of features: MFCC (Mel Filtered Cepstral Coefficients), LPC (Linear PredictiveCoding) smoothed MFCC, and Enhanced MFCC (E-MFCC). The E-MFCC features have a higher dynamic range than regularMFCC, so they have been chosen for most of the experiments. However, the next tests attempt to compare the performance

Page 14: A novel Wake-Up-Word speech recognition system, Wake-Up …my.fit.edu/~vkepuska/share/MyPapers/WUWPaper.pdf · 2009-11-17 · AnovelWake-Up-Wordspeechrecognitionsystem,Wake-Up-Word

V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789 e2785

Table 5.2Score comparisons of conventional single scoring and presented triple scoring using liner and RBF SVM kernels.

Score 1–2 (%) Score 1–3 (%) Triple score (%) EER (%)

Single scoring N/A N/A N/A 15.5Triple scoring CR: 99.7 CR: 99.5 CR: 99.9 0.2Linear kernel CA: 98.6 CA: 95.4 CA: 99.5Triple scoring CR: 99.8 CR: 99.3 CR: 99.9 0.1RBF kernel (γ = 0.008) CA: 98.6 CA: 95.6 CA: 99.3%

Cum

ulat

ive

erro

r pe

rcen

t (%

)

Recognition Score

Fig. 5.14. Histogram of OOV-blue and INV-red score distributions for E-MFCC features.

Cum

ulat

ive

erro

r pe

rcen

t (%

)

Recognition Score

Fig. 5.15. Histogram of OOV-blue and INV-red score distributions for MFCC features.

of E-MFCC, MFCC, and LPC features. In addition, amethod of combining all three features has been developed and has shownexcellent results.The first experiment verifies the classification performance of each feature stream independently. Three HMMs have

been generated with the following parameters: 30 states, 2 entry and exit silence states, 6 mixtures, 1 skip transition. TheHMMs were trained on utterances of the word ‘‘operator’’ taken from the CCW17 and WUW-II corpora. One HMM wastrained with E-MFCC features, one with MFCC features, and one with LPC features.For OOV rejection testing, the Phonebook corpus, which contains over 8000 unique words, was divided into two sets,

Phonebook 1 and Phonebook 2. Each set was assigned roughly half of the utterances of every unique word, so both setscovered the same range of vocabulary.Recognition was then performed on the training data set, Phonebook 1 data and Phonebook 2 data. SVM models were

trained on the same training set and the Phonebook 1 set, with parameters γ = 0.001, C = 15. Classification tests wereperformed on the training set and the Phonebook 2 set. Figs. 5.14–5.16 show the score distributions and error rates for eachset of front end features.In this experiment LPC had an equal error rate slightly better than E-MFCC. However, the difference is negligible, and the

distributions and their separation look almost the same. MFCC had the worst equal error rate.The next experiment attempts to combine information from all three sets of features. It is hypothesized that each feature

set contains some unique information, and combining all three of themwould extract themaximum amount of informationfrom the speech signal. To combine them, the features are considered as three independent streams, and recognition andtriple scoring is performed on each stream independently. However, the scores from the three streams are aggregated and

Page 15: A novel Wake-Up-Word speech recognition system, Wake-Up …my.fit.edu/~vkepuska/share/MyPapers/WUWPaper.pdf · 2009-11-17 · AnovelWake-Up-Wordspeechrecognitionsystem,Wake-Up-Word

e2786 V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789

Cum

ulat

ive

erro

r pe

rcen

t (%

)

Recognition Score

Fig. 5.16. Histogram of OOV-blue and INV-red score distributions for LPC features.

Cum

ulat

ive

erro

r pe

rcen

t (%

)

Recognition Score

Fig. 5.17. Histogram of OOV-blue and INV-red score distributions for combined features.

classified via SVM in a nine-dimensional space. The next Fig. 5.17 shows the results when combining the same scores fromthe previous section using this method.Combining the three sets of features produced a remarkable separation between the two distributions. The Phonebook

2 set used for OOV testing contains 49,029 utterances previously unseen by the recognizer, of which over 8,000 are uniquewords. At an operating threshold of−0.5, just 16 of 49,029 utterances were misclassified, yielding an FA error of 0.03%. ForINV, just 1 of 573 utterances was misclassified.

6. Baseline and final results

This section establishes a baseline by examining the performance of HTK and Microsoft SAPI 5.1, a commercial speechrecognition engine. It then presents the final performance results of the WUW-SR system.

6.1. Corpora setup

Because Wake-Up-Word Speech Recognition is a new concept, it is difficult to compare its performance with existingsystems. In order to perform a fair, nonbiased analysis of performance, all attempts weremade to configure the test systemsto achieve highest possible performance for WUW recognition. For the final experiments, once again the word ‘‘operator’’was chosen for testing. The first step entails choosing a good set of training and testing data.The ideal INV data for building a speaker-independent word model would be a large number of isolated utterances from

a wide variety of speakers. As such, the INV training data was once again constructed from isolated utterances taken fromthe CCW17 and WUW-II corpora, for a total of 573 examples.Because theWUW-SR system is expected to listen continuously to all forms of speech and non-speech and never activate

falsely. It must be able to reject OOV utterances in both continuous free speech and isolated cases, so the following twoscenarios have been defined for testing OOV rejection capability: continuous natural speech (Callhome corpus), and isolatedwords (Phonebook corpus).

Page 16: A novel Wake-Up-Word speech recognition system, Wake-Up …my.fit.edu/~vkepuska/share/MyPapers/WUWPaper.pdf · 2009-11-17 · AnovelWake-Up-Wordspeechrecognitionsystem,Wake-Up-Word

V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789 e2787

Cum

ulat

ive

erro

r pe

rcen

t (%

)

Recognition Score

Fig. 6.1. Histogram of the overall recognition scores for Callhome 2 test.

As described in Section 5, the Callhome corpus contains natural speech from phone call conversations in 30 min longsegments. The corpus spans 40 h of speech from over 160 different speakers, and contains over 8500 unique words. Thecorpus has been split into two sets of 20 h each, Callhome 1 and Callhome 2.The Phonebook corpus contains isolated utterances from over 1000 speakers and over 8000 unique words. The fact that

every utterance is isolated actually presents a worst case scenario to theWUW recognizer. In free speech, the signal is oftensegmented by the VAD into long sentences, indicative of non-WUW context and easily classified as OOV. However, shortisolatedwords aremuchmore likely to be confusablewith theWake-Up-Word. The Phonebook corpuswas also divided intotwo sets, Phonebook 1 and Phonebook 2. Each set was assigned roughly half of the utterances of every word. Phonebook 1contained 48,598 words (7945 unique) and Phonebook 2 contained 44,669 words (7944 unique).

6.2. HTK

The first baseline comparison was against the Hidden Markov Model Toolkit (HTK) version 3.0 from CambridgeUniversity [24]. First, an HMM word model with similar parameters (number of states, mixtures, topology) as the WUWmodel was generated. Next, the model was trained on the INV speech utterances. Note that while the utterances were thesame in both cases (i.e., HTK and e-WUW), the speech features (MFCCs) were extracted with the HTK feature extractor, nottheWUW front end. Finally, recognition was performed by computing the Viterbi score of test sequences against themodel.After examining the score of all sequences, a threshold was manually chosen to optimize FA and FR error rates.

6.3. Commercial SR engine—microsoft SAPI 5.1

The second baseline comparisons evaluate the performance of a commercial speech recognition system. The commercialsystem was designed for dictation and command & control modes, and has no concept of ‘‘Wake-Up-Word’’. In orderto simulate a wake-up-word scenario, a command & control grammar was created with the wake-up-word as the onlycommand (e.g. ‘‘operator’’). There is little control of the engine parameters other than a generic ‘‘sensitivity’’ setting.Preliminary tests were executed and the sensitivity of the engine was adjusted to the point where it gave the best results.Speaker adaptation was also disabled for the system.

6.4. WUW-SR

For the WUW-SR system, three HMMs were trained; one with E-MFCC features, one with MFCC features, and one withLPC smoothed MFCC features. All three used the following parameters: 30 states, 2 entry and exit silence states, 6 mixtures,1 skip transition. SVMs were trained on the nine-dimensional score space obtained by HMM triple scoring on three featurestreams, with parameters γ = 0.001, C = 15.Note that unlike the baseline recognizers, theWUWSpeech Recognizer uses OOV data for training the SVM classifier. The

OOV training set consisted of Phonebook 1 and Callhome 1, while the testing set consisted of Phonebook 2 and Callhome 2.Although it is possible to train a classifier using only INV examples, such as a one-class SVM, it has not been explored yet inthis work.The plots in Figs. 6.1 and 6.2 show the combined feature results on the Callhome 2 and Phonebook 2 testing sets.Table 6.1 summarizes the results of HTK, the commercial speech recognizer, and WUW-SR:In Table 6.1, Relative Error Rate Reduction is computed as:

RERR =B− NN∗ 100% (2)

where B is the baseline error rate and N is the new error rate. Reduction factor is computed as B/N .

Page 17: A novel Wake-Up-Word speech recognition system, Wake-Up …my.fit.edu/~vkepuska/share/MyPapers/WUWPaper.pdf · 2009-11-17 · AnovelWake-Up-Wordspeechrecognitionsystem,Wake-Up-Word

e2788 V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789

Cum

ulat

ive

erro

r pe

rcen

t (%

)

Recognition Score

Fig. 6.2. Histogram of the overall recognition scores for Phonebook 2 test.

Table 6.1Summary of final results presenting and comparing False Rejection (FR) and Acceptance (FA) rates of WUW-SR vs. HTK and Microsoft SAPI 5.1 reported asRelative Error Rate Reduction and as reduction error factor.

CCW17+WUW-II (INV) Phonebook 2 (OOV) Callhome 2 (OOV)FR FR (%) FA FA (%) FA/h FA FA (%) FA/h

Recognizer

HTK 105 18.3 8793 19.6 488.5 2421 2.9 121.1Microsoft SAPI 5.1 SR 102 16.6 5801 12.9 322.3 467 0.6 23.35WUW –SR 4 0.70 15 0.03 0.40 9 0.01 0.45

Reduction of Rel. Error Rate /Factor

WUW vs. HTK 2514%26 65,233%653 28,900%290WUW vs. Microsoft SAPI 5.1 SR 2271%24 42,900%430 5900%60

Table 6.1 demonstrates WUW-SR’s improvement over current state of the art recognizers. WUW-SR with three featurestreams was several orders of magnitude superior to the baseline recognizers in INV recognition and particularly in OOVrejection. Of the two OOV corpora, Phonebook 2 was significantly more difficult to recognize for all three systems due tothe fact that every utterance was a single isolated word. Isolated words have similar durations to the WUW and differ justin pronunciation, putting to the test the raw recognition power of a SR system. HTK had 8793 false acceptance errors onPhonebook 2 and the commercial SR system had 5801 errors. WUW-SR demonstrated its OOV rejection capabilities bycommitting just 15 false acceptance errors on the same corpus—an error rate reduction of 58,520% relative to HTK, and38,573% relative to the commercial system.On the Callhome 2 corpus, which contains lengthy telephone conversations, recognizers are helped by the VAD in the

sense that most utterances are segmented as long phrases instead of short words. Long phrases differ from the WUW inboth content and duration, and are more easily classified as OOV. Nevertheless, HTK had 2421 false acceptance errors andthe commercial system had 467 errors, compared to WUW’s 9 false acceptance errors in 20 h of unconstrained speech.

7. Conclusion

In this paper we have defined and investigated WUW Speech Recognition, and presented a novel solution. The mostimportant characteristic of aWUW-SR system is that it should guarantee virtually 100% correct rejection of non-WUWwhilemaintaining correct acceptance rate of 99% or higher. This requirement sets apart WUW-SR from other speech recognitionsystems because no existing system can guarantee 100% reliability by anymeasure, and allowsWUW-SR to be used in novelapplications that previously have not been possible.TheWUW-SR systemdeveloped in thiswork provides for efficient and highly accurate speaker-independent recognitions

at performance levels not achievable by current state of the art recognizers.Extensive testing demonstrates accuracy improvements superior by several orders of magnitude over the best known

academic speech recognition system, HTK, as well as a leading commercial speech recognition system. Specifically, theWUW-SR system correctly detects the WUW with 99.3% accuracy. It correctly rejects non-WUW with 99.97% accuracy forcases where they are uttered in isolation. In continuous and spontaneous free speech the WUW system makes 0.45 falseacceptance errors per hour, or one false acceptance in 2.22 h.WUWdetection performance is 2514%, or 26 times better than HTK for the same training & testing data, and 2271%, or 24

times better thanMicrosoft SAPI 5.1., leading commercial recognizer. The non-WUW rejection performance is over 62,233%,or 653 times better than HTK, and 5900% to 42,900%, or 60 6o 430 better than Microsoft SAPI 5.1.

Page 18: A novel Wake-Up-Word speech recognition system, Wake-Up …my.fit.edu/~vkepuska/share/MyPapers/WUWPaper.pdf · 2009-11-17 · AnovelWake-Up-Wordspeechrecognitionsystem,Wake-Up-Word

V.Z. Këpuska, T.B. Klein / Nonlinear Analysis 71 (2009) e2772–e2789 e2789

8. Future work

In spite of the significant advancement of SR technologies, true natural language interaction with machines is notyet possible with state-of-the-art systems. WUW-SR technology has demonstrated, through this work and in twoactual prototype software applications; voice-only activated PowerPoint Commander and Elevator simulator, to enablespontaneous and true natural interaction with a computer using only voice. For further information on the softwareapplications please contact the author ([email protected]). Currently there are no research or commercial technologies thatare capable of discriminating a sameword or phrasewith such accuracywhilemaintaining practically 100% correct rejectionrate. Furthermore, authors are not aware of any work or technology that discriminates the same word based on its contextusing only acoustic features as is done in WUW-SR.To further enhance disriminability of a WUW word recognition used in alerting context from other referential contexts

using additional prosodic features is presently being conducted. Intuitive and empirical evidence suggests that alertingcontext tends to be characterized with additional emphasis needed to get a desired attention. This investigation is leadingin development of prosodic based features that will capture this heightened emphasis. Initial results using energy basedfeatures show promise. Using appropriate data is paramount to accuracy of this study; an innovative procedure of collectingappropriate data from video DVDs is developed.Although current system performs admirably, there is room for improvement in several areas. Making SVN classification

independent of OOV data is one of the primary tasks. Development of such a solution will enable various adaptationtechniques to be deployed in real or near real time. The most important aspect of the presented solution is its genericnature and its applicability to any basic speech unit being modeled, such as phonemes (e.g., context independent or contextdependent such as tri-phones), which is necessary for large vocabulary continuous speech recognition (LVCSR) applications.If the fraction of the accuracy reported here is carried to those models as expected this methods has the potential torevolutionize speech recognition technology. In order to demonstrate its expected superiority, the WUW-SR technologywill be added to an existing and publically available recognition system such as HTK (http://htk.eng.cam.ac.uk/) or SPHINX(http://cmusphinx.sourceforge.net/html/cmusphinx.php).

Acknowledgements

The authors acknowledge the partial support from the AMALTHEA REU NSF Grant No. 0647120 and Grant No. 0647018,and themembers of the AMALTHEA 2007 team, Alex Stills, Brandon Schmitt, and Tad Gertz for their dedication to the projectand their immense assistance with the experiments.

References

[1] Xuedong Huang, Alex Acero, Hsiao-Wuen Hon, Spoken Language Processing: A Guide to Theory, Algorithm and System Development, Prentice HallPTR, 2001.

[2] Ron Cole, Joseph Mariani, Hans Uszkoreit, Giovanni Batista Varile, Annie Zaenen, Antonio Zampolli, Victor Zue (Eds.), Survey of the State of the Art inHuman Language Technology, Cambridge University Press and Giardini, 1997.

[3] V. Këpuska, Wake-Up-Word Application for First Responder Communication Enhancement, SPIE, Orlando, 2006.[4] T. Klein, Triple scoring of hidden markov models in wake-up-word speech recognition, Thesis, Florida Institute of Technology.[5] V. Këpuska, Dynamic time warping (DTW) using frequency distributed distance measures, US Patent: 6983246, January 3, 2006.[6] V. Këpuska, Scoring and rescoring dynamic time warping of speech, US Patent: 7085717, April 1, 2006.[7] V. Këpuska, T. Klein, OnWake-Up-Word speech recognition task, technology, and evaluation results against HTK andMicrosoft SDK 5.1, Invited Paper:World Congress on Nonlinear Analysts, Orlando 2008.

[8] V. Këpuska, D.S. Carstens, R. Wallace, Leading and trailing silence in Wake-Up-Word speech recognition, in: Proceedings of the InternationalConference: Industry, Engineering & Management Systems 2006, Cocoa Beach, FL., 259–266.

[9] J.R. Rohlicek, W. Russell, S. Roukos, H. Gish, Continuous hidden Markov modeling for speaker-independent word spotting, vol. 1, 23–26 May 1989,pp. 627–630.

[10] C. Myers, L. Rabiner, A. Rosenberg, An investigation of the use of dynamic time warping for word spotting and connected speech recognition, in:ICASSP ’80. vol. 5, Apr 1980, pp. 173–177.

[11] A. Garcia, H. Gish, Keyword spotting of arbitrary words using minimal speech resources, in: ICASSP 2006, vol. 1, 14–19 May 2006, pp. I–I.[12] D.A. James, S.J. Young, A fast lattice-based approach to vocabulary independent wordspotting, in: Proc. ICASSP ’94, vol. 1, 1994, pp. 377–380.[13] S.P. Davis, P.Mermelstein, Comparison of parametric representations formonosyllabicword recognition in continuously spoken sentences, IEEE Trans.

ASSP 28 (1980) 357–366.[14] John Makhoul, Linear prediction: A tutorial review, Proc. IEEE. 63 (1975) 4.[15] Frederick Jelinek, Statistical Methods for Speech Recognition, The MIT Press, 1998.[16] L.R. Rabiner, A tutorial on hidden markov models and selected applications in speech recognition, Proc. IEEE. 77 (1989) 2.[17] John C. Platt, Sequential minimal optimization: A fast algorithm for training support vector machines, Microsoft Research Technical Report, 1998.[18] C.J.C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery, 1998.[19] Chih-Chung Chang, Chih-Jen Lin, LIBSVM : A Library for Support Vector Machines. Online 2001. http://www.csie.ntu.edu.tw/~cjlin/libsvm.[20] Rong-En Fan, Pai-Hsuen Chen, Chih-Jen Lin,Working set selection using second order information for training support vectormachines, J. Mach. Learn.

Res. 6 (2005) 1889–1918.[21] Chih-Wei Hsu, Chih-Chung Chang, Chih-Jen Lin, A practical guide to support vector classification, Department of Computer Science and Information

Engineering, National Taiwan University, Taipei, Taiwan : s.n., 2007.[22] C.C. Broun, W.M. Campbell, Robust out-of-vocabulary rejection for low-complexity speakerindependent speech recognition, Proc. IEEE Acoustics

Speech Signal Process. 3 (2000) 1811–1814.[23] T. Hazen, I. Bazzi, A comparison and combination of methods for oov word detection and word confidence scoring, in: IEEE International Conference

on Acoustics, Speech, and Signal Processing, ICASSP, Salt Lake City, Utah, May 2001.[24] Cambridge University Engineering Department, HTK 3.0. [Online] http://htk.eng.cam.ac.uk.


Recommended