341 Telecom Applications

8/2/2019 341 Telecom Applications

1/10

APPLICATIONS OF SPEECH RECOGNITIONI NTHE AREA OF TELECOMMUNICATI0:NS

Lawrence R. RabinerAT&T LabsFlorham Park,New Jersey 07932

Abstract - Advances in speech recognition technology, over he past4 decades, haveenabled a wide range of telecommunications and desktop services to become voice-enabled. Early applications were driven by the need to automatend thereby reducethe cost of attendant services, or by the need to create revenue generatingew serviceswhich were previously unavailable because ofcost, or the inability to adequatelyprovide such a servicewith the available work force. As we move towards the futurewe eeanewgenerationofvoice-enabled erviceofferingsemerging includingintelligentagents,customer care wizards,callcenterautomatedattendants, voiceaccess to universal directories and registries, unconstrained dictation capability, ndfinally unconstrained language translation capability. In this paper we review thecurrent capabilities of speech recognition systems, show how they have been exploitedin todays services and applications, and show how they will evolve over time to thenext generation f voice-enabled services.1. INTRODUCTIONSpeech recognition technology as evolved for more than 40 years, spurred on by

advances in signalprocessing,algorithms,architectures, and hardware. Duringthat time it has gone from a laboratory curiosity, o an art, and eventually to a fullfledged technology that s practiced and understood by a wideange ofengineers,scientists, linguists, psychologists, and systems designers. Over those 4 ecadesthe technology of speech recognition has evolved, leading to a steady stream ofincreasinglymoredifficultasks which have been tackled and solved. Thehierarchy of speech ecognitionproblems which have been attacked,and theresulting application tasks which became viable as a result, includesthe followingc11:

isolated word ecogn.ition.-both peakerrained nd peakerndependent.This technology opened up a class of applications called comlmand-and-control applications in which the system was apable of recognizing a singleword command (ftom a small vocabulary of single word commmds), andappropriately responding to he ecognizedcommand.Onekeyproblemwith this technology was the sensitivity to background noises (which wereoften recognized as spurious spoken words) and extraneous speech whichwas inadvertently spoken along with the command word. Various types ofkeyword spotting lgorithms evolved to solve these typesof problems.comected word ecognition,-bothspeaker rainedandspeaker ndependent.This technology was built on top of word recognition technology, choosingto exploit the word models that were successfuln isolated word recognition,

0-7803-3698-4/97/$10.000 997 IEEE. 501


2/10

a

e

1.1

and extend the modeling to recognize a concatenated sequence (a string) ofsuch word models as a word string. This technology opened up a class ofapplications based on recognizing digit strings and alphanumeric strings, andled toavariety of systems forvoice dialing, credit card authorization,directory assistance lookups, and catalog ordering.con.tin.uousor fluent speech recognition.-both speaker trained and speakerindependent. This technology led to the first large vocabulary recognitionsystemswhichwereusedo ccess databases (the DARPA ResourceManagement Task), to do constrained dialogue access to information (theDARPAATIS Task), to handle very large vocabulary read speech ordictation (the DARPA NAB Task), and eventually were used for desktopdictation systems for PC environments2].speech un.derstan.dingsystems so-calledunconstrained dialogue systems)which are capable of determining the underlying message embedded withinthe speech, rather than just recognizing the spoken words 3]. Such systems,which are only beginning to appear recently, enable services like customercare (the AT&T How May IHelp You System),and intelligent agentsystems which provide access to information sources by voiceialogues (theAT&T MaxwellTask).spontaneous con.versation, s.ystenzs which are able to both recognize thespoken material accuratelyandunderstand the meaning of the spokenmaterial. Such systems, which re currently beyond he limits of the existingtechnology, will enable new services such as Conversation Summarization,Business Meeting Notes, Topic Spottingn fluent speech (e.g., ftom radioor TV broadcasts), and ultimately even language translation services betweenany pair of existing languages.

Generic Speech Recog nition System [4]Figure 1 showsablockdiagram of a ypical ntegratedcontinuous peech

recognition system. Interestingly enough, this generic blockdiagram can be madeto work on virtually any speech recognition task that has been devised in the past40 years, Le., isolated word recognition, connected word recognition, continuousspeech recognition, etc.The feature analysismoduleprovides the acoustic feature vectorsused tocharacterize the spectral properties of the time varying speech signal. The word-level acoustic match module evaluates he similarity between he input featurevector sequence (corresponding to a portion of the nput speech) and a set ofacoustic word models or all words in the recognition task vocabulary o determinewhich words were most likely spoken. The sentence-level match module uses alanguage model (i.e., a model of syntax and semantics) to determine the mostlikely sequence of words. Syntactic and semantic rules can be specified, eithermanually, based on task constraints, or with statistical models such as word andclass N-gram robabilities.Search ndecognition decisions are made by

502


3/10

considering all likely word sequencesand choosing theone with the best matchingscore as the recognized sentence.

InputSDeech+ eature

AnalysisWord PMatch i-t

IIIII

Recognized

Match

Figure 1 Block diagramof a typical integrated continuous speech recognizer.Almost every aspect of the continuous speech recognizer of Figure 1 has beenstudied and optimized over the years.As a result, we have obtained a great deal ofknowledgeabouthow to design he eatureanalysis module, how to chooseappropriate recognition units, how o populate the word exicon, how to buildacousticwordmodels, how to model language syntaxand semantics, how todecodewordmatchesagainstwordmodels,how to efficiently determine asentence match, nd inallyhowo ventually hoose the besttecognizedsentence. Among the things we have learned re the following:

the best spectral features to use are LPC-based cepstral coefficients (either on alinear or a me1 frequency scale) and their first and second order derivatives,along with log energies and their derivatives.the continuousdensityhiddenMarkovmodel ( H " ) with tate: mixturedensities is the best model for the statistical properties of the spectra.1 featuresover time.the most robust set of speech units is a set of context dependent triphme unitsfor modelingboth intraword and interword linguistic phenonema.althoughmaximum ikelihood raining of unitmodels is effective for manyspeechvocabularies, heuse of discriminative rainingmethods e.g., MMItraining or Global Probabilistic Descent (GPD) methods) is more effkctive formost tasks.

503


4/10

the most effective technique for making the unit models be robust to varyingspeakers, microphones, backgrounds, and transmission environments s throughthe use of signal conditioningmethodssuch as CepstralMeanSubtraction(CMS) or some type of Signal Bias Removal (SBR).the use of adaptive learning increasesperformance ornew alkers,newbackgrounds, and new transmission systems.

0 the use of utterance verification provides improved rejection f improper speechor background sounds.H s can be made veryefficient in terms of computation speed, memory size,and performance through he use of subspace and parameter tieing methods.efficient word and sentence matches can be obtained through these of efficientbeam searches, tree-trellis coding methods, and through proper determinizationof the Finite State Network (FSN) hat s being searched and decoded. Suchprocedures also lead oefficientmethods orobtaining the N-best sentencematches to the spoken input.the ideas of concept spotting can be used to implement semantic constraints of atask in an automatically trainable manner.

1.2 Building Good Speech -BasedApplications [5]Inaddition to havinggoodspeechrecognition echnology, effective speech-

based applications heavily depend on several factors, including:good user interfaces which make the application easy-to-use and robust to thegood models of dialogue that keep the conversation moving forward, even inmatching the task to the technology.

kinds of confusion that arise in human-machine communicationsby voice.periods of great uncertainty on the parts of either the user or themachine.

We now expand somewhat on each of these factors.User n.terface Destp-In order to makeaspeech nterface as simple and aseffective as Graphical User Interfaces (GUI), 3 key design principles should befollowed as closely as possible, namely:

provide a con.tinuous represmtation of the objects and actions of interest.provide a mechanism for rapid, in.crernen.ta1, and reversible operations whoseusephysical actions or labeledbuttonpresses nstead of textcommands,

For Speech Interfaces (SI), these GUI principles are preserved in the followinguser design principles:

remindteach users what can be said at any point in the interaction.maintain consistency across features using a vocabulary that is almost alwaysavailable.designor rror.provide the ability to barge-in over prompts.use implicit confirmation of voice input.

impact on the object of interest is immediately visible.whenever possible.

504


5/10

0 rely on earcons to orient users as to where they are in an interaction with the0 avoid information overload by aggregation or pre-selection of a subset of theThese usernterfacedesignprinciples are applied in different ways in theapplications described later n this paper.

machine.material to be presented.

Dialogue Design. Principles-For manynteractionsbetween a person and amachine, a dialogue is needed o stablish a complete nteraction with themachine. The ideal dialogue allows either the user or the machine: to initiatequeries, or to choose to respond to queries nitiated by the other side. (Suchsystems are called mixed initiative systems.) A complete set of design principlesfor dialogue systems has not yet evolved (it is far too early yet). However, muchas we have learned good speech interface design principles, manyof the same orsimilar principles are evolving for dialogue management. The key principles thathave evolvedare the following:

summarize actions to be taken, whenever possible.0 provide real-time, low delay, responses from the machine and allow the user

orient users to their location in task space as often as possible.0 use flexible grammars to provide incrementality of the dialogue.

wheneverpossible, ustomize nd personalize thedialogue (novice/expert

In addition to these design principles, n objective performance measure is neededthat combines ask-based success measures (e.g., information elements that arecorrectly obtained) and a variety of dialogue-based cost measures (e.g., numberoferror correctionurns,imeoask ompletion,uccess rate, etc.11 Such aperformance measure for dialogues does notet exist but is under investigation.

to barge-in at any time.

modes).

Match Task to th e Technology-It is essential hatanyapplication of speechrecognition be realisticabout hecapabilities of the technology,andbuild infailure correctionmodes.Hencebuilding a creditcard ecognition applicationbefore digit error rates fell below0.5% per digit is formula for failure, ince for a16-digit credit card, the string error rate will be at the 10% level or higher, therebyfrustrating customers who speak clearly and distinctly, and making the systemtotally unusable for customers whoslur their speech or otherwise make i t difficulttonderstandheirpokennputs.Utilizinghisrinciple, the followingsuccessful applications have been built:telecommunications:Command-and-Control, gents, all enter utomation,

customer care, voice calling.office/desktop: voice navigation of desktop, voice browser for Internet, voicedialer, dictation.

0 manufacturinghusiness: package sorting, data entry, form filling.medical/legal: creation of stylized reports.

505


6/10

0 gamedaids-to-the-handicapped: voice control of selective features of thegame, the wheel chair, the environment (climate control).

1.3 Current Capabilities of peech RecognizersTable 1 provides a summary of the performance of modern speech recognition

and natural language understanding systems. Shown in the table are the Task orCorpus, the Type of speech input, the Vocabulary Size and the resulting WordError Rate. It can be seen that the technology is more than suitable for connecteddigit recognition tasks,forsimple data retrieval tasks (like the Airline TravelInformation System), and, with a well-designed user interface, can even be usedfor dictation like the WallStreet Journal Task.However, the word error ratesrapidly become prohibitive for tasks like recognizing peech rom a radiobroadcastwith all of the cross-announcer banter, commercials, etc), fromlistening in an conversational elephone calls off a switchboard, r even in the caseof familiarity of families calling each other overa switched telephone ine.

Table 1 Word ErrorRates for Speech Recognitionand Natural LanguageUnderstanding Tasks (Courtesy: John Makhoul,BN)

1.4 Instantiations of Speech Recognition TechnologySpeechrecognition echnologyused to be available onlyon special purpose

boards with special purpose DSP chips or ASICs. Today high quality speechrecognition technology packages are available in the form of inexpensive softwareonlydesktoppackages(IBMViaVoice,Dragon Naturally Speaking,Kurzweil,

50 6


7/10

etc.), technology engines that run on either the desktop or a workstation and areoften embedded in third party vendor applications, suchas the BB N Hark System,the SRI Nuance System, the AT&T Watson System, and the Altech S,ystem, andfinally they are also available as proprietary engines running on commerciallyavailable speech processing boards such as the Lucent Speech Processing System(LSPS), theTI board, the Nortel board, etc.2. The T elecommunications Need or Speech Recognition [6]

The telecommunications network s evolving as the traditionalPOTS (Plain OldTelephony Services)networkcomes ogether wilh thedynamicallyevolvingPacket network, in a structure which we believe will look something like the oneshown in Figure 2.

Figure 2 The telecommunications network f tomorrow.Intelligence in thisevolvingnetwork sdistributed at thedesktop (the localintelligence), at the terminal device (the telephone, screen phone, PC, ek), and inthe network. In order to provide universal services, there needs to be interfaceswhich operate effectively orall erminaldevices. Since themostubiquitousterminal device is still the ordinary telephone handset, the evolving network mustrely on the availability of speech interfaces to all services. Hence tht: growingneed for speech recognition for Command-and-Control applications, and naturallanguage understanding for maintaining dialogue with the machine.3. Telecommunication Applicationsof Speech Recognition [7l

Speech recognition was introduced into the telecommunications network in theearly 1990s for two reasons, namely to reduce costs via automation of attendantfunctions, and to provide new revenue generating services that were previouslyimpractical because of the associated costsof using attendants.

507


8/10

Examples of elecommunicationsserviceswhichwerecreated to achievecostreduction include the following:

Automation. o OperatorServices. Systems like theVoiceRecognitionCallProcessing (VRCP) system introduced by AT&T or the Automated AlternateBilling System (AABS) introduced by Nortel enabled operator unctions to behandled by speech recognition systems. The VRCP system handled so-calledoperatorassisted calls such as Collect, Third Party Billing, Person-to-Person, Operator Assisted Calling, and Calling Card calls.The AABS systemautomated the acceptance (or ejection) of billing charges for reverse calls byrecognizing simple variants of the two word vocabulary Yesnd No.Automation of Directory Assistan,ce. Systemswere created for ssistingoperators with the ask of determining elephonenumbers in response ocustomer queries by voice. Both NYNEX and Nortel introduced system thatdid front end city name recognition so as to reduce the operator search spacefor the desired isting,and several experimentalsystems were created ocomplete the directory assistance task by attempting to recognize individualnames in a directory of as many as 1million names. Such systems are not yetpractical (because of the confusability among names) but for mall directories,such systemshave been widely used (e.g.,n corporate environments).* Voice Dialing. Systems have beencreated orvoice dialing by name ( so-called alias dialing such as Call Home, Call Office) from AT&T, NYNEX,and Bell Atlantic, and by number (AT&T SDN/NRA) to enable customers tocomplete calls without having to push buttons associated with the telephonenumber being called.

Examples of telecommunications services which were created o generate newrevenue include the following:

Voic e Banking Services. A system for providing access to customer accounts,account balances,customer ransactions, etc. wasfirstcreated inJapanbyNTT (the ANSERSystem)more than 10 years ago in order to provide aservice that was previouslyunavailable.Equivalent erviceshave beenintroduced in banks worldwide over the lasteveral years.Voice Prompter. A system orprovidingvoice eplacement of touch-toneinput for so-called Interactive Voice Response (IVR) systems was introducedby AT&T in the early 1990s (initially in Spain becauseof the lack of touch-tone phones in that country). This system initially enabled the customer tospeak the touch-tone position (i.e., speak or press the digit one); over timesystems have evolved so that customers can speak the service associated withthe touch-tone position (e.g., say reservations or push the 1-key, say scheduleor push the 2-key, etc.).Directory Assistan.ce Call Completion.. This systemwas ntroduced bybothAT&TandNYNEX to handle completion of calls madeviarequests forDirectory Assistance. Since Directory Assistance numbers are provided by anindependent system, using Text-to-Speech synthesis to speak out the listing,

508


9/10

speech recognition can be used to reliably recognize the listing and dial theassociatednumber. This highlyunusualuse of a speech ecognizer tointerface with a speech synthesizer is one of the unusual outgrowths of thefractionation of the telephone network into local and long distance arriers inthe United States.Reverse Directory Assistance. This system was created by NYNEX:, Bellcore,and Ameritech o provide name and address nformation associated with aspoken telephone number.InformationServices. These typeof systems enable customers to accessinformation lines to retrieve informationabout scores of sportilng events,traffic reports, weather reports, theatre bookings, restauranteservations, etc.

As we move to the future the intelligent network f Figure 2, along with advancesin speech recognition echnology, will support a new range of services of thefollowing types:

Agent Technology. Systems like Wildfire ndMaxwellAT&.T) nablecustomers to interact with intelligent agents via voice dialogues in order tomanage calls (both in-coming and out-going calls), manage mesaages (bothvoice and email), get information from the Web ( e g , movie reviews, callingdirectories), customize ervices e.g., irst hingeachmorning he agentprovides the traffic and weather reports), personalize services (via the agentpersonality, speed, helpfulness), and dapt to user preferences (e.g., learn howthe user likes to do things and react appropriately).Customer Cure. The goal of customer care systems is to replace InteractiveVoice Response systems witha dialogue type of interaction o mak;e it easierfor the user to get the desired help without having to navigate complicatedmenus or understand the terminology of the place being called for ihelp. TheHow May I Help You (HMIHY) customer care system of AT&T is anexcellent example of this typeof system.Com puter-Te1eph on.y Integration.. Since the telecommunicationnetwork ofthe futurewill integrate theelephonyPOTS) nd omputerPacket)networks, a range of new applications willarise which exploit this integrationmore fully. One prime example s registry services where the network locatesthe user and determines the most appropriate ay to communicate with them.Another example is providing a user cache of the most frequently accessedpeople in order oprovide a rapid accessmechanismfor hesefrequentlycalled numbers.Voice Dictation. Although th e desktopalready supports voicedictation ofdocuments, a prime elecommunicationsapplication of speech ecognitionwould be for generating voice responses to email queries so that the resultingmessage becomes an email message back to the sender (rather than a voicemail response to an email message).

50 9


10/10

4. SummaryThe world of telecommunications is rapidly changing and evolving. The world ofspeech recognition is rapidly changing and evolving. Early applications of thetechnology have achieved varying degrees of success. The promise for the futureis significantly highererformanceorlmostvery speech recognitiontechnology area, with more robustness to speakers, background noises etc. Thiswillltimatelyead to reliable,obustoicenterfaces to everytelecommunications ervice hat is offered, hereby making themuniversallyavailable.References[l] L. R.Rabiner nd B. H. Juang, Fun.damen.tals of Speech Recogr?ition,Englewood Cliffs, NJ, 1993.[2] J. Makhoul nd R. Schwartz, State of the Art in ContinuousSpeechRecognition, in Voice Co mmu nications Between. Hum ans and Ma chines, D. oeand J. Wilpon, Eds., pp. 165-198, 1994.[3] R. Pieraccini and E. Levin, Stochastic Representation of Semantic Structurefor Speech Understanding,Speech Communications,Vol. 11, p p . 283-288, 1992.[4] L. R. Rabiner,B. H. Juang, and C. H. Lee, An Overview of Automatic SpeechRecognition, in Automa tic Speech an.d Speak er Recogn.ition, C.H. Lee, F. KSoong, and K. K. Paliwal, Eds., pp. 1-30, 1996.[5] C. A. Kamm, M. Walker, and L. R. Rabiner, The Role of Speech Processingin Human-Computer Intelligent Communication, Proc.CI Workshop,Washington, DC, pp. 169-190, Feb. 1997.[6] R. V. Cox, B. G. Haskell, Y. eCun, B. Shahraray, andL. R. Rabiner, On theApplications of Multimedia Processing to Communications, submitted to Proc.IEEE.[7] L. R. Rabiner, Applications of Voice Processing to Telecommunications,Proc. IEEE, Vol. 82, No. 4, pp. 199-228, Feb. 1994.

510

Date post:	06-Apr-2018
Category:	Documents
Upload:	shady-magdy
View:	218 times
Download:	0 times

341 Telecom Applications

Documents