Date post: | 30-Dec-2015 |
Category: |
Documents |
Upload: | harlan-duncan |
View: | 27 times |
Download: | 0 times |
VISIONS, TECHNOLOGY, AND BUSINESS OF TALKING MACHINES
Roberto Pieraccini, CTO, Tell-Eureka Corporation
535 West 34th StreetNew York, NY 10001
+1 646 792 2744
[email protected]://www.telleureka.com
The vision
Recreating the Speech Chain DIALOG
SEMANTICS
SYNTAX
LEXICON
MORPHOLOGY
PHONETICS
VOCAL-TRACTARTICULATORS
INNER EARACOUSTIC
NERVE
SPEECHRECOGNITION
DIALOGMANAGEMENT
SPOKENLANGUAGE
UNDERSTANDING
SPEECHSYNTHESIS
The technology
Talking Machines: First Steps into Spoken Language Technology
Joseph Faber(1835)
Von Kempelen(1791)
Homer DudleyBell Labs(1939)
Speech Recognition: the Early Years
1952 – Automatic Digit Recognition (AUDREY)Davis, Biddulph, Balashek (Bell Laboratories)
1960’s – Speech Processing and Digital Computers
AD/DA converters and digital computers start appearing in the labs
James FlanaganBell Laboratories
The Illusion of Segmentation... or...
Why Speech Recognition is so Difficult
m I n & m b
& r i s e v & n th
r E n I n z E r
o t ü s e v & n f O r
MY
NUMBER
IS
SEVEN
THREE
NINE
ZERO
TWO
SEVEN
FOUR
NPNP
VP
(user:Roberto (attribute:telephone-num value:7360474))(user:Roberto (attribute:telephone-num value:7360474))
The Illusion of Segmentation... or...
Why Speech Recognition is so Difficult
m I n & m b
& r i s e v & n th
r E n I n z E r
o t ü s e v & n f O r
MY
NUMBER
IS
SEVEN
THREE
NINE
ZERO
TWO
SEVEN
FOUR
NPNP
VP
(user:Roberto (attribute:telephone-num value:7360474))(user:Roberto (attribute:telephone-num value:7360474))
errors
errors
errors
errors
Intra-speaker variability
Noise/reverberation
Coarticulation
Context-dependency
Word confusability
Word variations
Speaker Dependency
Multiple Interpretations
Limited vocabulary
Ellipses and Anaphors
rules
rules
rules
rules
1969 – Whither Speech Recognition?
[…] General purpose speech recognition seems far away. Social-purpose speech recognition is severely limited. It would seem appropriate for people to ask themselves why they are working in the field and what they can expect to accomplish.
[…] It would be too simple to say that work in speech recognition is carried out simply because one can get money for it. That is a necessary but no sufficient condition. We are safe in asserting that speech recognition is attractive to money. The attraction is perhaps similar to the attraction of schemes for turning water into gasoline, extracting gold from the sea, curing cancer, or going to the moon. One doesn’t attract thoughtlessly given dollars by means of schemes for cutting the cost of soap by 10%. To sell suckers, one uses deceit and offers glamour.
[…] Most recognizers behave, not like scientists, but like mad inventors or untrustworthy engineers. The typical recognizer gets it into his head that he can solve “the problem.” The basis for this is either individual inspiration (the “mad inventor” source of knowledge) or acceptance of untested rules, schemes, or information (the untrustworthy engineer approach).
The Journal of the Acoustical Society of America, June 1969
J. R. PierceExecutive Director,Bell Laboratories
1971-1976: The ARPA SUR project
In spite of the anti-speech recognition campaign headed by the Pierce Commission ARPA launches into a 5 year program on Spoken Understanding Research
REQUIREMENTS: 1000 word vocabulary, 90%understanding rate, near real time on a 100 MIPS machine
4 Systems built by the end of the programSDC (24%)
BBN’s HWIM (44%)
CMU’s Hearsay II (74%)
CMU’s HARPY (95% -- 80 times real time!) HARPY was based on an engineering approach
search on a network representing all the possible utterances Lack of a scientific evaluation approach Speech Understanding: too early for its time
The project was not extended.Raj Reddy -- CMU
LESSON LEARNED:Hand-built knowledge does not scale upNeed of a global “optimization” criterion
Vintage Speech Recognition
1970’s – Dynamic Time WarpingThe Brute Force of the Engineering Approach
TE
MP
LAT
E (
WO
RD
7)
UNKNOWN WORD
T.K. Vyntsyuk (1969)H. Sakoe, S. Chiba (1970)
Isolated WordsSpeaker Dependent
Connected WordsSpeaker Independent
Sub-Word Units
1980s -- The Statistical Approach
Based on work on Hidden Markov Models done by Leonard Baum at IDA, Princeton in the late 1960s
Purely statistical approach pursued by Fred Jelinek and Jim Baker at IBM T.J.Watson Research
Foundations of modern speech recognition engines
)()|(maxargˆ WPWAPWW
Fred Jelinek
S1 S2 S3
a11
a12
a22
a23
a33 ),|( 21 ttt wwwP
Acoustic HMMs Word Tri-grams
No Data Like More Data Whenever I fire a linguist, our system performance improves (1988) Some of my best friends are linguists (2004)
Jim Baker
1980-1990 – The statistical approach becomes ubiquitous
Lawrence Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceeding of the IEEE, Vol. 77, No. 2, February 1989.
1980s-1990s – The Power of Evaluation
Pros and Cons of DARPA programs
+ Continuous incremental improvement- Loss of “bio-diversity”
SPOKENDIALOGINDUSTRY
SPEECHWORKS
NUANCE
MIT
SRI
TECHNOLOGYVENDORS
PLATFORMINTEGRATORS
APPLICATIONDEVELOPERS
HOSTING
TOOLS
STANDARDS
STANDARDS
STANDARDS
1997
19951996
19981999
20002001
20022003
20042005
The business of speech
Voice User Interface (VUI) Design—the Quantum Leap in Dialog Systems
1995 -- The WildFire Effect
Change of perspective: From technology driven to user centered
RESEARCH: Natural Language free form Commercial: Task completion and usability.
Persona: the personality of the application (TTS vs. Recording)
Speech recognition accuracy is important, but success is determined by the VUI.
The importance of a repeatable, streamlined, teachable, development process
The Speech Application Lifecycle
2 3
4 5
6
7
8
9 10
requirements
VUI design
usability
1
VUI development
speech science
high levelsystem design
system engineering
integration
partialdeployment
fulldeployment
AnalystVUI Designer
Speech ScientistVUI Designer
Architect, App Developer
Engineer
ProjectManager
Get Origin Account
Get Destination Account
Get Amount
Enter Transfer
amount > origin
account?
Play WrongAmount Message
YES
Play Confirmation
confirmed? What is wrong?
Go to Main Menu
NO
YES
NO
amount
destinationaccount
originaccount
Voice User Interface DesignGet Amount Interaction Module
PROMPTS
Type Wording Source
Initial Please say the amount you would like to transfer from your get_amount_I_1.wav
<origin-account> TTS
to your get_amount_I_2.wav
<destination-account> TTS
in dollars and cents. get_amount_I_3.wav
Retry 1 Please say the amount you would like to transfer from your get_amount_I_1.wav
<origin-account> TTS
to your get_amount_I_2.wav
<destination-account> TTS
in dollars and cents. get_amount_I_3.wav
Retry 2 Please say the amount you would like to have transferred, like one hundred dollars and fifty cents. get_amount_R_2_1.wav
Timeout 1
I'm sorry, I didn't hear you. get_amount_T_1_1.wav
Please say the amount you would like to transfer from your get_amount_I_1.wav
<origin-account> TTS
to your get_amount_I_2.wav
<destination-account> TTS
Timeout 2
I didn't hear you this time either. Please say the amount you would like to have transferred, like one hundred dollars and fifty cents. get_amount_T_2_1.wav
HelpPlease say how much do you wish to transfer. You can say the amount in dollars and cents, like, for instance, one hundred dollars and fifty cents. get_amount_H.wav
ACTIONS
CONDITION ACTION
if amount greater than amount in <origin-account>Go to "Play Wrong Amount Message"
else Go to "Play Confirmation"
Speech Science: Tuning for performance
Recognition
CorrectlyRecognize
Mis-recognize
InVocabulary
Out ofVocabulary
FalselyReject False rejection
FalselyAccept
CorrectlyReject Correct rejection
False acceptance - out
Accept
Confirm
False acceptance - in
False confirmation
Accept
Confirm
Correct acceptance
Correct confirmation
Speech Science: Tuning for performance
utt# sub-err% fa-err% fr-err% rej% OOV% fa-oov%
WaitPowerBothUp-2 17 5.88 0 0 5.88 5.88 0
WaitHowMuchSnow 17 5.88 11.76 5.88 23.53 29.41 40
MissingOneChannel 22 4.55 0 0 9.09 9.09 0
WPAllChannels 23 4.35 0 4.35 8.7 4.35 0
PictureBack 27 3.7 3.7 3.7 7.41 7.41 50
WaitFindInputSource 29 3.45 0 0 13.79 13.79 0
PictureProb 33 3.03 12.12 0 0 12.12 100
DM
Utt# = Number of utterances
Sub-err% = percent of in-voc utterances wrongly recognized
Fa-err% = percent of utterances wrongly accepted
Fr-err% = percent of utterances wrongly rejected
Rej% = total percent of all utterances rejected
OOV% = percent of out-voc utterances
Fa-oov% = percent of out-voc utterances wrongly accepted
AC
TIO
N
- Prioritize grammars that need improvement
- Use transcriptions to improve grammars
The Architectural Evolution of Spoken Dialog
Native Code
ProprietaryIVR Systems
StandardClients(VoiceXML)
StandardApplicationservers
1994 1998 2000 2005
TelephonyPlatform
The Voice Web
Web Server
VoiceBrowser
Internet
ASR TTS
Telephone
VoiceXML/SALT
MRCP
SSML, SRGF
EMMA?
SCXML?
CCXML
The Evolution of the Interface and the Research-Industry Chasm
Natural
Language
Directed
Dialog
1994 1996 1998 2000 2002 2004 2006
Research Systems a-la DARPA Communicator
Small Vocabulary Menu Based
Large Vocabulary, Dialog Modules
SLU: Statistical Language Understanding
Spoken dialog as an anthropomorphic
system
Spoken dialog as a tool
The evolution of the market and the industry
TECHNOLOGY VENDORSSPEECH RECOGNITION, TTS
PLATFORM INTEGRATORSIVR, VoiceXML, CTI,…
TOOLS – AUTHORING, TUNING, PREPACKAGED APPLICATIONS
APPLICATION DEVELOPERSPROFESSIONAL SERVICES
HOSTING 600 to 1,000M$revenue
> 8000 apps worldwide
New evolving standards guarantee
interoperability of engines and platforms.
Third generation dialog systems
LOW MEDIUM HIGH
COMPLEXITY
FLIGHTSTATUS
STOCKTRADING
PACKAGETRACKING
FLIGHT/TRAINRESERVATION
BANKING CUSTOMERCARE
TECHNICALSUPPORT
1st GenerationINFORMATIONAL
2nd GenerationTRANSACTIONAL
3RD GenerationPROBLEM SOLVING
2005 -- Spoken Dialog goes to Saturday Night Live