VISIONS, TECHNOLOGY, AND BUSINESS OF TALKING MACHINES

VISIONS, TECHNOLOGY, AND BUSINESS OF TALKING MACHINES

Roberto Pieraccini, CTO, Tell-Eureka Corporation

535 West 34th StreetNew York, NY 10001

+1 646 792 2744

[email protected]://www.telleureka.com

mailto:[email protected]

http://www.telleureka.com/

The vision

Recreating the Speech Chain DIALOG

SEMANTICS

SYNTAX

LEXICON

MORPHOLOGY

PHONETICS

VOCAL-TRACTARTICULATORS

INNER EARACOUSTIC

NERVE

SPEECHRECOGNITION

DIALOGMANAGEMENT

SPOKENLANGUAGE

UNDERSTANDING

SPEECHSYNTHESIS

The technology

Talking Machines: First Steps into Spoken Language Technology

Joseph Faber(1835)

Von Kempelen(1791)

Homer DudleyBell Labs(1939)

Speech Recognition: the Early Years

1952 – Automatic Digit Recognition (AUDREY)Davis, Biddulph, Balashek (Bell Laboratories)

1960’s – Speech Processing and Digital Computers

AD/DA converters and digital computers start appearing in the labs

James FlanaganBell Laboratories

The Illusion of Segmentation... or...

Why Speech Recognition is so Difficult

m I n & m b

& r i s e v & n th

r E n I n z E r

o t ü s e v & n f O r

MY

NUMBER

IS

SEVEN

THREE

NINE

ZERO

TWO

SEVEN

FOUR

NPNP

VP

(user:Roberto (attribute:telephone-num value:7360474))(user:Roberto (attribute:telephone-num value:7360474))

The Illusion of Segmentation... or...

Why Speech Recognition is so Difficult

m I n & m b

& r i s e v & n th

r E n I n z E r

o t ü s e v & n f O r

MY

NUMBER

IS

SEVEN

THREE

NINE

ZERO

TWO

SEVEN

FOUR

NPNP

VP

(user:Roberto (attribute:telephone-num value:7360474))(user:Roberto (attribute:telephone-num value:7360474))

errors

errors

errors

errors

Intra-speaker variability

Noise/reverberation

Coarticulation

Context-dependency

Word confusability

Word variations

Speaker Dependency

Multiple Interpretations

Limited vocabulary

Ellipses and Anaphors

rules

rules

rules

rules

1969 – Whither Speech Recognition?

[…] General purpose speech recognition seems far away. Social-purpose speech recognition is severely limited. It would seem appropriate for people to ask themselves why they are working in the field and what they can expect to accomplish.

[…] It would be too simple to say that work in speech recognition is carried out simply because one can get money for it. That is a necessary but no sufficient condition. We are safe in asserting that speech recognition is attractive to money. The attraction is perhaps similar to the attraction of schemes for turning water into gasoline, extracting gold from the sea, curing cancer, or going to the moon. One doesn’t attract thoughtlessly given dollars by means of schemes for cutting the cost of soap by 10%. To sell suckers, one uses deceit and offers glamour.

[…] Most recognizers behave, not like scientists, but like mad inventors or untrustworthy engineers. The typical recognizer gets it into his head that he can solve “the problem.” The basis for this is either individual inspiration (the “mad inventor” source of knowledge) or acceptance of untested rules, schemes, or information (the untrustworthy engineer approach).

The Journal of the Acoustical Society of America, June 1969

J. R. PierceExecutive Director,Bell Laboratories

1971-1976: The ARPA SUR project

In spite of the anti-speech recognition campaign headed by the Pierce Commission ARPA launches into a 5 year program on Spoken Understanding Research

REQUIREMENTS: 1000 word vocabulary, 90%understanding rate, near real time on a 100 MIPS machine

4 Systems built by the end of the programSDC (24%)

BBN’s HWIM (44%)

CMU’s Hearsay II (74%)

CMU’s HARPY (95% -- 80 times real time!) HARPY was based on an engineering approach

search on a network representing all the possible utterances Lack of a scientific evaluation approach Speech Understanding: too early for its time

The project was not extended.Raj Reddy -- CMU

LESSON LEARNED:Hand-built knowledge does not scale upNeed of a global “optimization” criterion

Vintage Speech Recognition

1970’s – Dynamic Time WarpingThe Brute Force of the Engineering Approach

TE

MP

LAT

E (

WO

RD

7)

UNKNOWN WORD

T.K. Vyntsyuk (1969)H. Sakoe, S. Chiba (1970)

Isolated WordsSpeaker Dependent

Connected WordsSpeaker Independent

Sub-Word Units

1980s -- The Statistical Approach

Based on work on Hidden Markov Models done by Leonard Baum at IDA, Princeton in the late 1960s

Purely statistical approach pursued by Fred Jelinek and Jim Baker at IBM T.J.Watson Research

Foundations of modern speech recognition engines

)()|(maxargˆ WPWAPWW

Fred Jelinek

S1 S2 S3

a11

a12

a22

a23

a33 ),|( 21 ttt wwwP

Acoustic HMMs Word Tri-grams

No Data Like More Data Whenever I fire a linguist, our system performance improves (1988) Some of my best friends are linguists (2004)

Jim Baker

1980-1990 – The statistical approach becomes ubiquitous

Lawrence Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceeding of the IEEE, Vol. 77, No. 2, February 1989.

1980s-1990s – The Power of Evaluation

Pros and Cons of DARPA programs

+ Continuous incremental improvement- Loss of “bio-diversity”

SPOKENDIALOGINDUSTRY

SPEECHWORKS

NUANCE

MIT

SRI

TECHNOLOGYVENDORS

PLATFORMINTEGRATORS

APPLICATIONDEVELOPERS

HOSTING

TOOLS

STANDARDS

STANDARDS

STANDARDS

1997

19951996

19981999

20002001

20022003

20042005

The business of speech

Voice User Interface (VUI) Design—the Quantum Leap in Dialog Systems

1995 -- The WildFire Effect

Change of perspective: From technology driven to user centered

RESEARCH: Natural Language free form Commercial: Task completion and usability.

Persona: the personality of the application (TTS vs. Recording)

Speech recognition accuracy is important, but success is determined by the VUI.

The importance of a repeatable, streamlined, teachable, development process

The Speech Application Lifecycle

2 3

4 5

6

7

8

9 10

requirements

VUI design

usability

1

VUI development

speech science

high levelsystem design

system engineering

integration

partialdeployment

fulldeployment

AnalystVUI Designer

Speech ScientistVUI Designer

Architect, App Developer

Engineer

ProjectManager

Get Origin Account

Get Destination Account

Get Amount

Enter Transfer

amount > origin

account?

Play WrongAmount Message

YES

Play Confirmation

confirmed? What is wrong?

Go to Main Menu

NO

YES

NO

amount

destinationaccount

originaccount

Voice User Interface DesignGet Amount Interaction Module

PROMPTS

Type Wording Source

Initial Please say the amount you would like to transfer from your get_amount_I_1.wav

<origin-account> TTS

to your get_amount_I_2.wav

<destination-account> TTS

in dollars and cents. get_amount_I_3.wav

Retry 1 Please say the amount you would like to transfer from your get_amount_I_1.wav




in dollars and cents. get_amount_I_3.wav

Retry 2 Please say the amount you would like to have transferred, like one hundred dollars and fifty cents. get_amount_R_2_1.wav

Timeout 1

I'm sorry, I didn't hear you. get_amount_T_1_1.wav

Please say the amount you would like to transfer from your get_amount_I_1.wav




Timeout 2

I didn't hear you this time either. Please say the amount you would like to have transferred, like one hundred dollars and fifty cents. get_amount_T_2_1.wav

HelpPlease say how much do you wish to transfer. You can say the amount in dollars and cents, like, for instance, one hundred dollars and fifty cents. get_amount_H.wav

ACTIONS

CONDITION ACTION

if amount greater than amount in <origin-account>Go to "Play Wrong Amount Message"

else Go to "Play Confirmation"

Speech Science: Tuning for performance

Recognition

CorrectlyRecognize

Mis-recognize

InVocabulary

Out ofVocabulary

FalselyReject False rejection

FalselyAccept

CorrectlyReject Correct rejection

False acceptance - out

Accept

Confirm

False acceptance - in

False confirmation

Accept

Confirm

Correct acceptance

Correct confirmation

Speech Science: Tuning for performance

utt# sub-err% fa-err% fr-err% rej% OOV% fa-oov%

WaitPowerBothUp-2 17 5.88 0 0 5.88 5.88 0

WaitHowMuchSnow 17 5.88 11.76 5.88 23.53 29.41 40

MissingOneChannel 22 4.55 0 0 9.09 9.09 0

WPAllChannels 23 4.35 0 4.35 8.7 4.35 0

PictureBack 27 3.7 3.7 3.7 7.41 7.41 50

WaitFindInputSource 29 3.45 0 0 13.79 13.79 0

PictureProb 33 3.03 12.12 0 0 12.12 100

DM

Utt# = Number of utterances

Sub-err% = percent of in-voc utterances wrongly recognized

Fa-err% = percent of utterances wrongly accepted

Fr-err% = percent of utterances wrongly rejected

Rej% = total percent of all utterances rejected

OOV% = percent of out-voc utterances

Fa-oov% = percent of out-voc utterances wrongly accepted

AC

TIO

N

- Prioritize grammars that need improvement

- Use transcriptions to improve grammars

The Architectural Evolution of Spoken Dialog

Native Code

ProprietaryIVR Systems

StandardClients(VoiceXML)

StandardApplicationservers

1994 1998 2000 2005

TelephonyPlatform

The Voice Web

Web Server

VoiceBrowser

Internet

ASR TTS

Telephone

VoiceXML/SALT

MRCP

SSML, SRGF

EMMA?

SCXML?

CCXML

The Evolution of the Interface and the Research-Industry Chasm

Natural

Language

Directed

Dialog

1994 1996 1998 2000 2002 2004 2006

Research Systems a-la DARPA Communicator

Small Vocabulary Menu Based

Large Vocabulary, Dialog Modules

SLU: Statistical Language Understanding

Spoken dialog as an anthropomorphic

system

Spoken dialog as a tool

The evolution of the market and the industry

TECHNOLOGY VENDORSSPEECH RECOGNITION, TTS

PLATFORM INTEGRATORSIVR, VoiceXML, CTI,…

TOOLS – AUTHORING, TUNING, PREPACKAGED APPLICATIONS

APPLICATION DEVELOPERSPROFESSIONAL SERVICES

HOSTING 600 to 1,000M$revenue

> 8000 apps worldwide

New evolving standards guarantee

interoperability of engines and platforms.

Third generation dialog systems

LOW MEDIUM HIGH

COMPLEXITY

FLIGHTSTATUS

STOCKTRADING

PACKAGETRACKING

FLIGHT/TRAINRESERVATION

BANKING CUSTOMERCARE

TECHNICALSUPPORT

1st GenerationINFORMATIONAL

2nd GenerationTRANSACTIONAL

3RD GenerationPROBLEM SOLVING

2005 -- Spoken Dialog goes to Saturday Night Live

Date post:	30-Dec-2015
Category:	Documents
Upload:	harlan-duncan
View:	27 times
Download:	0 times

VISIONS, TECHNOLOGY, AND BUSINESS OF TALKING MACHINES

Documents