SPOKEN DIALOG SYSTEM FOR INTELLIGENT SERVICE ROBOTS

Post on 17-Mar-2016

53 views 1 download

Tags:

description

SPOKEN DIALOG SYSTEM FOR INTELLIGENT SERVICE ROBOTS. Intelligent Software Lab. POSTECH Prof. Gary Geunbae Lee. This Tutorial. Introduction to Spoken Dialog System (SDS) for Human-Robot Interaction (HRI) Brief introduction to SDS Language processing oriented - PowerPoint PPT Presentation

transcript

SPOKEN DIALOG SYSTEM FOR INTELLIGENT SERVICE ROBOTS

Intelligent Software Lab. POSTECHProf. Gary Geunbae Lee

Introduction to Spoken Dialog System (SDS) for Human-Robot Interaction (HRI) Brief introduction to SDS

Language processing oriented But not signal processing oriented

Mainly based on papers at ACL, NAACL, HLT, ICASSP, INTESPEECH, ASRU, SLT,

SIGDIAL, CSL, SPECOM, IEEE TASLP

2

This Tutorial

OUTLINES INTRODUCTION

AUTOMATIC SPEECH RECOGNITION

SPOKEN LANGUAGE UNDERSTANDING

DIALOG MANAGEMENT

CHALLENGES & ISSUES MULTI-MODAL DIALOG SYSTEM DIALOG SIMULATOR

DEMOS

REFERENCES

INTRODUCTION

Human-Robot Interaction (in Movie)

Human-Robot Interaction (in Real World)

Wikipedia (http://en.wikipedia.org/wiki/Human_robot_interaction)

What is HRI?

Human-robot interaction (HRI) is the study of interactions the study of interactions between people and robots.between people and robots. HRI is multidisciplinary with contributions from the fields of human-computer interaction, artificial intelligence, robotics, natural language natural language understandingunderstanding, and social science.

The basic goal of HRI is to develop principles and algorithms develop principles and algorithms to allow more natural and effective communication and allow more natural and effective communication and interactioninteraction between humans and robots.

• Signal Processing• Speech Recognition• Speech Understanding• Dialog Management• Speech Synthesis

Area of HRIVision

Speech

Haptics

Emotion

Learning

SPOKEN DIALOG SYSTEM (SDS)

Tele-service

Car-navigation Home networking

Robot interface

SDS APPLICATIONS

Talk, Listen and Interact

AUTOMATIC SPEECH RECOGNITION

SCIENCE FICTION Eagle Eye (2008, D.J. Caruso)

AUTOMATIC SPEECH RECOGNITION

x y

Speech Words

(x, y)

Training examples

Learning algorithm

A process by which an acoustic speech signal is converted into a set of words[Rabiner et al., 1993]

NOISY CHANNEL MODEL GOAL

Find the most likely sequence w of “words” in language L given the sequence of acoustic observation vectors O

Treat acoustic input O as sequence of individual observations O = o1,o2,o3,…,ot

Define a sentence as a sequence of words: W = w1,w2,w3,…,wn

)|(maxargˆ OWPWLW

)()|(maxargˆ WPWOPWLW

)()()|(maxargˆ

OPWPWOPW

LW

Bayes rule

Golden rule

TRADITIONAL ARCHITECTURE

FeatureExtraction Decoding

AcousticModel

PronunciationModel

LanguageModel

버스 정류장이어디에 있나요 ?

Speech Signals Word Sequence

버스 정류장이어디에 있나요 ?

NetworkConstruction

SpeechDB

TextCorpora

HMMEstimation

G2P

LMEstimation

WO

)()|(maxargˆ WPWOPWLW

TRADITIONAL PROCESSES

FEATURE EXTRACTION The Mel-Frequency Cepstrum Coefficients (MFCC) is a

popular choice [Paliwal, 1992]

Frame size : 25ms / Frame rate : 10ms

39 feature per 10ms frame Absolute : Log Frame Energy (1) and MFCCs (12) Delta : First-order derivatives of the 13 absolute coefficients Delta-Delta : Second-order derivatives of the 13 absolute

coefficients

Preemphasis/HammingWindow

FFT(Fast Fourier Transform)

Mel-scalefilter bank log|.|

DCT (Discrete Cosine Transform)

MFCC(12-Dimension)X(n)

25 ms

10ms . . .a1 a2

a3

ACOUSTIC MODEL Provide P(O|Q) = P(features|phone) Modeling Units [Bahl et al., 1986]

Context-independent : Phoneme Context-dependent : Diphone, Triphone, Quinphone

pL-p+pR : left-right context triphone Typical acoustic model [Juang et al., 1986]

Continuous-density Hidden Markov Model Distribution : Gaussian Mixture

HMM Topology : 3-state left-to-right model for each phone, 1-state for silence or pause

),,( BA

K

kjkjktjkjj xNcxb

1

),;()(

codebook

bj(x)

PRONUCIATION MODEL Provide P(Q|W) = P(phone|word) Word Lexicon [Hazen et al., 2002]

Map legal phone sequences into words according to phonotactic rules

G2P (Grapheme to phoneme) : Generate a word lexicon automatically

Several word may have multiple pronunciations Example

Tomato

P([towmeytow]|tomato) = P([towmaatow]|tomato) = 0.1 P([tahmeytow]|tomato) = P([tahmaatow]|tomato) = 0.4

[t]

[ow]

[ah]

[m]

[ey]

[aa]

[t] [ow]

0.2

0.8 1.0

1.0 0.5

0.5 1.0

1.01.0

LANGUAGE MODEL Provide P(W) ; the probability of the sentence [Beaujard et al.,

1999] We saw this was also used in the decoding process as the

probability of transitioning from one word to another. Word sequence : W = w1,w2,w3,…,wn

The problem is that we cannot reliably estimate the conditional word probabilities, for all words and all sequence lengths in a given language

n-gram Language Model n-gram language models use the previous n-1 words to represent

the history

Bi-grams are easily incorporated in a viterbi search

n

iiin wwwPwwP

111 )|()1(

)|( 11 ii wwwP

)|()|( 1)1(11 iniiii wwwPwwwP

LANGUAGE MODEL Example

Finite State Network (FSN)

Context Free Grammar (CFG)

Bigram

서울부산

에서

출발

세시네시

대구대전 도착

출발하는

기차버스

P( 에서 | 서울 )=0.2 P( 세시 | 에서 )=0.5P( 출발 | 세시 )=1.0 P( 하는 | 출발 )=0.5P( 출발 | 서울 )=0.5 P( 도착 | 대구 )=0.9…

$time = 세시 | 네시 ;$city = 서울 | 부산 | 대구 | 대전 ;$trans = 기차 | 버스 ;$sent = $city ( 에서 $time 출발 | 출발 $city 도착 ) 하는 $trans

Expanding every word to state level, we get a search network [Demuynck et al., 1997]

NETWORK CONSTRUCTION

I

L

S

A

M

I L

I

S A M

S A삼

Acoustic Model Pronunciation Model Language Model

I

I L

S A M

Wordtransition

P( 일 |x)

P( 사 |x)

P( 삼 |x)

P( 이 |x)LM is

applied

S A

start end이

Between-wordtransition

Intra-wordtransition

Search Network

DECODING Find

Viterbi Search : Dynamic Programming Token Passing Algorithm [Young et al., 1989]

)|(maxargˆ OWPWLW

• Initialize all states with a token with a null history and the likelihood that it’s a start state

• For each frame ak

– For each token t in state s with probability P(t), history H– For each state r

– Add new token to s with probability P(t) Ps,r Pr(ak), and history s.H

HTK Hidden Markov Model Toolkit (HTK)

A portable toolkit for building and manipulating hidden Markov models [Young et al., 1996]

- HShell : User I/O & interaction with OS- HLabel : Label files- HLM : Language model- HNet : Network and lattices- HDic : Dictionaries- HVQ : VQ codebooks- HModel : HMM definitions- HMem : Memory management- HGrf : Graphics- HAdapt : Adaptation- HRec : Main recognition processing functions

SUMMARY

x y

Speech Words

(x, y)

Training examples

Learning algorithm

I

L

S

A

M

I L

I

S A M

S A삼

Acoustic Model Pronunciation Model Language Model

Decoding

Search Network Construction

Speech Understanding= Spoken Language Understanding (SLU)

SPEECH UNDERSTANDING (in general)

Computer Program

Speaker ID /Language ID

Sentiment / Opinion

Named Entity / Relation

Topic / Intent

Speech Segment

Summary

Syntactic / Semantic Role

SQL

Meaning Representation

Dave /English

Nervous

LOC = pod bayOBJ = door

Control the Spaceship

Open the doors.

Open=Verb, the=Det. ...

select * from DOORS where ...

SPEECH UNDERSTANDING (in SDS)

x y

InputSpeech or

Words

OutputIntentions

(x, y)

Training examples

Learning algorithm

A process by which natural langauge speech is mapped to frame structure encoding of its meanings [Mori et al., 2008]

What’s difference between NLU and SLU? Robustness; noise and ungrammatical spoken language Domain-dependent; further deep-level semantics (e.g.

Person vs. Cast) Dialog; dialog history dependent and utt. by utt. Analysis

Traditional approaches; natural language to SQL conversion

ASRSpeech

SLU SQLGenerate Database

Text SemanticFrame SQL Response

A typical ATIS system (from [Wang et al., 2005])

LANGUAGE UNDERSTANDING

REPRESENTATION Semantic frame (slot/value structure) [Gildea and Jurafsky, 2002]

An intermediate semantic representation to serve as the interface between user and dialog system

Each frame contains several typed components called slots. The type of a slot specifies what kind of fillers it is expecting.

“Show me flights from Seattle to Boston”ShowFlight

Subject Flight

FLIGHT Departure_City Arrival_City

SEA BOS

<frame name=‘ShowFlight’ type=‘void’> <slot type=‘Subject’>FLIGHT</slot> <slot type=‘Flight’/> <slot type=‘DCity’>SEA</slot> <slot type=‘ACity’>BOS</slot> </slot></frame>

Semantic representation on ATIS task; XML format (left) and hierarchical representation (right) [Wang et al., 2005]

Meaning Representations for Spoken Dialog System Slot type 1: Intent, Subject Goal, Dialog Act (DA)

The meaning (intention) of an utt. at the discourse level Slot type 2: Component Slot, Named Entity (NE)

The identifier of entity such as person, location, organization, or time. In SLU, it represents domain-specific meaning of a word (or word group).

SEMANTIC FRAME

<frame domain=`RestaurantGuide'> <slot type=`DA' name=`SEARCH_RESTAURANT'/> <slot type=`NE' name=`CITY'>Pohang</slot> <slot type=`NE' name=`ADDRESS'>Daeyidong</slot> <slot type=`NE' name=`FOOD_TYPE'>Korean</slot></frame>

Ex) Find Korean restaurants in Daeyidong, Pohang

Two Classification ProblemsHOW TO SOLVE

Find Korean restaurants in Daeyidong, PohangInput:

Output: SEARCH_RESTAURANT

Dialog Act Identification

FOOD_TYPE ADDRESS CITY

Find Korean restaurants in Daeyidong, PohangInput:

Output: Named Entity Recognition

Encoding:

x is an input (word), y is an output (NE), and z is another output (DA).

Vector x = {x1, x2, x3, …, xT} Vector y = {y1, y2, y3, …, yT} Scalar z

Goal: modeling the functions y=f(x) and z=g(x)

PROBLEM FORMALIZATION

x Find Korean restaurants

in Daeyidong , Pohang .

y O FOOD_TYPE-B O O ADDRESS-B O CITY-B O

z SEARCH_RESTAURANT

CASCADE APPROACH I

Classification(Dialog Act / Intent)

Sequential Labeling

(Named Entity / Frame Slot)

Automatic Speech

Recognition

Dialog Management

Sequential Labeling Model (e.g. HMM, CRFs)

Classification Model

(e.g. MaxEnt, SVM)

x,yx x,y,z

Named Entity Dialog Act

Dialog Act Named Entity

Improve NE, but not DA.

CASCADE APPROACH II

Classification(Dialog Act / Intent)

Sequential Labeling

(Named Entity / Frame Slot)

Automatic Speech

Recognition

Dialog Management

Multiple Sequential Models (e.g.

intent-dependent)

Classification Model

(e.g. MaxEnt, SVM)

x,y,zx x,z

z

Named Entity ↔ Dialog ActJOINT APPROACH

Joint Inference

Classification(Dialog Act / Intent)

Sequential Labeling

(Named Entity / Frame Slot)

Automatic Speech

Recognition

Dialog Management

Joint Model(e.g. TriCRFs)

x x,y,z

[Jeong and Lee, 2006]

MACHINE LEARNING FOR SLU Relational Learning (RL) or Structured Prediction (SP)

[Dietterich, 2002; Lafferty et al., 2004, Sutton and McCallum, 2006] Structured or relational patterns are important because

they can be exploited to improve the prediction accuracy of our classier

Argmax search (e.g. Sum-Max, Belief propagation, Viterbi etc)

Basically, RL for language processing is to use a left-to-right structure (a.k.a linear-chain or sequence structure)

Algorithms: CRFs, Max-Margin Markov Net (M3N), SVM for Independent and Structured Output (SVM-ISO), Structured Perceptron, etc.

MACHINE LEARNING FOR SLU Background: Maximum Entropy (a.k.a logistic regression)

Conditional and discriminative manner Unstructured! (no dependency in y) Dialog act classification problem

Conditional Random Fields [Lafferty et al. 2001] Structured versions of MaxEnt (argmax search in inference) Undirected graphical models Popular in language and text processing Linear-chain structure for practical implementation Named entity recognition problem

z

x

yt-1 yt yt+1

xt-1 xt xt+1

fk

gk

hk

SUMMARYSolve by isolate (or independent) classifiersuch as Naïve Bayes, and MaxEnt

Solve by structured (or relational) classifiersuch as HMM, and CRFs

Find Korean restaurants in Daeyidong, PohangInput:

Output: SEARCH_RESTAURANT

Dialog Act Identification

FOOD_TYPE ADDRESS CITY

Find Korean restaurants in Daeyidong, PohangInput:

Output: Named Entity Recognition

Coffee Break

DIALOG MANAGEMENT

DIALOG MANAGEMENT

x y

InputWords or Intentions

OutputSystem

Response

(x, y)

Training examples

Learning algorithm

A central component of a dialog system to produce system responses with external knowledge sources[McTear, 2004]

DIALOG MANAGEMENT GOAL

Answer your query (e.g., question and order) given the task domain It includes :

Provide query results Ask further slot information Confirm user utterance Notify invalid query Suggest the alternative

Related to dialog complexity and task complexity.

In practice Find the best system action a given the dialog state s

DESIGN ISSUES Task complexity

How hard the task is? How much the system has domain knowledge?

Simple Complex

Call Routing

CollaborativePlanning

WeatherInformation

Conversational English Tutoring

AutomaticBanking

DESIGN ISSUES Dialog complexity

What dialog phenomena are allowed Initiative strategies

e.g., system-initiative vs. user-initiative vs. mixed-initiative

Meta-dialogs; the dialog itself e.g., Could you hold on for a minute?

Subdialogs; clarification/confirmation e.g., You selected KE airlines, is it right?

Multiple dialog threads e.g., domain switching

DIALOG EXAMPLES Example 1S: Where are you going?U: Gangnam, SeoulS: What kinds of food do you have in Gangnam, Seoul?U: Please African foodS: Sorry, there are no African restaurants. Please select another food.

System-initiative Implicit confirmation Merely informs user of failed query Rigid dialog flow

DIALOG EXAMPLES Example 2S: May I help you?U: I’d like to have African food in Gannam, Seoul S: Sorry, there are no African restaurants. S: There are Korean, Japanese, Indian, and American restaurants in Gangnam.U: Please Korean restaurants.S: Would you like to find Korean restaurants?U: Yes pleases

Mixed-initiative Implicit/Explicit confirmation Suggests the alternative when query fails More natural dialog flow

DIALOG EXAMPLES Example 3

U: I’d like to have African food in Gangnam, Seoul S: Sorry, there are no African restaurants. S: How about American restaurants in Gangnam, Seoul?U: No I don’t like it.S: What is your favorite food?U: I like grilled and seasoned beef S: So, how about Korean restaurants?U: Good.

Mixed-initiative Implicit/Explicit confirmation Recommends the alternative when query fails Most natural dialog flow

DIALOG CONTROL Finite-state based approach

Input : Single word or phrase State transition network (or graph) It can allow all legal dialog flow which is pre-defined in the state diagram.

Frame-based approach Input : Natural language with concept spotting Form-filling tasks to access information source But the questions do not have to be asked in a predetermined sequence

Plan-based approach Input : Unrestricted natural language The modeling of dialog as collaboration between intelligent agents to

solve some problems or task. For more complex task, such as negotiation and problem solving.

KNOWLEDGE-BASED DM (KBDM) Rule-based approaches

Early KBDMs were developed with handcrafted rules (e.g., information state update).

Simple Example [Larsson and Traum, 2003]

Agenda-based approaches Recent KBDMs were developed with domain-

specific knowledge and domain-independent dialog engine.

VoiceXML What is VoiceXML?

The HTML(XML) of the voice web. The open standard markup language for voice

application VoiceXML Resources : http://www.voicexml.org/

Can do Rapid implementation and management Integrated with World Wide Web Mixed-Initiative dialogue Simple Dialogue implementation solution

VoiceXML EXAMPLES: Say one of: Sports scores; Weather information; Log in.U: Sports scores

<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml"> <menu> <prompt>Say one of: <enumerate/></prompt> <choice next="http://www.example.com/sports.vxml"> Sports scores </choice> <choice next="http://www.example.com/weather.vxml"> Weather information </choice> <choice next="#login"> Log in </choice> </menu> </vxml>

AGENDA-BASED DM RavenClaw DM (CMU)

Using Hierarchical Task Decomposition A set of all possible dialogs in the domain Tree of dialog agents Each agent handles the corresponding part of the dialog

task

[Bohus and Rudnicky, 2003]

EXAMPLE-BASED DM (EBDM) Example-based approaches

Dialog State Space

Domain = Building_GuidanceDialog Act = WH-QUESTIONMain Goal = SEARCH-LOCROOM-TYPE=1 (filled), ROOM-NAME=0 (unfilled)LOC-FLOOR=0, PER-NAME=0, PER-TITLE=0Previous Dialog Act = <s>, Previous Main Goal = <s> Discourse History Vector = [1,0,0,0,0]Lexico-semantic Pattern = ROOM_TYPE 이 어디 지 ?System Action = inform(Floor)

Dialog CorpusUSER: 회의 실 이 어디 지 ?[Dialog Act = WH-QUESTION][Main Goal = SEARCH-LOC][ROOM-TYPE = 회의실 ]SYSTEM: 3 층에 교수회의실 , 2 층에 대회의실 , 소회의실이 있습니다 . [System Action = inform(Floor)]

Turn #1 (Domain=Building_Guidance)

Dialog Example

Indexed by using semantic & discourse features

Having the similar state

),(argmax* heSe iEei

[Lee et al., 2009]

STOCHASTIC DM Supervised approaches [Griol et al., 2008]

Find the best system action to maximize the conditional probability P(a|s) given the dialog state Based on supervised learning algorithms

MDP/POMDP-based approaches [Williams and Young, 2007] Find the optimal system action to maximize the reward

function R(a|s) given the belief state Based on reinforcement learning algorithms

In general, a dialog state space is too large So, generalizing the current dialog state is important

Dialog as a Markov Decision Process

User

SpeechUnderstanding

SpeechGeneration

StateEstimator

DialogPolicy Optimize

k

kk rR

us

ua

ma

ua~

ma~

duum ssas ~,~,~~MDP

usergoal

userdialog act

noisy estimate ofuser dialog act

dialoghistory

machinestate

machinedialog act

ReinforcementLearning

Reward),( mm asr

ms~

ds

[Williams and Young, 2007]

SUMMARY

x y

InputWords or Intentions

OutputSystem

Response

Dialog Corpus

Dialog Model

External DB

Agenda-based approachStochastic approachExample-based approach

Demo Building guidance dialog TV program guide dialog Multi-domain dialog with chatting

CHALLENGES & ISSUESMULTI-MODAL DIALOG SYSTEM

MULTI-MODAL DIALOG SYSTEM

x y

InputGesture

OutputSystem

Response

(x, y)

Training examples

Learning algorithm

InputSpeech

Inputface

MULTIMODAL DIALOG SYSTEM A system which supports human-computer

interaction over multiple different input and/or output modes. Input: voice, pen, gesture, face expression, etc. Output: voice, graphical output, etc.

Applications GPS Information guide system Smart home control Etc.

여기에서 여기로 가는 제일 빠른 길 좀 알려 줘 .

voice

pen

MOTIVATION Speech: the Ultimate Interface?

(+) Interaction style: natural (use free speech) Natural repair process for error recovery

(+) Richer channel – speaker’s disposition and emotional state (if system’s knew how to deal with that..)

(-) Input inconsistent (high error rates), hard to correct error e.g., may get different result, each time we speak the

same words. (-) Slow (sequential) output style: using TTS (text-to-speech)

How to overcome these weak points? Multimodal interface!!

ADVANTAGES Task performance and user preference

Migration of Human-Computer Interaction away from the desktop

Adaptation to the environment

Error recovery and handling

Special situations where mode choice helps

TASK PERFORMANCE AND USER PREFERENCE Task performance and user preference for

multimodal over speech only interfaces [Oviatt et al., 1997] 10% faster task completion, 23% fewer words, (Shorter and simpler linguistic constructions) 36% fewer task errors, 35% fewer spoken disfluencies, 90-100% user preference to interact this way.

• Speech-only dialog system

Speech: Bring the drink on the table to the side of bed

• Multimodal dialog System

Speech: Bring this to herePen gesture:

Easy, Simplified

user utterance

!

MIGRAION OF HCI AWAY FROM THE DESKTOP Small portable computing devices

Such as PDAs, organizers, and smart-phones Limited screen real estate for graphical output Limited input no keyboard/mouse (arrow keys, thumbwheel) Complex GUIs not feasible Augment limited GUI with natural modalities such as speech and pen

Use less space Rapid navigation over menu hierarchy

Other devices Kiosks, car navigation system…

No mouse or keyboard

Speech + pen gesture

APPLICATION TO THE ENVIRONMENT Multimodal interfaces enable rapid adaptation to

changes in the environment Allow user to switch modes Mobile devices that are used in multiple environments

Environmental conditions can be either physical or social Physical

Noise: Increases in ambient noise can degrade speech performance switch to GUI, stylus pen input

Brightness: Bright light in outdoor environment can limit usefulness of graphical display

Social Speech many be easiest for password, account number

etc, but in public places users may be uncomfortable being overheard Switch to GUI or keypad input

ERROR RECOVERY AND HANDLING Advantages for recovery and reduction of

error: Users intuitively pick the mode that is less error-prone. Language is often simplified. Users intuitively switch modes after an error

The same problem is not repeated. Multimodal error correction

Cross-mode compensation - complementarity Combining inputs from multiple modalities can reduce

the overall error rate Multimodal interface has potentially

SPECIAL SITUATIONS WHERE MODE CHOICE HELPS Users with disability People with a strong accent or a cold People with RSI Young children or non-literate users Other users who have problems when handle

the standard devices: mouse and keyboard

Multimodal interfaces let people choose their preferred interaction style depending on the actual task, the context, and their own preferences and abilities.

Demo Multimodal dialog in smart home domain English teaching dialog

CHALLENGES & ISSUESDIALOG SIMULATOR

SYSTEM EVALUATION Real User Evaluation

Real Interaction

1. High Cost (-)

2. Human Factor- It looses objectivity (-)

Spoken Dialog System

1. Reflecting Real World (+)

SYSTEM EVALUATION Simulated User Evaluation

Simulated Interaction

Spoken Dialog System Simulated User

Virtual Environment

1. Low Cost (+)

2. Consistent Evaluation- It guarantees objectivity (+)

1. Not Real World (-)

SYSTEM DEVELOPMENT Exposing System to Diverse Environment

•Different users•Noises•Unexpected focus shift

Spoken Dialog System

USER SIMULATION

Spoken Dialog System Simulated Users

Simulated User Input

System Output

PROBLEMS User Simulation for spoken dialog systems

involves four essential problems [Jung et al., 2009]

User Intention Simulation

User Utterance Simulation

ASR Channel SimulationSpoken Dialog System Simulated Users

USER INTENTION SIMULATION Goal

Generating appropriate user intentions given the current dialog state

P( user_intention | dialog_state)

ExampleU1 : 근처에 중국집 가자S1 : 행당동에 북경 , 아서원 , 시온반점이 있고 홍익동에

정궁중화요리 , 도선동에 양자강이 있습니다 . U2 : 삼성동에는 뭐가 있지 ? Semantic Frame

(Intention)Dialog act WH-

QUESTIONMain Goal SEARCH-LOCNamed Entity LOC_ADDRES

S

USER UTTERANCE SIMULATION Goal

Generating natural languages given the user intentions

P( user_utterance | user_intention )

Semantic Frame (Intention)Dialog act WH-

QUESTIONMain Goal SEARCH-LOCNamed Entity LOC_ADDRES

S

• 삼성동에는 뭐가 있지 ?• 삼성동 쪽에 뭐가 있지 ?• 삼성동에 있는 것은 뭐니 ?• …

ASR CHANNEL SIMULATION Goal

Generating noisy utterances from a clean utterance at certain error rates

P( utternoisy | utterclean , error_rate)

• 삼성동에는 뭐가 있지 ?

• 삼성동에 뭐 있니 ? • 삼정동에는 뭐가 있지 ? • 상성동 뭐 가니 ?•삼성동에는 무엇이 있니 ?• …

Clean utterance Noisy utterance

AUTOMATED DIALOG SYSTEM EVALUATION

[Jung et al., 2009]

Demo Self-learned dialog system Translating dialog system

REFERENCES

REFERENCES ASR (1/2)

L.R. Rabiner and B.H. Juang, 1993. Fundamentals of Speech Recognition, Prentice-Hall.

L.R. Bahl, P.F. Brown, P.V. de Souza, and R.L. Mercer, 1986. Maximum mutual information estimation of hidden Markov model parameters for speech recognition, Proceedings of 1986 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.49–52.

K.K. Paliwal, 1992. Dimensionality reduction of the enhanced feature set for the HMMbased speech recognizer, Digital Signal Processing, vol.2, pp.157–173.

B.H. Juang, S.E. Levinson, and M.M. Sondhi, 1986. Maximum likelihood estimation for multivariate mixture observations of Markov chains, IEEE Transactions on Information Theory, vol.32, no.2, pp.307–309.

T.J. Hazen, I.L. Hetherington, H. Shu, and K. Livescu, 2002. Pronunciation modeling using a finite-state transducer representation, Proceedings of the ISCA Workshop on Pronunciation Modeling and Lexicon Adaptation, pp.99–104.

REFERENCES ASR (2/2)

K. Demuynck, J. Duchateau, and D.V. Compernolle, 1997. A static lexicon network representation for cross-word context dependent phones, Proceedings of the 5th European Conference on Speech Communication and Technology, pp.143–146.

S.J. Young, N.H. Russell, and J.H.S Thornton, 1989. Token passing: a simple conceptual model for connected speech recognition systems. Technical Report CUED/F-INFENG/TR.38, Cambridge University Engineering Department.

S. Young, J. Jansen, J. Odell, D. Ollason, and P. Woodland, 1996. The HTK book. Entropics Cambridge Research Lab., Cambridge, UK.

HTK website: http://htk.eng.cam.ac.uk/

REFERENCES SLU

R. De Mori et al. Spoken Language Understanding for Conversational Systems. Signal Processing Magazine. 25(3):50-58. 2008.

Y. Wang, L. Deng, and A. Acero. September 2005, Spoken Language Understanding: An introduction to the statistical framework. IEEE Signal Processing Magazine, 27(5):16-31.

D. Gildea, and D. Jurafsky. 2002. Automatic labeling of semantic roles. Computational Linguistics, 28(3):245-288.

M. Jeong and G.G. Lee, 2006. Jointly predicting dialog act and named entity for spoken language understanding, IEEE/ACL workshop on SLT.

T. G. Dietterich, 2002. Machine learning for sequential data: A review. Caelli(Ed.) Structural, Syntactic, and Statistical Pattern Recognition.

J. Lafferty, A. McCallum, and F. Pereira. 2001. Conditional Random Fields: Probabilistic models for segmenting and labeling sequence data. ICML.

C. Sutton and A. McCallum, 2006. An introduction to conditional random fields for relational learning. In Introduction to Statistical Relational Learning. L. Getoor and B. Taskar, Eds. MIT Press.

REFERENCES DM

M. F. McTear, Spoken Dialogue Technology - Toward the Conversational User Interface: Springer Verlag London, 2004.

S. Larsson, and D. R. Traum, “Information state and dialogue management in the TRINDI dialogue move engine toolkit,” Natural Language Engineering, vol. 6, pp. 323-340, 2006.

B. Bohus, and A. Rudnicky, “RavenClaw: Dialog Management Using Hierarchical Task Decomposition and an Expectation Agenda,” in Proc. of the European Conference on Speech, Communication and Technology, 2003, pp. 597-600.

D. Griol, L. F. Hurtado, E. Segarra et al., “A statistical approach to spoken dialog systems design and evaluation,” Speech Communication, vol. 50, no. 8-9, pp. 666-682, 2008.

J. D. Williams, and S. Young, “Partially observable Markov decision processes for spoken dialog systems,” Computer Speech and Language, vol. 21, pp. 393-422, 2007.

C. Lee, S. Jung, S. Kim et al., “Example-based Dialog Modeling for Practical Multi-domain Dialog System,” Speech Communication, vol. 51, no. 5, pp. 466-484, 2009.

REFERENCES MULTI-MODAL DIALOG SYSTEM & DIALOG

SIMULATOR S. L. Oviatt , A. DeAngeli, and K. Kuhn, 1997, Integration and

synchronization of input modes during multimodal human-computer interaction. In Proceedings of Conference on Human Factors in Computing Systems: CHI '97.

R. Lopez-Cozar, A. D. la Torre, J. C. Segura et al., “Assessment of dialogue systems by means of a new simulation technique,” Speech Communication, vol. 40, no. 3, pp. 387-407, 2003.

J. Schatzmann, B. Thomson, K. Weilhammer et al., “Agenda-based User Simulation for Bootstrapping a POMDP Dialogue System,” in Proc. of the Human Language Technology/North American Chapter of the Association for Computational Linguistics, 2007, pp. 149-152.

S. Jung, C. Lee, K. Kim et al., “Data-driven user simulation for automated evaluation of spoken dialog systems,” Computer Speech and Language, 2009.

Thank You & QAThank You & QA