Statistical Spoken Dialogue Systems and the...

Dialogue Systems Group Machine Intelligence Laboratory Cambridge University Engineering Department Cambridge, UK

Steve Young

Statistical Spoken Dialogue Systems and the Challenges for Machine Learning

1

Dialog System Architecture

Semantic DecoderASR Belief

Tracker

Understanding

Turn Level Dialogue Level

Database/Application

MessageGenerator

ResponsePlanner

Generation

Turn Level Dialogue Level

TTS

User DialogPolicy

Dialog Manager

2

Recognition Hypotheses

Belief State

System Actions

System Response

Understanding: ASR -> Beliefs

3

CNNASR Hyp#1 [p ]

LSTM

1

Last System Act

…

WE x

p1

+ASR Hyp#1 [p ] WE x

p2

WE

LSTM

SoftMax

2

Per Turn Semantic Decoding Per Utterance Belief Tracking

Ps (v)

Repeated for Each Slot s

WordEmbedding

Belief State = Concatenation of Slot Probability Vectors

CNN

c

c

c

c

c

c

c

c

c

1

2

3

4

5

6

7

8

9

I

am

looking

for

a

cheap

hotel

near

here

Slide convolutionfilter k of length lover utterance

ci = tanh fkl .wi:i+l−1 + b( )

Using a CNN to Extract Lexical Features

4

CNN is the key component: it scans each utterance applying convolution windows of 1, 2, 3, 4, … words

r

r

r

r

r

1

2

3

4

5

Sentencerepresentation

r

r

r

r

r

r

11

21

31

41

51

r

r

r

r

r

12

22

32

42

52

r

r

r

r

r

13

23

33

43

53

r

r

r

r

r

14

24

34

44

54

max

+ + +

window size l

filte

r num

ber k

f 43

CNNw

Understanding: ASR -> Beliefs

5

CNNASR Hyp#1 [p ]

LSTM

1

Last System Act

…

WE x

p1

+ASR Hyp#1 [p ] WE x

p2

WE

LSTM

SoftMax

2

Per Turn Semantic Decoding Per Utterance Belief Tracking

Ps (v)

Repeated for Each Slot s

WordEmbedding

Belief State = Concatenation of Slot Probability Vectors

CNN

Henderson, M., et al. (2014). Word-Based Dialog State Tracking with Recurrent Neural Networks. SigDial 2014, Philadelphia, PA. Rojas-Barahona, L., et al. (2016). Exploiting Sentence and Context Representations in Deep Neural Models for Spoken Language Understanding. Coling, Osaka, Japan. Mrksic, N., et al. (2016) Neural Belief Tracker: Data-Driven Dialogue State Tracking. arXiv:1606.03777

Generation: actions -> words

6

Need to convert abstract system actions to natural language e.g.

<name><s>

inform(<name>, <food>)

serves<name>

<food>serves

training

inform(name=“The Peking”, food=“chinese”) “The Peking serves chinese food”

SC-LSTM

food<food>

running

inform(name=<name>, food=<food>) “ <name> serves <food> food”

Generation: actions -> words

7

Need to convert abstract system actions to natural language e.g.

request(<food>)

you

Solution: delexicalise the training data, and train a conditional LSTM

SC-LSTM like?

Semantically constrained LSTM

8

i o

f

c ht

ht−1wt

SC-LSTM

rdt−1 dt

semanticconditioningsystem

dialog act

word sequence

Dialog Manager

9

Weather Other

Domain

Local Maine

Location

Temp Rain

Weather Condition

Wind b

π

π a Actions: request, confirm,inform, execute, etc

1. Belief state b encodes the state of the dialog, including all relevant history.

2. Belief state is updated every turn of the dialog.

3. The policy determines the best action to make at each turn via a mapping from the belief state b to actions a.

4. Every dialog ends with a reward: +ve for success, -ve for failure. Plus a weak -ve reward for every turn to encourage brevity.

5. Reinforcement Learning is used to find the best policy.

π

Reinforcement Learning

10

π (b,a) :!n × A→ [0,1]Policy:

R = r(bτ ,aτ )τ=1

T

∑Reward: NB: no discounting:

π * = argmaxπ E[R |π ]{ }Problem: find

Policy Representation

• Gaussian Processes: data efficient, includes explicit confidence on Q-value. Can support large n, but action space |A| limited.

• Deep Neural Networks: scale well on both n and |A|, but no built-in confidence measure and poor convergence properties.

11

π (b,a) :!n × A→ [0,1] n ~ 20 - 100 |A| ~ 200+

Training Data

• Ideally, train directly on interactions with real users but ✦ training even a small domain may require around 5k

dialogues (many in exploration mode) ✦ reward signal is hard to measure (see later)

• In practice, train in stages ✦ initialise with corpus data ✦ train/test on user simulator ✦ tune on real users

12

Optimisation Algorithms• Policy Iteration

✦ GP Sarsa ✦ Deep Q-learning

• Policy Gradient ✦ Natural Actor Critic

• “Black box” methods ✦ Trust Regions

13

1. NN policy: 1 common 32 node tanh hidden layer. Action outputs encoded via 2 softmax output partitions and 6 sigmoid partitions

2. Pre-trained (using SL for NN and prior for GP) on 720 dialogs from Cambridge restaurant domain.

3. Optimised (using RL) on 5000 simulated dialogues

SL 94.5%SL+RL 98.2%

NN Policy trained and tested on-line with real users.

Simulation Results

NAC trained Neural Net Policy vs GP Policy

Real User Results

Su, P-H, et al., Continuously Learning Neural Dialogue Management, arXiv:1606.02689

Curse of Dimensionality

15

Domain Complexity

Belief Space

Multiple Domains

“I am looking for a cheap italian restaurant.”

Single domain Simple types

action=search venue=restaurant price=cheap food=italian

Restaurant Domain

“Book a table at Nando’s after my meeting with Bill.”

Multi-domain Simple types

action=book venue=restaurant name=Nando’s when=?? action=lookup event=meeting attendee=Bill

Restaurant DomainCalendar Domain

action=book venue=restaurant when={time(19:45), date(today+1)}

“Book a table at 7:45pm tomorrow.”

Single domain Complex types

Multi-domain Complex types

“Book a table at Nando’s for 7:45pm tomorrow and invite Bill and John”

action=book venue=restaurant name = Nando’s when={time(19:45), date(today+1)} action=create event=meeting attendees = {“Bill”, “John”}

Bayesian Committee Machines

16

Assume M independent policies and a common belief state

Q1Domain1

b …

…

argmaxa Q̂(b,a){ }

Q2

Qi

Q̂ = f Q1,...Qi ,...( )Domain

2

Domaini

r(b,a)distribute reward to all committee members scaled by contribution to actual selected action

17

Example using GP-RL:

M. Gasic et al (2015). "Policy Committee for Adaptation in Multi-domain Spoken Dialogue Systems." IEEE ASRU 2015, Scotsdale, AZ.

Rew

ard

Number of Training Dialogues

Laptop domain trained in parallel with Hotels

and Restaurants

Laptop domain trained in isolation

Three domains trained from scratch on line both individually and in parallel:

• Hotel info • Restaurant info • Laptop product guide

Q = ΣQ ΣiQ( )−1Qi

i=1

M

∑

ΣQ = ΣiQ( )−1 − const

i=1

M

∑⎡⎣⎢

⎤⎦⎥

−1

Q̂ ∼ N Q,ΣQ( )where

Domain Complexity

18

b1

a1

b2

a2

b6

a6

b7

a7

π calendar r1 r2 r3 r4 r5 r6 r7

b3

a3

b4

a4

b5

a5How can I help?

Fix a meet-ing

Who with?

Bill

What time?

5.30

Was that 9.30?

No, 5.30

5.30pm?

Yes

Ok meet-ing at 5.30pm with Bill?

Yes

Meeting is scheduledSystem:

User:

b3

a3

b4

a4

b5

a5

GetTime

Hierarchical Reinforcement Learning

19

b1

a1

b2

a2

b6

a6

b7

a7

π calendar r1 r2

r3 r4 r5

r6 r7

b3

a3

b4

a4

b5

a5

GetTime

π time + +

Hierarchical Deep Reinforcement Learning

20

T. Kulkarni et al (2016). "Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation." arXiv:1604.06057.

DQNθ

DQNλ DQNλ DQNλ

DQNθ

bt bt+1 bt+N

at at+1

gt gt

at+N

gt gt+N

Topmeta-level

Subgoal-level eg GetTime

NextSubgoal

Measuring Success

21

Task success is not always obvious….

b1

a1

b2

a2

b6

a6

b7

a7

π calendar r1 r2 r3 r4 r5 r6 r7

b3

a3

b4

a4

b5

a5How can I help?

Fix a meet-ing

Who with?

Bill

What time?

5.30

Was that 9.30?

No, 5.30

5.30pm?

Yes

Ok meet-ing at 5.30pm with Bill?

Yes

Meeting is scheduledSystem:

User:

….so probably ok

✔

Measuring Success

22

However, what about the problematic weather query?

π calendar

b1

a1

b2

a2

r1 r2 r3 r4

b3

a3

b4

a4

How can I help?

Hows the weather in

Maine

It’ll be fine all day in the Bay

area.

No, Maine

I know your name Steve, it’s “Steve”.

I want the weather in

Maine!

I dont believe it’s raining right now.

System:

User:

On-line Reward Estimation

23

Estimated Reward Signal

LSTMEncode

GP-based Reward Estimator

User

If low confidence then

Prompt for user feedback

“good” or

“bad”

Episodic Dialogue Features

64-D embedding

b1

a1

b2

a2

r1 r2 r3 r4

b3

a3

b4

a4

On-line Reward and Policy Learning

24

On-line Reward and Policy Learning

25

P-H. Su et al (2016). "On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems." ACL 2016, Berlin.

Summary• POMDPs and Reinforcement Learning provide a

powerful mathematical framework for decision making in intelligent conversational agents.

• DNNs provide a flexible building block for all stages of the dialogue system pipeline, though training is often problematic.

• Unrestricted conversation is challenging but there are several promising approaches to managing complexity.

• For commercially deployed systems, the user is a tremendous untapped resource, and Reinforcement Learning provides the framework for exploiting it.

26

27

CreditsAll members of the Cambridge Dialogue Systems Group Past and Present:

Milica Gasic Catherine Breslin Pawel Budzianowski Matt Henderson Filip Jurcicek Simon Keizer Dongho Kim Fabrice Lefevre Francois Mairesse Nikola MrksicLina Rojas Barahona Jost Schatzmann

Matt Stuttle Martin Szummer Eddy Su Blaise Thomson Pirros Tsiakoulis Stefan Ultes David Vandyke Karl Weilhammer Shawn Wen Jason Williams Hui Ye Kai Yu

Date post:	10-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Statistical Spoken Dialogue Systems and the...

Documents