Dialogue Systems Group Machine Intelligence Laboratory Cambridge University Engineering Department Cambridge, UK
Steve Young
Statistical Spoken Dialogue Systems and the Challenges for Machine Learning
1
Dialog System Architecture
Semantic DecoderASR Belief
Tracker
Understanding
Turn Level Dialogue Level
Database/Application
MessageGenerator
ResponsePlanner
Generation
Turn Level Dialogue Level
TTS
User DialogPolicy
Dialog Manager
2
Recognition Hypotheses
Belief State
System Actions
System Response
Understanding: ASR -> Beliefs
3
CNNASR Hyp#1 [p ]
LSTM
1
Last System Act
…
WE x
p1
+ASR Hyp#1 [p ] WE x
p2
WE
LSTM
SoftMax
2
Per Turn Semantic Decoding Per Utterance Belief Tracking
Ps (v)
Repeated for Each Slot s
WordEmbedding
Belief State = Concatenation of Slot Probability Vectors
CNN
c
c
c
c
c
c
c
c
c
1
2
3
4
5
6
7
8
9
I
am
looking
for
a
cheap
hotel
near
here
Slide convolutionfilter k of length lover utterance
ci = tanh fkl .wi:i+l−1 + b( )
Using a CNN to Extract Lexical Features
4
CNN is the key component: it scans each utterance applying convolution windows of 1, 2, 3, 4, … words
r
r
r
r
r
1
2
3
4
5
Sentencerepresentation
r
r
r
r
r
r
11
21
31
41
51
r
r
r
r
r
12
22
32
42
52
r
r
r
r
r
13
23
33
43
53
r
r
r
r
r
14
24
34
44
54
max
+ + +
window size l
filte
r num
ber k
f 43
CNNw
Understanding: ASR -> Beliefs
5
CNNASR Hyp#1 [p ]
LSTM
1
Last System Act
…
WE x
p1
+ASR Hyp#1 [p ] WE x
p2
WE
LSTM
SoftMax
2
Per Turn Semantic Decoding Per Utterance Belief Tracking
Ps (v)
Repeated for Each Slot s
WordEmbedding
Belief State = Concatenation of Slot Probability Vectors
CNN
Henderson, M., et al. (2014). Word-Based Dialog State Tracking with Recurrent Neural Networks. SigDial 2014, Philadelphia, PA. Rojas-Barahona, L., et al. (2016). Exploiting Sentence and Context Representations in Deep Neural Models for Spoken Language Understanding. Coling, Osaka, Japan. Mrksic, N., et al. (2016) Neural Belief Tracker: Data-Driven Dialogue State Tracking. arXiv:1606.03777
Generation: actions -> words
6
Need to convert abstract system actions to natural language e.g.
<name><s>
inform(<name>, <food>)
serves<name>
<food>serves
training
inform(name=“The Peking”, food=“chinese”) “The Peking serves chinese food”
SC-LSTM
food<food>
running
inform(name=<name>, food=<food>) “ <name> serves <food> food”
Generation: actions -> words
7
Need to convert abstract system actions to natural language e.g.
request(<food>)
you
Solution: delexicalise the training data, and train a conditional LSTM
SC-LSTM like?
Semantically constrained LSTM
8
i o
f
c ht
ht−1wt
SC-LSTM
rdt−1 dt
semanticconditioningsystem
dialog act
word sequence
Dialog Manager
9
Weather Other
Domain
Local Maine
Location
Temp Rain
Weather Condition
Wind b
π
π a Actions: request, confirm,inform, execute, etc
1. Belief state b encodes the state of the dialog, including all relevant history.
2. Belief state is updated every turn of the dialog.
3. The policy determines the best action to make at each turn via a mapping from the belief state b to actions a.
4. Every dialog ends with a reward: +ve for success, -ve for failure. Plus a weak -ve reward for every turn to encourage brevity.
5. Reinforcement Learning is used to find the best policy.
π
Reinforcement Learning
10
π (b,a) :!n × A→ [0,1]Policy:
R = r(bτ ,aτ )τ=1
T
∑Reward: NB: no discounting:
π * = argmaxπ E[R |π ]{ }Problem: find
Policy Representation
• Gaussian Processes: data efficient, includes explicit confidence on Q-value. Can support large n, but action space |A| limited.
• Deep Neural Networks: scale well on both n and |A|, but no built-in confidence measure and poor convergence properties.
11
π (b,a) :!n × A→ [0,1] n ~ 20 - 100 |A| ~ 200+
Training Data
• Ideally, train directly on interactions with real users but ✦ training even a small domain may require around 5k
dialogues (many in exploration mode) ✦ reward signal is hard to measure (see later)
• In practice, train in stages ✦ initialise with corpus data ✦ train/test on user simulator ✦ tune on real users
12
Optimisation Algorithms• Policy Iteration
✦ GP Sarsa ✦ Deep Q-learning
• Policy Gradient ✦ Natural Actor Critic
• “Black box” methods ✦ Trust Regions
13
1. NN policy: 1 common 32 node tanh hidden layer. Action outputs encoded via 2 softmax output partitions and 6 sigmoid partitions
2. Pre-trained (using SL for NN and prior for GP) on 720 dialogs from Cambridge restaurant domain.
3. Optimised (using RL) on 5000 simulated dialogues
SL 94.5%SL+RL 98.2%
NN Policy trained and tested on-line with real users.
Simulation Results
NAC trained Neural Net Policy vs GP Policy
Real User Results
Su, P-H, et al., Continuously Learning Neural Dialogue Management, arXiv:1606.02689
Curse of Dimensionality
15
Domain Complexity
Belief Space
Multiple Domains
“I am looking for a cheap italian restaurant.”
Single domain Simple types
action=search venue=restaurant price=cheap food=italian
Restaurant Domain
“Book a table at Nando’s after my meeting with Bill.”
Multi-domain Simple types
action=book venue=restaurant name=Nando’s when=?? action=lookup event=meeting attendee=Bill
Restaurant DomainCalendar Domain
action=book venue=restaurant when={time(19:45), date(today+1)}
“Book a table at 7:45pm tomorrow.”
Single domain Complex types
Multi-domain Complex types
“Book a table at Nando’s for 7:45pm tomorrow and invite Bill and John”
action=book venue=restaurant name = Nando’s when={time(19:45), date(today+1)} action=create event=meeting attendees = {“Bill”, “John”}
Bayesian Committee Machines
16
Assume M independent policies and a common belief state
Q1Domain1
b …
…
argmaxa Q̂(b,a){ }
Q2
Qi
Q̂ = f Q1,...Qi ,...( )Domain
2
Domaini
r(b,a)distribute reward to all committee members scaled by contribution to actual selected action
17
Example using GP-RL:
M. Gasic et al (2015). "Policy Committee for Adaptation in Multi-domain Spoken Dialogue Systems." IEEE ASRU 2015, Scotsdale, AZ.
Rew
ard
Number of Training Dialogues
Laptop domain trained in parallel with Hotels
and Restaurants
Laptop domain trained in isolation
Three domains trained from scratch on line both individually and in parallel:
• Hotel info • Restaurant info • Laptop product guide
Q = ΣQ ΣiQ( )−1Qi
i=1
M
∑
ΣQ = ΣiQ( )−1 − const
i=1
M
∑⎡⎣⎢
⎤⎦⎥
−1
Q̂ ∼ N Q,ΣQ( )where
Domain Complexity
18
b1
a1
b2
a2
b6
a6
b7
a7
π calendar r1 r2 r3 r4 r5 r6 r7
b3
a3
b4
a4
b5
a5How can I help?
Fix a meet-ing
Who with?
Bill
What time?
5.30
Was that 9.30?
No, 5.30
5.30pm?
Yes
Ok meet-ing at 5.30pm with Bill?
Yes
Meeting is scheduledSystem:
User:
b3
a3
b4
a4
b5
a5
GetTime
Hierarchical Reinforcement Learning
19
b1
a1
b2
a2
b6
a6
b7
a7
π calendar r1 r2
r3 r4 r5
r6 r7
b3
a3
b4
a4
b5
a5
GetTime
π time + +
Hierarchical Deep Reinforcement Learning
20
T. Kulkarni et al (2016). "Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation." arXiv:1604.06057.
DQNθ
DQNλ DQNλ DQNλ
DQNθ
bt bt+1 bt+N
at at+1
gt gt
at+N
gt gt+N
Topmeta-level
Subgoal-level eg GetTime
NextSubgoal
Measuring Success
21
Task success is not always obvious….
b1
a1
b2
a2
b6
a6
b7
a7
π calendar r1 r2 r3 r4 r5 r6 r7
b3
a3
b4
a4
b5
a5How can I help?
Fix a meet-ing
Who with?
Bill
What time?
5.30
Was that 9.30?
No, 5.30
5.30pm?
Yes
Ok meet-ing at 5.30pm with Bill?
Yes
Meeting is scheduledSystem:
User:
….so probably ok
✔
Measuring Success
22
However, what about the problematic weather query?
π calendar
b1
a1
b2
a2
r1 r2 r3 r4
b3
a3
b4
a4
How can I help?
Hows the weather in
Maine
It’ll be fine all day in the Bay
area.
No, Maine
I know your name Steve, it’s “Steve”.
I want the weather in
Maine!
I dont believe it’s raining right now.
System:
User:
On-line Reward Estimation
23
Estimated Reward Signal
LSTMEncode
GP-based Reward Estimator
User
If low confidence then
Prompt for user feedback
“good” or
“bad”
Episodic Dialogue Features
64-D embedding
b1
a1
b2
a2
r1 r2 r3 r4
b3
a3
b4
a4
On-line Reward and Policy Learning
24
On-line Reward and Policy Learning
25
P-H. Su et al (2016). "On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems." ACL 2016, Berlin.
Summary• POMDPs and Reinforcement Learning provide a
powerful mathematical framework for decision making in intelligent conversational agents.
• DNNs provide a flexible building block for all stages of the dialogue system pipeline, though training is often problematic.
• Unrestricted conversation is challenging but there are several promising approaches to managing complexity.
• For commercially deployed systems, the user is a tremendous untapped resource, and Reinforcement Learning provides the framework for exploiting it.
26
27
CreditsAll members of the Cambridge Dialogue Systems Group Past and Present:
Milica Gasic Catherine Breslin Pawel Budzianowski Matt Henderson Filip Jurcicek Simon Keizer Dongho Kim Fabrice Lefevre Francois Mairesse Nikola MrksicLina Rojas Barahona Jost Schatzmann
Matt Stuttle Martin Szummer Eddy Su Blaise Thomson Pirros Tsiakoulis Stefan Ultes David Vandyke Karl Weilhammer Shawn Wen Jason Williams Hui Ye Kai Yu