Reinforcement learning for spoken dialog systems:Reinforcement learning for spoken dialog systems:
Using Using POMDPsPOMDPs for Dialog Managementfor Dialog Management
Cambridge University Engineering DepartmentCambridge University Engineering DepartmentMachine Intelligence LaboratoryMachine Intelligence Laboratory
Steve YoungSteve Young
2© Steve Young, 2006
Outline of TalkOutline of Talk
the promise of statistical dialog systems
Markov Decision Processes and their limitations
Partially Observable MDPs – an intractable solution?
the Hidden Information State system – a proof of concept.
3© Steve Young, 2006
Statistical Dialog SystemsStatistical Dialog Systems
A statistical approach to dialog system design offers the following potential advantages:
formalise dialog design criteria as objective reward functionsautomatically learn dialog strategies from dataallow decision making to be optimisedincrease robustness to recognition/understanding errorsenable on-line dialog policy adaptation to allow the system to learn from experience
Markov Decision Processes provide the framework to do this .....
Overall, increase robustness and reduce design, implementation and maintenance costs
4© Steve Young, 2006
Dialog as a Markov Decision ProcessDialog as a Markov Decision Process
userdialogact
machinedialog act
SpeechUnderstanding
SpeechGeneration
User
ua
ma~us
mausergoal
>=< duum ssas ~,~,~~
machinestate
dialoghistory
noisy estimate ofuser dialog act
Levin, E. and R. Pieraccini (1997). "A Stochastic Model Of Computer-Human Interaction For Learning Dialog Strategies." Proc Eurospeech, Rhodes,Greece.
Levin, E., R. Pieraccini, et al. (1998). "Using Markov Decision Processes For Learning Dialog Strategies." Proc Int Conf Acoustics, Speech and Signal Processing, Seattle,USA.
ua~State
Estimator
DialogPolicy
ds
ms~
MDP
∑=k
kkrR γ
),( mm asr
Optimise
ReinforcementLearning
Reward
π
5© Steve Young, 2006
Training an MDPTraining an MDP
Key idea is to associate a value function with each state
{ }mm sREsV |)( ππ = { }mmmm asREasQ ,|),( π
π =
A popular algorithm for implementing this is Q-Learning
)())(,( mmm sVssQ ππ π =where
then policy π ′ πis better than policy .
Given V or Q, policy optimisation is straightforward since if
)())(,( mmm sVssQ ππ π >′
6© Steve Young, 2006
Limitations of MDP FrameworkLimitations of MDP Framework
state space is huge, hence propositional content and much of the relevant history is often ignored.
dialogs are fragile because user state and user dialog act are uncertain, hence estimate of machine state is often incorrect
recovery strategies are difficult since no information is available for backtracking
no principled way to handle N-best ASR output.
ms~us ua
Modelling dialog as an MDP suffers from a variety of practical problems
7© Steve Young, 2006
Dialog as a Partially Observable MDPDialog as a Partially Observable MDP
BeliefEstimator
DialogPolicy
ds
)~( msb
POMDP
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
Nu
u
a
a
~...
~1
>=< duum ssas ~,~,~~
SpeechUnderstanding
SpeechGeneration
User
ua
ma~us
ma
Distribution over possible dialog acts(eg N-best list)
Distribution over allpossible machine states
Policy now depends on state distribution not justthe most likely state
ReinforcementLearning
),()( mms
m asrsbm
′′∑′
∑
Optimise
expected reward
Roy, N., J. Pineau, et al. (2000). "Spoken Dialog Management Using Probabilistic Reasoning." Proceedings of the ACL 2000.
Williams, J., P. Poupart, et al. (2005). "Factored Partially Observable Markov Decision Processes for Dialog Management." 4th Workshop on Knowledge and Reasoning in Practical Dialog Systems, Edinburgh.
8© Steve Young, 2006
Belief Update EquationBelief Update Equation
Belief is updated every dialog turn as follows:
∑ ′′′′′=′ms
mmmmmmuum sbassPasaPaoPksb )().,|(),|()|(.)('
new belief old beliefstate
transitionnetwork
probability of s’m
hypothesiseduser
dialog act
prior of au given s’m and am
observed recogniseroutput (can include multiple hyps)
new evidence
ObservationModel
User Action Model
TransitionModel
9© Steve Young, 2006
Robustness of POMDP vs. MDPRobustness of POMDP vs. MDP
-15
-10
-5
0
5
10
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
0.60
0.65
perr
Exp
ecte
d or
ave
rage
retu
rn
POMDPMDP
Williams, J., P. Poupart, et al. (2005).
Simulation of simple 2 slot 3-city travel problem
10© Steve Young, 2006
Summary of the POMDP FrameworkSummary of the POMDP Framework
system maintains multiple dialog hypotheses called the belief state
machine actions are based on the full belief state distribution not just the most likely state
no backtracking is required when misunderstanding detected
speech understanding output is regarded as an observation
belief distribution is re-computed each time a new observation is received in a process called belief monitoring
N-best ASRU outputs naturally incorporated into belief monitoring framework via an observation model
POMDP framework naturally includes a user model which gives probability of each user act given each possible dialog hypothesis
system maintains multiple dialog hypotheses called the belief state
machine actions are based on the full belief state distribution not just the most likely state
no backtracking is required when misunderstanding detected
speech understanding output is regarded as an observation
belief distribution is re-computed each time a new observation is received in a process called belief monitoring
N-best ASRU outputs naturally incorporated into belief monitoring framework via an observation model
POMDP framework naturally includes a user model which gives probability of each user act given each possible dialog hypothesis
However, there are some issues ....
11© Steve Young, 2006
Belief MonitoringBelief Monitoring
Policy π
Machinedialog act tma ,
1ms 2
ms 3ms ...
)(1 mt sb +
ms4ms
BeliefUpdate
1ms 2
ms 3ms ...
)( mt sb
ms4ms
Time t
1ua 2
ua ...
)( uaP
ua3ua
Observation
Belief State
to
Time t+1
1ua 2
ua ...
)( uaP
ua3ua 1+to
1, +tma
Policy π
But representationof these distributionsin a practical systemis unclear.
12© Steve Young, 2006
POMDP Value functionsPOMDP Value functions
Consider a system with just two states and 3 actions
1s 2sbelief space
)( 1sP )( 2sPb
),( 1 asQπ ),( 2 asQπ
),( 11 asQ kπ
),( 12 asQ kπ
),( 1abQ kπ
),( 21 asQ kπ ),( 22 asQ kπ
),( 31 asQ kπ
),( 32 asQ kπ
choose1a
choose2a
choose3a
POMDP value functions are hyperplanes in belief space. Upper surface provides defines the value function V(b).Exact learning is iterative and effectively intractable.
13© Steve Young, 2006
Scaling to Real SystemsScaling to Real Systems
POMDPs provide an elegant mathematical framework for modelling spoken dialog systems but ....
State space will be huge – direct belief monitoring is impractical.
Exact POMDP optimisation is intractable - even approximate POMDP optimisation is limited to a few thousand states
A solution – the Hidden Information State Dialog Model
14© Steve Young, 2006
The Hidden Information State ModelThe Hidden Information State Model
Partition state space and compute partition beliefs not state beliefs
Represent user goals by branching-tree driven by ontology rules.
Maintain two state spaces: master space and summary space.Monitor beliefs in master space, apply and optimise policies in summary space
Use grid-based approximations, hence finite policy table
The HIS model provides a scaleable POMDP framework for implementing practical spoken dialog systems.
15© Steve Young, 2006
Structure of a HIS Dialog HypothesisStructure of a HIS Dialog Hypothesis
User action – a dialog type plus goal tree bindings
inform(food=Indian)
Dialog history –grounding status of each tree node
restaurant : UserRequestedfood : UserInformedarea : Groundedname : Initial
User goal tree built incrementally from rules, expanded on demandto accommodate user dialog acts
A single hypothesised information state
>=< duu sash ,},{ ie a set of withcommon &
msua ds
User goal – tree struct-ured set of entities
task
find entity
venue name location
restaurant food street
type
addr
Indian
a partition of us
16© Steve Young, 2006
HIS PartitionsHIS Partitions
Each partition represents a group of user goal statesPartitions are stored as tree structures, with nodes defined by atask ontologyPartitions are split by incoming user dialog actsWhen a partition is split, its belief is shared between the splits
Each partition represents a group of user goal statesPartitions are stored as tree structures, with nodes defined by atask ontologyPartitions are split by incoming user dialog actsWhen a partition is split, its belief is shared between the splits
entity
venue name type area
entity -> venue(name,type,area) 1.0type -> bar(drinks,music) 0.4type -> restaurant(food,price) 0.3
area -> (central | east | west | ....)food -> (Italian | Chinese | ....)
Structure ruleswith prior probs
Lexical/Dbase rules
Example ontology rules
17© Steve Young, 2006
bar drinks music
typetype
0.40.6
Partition splittingPartition splitting
Incoming dialog acts cause partitions to be extended and split inorder to match the items in the dialog act with the nodes in the tree.
Incoming dialog acts cause partitions to be extended and split inorder to match the items in the dialog act with the nodes in the tree.
entity
venue name area
entity -> venue(name,type,area)
type -> bar(drinks,music) 0.4request(bar)User
18© Steve Young, 2006
Master <Master <--> Summary State Mapping> Summary State Mapping
Master space is mapped into a reduced summary space:
find(venue(hotel,area=east,near=Museum))
find(venue(bar,area=east,near=Museum))
find(venue(hotel,area=east)
find(venue(hotel,area=west)
find(venue(hotel)
....etc
b
P(top)P(Nxt)T12SameTPStatusTHStatusTUserActLastSA
b
maPolicyπ
GreetBold RequestTentative RequestConfirmOfferInform.... etc
ma HeuristicMapping
act type
confirm( )confirm(area=east)
19© Steve Young, 2006
The POMDP Policy and Action SelectionThe POMDP Policy and Action Selection
A set of points in summary space and their associated actions
1b
1ˆma
2b
2ˆma
3b
3ˆma
4b
4ˆma
....Policyπ
tb
tma ,ˆ
Action selection at time t
Find nearest belief point
20© Steve Young, 2006
Policy OptimisationPolicy Optimisation
Use Q-learning with a simulated user on belief pointsStart with a single belief pointAdd new points as they are encountered upto some maximum
x1000TrainingDialogs
AverageReward
21© Steve Young, 2006
Summary of HIS Dialog Manager OperationSummary of HIS Dialog Manager Operation
1~ua
Observation
FromUser
1. Grow Forest ieextend partitioningof Belief Space
Ontology Rules
2~ua
Nua~
maFromSystem
1~ua
1~ua
2~ua
2~ua
3~ua
2. Bind User Actsto Partitions
1ds
2ds
1ds
2ds
3ds
1up
2up
3up
3. Update DialogHistory
2h
3h
4h
5h
1hApplication Database
BeliefState
MapPOMDPPolicy
Heuristic ActionRefinement
bbStrategic
Action
mamaSummary Space
5. Map from master space -> summary space
6. Apply policy insummary space
4. Form NewHypotheses
7. Map back to master space
22© Steve Young, 2006
EvaluationEvaluation
The system was tested by human users in a two day study conducted simultaneously at Edinburgh and Cambridge.
Dialogues were deemed to be successfully completed when the system made a correct recommendation.
Cambridge Edinburgh Combined
# subjects 23 17 40
# dialogues 92 68 160
% WER 21.1 37.3 29.3
% completion rate 95.7 83.8 90.6
Average turns to completion
3.8 8.1 5.6
Work supported by the EU FP6 “TALK” Project
23© Steve Young, 2006
Partially observable MDPs provide a natural framework for modellingspoken dialog systems:
explicit representation of uncertainty
support for N-best ASR output
incorporates user and observation model
simple error recovery by shifting belief to alternative hypotheses
potential for on-line adaptation
The Hidden Information State system demonstrates that POMDPscan be scaled to handle real world tasks
There are many issues to resolve e.g. effective observation and usermodels, choice of summary state mapping, improved trainingprocedures ...
... but overall POMDPs provide an opportunity for making significantimprovements to both the design and implementation of spoken dialog systems.
ConclusionsConclusions