+ All Categories
Home > Documents > HIS Demo System - mi.eng.cam.ac.ukmi.eng.cam.ac.uk/research/dialogue/slt06_sjy-talk.pdf ·...

HIS Demo System - mi.eng.cam.ac.ukmi.eng.cam.ac.uk/research/dialogue/slt06_sjy-talk.pdf ·...

Date post: 07-May-2018
Category:
Upload: ngobao
View: 216 times
Download: 1 times
Share this document with a friend
23
Reinforcement learning for spoken dialog systems: Reinforcement learning for spoken dialog systems: Using Using POMDPs POMDPs for Dialog Management for Dialog Management Cambridge University Engineering Department Cambridge University Engineering Department Machine Intelligence Laboratory Machine Intelligence Laboratory Steve Young Steve Young
Transcript
Page 1: HIS Demo System - mi.eng.cam.ac.ukmi.eng.cam.ac.uk/research/dialogue/slt06_sjy-talk.pdf · Reinforcement learning for spoken dialog systems: Using POMDPs for Dialog Management Cambridge

Reinforcement learning for spoken dialog systems:Reinforcement learning for spoken dialog systems:

Using Using POMDPsPOMDPs for Dialog Managementfor Dialog Management

Cambridge University Engineering DepartmentCambridge University Engineering DepartmentMachine Intelligence LaboratoryMachine Intelligence Laboratory

Steve YoungSteve Young

Page 2: HIS Demo System - mi.eng.cam.ac.ukmi.eng.cam.ac.uk/research/dialogue/slt06_sjy-talk.pdf · Reinforcement learning for spoken dialog systems: Using POMDPs for Dialog Management Cambridge

2© Steve Young, 2006

Outline of TalkOutline of Talk

the promise of statistical dialog systems

Markov Decision Processes and their limitations

Partially Observable MDPs – an intractable solution?

the Hidden Information State system – a proof of concept.

Page 3: HIS Demo System - mi.eng.cam.ac.ukmi.eng.cam.ac.uk/research/dialogue/slt06_sjy-talk.pdf · Reinforcement learning for spoken dialog systems: Using POMDPs for Dialog Management Cambridge

3© Steve Young, 2006

Statistical Dialog SystemsStatistical Dialog Systems

A statistical approach to dialog system design offers the following potential advantages:

formalise dialog design criteria as objective reward functionsautomatically learn dialog strategies from dataallow decision making to be optimisedincrease robustness to recognition/understanding errorsenable on-line dialog policy adaptation to allow the system to learn from experience

Markov Decision Processes provide the framework to do this .....

Overall, increase robustness and reduce design, implementation and maintenance costs

Page 4: HIS Demo System - mi.eng.cam.ac.ukmi.eng.cam.ac.uk/research/dialogue/slt06_sjy-talk.pdf · Reinforcement learning for spoken dialog systems: Using POMDPs for Dialog Management Cambridge

4© Steve Young, 2006

Dialog as a Markov Decision ProcessDialog as a Markov Decision Process

userdialogact

machinedialog act

SpeechUnderstanding

SpeechGeneration

User

ua

ma~us

mausergoal

>=< duum ssas ~,~,~~

machinestate

dialoghistory

noisy estimate ofuser dialog act

Levin, E. and R. Pieraccini (1997). "A Stochastic Model Of Computer-Human Interaction For Learning Dialog Strategies." Proc Eurospeech, Rhodes,Greece.

Levin, E., R. Pieraccini, et al. (1998). "Using Markov Decision Processes For Learning Dialog Strategies." Proc Int Conf Acoustics, Speech and Signal Processing, Seattle,USA.

ua~State

Estimator

DialogPolicy

ds

ms~

MDP

∑=k

kkrR γ

),( mm asr

Optimise

ReinforcementLearning

Reward

π

Page 5: HIS Demo System - mi.eng.cam.ac.ukmi.eng.cam.ac.uk/research/dialogue/slt06_sjy-talk.pdf · Reinforcement learning for spoken dialog systems: Using POMDPs for Dialog Management Cambridge

5© Steve Young, 2006

Training an MDPTraining an MDP

Key idea is to associate a value function with each state

{ }mm sREsV |)( ππ = { }mmmm asREasQ ,|),( π

π =

A popular algorithm for implementing this is Q-Learning

)())(,( mmm sVssQ ππ π =where

then policy π ′ πis better than policy .

Given V or Q, policy optimisation is straightforward since if

)())(,( mmm sVssQ ππ π >′

Page 6: HIS Demo System - mi.eng.cam.ac.ukmi.eng.cam.ac.uk/research/dialogue/slt06_sjy-talk.pdf · Reinforcement learning for spoken dialog systems: Using POMDPs for Dialog Management Cambridge

6© Steve Young, 2006

Limitations of MDP FrameworkLimitations of MDP Framework

state space is huge, hence propositional content and much of the relevant history is often ignored.

dialogs are fragile because user state and user dialog act are uncertain, hence estimate of machine state is often incorrect

recovery strategies are difficult since no information is available for backtracking

no principled way to handle N-best ASR output.

ms~us ua

Modelling dialog as an MDP suffers from a variety of practical problems

Page 7: HIS Demo System - mi.eng.cam.ac.ukmi.eng.cam.ac.uk/research/dialogue/slt06_sjy-talk.pdf · Reinforcement learning for spoken dialog systems: Using POMDPs for Dialog Management Cambridge

7© Steve Young, 2006

Dialog as a Partially Observable MDPDialog as a Partially Observable MDP

BeliefEstimator

DialogPolicy

ds

)~( msb

POMDP

⎥⎥⎥⎥

⎢⎢⎢⎢

Nu

u

a

a

~...

~1

>=< duum ssas ~,~,~~

SpeechUnderstanding

SpeechGeneration

User

ua

ma~us

ma

Distribution over possible dialog acts(eg N-best list)

Distribution over allpossible machine states

Policy now depends on state distribution not justthe most likely state

ReinforcementLearning

),()( mms

m asrsbm

′′∑′

Optimise

expected reward

Roy, N., J. Pineau, et al. (2000). "Spoken Dialog Management Using Probabilistic Reasoning." Proceedings of the ACL 2000.

Williams, J., P. Poupart, et al. (2005). "Factored Partially Observable Markov Decision Processes for Dialog Management." 4th Workshop on Knowledge and Reasoning in Practical Dialog Systems, Edinburgh.

Page 8: HIS Demo System - mi.eng.cam.ac.ukmi.eng.cam.ac.uk/research/dialogue/slt06_sjy-talk.pdf · Reinforcement learning for spoken dialog systems: Using POMDPs for Dialog Management Cambridge

8© Steve Young, 2006

Belief Update EquationBelief Update Equation

Belief is updated every dialog turn as follows:

∑ ′′′′′=′ms

mmmmmmuum sbassPasaPaoPksb )().,|(),|()|(.)('

new belief old beliefstate

transitionnetwork

probability of s’m

hypothesiseduser

dialog act

prior of au given s’m and am

observed recogniseroutput (can include multiple hyps)

new evidence

ObservationModel

User Action Model

TransitionModel

Page 9: HIS Demo System - mi.eng.cam.ac.ukmi.eng.cam.ac.uk/research/dialogue/slt06_sjy-talk.pdf · Reinforcement learning for spoken dialog systems: Using POMDPs for Dialog Management Cambridge

9© Steve Young, 2006

Robustness of POMDP vs. MDPRobustness of POMDP vs. MDP

-15

-10

-5

0

5

10

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.55

0.60

0.65

perr

Exp

ecte

d or

ave

rage

retu

rn

POMDPMDP

Williams, J., P. Poupart, et al. (2005).

Simulation of simple 2 slot 3-city travel problem

Page 10: HIS Demo System - mi.eng.cam.ac.ukmi.eng.cam.ac.uk/research/dialogue/slt06_sjy-talk.pdf · Reinforcement learning for spoken dialog systems: Using POMDPs for Dialog Management Cambridge

10© Steve Young, 2006

Summary of the POMDP FrameworkSummary of the POMDP Framework

system maintains multiple dialog hypotheses called the belief state

machine actions are based on the full belief state distribution not just the most likely state

no backtracking is required when misunderstanding detected

speech understanding output is regarded as an observation

belief distribution is re-computed each time a new observation is received in a process called belief monitoring

N-best ASRU outputs naturally incorporated into belief monitoring framework via an observation model

POMDP framework naturally includes a user model which gives probability of each user act given each possible dialog hypothesis

system maintains multiple dialog hypotheses called the belief state

machine actions are based on the full belief state distribution not just the most likely state

no backtracking is required when misunderstanding detected

speech understanding output is regarded as an observation

belief distribution is re-computed each time a new observation is received in a process called belief monitoring

N-best ASRU outputs naturally incorporated into belief monitoring framework via an observation model

POMDP framework naturally includes a user model which gives probability of each user act given each possible dialog hypothesis

However, there are some issues ....

Page 11: HIS Demo System - mi.eng.cam.ac.ukmi.eng.cam.ac.uk/research/dialogue/slt06_sjy-talk.pdf · Reinforcement learning for spoken dialog systems: Using POMDPs for Dialog Management Cambridge

11© Steve Young, 2006

Belief MonitoringBelief Monitoring

Policy π

Machinedialog act tma ,

1ms 2

ms 3ms ...

)(1 mt sb +

ms4ms

BeliefUpdate

1ms 2

ms 3ms ...

)( mt sb

ms4ms

Time t

1ua 2

ua ...

)( uaP

ua3ua

Observation

Belief State

to

Time t+1

1ua 2

ua ...

)( uaP

ua3ua 1+to

1, +tma

Policy π

But representationof these distributionsin a practical systemis unclear.

Page 12: HIS Demo System - mi.eng.cam.ac.ukmi.eng.cam.ac.uk/research/dialogue/slt06_sjy-talk.pdf · Reinforcement learning for spoken dialog systems: Using POMDPs for Dialog Management Cambridge

12© Steve Young, 2006

POMDP Value functionsPOMDP Value functions

Consider a system with just two states and 3 actions

1s 2sbelief space

)( 1sP )( 2sPb

),( 1 asQπ ),( 2 asQπ

),( 11 asQ kπ

),( 12 asQ kπ

),( 1abQ kπ

),( 21 asQ kπ ),( 22 asQ kπ

),( 31 asQ kπ

),( 32 asQ kπ

choose1a

choose2a

choose3a

POMDP value functions are hyperplanes in belief space. Upper surface provides defines the value function V(b).Exact learning is iterative and effectively intractable.

Page 13: HIS Demo System - mi.eng.cam.ac.ukmi.eng.cam.ac.uk/research/dialogue/slt06_sjy-talk.pdf · Reinforcement learning for spoken dialog systems: Using POMDPs for Dialog Management Cambridge

13© Steve Young, 2006

Scaling to Real SystemsScaling to Real Systems

POMDPs provide an elegant mathematical framework for modelling spoken dialog systems but ....

State space will be huge – direct belief monitoring is impractical.

Exact POMDP optimisation is intractable - even approximate POMDP optimisation is limited to a few thousand states

A solution – the Hidden Information State Dialog Model

Page 14: HIS Demo System - mi.eng.cam.ac.ukmi.eng.cam.ac.uk/research/dialogue/slt06_sjy-talk.pdf · Reinforcement learning for spoken dialog systems: Using POMDPs for Dialog Management Cambridge

14© Steve Young, 2006

The Hidden Information State ModelThe Hidden Information State Model

Partition state space and compute partition beliefs not state beliefs

Represent user goals by branching-tree driven by ontology rules.

Maintain two state spaces: master space and summary space.Monitor beliefs in master space, apply and optimise policies in summary space

Use grid-based approximations, hence finite policy table

The HIS model provides a scaleable POMDP framework for implementing practical spoken dialog systems.

Page 15: HIS Demo System - mi.eng.cam.ac.ukmi.eng.cam.ac.uk/research/dialogue/slt06_sjy-talk.pdf · Reinforcement learning for spoken dialog systems: Using POMDPs for Dialog Management Cambridge

15© Steve Young, 2006

Structure of a HIS Dialog HypothesisStructure of a HIS Dialog Hypothesis

User action – a dialog type plus goal tree bindings

inform(food=Indian)

Dialog history –grounding status of each tree node

restaurant : UserRequestedfood : UserInformedarea : Groundedname : Initial

User goal tree built incrementally from rules, expanded on demandto accommodate user dialog acts

A single hypothesised information state

>=< duu sash ,},{ ie a set of withcommon &

msua ds

User goal – tree struct-ured set of entities

task

find entity

venue name location

restaurant food street

type

addr

Indian

a partition of us

Page 16: HIS Demo System - mi.eng.cam.ac.ukmi.eng.cam.ac.uk/research/dialogue/slt06_sjy-talk.pdf · Reinforcement learning for spoken dialog systems: Using POMDPs for Dialog Management Cambridge

16© Steve Young, 2006

HIS PartitionsHIS Partitions

Each partition represents a group of user goal statesPartitions are stored as tree structures, with nodes defined by atask ontologyPartitions are split by incoming user dialog actsWhen a partition is split, its belief is shared between the splits

Each partition represents a group of user goal statesPartitions are stored as tree structures, with nodes defined by atask ontologyPartitions are split by incoming user dialog actsWhen a partition is split, its belief is shared between the splits

entity

venue name type area

entity -> venue(name,type,area) 1.0type -> bar(drinks,music) 0.4type -> restaurant(food,price) 0.3

area -> (central | east | west | ....)food -> (Italian | Chinese | ....)

Structure ruleswith prior probs

Lexical/Dbase rules

Example ontology rules

Page 17: HIS Demo System - mi.eng.cam.ac.ukmi.eng.cam.ac.uk/research/dialogue/slt06_sjy-talk.pdf · Reinforcement learning for spoken dialog systems: Using POMDPs for Dialog Management Cambridge

17© Steve Young, 2006

bar drinks music

typetype

0.40.6

Partition splittingPartition splitting

Incoming dialog acts cause partitions to be extended and split inorder to match the items in the dialog act with the nodes in the tree.

Incoming dialog acts cause partitions to be extended and split inorder to match the items in the dialog act with the nodes in the tree.

entity

venue name area

entity -> venue(name,type,area)

type -> bar(drinks,music) 0.4request(bar)User

Page 18: HIS Demo System - mi.eng.cam.ac.ukmi.eng.cam.ac.uk/research/dialogue/slt06_sjy-talk.pdf · Reinforcement learning for spoken dialog systems: Using POMDPs for Dialog Management Cambridge

18© Steve Young, 2006

Master <Master <--> Summary State Mapping> Summary State Mapping

Master space is mapped into a reduced summary space:

find(venue(hotel,area=east,near=Museum))

find(venue(bar,area=east,near=Museum))

find(venue(hotel,area=east)

find(venue(hotel,area=west)

find(venue(hotel)

....etc

b

P(top)P(Nxt)T12SameTPStatusTHStatusTUserActLastSA

b

maPolicyπ

GreetBold RequestTentative RequestConfirmOfferInform.... etc

ma HeuristicMapping

act type

confirm( )confirm(area=east)

Page 19: HIS Demo System - mi.eng.cam.ac.ukmi.eng.cam.ac.uk/research/dialogue/slt06_sjy-talk.pdf · Reinforcement learning for spoken dialog systems: Using POMDPs for Dialog Management Cambridge

19© Steve Young, 2006

The POMDP Policy and Action SelectionThe POMDP Policy and Action Selection

A set of points in summary space and their associated actions

1b

1ˆma

2b

2ˆma

3b

3ˆma

4b

4ˆma

....Policyπ

tb

tma ,ˆ

Action selection at time t

Find nearest belief point

Page 20: HIS Demo System - mi.eng.cam.ac.ukmi.eng.cam.ac.uk/research/dialogue/slt06_sjy-talk.pdf · Reinforcement learning for spoken dialog systems: Using POMDPs for Dialog Management Cambridge

20© Steve Young, 2006

Policy OptimisationPolicy Optimisation

Use Q-learning with a simulated user on belief pointsStart with a single belief pointAdd new points as they are encountered upto some maximum

x1000TrainingDialogs

AverageReward

Page 21: HIS Demo System - mi.eng.cam.ac.ukmi.eng.cam.ac.uk/research/dialogue/slt06_sjy-talk.pdf · Reinforcement learning for spoken dialog systems: Using POMDPs for Dialog Management Cambridge

21© Steve Young, 2006

Summary of HIS Dialog Manager OperationSummary of HIS Dialog Manager Operation

1~ua

Observation

FromUser

1. Grow Forest ieextend partitioningof Belief Space

Ontology Rules

2~ua

Nua~

maFromSystem

1~ua

1~ua

2~ua

2~ua

3~ua

2. Bind User Actsto Partitions

1ds

2ds

1ds

2ds

3ds

1up

2up

3up

3. Update DialogHistory

2h

3h

4h

5h

1hApplication Database

BeliefState

MapPOMDPPolicy

Heuristic ActionRefinement

bbStrategic

Action

mamaSummary Space

5. Map from master space -> summary space

6. Apply policy insummary space

4. Form NewHypotheses

7. Map back to master space

Page 22: HIS Demo System - mi.eng.cam.ac.ukmi.eng.cam.ac.uk/research/dialogue/slt06_sjy-talk.pdf · Reinforcement learning for spoken dialog systems: Using POMDPs for Dialog Management Cambridge

22© Steve Young, 2006

EvaluationEvaluation

The system was tested by human users in a two day study conducted simultaneously at Edinburgh and Cambridge.

Dialogues were deemed to be successfully completed when the system made a correct recommendation.

Cambridge Edinburgh Combined

# subjects 23 17 40

# dialogues 92 68 160

% WER 21.1 37.3 29.3

% completion rate 95.7 83.8 90.6

Average turns to completion

3.8 8.1 5.6

Work supported by the EU FP6 “TALK” Project

Page 23: HIS Demo System - mi.eng.cam.ac.ukmi.eng.cam.ac.uk/research/dialogue/slt06_sjy-talk.pdf · Reinforcement learning for spoken dialog systems: Using POMDPs for Dialog Management Cambridge

23© Steve Young, 2006

Partially observable MDPs provide a natural framework for modellingspoken dialog systems:

explicit representation of uncertainty

support for N-best ASR output

incorporates user and observation model

simple error recovery by shifting belief to alternative hypotheses

potential for on-line adaptation

The Hidden Information State system demonstrates that POMDPscan be scaled to handle real world tasks

There are many issues to resolve e.g. effective observation and usermodels, choice of summary state mapping, improved trainingprocedures ...

... but overall POMDPs provide an opportunity for making significantimprovements to both the design and implementation of spoken dialog systems.

ConclusionsConclusions


Recommended