HIS Demo System - mi.eng.cam.ac.ukmi.eng.cam.ac.uk/research/dialogue/slt06_sjy-talk.pdf ·...

Reinforcement learning for spoken dialog systems:Reinforcement learning for spoken dialog systems:

Using Using POMDPsPOMDPs for Dialog Managementfor Dialog Management

Cambridge University Engineering DepartmentCambridge University Engineering DepartmentMachine Intelligence LaboratoryMachine Intelligence Laboratory

Steve YoungSteve Young

2© Steve Young, 2006

Outline of TalkOutline of Talk

the promise of statistical dialog systems

Markov Decision Processes and their limitations

Partially Observable MDPs – an intractable solution?

the Hidden Information State system – a proof of concept.


Statistical Dialog SystemsStatistical Dialog Systems

A statistical approach to dialog system design offers the following potential advantages:

formalise dialog design criteria as objective reward functionsautomatically learn dialog strategies from dataallow decision making to be optimisedincrease robustness to recognition/understanding errorsenable on-line dialog policy adaptation to allow the system to learn from experience

Markov Decision Processes provide the framework to do this .....

Overall, increase robustness and reduce design, implementation and maintenance costs


Dialog as a Markov Decision ProcessDialog as a Markov Decision Process

userdialogact

machinedialog act

SpeechUnderstanding

SpeechGeneration

User

ua

ma~us

mausergoal

>=< duum ssas ~,~,~~

machinestate

dialoghistory

noisy estimate ofuser dialog act

Levin, E. and R. Pieraccini (1997). "A Stochastic Model Of Computer-Human Interaction For Learning Dialog Strategies." Proc Eurospeech, Rhodes,Greece.

Levin, E., R. Pieraccini, et al. (1998). "Using Markov Decision Processes For Learning Dialog Strategies." Proc Int Conf Acoustics, Speech and Signal Processing, Seattle,USA.

ua~State

Estimator

DialogPolicy

ds

ms~

MDP

∑=k

kkrR γ

),( mm asr

Optimise

ReinforcementLearning

Reward

π


Training an MDPTraining an MDP

Key idea is to associate a value function with each state

{ }mm sREsV |)( ππ = { }mmmm asREasQ ,|),( π

π =

A popular algorithm for implementing this is Q-Learning

)())(,( mmm sVssQ ππ π =where

then policy π ′ πis better than policy .

Given V or Q, policy optimisation is straightforward since if

)())(,( mmm sVssQ ππ π >′


Limitations of MDP FrameworkLimitations of MDP Framework

state space is huge, hence propositional content and much of the relevant history is often ignored.

dialogs are fragile because user state and user dialog act are uncertain, hence estimate of machine state is often incorrect

recovery strategies are difficult since no information is available for backtracking

no principled way to handle N-best ASR output.

ms~us ua

Modelling dialog as an MDP suffers from a variety of practical problems


Dialog as a Partially Observable MDPDialog as a Partially Observable MDP

BeliefEstimator

DialogPolicy

ds

)~( msb

POMDP

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

Nu

u

a

a

~...

~1

>=< duum ssas ~,~,~~

SpeechUnderstanding

SpeechGeneration

User

ua

ma~us

ma

Distribution over possible dialog acts(eg N-best list)

Distribution over allpossible machine states

Policy now depends on state distribution not justthe most likely state

ReinforcementLearning

),()( mms

m asrsbm

′′∑′

∑

Optimise

expected reward

Roy, N., J. Pineau, et al. (2000). "Spoken Dialog Management Using Probabilistic Reasoning." Proceedings of the ACL 2000.

Williams, J., P. Poupart, et al. (2005). "Factored Partially Observable Markov Decision Processes for Dialog Management." 4th Workshop on Knowledge and Reasoning in Practical Dialog Systems, Edinburgh.


Belief Update EquationBelief Update Equation

Belief is updated every dialog turn as follows:

∑ ′′′′′=′ms

mmmmmmuum sbassPasaPaoPksb )().,|(),|()|(.)('

new belief old beliefstate

transitionnetwork

probability of s’m

hypothesiseduser

dialog act

prior of au given s’m and am

observed recogniseroutput (can include multiple hyps)

new evidence

ObservationModel

User Action Model

TransitionModel


Robustness of POMDP vs. MDPRobustness of POMDP vs. MDP

-15

-10

-5

0

5

10

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.55

0.60

0.65

perr

Exp

ecte

d or

ave

rage

retu

rn

POMDPMDP

Williams, J., P. Poupart, et al. (2005).

Simulation of simple 2 slot 3-city travel problem


Summary of the POMDP FrameworkSummary of the POMDP Framework

system maintains multiple dialog hypotheses called the belief state

machine actions are based on the full belief state distribution not just the most likely state

no backtracking is required when misunderstanding detected

speech understanding output is regarded as an observation

belief distribution is re-computed each time a new observation is received in a process called belief monitoring

N-best ASRU outputs naturally incorporated into belief monitoring framework via an observation model

POMDP framework naturally includes a user model which gives probability of each user act given each possible dialog hypothesis

system maintains multiple dialog hypotheses called the belief state

machine actions are based on the full belief state distribution not just the most likely state

no backtracking is required when misunderstanding detected

speech understanding output is regarded as an observation

belief distribution is re-computed each time a new observation is received in a process called belief monitoring

N-best ASRU outputs naturally incorporated into belief monitoring framework via an observation model

POMDP framework naturally includes a user model which gives probability of each user act given each possible dialog hypothesis

However, there are some issues ....


Belief MonitoringBelief Monitoring

Policy π

Machinedialog act tma ,

1ms 2

ms 3ms ...

)(1 mt sb +

ms4ms

BeliefUpdate

1ms 2

ms 3ms ...

)( mt sb

ms4ms

Time t

1ua 2

ua ...

)( uaP

ua3ua

Observation

Belief State

to

Time t+1

1ua 2

ua ...

)( uaP

ua3ua 1+to

1, +tma

Policy π

But representationof these distributionsin a practical systemis unclear.


POMDP Value functionsPOMDP Value functions

Consider a system with just two states and 3 actions

1s 2sbelief space

)( 1sP )( 2sPb

),( 1 asQπ ),( 2 asQπ

),( 11 asQ kπ

),( 12 asQ kπ

),( 1abQ kπ

),( 21 asQ kπ ),( 22 asQ kπ

),( 31 asQ kπ

),( 32 asQ kπ

choose1a

choose2a

choose3a

POMDP value functions are hyperplanes in belief space. Upper surface provides defines the value function V(b).Exact learning is iterative and effectively intractable.


Scaling to Real SystemsScaling to Real Systems

POMDPs provide an elegant mathematical framework for modelling spoken dialog systems but ....

State space will be huge – direct belief monitoring is impractical.

Exact POMDP optimisation is intractable - even approximate POMDP optimisation is limited to a few thousand states

A solution – the Hidden Information State Dialog Model


The Hidden Information State ModelThe Hidden Information State Model

Partition state space and compute partition beliefs not state beliefs

Represent user goals by branching-tree driven by ontology rules.

Maintain two state spaces: master space and summary space.Monitor beliefs in master space, apply and optimise policies in summary space

Use grid-based approximations, hence finite policy table

The HIS model provides a scaleable POMDP framework for implementing practical spoken dialog systems.


Structure of a HIS Dialog HypothesisStructure of a HIS Dialog Hypothesis

User action – a dialog type plus goal tree bindings

inform(food=Indian)

Dialog history –grounding status of each tree node

restaurant : UserRequestedfood : UserInformedarea : Groundedname : Initial

User goal tree built incrementally from rules, expanded on demandto accommodate user dialog acts

A single hypothesised information state

>=< duu sash ,},{ ie a set of withcommon &

msua ds

User goal – tree struct-ured set of entities

task

find entity

venue name location

restaurant food street

type

addr

Indian

a partition of us


HIS PartitionsHIS Partitions

Each partition represents a group of user goal statesPartitions are stored as tree structures, with nodes defined by atask ontologyPartitions are split by incoming user dialog actsWhen a partition is split, its belief is shared between the splits

Each partition represents a group of user goal statesPartitions are stored as tree structures, with nodes defined by atask ontologyPartitions are split by incoming user dialog actsWhen a partition is split, its belief is shared between the splits

entity

venue name type area

entity -> venue(name,type,area) 1.0type -> bar(drinks,music) 0.4type -> restaurant(food,price) 0.3

area -> (central | east | west | ....)food -> (Italian | Chinese | ....)

Structure ruleswith prior probs

Lexical/Dbase rules

Example ontology rules


bar drinks music

typetype

0.40.6

Partition splittingPartition splitting

Incoming dialog acts cause partitions to be extended and split inorder to match the items in the dialog act with the nodes in the tree.

Incoming dialog acts cause partitions to be extended and split inorder to match the items in the dialog act with the nodes in the tree.

entity

venue name area

entity -> venue(name,type,area)

type -> bar(drinks,music) 0.4request(bar)User


Master <Master <--> Summary State Mapping> Summary State Mapping

Master space is mapped into a reduced summary space:

find(venue(hotel,area=east,near=Museum))

find(venue(bar,area=east,near=Museum))

find(venue(hotel,area=east)

find(venue(hotel,area=west)

find(venue(hotel)

....etc

b

P(top)P(Nxt)T12SameTPStatusTHStatusTUserActLastSA

b

maPolicyπ

GreetBold RequestTentative RequestConfirmOfferInform.... etc

ma HeuristicMapping

act type

confirm( )confirm(area=east)


The POMDP Policy and Action SelectionThe POMDP Policy and Action Selection

A set of points in summary space and their associated actions

1b

1ˆma

2b

2ˆma

3b

3ˆma

4b

4ˆma

....Policyπ

tb

tma ,ˆ

Action selection at time t

Find nearest belief point


Policy OptimisationPolicy Optimisation

Use Q-learning with a simulated user on belief pointsStart with a single belief pointAdd new points as they are encountered upto some maximum

x1000TrainingDialogs

AverageReward


Summary of HIS Dialog Manager OperationSummary of HIS Dialog Manager Operation

1~ua

Observation

FromUser

1. Grow Forest ieextend partitioningof Belief Space

Ontology Rules

2~ua

Nua~

maFromSystem

1~ua

1~ua

2~ua

2~ua

3~ua

2. Bind User Actsto Partitions

1ds

2ds

1ds

2ds

3ds

1up

2up

3up

3. Update DialogHistory

2h

3h

4h

5h

1hApplication Database

BeliefState

MapPOMDPPolicy

Heuristic ActionRefinement

bbStrategic

Action

mamaSummary Space

5. Map from master space -> summary space

6. Apply policy insummary space

4. Form NewHypotheses

7. Map back to master space


EvaluationEvaluation

The system was tested by human users in a two day study conducted simultaneously at Edinburgh and Cambridge.

Dialogues were deemed to be successfully completed when the system made a correct recommendation.

Cambridge Edinburgh Combined

# subjects 23 17 40

# dialogues 92 68 160

% WER 21.1 37.3 29.3

% completion rate 95.7 83.8 90.6

Average turns to completion

3.8 8.1 5.6

Work supported by the EU FP6 “TALK” Project


Partially observable MDPs provide a natural framework for modellingspoken dialog systems:

explicit representation of uncertainty

support for N-best ASR output

incorporates user and observation model

simple error recovery by shifting belief to alternative hypotheses

potential for on-line adaptation

The Hidden Information State system demonstrates that POMDPscan be scaled to handle real world tasks

There are many issues to resolve e.g. effective observation and usermodels, choice of summary state mapping, improved trainingprocedures ...

... but overall POMDPs provide an opportunity for making significantimprovements to both the design and implementation of spoken dialog systems.

ConclusionsConclusions

Date post:	07-May-2018
Category:	Documents
Upload:	ngobao
View:	216 times
Download:	1 times

HIS Demo System - mi.eng.cam.ac.ukmi.eng.cam.ac.uk/research/dialogue/slt06_sjy-talk.pdf ·...

Documents