Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 218 times |
Download: | 1 times |
Gautam Varma MantenaDr. Kishore Prahallad
Speech and Vision Lab, International Institute of Information Technology-
Hyderbad
Spoken Dialogue SystemsA Tutorial
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
IntroductionInterface between a human and a machine for
information access.Some applications for information access:
Search engines. Example: Google search, Bing, etc.IVR applications. Example: Most customer careDialogue Systems. Example Google chat bots (
[email protected])Dialogue Systems:
Dialogue is a much freer mode of communication.Speech being the natural mode of communication,
there is a need to build systems which communicate to users via speech
2
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Introduction (contd.)
3
Modes of communication in a dialogue systemTextSpeechMultimodal features like gestures, touch
screen, etc.
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
DialogueHow is spoken dialogue different from other
modes of communication?Input: Input utterance can consists
disfluencies (like ‘uh’ and ‘hmm’), word repetitions, etc.
Barge-ins: Interrupting the system before allowing it to finish
Turn taking: System should know when to speak (i.e. when to take control of the conversation)
Prosodic information.Grounding with user.
4
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Applications of Spoken Dialogue SystemsARISE – Automatic Railway Information System for
EuropeRoomLine – Conference room scheduling and
reservation.Let’s Go – Bus information systemTeamTalk – Command and control interface to a
team of robotsCommunicator – Air travel planningJupiter – Automated weather serviceMIS – Mandi Information System
and many more
5
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Characteristics of a Spoken Dialogue SystemsAbility to understand user’s goal and to
reach an appropriate and a satisfied solution.
Ability to carry out sub-dialogues to achieve sub-goals.
Ability to pass control from one sub-dialogue to another.
Ability to vary the dialogue initiative modes from system initiative to user initiative.
Use of a user model to expect user’s utterances and act aptly.
Directing the user towards task completion.6
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Components of a Spoken Dialogue SystemAutomatic Speech Recognition (ASR)
Convert user input (utterance/speech) to text
Natural Language Understanding (NLU)Structure text to a format which can be understood
Dialogue Manager (DM)Actions to be performed based on structured text.
Natural Language Generator (NLG)Generate text that is to be conveyed to the user.
Text to Speech Synthesis (TTS)Convert the text to speech
7
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Architecture of a Spoken Dialogue System
8
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Approaches to build Dialogue ManagerClassification can be done based on
On how the conversation is controlled:System initiative: System has complete control on
the conversation.User initiative: User has the complete dialogue
control and the system will only respond to user requests.
Mixed initiative: Both user and the system take turns in controlling the conversation.
Functioning of DMFinite StateSelf Organizing
9
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Finite State Dialogue ManagersDialogue is structured in a sequence of
predetermined utterances.
10
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Finite State Dialogue Managers (contd.)
11
User is expected to answer only the system queries. Providing more information would be redundant.
FS-DM systems are system initiative as they restrict user responses.
FS-DM systems are very robust as it already knows what state it is in and what would be the possible responses of the user.
Useful when dealing with well structured data like small enquiry systems, questionnaires, etc.
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Self Organizing Models
12
Dialogue path is not predeterminedDialogue path evolves along with the user
responses to the system.Mixed initiative.Provide much freer form of communication
as the user is not restricted.Types of self organizing models:
Frame basedAgent basedInformation State based.
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Frame Based Dialogue Manager
13
Frame (or form) based models function like form filling application. It keeps track of the information present and information to be acquired.
System chooses the required question to be asked at every instant. Dialogue path is not predetermined.
State of DM is defined as the content of the frame.
Frame is a data structure which holds the concepts and the necessary actions to be performed.
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Frame Based Dialogue Manager (contd)
14
Preconditions are the conditions under which the actions are to be performed.
Given all the questions and the preconditions, the system would determine what actions to be performed.
Example:Concepts Questions Preconditions
Person name
Prompt: What is the person name?
Person name empty
Lab name Prompt: What is the lab name? Lab name empty
Detail type Prompt: What details do you want?
Detail type empty
Query type Execute: update query type Person name or Lab name not empty
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Agent based Systems
15
An agent is one which perceives its environment and perform necessary actions.
Information obtained from the environment by an agent are called percepts.
Agents behavior is dependent on the sequence of percepts it receives from the environment.
Goal of an intelligent agent is to effectively map the percept sequences to its corresponding action sequences via a program called agent program.
Example model is a plan based model.
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Agent based Systems (contd)
16
In a plan based model, given an initial state and a goal state, the system constructs a plan or a sequence of operations to be performed.
The series of actions are to be performed are defined by an action schema.
An example action schema consists of the following parameters and constraintsPreconditions: Necessary conditions for
successful execution of the procedure
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Agent based Systems (contd)
17
Effect: Conditions that become true after execution of the procedure.
Body: A set of goal states that are to be achieved in executing the procedure.
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Information State based Dialogue Systems
18
Dialogue management operations are all in terms of information state.
Information state of a dialogue represents:Current information present.Information obtained from previous dialogue
moves.Future actions to be performed.
Information state approach for dialogue modeling consists of:Description and a representation of the
information components that constitute an information state. In general they can be sub-divided into static and dynamic components.
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Information State based Dialogue Systems (contd)
19
Static components are those information components that do not change during a conversation with a user. Static components can be domain knowledge, update rules, etc.
Set of dialogue moves that will update the information state.
Set of update rules that control the update process of the information state. Rules are governed by the current information state and the performed dialogue move.
An update strategy for deciding which update rule to be executed from a set of applicable ones.
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Stochastic Methods for learning Dialogue Strategies
20
Dialogue systems like information state based systems are complex to build.
Difficulty comes in incorporating update moves and strategies.
An approach to overcome the complexity is to model the dialogue manager as a stochastic model and train the model using preliminary dialogue corpus.
Incase of real time systems, it is difficult to foresee all possible dialogue scenarios that could occur between a user and a system. To overcome such problems we could approximate a dialogue system to a stochastic model.
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Stochastic Methods for learning Dialogue Strategies (contd.)
21
Some of the approaches used for modeling dialogue managers are as follows:Markov Decision Process (MDP)Partially Observable Markov Decision Process
(POMDP)
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Markov Decision Process
22
Markov decision process or MDP is characterized byA set of states S an agent can be in.A set of actions A the agent can take.A reward r(s,a) that the agent receives for
taking an action.Goal is to specify a policy which specifies
which action to be taken to receive the best reward.
Expected cumulative reward Q for a sequence of states is
Q[s0,a0,s1,a1,..] = R(s0,a0) + αR(s1,a1) + α2 R(s2,a2) + …
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Markov Decision Process
23
α is called discount factor and lies between 0 and 1. It gives more importance to the current reward than to
future rewards. The more future a reward, the more discounted its
value.Bellman equation:
Q(s,a) = R(s,a) + α ∑s1 P(s1|s,a) maxa1 Q(s1,a1)
We require hand labeled data to model R(s,a) and P(s1|s,a)
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Partially Observable Markov Decision Process (POMDP)
24
Problems with MDP:System might never know what state the
user is in.System state might be misdirected by
recognition errors.Use of POMDP to overcome the above
problems.Environment is not completely observable for
the agent.POMDP provides a framework to overcome
such errors by explicitly mentioning the recognition errors.
It models the user output as an observed signal generated from another hidden variable.
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Evaluation of Spoken Dialogue System
25
Dialogue metrics can be classified as:Objective measures: calculated automatically
by machine without any considerations towards human judgment.Some objective measures:
Percentage of correct answers given by a system to a user.
Number of turns taken to complete the task.Mean user response time.Mean system response time.Percentage of errors recognized by the dialogue
system.Time taken to complete the task.
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Evaluation of Spoken Dialogue System (contd)
26
Subjective measures: require human analysis and a set of ground rules for user evaluation. Some subjective measures:
System cooperation.User satisfactionPercentage of correct and partially correct
answers.
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Issues in evaluating Spoken Dialogue Systems
27
Only work when responses of the system is a single correct answer. It would not be able to compare systems if their outputs are summarized texts.
Cannot clearly demarcate the different metrics. Some of the given metrics can be correlated and can be redundant.
Given all the metrics, how do we combine the metrics for evaluation.
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
28
Mandi Information System
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Mandi Information System
29
A conversation system for accessing prices of agricultural commodities Target users: farmers in rural and semi-urban
areasCommodities: Vegetables, fruits, pulses,
spices, Markets: Andhra Pradesh, IndiaLanguage: Telugu
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Issues/Challenged involved
30
Noisy environment: Target users are primarily farmers in rural and semi-urban areas. Quality of speech is affected by the surrounding noises and also
the distance of mobile/telephone handset or the microphone.Dialect/Pronunciation variation:
Four distinct dialects of Telugu in Andhra Pradesh Dialectal variation is a continuum
Unstructured conversation: Target audience with no familiarity with computers Conversation is unstructured with disfluencies like repeats and
false startsPersonalization:
Useful to guide the user in obtaining the required information as quickly as possible.
Require mining user's previous calls to predict his/her preferred query during the next call.
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Building a baseline system
31
Data collectionUsers were asked to read out the names of the
commodities, place names, etc. from the text provided to them.
Users were shown a series of pictures of agricultural commodities and were asked to say the names of the commodities shown in the picture. This is to collect the dialectal variations of the commodities.
Users were asked a series of questions related to agriculture and the places around their locality. This is to record conversational speech.
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Building a baseline system (contd)
32
Dialogue flow is as follows
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Building a baseline system (contd)
33
Finite state dialogue manager Limited domain and number of inputs
required are only three.Well structured.Real time application and the system is
expected to be very robust.A sample conversation of the system
System: What is the name of the commodity?
User: Orange
System: Did you say orange?
User: Yes
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Building a baseline system (contd)
34
Sample action schema for MISObtaining commodity name
Preconditions:NOT(COMMODITY_IS_AVAILABLE)
Body:PROMPT(‘What is the commodity name?’)RECORD_USER_UTTSPEECH_RECOGNITIONGET_LIST_DISTRICTS_SELLING(COMMODITY)
Effect:#Commodity is obtainedCOMMODITY_IS_AVAILABLE
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Building a baseline system (contd)
35
Obtain district name
Preconditions:COMMODITY_IS_AVAILABLE and NOT(DISTRICT_IS_AVAILABLE) and IS_AVAILABLE(LIST_DISTRICT_SELLING(COMMODITY))
Body:PROMPT(LIST_DISTRICTS_SELLING(COMMODITY))PROMPT(‘What is the district name?’)RECORD_USER_UTTSPEECH_RECOGNITIONGET_LIST_DISTRICTS_SELLING(COMMODITY)
Effect:#District is obtainedDISTRICT_IS_AVAILABLE
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Characteristics of MIS baseline system
36
MIS expects three concepts or inputs from the user, which are commodity, district and market names.
When a user provides some information to MIS, the goodness of recognition hypothesis is checked by using explicit confirmation. This is to make sure that the input query is right as recognition is error prone.
The strategy of explicit confirmation is hardly convenient, as the user has to confirm to all the information that was provided to the MIS queries.
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Mandi Information System II
37
Spoken dialogue system should provide accurate information to a user in less number of turns (or interactions).
Speech recognition being error prone, it is difficult to avoid confirmations from users. However, the objective would be to limit these confirmations.
An approach would be to associate a confidence score to the recognition output of an ASR.
Based on the confidence measure we decide the dialogue flow of the system.
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Multiple Decoders and Contextual Information
38
One could build multiple ASR decoders, where each decoder tries to capture complementary information about the speech data. Can be done through training multiple decoders using Different training datasetsDifferent features such as Mel-frequency cepstral
coefficients, linear prediction cepstral coefficients.If a majority of these decoders agree on a
hypothesis, the recognized output is considered to be same and the system could choose to avoid an explicit confirmation from a user.
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Multiple Decoders and Contextual Information (contd)
39
Given a set of decoders {d1,d2} and for a given acoustic signal, their corresponding hypothesis be {h1,h2}.
Let C[i] be the contextual information for the dialogue state i.
Following are the possible cases:Case 1: h1 = h2 and h1 belongs in C[i]
Action: Recognition output is most likely to be correct and the system would jump to subsequent dialogue states
Case 2: h1 = h2 and h1,h2 does not belong in C[i]Action: Recognition output is most likely to be correct, but the
input is of no help as it is not present in the contextual information. System would prompt the user, saying that no such information is available pertaining to that given input and would ask the user to provide some other query.
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Multiple Decoders and Contextual Information (contd)
40
Case 3: h1 ≠ h2 and h1/h2 belongs in C[i]Action: System would try to consider the hypothesis that is
present in the contextual information and discard the other. To make sure that the recognition is correct, system would ask for an explicit confirmation from the user.
Case 4: h1 ≠ h2 and h1,h2 does not belong in C[i]Action: Mis-recognition might have occurred and the system
would prompt the user to provide the information again.
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Multiple Decoders and Contextual Information (contd)
41
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Multiple Decoders and Contextual Information (contd)
42
Sample conversation logged using MIS IISystem: What is the name of the district
User: Karnul
(CI: Karnul, Krishna, Nellur,..)
(Recognition AM-1, AM-2) Karnul
System: Red grams is sold in 6 markets in Karnul
System: What is the market name?
User: Dhoni
(CI: Adoni, Dhoni, .. )
(Recognition AM-1) Dhoni
(Recognition AM-2) Guti
System: Did you say Dhoni?
User: Yes
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
MIS performace
43
Performance was measured based on whether the user was able to retrieve information or not.
Evaluation was conducted on 8 users, where each user was asked to retrieve information for any 5 commodities and the performance of MIS baseline system and MIS-II is as follows:
System Accuracy
MIS baseline 57.4% (23/40)
MIS II 77.5% (31/40)
Speech and Vision Lab, International Institute of Information Technology - Hyderabad
Personalization
44
A frequent caller expects his/her preferred query to be automatically answered.
Such personalization requires identifying a user and mining users previous calls to predict his/her preferred query during the next call.
Currently the system logs the user calls based on the caller id.
If the user is a frequent caller the system would prompt a yes/no question asking for the most recent transaction they made.Example: Do you want information of Rice in
VishakhapatnamDialogue flow will be user specific and based on user
responses.
Number: 66150320 extension 2
MIS Demo
Thank You