SPOKEN DIALOG SYSTEM FOR INTELLIGENT SERVICE ROBOTS Intelligent Software Lab. POSTECH Prof. Gary Geunbae Lee
Transcript
Slide 1
SPOKEN DIALOG SYSTEM FOR INTELLIGENT SERVICE ROBOTS Intelligent
Software Lab. POSTECH Prof. Gary Geunbae Lee
Slide 2
Introduction to Spoken Dialog System (SDS) for Human-Robot
Interaction (HRI) Brief introduction to SDS Language processing
oriented But not signal processing oriented Mainly based on papers
at ACL, NAACL, HLT, ICASSP, INTESPEECH, ASRU, SLT, SIGDIAL, CSL,
SPECOM, IEEE TASLP 2 This Tutorial
Slide 3
OUTLINES INTRODUCTION AUTOMATIC SPEECH RECOGNITION SPOKEN
LANGUAGE UNDERSTANDING DIALOG MANAGEMENT CHALLENGES & ISSUES
MULTI-MODAL DIALOG SYSTEM DIALOG SIMULATOR DEMOS REFERENCES
Slide 4
INTRODUCTION
Slide 5
Human-Robot Interaction (in Movie)
Slide 6
Human-Robot Interaction (in Real World)
Slide 7
Wikipedia
(http://en.wikipedia.org/wiki/Human_robot_interaction) What is HRI?
the study of interactions between people and robots. natural
language understanding Human-robot interaction (HRI) is the study
of interactions between people and robots. HRI is multidisciplinary
with contributions from the fields of human-computer interaction,
artificial intelligence, robotics, natural language understanding,
and social science. develop principles and algorithms allow more
natural and effective communication and interaction The basic goal
of HRI is to develop principles and algorithms to allow more
natural and effective communication and interaction between humans
and robots.
Slide 8
Signal Processing Speech Recognition Speech Understanding
Dialog Management Speech Synthesis Area of HRI Vision Speech
Haptics Emotion Learning
Slide 9
SPOKEN DIALOG SYSTEM (SDS)
Slide 10
Tele-service Car-navigation Home networking Robot interface SDS
APPLICATIONS
Slide 11
Talk, Listen and Interact
Slide 12
AUTOMATIC SPEECH RECOGNITION
Slide 13
SCIENCE FICTION Eagle Eye (2008, D.J. Caruso)
Slide 14
AUTOMATIC SPEECH RECOGNITION xy Speech Words (x, y) Training
examples Learning algorithm A process by which an acoustic speech
signal is converted into a set of words [Rabiner et al., 1993]
Slide 15
NOISY CHANNEL MODEL GOAL Find the most likely sequence w of
words in language L given the sequence of acoustic observation
vectors O Treat acoustic input O as sequence of individual
observations O = o 1,o 2,o 3, ,o t Define a sentence as a sequence
of words: W = w 1,w 2,w 3, ,w n Bayes rule Golden rule
Slide 16
TRADITIONAL ARCHITECTURE Feature Extraction Decoding Acoustic
Model Pronunciation Model Language Model ? Speech Signals Word
Sequence ? Network Construction Speech DB Text Corpora HMM
Estimation G2P LM Estimation
Slide 17
TRADITIONAL PROCESSES
Slide 18
FEATURE EXTRACTION The Mel-Frequency Cepstrum Coefficients
(MFCC) is a popular choice [Paliwal, 1992] Frame size : 25ms /
Frame rate : 10ms 39 feature per 10ms frame Absolute : Log Frame
Energy (1) and MFCCs (12) Delta : First-order derivatives of the 13
absolute coefficients Delta-Delta : Second-order derivatives of the
13 absolute coefficients Preemphasis/ Hamming Window FFT (Fast
Fourier Transform) Mel-scale filter bank log|.| DCT (Discrete
Cosine Transform) MFCC (12-Dimension) X(n) 25 ms 10ms... a1 a2
a3
Slide 19
ACOUSTIC MODEL Provide P(O|Q) = P(features|phone) Modeling
Units [Bahl et al., 1986] Context-independent : Phoneme
Context-dependent : Diphone, Triphone, Quinphone p L -p+p R :
left-right context triphone Typical acoustic model [Juang et al.,
1986] Continuous-density Hidden Markov Model Distribution :
Gaussian Mixture HMM Topology : 3-state left-to-right model for
each phone, 1- state for silence or pause codebook b j (x)
Slide 20
PRONUCIATION MODEL Provide P(Q|W) = P(phone|word) Word Lexicon
[Hazen et al., 2002] Map legal phone sequences into words according
to phonotactic rules G2P (Grapheme to phoneme) : Generate a word
lexicon automatically Several word may have multiple pronunciations
Example Tomato P([towmeytow]|tomato) = P([towmaatow]|tomato) = 0.1
P([tahmeytow]|tomato) = P([tahmaatow]|tomato) = 0.4 [t] [ow] [ah]
[m] [ey] [aa] [t][ow] 0.2 0.8 1.0 0.5 1.0
Slide 21
LANGUAGE MODEL Provide P(W) ; the probability of the sentence
[Beaujard et al., 1999] We saw this was also used in the decoding
process as the probability of transitioning from one word to
another. Word sequence : W = w 1,w 2,w 3, ,w n The problem is that
we cannot reliably estimate the conditional word probabilities, for
all words and all sequence lengths in a given language n-gram
Language Model n-gram language models use the previous n-1 words to
represent the history Bi-grams are easily incorporated in a viterbi
search
Expanding every word to state level, we get a search network
[Demuynck et al., 1997] NETWORK CONSTRUCTION I L S A M IL I SAM SA
Acoustic ModelPronunciation ModelLanguage Model
Slide 24
DECODING Find Viterbi Search : Dynamic Programming Token
Passing Algorithm [Young et al., 1989] Initialize all states with a
token with a null history and the likelihood that it s a start
state For each frame a k For each token t in state s with
probability P(t), history H For each state r Add new token to s
with probability P(t) P s,r P r (a k ), and history s.H
Slide 25
HTK Hidden Markov Model Toolkit (HTK) A portable toolkit for
building and manipulating hidden Markov models [Young et al., 1996]
- HShell : User I/O & interaction with OS - HLabel : Label
files - HLM : Language model - HNet : Network and lattices - HDic :
Dictionaries - HVQ : VQ codebooks - HModel : HMM definitions - HMem
: Memory management - HGrf : Graphics - HAdapt : Adaptation - HRec
: Main recognition processing functions
Slide 26
SUMMARY xy Speech Words (x, y) Training examples Learning
algorithm I L S A M IL I SAM SA Acoustic ModelPronunciation
ModelLanguage Model Decoding Search Network Construction
Slide 27
Speech Understanding = Spoken Language Understanding (SLU)
Slide 28
SPEECH UNDERSTANDING (in general)
Slide 29
SPEECH UNDERSTANDING (in SDS) xy Input Speech or Words Output
Intentions (x, y) Training examples Learning algorithm A process by
which natural langauge speech is mapped to frame structure encoding
of its meanings [Mori et al., 2008]
Slide 30
Whats difference between NLU and SLU? Robustness; noise and
ungrammatical spoken language Domain-dependent; further deep-level
semantics (e.g. Person vs. Cast) Dialog; dialog history dependent
and utt. by utt. Analysis Traditional approaches; natural language
to SQL conversion ASR Speech SLU SQL Generate Database Text
Semantic Frame SQLResponse A typical ATIS system (from [Wang et
al., 2005]) LANGUAGE UNDERSTANDING
Slide 31
REPRESENTATION Semantic frame (slot/value structure) [Gildea
and Jurafsky, 2002] An intermediate semantic representation to
serve as the interface between user and dialog system Each frame
contains several typed components called slots. The type of a slot
specifies what kind of fillers it is expecting. Show me flights
from Seattle to Boston ShowFlight SubjectFlight
FLIGHTDeparture_CityArrival_City SEABOS FLIGHT SEA BOS Semantic
representation on ATIS task; XML format (left) and hierarchical
representation (right) [Wang et al., 2005]
Slide 32
Meaning Representations for Spoken Dialog System Slot type 1:
Intent, Subject Goal, Dialog Act (DA) The meaning (intention) of an
utt. at the discourse level Slot type 2: Component Slot, Named
Entity (NE) The identifier of entity such as person, location,
organization, or time. In SLU, it represents domain-specific
meaning of a word (or word group). SEMANTIC FRAME Pohang Daeyidong
Korean Ex) Find Korean restaurants in Daeyidong, Pohang
Slide 33
Two Classification Problems HOW TO SOLVE Find Korean
restaurants in Daeyidong, PohangInput: Output: SEARCH_RESTAURANT
Dialog Act Identification FOOD_TYPEADDRESSCITY Find Korean
restaurants in Daeyidong, PohangInput: Output: Named Entity
Recognition
Slide 34
Encoding: x is an input (word), y is an output (NE), and z is
another output (DA). Vector x = {x 1, x 2, x 3, , x T } Vector y =
{y 1, y 2, y 3, , y T } Scalar z Goal: modeling the functions
y=f(x) and z=g(x) PROBLEM FORMALIZATION x
FindKoreanrestaurantsinDaeyidong,Pohang. y
OFOOD_TYPE-BOOADDRESS-BOCITY-BO z SEARCH_RESTAURANT