transcript
- Slide 1
- SPOKEN DIALOG SYSTEM FOR INTELLIGENT SERVICE ROBOTS Intelligent
Software Lab. POSTECH Prof. Gary Geunbae Lee
- Slide 2
- Introduction to Spoken Dialog System (SDS) for Human-Robot
Interaction (HRI) Brief introduction to SDS Language processing
oriented But not signal processing oriented Mainly based on papers
at ACL, NAACL, HLT, ICASSP, INTESPEECH, ASRU, SLT, SIGDIAL, CSL,
SPECOM, IEEE TASLP 2 This Tutorial
- Slide 3
- OUTLINES INTRODUCTION AUTOMATIC SPEECH RECOGNITION SPOKEN
LANGUAGE UNDERSTANDING DIALOG MANAGEMENT CHALLENGES & ISSUES
MULTI-MODAL DIALOG SYSTEM DIALOG SIMULATOR DEMOS REFERENCES
- Slide 4
- INTRODUCTION
- Slide 5
- Human-Robot Interaction (in Movie)
- Slide 6
- Human-Robot Interaction (in Real World)
- Slide 7
- Wikipedia
(http://en.wikipedia.org/wiki/Human_robot_interaction) What is HRI?
the study of interactions between people and robots. natural
language understanding Human-robot interaction (HRI) is the study
of interactions between people and robots. HRI is multidisciplinary
with contributions from the fields of human-computer interaction,
artificial intelligence, robotics, natural language understanding,
and social science. develop principles and algorithms allow more
natural and effective communication and interaction The basic goal
of HRI is to develop principles and algorithms to allow more
natural and effective communication and interaction between humans
and robots.
- Slide 8
- Signal Processing Speech Recognition Speech Understanding
Dialog Management Speech Synthesis Area of HRI Vision Speech
Haptics Emotion Learning
- Slide 9
- SPOKEN DIALOG SYSTEM (SDS)
- Slide 10
- Tele-service Car-navigation Home networking Robot interface SDS
APPLICATIONS
- Slide 11
- Talk, Listen and Interact
- Slide 12
- AUTOMATIC SPEECH RECOGNITION
- Slide 13
- SCIENCE FICTION Eagle Eye (2008, D.J. Caruso)
- Slide 14
- AUTOMATIC SPEECH RECOGNITION xy Speech Words (x, y) Training
examples Learning algorithm A process by which an acoustic speech
signal is converted into a set of words [Rabiner et al., 1993]
- Slide 15
- NOISY CHANNEL MODEL GOAL Find the most likely sequence w of
words in language L given the sequence of acoustic observation
vectors O Treat acoustic input O as sequence of individual
observations O = o 1,o 2,o 3, ,o t Define a sentence as a sequence
of words: W = w 1,w 2,w 3, ,w n Bayes rule Golden rule
- Slide 16
- TRADITIONAL ARCHITECTURE Feature Extraction Decoding Acoustic
Model Pronunciation Model Language Model ? Speech Signals Word
Sequence ? Network Construction Speech DB Text Corpora HMM
Estimation G2P LM Estimation
- Slide 17
- TRADITIONAL PROCESSES
- Slide 18
- FEATURE EXTRACTION The Mel-Frequency Cepstrum Coefficients
(MFCC) is a popular choice [Paliwal, 1992] Frame size : 25ms /
Frame rate : 10ms 39 feature per 10ms frame Absolute : Log Frame
Energy (1) and MFCCs (12) Delta : First-order derivatives of the 13
absolute coefficients Delta-Delta : Second-order derivatives of the
13 absolute coefficients Preemphasis/ Hamming Window FFT (Fast
Fourier Transform) Mel-scale filter bank log|.| DCT (Discrete
Cosine Transform) MFCC (12-Dimension) X(n) 25 ms 10ms... a1 a2
a3
- Slide 19
- ACOUSTIC MODEL Provide P(O|Q) = P(features|phone) Modeling
Units [Bahl et al., 1986] Context-independent : Phoneme
Context-dependent : Diphone, Triphone, Quinphone p L -p+p R :
left-right context triphone Typical acoustic model [Juang et al.,
1986] Continuous-density Hidden Markov Model Distribution :
Gaussian Mixture HMM Topology : 3-state left-to-right model for
each phone, 1- state for silence or pause codebook b j (x)
- Slide 20
- PRONUCIATION MODEL Provide P(Q|W) = P(phone|word) Word Lexicon
[Hazen et al., 2002] Map legal phone sequences into words according
to phonotactic rules G2P (Grapheme to phoneme) : Generate a word
lexicon automatically Several word may have multiple pronunciations
Example Tomato P([towmeytow]|tomato) = P([towmaatow]|tomato) = 0.1
P([tahmeytow]|tomato) = P([tahmaatow]|tomato) = 0.4 [t] [ow] [ah]
[m] [ey] [aa] [t][ow] 0.2 0.8 1.0 0.5 1.0
- Slide 21
- LANGUAGE MODEL Provide P(W) ; the probability of the sentence
[Beaujard et al., 1999] We saw this was also used in the decoding
process as the probability of transitioning from one word to
another. Word sequence : W = w 1,w 2,w 3, ,w n The problem is that
we cannot reliably estimate the conditional word probabilities, for
all words and all sequence lengths in a given language n-gram
Language Model n-gram language models use the previous n-1 words to
represent the history Bi-grams are easily incorporated in a viterbi
search
- Slide 22
- LANGUAGE MODEL Example Finite State Network (FSN) Context Free
Grammar (CFG) Bigram P( | )=0.2 P( | )=0.5 P( | )=1.0 P( | )=0.5 P(
| )=0.5 P( | )=0.9 $time = | ; $city = | | | ; $trans = | ; $sent =
$city ( $time | $city ) $trans
- Slide 23
- Expanding every word to state level, we get a search network
[Demuynck et al., 1997] NETWORK CONSTRUCTION I L S A M IL I SAM SA
Acoustic ModelPronunciation ModelLanguage Model
- Slide 24
- DECODING Find Viterbi Search : Dynamic Programming Token
Passing Algorithm [Young et al., 1989] Initialize all states with a
token with a null history and the likelihood that it s a start
state For each frame a k For each token t in state s with
probability P(t), history H For each state r Add new token to s
with probability P(t) P s,r P r (a k ), and history s.H
- Slide 25
- HTK Hidden Markov Model Toolkit (HTK) A portable toolkit for
building and manipulating hidden Markov models [Young et al., 1996]
- HShell : User I/O & interaction with OS - HLabel : Label
files - HLM : Language model - HNet : Network and lattices - HDic :
Dictionaries - HVQ : VQ codebooks - HModel : HMM definitions - HMem
: Memory management - HGrf : Graphics - HAdapt : Adaptation - HRec
: Main recognition processing functions
- Slide 26
- SUMMARY xy Speech Words (x, y) Training examples Learning
algorithm I L S A M IL I SAM SA Acoustic ModelPronunciation
ModelLanguage Model Decoding Search Network Construction
- Slide 27
- Speech Understanding = Spoken Language Understanding (SLU)
- Slide 28
- SPEECH UNDERSTANDING (in general)
- Slide 29
- SPEECH UNDERSTANDING (in SDS) xy Input Speech or Words Output
Intentions (x, y) Training examples Learning algorithm A process by
which natural langauge speech is mapped to frame structure encoding
of its meanings [Mori et al., 2008]
- Slide 30
- Whats difference between NLU and SLU? Robustness; noise and
ungrammatical spoken language Domain-dependent; further deep-level
semantics (e.g. Person vs. Cast) Dialog; dialog history dependent
and utt. by utt. Analysis Traditional approaches; natural language
to SQL conversion ASR Speech SLU SQL Generate Database Text
Semantic Frame SQLResponse A typical ATIS system (from [Wang et
al., 2005]) LANGUAGE UNDERSTANDING
- Slide 31
- REPRESENTATION Semantic frame (slot/value structure) [Gildea
and Jurafsky, 2002] An intermediate semantic representation to
serve as the interface between user and dialog system Each frame
contains several typed components called slots. The type of a slot
specifies what kind of fillers it is expecting. Show me flights
from Seattle to Boston ShowFlight SubjectFlight
FLIGHTDeparture_CityArrival_City SEABOS FLIGHT SEA BOS Semantic
representation on ATIS task; XML format (left) and hierarchical
representation (right) [Wang et al., 2005]
- Slide 32
- Meaning Representations for Spoken Dialog System Slot type 1:
Intent, Subject Goal, Dialog Act (DA) The meaning (intention) of an
utt. at the discourse level Slot type 2: Component Slot, Named
Entity (NE) The identifier of entity such as person, location,
organization, or time. In SLU, it represents domain-specific
meaning of a word (or word group). SEMANTIC FRAME Pohang Daeyidong
Korean Ex) Find Korean restaurants in Daeyidong, Pohang
- Slide 33
- Two Classification Problems HOW TO SOLVE Find Korean
restaurants in Daeyidong, PohangInput: Output: SEARCH_RESTAURANT
Dialog Act Identification FOOD_TYPEADDRESSCITY Find Korean
restaurants in Daeyidong, PohangInput: Output: Named Entity
Recognition
- Slide 34
- Encoding: x is an input (word), y is an output (NE), and z is
another output (DA). Vector x = {x 1, x 2, x 3, , x T } Vector y =
{y 1, y 2, y 3, , y T } Scalar z Goal: modeling the functions
y=f(x) and z=g(x) PROBLEM FORMALIZATION x
FindKoreanrestaurantsinDaeyidong,Pohang. y
OFOOD_TYPE-BOOADDRESS-BOCITY-BO z SEARCH_RESTAURANT