+ All Categories
Home > Documents > SPOKEN DIALOG SYSTEM FOR INTELLIGENT SERVICE ROBOTS Intelligent Software Lab. POSTECH Prof. Gary...

SPOKEN DIALOG SYSTEM FOR INTELLIGENT SERVICE ROBOTS Intelligent Software Lab. POSTECH Prof. Gary...

Date post: 27-Dec-2015
Category:
Upload: marcia-heather-heath
View: 220 times
Download: 2 times
Share this document with a friend
Popular Tags:
88
SPOKEN DIALOG SYSTEM FOR INTELLIGENT SERVICE ROBOTS Intelligent Software Lab. POSTECH Prof. Gary Geunbae Lee
Transcript
  • Slide 1
  • SPOKEN DIALOG SYSTEM FOR INTELLIGENT SERVICE ROBOTS Intelligent Software Lab. POSTECH Prof. Gary Geunbae Lee
  • Slide 2
  • Introduction to Spoken Dialog System (SDS) for Human-Robot Interaction (HRI) Brief introduction to SDS Language processing oriented But not signal processing oriented Mainly based on papers at ACL, NAACL, HLT, ICASSP, INTESPEECH, ASRU, SLT, SIGDIAL, CSL, SPECOM, IEEE TASLP 2 This Tutorial
  • Slide 3
  • OUTLINES INTRODUCTION AUTOMATIC SPEECH RECOGNITION SPOKEN LANGUAGE UNDERSTANDING DIALOG MANAGEMENT CHALLENGES & ISSUES MULTI-MODAL DIALOG SYSTEM DIALOG SIMULATOR DEMOS REFERENCES
  • Slide 4
  • INTRODUCTION
  • Slide 5
  • Human-Robot Interaction (in Movie)
  • Slide 6
  • Human-Robot Interaction (in Real World)
  • Slide 7
  • Wikipedia (http://en.wikipedia.org/wiki/Human_robot_interaction) What is HRI? the study of interactions between people and robots. natural language understanding Human-robot interaction (HRI) is the study of interactions between people and robots. HRI is multidisciplinary with contributions from the fields of human-computer interaction, artificial intelligence, robotics, natural language understanding, and social science. develop principles and algorithms allow more natural and effective communication and interaction The basic goal of HRI is to develop principles and algorithms to allow more natural and effective communication and interaction between humans and robots.
  • Slide 8
  • Signal Processing Speech Recognition Speech Understanding Dialog Management Speech Synthesis Area of HRI Vision Speech Haptics Emotion Learning
  • Slide 9
  • SPOKEN DIALOG SYSTEM (SDS)
  • Slide 10
  • Tele-service Car-navigation Home networking Robot interface SDS APPLICATIONS
  • Slide 11
  • Talk, Listen and Interact
  • Slide 12
  • AUTOMATIC SPEECH RECOGNITION
  • Slide 13
  • SCIENCE FICTION Eagle Eye (2008, D.J. Caruso)
  • Slide 14
  • AUTOMATIC SPEECH RECOGNITION xy Speech Words (x, y) Training examples Learning algorithm A process by which an acoustic speech signal is converted into a set of words [Rabiner et al., 1993]
  • Slide 15
  • NOISY CHANNEL MODEL GOAL Find the most likely sequence w of words in language L given the sequence of acoustic observation vectors O Treat acoustic input O as sequence of individual observations O = o 1,o 2,o 3, ,o t Define a sentence as a sequence of words: W = w 1,w 2,w 3, ,w n Bayes rule Golden rule
  • Slide 16
  • TRADITIONAL ARCHITECTURE Feature Extraction Decoding Acoustic Model Pronunciation Model Language Model ? Speech Signals Word Sequence ? Network Construction Speech DB Text Corpora HMM Estimation G2P LM Estimation
  • Slide 17
  • TRADITIONAL PROCESSES
  • Slide 18
  • FEATURE EXTRACTION The Mel-Frequency Cepstrum Coefficients (MFCC) is a popular choice [Paliwal, 1992] Frame size : 25ms / Frame rate : 10ms 39 feature per 10ms frame Absolute : Log Frame Energy (1) and MFCCs (12) Delta : First-order derivatives of the 13 absolute coefficients Delta-Delta : Second-order derivatives of the 13 absolute coefficients Preemphasis/ Hamming Window FFT (Fast Fourier Transform) Mel-scale filter bank log|.| DCT (Discrete Cosine Transform) MFCC (12-Dimension) X(n) 25 ms 10ms... a1 a2 a3
  • Slide 19
  • ACOUSTIC MODEL Provide P(O|Q) = P(features|phone) Modeling Units [Bahl et al., 1986] Context-independent : Phoneme Context-dependent : Diphone, Triphone, Quinphone p L -p+p R : left-right context triphone Typical acoustic model [Juang et al., 1986] Continuous-density Hidden Markov Model Distribution : Gaussian Mixture HMM Topology : 3-state left-to-right model for each phone, 1- state for silence or pause codebook b j (x)
  • Slide 20
  • PRONUCIATION MODEL Provide P(Q|W) = P(phone|word) Word Lexicon [Hazen et al., 2002] Map legal phone sequences into words according to phonotactic rules G2P (Grapheme to phoneme) : Generate a word lexicon automatically Several word may have multiple pronunciations Example Tomato P([towmeytow]|tomato) = P([towmaatow]|tomato) = 0.1 P([tahmeytow]|tomato) = P([tahmaatow]|tomato) = 0.4 [t] [ow] [ah] [m] [ey] [aa] [t][ow] 0.2 0.8 1.0 0.5 1.0
  • Slide 21
  • LANGUAGE MODEL Provide P(W) ; the probability of the sentence [Beaujard et al., 1999] We saw this was also used in the decoding process as the probability of transitioning from one word to another. Word sequence : W = w 1,w 2,w 3, ,w n The problem is that we cannot reliably estimate the conditional word probabilities, for all words and all sequence lengths in a given language n-gram Language Model n-gram language models use the previous n-1 words to represent the history Bi-grams are easily incorporated in a viterbi search
  • Slide 22
  • LANGUAGE MODEL Example Finite State Network (FSN) Context Free Grammar (CFG) Bigram P( | )=0.2 P( | )=0.5 P( | )=1.0 P( | )=0.5 P( | )=0.5 P( | )=0.9 $time = | ; $city = | | | ; $trans = | ; $sent = $city ( $time | $city ) $trans
  • Slide 23
  • Expanding every word to state level, we get a search network [Demuynck et al., 1997] NETWORK CONSTRUCTION I L S A M IL I SAM SA Acoustic ModelPronunciation ModelLanguage Model
  • Slide 24
  • DECODING Find Viterbi Search : Dynamic Programming Token Passing Algorithm [Young et al., 1989] Initialize all states with a token with a null history and the likelihood that it s a start state For each frame a k For each token t in state s with probability P(t), history H For each state r Add new token to s with probability P(t) P s,r P r (a k ), and history s.H
  • Slide 25
  • HTK Hidden Markov Model Toolkit (HTK) A portable toolkit for building and manipulating hidden Markov models [Young et al., 1996] - HShell : User I/O & interaction with OS - HLabel : Label files - HLM : Language model - HNet : Network and lattices - HDic : Dictionaries - HVQ : VQ codebooks - HModel : HMM definitions - HMem : Memory management - HGrf : Graphics - HAdapt : Adaptation - HRec : Main recognition processing functions
  • Slide 26
  • SUMMARY xy Speech Words (x, y) Training examples Learning algorithm I L S A M IL I SAM SA Acoustic ModelPronunciation ModelLanguage Model Decoding Search Network Construction
  • Slide 27
  • Speech Understanding = Spoken Language Understanding (SLU)
  • Slide 28
  • SPEECH UNDERSTANDING (in general)
  • Slide 29
  • SPEECH UNDERSTANDING (in SDS) xy Input Speech or Words Output Intentions (x, y) Training examples Learning algorithm A process by which natural langauge speech is mapped to frame structure encoding of its meanings [Mori et al., 2008]
  • Slide 30
  • Whats difference between NLU and SLU? Robustness; noise and ungrammatical spoken language Domain-dependent; further deep-level semantics (e.g. Person vs. Cast) Dialog; dialog history dependent and utt. by utt. Analysis Traditional approaches; natural language to SQL conversion ASR Speech SLU SQL Generate Database Text Semantic Frame SQLResponse A typical ATIS system (from [Wang et al., 2005]) LANGUAGE UNDERSTANDING
  • Slide 31
  • REPRESENTATION Semantic frame (slot/value structure) [Gildea and Jurafsky, 2002] An intermediate semantic representation to serve as the interface between user and dialog system Each frame contains several typed components called slots. The type of a slot specifies what kind of fillers it is expecting. Show me flights from Seattle to Boston ShowFlight SubjectFlight FLIGHTDeparture_CityArrival_City SEABOS FLIGHT SEA BOS Semantic representation on ATIS task; XML format (left) and hierarchical representation (right) [Wang et al., 2005]
  • Slide 32
  • Meaning Representations for Spoken Dialog System Slot type 1: Intent, Subject Goal, Dialog Act (DA) The meaning (intention) of an utt. at the discourse level Slot type 2: Component Slot, Named Entity (NE) The identifier of entity such as person, location, organization, or time. In SLU, it represents domain-specific meaning of a word (or word group). SEMANTIC FRAME Pohang Daeyidong Korean Ex) Find Korean restaurants in Daeyidong, Pohang
  • Slide 33
  • Two Classification Problems HOW TO SOLVE Find Korean restaurants in Daeyidong, PohangInput: Output: SEARCH_RESTAURANT Dialog Act Identification FOOD_TYPEADDRESSCITY Find Korean restaurants in Daeyidong, PohangInput: Output: Named Entity Recognition
  • Slide 34
  • Encoding: x is an input (word), y is an output (NE), and z is another output (DA). Vector x = {x 1, x 2, x 3, , x T } Vector y = {y 1, y 2, y 3, , y T } Scalar z Goal: modeling the functions y=f(x) and z=g(x) PROBLEM FORMALIZATION x FindKoreanrestaurantsinDaeyidong,Pohang. y OFOOD_TYPE-BOOADDRESS-BOCITY-BO z SEARCH_RESTAURANT

Recommended