SPOKEN DIALOG SYSTEM FOR INTELLIGENT SERVICE ROBOTS Intelligent Software Lab. POSTECH Prof. Gary...

transcript

SPOKEN DIALOG SYSTEM FOR INTELLIGENT SERVICE ROBOTS Intelligent Software Lab. POSTECH Prof. Gary Geunbae Lee

Introduction to Spoken Dialog System (SDS) for Human-Robot Interaction (HRI) Brief introduction to SDS Language processing oriented But not signal processing oriented Mainly based on papers at ACL, NAACL, HLT, ICASSP, INTESPEECH, ASRU, SLT, SIGDIAL, CSL, SPECOM, IEEE TASLP 2 This Tutorial

OUTLINES INTRODUCTION AUTOMATIC SPEECH RECOGNITION SPOKEN LANGUAGE UNDERSTANDING DIALOG MANAGEMENT CHALLENGES & ISSUES MULTI-MODAL DIALOG SYSTEM DIALOG SIMULATOR DEMOS REFERENCES

INTRODUCTION

Human-Robot Interaction (in Movie)

Human-Robot Interaction (in Real World)

Wikipedia (http://en.wikipedia.org/wiki/Human_robot_interaction) What is HRI? the study of interactions between people and robots. natural language understanding Human-robot interaction (HRI) is the study of interactions between people and robots. HRI is multidisciplinary with contributions from the fields of human-computer interaction, artificial intelligence, robotics, natural language understanding, and social science. develop principles and algorithms allow more natural and effective communication and interaction The basic goal of HRI is to develop principles and algorithms to allow more natural and effective communication and interaction between humans and robots.

Signal Processing Speech Recognition Speech Understanding Dialog Management Speech Synthesis Area of HRI Vision Speech Haptics Emotion Learning

SPOKEN DIALOG SYSTEM (SDS)

Tele-service Car-navigation Home networking Robot interface SDS APPLICATIONS

Talk, Listen and Interact

AUTOMATIC SPEECH RECOGNITION

SCIENCE FICTION Eagle Eye (2008, D.J. Caruso)

AUTOMATIC SPEECH RECOGNITION xy Speech Words (x, y) Training examples Learning algorithm A process by which an acoustic speech signal is converted into a set of words [Rabiner et al., 1993]

NOISY CHANNEL MODEL GOAL Find the most likely sequence w of words in language L given the sequence of acoustic observation vectors O Treat acoustic input O as sequence of individual observations O = o 1,o 2,o 3, ,o t Define a sentence as a sequence of words: W = w 1,w 2,w 3, ,w n Bayes rule Golden rule

TRADITIONAL ARCHITECTURE Feature Extraction Decoding Acoustic Model Pronunciation Model Language Model ? Speech Signals Word Sequence ? Network Construction Speech DB Text Corpora HMM Estimation G2P LM Estimation

TRADITIONAL PROCESSES

FEATURE EXTRACTION The Mel-Frequency Cepstrum Coefficients (MFCC) is a popular choice [Paliwal, 1992] Frame size : 25ms / Frame rate : 10ms 39 feature per 10ms frame Absolute : Log Frame Energy (1) and MFCCs (12) Delta : First-order derivatives of the 13 absolute coefficients Delta-Delta : Second-order derivatives of the 13 absolute coefficients Preemphasis/ Hamming Window FFT (Fast Fourier Transform) Mel-scale filter bank log|.| DCT (Discrete Cosine Transform) MFCC (12-Dimension) X(n) 25 ms 10ms... a1 a2 a3

ACOUSTIC MODEL Provide P(O|Q) = P(features|phone) Modeling Units [Bahl et al., 1986] Context-independent : Phoneme Context-dependent : Diphone, Triphone, Quinphone p L -p+p R : left-right context triphone Typical acoustic model [Juang et al., 1986] Continuous-density Hidden Markov Model Distribution : Gaussian Mixture HMM Topology : 3-state left-to-right model for each phone, 1- state for silence or pause codebook b j (x)

PRONUCIATION MODEL Provide P(Q|W) = P(phone|word) Word Lexicon [Hazen et al., 2002] Map legal phone sequences into words according to phonotactic rules G2P (Grapheme to phoneme) : Generate a word lexicon automatically Several word may have multiple pronunciations Example Tomato P([towmeytow]|tomato) = P([towmaatow]|tomato) = 0.1 P([tahmeytow]|tomato) = P([tahmaatow]|tomato) = 0.4 [t] [ow] [ah] [m] [ey] [aa] [t][ow] 0.2 0.8 1.0 0.5 1.0

LANGUAGE MODEL Provide P(W) ; the probability of the sentence [Beaujard et al., 1999] We saw this was also used in the decoding process as the probability of transitioning from one word to another. Word sequence : W = w 1,w 2,w 3, ,w n The problem is that we cannot reliably estimate the conditional word probabilities, for all words and all sequence lengths in a given language n-gram Language Model n-gram language models use the previous n-1 words to represent the history Bi-grams are easily incorporated in a viterbi search

LANGUAGE MODEL Example Finite State Network (FSN) Context Free Grammar (CFG) Bigram P( | )=0.2 P( | )=0.5 P( | )=1.0 P( | )=0.5 P( | )=0.5 P( | )=0.9 $time = | ; $city = | | | ; $trans = | ; $sent = $city ( $time | $city ) $trans

Expanding every word to state level, we get a search network [Demuynck et al., 1997] NETWORK CONSTRUCTION I L S A M IL I SAM SA Acoustic ModelPronunciation ModelLanguage Model

DECODING Find Viterbi Search : Dynamic Programming Token Passing Algorithm [Young et al., 1989] Initialize all states with a token with a null history and the likelihood that it s a start state For each frame a k For each token t in state s with probability P(t), history H For each state r Add new token to s with probability P(t) P s,r P r (a k ), and history s.H

HTK Hidden Markov Model Toolkit (HTK) A portable toolkit for building and manipulating hidden Markov models [Young et al., 1996] - HShell : User I/O & interaction with OS - HLabel : Label files - HLM : Language model - HNet : Network and lattices - HDic : Dictionaries - HVQ : VQ codebooks - HModel : HMM definitions - HMem : Memory management - HGrf : Graphics - HAdapt : Adaptation - HRec : Main recognition processing functions

SUMMARY xy Speech Words (x, y) Training examples Learning algorithm I L S A M IL I SAM SA Acoustic ModelPronunciation ModelLanguage Model Decoding Search Network Construction

Speech Understanding = Spoken Language Understanding (SLU)

SPEECH UNDERSTANDING (in general)

SPEECH UNDERSTANDING (in SDS) xy Input Speech or Words Output Intentions (x, y) Training examples Learning algorithm A process by which natural langauge speech is mapped to frame structure encoding of its meanings [Mori et al., 2008]

Whats difference between NLU and SLU? Robustness; noise and ungrammatical spoken language Domain-dependent; further deep-level semantics (e.g. Person vs. Cast) Dialog; dialog history dependent and utt. by utt. Analysis Traditional approaches; natural language to SQL conversion ASR Speech SLU SQL Generate Database Text Semantic Frame SQLResponse A typical ATIS system (from [Wang et al., 2005]) LANGUAGE UNDERSTANDING

REPRESENTATION Semantic frame (slot/value structure) [Gildea and Jurafsky, 2002] An intermediate semantic representation to serve as the interface between user and dialog system Each frame contains several typed components called slots. The type of a slot specifies what kind of fillers it is expecting. Show me flights from Seattle to Boston ShowFlight SubjectFlight FLIGHTDeparture_CityArrival_City SEABOS FLIGHT SEA BOS Semantic representation on ATIS task; XML format (left) and hierarchical representation (right) [Wang et al., 2005]

Meaning Representations for Spoken Dialog System Slot type 1: Intent, Subject Goal, Dialog Act (DA) The meaning (intention) of an utt. at the discourse level Slot type 2: Component Slot, Named Entity (NE) The identifier of entity such as person, location, organization, or time. In SLU, it represents domain-specific meaning of a word (or word group). SEMANTIC FRAME Pohang Daeyidong Korean Ex) Find Korean restaurants in Daeyidong, Pohang

Two Classification Problems HOW TO SOLVE Find Korean restaurants in Daeyidong, PohangInput: Output: SEARCH_RESTAURANT Dialog Act Identification FOOD_TYPEADDRESSCITY Find Korean restaurants in Daeyidong, PohangInput: Output: Named Entity Recognition

Encoding: x is an input (word), y is an output (NE), and z is another output (DA). Vector x = {x 1, x 2, x 3, , x T } Vector y = {y 1, y 2, y 3, , y T } Scalar z Goal: modeling the functions y=f(x) and z=g(x) PROBLEM FORMALIZATION x FindKoreanrestaurantsinDaeyidong,Pohang. y OFOOD_TYPE-BOOADDRESS-BOCITY-BO z SEARCH_RESTAURANT

SPOKEN DIALOG SYSTEM FOR INTELLIGENT SERVICE ROBOTS Intelligent Software Lab. POSTECH Prof. Gary...

Documents