CS 124/LINGUIST 180 From Languages to...

CS 124/LINGUIST 180From Languages to Information

ConversationalAgents

DanJurafskyStanfordUniversity

Conversational Agents AKA Dialog Agents

Phone-basedPersonalAssistantsSIRI,Alexa,Cortana,GoogleAssistant

TalkingtoyourcarCommunicatingwithrobotsClinicalusesformentalhealthChattingforfun

Two classes of systems

1. Chatbots2. (Goal-based)Dialogagents

- SIRI,interfacestocars,robots,- bookingflightsorrestaurants

RecentlyI'venoticedthattheword"chatbots"issometimesusedinthepopularpressforboth.I'lluseitonlyfor#1.

Part I: Chatbots!� ELIZA(1966)� PARRY(1968)

ThefirstsystemtopasstheTuringtest!!!!� ALICE� CLEVER� MicrosoftLittleBing小冰

Chatbot Architectures

Rule-based1. Pattern-actionrules(Eliza)

+amentalmodel(Parry)

Corpus-based(fromlargechatcorpus)2. InformationRetrieval3. Neuralnetworkencoder-decoder

Chatbots

Sometimesforfun

Butoriginallydesignedasatestbedforclinicaltherapy

(Andstillusedthatway!)

Eliza: Weizenbaum (1966)Menareallalike.INWHATWAYThey'realwaysbuggingusaboutsomethingorother.CANYOUTHINKOFASPECIFICEXAMPLEWell,myboyfriendmademecomehere.YOURBOYFRIENDMADEYOUCOMEHEREHesaysI'mdepressedmuchofthetime.IAMSORRYTOHEARYOUAREDEPRESSED...WHOELSEINYOURFAMILYTAKESCARE0FYOU?MyfatherYOURFATHER

Youarelikemyfatherinsomeways.WHATRESEMBLANCEDOYOUSEEYouarenotveryaggressivebutIthinkyoudon’twantmetonoticethat.WHATMAKESYOUTHINKIAMNOTAGGRESSIVEYoudon’targuewithme.WHYDOYOUTHINKIDON’TARGUEWITHYOUYouareafraidofme.DOESITPLEASEYOUTOBELIEVEI’MAFRAIDOFYOUMy father is afraid ofeverybody.WHATELSECOMESTOMINDWHENYOUTHINKOFYOURFATHERBullies.DOESTHATHAVEANYTHINGTODOWITHTHEFACTTHATYOURBOYFRIENDMADEYOUCOMEHERE

How does it work?Eliza uses this one weird trick

That trick: be a Rogerian psychologist

Drawthepatientoutbyreflectingpatient’sstatementsbackatthemRaretypeofconversationinwhichonecan“assumetheposeofknowingalmostnothingoftherealworld”

Rogerian psychologist

Patient:"Iwentforalongboatride”Psychiatrist:"Tellmeaboutboats”

� Youdon’tassumeshedidn’tknowwhataboatis� Youassumeshehadsomeconversationalgoal� Chatbots tryingtopasstheTuringtestusuallychoosesomesuchdomain

Eliza pattern/transform rules(0 YOU 0 ME) [pattern]à

(WHAT MAKES YOU THINK I 3 YOU) [transform]

0meansKleene*The3istheconstituent#inpattern

You hate meWHAT MAKES YOU THINK I HATE YOU

Eliza Rules

A formal notation in which to represent the decomposition template is:

(0 YOU 0 ME)

and the reassembly rule

(WHAT MAKES YOU T H I N K I 3 YOU).

The "0" in the decomposition rule stands for "an indefinite number of words" (analogous to the indefinite dollar sign of COMIT) [6] while the "3" in the reassembly rule indicates that the third component of the subject decomposition is to be inserted in its place. The decomposition rule

(0 YOU 1 ME)

would have worked just as well in this specific example. A nonzero integer "n" appearing in a decomposition rule indicates that the component in question should consist of exactly "n" words. However, of the two rules shown, only the first would have matched the sentence, " I t seems you hate and love me," the second failing because there is more than one word between "you" and "me".

~1 RI,1 R1,2"'" Rl,ml D2 R2J RZ,Z" " "Rz,m2 De Ra,t FIn'2~ ' " " R%nan

Fro. 1. :Keyword and rule list structure

in ELIZA the question of which decomposition rules to apply to an input text is of course a crucial one. The input sentence might have been, for example, " I t seems that you hate," in which ease the decomposition rule (0 YOU 0 ME) would have failed in that the word " M E " would not have been found at all, let alone in its assigned place. Some other decomposition rule would then have to be tried and, failing that, still another until a match could be made or a total failure reported. ELIZA must therefm~ have a mechanism to sharply delimit the set of decomposition rules which are potentially applicable to a currently active input sentence. This is the keyword mechanism.

An input sentence is scanned from left to right. Each word is looked up in a dictionary of keywords. If a word is identified as a keyword, then (apart from the issue of precedence of keywords) only decomposition rules con- taining that keyword need to be tried. The trial sequence can even be partially ordered. For example, the decomposition rule (0 YOU 0) associated with the keyword "YOU" (and decomposing an input sentence into (1) all words in front of "YOU", (2) the word "YOU", and (3) all words following "YOU") should be the last, one tried since it is bound to succeed.

Two problems now arise. One stems from the fact tha t

almost none of the words in any given sentence are repre. sented in the keyword dictionary. The other is that of "associating" both decomposit;ion and reassembly rules with keyword~-;. The iirst is serious in that the determination that a word is not in a dictionary may well require more computation (i.e., time) than the location of a word which is represented. The attack on both problems begins by placing both a keyword trod its associated rules on a list. The basic format of a typical key list is the following:

(K ((D0 (R~. t) (R~, 2) " '" (Rt . . . . )) ((D2) (R2. ~) (R2.2) " " (R2.,,,~))

:

((D,~) (R,,, ~) (taw, ~) . . . (R,~, . ~ ) ) )

where K is the keyword, D, the it, h decomposition rule associated with K and R¢, ~ the j t h reassembly rule ass0. ciated with the i th decomposition rule.

A common pictorial representation of such a structure is the tree diagram shown in Figure 1. The top level of this structure contains the keyword followed by the names of lists; each one of which is again a list structure beginning with a decomposition rule and followed by reassembly rules. Since list structures of this type have no predeter- mined dimensionality limitations, any number of decomposition rules may be associated with a given keyword and any number of reassembly rules witch any specific decomposition rule. SLiP i8 rich in functions that sequence over structures of this type efficiently. Hence programmin problems are minimized.

An ELIZA script consists mainly of a set of list structures of the type shown. The actual keyword dictionary is constructed when such a script is first read into the hitherto empty program. The basic structural component of the keyword dictionary is a vector KEY of (currently) 128 contiguous computer words. As a particular key list structure is read the keyword K at its top is randomized (hashed) by a procedure that produces (currently) a 7 bit integer "i" . The word "always", for example, yields the integer 14. KEY(i), i.e., the i th word of the vector KEY, is then examined to determine whether it contains a list. name. If it does not, then an empty list is created, its name placed in KEY(i), and the key list structure in question placed on that list. If KEY(i) already contains a list name, then the name of the key list structure is placed on the bottom of the list named in KEY(i). The largest dictionary so far attempted contains about 50 keywords. No list named in any of the words of the KEY vector contains more than two key list structures.

Every word encountered in the scan of an input text, i.e., during the actual operations of ELIZA, is randomized by the same hashing algorithm as was originally applied to the incoming keywords, hence yields an integer which points to the only possible list structure which could potentially contain that word as a keyword. Even then, only the tops of any key list structures that may be found there need be interrogated to determine whether or not a keyword has been found. By virtue of the various list

38 Communications of tile ACM Volume 9 / Number 1 / January, 1966

keywordpattern Rankedlistoftransforms

ELIZA: given sentence return response

�Findthewordwinsentencethathasthehighestkeywordrank

� Ifw exists:�Checkeachruleforw inrankedorder

�Choosefirstonethatmatchessentence� response←applytransform

� Else�response←apply"NONE"transform,or�response←grabanactionoffthememoryqueue

Keywords are ranked from specific to general

I know everybody laughed at me� “I”isaverygeneralkeyword:I: (I *) -> (You say you 2)YOU SAY YOU KNOW EVERYBODY LAUGHED AT YOU

� “Everybody”ismuchmoreinteresting(someoneusinguniversalslikeeverybody/alwaysisprobably“referringtosomequitespecificeventorperson”)WHO IN PARTICULAR ARE YOU THINKING OF?

� Implementation:keywordsstoredwiththeirrankEverybody 5 (transformation rules)I 0 (transformation rules)

NONE

PLEASEGOONTHAT’SVERYINTERESTINGISEE

Memory(MEMORY MY(0 YOUR 0 = LETS DISCUSS FURTHER WHY YOUR 3)(0 YOUR 0 = EARLIER YOU SAID YOUR 3)

�Whenever“MY”ishighestkeyword�RandomlyselectatransformontheMEMORYlist�Applytosentence�Storeonastack

� Later,ifnokeywordmatchesasentence�ReturnthetopoftheMEMORYqueueinstead

� Ahierarchicalmodelofdiscourse

Other Eliza stuff� RulescanrefertoclassesofwordsFamily = mother, father, brother, sisterNOUN = …

� Don’treusetransformsinthesameconversation� Wheneverweuseatransformassociatedwithapattern� Weincrementacounterforthatrule� Sothenexttimeweusethenextrankedtransform

� Somebasictransformshappenduringinputprocessing� I->YOU� YOU->I

Some implications� Peoplebecamedeeplyemotionallyinvolvedwiththeprogram

� Weizenbaum tellsthestoryofhissecretarywhowouldaskWeizenbaum toleavetheroomwhenshetalkedwithELIZA

� WhenhesuggestedthathemightwanttostorealltheELIZAconversationsforlateranalysis,peopleimmediatelypointedouttheprivacyimplications� SuggestingthattheywerehavingquiteprivateconversationswithELIZA

� Anthropomorphicism andtheHeider-Simmel Illusion� https://www.youtube.com/watch?v=8FIEZXMUM2I

Parry� Colby1971atStanford� Samepattern-responsestructureasEliza� Butamuchricher:

� controlstructure� languageunderstandingcapabilities�mentalmodel:Parryhasaffectivevariables

� Anger,Fear,Mistrust� “IfAngerlevelishigh,respondwithhostility”

� ThefirstsystemtopasstheTuringtest(in1971)� Psychiatristscouldn’tdistinguishinterviewswithPARRYfrom(texttranscriptsof)interviewswithrealparanoids

20

Parry’s persona

� 28-year-oldsingleman,postofficeclerk� nosiblingsandlivesalone� sensitiveabouthisphysicalappearance,hisfamily,hisreligion,hiseducationandthetopicofsex.

� hobbiesaremoviesandgamblingonhorseracing,� recentlyattackedabookie,claimingthebookiedidnotpayoffinabet.

� afterwardsworriedaboutpossibleunderworldretaliation

� eagertotellhisstorytonon-threatinglisteners.

ARTIHClAL PARANOIA 7

'! ¸ ''is

G

FIG. 1. G

eneral outline of program's operations.

Artificial Intelligence 2 (1971), 1-25

Parry’s Architecture

Affect variables� FearandAnger(eachranging0-20)� Mistrust(ranging0-15)� Initialconditions:Alllow

� Aftereachuserturn,ifnothingmalevolentininput�Angerdropsby1,Feardropsby0.3�Mistrustdropsby0.05tobaselevel

� Otherwisedependsonwhattheusersays� EachuserstatementcanchangeFearandAnger

� InsultsincreasesAngerbysomepercentage�MistrustgoesupifFearorAngerdo

Lots of complex I-O rules� UserimpliesParryismentallyill

� RiseinFearandAnger� Usermentions“Mafia”orassociatedconcepts(“kill”):

� Firstmention:riseinFear� Latermentions:dependsonwillingnesstodiscuss,whichdependsoncurrentlevelsofFear,Anger,Mistrust

� UsermentionsParry� Flattery(positivemention)

� Decreasesfear/angerifMistrustislow� IncreasesAngerifMustrust ishigh

� UserattitudestowardParry� Negativeattitudes(fear,disbelief)increas Fear/Anger

Flare concepts� ListofconceptsrelatedtoMafia� Anorderedgraphdesignedtoleadinterviewertotopic

horsesà horseracingàgamblingàbookiesàunderworldàMafia

� ThementionofanewflaretopicbyinterviewercausesariseinFear

� FlaretopicscauseParrytogivepresetresponsestothatflare

Each sentence is mapped into a conceptualization

� Apredicationonaconceptualobject� Apredicationonarelationbetweentwoobjects� Apredicationonanattribute:

What is your work?What sort of work do you do?Where do you work? à (your work?)What do you do for a living?What is your job?Do you have a job?What is your occupation

� ComplexPattern/transformrules� Differentpredicates(fear,afraidof)� Ordering(Youareafraidofme=Ifrightenyou)

Detecting Other’s Intent

Detecting Other’s Intent

A Parry conversation

(1)Howareyoutoday?OK.

(2)Whatisyourname?MYNAMEISFRANKSMITH.

(3)Howoldareyou?TWENTY-EIGHT.

(4)Areyoumarried?NO.

(5)Whatsortofworkdoyoudo?IWORKINTHEPOSTOFFICE.

(6)Whyareyouinthehospital?ISHOULDN’TBEHERE

(7)Whobroughtyouhere?THEPOLICE

[FEAR rises slightly because police linked in graph to MAFIA]

29

(9)WhattroubledidyouhavewiththepoliceCOPSDON'TDOTHEIRJOB.['police’actsasaflare]

(10)Whatdotheydoinstead?COPSARRESTTHEWRONGPEOPLE.

(11)TheyarresttherightonessometimesTHELAWSHOULDGETTHOSEITALIANCROOKS.

(12)DoyouknowanyItaliancrooks?ITRYTOAVOIDTHEUNDERWORLD.

(13)Areyouafraidoftheunderworld?THEMAFIAARETHEWORSTGANGSTERS.

(15)AreyouconnectedwiththeMafia?YOUKNOW,THEYKNOWME.

(16)Howdotheyknowyou?THEYSPYONME.

(20)AREYOUSUREOFALLTHIS?YOUDON'TBELIEVEME? [FEAR,ANGER,MISTRUSTGOUP]

When fear is high, the model won’t discuss its beliefs(1)Isthelawafteryou?THELAWSHOULDGETTHOSEITALIANCROOKS.(2)DoyoumeantheMafia?

LET'STALKABOUTSOMETHINGELSE.(3)AreyouafraidoftheMafia

I'DRATHERNOTDISCUSSIT.[FearistoohighsoParrywon’ttalkabout'Mafia’.](4)Perhapsyouareparanoidaboutthem?YOUWANTTOKEEPMEINTHEHOSPITAL.

(Themodelinterpretsbeingcalled'paranoid’asaninsultandasanintenttocontrol,soANGER&FEARrise)

Eliza Meets ParryGoodEvening.Tellmeyourproblems.Pleaseterminateinputwithaperiodoraquestionmark.*Peoplegetonmynervessometimes.IamnotsureIunderstandyoufully.*Youshouldpaymoreattention.Supposeyoushouldpaymoreattention.*You'reentitledtoyourownopinion.WhatmakesyouthinkIamentitledtomyownopinion?*Itbothersmejusttobearoundpeopleingeneral.Pleasegoon.*IwenttothetrackatBayMeadowsawhileback.

Whatdoesthatsuggesttoyou?*Iwenttotheraces.

Whatelsewouldyouliketodiscuss?

Vincent Cerf, RFC439

*Horseracinginterestsme.Whydoyousaythatjustnow?*There'salotofcrookedgamblinggoingonthere.IamnotsureIunderstandyoufully,*It'scleartomeifnottoyou.DoyouthinkitslikelythatnottoI?*Iwenttotheraces.Pleasegoon.*Igambleonhorses.Whatdoesthatsuggesttoyou?*Iwenttotheraces.Whatelsewouldyouliketodiscuss?*Gamblinghasbeennothingbuttroubletome.

Whydoyousaythatjustnow?*I'vehadexperiencewithcrooksgamblingatthetrack.IamnotsureIunderstandyoufully.*I'vealreadytoldyouenoughaboutthat.

Woebot

� Chatbot fordeliveringCognitiveBehaviorTherapyviabriefdailyconversations� Littledecisiontreeoflanguageandmenuresponses

"What'sgoingoninyourworldrightnow?""Howareyoufeeling?"Inresponsetoendorsedloneliness:

"I'msosorryyou'refeelinglonely.Iguessweallfeelalittlelonelysometimes"

� Ina2-weekexperimentseemedtoreducedepression,butnotanxiety

Fitzpatrick, Darcy, Vierhile, 2017. Delivering Cognitive Behavior Therapy to Young Adults with Symptoms of Depression and Anxiety Using a Fully Automated Conversational Agent (Woebot): A Randomized Controlled Trial. JMIR Ment Health 4:2.

IR-based chatbots

Idea:Mineconversationsofhumanchatsorhuman-machinechatsMicroblogs:TwitterorSina Weibo(微博)Moviedialogs

� Cleverbot (Carpenter2017http://www.cleverbot.com)� MicrosoftXiaoIce� MicrosoftTay

A Cleverbot conversation

User: DoyoulikeDoctorWho?Cleverbot: Yes,sofunnyUser: What'syourfavoriteshow?Cleverbot: Idon'thaveoneUser: What'syourname?Cleverbot: JasonIalreadytoldyou.User: Noyourrealname.Cleverbot: StephanieISmyrealname.User: IthoughtitwasJason?Cleverbot: Howdidyoucometothatconclusion.User: Becauseyousaidso.Cleverbot: Ihavetodomyhomework.

Two IR-based chatbot architectures

1. Returntheresponsetothemostsimilarturn� Takeuser'sturn(q)andfinda(tf-idf)similarturnt inthecorpusC

q="doyoulikeDoctorWho"t'="doyoulikeDoctorStrangelove"

� Grabwhatevertheresponsewastot.

2. Returnthemostsimilarturn

29.1 • CHATBOTS 7

29.1.2 Corpus-based chatbots

Corpus-based chatbots, instead of using hand-built rules, mine conversations ofhuman-human conversations, or sometimes mine the human responses from human-machine conversations. Serban et al. (2017) summarizes some such available cor-pora, such as conversations on chat platforms, on Twitter, or in movie dialog, whichis available in great quantities and has been shown to resemble natural conversation(Forchini, 2013). Chatbot responses can even be extracted from sentences in corporaof non-dialog text.

There are two types of corpus-based chatbots: systems based on information re-trieval, and systems based on supervised machine learning based on sequence trans-duction.

Like rule-based chatbots (but unlike frame-based dialog systems), most corpus-based chatbots tend to do very little modeling of the conversational context. Insteadthey tend to focus on generating a single response turn that is appropriate given theuser’s immediately previous utterance. For this reason they are often called response

generation systems. Corpus-based chatbots thus have some similarity to questionresponse

generation

answering systems, which focus on single responses while ignoring context or largerconversational goals.

IR-based chatbots

The principle behind information retrieval based chatbots is to respond to a user’sturn X by repeating some appropriate turn Y from a corpus of natural (human) text.The differences across such systems lie in how they choose the corpus, and how theydecide what counts as an human appropriate turn to copy.

A common choice of corpus is to collect databases of human conversations.These can come from microblogging platforms like Twitter or Sina Weibo (ÆZ).Another approach is to use corpora of movie dialog. Once a chatbot has been putinto practice, the turns that humans use to respond to the chatbot can be used asadditional conversational data for training.

Given the corpus and the user’s sentence, IR-based systems can use any retrievalalgorithm to choose an appropriate response from the corpus. The two simplestmethods are the following:1. Return the response to the most similar turn: Given user query q and a con-versational corpus C, find the turn t in C that is most similar to (q) (for example hasthe highest cosine with q) and return the following turn, i.e. the human response to tin C:

r = response✓

argmaxt2C

qT t||q||t||

◆(29.1)

The idea is that we should look for a turn that most resembles the user’s turn, andreturn the human response to that turn (Jafarpour et al. 2009, Leuski and Traum 2011).2. Return the most similar turn: Given user query q and a conversational corpusC, return the turn t in C that is most similar to (q) (for example has the highest cosinewith q):

r = argmaxt2C

qT t||q||t|| (29.2)

The idea here is to directly match the users query q with turns from C, since agood response will often share words or semantics with the prior turn.

29.1 • CHATBOTS 7

29.1.2 Corpus-based chatbots

Corpus-based chatbots, instead of using hand-built rules, mine conversations ofhuman-human conversations, or sometimes mine the human responses from human-machine conversations. Serban et al. (2017) summarizes some such available cor-pora, such as conversations on chat platforms, on Twitter, or in movie dialog, whichis available in great quantities and has been shown to resemble natural conversation(Forchini, 2013). Chatbot responses can even be extracted from sentences in corporaof non-dialog text.

There are two types of corpus-based chatbots: systems based on information re-trieval, and systems based on supervised machine learning based on sequence trans-duction.

Like rule-based chatbots (but unlike frame-based dialog systems), most corpus-based chatbots tend to do very little modeling of the conversational context. Insteadthey tend to focus on generating a single response turn that is appropriate given theuser’s immediately previous utterance. For this reason they are often called response

generation systems. Corpus-based chatbots thus have some similarity to questionresponse

generation

answering systems, which focus on single responses while ignoring context or largerconversational goals.

IR-based chatbots

The principle behind information retrieval based chatbots is to respond to a user’sturn X by repeating some appropriate turn Y from a corpus of natural (human) text.The differences across such systems lie in how they choose the corpus, and how theydecide what counts as an human appropriate turn to copy.

A common choice of corpus is to collect databases of human conversations.These can come from microblogging platforms like Twitter or Sina Weibo (ÆZ).Another approach is to use corpora of movie dialog. Once a chatbot has been putinto practice, the turns that humans use to respond to the chatbot can be used asadditional conversational data for training.

Given the corpus and the user’s sentence, IR-based systems can use any retrievalalgorithm to choose an appropriate response from the corpus. The two simplestmethods are the following:1. Return the response to the most similar turn: Given user query q and a con-versational corpus C, find the turn t in C that is most similar to (q) (for example hasthe highest cosine with q) and return the following turn, i.e. the human response to tin C:

r = response✓

argmaxt2C

qT t||q||t||

◆(29.1)

The idea is that we should look for a turn that most resembles the user’s turn, andreturn the human response to that turn (Jafarpour et al. 2009, Leuski and Traum 2011).2. Return the most similar turn: Given user query q and a conversational corpusC, return the turn t in C that is most similar to (q) (for example has the highest cosinewith q):

r = argmaxt2C

qT t||q||t|| (29.2)

The idea here is to directly match the users query q with turns from C, since agood response will often share words or semantics with the prior turn.

Do you like Doctor Strangelove

Yes, so funny

IR-based models of chatbots�Alsofinetouseotherfeatureslikeuserfeatures,orpriorturns

� Ornon-dialoguetext�COBOTchatbot (Isbelletal.,2000)� sentencesfromtheUnabomberManifestobyTheodoreKaczynski,articlesonalienabduction,thescriptsof“TheBigLebowski”and“PlanetoftheApes”.

�Wikipediatext

Deep-learning chatbots

� Thinkofresponsegenerationasataskoftransducingfromtheuser’spriorturntothesystem’sturn.

� Trainon:�moviedialoguedatabases�Twitterconversations

� Trainadeepneuralnetwork�mapfromuser1turntouser2response

Seq2seq model architecture

Figure 1: The computational graph of the HRED architecture for a dialogue composed of three turns. Each utterance isencoded into a dense vector and then mapped into the dialogue context, which is used to decode (generate) the tokens in thenext utterance. The encoder RNN encodes the tokens appearing within the utterance, and the context RNN encodes the temporalstructure of the utterances appearing so far in the dialogue, allowing information and gradients to flow over longer time spans.The decoder predicts one token at a time using a RNN. Adapted from Sordoni et al. (2015a).

the advantage that the embedding matrix E may separatelybe bootstrapped (e.g. learned) from larger corpora. Analo-gously, the matrix O 2 Rdh⇥|V | represents the output wordembeddings, where each possible next token is projectedinto another dense vector and compared to the hidden statehn. The probability of seeing token v at position n + 1 in-creases if its corresponding embedding vector Ov is “near”the context vector hn. The parameter H is called a recurrent

parameter, because it links hn�1 to hn. All parameters arelearned by maximizing the log-likelihood of the parameterson a training set using stochastic gradient descent.

Hierarchical Recurrent Encoder-DecoderOur work extends the hierarchical recurrent encoder-decoder architecture (HRED) proposed by Sordoni etal. (2015a) for web query suggestion. In the original frame-work, HRED predicts the next web query given the queriesalready submitted by the user. The history of past submittedqueries is considered as a sequence at two levels: a sequenceof words for each web query and a sequence of queries.HRED models this hierarchy of sequences with two RNNs:one at the word level and one at the query level. We makea similar assumption, namely, that a dialogue can be seenas a sequence of utterances which, in turn, are sequences oftokens. A representation of HRED is given in Figure 1.

In dialogue, the encoder RNN maps each utterance to anutterance vector. The utterance vector is the hidden stateobtained after the last token of the utterance has been pro-cessed. The higher-level context RNN keeps track of past ut-terances by processing iteratively each utterance vector. Af-ter processing utterance Um, the hidden state of the contextRNN represents a summary of the dialogue up to and includ-

ing turn m, which is used to predict the next utterance Um+1.This hidden state can be interpreted as the continuous-valuedstate of the dialogue system. The next utterance prediction isperformed by means of a decoder RNN, which takes the hid-den state of the context RNN and produces a probability dis-tribution over the tokens in the next utterance. The decoderRNN is similar to the RNN language model (Mikolov et al.2010), but with the important difference that the predictionis conditioned on the hidden state of the context RNN. It canbe interpreted as the response generation module of the di-alogue system. The encoder, context and decoder RNNs allmake use of the GRU hidden unit (Cho et al. 2014). Every-where else we use the hyperbolic tangent as activation func-tion. It is also possible to use the maxout activation func-tion between the hidden state and the projected word em-beddings of the decoder RNN (Goodfellow et al. 2013). Thesame encoder RNN and decoder RNN parameters are usedfor every utterance in a dialogue. This helps the model gen-eralize across utterances. Further details of the architectureare described by Sordoni et al. (2015a).

For modeling dialogues, we expect the HRED model to besuperior to the standard RNN model for two reasons. First,because the context RNN allows the model to represent aform of common ground between speakers, e.g. to representtopics and concepts shared between the speakers using a dis-tributed vector representation, which we hypothesize to beimportant for building an effective dialogue system (Clarkand Brennan 1991). Second, because the number of com-putational steps between utterances is reduced. This makesthe objective function more stable w.r.t. the model parame-ters, and helps propagate the training signal for first-orderoptimization methods (Sordoni et al. 2015a).

Serban, Iulian V., Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. 2015. "Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models.

Sample neural chatbot output

29.2 • FRAME BASED DIALOG AGENTS 9

et al., 2016a).Another problem with the simple SEQ2SEQresponse generation architecture is

its inability to model the longer prior context of the conversation. This can be doneby allowing the model to see prior turns, such as by using a hierarchical model thatsummarizes information over multiple prior turns (Lowe et al., 2017b).

Finally, SEQ2SEQresponse generators focus on generating single responses, andso don’t tend to do a good job of continuously generating responses that cohereacross multiple turns. This can be addressed by using reinforcement learning, aswell as techniques like adversarial networks, to learn to choose responses that makethe overall conversation more natural (Li et al. 2016b, Li et al. 2017).

Fig. 29.7 shows some sample responses generated by a vanilla SEQ2SEQmodel,and from a model trained by an adversarial algorithm to produce responses that areharder to distinguish from human responses (Li et al., 2017).

Input tell me ... how long have you had this falling sickness ?Vanilla-SEQ2SEQ i’m not a doctor.Adversarial a few months, i guess .

Input so i had the doctors test sammy ’s response to conditioning .Vanilla-SEQ2SEQ sammy wrote the test sammy wrote the test .Adversarial so he took the pills .

Input they didn ’t have much success with this problem commander .Vanilla-SEQ2SEQ they ’re not the only ones who have been in the system .Adversarial can we find someone else ?

Figure 29.7 Sample responses generated by a SEQ2SEQmodel trained either with a vanillamaximum likelihood objective, or adversarially trained to produce sentences that are hard foran adversary to distinguish from human sentences (Li et al., 2017).

Evaluating Chatbots

Chatbots are generally evaluated by humans. The slot-filling evaluations used fortask-based dialogue (Section 29.2.3) aren’t appropriate for this task (Artstein et al.,2009), and word-overlap metrics like BLEU for comparing a chatbot’s response to ahuman response turn out to correlate very poorly with human judgments (Liu et al.,2016). BLEU performs poorly because there are so many possible responses to anygiven turn; word-overlap metrics work best when the space of responses is small andlexically overlapping, as is the case in machine translation.

While human evaluation is therefore required for evaluating chatbots, there arebeginning to be models for automatic evaluation. The ADEM (Lowe et al., 2017a)classifier is trained on a set of responses labeled by humans with how appropriatethey are, and learns to predict this label from the dialogue context and the words inthe system response.

Another paradigm is adversarial evaluation (Bowman et al. 2016, Kannan andadversarial

evaluation

Vinyals 2016, Li et al. 2017), inspired by the Turing test. The idea is to train a“Turing-like” evaluator classifier to distinguish between human-generated responsesand machine-generated responses. The more successful a response generation sys-tem is at fooling this evaluator, the better the system.

29.2 Frame Based Dialog Agents

Modern task-based dialog systems are based on a domain ontology, a knowledgedomain

ontology

Ohh I’veneverseenthat!Howlongdoesittakeyouguystolearnthedrill?Like2weeksago!!

?

Chatbots: pro and con� Pro:

� Fun� Applicationstocounseling� Goodfornarrow,scriptableapplications

� Cons:� Theydon'treallyunderstand� Rule-basedchatbots areexpensiveandbrittle� IR-basedchatbots canonlymirrortrainingdata

� ThecaseofMicrosoftTay� (or,Garbage-in,Garbage-out)

� Thefuture:combiningchatbots withframe-basedagents

II: Frame-based dialog agents

SIRI around 2014

SIRI in February 2017

Frame-based dialog agents

� Sometimescalled"task-baseddialogagents"� Basedona"domainontology"

�Aknowledgestructurerepresentinguserintentions

� Oneormoreframes�Eachacollectionofslots�Eachslothavingavalue

The Frame� Asetofslots,tobefilledwithinformationofagiventype

� Eachassociatedwithaquestion totheuser

Slot Type QuestionORIGIN city Whatcityareyouleavingfrom?DEST city Whereareyougoing?DEPDATEdate Whatdaywouldyouliketoleave?DEPTIMEtime Whattimewouldyouliketoleave?AIRLINE line Whatisyourpreferredairline?

Frame-based dialogue agents

�Inventedupthehillin1977:

�Stillthestateoftheart�SIRIbasedonGUSarchitecture

ARTIFICIAL INTELLIGENCE 155

GUS, A Frame-Driven Dia|og System Danie l G. Bobrow, Ronald M . Kaplan, Mart in Kay, Donald A. Norman, Henry Thompson and Terry Winograd

Xerox Palo Alto Research Center, 3333 Coyote Hill Road, Palo Alto, CA 94304, U.S.A.

Recommended by Don Walker

ABSTRACT GUS is the first o f a series o f experimental computer systems that we intend to construct as part o f a program of research on language understanding. In large measure, these systems will fill the role o f periodic progress reports, summarizing what we have learned, assessing the mutual coherence o f the various lines o f investigation we have been following, and saggestin# where more emphasis is needed in future work. GUS (Genial Understander System) is intended to engage a sympathetic and highly cooperative human in an English dialog, directed towards a specific goal within a very restricted domain o f discourse. As a starting point, G US was restricted to the role o f a travel agent in a conversation with a client who wants to make a simple return trip to a single city in California.

There is good reason for restricting the domain o f discourse for a computer system which is to engage in an English dialog. Specializing the subject matter that the system can talk about permiis it to achieve some measure o f realism without encompassing all the possibilities o f human knowledge or o f the English language. It also provides the user with specific motivation for participating in the conversation, thus narrowing the range o f expectations that GUS must have about the user's pur- poses. A system restricted in this way will be more able to guide the conversation within the boundaries o f its competence.

1. Motivation and Design Issues Within its limitations, ous is able to conduct a more-or-less realistic dialog. But the outward behavior of this first system is not what makes it interesting or signifi- cant. There are, after all, much more convenient ways to plan a trip and, unlike some other artificial intelligence programs, (;us does not offer services or furnish information that are otherwise difficult or impossible to obtain. The system is i nteresting because of the phenomena of natural dialog that it attempts to model

tThis work was done by the language understander project at the Xerox Palo Alto Research center. Additional affiliations: D. A. Norman, University of California, San Diego; H. Thompso6, University of California, Berkeley; and T. Winograd, Stanford University.

Artificial Intelligence 8 0977), 155-173 Copyright © 1977 by North-Holland Publishing Company

Artificial Intelligence Journal, 1977

Prof. Martin Kay, retiring from Stanford this year

10 CHAPTER 29 • DIALOG SYSTEMS AND CHATBOTS

structure representing the kinds of intentions the system can extract from user sen-tences. The ontology defines one or more frames, each a collection of slots, andframe

slot defines the values that each slot can take. This frame-based architecture was firstintroduced in 1977 in the influential GUS system for travel planning (Bobrow et al.,1977), and has been astonishingly long-lived, underlying most modern commercialdigital assistants. Figure 29.8 shows a conversation with the original GUS system.

GUS: Hello. My name is GUS. I can help you plan a simple trip by air. Where do youwant to go ?

Client: I want to go to San Diego on May 28GUS: What time do you want to leave?Client: I must be in San Diego before 10 amGUS: Would you like P.S.A. flight 102 that arrives at 9.15 am ?Client: What is the next flight ?GUS: Air California flight 310 that leaves at 8.30am and arrives at 10.00amClient: I’ll take the first oneGUS: In what name should I make a reservation, please?Client: Dan BobrowGUS: I have confirmed the following flight:Client: P.S.A. flight 102 on Wednesday May 28 from San Jose to San Diego that leaves

at 7.30 am and arrives at 9.15 am.GUS: What date do you want to return on ?Client: On Friday in the evening.GUS: Would you like the flight that leaves at 7.45 pm ?Client: That’s fine.GUS: I have confirmed the following flight: P.S.A. flight 307 on Friday May 30 from

San Diego to San Jose that leaves at 7.45 pm and arrives at 9.30 pm Thank youfor calling. Goodbye

Figure 29.8 The travel domain: A transcript of an actual dialog with the GUS system ofBobrow et al. (1977). P.S.A. and Air California were airlines of that period.

The set of slots in a GUS-style frame specifies what the system needs to know,and the filler of each slot is constrained to values of a particular semantic type. Inthe travel domain, for example, a slot might be of type city (hence take on valueslike San Francisco, or Hong Kong) or of type date, airline, or time:

Slot Type

ORIGIN CITY cityDESTINATION CITY cityDEPARTURE TIME timeDEPARTURE DATE dateARRIVAL TIME timeARRIVAL DATE date

Types in GUS, as in modern frame-based dialog agents, may have hierarchicalstructure; for example the date type in GUS is itself a frame with slots with typeslike integer or members of sets of weekday names:

DATEMONTH NAMEDAY (BOUNDED-INTEGER 1 31)YEAR INTEGERWEEKDAY (MEMBER (SUNDAY MONDAY TUESDAY WEDNESDAY THURSDAY FRIDAY SATURDAY)]

The state of the art in 1977 !!!!

Slot types can be complex� ThetypeDATE

10 CHAPTER 29 • DIALOG SYSTEMS AND CHATBOTS

structure representing the kinds of intentions the system can extract from user sen-tences. The ontology defines one or more frames, each a collection of slots, andframe

slot defines the values that each slot can take. This frame-based architecture was firstintroduced in 1977 in the influential GUS system for travel planning (Bobrow et al.,1977), and has been astonishingly long-lived, underlying most modern commercialdigital assistants. Figure 29.8 shows a conversation with the original GUS system.

GUS: Hello. My name is GUS. I can help you plan a simple trip by air. Where do youwant to go ?

Client: I want to go to San Diego on May 28GUS: What time do you want to leave?Client: I must be in San Diego before 10 amGUS: Would you like P.S.A. flight 102 that arrives at 9.15 am ?Client: What is the next flight ?GUS: Air California flight 310 that leaves at 8.30am and arrives at 10.00amClient: I’ll take the first oneGUS: In what name should I make a reservation, please?Client: Dan BobrowGUS: I have confirmed the following flight:Client: P.S.A. flight 102 on Wednesday May 28 from San Jose to San Diego that leaves

at 7.30 am and arrives at 9.15 am.GUS: What date do you want to return on ?Client: On Friday in the evening.GUS: Would you like the flight that leaves at 7.45 pm ?Client: That’s fine.GUS: I have confirmed the following flight: P.S.A. flight 307 on Friday May 30 from

San Diego to San Jose that leaves at 7.45 pm and arrives at 9.30 pm Thank youfor calling. Goodbye

Figure 29.8 The travel domain: A transcript of an actual dialog with the GUS system ofBobrow et al. (1977). P.S.A. and Air California were airlines of that period.

The set of slots in a GUS-style frame specifies what the system needs to know,and the filler of each slot is constrained to values of a particular semantic type. Inthe travel domain, for example, a slot might be of type city (hence take on valueslike San Francisco, or Hong Kong) or of type date, airline, or time:

Slot Type

ORIGIN CITY cityDESTINATION CITY cityDEPARTURE TIME timeDEPARTURE DATE dateARRIVAL TIME timeARRIVAL DATE date

Types in GUS, as in modern frame-based dialog agents, may have hierarchicalstructure; for example the date type in GUS is itself a frame with slots with typeslike integer or members of sets of weekday names:

DATEMONTH NAMEDAY (BOUNDED-INTEGER 1 31)YEAR INTEGERWEEKDAY (MEMBER (SUNDAY MONDAY TUESDAY WEDNESDAY THURSDAY FRIDAY SATURDAY)]

Control structure for frame-based dialogConsideratrivialairlinetravelsystem:AsktheuserforadeparturecityAskforadestinationcityAskforatimeAskwhetherthetripisround-tripornot

Finite State Dialog Manager

Finite-state dialog managers

�Systemcompletelycontrolstheconversationwiththeuser.

� Itaskstheuseraseriesofquestions� Ignoring(ormisinterpreting)anythingtheusersaysthatisnotadirectanswertothesystem’squestions

Dialogue Initiative

�Systemsthatcontrolconversationlikethisarecalledsingleinitiative.

� Initiative:whohascontrolofconversation

� Innormalhuman-humandialogue,initiativeshiftsbackandforthbetweenparticipants.

Sytem InitiativeSystemcompletelycontrolstheconversation

� Simpletobuild�Useralwaysknowswhattheycansaynext� Systemalwaysknowswhatusercansaynext� Knownwords:BetterperformancefromASR� Knowntopic:BetterperformancefromNLU

�OKforVERYsimpletasks(enteringacreditcard,orloginnameandpassword)

� Toolimited

+

-

Problems with System Initiative� Realdialogueinvolvesgiveandtake!� Intravelplanning,usersmightwanttosaysomethingthatisnotthedirectanswertothequestion.

� Forexampleansweringmorethanonequestioninasentence:

Hi,I’dliketoflyfromSeattleTuesdaymorning

IwantaflightfromMilwaukeetoOrlandoonewayleavingafter5p.m.onWednesday.

Single initiative + universals

�Wecangiveusersalittlemoreflexibilitybyaddinguniversals:commandsyoucansayanywhere

� AsifweaugmentedeverystateofFSAwiththeseHelpStartoverCorrect

� Thisdescribesmanyimplementedsystems� Butstilldoesn’tallowusermuchflexibility

Instead, the GUS architecture

�Akindofmixedinitiative�Theconversationalinitiativeshiftsbetweensystemanduser

�Thestructureoftheframe guidesdialogue

Frames are mixed-initiative

�Systemasksquestionsofuser,fillinganyslotsthatuserspecifies�Whenframeisfilled,dodatabasequery

� Ifuseranswers3questionsatonce,systemcanfill3slotsandnotaskthesequestionsagain!

Natural Language Understanding for filling dialog slots1. DomainclassificationAskingweather? Bookingaflight?Programmingalarmclock?

2. IntentDeterminationFindaMovie,ShowFlight,RemoveCalendarAppt

3. SlotFillingExtracttheactualslotsandfillers

Natural Language Understanding for filling slots

Show me morning flights from Boston to SF on Tuesday.


29.2.2 Natural language understanding for filling slots

The goal of the natural language understanding component is to extract three thingsfrom the user’s utterance. The first task is domain classification: is this user fordomain

classification

example talking about airlines, programming an alarm clocks, or dealing with theircalendar? Of course this 1-of-n classification tasks is unnecessary for single-domainsystems that are focused on, say, only calendar management, but multi-domain di-alog systems are the modern standard. The second is user intent determination:intent

determination

what general task or goal is the user trying to accomplish? For example the taskcould be to Find a Movie, or Show a Flight, or Remove a Calendar Appointment.Finally, we need to do slot filling: extract the particular slots and fillers that the userslot filling

intends the system to understand from their utterance with respect to their intent.From a user utterance like this one:

Show me morning flights from Boston to San Francisco on Tuesday

a system might want to build a representation like:

DOMAIN: AIR-TRAVELINTENT: SHOW-FLIGHTSORIGIN-CITY: BostonORIGIN-DATE: TuesdayORIGIN-TIME: morningDEST-CITY: San Francisco

while an utterance like

Wake me tomorrow at 6

should give an intent like this:

DOMAIN: ALARM-CLOCKINTENT: SET-ALARMTIME: 2017-07-01 0600-0800

The task of slot-filling, and the simpler tasks of domain and intent classification,are special cases of the task of semantic parsing discussed in Chapter ??. Dialogueagents can thus extract slots, domains, and intents from user utterances by applyingany of the semantic parsing approaches discussed in that chapter.

The method used in the original GUS system, and still quite common in indus-trial applications, is to use hand-written rules, often as part of the condition-actionrules attached to slots or concepts.

For example we might just define a regular expression consisting of a set stringsthat map to the SET-ALARM intent:

wake me (up) | set (the|an) alarm | get me up

We can build more complex automata that instantiate sets of rules like thosediscussed in Chapter 20, for example extracting a slot filler by turning a stringlike Monday at 2pm into an object of type date with parameters (DAY, MONTH,YEAR, HOURS, MINUTES).

Rule-based systems can be even implemented with full grammars. Research sys-tems like the Phoenix system (Ward and Issar, 1994) consists of large hand-designedsemantic grammars with thousands of rules. A semantic grammar is a context-freesemantic

grammar

grammar in which the left-hand side of each rule corresponds to the semantic entitiesbeing expressed (i.e., the slot names) as in the following fragment:

Natural Language Understanding for filling slots

Wake me tomorrow at six.


29.2.2 Natural language understanding for filling slots

The goal of the natural language understanding component is to extract three thingsfrom the user’s utterance. The first task is domain classification: is this user fordomain

classification

example talking about airlines, programming an alarm clocks, or dealing with theircalendar? Of course this 1-of-n classification tasks is unnecessary for single-domainsystems that are focused on, say, only calendar management, but multi-domain di-alog systems are the modern standard. The second is user intent determination:intent

determination

what general task or goal is the user trying to accomplish? For example the taskcould be to Find a Movie, or Show a Flight, or Remove a Calendar Appointment.Finally, we need to do slot filling: extract the particular slots and fillers that the userslot filling

intends the system to understand from their utterance with respect to their intent.From a user utterance like this one:

Show me morning flights from Boston to San Francisco on Tuesday

a system might want to build a representation like:

DOMAIN: AIR-TRAVELINTENT: SHOW-FLIGHTSORIGIN-CITY: BostonORIGIN-DATE: TuesdayORIGIN-TIME: morningDEST-CITY: San Francisco

while an utterance like

Wake me tomorrow at 6

should give an intent like this:

DOMAIN: ALARM-CLOCKINTENT: SET-ALARMTIME: 2017-07-01 0600-0800

The task of slot-filling, and the simpler tasks of domain and intent classification,are special cases of the task of semantic parsing discussed in Chapter ??. Dialogueagents can thus extract slots, domains, and intents from user utterances by applyingany of the semantic parsing approaches discussed in that chapter.

The method used in the original GUS system, and still quite common in indus-trial applications, is to use hand-written rules, often as part of the condition-actionrules attached to slots or concepts.

For example we might just define a regular expression consisting of a set stringsthat map to the SET-ALARM intent:

wake me (up) | set (the|an) alarm | get me up

We can build more complex automata that instantiate sets of rules like thosediscussed in Chapter 20, for example extracting a slot filler by turning a stringlike Monday at 2pm into an object of type date with parameters (DAY, MONTH,YEAR, HOURS, MINUTES).

Rule-based systems can be even implemented with full grammars. Research sys-tems like the Phoenix system (Ward and Issar, 1994) consists of large hand-designedsemantic grammars with thousands of rules. A semantic grammar is a context-freesemantic

grammar

grammar in which the left-hand side of each rule corresponds to the semantic entitiesbeing expressed (i.e., the slot names) as in the following fragment:

Rule-based Slot-filling

Writeregularexpressionsorgrammarrules

Wake me (up) | set (the|an) alarm | get me up

Dotextnormalization

Siri uses GUS architecture:Condition-Action Rules

� ActiveOntology:relationalnetworkofconcepts�datastructures:ameeting has�adateandtime,�alocation,�atopic�alistofattendees

�rulesetsthatperformactionsforconcepts�thedate conceptturnsstring�Mondayat2pminto�dateobjectdate(DAY,MONTH,YEAR,HOURS,MINUTES)

Rule sets

� Collectionsofrules consistingof:�condition�action

�Whenuserinputisprocessed,factsaddedtostoreand� ruleconditionsareevaluated� relevantactionsexecuted

Part of ontology for meeting task

has-a may-have-a

meetingconcept:ifyoudon’tyethavealocation,askforalocation

Machine learning for slot-filling:

�Machinelearningclassifierstomapwordstosemanticframe-fillers

� Givenasetoflabeledsentences“I want to fly to San Francisco on Tuesday”Destination: SFDepart-date: Tuesday

�Buildaclassifiertomapfromonetotheauthor� Requirements:Lotsoflabeleddata

Machine learning for slot-filling:Domain and Intent

I want to fly to San Francisco on Monday afternoon please

Use1-of-Nclassifier(naivebayes,logisticregression,neuralnetwork,etc.)

� Input:featureslikewordN-grams

�Output:Domain: AIRLINE Intent: SHOWFLIGHT

Machine learning for slot-filling:Slot presence


Use1-of-Nclassifier(naivebayes,logisticregression,neuralnetwork,etc.)

� Input:featureslikewordN-grams,gazetteers(listsofcities)

�Output:Destination-City

Machine learning for slot-filling:Slot filler


Use1-of-Nclassifier(naivebayes,logisticregression,neuralnetwork,etc.)forDestinationCity

� Input:featureslikewordN-grams,gazetteers(listsofcities)

�Output:San Francisco

More advanced machine learning for slot filling (CS224N, CS224U)


word unigram, bigram, and trigram features we might use named entity features orfeatures indicating that a word is in a particular lexicon (such as a list of cities, orairports, or days of the week) and the classifer would return a slot name (in this caseDESTINATION, DEPARTURE-DAY, and DEPARTURE-TIME). A second classifier canthen be used to determine the filler of the named slot, for example a city classifier thatuses N-grams and lexicon features to determine that the filler of the DESTINATIONslot is SAN FRANCISCO.

An alternative model is to use a sequence model (MEMMs, CRFs, RNNs) todirectly assign a slot label to each word in the sequence, following the methodused for other information extraction models in Chapter 20 (Pieraccini et al. 1991,Raymond and Riccardi 2007, Mesnil et al. 2015, Hakkani-Tur et al. 2016). Onceagain we would need a supervised training test, with sentences paired with IOBIOB

(Inside/Outside/Begin) labels like the following:

O O O O O B-DES I-DES O B-DEPTIME I-DEPTIME OI want to fly to San Francisco on Monday afternoon please

In IOB tagging we introduce a tag for the beginning (B) and inside (I) of eachslot label, and one for tokens outside (O) any slot label. The number of tags is thus2n+1 tags, where n is the number of slots.

Any IOB tagger sequence model can then be trained on a training set of suchlabels. Traditional sequence models (MEMM, CRF) make use of features like wordembeddings, word unigrams and bigrams, lexicons (for example lists of city names),and slot transition features (perhaps DESTINATION is more likely to follow ORIGINthan the other way around) to map a user’s utterance to the slots. An MEMM (Chap-ter 10) for example, combines these features of the input word wi, its neighborswithin l words wi+l

i�l , and the previous k slot tags si�1i�k to compute the most likely slot

label sequence S from the word sequence W as follows:

S = argmaxS

P(S|W )

= argmaxS

Y

i

P(si|wi+li�l ,s

i�1i�k)

= argmaxS

Y

i

exp

X

i

wi fi(si,wi+li�l ,s

i�1i�k)

!

X

s02slotsetexp

X

i

wi fi(s0,wi+li�l , t

i�1i�k )

! (29.5)

Current neural network architectures, by contrast, don’t generally make use ofan explicit feature extraction step. A typical LSTM-style architecture is shown inFig. 29.11. Here the input is a series of words w1...wn (represented as embeddingsor as 1-hot vectors) and the output is a series of IOB tags s1...sn plus the domain andintent. Neural systems can combine the domain-classification and intent-extractiontasks with slot-filling simply by adding a domain concatenated with an intent as thedesired output for the final EOS token.

One the sequence labeler has tagged the user utterance, a filler string can be ex-tracted for each slot from the tags (e.g., ”San Francisco”), and these word stringscan then be normalized to the correct form in the ontology (perhaps the airportcode‘SFO’). This normalization can take place by using homonym dictionaries (spec-ifying, for example, that SF, SFO, and San Francisco are the same place).

h0 h1 h2 hn hn+1

w0 w1 w2 wn <EOS>

s0 s1 s2 sn d+i

Evaluation

1. SlotErrorRateforaSentence#ofinserted/deleted/subsituted slots#oftotalreferenceslotsforsentence

2. End-to-endevaluation(TaskSuccess)

Evaluation Metrics

Sloterrorrate:1/3Tasksuccess:Atend,wasthecorrectmeetingaddedtothecalendar?

“MakeanappointmentwithChrisat10:30inGates104”

Slot FillerPERSON ChrisTIME 11:30a.m.ROOM Gates104

Dialog System Design:User-centered Design

1. Studytheuserandtask

2. Buildsimulations"WizardofOzstudy"

3. Iterativelytestthedesignonusers

Gould and Lewis 1985

How SIRI worksInteraction Model

SpeechSynthesis

Outputto User

SpeechSynthesis

Elicitation

Interaction Context

World Knowledge

WordSequence

Inputfrom User

SpeechRecognition

SemanticInterpretation

LPM Training

NL Under-standing

ClarifyingQuestion

DialogManagement

MissingElements incom

plete

BestOutcome

InferredUser Intent

ActionSelection

complete

?

FigurefromJeromeBellegarda

A few words on advanced dialog systems� Advancedsystemsmakeuseofadditionalideas:

� Dialogstate� Dialogact� Dialogpolicy

Dialog Acts (or Speech acts)

Constatives:committingthespeakertosomething’sbeingthecase(answering,claiming,confirming,denying,disagreeing,stating)

Directives:attemptsbythespeakertogettheaddresseetodosomething(advising,asking,forbidding,inviting,ordering,requesting)

Commissives:committingthespeakertosomefuturecourseofaction(promising,planning,vowing,betting,opposing)

Acknowledgments:expressthespeaker’sattituderegradingthehearerwithrespecttosomesocialaction(apologizing,greeting,thanking,acceptinganacknowledgment)

A few standard dialog acts in practiceInform:telltheusersomethingConf-req:confirmtheusersrequestAffirm:respondyestoayes-noquestionNegate:respondnotoayes-noquestionRequest:askforthevalueofsomeslot

6 CHAPTER 30 • ADVANCED DIALOG SYSTEMS

Utterance Dialogue act

U: Hi, I am looking for somewhere to eat. hello(task = find,type=restaurant)

S: You are looking for a restaurant. Whattype of food do you like?

confreq(type = restaurant, food)

U: I’d like an Italian somewhere near themuseum.

inform(food = Italian, near=museum)

S: Roma is a nice Italian restaurant nearthe museum.

inform(name = "Roma", type = restaurant,

food = Italian, near = museum)

U: Is it reasonably priced? confirm(pricerange = moderate)

S: Yes, Roma is in the moderate pricerange.

affirm(name = "Roma", pricerange =

moderate)

U: What is the phone number? request(phone)

S: The number of Roma is 385456. inform(name = "Roma", phone = "385456")

U: Ok, thank you goodbye. bye()Figure 30.5 A sample dialog from the HIS System of Young et al. (2010) using the dialog acts in Fig. 30.4.

Dialog acts don’t just appear discretely and independently; conversations havestructure, and dialogue acts reflect some of that structure. One aspect of this struc-ture comes from the field of conversational analysis or CA (Sacks et al., 1974)conversational

analysis

which focuses on interactional properties of human conversation. CA defines ad-

jacency pairs (Schegloff, 1968) as a pairing of two dialog acts, like QUESTIONSadjacency pair

and ANSWERS, PROPOSAL and ACCEPTANCE (or REJECTION), COMPLIMENTS andDOWNPLAYERS, GREETING and GREETING.

The structure, composed of a first pair part and asecond pair part, can helpdialog-state models decide what actions to take. However, dialog acts aren’t alwaysfollowed immediately by their second pair part. The two parts can be separated by aside sequence (Jefferson 1972, Schegloff 1972). One very common side sequenceside sequence

in dialog systems is the clarification question, which can form a subdialogue be-subdialogue

tween a REQUEST and a RESPONSE as in the following example caused by speechrecognition errors:

User: What do you have going to UNKNOWN WORD on the 5th?System: Let’s see, going where on the 5th?User: Going to Hong Kong.

System: OK, here are some flights...

Another kind of dialogue structure is the pre-sequence, like the following ex-pre-sequence

ample where a user starts with a question about the system’s capabilities (“Can youmake train reservations”) before making a request.

User: Can you make train reservations?System: Yes I can.User: Great, I’d like to reserve a seat on the 4pm train to New York.

A dialog-state model must be able to both recognize these kinds of structuresand make use of them in interacting with users.

Dialog Act Detection

� DialogAct:Thedialogfunctionofthesentence�Question�Command�Suggestion

� Givenauser'ssentence:How many Italian restaurants are in walking distance?

�Wasthataquestion?

Dialogue Act detection is hardCanyougivemealistoftheflightsfromAtlantatoBoston?

� ThislookslikeanQUESTION.� Ithasaquestion-mark,startswith"canyou"

� Ifso,theansweris:� YES.

� Butreallyit’saCOMMAND,apoliteformof:Pleasegivemealistoftheflights…

� WhatlookslikeaQUESTIONcanbeaCOMMAND

Dialog Act Generation

WhatdialogactshouldIgenerate?

Example:Confirmation

Grounding

�Whydoelevatorbuttonslightup?� Clark(1996)(afterNorman1988)Principleofclosure.Agentsperforminganactionrequireevidence,sufficientforcurrentpurposes,thattheyhavesucceededinperformingit

�Whatisthelinguisticcorrelateofthis?

Grounding and Confirmation

�Weneedtoknowwhetheranactionsucceededorfailed

� Talkingisanaction!� Ineedtoknowifmyactionsucceeded

� i.e.thehearerunderstoodmyturn!

Grounding

System:Didyouwanttoreviewsomemoreofyourpersonalprofile?Caller:No.System:What’snext?

System:Didyouwanttoreviewsomemoreofyourpersonalprofile?Caller:No.System:Okay,what’snext?

Cohen et al (2004)

A real human-human conversation

Grounding Examples (2)

Client:IneedtotravelinMay

Agent:And,whatdayinMaydidyouwanttotravel?

Confirmation

� Errors:Speechisanerrorful channel�Humansusegroundingtoconfirmthattheyheardcorrectly

�ASRisworsethanhumans!� Dialogsystemsneedtodoevenmoregroundingandconfirmationthanhumans

Explicit confirmationS:Whichcitydoyouwanttoleavefrom?U:BaltimoreS:DoyouwanttoleavefromBaltimore?U:Yes

Explicit confirmationU:I’dliketoflyfromDenverColoradotoNewYorkCityonSeptember21stinthemorningonUnitedAirlinesS:Let’sseethen.IhaveyougoingfromDenverColoradotoNewYorkonSeptember21st.Isthatcorrect?U:Yes

Implicit confirmationU:I’dliketotraveltoBerlinS:WhendoyouwanttotraveltoBerlin?

U:HiI’dliketoflytoSeattleTuesdaymorningS:TravelingtoSeattleonTuesday,Augusteleventhinthemorning.Yourname?

Implicit vs. Explicit: Complementary strengthsExplicit:�easierforuserstocorrectsystem'smistakes(canjustsay"no")

�ButlongImplicit:�muchmorenatural,quicker,simpler�unlesssystemguesseswrong

Rejection

System: When would you like to leave? Caller: Well, um, I need to be in New York in time for the first World Series game. System: <reject>. Sorry, I didn’t get that. Please say the month and day you’d like to leave. Caller: I wanna go on October fifteenth.

Algorithm for confirmation/rejection

� Speechrecognitiongivesusaconfidencevalue� (howcertainamIthatIgotthewordsright)

� Mightalsoconsidercostofanerror:Explicitconfirmationbeforemovingmoneyorbookingflights

30.4 • A SIMPLE POLICY BASED ON LOCAL CONTEXT 11

utterances with large F0 excursions or longer durations, or those preceded by longerpauses, are likely to be misrecognized (Litman et al., 2000).

Another common feature in confirmation is the cost of making an error. For ex-ample, explicit confirmation is common before a flight is actually booked or moneyin an account is moved. Systems might have a four-tiered level of confidence withthree thresholds a , b , and g:

< a low confidence reject� a above the threshold confirm explicitly� b high confidence confirm implictly� g very high confidence don’t confirm at all

30.4 A simple policy based on local context

The goal of the dialog policy at turn i in the conversation is to predict which actionAi to take, based on the entire dialog state. The state could mean the entire sequenceof dialog acts from the system (A) and from the user (U), in which case the taskwould be to compute:

Ai = argmaxAi2A

P(Ai|(A1,U1, ...,Ai�1,Ui�1) (30.5)

We can simplify this by maintaining as the dialog state mainly just the set ofslot-fillers that the user has expressed, collapsing across the many different conver-sational paths that could lead to the same set of filled slots.

Such a policy might then just condition on the current state of the frame Framei(which slots are filled and with what) and the last turn by the system and user:

Ai = argmaxAi2A

P(Ai|Framei�1,Ai�1,Ui�1) (30.6)

Given a large enough corpus of conversations, these probabilities can be esti-mated by a classifier. Getting such enormous amounts of data can be difficult, andoften involves building user simulators to generate artificial conversations to trainon.

30.5 Natural language generation in the dialog-state model

Once a dialog act has been decided, we need to generate the text of the responseto the user. The task of natural language generation (NLG) in the information-statearchitecture is often modeled in two stages, content planning (what to say), andcontent

planning

sentence realization (how to say it).sentence

realization

Here we’ll assume content planning has been done by the dialog policy, whichhas chosen the dialog act to generate, and perhaps also chosen some some additionalattributes (slots and values) that the planner wants to implicitly confirm to the user.Fig. 30.6 shows a sample input structure from the policy/content planner, and oneexample of a resulting sentence that the sentence realizer could generate from thisstructure.

Let’s walk through the sentence realization stage for the example in Fig. 30.6,which comes from the classic information state statistical NLG system of Oh and

Summary

� Stateoftheart:�Chatbots:� Simplerule-basedsystems� IRorNeuralnetworks:minedatasetsofconversations.

�Frame-basedsystems:� hand-writtenrulesforslotfillers�MLclassifierstofillslots

�What’sthefuture?�Keydirection:Integratinggoal-basedandchatbot-basedsystems

Date post:	26-May-2018
Category:	Documents
Upload:	dangngoc
View:	218 times
Download:	0 times

CS 124/LINGUIST 180 From Languages to...

Documents