Date post: | 13-Dec-2015 |
Category: |
Documents |
Upload: | phoenix-sands |
View: | 215 times |
Download: | 0 times |
Computational Extraction of Social and Interactional Meaning
from Speech
Dan Jurafsky and Mari Ostendorf
Lecture 7: Dialog Acts & Sarcasm
Mari Ostendorf
Note: Uncredited examples are from Dialogue & Conversational Agents chapter.
Human-Computer Dialog
Greeting
Request
Clarification Question
InformResponse
Welcome to theCommunicator...
I wanna go from Denver
to ... What time doyou want to
leave Denver?
I’d like to leave in the morning ... Eight flight options
were returned. Option 1...
OverviewDialog acts
Definitions Important special cases Detection
Role of prosodySarcasm
In speechIn text
Speech/Dialog/Conversation Acts
Characterize the purpose of an utteranceAssociated with sentences (or intonational phrases)Used for:
Determining and controlling the “state” of a conversation in a spoken language system
Conversation analysis, e.g. extracting social informationMany different tag sets, depending on application
Aside: Speech vs. TextSpeech/dialog/conversation act inventories were
developed when conversations were spokenNow, conversations can happen online or via text
messagingDialog acts are also relevant here, researchers are
starting to look at thisSome differences:
Text is impoverished relative to speech, so extra punctuation, emoticons, etc., are added
Turn-taking & grounding
Special CasesQuestion detection
punctuation prediction4-category general set: statement, question,
incomplete, backchannel cross-domain training and transfer
Agreement vs. disagreement social analysis
Error corrections (for communication errors) human-computer dialogs
Automatic DetectionTwo problems:
Classification given segmentationSegmentation (often multiple DAs per turn)
Best treated jointly, but this can be computationally complex – start with known segmentation case
ok uh let me pull up your profile and I’ll be right with you here and you said you wanted to travel next week
1. ok uh let me pull up your profile and I’ll be right with you here
2. and you said you wanted to travel next week
1. ok 2. uh let me pull up your profile and 3. I’ll be right with you here and4. you said you wanted to travel5. next week
More Segmentation Challenges
A: Ok, so what do you think?B: Well that’s a pretty loaded topic. A: Absolutely.B: Well, here in uh – Hang on just a minute, the dog is barking -- Ok, here in Oklahoma, we just went through a major educational reform…
A: After all these things, he raises hundreds of millions of dollars. I mean uh the fella B: but he never stops talking about it. A: but okB: Aren’t you supposed to y- I mean A: well that’s a little- the Lord saysB: Does charity mean something if you’re constantly using it as a cudgel to beat your enemies over the- I’m better than you. I give money to charity.A: Well look, now I…
Knowledge Sources for ClassificationWords and grammar
“please,” “would you” – cue to requestAux inversion – cue to Y/N question“uh-huh,” “yeah” – often backchannels
ProsodyRising final pitch – Y/N question, declarative questionPitch & energy can distinguish backchannel (yeah) from
agreement, pitch reset may indicate incompletePitch accent type… (more on this)
Conversational structure (context)Answers follow questions
Feature extractionWords
N-grams as featuresDA-dependent n-gram language model scorePresence/absence of syntactic constituents
Prosody (typically with normalization)Speaking rateMean and variance of log energyFundamental frequency: mean, variance, overall contour
trend, utterance final contour shape, change in mean across utterance boundaries
Combining Cues with Context
With conversational structure: need a sequence model
d = dialog act sequence d1, …, dT
f = prosody features, w = word/grammar features
Direct model (e.g. conditional random field)
Generative model (e.g. HMM, or hidden event model)
Experimental results show small gain from context
argmax p(d|f,w) where p(d|f,w) = t p(dt|ft,wt,dt-1)
argmax p(f,w|d)p(d) where p(f,w|d) = t p(ft|dt) p(wt|dt) p(dt|dt-1)
Assuming Independent Segments
No sequence model, but DA prior (unigram) still important
Direct model:
features can extend beyond utterance to approximately capture context,
need to handle nonhomogeneous cues or make them homogeneous
Generative model:
Can predict dt using separate w and f classifiers, then do classifier combination
argmax p(dt|ft,wt)
argmax p(ft|dt) p(wt|dt) p(dt)
Some Results (not directly comparable)
42 classes (Stolcke et al., CL 2000)Hidden-event model: prosody & words (& context)42-class accuracy: 62-65% Switchboard ASR (68-71% hand transcripts)
4 classes (Margolis et al., DANLP 2009)Liblinear, n-grams + length (no prosody), hand transcripts4-class accuracy: 89% Swbd, 84% MRDA 4-class avg recall: 85% Swbd, 81% MRDA
2 classes (Margolis & Ostendorf, ACL 2011)Liblinear, n-grams + prosody, hand transcriptsquestion F-measure: 0.6 MRDA (recall = 92%)
3 classes (Galley et al., ACL 2004)Maxent, lexical-structural-duration features, hand transcripts3-class accuracy: 86% MRDA
Backchannel “Universals”
What is in common with backchannels across languages?Short length, low energy, NOT the words
Example:English: uh-huh, right, yeahSpanish: mmm, si, ya mmm, yes, already
Experiment: Cross-language DA classification for English vs. Spanish conversational telephone speechMargolis et al., 2009Statement, question, incomplete, backchannelUse automatic translation in cross-language classification
Spanish vs. English DAsBackchannels: • roughly 20% of DAs• lexical cues are useful within languages, so length is not used much• length more important across languages
Questions: • “<s> es que” often starts a statement in Spanish• translate: “<s> is that” indicates a question in English
ProsodyImpact overall is small: from Stolcke et al., CL 2000
BUT, it can be important for some distinctions
Other examples: right, so, absolutely, ok, thank you, ….
Oh. (disappointment) vs. Oh! (I get it)
Yeah: positive vs. negative
Whatever! (Benus, Gravano & Hirschberg, 2007)
Production: 1st syllable more likely to have a pitch accent for negative interpretation.
Perception: Listeners negativity judgments from prosody on “whatever” alone is similar to having full context.
SarcasmChanging the default (or literal) meaningObjectives of sarcasm
Make someone else feel bad or stupidDisplay anger or annoyance about somethingInside joke
Why is it interesting? More accurate sentiment detection More accurate agreement/disagreement detection General understanding of communication strategies
Negative positives in talk shows: yeah
and i don't think you’re going to be going back … yeahoh yeahthat's right yeahyeahyeah but …yeah well i well m my understanding is … yeah it it it gosh you know is that the standard that
prosecutors use the maybe possibly she's telling the truth standard
yeah i i don't think it was just the radical right yeah larry i i want to correct something randi said of
course
Negative positives (cont.) -- right
rightth that's rightthat's right yeahyou know what you're right but rightright but but you you can't say that punching him …right but the but the psychiatrists in this case were not
just …senators are not polling very well rightthen as a columnist who's offering opinions on what i
think the right policy is it seems to me…
Yeah, right. (Tepperman et al., 2006)
131 instances of “yeah right” in Switchboard & Fisher, 23% annotated as sarcastic
Annotation:In isolation: very low agreement between human
listeners (k=0.16)*In context, still weak agreement (k=.31)Gold standard based on discussion
Observation: laughter is much more frequent around sarcastic versions
* “Prosody alone is not sufficient to discern whether a speaker is being sarcastic.”
Sarcasm DetectorFeatures:
Prosody: relative pitch, duration & energy for each wordSpectral: class-dependent HMM acoustic model scoreContext: laughter, gender, pause, Q/A DA, location in
utteranceClassifier: decision tree (WEKA)Implicit feature selection in tree training
Results
• Laughter is most important contextual feature • Energy seems a little more important than pitch
OverviewDialog acts Role of prosodySarcasm
In speechIn text
Davidov, Tsur & Rappoport, 2010 – DTR10
Gonzalez-Ibanez, Muresan & Wacholder, 2011 – GIMW11
Sarcasm in Twitter & AmazonTwitter examples (DTR10)
“thank you Janet Jackson for yet another year of Super Bowl classic rock!”
“He’s with his other woman: XBox 360. It’s 4:30 fool. Sure I can sleep through the gunfire”
“Wow GPRS data speeds are blazing fast.”More twitter examples (GIMW11)
@UserName That must suck.I can't express how much I love shopping on black Friday.@UserName that's what I love about Miami.Attention to detail in preserving historic landmarks of the past.@UserName im just loving the positive vibes out of that!
Amazon examples (DTR10)“[I] Love The Cover” (book)“Defective by design” (music player)
Negative positive
Twitter #sarcasm issues
Problems: DTR10Used infrequentlyUsed in non-sarcastic cases, e.g. to clarify a previous
tweet (it was #Sarcasm)Used when sarcasm is otherwise ambiguous (prosody
surrogate?) – biased towards the most difficult casesGIMW11 argues that the non-sarcastic cases are easily
filtered by only using ones with #sarcasm at the end
DTR10 StudyData
Twitter: 5.9M tweets, unconstrained contextAmazon: 66k reviews, known product contextMechanical Turk annotation
K= 0.34 on Amazon, K = 0.41 on Twitter Features
Patterns of high frequency words + content word slots“[COMPANY] CW does not CW much”
PunctuationK-NN classifierSemi-supervised labeling of training samples
DTR10 Results
F-score
Punctuation 0.28
Patterns 0.77
Patts + punc 0.81
Enriched patts 0.40
Enriched punct 0.77
All (SASI) 0.83
Amazon results for different feature sets on gold standard
F-score
Amazon - Turk 0.79
Twitter - Turk 0.83
Twitter – #Gold 0.55
Amazon/Twitter SASI results for eval paradigms
GMW11 StudyData: 2700 tweets, equal amounts of positive,
negative and sarcastic (no neutral)Annotation by hashtags: sarcasm/sarcastic,
happy/joy/lucky, sadness/angry/frustratedFeatures:
Unigrams, LIWC classes (grouped), WordNet affectInterjections and punctuation, Emoticons & ToUser
Classifier: SVM & logistic regression
ResultsAutomatic system accuracy:
3-way S-P-N: 57%, 2-way S-NS: 65%Equal difficulty in separating sarcastic from positive and
negativeHuman S-P-N labeling: 270 tweet subset, K=0.48
Human “accuracy”: 43% unanimous, 63% avgNew humans S-NS labeling, K=.59
Human “accuracy”: 59% unanimous, 67% avgAutomatic: 68%
Accuracies & agreement go up for subset with emoticons
Conclusion: Humans are not so good at this task either…
SummaryDialog Acts
Purpose of an utterance in conversationUseful for punctuation in transcription, social analysis,
dialog management in human-computer interactionDetection leverages words, grammar, prosody & context
Prosody …matters for a small subset of DAs, but can matter a lot
for these casesIs realized in continuous (range) and symbolic (accents)
cues – needs contextual normalizationSarcasm: a difficult task! (for both text and speech)