Making natural language processingrobust to sociolinguistic variation
Jacob Eisenstein@jacobeisenstein
Georgia Institute of Technology
September 9, 2017
Machine readingFrom text to structured
representations.
New domains of digitizedtexts offer opportunities aswell as challenges.
Annotateand train
Machine readingFrom text to structured
representations.
New domains of digitizedtexts offer opportunities aswell as challenges.
Annotateand train
Machine readingFrom text to structured
representations.
New domains of digitizedtexts offer opportunities aswell as challenges.
Language data then and now
Then: news text, small setof authors, professionallyedited, fixed style
Now: open domain,everyone is an author,unedited, many styles
Language data then and now
Then: news text, small setof authors, professionallyedited, fixed style
Now: open domain,everyone is an author,unedited, many styles
Social media has forcedNLP to confront thechallenge of missingsocial context(Eisenstein, 2013):
I tacit assumptionsabout audienceknowledge
I language variationacross social groups
(Gimpel et al., 2011)(Ritter et al., 2011)(Foster et al., 2011)
Social media has forcedNLP to confront thechallenge of missingsocial context(Eisenstein, 2013):
I tacit assumptionsabout audienceknowledge
I language variationacross social groups
(Gimpel et al., 2011)(Ritter et al., 2011)(Foster et al., 2011)
Social media has forcedNLP to confront thechallenge of missingsocial context(Eisenstein, 2013):
I tacit assumptionsabout audienceknowledge
I language variationacross social groups
(Gimpel et al., 2011)(Ritter et al., 2011)(Foster et al., 2011)
Finding tacit context in the social network
I Social media texts lackcontext, because it isimplicit between thewriter and the reader.
I Homophily: sociallyconnected individuals tendto share traits.
Assortativity of entity references
We project embeddings for entities, words, andauthors into a shared semantic space.
“Dirk Novitsky”
“the warriors”
Inner products in this space indicate compatibility.
Socially-InfusedEn,tyLinking
47
Socially-InfusedEn,tyLinking
47
tweeten,tyassignments
author
Socially-InfusedEn,tyLinking
47
tweeten,tyassignments
author
‣ isemployedtomodelsurfacefeatures.g1
Socially-InfusedEn,tyLinking
47
tweeten,tyassignments
author
‣ isusedtocapturetwoassump,ons:‣ En,tyhomophily
‣ isemployedtomodelsurfacefeatures.
‣ Seman,callyrelatedmen,onstendtorefersimilaren,,es
g1
g2
Socially-InfusedEn,tyLinking
48
g2(x, yt, u, t;⇥2) = v(u)u
>W(u,e)v(e)
yt+ v
(m)t
>W(m,e)v(e)
yt
authorembedding men,onembedding
v(u)u
v(e)yt
v(e)yt v
(m)t
g2
�(x, yt, t)
g1
en,tyembedding
Loss-augmentedtraining
Socially-InfusedEn,tyLinking
48
g2(x, yt, u, t;⇥2) = v(u)u
>W(u,e)v(e)
yt+ v
(m)t
>W(m,e)v(e)
yt
authorembedding men,onembedding
v(u)u
v(e)yt
v(e)yt v
(m)t
g2
�(x, yt, t)
g1
en,tyembedding
Learning
49
Learning
49
‣ Loss-augmentedinference:
Learning
49
‣ Loss-augmentedinference: hammingloss
Learning
49
‣ Loss-augmentedinference:
‣ Op,miza,on:stochas,cgradientdescent
hammingloss
Inference
50
‣ Non-overlappingstructure
Inordertolink‘RedSox’toarealen,ty,‘Red’and‘Sox’shouldbelinkedtoNil.
Classifier Struct Struct+Social S-MART
6466687072747678
F1 Dataset
NEEL
TACL
+3.2
+2.0
I Structure prediction improves accuracy.
I Social context yields further improvements.
I S-MART is the prior state-of-the-art(Yang & Chang, 2015).
Social media has forcedNLP to confront thechallenge of missingsocial context(Eisenstein, 2013):
I tacit assumptionsabout audienceknowledge
I language variationacross social groups
(Gimpel et al., 2011)(Ritter et al., 2011)(Foster et al., 2011)
Social media has forcedNLP to confront thechallenge of missingsocial context(Eisenstein, 2013):
I tacit assumptionsabout audienceknowledge
I language variationacross social groups
(Gimpel et al., 2011)(Ritter et al., 2011)(Foster et al., 2011)
Language variation: a challenge for NLP
“I would like to believe he’ssick rather than just meanand evil.”
“You could’ve been gettingdown to this sick beat.”
(Yang & Eisenstein, 2017)
Language variation: a challenge for NLP
“I would like to believe he’ssick rather than just meanand evil.”
“You could’ve been gettingdown to this sick beat.”
(Yang & Eisenstein, 2017)
Personalization by ensemble
I Goal: personalized conditional likelihood,P(y | x , a), where a is the author.
I Problem: We have labeled examples for only afew authors.
I Personalization ensemble
P(y | x , a) =∑k
Pk(y | x)πa(k)
I Pk(y | x) is a basis modelI πa(·) are the ensemble weights for author a
Personalization by ensemble
I Goal: personalized conditional likelihood,P(y | x , a), where a is the author.
I Problem: We have labeled examples for only afew authors.
I Personalization ensemble
P(y | x , a) =∑k
Pk(y | x)πa(k)
I Pk(y | x) is a basis modelI πa(·) are the ensemble weights for author a
Homophily to the rescue?
Sick! Sick!
Sick!Sick!
Labeled data
Unlabeled data
Are language styles assortative on the socialnetwork?
Evidence for linguistic homophilyPilot study: is classifier accuracy assortative on theTwitter social network?
assort(G ) =1
#|G |∑
(i ,j)∈G
δ(yi = yi)δ(yj = yj)
+ δ(yi 6= yi)δ(yj 6= yj)
0 20 40 60 80 100rewiring epochs
0.7000.7050.7100.7150.7200.7250.7300.735
asso
rtativ
ity
follow
0 20 40 60 80 100rewiring epochs
mention
0 20 40 60 80 100rewiring epochs
retweet
original networkrandom rewiring
Evidence for linguistic homophilyPilot study: is classifier accuracy assortative on theTwitter social network?
assort(G ) =1
#|G |∑
(i ,j)∈G
δ(yi = yi)δ(yj = yj)
+ δ(yi 6= yi)δ(yj 6= yj)
0 20 40 60 80 100rewiring epochs
0.7000.7050.7100.7150.7200.7250.7300.735
asso
rtativ
ity
follow
0 20 40 60 80 100rewiring epochs
mention
0 20 40 60 80 100rewiring epochs
retweet
original networkrandom rewiring
Network-driven personalization
I For each author, estimatea node embeddingea (Tang et al., 2015).
I Nodes who shareneighbors get similarembeddings.
πa =SoftMax(f (ea))
P(y | x , a) =K∑
k=1
Pk(y | x)πa(k)
Results
Mixture of Experts NLSE Social Personalization
0.0
0.5
1.0
1.5
2.0
2.5
3.0F1
impr
ovem
ent o
ver C
onvN
et
+0.10
+1.90
+2.80
Twitter Sentiment Analysis
Improvements over ConvNet baseline:
I +2.8% on Twitter Sentiment Analysis
I +2.7% on Ciao Product Reviews
NLSE is prior state-of-the-art (Astudillo et al., 2015).
Variable sentiment words
More positive More negative
1 banging loss fever brokenfucking
dear like god yeah wow
2 chilling cold ill sick suck satisfy trust wealth stronglmao
3 ass damn piss bitch shit talent honestly voting winclever
4 insane bawling fever weird cry lmao super lol haha hahaha
5 ruin silly bad boring dreadful lovatics wish beliebers ariana-tors kendall
SummaryRobustness is a key challenge for making NLP effective onsocial media data:
I Tacit assumptions about shared knowledge; languagevariation
I Social metadata gives NLP systems the flexibility tohandle each author differently.
The long tail of rare events is the other big challenge.
I Word embeddings for unseen words (Pinter et al., 2017)
I Lexicon-based supervision (Eisenstein, 2017)
I Applications to finding rare events in electronic healthrecords (ongoing work with Jimeng Sun)
SummaryRobustness is a key challenge for making NLP effective onsocial media data:
I Tacit assumptions about shared knowledge; languagevariation
I Social metadata gives NLP systems the flexibility tohandle each author differently.
The long tail of rare events is the other big challenge.
I Word embeddings for unseen words (Pinter et al., 2017)
I Lexicon-based supervision (Eisenstein, 2017)
I Applications to finding rare events in electronic healthrecords (ongoing work with Jimeng Sun)
Acknowledgments
I Students and collaborators:I Yi Yang (GT → Bloomberg)I Mingwei Chang (Google Research)I See https://gtnlp.wordpress.com/ for more!
I Funding: National Science Foundation,National Institutes for Health, Georgia Tech
References I
Astudillo, R. F., Amir, S., Lin, W., Silva, M., & Trancoso, I. (2015). Learning word representations from scarce andnoisy data with embedding sub-spaces. In Proceedings of the Association for Computational Linguistics(ACL), Beijing.
Eisenstein, J. (2013). What to do about bad language on the internet. In Proceedings of the North AmericanChapter of the Association for Computational Linguistics (NAACL), (pp. 359–369).
Eisenstein, J. (2017). Unsupervised learning for lexicon-based classification. In Proceedings of the NationalConference on Artificial Intelligence (AAAI), San Francisco.
Foster, J., Cetinoglu, O., Wagner, J., Le Roux, J., Nivre, J., Hogan, D., & van Genabith, J. (2011). From news tocomment: Resources and benchmarks for parsing the language of web 2.0. In Proceedings of the InternationalJoint Conference on Natural Language Processing (IJCNLP), (pp. 893–901)., Chiang Mai, Thailand. AsianFederation of Natural Language Processing.
Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yogatama, D., Flanigan,J., & Smith, N. A. (2011). Part-of-speech tagging for Twitter: annotation, features, and experiments. InProceedings of the Association for Computational Linguistics (ACL), (pp. 42–47)., Portland, OR.
Pinter, Y., Guthrie, R., & Eisenstein, J. (2017). Mimicking word embeddings using subword rnns. In Proceedings ofEmpirical Methods for Natural Language Processing (EMNLP).
Ritter, A., Clark, S., Mausam, & Etzioni, O. (2011). Named entity recognition in tweets: an experimental study. InProceedings of EMNLP.
Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., & Mei, Q. (2015). Line: Large-scale information networkembedding. In Proceedings of the Conference on World-Wide Web (WWW), (pp. 1067–1077).
Yang, Y. & Chang, M.-W. (2015). S-mart: Novel tree-based structured learning algorithms applied to tweet entitylinking. In Proceedings of the Association for Computational Linguistics (ACL), (pp. 504–513)., Beijing.
Yang, Y. & Eisenstein, J. (2017). Overcoming language variation in sentiment analysis with social attention.Transactions of the Association for Computational Linguistics (TACL), in press.