User Trait Expression and Portrayalthrough Social Media
Daniel Preotiuc-Pietro
Bloomberg LP
1 November 2018
Context
The availability of large scale user generated data provides thecontext for new applications and research.
The key elements are:• metadata
• user• time• location
• volume• diversity
• text• images• network information
Context
The availability of large scale user generated data provides thecontext for new applications and research.
The key elements are:• metadata
• user• time• location
• volume• diversity
• text• images• social connections
User Traits and Text
Hypothesis
User generated text reveals individual differences in bothdemographic and psychological traits.
Demographic Traits
• Age (Rao et al. 2010, ACL)• Gender (Burger et al. 2011, EMNLP)• Location (Eisenstein et al. 2010, EMNLP)• Political Orientation (Volkova et al. 2014, ACL)
Demographic Traits
• Age (Rao et al. 2010, ACL)• Gender (Burger et al. 2011, EMNLP)• Location (Eisenstein et al. 2010, EMNLP)• Political Orientation (Volkova et al. 2014, ACL)• Popularity (Lampos et al. 2014, EACL)• Occupation (Preotiuc-Pietro et al. 2015, ACL)• Income (Preotiuc-Pietro et al. 2015, PLoS ONE)• Political Ideology (Preotiuc-Pietro et al. 2017, ACL)• Race (Preotiuc-Pietro & Ungar 2018, COLING)
Psychological Traits
Psychological traits:• Mental illness (Coppersmith et al. 2014, ACL)• Personality (Schwartz et al. 2013, PLoS ONE)• Empathy (Abdul-Mageed et al. 2017, ICWSM)
Psychological Traits
Psychological traits:• Mental illness (Coppersmith et al. 2014, ACL)• Personality (Schwartz et al. 2014, PLoS ONE)• Empathy (Abdul-Mageed et al. 2017, ICWSM)• ‘Dark Triad’ Personality (Preotiuc-Pietro et al. 2016, CIKM)• Active Open-Minded Thinking (Carpenter et al. 2018,
JDM, in press)
Aspects
1. Data2. Prediction3. Insight
Example: Political Ideology
Daniel Preotiuc-Pietro, Liu Ye, Daniel J Hopkins, and Lyle Ungar. “Beyond BinaryLabels: Political Ideology Prediction of Twitter Users”. In: ACL. 2017.
Data
Social media data analysis:
• Unobtrusive• Observe behaviors, rather than self-reported
• Access to data from a larger and more diverse population• Traditional social science research is based on convenience
lab samples
• Access to both historical and real-time data• Fine spatial granularity
Data - Ethics
Twitter – profiles are public by default
Facebook/Instagram – users provide informed consent to sharedata
User-trait analysis requires trait-level information and,provided through surveys, is sensitive and is anonymised.
All studies were approved by the institutional Internal ReviewBoard (IRB).
Data - Example
We collected a new data set:• 3.938 users (4.8M tweets)• public Twitter handles with >100 posts
Political ideology is reported through an online survey:• our use case is US politics• the major US ideology spectrum is Conservative – Liberal• seven point scale• additionally reported age, gender and other demographics
Daniel Preotiuc-Pietro, Liu Ye, Daniel J Hopkins, and Lyle Ungar. “Beyond BinaryLabels: Political Ideology Prediction of Twitter Users”. In: ACL. 2017.
Data - Applications
Social media data enables new types of applications andstudies.
Real-time passive polling:
Prediction
Prediction Insight
Perspective NLP/ML Social Science
Goal Models to predict traits of unknownusers
Gain a better understanding ofgroup behaviors and differences
Framing Predictive task Exploring/testing hypotheses
Methods Regression/Classification Statistical hypothesis testingInterpretable featuresUse domain experts in analysis
Prediction - Example
• Linear Regression• Learning: V. Conservative (1) – V. Liberal (7)• Engagement: Neural (4) – Moderate C/L (3&5) – C/L (2&6)
– Very C/L (1&7)• 10 fold-cross validation• Range of linguistic features• Evaluation – Pearson R between predictions and true labels
.294
.165
.286
.149
.300
.169.145
.079
.256
.169
.369
.196
.00
.10
.20
.30
.40
Leaning Engagement
Unigrams LIWC Topics Emotions Political All
Daniel Preotiuc-Pietro, Liu Ye, Daniel J Hopkins, and Lyle Ungar. “Beyond BinaryLabels: Political Ideology Prediction of Twitter Users”. In: ACL. 2017.
Prediction - Applications
Applications of predictive models of user traits:
• Improving downstream NLP tools:• sentiment analysis• text classification
• Personalised AI applications:• machine translation• dialogue systems with an identity
• Uncover and adjust model biases• Control for demographic biases in data analysis• Marketing or Targeted ads• Measure communities in real-time over space and time
Insight
Prediction Insight
Perspective NLP/ML Social Science
Goal Models to predict traits of unknownusers
Gain a better understanding ofgroup behaviors and differences
Framing Predictive task Exploring/testing hypotheses
Methods Regression/Classification Statistical hypothesis testingInterpretable featuresUse domain experts in analysis
Insight - Example
Differences between moderate and extreme users
Words associated with moderateliberals (5 and 6).
Words associated with extremeliberals (7).
relative frequency
a aacorrelation strength
Correlations are age and gender controlled
Daniel Preotiuc-Pietro, Liu Ye, Daniel J Hopkins, and Lyle Ungar. “Beyond BinaryLabels: Political Ideology Prediction of Twitter Users”. In: ACL. 2017.
Applications - Insight
Insight allows us to:
• Gain a better understanding of:• human behaviors• language use• linguage change• cultural differences• stylistic differences• pragmatic differences• human stereotypes
• Confirm or generate new data-driven hypotheses
Aspects
1. Data2. Prediction3. Insight
All steps pose unique challenges and implications.
Aspects
This talk will try to address some of these aspects:
1. Data• User sampling• Trait collection
2. Prediction3. Insight
• Content• Phrase choice• Style• Pragmatic roles
Aspects
This talk will try to address some of these aspects:
1. Data collection• User sampling• Trait collection
2. Prediction3. Insight
• Content• Phrase choice• Style• Pragmatic roles
User sampling
Collecting representative gold data for training models.
For political orientation, previous NLP research collected users:
Daniel Preotiuc-Pietro, Liu Ye, Daniel J Hopkins, and Lyle Ungar. “Beyond BinaryLabels: Political Ideology Prediction of Twitter Users”. In: ACL. 2017.
User sampling
Our hypotheses:
1. These users are far more likely to be politically engaged2. The prediction problem was over-simplified3. Neutral users are not accounted for4. There are differences between moderate and extreme users
on the same side
Daniel Preotiuc-Pietro, Liu Ye, Daniel J Hopkins, and Lyle Ungar. “Beyond BinaryLabels: Political Ideology Prediction of Twitter Users”. In: ACL. 2017.
Engagement
Data set obtained using previous methods
2.64 2.95
0.73
0.79
0.11
0.18
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00Political word usage across
user groups
Media/Pundit Names
Politician Names
Political Words
Average percentage of political word usage
Daniel Preotiuc-Pietro, Liu Ye, Daniel J Hopkins, and Lyle Ungar. “Beyond BinaryLabels: Political Ideology Prediction of Twitter Users”. In: ACL. 2017.
Engagement
Our data set (survey-based, 7 point ideology scale)
2.64 0.76 0.55 0.42 0.36 0.46 0.51 0.76 2.95
0.73
0.24
0.140.07 0.07
0.09 0.12
0.19
0.79
0.11
0.03
0.03
0.02 0.020.03
0.03
0.04
0.18
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00Political word usage across
user groups
Media/Pundit Names
Politician Names
Political Words
Average percentage of political word usage
Daniel Preotiuc-Pietro, Liu Ye, Daniel J Hopkins, and Lyle Ungar. “Beyond BinaryLabels: Political Ideology Prediction of Twitter Users”. In: ACL. 2017.
Engagement
Our data set (survey-based, 7 point ideology scale)
2.64 0.76 0.55 0.42 0.36 0.46 0.51 0.76 2.95
0.73
0.24
0.140.07 0.07
0.09 0.12
0.19
0.79
0.11
0.03
0.03
0.02 0.020.03
0.03
0.04
0.18
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00Political word usage across
user groups
Media/Pundit Names
Politician Names
Political Words
Average percentage of political word usage
Daniel Preotiuc-Pietro, Liu Ye, Daniel J Hopkins, and Lyle Ungar. “Beyond BinaryLabels: Political Ideology Prediction of Twitter Users”. In: ACL. 2017.
Over-simplification
The prediction problem was over-simplified
.891
.785
.662
.581
.972
.785
.679
.590
.976
.789
.690
.625
.5
.6
.7
.8
.9
1.0
CvL 1v7 2v6 3v5
Topics Political Terms Domain Adaptation
ROC AUC, Logistic Regression, 10 fold-cross validation.
Daniel Preotiuc-Pietro, Liu Ye, Daniel J Hopkins, and Lyle Ungar. “Beyond BinaryLabels: Political Ideology Prediction of Twitter Users”. In: ACL. 2017.
User sampling
Take aways:
• 3x more political terms for automatically identified userscompared to the highest survey-based scores
• Performance drops by 15% even when predicting extremeusers
• Performance drops by 35% to close to random whenpredicting between politically moderates
User sampling has a important impact in experimental results.
Daniel Preotiuc-Pietro, Liu Ye, Daniel J Hopkins, and Lyle Ungar. “Beyond BinaryLabels: Political Ideology Prediction of Twitter Users”. In: ACL. 2017.
Aspects
This talk will try to address some of these aspects:
1. Data collection• User sampling• Trait collection
2. Prediction3. Insight
• Content• Phrase choice• Style• Pragmatic roles
Trait collection
Trait collection: Identifying the trait value for users.
Several common methods exist:
1. Self-report2. Distant Supervision3. Perception (Annotation)4. Survey-based
Trait collection
1. Self-Report
• Method:• Mining profile descriptions• Mining tweet contents• Mining network connections• Processing profile images
• Advantages:• Large volume• Easy to implement
• Disadvantages:• Sample biases - some groups of users are more likely to
self-disclose personal information• Data usually required post-filtering due to false positives
Trait collection
2. Distant Supervision
• Method:• Map users to community statistics (e.g. Census data)
• Advantages:• Very large volume• Wide variety of traits have community statistics
• Disadvantages:• Statistics may be outdated• Twitter population is a biased sample of the general
population• Users that can be geolocated are not representative of the
Twitter population• Geo-located tweets might be posted from a different
location than the user’s home
Trait collection
3. Perception
• Method:• Human annotation of profiles, including text
• Advantages:• Accurate for common traits• Medium volume
• Disadvantages:• Contains systematic biases and stereotypes of particular
traits• Models trained on this data will capture only the
perception of the annotator
Trait collection
4. Survey-based
• Method:• Ask users for trait information through surveys
• Advantages• Collect information from the actual users• Can collect multiple traits• Can collect less common psychological traits
• Disadvantages:• Costly / Low volume• May be untruthful – but we can safeguard
Trait collection - Comparison
Comparing trait collection methods, race prediction, evaluatedon survey-based traits.
Daniel Preotiuc-Pietro and Lyle Ungar. “User-Level Race and Ethnicity Predictorsfrom Twitter Text”. In: COLING. 2018.
Survey-based vs. Perceived
We studied how the two differ in relation to demographic traits.
Lucie Flekova, Jordan Carpenter, Salvatore Giorgi, Lyle Ungar, andDaniel Preotiuc-Pietro. “Analyzing Biases in Human Perception of User Age andGender from Text”. In: ACL. 2016.
Experimental Setup
20 Tweets/user
9 ratings/user
Forced choice guess
Self-rated confidence (1-5)
Real traits known inadvance through
self-reports
This way we isolate the textual cues from any other profilerelated cues (screen name, profile pic, etc)
Lucie Flekova, Jordan Carpenter, Salvatore Giorgi, Lyle Ungar, andDaniel Preotiuc-Pietro. “Analyzing Biases in Human Perception of User Age andGender from Text”. In: ACL. 2016.
Data Set
Trait Outcome #Users #RatersGender M/F 2607 1083Age Integer 1066 737Education Adv/BSc/HS 900 481Political Orientation Lib/Cons 2500 943
Data set statistics
Lucie Flekova, Jordan Carpenter, Salvatore Giorgi, Lyle Ungar, andDaniel Preotiuc-Pietro. “Analyzing Biases in Human Perception of User Age andGender from Text”. In: ACL. 2016.
Human perception accuracy
.517
.330
.500
.000
.757
.445
.816
.416
.858
.488
.903
.631
.0
.1
.2
.3
.4
.5
.6
.7
.8
.91.0
Gender (%) Education (%) PoliticalOrientation (%)
Age (r)
Random Accuracy Majority/Average Guess
People are usually correct.
Lucie Flekova, Jordan Carpenter, Salvatore Giorgi, Lyle Ungar, andDaniel Preotiuc-Pietro. “Analyzing Biases in Human Perception of User Age andGender from Text”. In: ACL. 2016.
Inaccurate Gender Stereotypes
Trained two models on the same data with:• perceived labels• real labels
Training on perceived traits introduces a systematic biasLucie Flekova, Jordan Carpenter, Salvatore Giorgi, Lyle Ungar, and
Daniel Preotiuc-Pietro. “Analyzing Biases in Human Perception of User Age andGender from Text”. In: ACL. 2016.
Inaccurate Gender Stereotypes
40.1
6.17.9
45.8
0
10
20
30
40
50
Males Females
Pred. Male Pred. Female
Model predictions.
42.2
9.97.2
40.7
0
10
20
30
40
50
Males Females
Perc. Male Perc. Female
Human guesses.
Model trained on >10,000 users with self-reported gender.
Lucie Flekova, Jordan Carpenter, Salvatore Giorgi, Lyle Ungar, andDaniel Preotiuc-Pietro. “Analyzing Biases in Human Perception of User Age andGender from Text”. In: ACL. 2016.
Inaccurate Gender Stereotypes
40.1
6.17.9
45.8
0
10
20
30
40
50
Males Females
Pred. Male Pred. Female
Model predictions.
42.2
9.97.2
40.7
0
10
20
30
40
50
Males Females
Perc. Male Perc. Female
Human guesses.
The accuracies for correct predictions are reversed
Lucie Flekova, Jordan Carpenter, Salvatore Giorgi, Lyle Ungar, andDaniel Preotiuc-Pietro. “Analyzing Biases in Human Perception of User Age andGender from Text”. In: ACL. 2016.
Inaccurate Gender Stereotypes
40.1
6.17.9
45.8
0
10
20
30
40
50
Males Females
Pred. Male Pred. Female
Model predictions.
42.2
9.97.2
40.7
0
10
20
30
40
50
Males Females
Perc. Male Perc. Female
Human guesses.
The accuracies for incorrect predictions are also reversed!
Lucie Flekova, Jordan Carpenter, Salvatore Giorgi, Lyle Ungar, andDaniel Preotiuc-Pietro. “Analyzing Biases in Human Perception of User Age andGender from Text”. In: ACL. 2016.
Inaccurate Gender Stereotypes
Words more likely to be associatedwith females among male authors
Words more likely to be associatedwith males among female authors
The size of the word is the strength to which they’re inaccuratestereotypes i.e. ’love’ is more likely to mislead people inguessing female compared to ’wonderful.’
Lucie Flekova, Jordan Carpenter, Salvatore Giorgi, Lyle Ungar, andDaniel Preotiuc-Pietro. “Analyzing Biases in Human Perception of User Age andGender from Text”. In: ACL. 2016.
Controlling Perception
Can we control human perception of demographic traits?
We restrict to selecting tweets from the user’s timeline.Daniel Preotiuc-Pietro, Sharath Chandra Guntuku, and Lyle Ungar. “Controlling
Human Perception of Basic User Traits”. In: EMNLP. 2017.
Controlling Perception
Annotator accuracy on predicting gender in the threeconditions.
76.66%
40.67%35.99%
55.73%
32.26%
23.47%
91.33%
47.83%43.50%
0%
25%
50%
75%
100%
Overall Females Males
Random Opposite Same
Daniel Preotiuc-Pietro, Sharath Chandra Guntuku, and Lyle Ungar. “ControllingHuman Perception of Basic User Traits”. In: EMNLP. 2017.
Beyond Survey-based Methods
Survey-based screening methods for mental illnesses areimperfect.
Mental illness is less likely to be self reported due to lack ofawareness or social stigma.
Surveys may not be the best tool for collecting ’gold’ labels.
Social media can be an alternative.
Johannes Eichstaedt et al. “Facebook Language Predicts Depression in MedicalRecords”. In: PNAS. 2018.
Beyond survey-based methods
We linked medical records with clinical diagnosis of depressionto Facebook data.
Johannes Eichstaedt et al. “Facebook Language Predicts Depression in MedicalRecords”. In: PNAS. 2018.
Aspects
This talk will try to address some of these aspects:
1. Data collection• User sampling• Trait collection
2. Prediction3. Insight
• Content• Phrase choice• Style• Pragmatic roles
Content analysis
There are differences between neutral users and ideologicallyextreme users.
Words associated with eitherextreme conservative or liberal
Words associated with neutralusers
a aacorrelation strength
Correlations are age and gender controlled. Extreme groups arecombined using matched age and gender distributions.
Daniel Preotiuc-Pietro, Liu Ye, Daniel J Hopkins, and Lyle Ungar. “Beyond BinaryLabels: Political Ideology Prediction of Twitter Users”. In: ACL. 2017.
Content analysis
There are differences between moderate and extreme users onthe same side.
Words associated with moderateliberals (5 and 6).
Words associated with extremeliberals (7).
relative frequency
a aacorrelation strength
Correlations are age and gender controlledDaniel Preotiuc-Pietro, Liu Ye, Daniel J Hopkins, and Lyle Ungar. “Beyond Binary
Labels: Political Ideology Prediction of Twitter Users”. In: ACL. 2017.
Content analysis
Rank Correlation Topic (most frequent words)1 .116 hilarious, celeb, capaldi, corrie, chatty, corden,
barrowman2 .106 photo, art, pictures, photos, instagram, photoset,
image3 .106 hot, sex, naked, adult, teen, porn, lesbian, tube,
tits4 .087 turn, accidentally, barely, constantly, onto, bug,
suddenly5 .086 ha, ooo, uh, ohhh, ohhhh, maam, gotcha, gee,
ohhhhh
LIWC-1 .104 hfuck, gay, sex, sexy, dick, naked, fucks, cock,aids, cum
LIWC-2 .088 hate, fuck, hell, stupid, mad, sucks, suck, war,dumb, ugly
Word2Vec topics with the highest Pearson correlation betweenmoderately liberal users and moderately conservative users(gender/age controlled).
Daniel Preotiuc-Pietro, Liu Ye, Daniel J Hopkins, and Lyle Ungar. “Beyond BinaryLabels: Political Ideology Prediction of Twitter Users”. In: ACL. 2017.
Aspects
This talk will try to address some of these aspects:
1. Data collection• User sampling• Trait collection
2. Prediction3. Insight
• Content• Phrase choice• Style• Pragmatic roles
Phrase Choice
Which word is more likely to be used by a female ?
Charming – Fascinating
Daniel Preotiuc-Pietro, Wei Xu, and Lyle Ungar. “Discovering User AttributeStylistic Differences via Paraphrasing”. In: AAAI. 2016.
Phrase Choice
Which word is more likely to be used by a female ?
Charming – Fascinating
Daniel Preotiuc-Pietro, Wei Xu, and Lyle Ungar. “Discovering User AttributeStylistic Differences via Paraphrasing”. In: AAAI. 2016.
Phrase Choice
Which word is more likely to be used by an older person?
Impressive – Amazing
Daniel Preotiuc-Pietro, Wei Xu, and Lyle Ungar. “Discovering User AttributeStylistic Differences via Paraphrasing”. In: AAAI. 2016.
Phrase Choice
Which word is more likely to be used by an older person?
Impressive – Amazing
Daniel Preotiuc-Pietro, Wei Xu, and Lyle Ungar. “Discovering User AttributeStylistic Differences via Paraphrasing”. In: AAAI. 2016.
Phrase Choice
Which word is more likely to be used by a person of higheroccupational class ?
Suggestions – Proposals
Daniel Preotiuc-Pietro, Wei Xu, and Lyle Ungar. “Discovering User AttributeStylistic Differences via Paraphrasing”. In: AAAI. 2016.
Phrase Choice
Which word is more likely to be used by a person of higheroccupational class ?
Suggestions – Proposals
Daniel Preotiuc-Pietro, Wei Xu, and Lyle Ungar. “Discovering User AttributeStylistic Differences via Paraphrasing”. In: AAAI. 2016.
Phrase Choice
Which word is more likely to be used by a female ?
Brutal – Fierce
Daniel Preotiuc-Pietro, Wei Xu, and Lyle Ungar. “Discovering User AttributeStylistic Differences via Paraphrasing”. In: AAAI. 2016.
Phrase Choice
Which word is more likely to be used by a female ?
Brutal – Fierce
Daniel Preotiuc-Pietro, Wei Xu, and Lyle Ungar. “Discovering User AttributeStylistic Differences via Paraphrasing”. In: AAAI. 2016.
Phrase Choice
Which word is more likely to be used by an older person?
Defensive – Protective
Daniel Preotiuc-Pietro, Wei Xu, and Lyle Ungar. “Discovering User AttributeStylistic Differences via Paraphrasing”. In: AAAI. 2016.
Phrase Choice
Which word is more likely to be used by an older person?
Defensive – Protective
Daniel Preotiuc-Pietro, Wei Xu, and Lyle Ungar. “Discovering User AttributeStylistic Differences via Paraphrasing”. In: AAAI. 2016.
Phrase Choice
Which word is more likely to be used by a person of higheroccupational class ?
Humour – Wit
Daniel Preotiuc-Pietro, Wei Xu, and Lyle Ungar. “Discovering User AttributeStylistic Differences via Paraphrasing”. In: AAAI. 2016.
Phrase Choice
Which word is more likely to be used by a person of higheroccupational class ?
Humour – Wit
Daniel Preotiuc-Pietro, Wei Xu, and Lyle Ungar. “Discovering User AttributeStylistic Differences via Paraphrasing”. In: AAAI. 2016.
Phrase Choice
68.5%73.7%
67.2%
50%
60%
70%
80%
90%
100%
Gender Age Occ.Class
Daniel Preotiuc-Pietro, Wei Xu, and Lyle Ungar. “Discovering User AttributeStylistic Differences via Paraphrasing”. In: AAAI. 2016.
Phrase Choice
The method for quantifying phrase choice is straightforward:
Gender(w) = log(
Female(w)Male(w)
)(1)
Within a paraphrase pair (w1,w2), the differenceGender(w1) −Gender(w2) is the stylistic distance.
We use only equivalent paraphrases of 1–3 grams from PPDB2.0.
Statistics are computed over large Twitter data sets with usertraits.
Daniel Preotiuc-Pietro, Wei Xu, and Lyle Ungar. “Discovering User AttributeStylistic Differences via Paraphrasing”. In: AAAI. 2016.
Phrase Choice
Study which attributes of words in a pair are preferred by onegroup:
• Word Length in Characters• Word Length in Syllables
Simple proxies for word complexity
• Affective Norms: Valence, Arousal, Dominance14k rated wordsValence: suicide (0.15)→ bacon (0.70)→ laughter (1)
• Concreteness40k rated words: spirituality (1)→morning (3.44)→ tiger (5)
• Age of Acquisition30k rated words: great (5.05)→ splendid (7.22)→ tremendous (10.63)
• More in the paper ...
Daniel Preotiuc-Pietro, Wei Xu, and Lyle Ungar. “Discovering User AttributeStylistic Differences via Paraphrasing”. In: AAAI. 2016.
Phrase Choice
-.048
-.051
-.053
.047
.089
-.037
-.022
-.028
.077
.158
-.124
-.026
-.034
.110
.211
-0.25 -0.20 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20 0.25
Concreteness
Happiness
Word Rareness
# Syllables
Word Length
Occ.Class (High) Age (>30) Gender (M)
Correlation coefficients between paraphrase pair worddifferences and user group differences in usage.
Daniel Preotiuc-Pietro, Wei Xu, and Lyle Ungar. “Discovering User AttributeStylistic Differences via Paraphrasing”. In: AAAI. 2016.
Phrase Choice
.163
-.068
-.043
-.012
-.041
.067
.182
-.002
-.014
.036
-.001
.050
.045
.097
-.060
.010
.031
.028
.050
.047
.080
-.032
-.007
.030
.005
.040
.016
.010
-.014
.023
.000
-.024
.004
-.020
-.065
-.200 -.150 -.100 -.050 .000 .050 .100 .150 .200
Age of Acquisition
Concreteness
Dominance
Arousal
Happiness
#Syllables
Word Length
Openess Conscientiousness Extraversion Agreeableness Neuroticism
Correlation coefficients between paraphrase pair preferenceand user group usage.
Daniel Preotiuc-Pietro, Jordan Carpenter, and Lyle Ungar. “Personality DrivenDifferences in Paraphrase Preference”. In: NLP+CSS Workshop, ACL. 2017.
Phrase Choice
.163
-.068
-.043
-.012
-.041
.067
.182
-.002
-.014
.036
-.001
.050
.045
.097
-.060
.010
.031
.028
.050
.047
.080
-.032
-.007
.030
.005
.040
.016
.010
-.014
.023
.000
-.024
.004
-.020
-.065
-.200 -.150 -.100 -.050 .000 .050 .100 .150 .200
Age of Acquisition
Concreteness
Dominance
Arousal
Happiness
#Syllables
Word Length
Openess Conscientiousness Extraversion Agreeableness Neuroticism
Correlation coefficients between paraphrase pair preferenceand user group usage.
Daniel Preotiuc-Pietro, Jordan Carpenter, and Lyle Ungar. “Personality DrivenDifferences in Paraphrase Preference”. In: NLP+CSS Workshop, ACL. 2017.
Aspects
This talk will try to address some of these aspects:
1. Data collection• User sampling• Trait collection
2. Prediction3. Insight
• Content• Phrase choice• Style• Pragmatic roles
Stylistic Differences
Correlations of stylistic features with age and income.
0.3 0.2 0.1 0.0 0.1 0.2 0.3
Income r
0.3
0.2
0.1
0.0
0.1
0.2
0.3
Age r
# Char/Token
# Tokens/Tweet
# Chars/Tweet
#words>5char
Type/token RatioPunctuation
Smileys
URLs
ARIF-Kincaid
Coleman-Liau
Flesch RE
FOGSMOG
LIX
Nouns
Verbs
Pronouns
Adverbs
Adjectives
Determiners
Interjections
Named entitiesContextuality
Abstract
Hedging
Specific
Elongations
Hapax legom.
Surface
Readability
Syntax
Style
Lucie Flekova, Lyle Ungar, and Daniel Preotiuc-Pietro. “Exploring StylisticVariation with Age and Income on Twitter”. In: ACL. 2016.
Stylistic Differences
Specificity – quantifies how much detail is engaged in text.
1 – Always too much.
5 – Mascara is the most commonly worn cosmetic, and women will spend an average of $4,000
on it in their lifetimes
Yifan Gao, Yang Zhong, Daniel Preotiuc-Pietro, and Junyi Jessy Li. “Predicting andAnalyzing Language Specificity in Social Media Posts”. In: AAAI. 2019.
Aspects
This talk will try to address some of these aspects:
1. Data collection• User sampling• Trait collection
2. Prediction3. Insight
• Content• Phrase choice• Style• Pragmatic roles
Pragmatic roles
Vulgar words are often used in communication (1%)
Despite this, they are a restricted set of words (100)
Demographic traits impact how often users employ vulgarwords online (correlations with % vulgar use):
Isabel Cachola, Eric Holgate, Daniel Preotiuc-Pietro, and Junyi Jessy Li.“Expressively vulgar: The socio-dynamics of vulgarity and its effects on sentimentanalysis in social media”. In: COLING. 2018.
Pragmatic roles
Vulgarity is employed purposefully
Vulgar words are used for different pragmatic functions
We identified six different pragmatic functions
We annotated 8,524 instances of vulgar words across 7,800tweets from users with known demographic traits.
Eric Holgate, Isabel Cachola, Daniel Preotiuc-Pietro, and Junyi Jessy Li. “WhySwear? Analyzing and Inferring the Intentions of Vulgar Expressions”. In: EMNLP.2018.
Pragmatic roles
1. Express aggression (15.2%)
The word is used in order to harm the person or group thetweet is about.
USER You are an ass Your industry is full of assholes and you do nothing to improve (...)
Eric Holgate, Isabel Cachola, Daniel Preotiuc-Pietro, and Junyi Jessy Li. “WhySwear? Analyzing and Inferring the Intentions of Vulgar Expressions”. In: EMNLP.2018.
Pragmatic roles
2. Express emotion (24.8%)
The word is used to express emotions (positive or negative)related to the users internal states, exclamations, feelings orattitudes towards an object. If removing the vulgar term, theexpressed emotion is lacking.
There are so many things I want to do, But investing in equipment is a pain in the ass
Eric Holgate, Isabel Cachola, Daniel Preotiuc-Pietro, and Junyi Jessy Li. “WhySwear? Analyzing and Inferring the Intentions of Vulgar Expressions”. In: EMNLP.2018.
Pragmatic roles
3. Emphasise (29.8%)
The word is used to emphasize a statement or feeling.
today is a good ass day URL
Eric Holgate, Isabel Cachola, Daniel Preotiuc-Pietro, and Junyi Jessy Li. “WhySwear? Analyzing and Inferring the Intentions of Vulgar Expressions”. In: EMNLP.2018.
Pragmatic roles
4. Auxiliary (17.0%)
The use of this word is simply a manner of speaking and doesnot fit any of the above descriptions. Descriptions of externalemotions (those of someone else) fall into this category.
Wish USER could save my ass on these exams like he used to
Eric Holgate, Isabel Cachola, Daniel Preotiuc-Pietro, and Junyi Jessy Li. “WhySwear? Analyzing and Inferring the Intentions of Vulgar Expressions”. In: EMNLP.2018.
Pragmatic roles
5. Signal Group Identity (4.7%)
This word is used as a marker of identity in a specific socialgroup.
Now this is a group of ass kickers
Eric Holgate, Isabel Cachola, Daniel Preotiuc-Pietro, and Junyi Jessy Li. “WhySwear? Analyzing and Inferring the Intentions of Vulgar Expressions”. In: EMNLP.2018.
Pragmatic roles
6. Non-Vulgar (8.2%)
The use of this word is not vulgar (e.g., named entities thatinvolve vulgar words).
Kick Ass 2 - Red Band Trailer URL
Eric Holgate, Isabel Cachola, Daniel Preotiuc-Pietro, and Junyi Jessy Li. “WhySwear? Analyzing and Inferring the Intentions of Vulgar Expressions”. In: EMNLP.2018.
Take Aways
• Data collection poses challenges:• Sampling biases• Label collection
• Insight is important for social science and obtainedthrough• Interpretable modelling and prediction methods• Linguistically motivated features• Collaboration with domain experts• Traditional social science approaches• Quasi-experimental methods
Thank You!
Thank you to my amazing collaborators:
Thank You!
Thank you!