Date post: | 03-Apr-2018 |
Category: |
Documents |
Upload: | vasileios-lampos |
View: | 214 times |
Download: | 0 times |
of 47
7/28/2019 Exploiting Human-Generated Text for Trend Mining
1/47
Exploiting Human-Generated Text forTrend Mining
Vasileios Lampos
July, 2013
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 1/47
1/47
7/28/2019 Exploiting Human-Generated Text for Trend Mining
2/47
Outline
Motivation, Aims [Facts, Questions] Data
Nowcasting Events
Extracting Mood Patterns
Inferring Voting Intention
|= Conclusions
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 2/47
2/47
7/28/2019 Exploiting Human-Generated Text for Trend Mining
3/47
Facts
We started to work on those ideas back in 2008, when...
Web contained 1 trillion unique pages (Google)
Social Networks were rising, e.g.
Facebook: 100m (2008)
>1.11 billion active users (March, 2013)
Twitter: 6m (2008) 554m active users (July, 2013)
New technologies to handle Big Data (e.g., Map-Reduce)
User behaviour was changing Socialising via the Web Giving up privacy (Debatin et al., 2009)
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 3/47
3/47
7/28/2019 Exploiting Human-Generated Text for Trend Mining
4/47
General questions/aims
Does human-generated text posted on web platforms (or
elsewhere) contain useful information?
How can we extract this information...... automatically? Therefore, not we, but a machine.
Practical / real-life applications?
Can those large samples of human input assist studies in otherscientific fields?Social Sciences, Psychology, Epidemiology
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 4/474/47
7/28/2019 Exploiting Human-Generated Text for Trend Mining
5/47
The Data (1/3) Why Twitter?
Twitter...
has a lot of content that is publicly accessible provides a well-documented API for several forms of data collection
contains opinions and personal statements on various domains is connected with current affairs (usually in real-time) includes geo-located content
offers the option for personalised, per-user modelling
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 5/475/47
7/28/2019 Exploiting Human-Generated Text for Trend Mining
6/47
The Data (2/3)What does a @tweet look like?
Figure 1: Some biased and anonymised examples of tweets (limit of 140characters/tweet, # denotes a topic)
(a) (user will remain anonymous) (b) they live around us
(c) citizen journalism (d) flu attitude
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 6/476/47
7/28/2019 Exploiting Human-Generated Text for Trend Mining
7/47
The Data (3/3)
Data Collection & Preprocessing
The easiest part of the process... not true! Storage space, crawler implementation, parallel dataprocessing, adapt to new technologies
Data collected via Twitters Search API:
collective sampling tweets geo-located in 54 urban centres in the UK periodical crawling (every 3 or 5 minutes per urban centre)
Data collected via Twitters REST API:
user-centric sampling
preprocessing to approximate users location (city & country) ... or manual user selection from domain experts get their latest tweets (3,000 or more)
Several forms of ground truth (flu/rainfall rates, polls)
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 7/477/47
7/28/2019 Exploiting Human-Generated Text for Trend Mining
8/47
Nowcasting Events from the
Social Web
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 8/478/47
7/28/2019 Exploiting Human-Generated Text for Trend Mining
9/47
Nowcasting?
We do not predict the future, but infer the present i.e. the very recent past
Figure 2: Nowcasting the magnitude of an event () emerging in the real worldfrom Web information
Our case studies: nowcasting (a) flu rates & (b) rainfall rates (?!)
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 9/479/47
7/28/2019 Exploiting Human-Generated Text for Trend Mining
10/47
What do we get in the end?
This is a regression problemi.e. time interval i we aim to infer yi R using text input xxxi Rn
0 5 10 15 20 25 300
2
4
6
8
10
12
14
16
Days
R
ainfallr
ate(m
m)
Bristol
Actual
Inferred
Figure 3: Inferred rainfall rates for Bristol, UK (October, 2009)
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 10/4710/47
M h d l (1/5) T i V S
7/28/2019 Exploiting Human-Generated Text for Trend Mining
11/47
Methodology (1/5) Text in Vector Space
Candidate features (n-grams): C = {ci}Set of Twitter posts for a time interval u:
P(u) =
{pj
}Frequency of ci in pj:
g(ci, pj) =
if ci pj,0 otherwise.
g Boolean, maximum value for is 1
Score of ci in P(u):
s
ci, P(u)
=
|P(u)
|j=1
g(ci, pj)
|P(u)|
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 11/4711/47
M h d l (2/5)
7/28/2019 Exploiting Human-Generated Text for Trend Mining
12/47
Methodology (2/5)
Set of time intervals: U= {uk} 1 hour, 1 day, ...
Time series of candidate features scores:
X(U) =
xxx(u1) ... xxx(u|U|)T
,
where
xxx(ui) =
s
c1, P(ui)
... s
c|C|, P(ui)T
Target variable (event):
yyy(U) =
y1 ... y|U|T
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 12/4712/47
M th d l (3/5) F t l ti
7/28/2019 Exploiting Human-Generated Text for Trend Mining
13/47
Methodology (3/5) Feature selection
Solve the following optimisation problem:
minw
X(U)www
yyy(U)
22
s.t. www1 t,t = wwwOLS1 , (0, 1].
Least Absolute Shrinkage and Selection Operator (LASSO)argmin
wwwX(U)www yyy(U)22 + www1
(Tibshirani, 1996)
Expect a sparse www (feature selection)
Least Angle Regression (LARS) computes entire regularisationpath (wwws for different values of ) (Efron et al., 2004)
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 13/4713/47
M th d l (4/5)
7/28/2019 Exploiting Human-Generated Text for Trend Mining
14/47
Methodology (4/5)
LASSO is model-inconsistent:
inferred sparsity pattern may deviate from the true model, e.g.,when predictors are highly correlated (Zhao and Yu, 2006)
bootstrap [?] LASSO (Bolasso) performs a more robust featureselection (Bach, 2008)?:
in each bootstrap, input space is sampled with replacement apply LASSO (LARS) to select features select features with nonzero weights in all bootstraps
better alternative soft-Bolasso: a less strict feature selection select features with nonzero weights in p% of bootstraps (learn p using a separate validation set)
weights of selected features determined via OLS regression
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 14/4714/47
Methodolog (5/5) Sim lified s mma
7/28/2019 Exploiting Human-Generated Text for Trend Mining
15/47
Methodology (5/5) Simplified summary
Observations: X
R
mn (m time intervals, n features)
Response variable: yyy Rm
For i = 1 to number of bootstrapsForm Xi
X by sampling X with replacement
Solve LASSO for Xi and yyy, i.e. learn wwwi RnGet the k n features with nonzero weights
End_For
Select the v n features with nonzero weight in p% of the bootstrapsLearn their weights with OLS regression on X(v) Rmv and yyy
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 15/4715/47
How do we form candidate features?
7/28/2019 Exploiting Human-Generated Text for Trend Mining
16/47
How do we form candidate features?
Commonly formed by indexing the entire corpus(Manning, Raghavan and Schtze, 2008)
We extract them from Wikipedia, Google Search results, PublicAuthority websites (e.g., NHS)
Why? reduce dimensionality to bound the error of LASSO
L(www) L(www) + Q, with Q min
W21N
+p
N,
W21N
+W1
N
p candidate features, N samples, empirical loss
L(www) and
www1 W1 (Bartlett, Mendelson and Neeman, 2011)
Harry Potter Effect!
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 16/4716/47
The Harry Potter effect (1/2)
7/28/2019 Exploiting Human-Generated Text for Trend Mining
17/47
The Harry Potter effect (1/2)
Figure 4: Events co-occurring (correlated) with the inference target may affectfeature selection, especially when the sample size is small.
180 200 220 240 260 280 300 320 3400
50
100
150
200
250
300
Day Number (2009)
Event
Score
Flu (England & Wales)
Hypothetical Event I
Hypothetical Event II
(Lampos, 2012a)
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 17/4717/47
The Harry Potter effect (2/2)
7/28/2019 Exploiting Human-Generated Text for Trend Mining
18/47
The Harry Potter effect (2/2)Table 1: Top 1-grams correlated with flu rates in England/Wales (0612/2009)
1-gram Event Corr. Coef.latitud Latitude Festival 0.9367
flu Flu epidemic 0.9344swine 0.9212harri Harry Potter Movie 0.9112
slytherin 0.9094potter 0.8972
benicassim Benicssim Festival 0.8966graduat Graduation (?) 0.8965
dumbledor Harry Potter Movie 0.8870hogwart 0.8852quarantin Flu epidemic 0.8822gryffindor Harry Potter Movie 0.8813ravenclaw 0.8738
princ 0.8635swineflu Flu epidemic 0.8633
ginni Harry Potter Movie 0.8620weaslei 0.8581
hermion 0.8540draco 0.8533
Solution: ground truth with some degree of variability
(Lampos, 2012a)
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 18/4718/47
About n grams
7/28/2019 Exploiting Human-Generated Text for Trend Mining
19/47
About n-grams
1-grams decent (dense) representation in the Twitter corpus unclear semantic interpretation
Example: I am not sick. But I dont feel great either!
2-grams
very sparse representation in tweets
sometimes clearer semantic interpretation
Experimental process indicated that...
a hybrid combination of 1-grams and 2-grams
delivers the best inference performance
refer to (Lampos, 2012a)
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 19/4719/47
Flu rates Example of selected features
7/28/2019 Exploiting Human-Generated Text for Trend Mining
20/47
Flu rates Example of selected features
Figure 5: Font size is proportional to the weight of each feature; flipped n-gramsare negatively weighted. All words are stemmed (Porter, 1980).
(Lampos and Cristianini, 2012)
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 20/4720/47
Rainfall rates Example of selected features
7/28/2019 Exploiting Human-Generated Text for Trend Mining
21/47
Rainfall rates Example of selected features
Figure 6: Font size is proportional to the weight of each feature; flipped n-gramsare negatively weighted. All words are stemmed (Porter, 1980).
(Lampos and Cristianini, 2012)
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 21/4721/47
Examples of inferences
7/28/2019 Exploiting Human-Generated Text for Trend Mining
22/47
Examples of inferences
0 5 10 15 20 25 300
20
40
60
80
100
120
Days
FluRate
C.En
gland&Wales
Actual
Inferred
(a) Central England/Wales (flu)
0 5 10 15 20 25 300
20
40
60
80
100
120
Days
FluRate
S.England
Actual
Inferred
(b) South England (flu)
0 5 10 15 20 25 300
2
4
6
8
10
12
14
16
Days
R
ainfallrate(mm)
Bristol
Actual
Inferred
(c) Bristol (rain)
Figure 7: Examples of flu and rainfall rates inferences from Twitter content
(Lampos and Cristianini, 2012)
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 22/4722/47
Performance figures
7/28/2019 Exploiting Human-Generated Text for Trend Mining
23/47
Performance figures
Table 2: RMSE for flu rates inference (5-fold cross validation), 50m tweets,21/06/200919/04/2010
Method 1-grams 2-grams Hybrid
Baseline 12.442.37 13.813.29 11.621.58Bolasso 11.142.35 12.642.57 10.572.2CART ensemble 9.635.21 13.134.72 9.44.21
Table 3: RMSE (in mm) for rainfall rates inference (6-fold cross validation), 8.5mtweets, 01/07/200930/06/2010
Method 1-grams 2-grams Hybrid
Baseline 2.910.6 3.10.57 4.392.99
Bolasso 2.730.65 2.950.55 2.600.68CART ensemble 2.710.69 2.720.72 2.640.63
As implemented in (Ginsberg et al., 2009) Classification and Regression Tree (Breiman et al., 1984) & (Sutton, 2005)
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 23/4723/47
Flu Detector
7/28/2019 Exploiting Human-Generated Text for Trend Mining
24/47
Flu Detector
URL: http://geopatterns.enm.bris.ac.uk/epidemics
Figure 8: Flu Detector uses the content of Twitter to nowcast flu rates in severalUK regions
(Lampos, De Bie and Cristianini, 2010)
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 24/4724/47
http://geopatterns.enm.bris.ac.uk/epidemicshttp://geopatterns.enm.bris.ac.uk/epidemics7/28/2019 Exploiting Human-Generated Text for Trend Mining
25/47
Extracting Mood Patterns from
Human-Generated Content
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 25/4725/47
Computing a mood score
7/28/2019 Exploiting Human-Generated Text for Trend Mining
26/47
p gTable 4: Mood terms from WordNet Affect
Fear Sadness Joy Anger
afraid depressed admire angryfearful discouraged cheerful despise
frighten disheartened enjoy enviouslyhorrible dysphoria enthousiastic harassed
panic gloomy exciting irritate... ... ... ...
(92 terms) (115 terms) (224 terms) (146 terms)
Mood score computation for a time interval d using n mood terms
msd =1
n
ni=1
c(td)iN(td)
c(td)i : count of term i in the Twitter corpus of day d
N(td): number of tweets for day dUsing the sample of d days, compute a standardised mood score:
msstdd =msd ms
ms
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 26/4726/47
The mood of the nation (1/5)
7/28/2019 Exploiting Human-Generated Text for Trend Mining
27/47
( / )
Figure 9: Daily time series (actual & their 14-point moving average) for the moodof Joy based on Twitter content geo-located in the UK
Jul 09 Jan 10 Jul 10 Jan 11 Jul 11 Jan 122
0
2
4
6
8
10
933 Day Time Series for Joy in Twitter Content
Date
NormalisedEmotionalValence
* RIOTS
* CUTS
* XMAS* XMAS
* XMAS
* roy.wed.
* halloween
* halloween
* halloween
* valentine* valentine
* easter
* easter
raw joy signal
14day smoothed joy
(Lansdall-Welfare, Lampos and Cristianini, 2012a&b)
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 27/4727/47
The mood of the nation (2/5)
7/28/2019 Exploiting Human-Generated Text for Trend Mining
28/47
( / )
Figure 10: Daily time series (actual & their 14-point moving average) for themood of Anger based on Twitter content geo-located in the UK
Jul 09 Jan 10 Jul 10 Jan 11 Jul 11 Jan 124
3
2
1
0
1
2
3
4
5933 Day Time Series for Anger in Twitter Content
Date
NormalisedEmotionalValence
* RIOTS
* CUTS
* XMAS
* XMAS
* XMAS
* roy.wed.
* halloween
* halloween
* halloween
* valentine
* valentine* easter
* easter
raw anger signal
14day smoothed anger
(Lansdall-Welfare, Lampos and Cristianini, 2012a&b)
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 28/4728/47
The mood of the nation (3/5)
7/28/2019 Exploiting Human-Generated Text for Trend Mining
29/47
( / )
Window of 100 days: 50 before & after the point of interest
msstdi = msstdi+1i+50 ms
stdi50i1
Jul 09 Jan 10 Jul 10 Jan 11 Jul 11 Jan 121
0.5
0
0.5
1
1.5
Date
Differencein
mean
Anger
Fear
Date of Budget Cuts
Date of Riots
Figure 11: Change point detection using a 100-day moving window
(Lansdall-Welfare, Lampos and Cristianini, 2012a)
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 29/4729/47
The mood of the nation (4/5)
7/28/2019 Exploiting Human-Generated Text for Trend Mining
30/47
( / )
Figure 12: Projections of 4-dimensional mood score signals (joy, sadness, anger andfear) on their top-2 principal components (PCA) Twitter content from 2011
1.5 1 0.5 0 0.5 10.5
0.4
0.3
0.2
0.1
0
0.1
0.2
0.3
0.4
SaturdaySunday
MondayTuesday
Wednesday
Thursday
Friday
1st Principal Component
2ndPrincipalComponent
Days of the Week
(a) Days of the week (2011)
8 6 4 2 0 2 4 6 82
0
2
4
6
8
10
1
2345
6789
10
1112
1314
151617
18192021
22
23
24252627
28
2930
3132
333435 363738
3940
4142
4344
45
46474849
5051
5253545556 57585960 6162
636465
66
676869
707172
737475
76
7778
7980818283
84 858687888990
91
92
93
949596979899100
101102103104105
106107
108109110111112 113
114
115
116117118
119
120121
122
123124
125126
127128
129
130131132
133 134135
136137138139
140141
142143144
145146147
148149150151152153154
155156157158159160161
162163164165166167168
169
170
171172173
174175176177178179180
181
182183184185
186187188189
190191
192193194195
196 197
198199
200201202203
204205
206207
208209210
211212213214215216217
218219
220221222
223224
225226227228229230
231
232233234235236237238239240241
242243244245
246247248249250251252
253 254
255256257
258259
260261
262263264
265
266267
268269
270271272
273
274275276277278279
280281
282
283284285286
287
288289
290291292293294
295 296297
298299300301
302303304
305306307308
309310
311312313314
315
316317
318319320321322
323324325326327
328
329
330331
332333
334
335336
337338339340341342
343
344345346347348349
350351352
353354
355356
357
358
359
360
361362363
364
365
1st Principal Component
2ndPrincipalComponent
Days in 2011
(b) Days of the year (2011)
Cluster INew Year (1), Valentines (45), Christmas Eve (358), New Years Eve (365)
Cluster IIO.B. Ladens death (122), Winehouses death + Breivik (204), UK riots (221)
(Lampos, 2012a)
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 30/4730/47
The mood of the nation (5/5)
7/28/2019 Exploiting Human-Generated Text for Trend Mining
31/47
( / )
URL: http://geopatterns.enm.bris.ac.uk/mood
Figure 13: Mood of the Nation uses the content of Twitter to nowcast moodrates in several UK regions
(Lampos, 2012a)
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 31/4731/47
Circadian mood patterns (1/3)
http://geopatterns.enm.bris.ac.uk/moodhttp://geopatterns.enm.bris.ac.uk/mood7/28/2019 Exploiting Human-Generated Text for Trend Mining
32/47
)
Compute 24-h mood score patterns
Mood score computation for a time interval u = 24hours using nmood terms (WordNet) and a sample of D days:
Ms(u) = 1|D||D|
j=1
1n
ni=1
sf(tj,u)i
sf(td,u)
i =f
(td,u)i fi
fi
, i {1,..., n}.
f(td,u)
i: normalised frequency of a mood term i during time interval u in day dD
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 32/4732/47
Circadian mood patterns (2/3)
7/28/2019 Exploiting Human-Generated Text for Trend Mining
33/47
Figure 14: Circadian (24-hour) mood patterns based on UK Twitter content
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 33/4733/47
Circadian mood patterns (3/3)
7/28/2019 Exploiting Human-Generated Text for Trend Mining
34/47
Figure 15: Autocorrelation of circadian mood patterns based on hourly lagsrevealing daily and weekly periodicities
1 12 24 36 48 60 72 84 96 108 120 132 144 156 168
0
0.2
0.4
Autocorr. Lags (Hours)
Autocorr.
(Fea
r)
Autocorr.
Conf. Bound
(a) Fear
1 12 24 36 48 60 72 84 96 108 120 132 144 156 168
0
0.1
0.2
0.3
0.4
Autocorr. Lags (Hours)
Autocorr.
(Sadness)
Autocorr.
Conf. Bound
(b) Sadness
1 12 24 36 48 60 72 84 96 108 120 132 144 156 168
0.2
0
0.2
0.4
Autocorr. Lags (Hours)
Autocorr.
(Joy)
Autocorr.
Conf. Bound
(c) Joy
1 12 24 36 48 60 72 84 96 108 120 132 144 156 168
0
0.1
0.2
0.3
Autocorr. Lags (Hours)
Autocorr.
(Anger)
Autocorr.
Conf. Bound
(d) Anger
Further analysis available in (Lampos, Lansdall-Welfare, Araya and Cristianini, 2013)
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 34/4734/47
Emotion in Books
7/28/2019 Exploiting Human-Generated Text for Trend Mining
35/47
Input: Google Ngram corpus of 5m digitised books (Michel et al., 2010)Tool: WordNet Affect (Strapparava and Valitutti, 2004)
qqq
qqq
qqq
q
qqqqqq
qqq
q
q
q
q
qqq
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
qqq
q
q
qqqqqqqqqqqqq
q
qqq
qqq
qqq
qqqqq
qqq
qqqq
qqq
1900 1920 1940 1960 1980 2000
1.0
0.5
0.0
0.5
1.0
Year
Joy
Sadness(zscores)
(a) Joy minus Sadness
q
qqqq
q
qqqqqqqqqqqq
q
qqqq
qqqqqqqqqqqq
qqqqqqqqqqq
qqqqqqq
qqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqq
qqqqqqqqqq
1900 1920 1940 1960 1980 2000
4
2
0
2
4
Year
Emotion
Random(
zscores)
qqqqq
q
qqqqqqqqqqqqq
q
qqq
qqqqqqqqqqqqqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqq
qqqqq
q
qqq
q
q
qqqqqq
qqqqq
q
q
q
qqq
q
q
qqqq
q
qqqqqq
qqqq
qqqq
q
q
qqqqqq
qqqqqqqq
qqqqqqq
qqqqqqqqqqqqqqqqq
qqqqqqqqqqq
AllFear
Disgust
(b) Use ofemotion-related terms
through time
q
q
q
q
qqqq
q
q
q
q
q
q
q
q
qqq
q
q
q
q
qqq
q
qqqqqq
q
q
qqqqqqqq
qqqqq
q
q
q
qqqqqqqqq
qqqq
q
1900 1920 1940 1960 1980 2000
4
2
0
2
4
Year
AmericanBritish(zscores)
q
q
q
q
q
q
q
qqqqqqqqqqq
qqqqqqqq
qqqqqqqqqq
qqqq
qqqqqqqqqqq
qqqqqqq
qqqqqqq
qqqqqqqqqqq
qqqqqqqqqqqqqqq
qqqqq
qqqqqqqqqq
q
q
qqq
q
qqq
(c) American versusBritish English
Figure 16: Emotion trends in 20th century books
(Acerbi, Lampos, Garnett and Bentley, 2013)
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 35/4735/47
7/28/2019 Exploiting Human-Generated Text for Trend Mining
36/47
Inferring Voting Intention fromSocial Media Content
... and a new way for modelling text regression
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 36/4736/47
Motivations and Aims
7/28/2019 Exploiting Human-Generated Text for Trend Mining
37/47
Social Media contain a vast amount of information aboutvarious topics (health, politics, finance)
This information (X) can be used to assist predictions (y)
f : X
y, f usually formulates a linear regression task
X accounts only for word frequencies; can we incorporate userinformation as well?
Could we also exploit the statistical information held in multipleresponse variables?
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 37/4737/47
Data Sets
7/28/2019 Exploiting Human-Generated Text for Trend Mining
38/47
UK case study 60m tweets by 42K users from 30/04/2010 to 13/02/2012 Random selection and distribution of geo-located users proportional to regional
population figures Main language: English 240 unique voting intention polls from YouGov
percentages for Conservatives (CON), Labour Party (LAB) and Liberal Democrats(LIB)
Austrian case study 800K tweets by 1.1K users from 25/01 to 01/12/2012 Users manually selected by Austrian political analysts
Main language: German
98 unique voting intention polls from various pollsters percentages for Social Democratic Party (SP), Peoples Party (VP), Freedom
Party (FP) and Green Alternative Party (GR)
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 38/4738/47
The Bilinear Model (1/2)Th i id i i l
7/28/2019 Exploiting Human-Generated Text for Trend Mining
39/47
The main idea is simple:
f(X) = uuuTXwww +
X Rmp: matrix of user-word frequenciesuuu, www: user and word weights
Our original bilinear text regression model:
{www, uuu, } = argminwww,uuu,
ni=1
uuuTQiwww + yi2
+ (www, 1) + (uuu, 2)
Qi: X for time instance i, yyy Rn: response variable (voting intention)
www R
m
, uuuR
p
: word and user weights, R
: bias(): a regularisation function
Elastic Net (Zhou and Hastie, 2005) for () Bilinear Elastic Net (BEN) (Lampos, Preoiuc-Pietro and Cohn, 2013)
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 39/4739/47
The Bilinear Model Multi-Task Learning (2/2)
7/28/2019 Exploiting Human-Generated Text for Trend Mining
40/47
Apply 1/2 regulariser (Argyriou, Evgeniou and Pontil, 2008)Extends the notion of Group LASSO (Yuan and Lin, 2006) for a-dimensional yyy
Bilinear Group 1/2 (BGL)
{W, U, } = argminW,U,
t=1n
i=1 uuuTt Qiwwwt + t yti
2
+ 1
mj=1
Wj2 + 2p
k=1
Uk2,
W = [www1 ... www]: words weight matrix wwwt refers to t-th political entityU
= [uuu
1 ...uuu]: users weight matrixWj, Uj: j-th rows of weight matrices W and U respectively
R: bias term per task
(Lampos, Preoiuc-Pietro and Cohn, 2013)
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 40/4740/47
Evaluation Performance Tables (1/2)
7/28/2019 Exploiting Human-Generated Text for Trend Mining
41/47
Table 5: UK case study Average RMSEs representing the error of the inferredvoting intention percentage for a 10-step validation process
CON LAB LIB B 2.272 1.663 1.136 1.69Blast 2 2.074 1.095 1.723LEN 3.845 2.912 2.445 3.067BEN 1.939 1.644 1.136 1.573BGL 1.7851.7851.785 1.5951.5951.595 1.0541.0541.054 1.4781.4781.478
Table 6: Austrian case study
SP VP FP GR
B 1.535 1.373 3.3 1.197 1.851Blast 1.1481.1481.148 1.556 1.6391.6391.639 1.536 1.47LEN 1.291 1.286 2.039 1.1521.1521.152 1.442BEN 1.392 1.31 2.89 1.205 1.699BGL 1.619 1.0051.0051.005 1.757 1.374 1.4391.4391.439
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 41/4741/47
Evaluation (2/3)
7/28/2019 Exploiting Human-Generated Text for Trend Mining
42/47
5 10 15 20 25 30 35 40 450
5
10
15
20
25
30
35
40
Voting
Intention%
Time
CON
LAB
LIB
(a) Polls
5 10 15 20 25 30 35 40 450
5
10
15
20
25
30
35
40
Voting
Intention%
Time
CON
LAB
LIB
(b) BEN
5 10 15 20 25 30 35 40 45
0
5
10
15
20
25
30
35
40
VotingIntention%
Time
CON
LAB
LIB
(c) BGL
Figure 17: UK case study 50 consecutive poll predictions
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 42/4742/47
Evaluation (3/3)
7/28/2019 Exploiting Human-Generated Text for Trend Mining
43/47
5 10 15 20 25 30 35 40 450
5
10
15
20
25
30
Voting
Intention%
Time
SP
VP
FP
GR
(a) Polls
5 10 15 20 25 30 35 40 450
5
10
15
20
25
30
Voting
Intention%
Time
SP
VP
FP
GR
(b) BEN
5 10 15 20 25 30 35 40 450
5
10
15
20
25
30
VotingIntention%
Time
SP
VP
FPGR
(c) BGL
Figure 18: Austrian case study 50 consecutive poll predictions
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 43/4743/47
Conclusions
7/28/2019 Exploiting Human-Generated Text for Trend Mining
44/47
Social Media hold valuable information
We can develop methods to extract portions of this informationautomatically detect, quantify, nowcast events (examples of flu and rainfall rates)
extract collective mood patterns (we can do this for books too!)
model other domains (such as politics)
Different types of information (word frequencies, user accounts)can be fused for improved inference performance
Side effect: user privacy
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 44/4744/47
Significant collaborators...
7/28/2019 Exploiting Human-Generated Text for Trend Mining
45/47
Prof. Nello Cristianini, University of Bristol (Artificial Intelligence)
Prof. Alexander Bentley, University of Bristol (Anthropology)
Dr. Trevor Cohn, University of Sheffield (Natural Language
Processing)Dr. Alberto Acerbi, University of Bristol (Anthropology)
Daniel Preoiuc-Pietro, University of Sheffield (Computer Science)
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 45/4745/47
Last Slide!
7/28/2019 Exploiting Human-Generated Text for Trend Mining
46/47
The end.Any questions?
Download the slides fromhttp://www.lampos.net/research/presentations-and-posters
V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 46/4746/47
References Acerbi, Lampos, Garnett and Bentley. The Expression of Emotions in 20th Century Books. PLoS ONE, 2013.
http://www.lampos.net/research/presentations-and-postershttp://www.lampos.net/research/presentations-and-posters7/28/2019 Exploiting Human-Generated Text for Trend Mining
47/47
Argyriou, Evgeniou and Pontil. Convex multi-task feature learning. Machine Learning, 2008.
Bach. Bolasso: Model Consistent Lasso Estimation through the Bootstrap. ICML, 2008.
Bartlett, Mendelson and Neeman. L1-regularized linear regression: persistence and oracle inequalities. PTRF,2011.
Debatin, Lovejoy, Horn and Hughes. Facebook and Online Privacy: Attitudes, Behaviors, and UnintendedConsequences. JCMC, 2009.
Efron et al.. Least Angle Regression. The Annals of Statistics, 2004.
Ginsberg et al. Detecting influenza epidemics using search engine query data. Nature, 2009.
Lampos and Cristianini. Tracking the flu pandemic by monitoring the Social Web. CIP, 2010.
Lampos, De Bie and Cristianini. Flu Detector Tracking Epidemics on Twitter. ECML PKDD, 2010.
Lampos and Cristianini. Nowcasting Events from the Social Web with Statistical Learning. ACM TIST, 2012.
Lampos. Detecting Events and Patterns in Large-Scale User Generated Textual Streams with Statistical
Learning Methods. Ph.D. Thesis, University of Bristol, 2012.(a) Lampos. On voting intentions inference from Twitter content: a case study on UK 2010 General Election .
CoRR, 2012.(b)
Lampos, Preoiuc-Pietro and Cohn. A user-centric model of voting intention from Social Media. ACL, 2013.
Lampos, Lansdall-Welfare, Araya and Cristianini. Analysing Mood Patterns in the United Kingdom throughTwitter Content. CoRR, 2013.
Lansdall-Welfare, Lampos and Cristianini. Effects of the Recession on Public Mood in the UK. WWW, 2012.(a)
Lansdall-Welfare, Lampos and Cristianini. Nowcasting the mood of the nation. Significance, 2012.(b)
Manning, Raghavan and Schtze. Introduction to Information Retrieval, 2008.
Michel et al. Quantitative Analysis of Culture Using Millions of Digitized Books. Nature, 2010.
Porter. An algorithm for suffix stripping. Program, 1980.
Strapparava and Valitutti. WordNet-Affect: an affective extension of WordNet. LREC, 2004.
Tibshirani. Regression Shrinkage and Selection via the LASSO. JRSS, 1996.
Yuan and Lin. Model selection and estimation in regression with grouped variables. JRSS, 2006.
Zhao and Yu. On model selection consistency of LASSO. JMLR, 2006.
Zhou and Hastie. Regularization and variable selection via the elastic net. JRSS, 2005.V. Lampos [email protected] Exploiting Human-Generated Text for Trend Mining 47/4747/47