Co-occurrences Networks Other co-occurrence based methods Next meetings
Big Data and Automated Content AnalysisWeek 7 – Monday
»Co-occurring words«
Damian Trilling
[email protected]@damian0604
www.damiantrilling.net
Afdeling CommunicatiewetenschapUniversiteit van Amsterdam
9 May 2016
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
Today
1 Integrating word counts and network analysis: Wordco-occurrences
The ideaA real-life example
2 Other co-occurrence based methodsPCALDA
3 Next meetings, & final project
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
Integrating word counts and network analysis:Word co-occurrences
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
Simple word count
We already know this.1 from collections import Counter2 tekst="this is a test where many test words occur several times this is
because it is a test yes indeed it is"3 c=Counter(tekst.split())4 print "The top 5 are: "5 for woord,aantal in c.most_common(5):6 print (aantal,woord)
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
Simple word count
The output:1 The top 5 are:2 4 is3 3 test4 2 a5 2 this6 2 it
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
What if we could. . .
. . . count the frequency of combinations of words?
As in: Which words do typical occur together in the sametweet (or paragraph, or sentence, . . . )
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
What if we could. . .
. . . count the frequency of combinations of words?
As in: Which words do typical occur together in the sametweet (or paragraph, or sentence, . . . )
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
We can — with the combinations() function
1 >>> from itertools import combinations2 >>> words="Hoi this is a test test test a test it is".split()3 >>> print ([e for e in combinations(words,2)])4 [(’Hoi’, ’this’), (’Hoi’, ’is’), (’Hoi’, ’a’), (’Hoi’, ’test’), (’Hoi’,
’test’), (’Hoi’, ’test’), (’Hoi’, ’a’), (’Hoi’, ’test’), (’Hoi’, ’it’), (’Hoi’, ’is’), (’this’, ’is’), (’this’, ’a’), (’this’, ’test’), (’this’, ’test’), (’this’, ’test’), (’this’, ’a’), (’this’, ’test’), (’this’, ’it’), (’this’, ’is’), (’is’, ’a’), (’is’, ’test’), (’is’, ’test’), (’is’, ’test’), (’is’, ’a’), (’is’, ’test’), (’is’, ’it’), (’is’, ’is’), (’a’, ’test’), (’a’, ’test’), (’a’, ’test’), (’a’, ’a’), (’a’, ’test’), (’a’, ’it’), (’a’, ’is’), (’test’, ’test’), (’test’, ’test’), (’test’, ’a’), (’test’, ’test’), (’test’,’it’), (’test’, ’is’), (’test’, ’test’), (’test’, ’a’), (’test’, ’
test’), (’test’, ’it’), (’test’, ’is’), (’test’, ’a’), (’test’, ’test’), (’test’, ’it’), (’test’, ’is’), (’a’, ’test’), (’a’, ’it’),(’a’, ’is’), (’test’, ’it’), (’test’, ’is’), (’it’, ’is’)]
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
Count co-occurrences
1 from collections import defaultdict2 from itertools import combinations34 tweets=["i am having coffee with my friend","i like coffee","i like
coffee and beer","beer i like"]5 cooc=defaultdict(int)67 for tweet in tweets:8 words=tweet.split()9 for a,b in set(combinations(words,2)):10 if (b,a) in cooc:11 a,b = b,a12 if a!=b:13 cooc[(a,b)]+=11415 for combi in sorted(cooc,key=cooc.get,reverse=True):16 print (cooc[combi],"\t",combi)
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
Count co-occurrences
The output:1 3 (’i’, ’coffee’)2 3 (’i’, ’like’)3 2 (’i’, ’beer’)4 2 (’like’, ’beer’)5 2 (’like’, ’coffee’)6 1 (’coffee’, ’beer’)7 1 (’and’, ’beer’)8 ...9 ...10 ...
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
From a list of co-occurrences to a network
Let’s conceptualize each word as a node and eachcooccurrence as an edge
• node weight = word frequency• edge weight = number of coocurrences
A GDF file offers all of this and looks like this:
Big Data and Automated Content Analysis Damian Trilling
1 nodedef>name VARCHAR, width DOUBLE2 coffee,33 beer,24 i,45 and,16 with,17 friend,18 having,19 like,310 am,111 my,112 edgedef>node1 VARCHAR,node2 VARCHAR, weight DOUBLE13 coffee,beer,114 i,beer,215 and,beer,116 with,friend,117 coffee,with,118 i,and,119 having,friend,120 like,beer,221 am,friend,122 i,am,123 i,coffee,324 i,with,125 am,having,126 i,having,127 coffee,and,128 like,coffee,229 am,coffee,130 with,my,131 i,friend,132 like,and,133 am,with,134 having,with,135 i,my,136 having,coffee,137 i,like,338 coffee,friend,139 having,my,140 am,my,141 coffee,my,142 my,friend,1
Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
How to represent the cooccurrences graphically?
A two-step approach
1 Save as a GDF file (the format seems easy to understand, sowe could write a function for this in Python)
2 Open the GDF file in Gephi for visualization and/or networkanalysis
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
Gephi
• Install (NOT in the VM) from https://gephi.org• By problems on MacOS, see what I wrote about Gephi here:
http://www.damiantrilling.net/setting-up-my-new-macbook/
• I made a screencast on how to visualize the GDF file in Gephi:https://streamingmedia.uva.nl/asset/detail/t2KWKVZtQWZIe2Cj8qXcW5KF
• Further: see the materials I mailed to you
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
A real-life example
Trilling, D. (2015). Two different debates? Investigating therelationship between a political debate on TV and simultaneouscomments on Twitter. Social Science Computer Review,33,259–276. doi: 10.1177/0894439314537886
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Commenting the TV debate on Twitter
The viewers
• Commenting television programs on social networks hasbecome a regular pattern of behavior (Courtois & d’Heer, 2012)
• User comments have shown to reflect the structure of thedebate (Shamma, Churchill, & Kennedy, 2010; Shamma, Kennedy, & Churchill, 2009)
• Topic and speaker effect more influential than, e.g., rhetoricalskills (Nagel, Maurer, & Reinemann, 2012; De Mooy & Maier, 2014)
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Research Questions
To which extent are the statements politicians make during aTV debate reflected in online live discussions of the debate?
RQ1 Which topics are emphasized by the candidates?RQ2 Which topics are emphasized by the Twitter users?RQ3 With which topics are the two candidates associated
on Twitter?
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Method
The data
• debate transcript• tweets containing#tvduell
• N = 120, 557 tweetsby N = 24, 796 users
• 22-9-2013,20.30-22.00
The analysis
• Series of self-written Pythonscripts:
1 preprocessing (stemming,stopword removal)
2 word counts3 word log likelihood (corpus
comparison)• Stata: regression analysis
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Method
The data
• debate transcript• tweets containing#tvduell
• N = 120, 557 tweetsby N = 24, 796 users
• 22-9-2013,20.30-22.00
The analysis
• Series of self-written Pythonscripts:
1 preprocessing (stemming,stopword removal)
2 word counts3 word log likelihood (corpus
comparison)• Stata: regression analysis
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Method
The data
• debate transcript• tweets containing#tvduell
• N = 120, 557 tweetsby N = 24, 796 users
• 22-9-2013,20.30-22.00
The analysis
• Series of self-written Pythonscripts:
1 preprocessing (stemming,stopword removal)
2 word counts3 word log likelihood (corpus
comparison)• Stata: regression analysis
Big Data and Automated Content Analysis Damian Trilling
02
00
04
00
06
00
08
000
−60 −50 −40 −30 −20 −10 10 20 30 40 50 60 70 80 100 110 120 130 140 150start
end
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Relationship between words on TV and on Twitter
02
46
81
0ln
(w
ord
on
Tw
itte
r +
1)
0 1 2 3ln (word on TV +1)
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Word frequency TV ⇒ word frequency Twitter
Model 1 Model 2 Model 3ln(Twitter +1) ln(Twitter +1) ln(Twitter +1)
together w/ M. together w/ S.b (SE) b(SE) b(SE)beta beta beta
ln (TV M. +1) 1.59 (.052) *** 1.54 (.041) *** .77 (.037) ***.21 .26 .14
ln (TV S. +1) 1.29 (.051) *** .88 (.041) *** 1.25 (.037) ***.17 .15 .24
intercept 1.64 (.008) *** .87 (.007) *** .60 (.006) ***R2 .100 .115 .100b M. & S. differ? F(1, 21408) = 12.29 F(1, 21408) = 96.69 F(1, 21408) =
p <.001 p <.001 63.38p <.001
M = Merkel; S = Steinbrück
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Most distinctive words on TV
LL word Frequency Merkel Frequency Steinbrück27,73 merkel 0 2019,41 arbeitsplatz [job] 14 015,25 steinbruck 11 09,70 koalition [coaltion] 7 09,70 international 7 09,70 gemeinsam [together] 7 08,55 griechenland [Greece] 10 18,32 investi [investment] 6 06,93 uberzeug [belief] 5 06,93 okonom [economic] 0 5
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Most distinctive words on Twitter
LL word Frequency Merkel Frequency Steinbrück32443,39 merkel 29672 030751,65 steinbrueck 0 177801507,08 kett [necklace] 1628 341241,14 vertrau [trust] 1240 12863,84 fdp [a coalition partner] 985 29775,93 nsa 1809 298626,49 wikipedia 40 502574,65 twittert [tweets] 40 469544,87 koalition [coalition] 864 77517,99 gold 669 34
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Putting the pieces together
Merkel
• necklace• trust (sarcastic)• nsa affair• coalition partners
Steinbrück
• suggestion to look sth. upon Wikipedia
• tweets from his accountduring the debate
Big Data and Automated Content Analysis Damian Trilling
Other (non-networkbased, statistical) co-occurrence basedmethods
Enter unsupervised machine learning
(something you aready did in your Bachelor – no kidding.)
Enter unsupervised machine learning
(something you aready did in your Bachelor – no kidding.)
Co-occurrences Networks Other co-occurrence based methods Next meetings
Some terminology
Supervised machine learningYou have a dataset with bothpredictor and outcome(dependent and independentvariables) — a labeled dataset.
Think of regression: You measured x1,x2, x3 and you want to predict y,which you also measured
Unsupervised machine learningYou have no labels.
(You did notmeasure y)Again, you already know sometechniques to find out how x1,x2,. . . x_i co-occur from othercourses:
• Principal Component Analysis(PCA)
• Cluster analysis• . . .
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
Some terminology
Supervised machine learningYou have a dataset with bothpredictor and outcome(dependent and independentvariables) — a labeled dataset.Think of regression: You measured x1,x2, x3 and you want to predict y,which you also measured
Unsupervised machine learningYou have no labels.
(You did notmeasure y)Again, you already know sometechniques to find out how x1,x2,. . . x_i co-occur from othercourses:
• Principal Component Analysis(PCA)
• Cluster analysis• . . .
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
Some terminology
Supervised machine learningYou have a dataset with bothpredictor and outcome(dependent and independentvariables) — a labeled dataset.
Think of regression: You measured x1,x2, x3 and you want to predict y,which you also measured
Unsupervised machine learningYou have no labels.
(You did notmeasure y)Again, you already know sometechniques to find out how x1,x2,. . . x_i co-occur from othercourses:
• Principal Component Analysis(PCA)
• Cluster analysis• . . .
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
Some terminology
Supervised machine learningYou have a dataset with bothpredictor and outcome(dependent and independentvariables) — a labeled dataset.
Think of regression: You measured x1,x2, x3 and you want to predict y,which you also measured
Unsupervised machine learningYou have no labels. (You did notmeasure y)
Again, you already know sometechniques to find out how x1,x2,. . . x_i co-occur from othercourses:
• Principal Component Analysis(PCA)
• Cluster analysis• . . .
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
Some terminology
Supervised machine learningYou have a dataset with bothpredictor and outcome(dependent and independentvariables) — a labeled dataset.
Think of regression: You measured x1,x2, x3 and you want to predict y,which you also measured
Unsupervised machine learningYou have no labels.
(You did notmeasure y)
Again, you already know sometechniques to find out how x1,x2,. . . x_i co-occur from othercourses:
• Principal Component Analysis(PCA)
• Cluster analysis• . . .
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
PCA
Principal Component Analysis? How does that fit in here?
In fact, PCA is used everywhere, even in image compression
PCA in ACA
• Find out what word cooccur (inductive frame analysis)• Basically, transform each document in a vector of wordfrequencies and do a PCA
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
PCA
Principal Component Analysis? How does that fit in here?
In fact, PCA is used everywhere, even in image compression
PCA in ACA
• Find out what word cooccur (inductive frame analysis)• Basically, transform each document in a vector of wordfrequencies and do a PCA
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
PCA
Principal Component Analysis? How does that fit in here?
In fact, PCA is used everywhere, even in image compression
PCA in ACA
• Find out what word cooccur (inductive frame analysis)• Basically, transform each document in a vector of wordfrequencies and do a PCA
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
PCA
A so-called term-document-matrix
1 w1,w2,w3,w4,w5,w6 ...2 text1, 2, 0, 0, 1, 2, 3 ...3 text2, 0, 0, 1, 2, 3, 4 ...4 text3, 9, 0, 1, 1, 0, 0 ...5 ...
These can be simple counts, but also more advanced metrics, liketf-idf scores (where you weigh the frequency by the number ofdocuments in which it occurs), cosine distances, etc.
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
PCA
A so-called term-document-matrix
1 w1,w2,w3,w4,w5,w6 ...2 text1, 2, 0, 0, 1, 2, 3 ...3 text2, 0, 0, 1, 2, 3, 4 ...4 text3, 9, 0, 1, 1, 0, 0 ...5 ...
These can be simple counts, but also more advanced metrics, liketf-idf scores (where you weigh the frequency by the number ofdocuments in which it occurs), cosine distances, etc.
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
PCA
PCA: implications and problems
• given a term-document matrix, easy to do with any tool• probably extremely skewed distributions• some problematic assumptions: does the goal of PCA, to finda solution in which one word loads on one component matchreal life, where a word can belong to several topics or frames?
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
LDA
Enter topic modeling with Latent Dirichlet Allocation (LDA)
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
LDA
LDA, what’s that?
No mathematical details here, but the general idea
• There are k topics, T1. . .Tk
• Each document Di consists of a mixture of these topics,e.g.80%T1, 15%T2, 0%T3, . . . 5%Tk
• On the next level, each topic consists of a specific probabilitydistribution of words
• Thus, based on the frequencies of words in Di , one can inferits distribution of topics
• Note that LDA (likek PCA) is a Bag-of-Words (BOW)approach
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
LDA
Doing a LDA in Python
You can use gensim (Řehůřek & Sojka, 2010) for this.1 sudo pip3 install gensim
Furthermore, let us assume you have a list of lists of words (!)called texts:
1 articles=[’The tax deficit is higher than expected. This said xxx ...’,’Germany won the World Cup. After a’]
2 texts=[art.split() for art in articles]
which looks like this:1 [[’The’, ’tax’, ’deficit’, ’is’, ’higher’, ’than’, ’expected.’, ’This’,
’said’, ’xxx’, ’...’], [’Germany’, ’won’, ’the’, ’World’, ’Cup.’, ’After’, ’a’]]
Řehůřek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. Proceedings of theLREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. Valletta, Malta: ELRA.
Big Data and Automated Content Analysis Damian Trilling
1 from gensim import corpora, models23 NTOPICS = 1004 LDAOUTPUTFILE="topicscores.tsv"56 # Create a BOW represenation of the texts7 id2word = corpora.Dictionary(texts)8 mm =[id2word.doc2bow(text) for text in texts]910 # Train the LDA models.11 lda = models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=
NTOPICS, alpha="auto")1213 # Print the topics.14 for top in lda.print_topics(num_topics=NTOPICS, num_words=5):15 print ("\n",top)1617 print ("\nFor further analysis, a dataset with the topic score for each
document is saved to",LDAOUTPUTFILE)1819 scoresperdoc=lda.inference(mm)2021 with open(LDAOUTPUTFILE,"w",encoding="utf-8") as fo:22 for row in scoresperdoc[0]:23 fo.write("\t".join(["{:0.3f}".format(score) for score in row]))24 fo.write("\n")
Co-occurrences Networks Other co-occurrence based methods Next meetings
LDA
Output: Topics (below) & topic scores (next slide)1 0.069*fusie + 0.058*brussel + 0.045*europesecommissie + 0.036*europese +
0.023*overname2 0.109*bank + 0.066*britse + 0.041*regering + 0.035*financien + 0.033*
minister3 0.114*nederlandse + 0.106*nederland + 0.070*bedrijven + 0.042*rusland +
0.038*russische4 0.093*nederlandsespoorwegen + 0.074*den + 0.036*jaar + 0.029*onderzoek +
0.027*raad5 0.099*banen + 0.045*jaar + 0.045*productie + 0.036*ton + 0.029*aantal6 0.041*grote + 0.038*bedrijven + 0.027*ondernemers + 0.023*goed + 0.015*
jaar7 0.108*werknemers + 0.037*jongeren + 0.035*werkgevers + 0.029*jaar +
0.025*werk8 0.171*bank + 0.122* + 0.041*klanten + 0.035*verzekeraar + 0.028*euro9 0.162*banken + 0.055*bank + 0.039*centrale + 0.027*leningen + 0.024*
financiele10 0.052*post + 0.042*media + 0.038*nieuwe + 0.034*netwerk + 0.025*
personeel11 ...
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
Next meetings
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
Next meetings
Wednesday, 11–5Lab sessionConduct an analysis based on word co-occurrences (Chapter 8and/or 9.2). Install Gephi in advance!
No meeting on Monday (Pentecost)
Wednesday, 18–5Supervised machine learning
Big Data and Automated Content Analysis Damian Trilling