Download - BDACA1516s2 - Lecture7

Co-occurrences Networks Other co-occurrence based methods Next meetings

Big Data and Automated Content AnalysisWeek 7 – Monday

»Co-occurring words«

Damian Trilling

[email protected]@damian0604

www.damiantrilling.net

Afdeling CommunicatiewetenschapUniversiteit van Amsterdam

9 May 2016

Big Data and Automated Content Analysis Damian Trilling

www.damiantrilling.net


Today

1 Integrating word counts and network analysis: Wordco-occurrences

The ideaA real-life example

2 Other co-occurrence based methodsPCALDA

3 Next meetings, & final project



Integrating word counts and network analysis:Word co-occurrences



The idea

Simple word count

We already know this.1 from collections import Counter2 tekst="this is a test where many test words occur several times this is

because it is a test yes indeed it is"3 c=Counter(tekst.split())4 print "The top 5 are: "5 for woord,aantal in c.most_common(5):6 print (aantal,woord)



The idea

Simple word count

The output:1 The top 5 are:2 4 is3 3 test4 2 a5 2 this6 2 it



The idea

What if we could. . .

. . . count the frequency of combinations of words?

As in: Which words do typical occur together in the sametweet (or paragraph, or sentence, . . . )



The idea

What if we could. . .

. . . count the frequency of combinations of words?

As in: Which words do typical occur together in the sametweet (or paragraph, or sentence, . . . )



The idea

We can — with the combinations() function

1 >>> from itertools import combinations2 >>> words="Hoi this is a test test test a test it is".split()3 >>> print ([e for e in combinations(words,2)])4 [(’Hoi’, ’this’), (’Hoi’, ’is’), (’Hoi’, ’a’), (’Hoi’, ’test’), (’Hoi’,

’test’), (’Hoi’, ’test’), (’Hoi’, ’a’), (’Hoi’, ’test’), (’Hoi’, ’it’), (’Hoi’, ’is’), (’this’, ’is’), (’this’, ’a’), (’this’, ’test’), (’this’, ’test’), (’this’, ’test’), (’this’, ’a’), (’this’, ’test’), (’this’, ’it’), (’this’, ’is’), (’is’, ’a’), (’is’, ’test’), (’is’, ’test’), (’is’, ’test’), (’is’, ’a’), (’is’, ’test’), (’is’, ’it’), (’is’, ’is’), (’a’, ’test’), (’a’, ’test’), (’a’, ’test’), (’a’, ’a’), (’a’, ’test’), (’a’, ’it’), (’a’, ’is’), (’test’, ’test’), (’test’, ’test’), (’test’, ’a’), (’test’, ’test’), (’test’,’it’), (’test’, ’is’), (’test’, ’test’), (’test’, ’a’), (’test’, ’

test’), (’test’, ’it’), (’test’, ’is’), (’test’, ’a’), (’test’, ’test’), (’test’, ’it’), (’test’, ’is’), (’a’, ’test’), (’a’, ’it’),(’a’, ’is’), (’test’, ’it’), (’test’, ’is’), (’it’, ’is’)]



The idea

Count co-occurrences

1 from collections import defaultdict2 from itertools import combinations34 tweets=["i am having coffee with my friend","i like coffee","i like

coffee and beer","beer i like"]5 cooc=defaultdict(int)67 for tweet in tweets:8 words=tweet.split()9 for a,b in set(combinations(words,2)):10 if (b,a) in cooc:11 a,b = b,a12 if a!=b:13 cooc[(a,b)]+=11415 for combi in sorted(cooc,key=cooc.get,reverse=True):16 print (cooc[combi],"\t",combi)



The idea

Count co-occurrences

The output:1 3 (’i’, ’coffee’)2 3 (’i’, ’like’)3 2 (’i’, ’beer’)4 2 (’like’, ’beer’)5 2 (’like’, ’coffee’)6 1 (’coffee’, ’beer’)7 1 (’and’, ’beer’)8 ...9 ...10 ...



The idea

From a list of co-occurrences to a network

Let’s conceptualize each word as a node and eachcooccurrence as an edge

• node weight = word frequency• edge weight = number of coocurrences

A GDF file offers all of this and looks like this:


1 nodedef>name VARCHAR, width DOUBLE2 coffee,33 beer,24 i,45 and,16 with,17 friend,18 having,19 like,310 am,111 my,112 edgedef>node1 VARCHAR,node2 VARCHAR, weight DOUBLE13 coffee,beer,114 i,beer,215 and,beer,116 with,friend,117 coffee,with,118 i,and,119 having,friend,120 like,beer,221 am,friend,122 i,am,123 i,coffee,324 i,with,125 am,having,126 i,having,127 coffee,and,128 like,coffee,229 am,coffee,130 with,my,131 i,friend,132 like,and,133 am,with,134 having,with,135 i,my,136 having,coffee,137 i,like,338 coffee,friend,139 having,my,140 am,my,141 coffee,my,142 my,friend,1


The idea

How to represent the cooccurrences graphically?

A two-step approach

1 Save as a GDF file (the format seems easy to understand, sowe could write a function for this in Python)

2 Open the GDF file in Gephi for visualization and/or networkanalysis



The idea

Gephi

• Install (NOT in the VM) from https://gephi.org• By problems on MacOS, see what I wrote about Gephi here:

http://www.damiantrilling.net/setting-up-my-new-macbook/

• I made a screencast on how to visualize the GDF file in Gephi:https://streamingmedia.uva.nl/asset/detail/t2KWKVZtQWZIe2Cj8qXcW5KF

• Further: see the materials I mailed to you


https://gephi.org



https://streamingmedia.uva.nl/asset/detail/t2KWKVZtQWZIe2Cj8qXcW5KF

https://streamingmedia.uva.nl/asset/detail/t2KWKVZtQWZIe2Cj8qXcW5KF


A real-life example

A real-life example

Trilling, D. (2015). Two different debates? Investigating therelationship between a political debate on TV and simultaneouscomments on Twitter. Social Science Computer Review,33,259–276. doi: 10.1177/0894439314537886



A real-life example

Commenting the TV debate on Twitter

The viewers

• Commenting television programs on social networks hasbecome a regular pattern of behavior (Courtois & d’Heer, 2012)

• User comments have shown to reflect the structure of thedebate (Shamma, Churchill, & Kennedy, 2010; Shamma, Kennedy, & Churchill, 2009)

• Topic and speaker effect more influential than, e.g., rhetoricalskills (Nagel, Maurer, & Reinemann, 2012; De Mooy & Maier, 2014)



A real-life example

Research Questions

To which extent are the statements politicians make during aTV debate reflected in online live discussions of the debate?

RQ1 Which topics are emphasized by the candidates?RQ2 Which topics are emphasized by the Twitter users?RQ3 With which topics are the two candidates associated

on Twitter?



A real-life example

Method

The data

• debate transcript• tweets containing#tvduell

• N = 120, 557 tweetsby N = 24, 796 users

• 22-9-2013,20.30-22.00

The analysis

• Series of self-written Pythonscripts:

1 preprocessing (stemming,stopword removal)

2 word counts3 word log likelihood (corpus

comparison)• Stata: regression analysis



A real-life example

Method

The data



• 22-9-2013,20.30-22.00

The analysis







A real-life example

Method

The data



• 22-9-2013,20.30-22.00

The analysis






02

00

04

00

06

00

08

000

−60 −50 −40 −30 −20 −10 10 20 30 40 50 60 70 80 100 110 120 130 140 150start

end


A real-life example

Relationship between words on TV and on Twitter

02

46

81

0ln

(w

ord

on

Tw

itte

r +

1)

0 1 2 3ln (word on TV +1)



A real-life example

Word frequency TV ⇒ word frequency Twitter

Model 1 Model 2 Model 3ln(Twitter +1) ln(Twitter +1) ln(Twitter +1)

together w/ M. together w/ S.b (SE) b(SE) b(SE)beta beta beta

ln (TV M. +1) 1.59 (.052) *** 1.54 (.041) *** .77 (.037) ***.21 .26 .14

ln (TV S. +1) 1.29 (.051) *** .88 (.041) *** 1.25 (.037) ***.17 .15 .24

intercept 1.64 (.008) *** .87 (.007) *** .60 (.006) ***R2 .100 .115 .100b M. & S. differ? F(1, 21408) = 12.29 F(1, 21408) = 96.69 F(1, 21408) =

p <.001 p <.001 63.38p <.001

M = Merkel; S = Steinbrück



A real-life example

Most distinctive words on TV

LL word Frequency Merkel Frequency Steinbrück27,73 merkel 0 2019,41 arbeitsplatz [job] 14 015,25 steinbruck 11 09,70 koalition [coaltion] 7 09,70 international 7 09,70 gemeinsam [together] 7 08,55 griechenland [Greece] 10 18,32 investi [investment] 6 06,93 uberzeug [belief] 5 06,93 okonom [economic] 0 5



A real-life example

Most distinctive words on Twitter

LL word Frequency Merkel Frequency Steinbrück32443,39 merkel 29672 030751,65 steinbrueck 0 177801507,08 kett [necklace] 1628 341241,14 vertrau [trust] 1240 12863,84 fdp [a coalition partner] 985 29775,93 nsa 1809 298626,49 wikipedia 40 502574,65 twittert [tweets] 40 469544,87 koalition [coalition] 864 77517,99 gold 669 34



A real-life example

Putting the pieces together

Merkel

• necklace• trust (sarcastic)• nsa affair• coalition partners

Steinbrück

• suggestion to look sth. upon Wikipedia

• tweets from his accountduring the debate


Other (non-networkbased, statistical) co-occurrence basedmethods

Enter unsupervised machine learning

(something you aready did in your Bachelor – no kidding.)

Enter unsupervised machine learning

(something you aready did in your Bachelor – no kidding.)


Some terminology

Supervised machine learningYou have a dataset with bothpredictor and outcome(dependent and independentvariables) — a labeled dataset.

Think of regression: You measured x1,x2, x3 and you want to predict y,which you also measured

Unsupervised machine learningYou have no labels.

(You did notmeasure y)Again, you already know sometechniques to find out how x1,x2,. . . x_i co-occur from othercourses:

• Principal Component Analysis(PCA)

• Cluster analysis• . . .



Some terminology

Supervised machine learningYou have a dataset with bothpredictor and outcome(dependent and independentvariables) — a labeled dataset.Think of regression: You measured x1,x2, x3 and you want to predict y,which you also measured







Some terminology









Some terminology



Unsupervised machine learningYou have no labels. (You did notmeasure y)

Again, you already know sometechniques to find out how x1,x2,. . . x_i co-occur from othercourses:





Some terminology




(You did notmeasure y)

Again, you already know sometechniques to find out how x1,x2,. . . x_i co-occur from othercourses:





PCA

Principal Component Analysis? How does that fit in here?

In fact, PCA is used everywhere, even in image compression

PCA in ACA

• Find out what word cooccur (inductive frame analysis)• Basically, transform each document in a vector of wordfrequencies and do a PCA



PCA



PCA in ACA




PCA



PCA in ACA




PCA

A so-called term-document-matrix

1 w1,w2,w3,w4,w5,w6 ...2 text1, 2, 0, 0, 1, 2, 3 ...3 text2, 0, 0, 1, 2, 3, 4 ...4 text3, 9, 0, 1, 1, 0, 0 ...5 ...

These can be simple counts, but also more advanced metrics, liketf-idf scores (where you weigh the frequency by the number ofdocuments in which it occurs), cosine distances, etc.



PCA

A so-called term-document-matrix

1 w1,w2,w3,w4,w5,w6 ...2 text1, 2, 0, 0, 1, 2, 3 ...3 text2, 0, 0, 1, 2, 3, 4 ...4 text3, 9, 0, 1, 1, 0, 0 ...5 ...

These can be simple counts, but also more advanced metrics, liketf-idf scores (where you weigh the frequency by the number ofdocuments in which it occurs), cosine distances, etc.



PCA

PCA: implications and problems

• given a term-document matrix, easy to do with any tool• probably extremely skewed distributions• some problematic assumptions: does the goal of PCA, to finda solution in which one word loads on one component matchreal life, where a word can belong to several topics or frames?



LDA

Enter topic modeling with Latent Dirichlet Allocation (LDA)



LDA

LDA, what’s that?

No mathematical details here, but the general idea

• There are k topics, T1. . .Tk

• Each document Di consists of a mixture of these topics,e.g.80%T1, 15%T2, 0%T3, . . . 5%Tk

• On the next level, each topic consists of a specific probabilitydistribution of words

• Thus, based on the frequencies of words in Di , one can inferits distribution of topics

• Note that LDA (likek PCA) is a Bag-of-Words (BOW)approach



LDA

Doing a LDA in Python

You can use gensim (Řehůřek & Sojka, 2010) for this.1 sudo pip3 install gensim

Furthermore, let us assume you have a list of lists of words (!)called texts:

1 articles=[’The tax deficit is higher than expected. This said xxx ...’,’Germany won the World Cup. After a’]

2 texts=[art.split() for art in articles]

which looks like this:1 [[’The’, ’tax’, ’deficit’, ’is’, ’higher’, ’than’, ’expected.’, ’This’,

’said’, ’xxx’, ’...’], [’Germany’, ’won’, ’the’, ’World’, ’Cup.’, ’After’, ’a’]]

Řehůřek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. Proceedings of theLREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. Valletta, Malta: ELRA.


1 from gensim import corpora, models23 NTOPICS = 1004 LDAOUTPUTFILE="topicscores.tsv"56 # Create a BOW represenation of the texts7 id2word = corpora.Dictionary(texts)8 mm =[id2word.doc2bow(text) for text in texts]910 # Train the LDA models.11 lda = models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=

NTOPICS, alpha="auto")1213 # Print the topics.14 for top in lda.print_topics(num_topics=NTOPICS, num_words=5):15 print ("\n",top)1617 print ("\nFor further analysis, a dataset with the topic score for each

document is saved to",LDAOUTPUTFILE)1819 scoresperdoc=lda.inference(mm)2021 with open(LDAOUTPUTFILE,"w",encoding="utf-8") as fo:22 for row in scoresperdoc[0]:23 fo.write("\t".join(["{:0.3f}".format(score) for score in row]))24 fo.write("\n")


LDA

Output: Topics (below) & topic scores (next slide)1 0.069*fusie + 0.058*brussel + 0.045*europesecommissie + 0.036*europese +

0.023*overname2 0.109*bank + 0.066*britse + 0.041*regering + 0.035*financien + 0.033*

minister3 0.114*nederlandse + 0.106*nederland + 0.070*bedrijven + 0.042*rusland +

0.038*russische4 0.093*nederlandsespoorwegen + 0.074*den + 0.036*jaar + 0.029*onderzoek +

0.027*raad5 0.099*banen + 0.045*jaar + 0.045*productie + 0.036*ton + 0.029*aantal6 0.041*grote + 0.038*bedrijven + 0.027*ondernemers + 0.023*goed + 0.015*

jaar7 0.108*werknemers + 0.037*jongeren + 0.035*werkgevers + 0.029*jaar +

0.025*werk8 0.171*bank + 0.122* + 0.041*klanten + 0.035*verzekeraar + 0.028*euro9 0.162*banken + 0.055*bank + 0.039*centrale + 0.027*leningen + 0.024*

financiele10 0.052*post + 0.042*media + 0.038*nieuwe + 0.034*netwerk + 0.025*

personeel11 ...



Next meetings



Next meetings

Wednesday, 11–5Lab sessionConduct an analysis based on word co-occurrences (Chapter 8and/or 9.2). Install Gephi in advance!

No meeting on Monday (Pentecost)

Wednesday, 18–5Supervised machine learning