+ All Categories
Home > Documents > Digital Paleontology - Digging for Ancient Tweets

Digital Paleontology - Digging for Ancient Tweets

Date post: 01-Sep-2014
Category:
Upload: martin255
View: 1,394 times
Download: 1 times
Share this document with a friend
Description:
 
Popular Tags:
63
Digital Paleontology Digging for Ancient Tweets Martin Lafréchoux JITSO 2012 – EPFL, December 4th, 2012 Full text of the Research notes is available on the JITSO website and on my research blog -- Pic : www.flickr.com/photos/mag3737/307016400/
Transcript
Page 1: Digital Paleontology - Digging for Ancient Tweets

Digital PaleontologyDigging for Ancient Tweets

Martin LafréchouxJITSO 2012 – EPFL, December 4th, 2012

Full text of the Research notes is available on the JITSO website and on my research blog

--Pic : www.flickr.com/photos/mag3737/307016400/

Page 2: Digital Paleontology - Digging for Ancient Tweets

Hi. My name is Martin Lafréchoux.

I am a PhD student at Paris Ouest Nanterre. My dissertation deals with the web page as a document.

I'll try not to repeat the content of the research notes published on the website. I'd rather address some points that were left out due to space constraints or because they were not ready for publication.

Mostly I’ll talk about the various twitter APIs and their implications to the researcher.

But first I'd like to try and explain the title of the presentation. Digital Paleontology - why ?

Page 3: Digital Paleontology - Digging for Ancient Tweets

- Twitter is live. You watch twitter as you would TV, to see what's happening right now.

When a piece of information spreads on twitter, it creates a cascade

Cascades are live data, best observed in the wild ; that's the whole point of just-in-time sociology.

The cascade goes on for a short while, then it disappears.

Twitter offers several ways to observe these cascades.

We will go through them briefly

It’s a bit hard to represent something as vivid in a slide, so I hope you can pardon me for using shaky visual metaphors

Page 4: Digital Paleontology - Digging for Ancient Tweets

Basic use: visiting twitter.com

The most simple way is to observe information cascades is in the wild, using the twitter web client.

You’re in the middle of the action, but that’s not always the best place to see the big picture.

If that’s not precise enough, twitter offers several API with distinct characteristics.--Pic : http://www.public-domain-image.com/

Page 5: Digital Paleontology - Digging for Ancient Tweets

Real Time: the Streaming API

The most complete API is called the streaming API. Complete access to all posted tweets is called the firehose.--Pic : http://www.flickr.com/photos/usnavy/5887790560/

Page 6: Digital Paleontology - Digging for Ancient Tweets

Access is pretty limited, both expensive and exclusive.

Gaining access to the firehose is one thing. Then you have to have the stomach to drink from it.

Page 7: Digital Paleontology - Digging for Ancient Tweets

500 000 000 / dayOctober 2012 average :

A complete recording for a month of twitter in january 2011 is about 1 billion tweets (Myers & Leskovec, 2012), that’s about 400 unique tweets per second on average, not counting RTs.

That’s for an average day, almost two years ago.

In October twitter CEO reported 500 million tweets per day. http://news.cnet.com/8301-1023_3-57541566-93/report-twitter-hits-half-a-billion-tweets-a-day/That’s about 5800 per second.

--Pic: http://www.flickr.com/photos/chrish_99/7431798496/

Page 8: Digital Paleontology - Digging for Ancient Tweets

There is not much you can do with such an astonishing stream of data besides displaying it, as twitter does, or storing it to analyze it later. Even that requires huge resources.

--Pic: http://www.flickr.com/photos/thomashawk/5683179189/

Page 9: Digital Paleontology - Digging for Ancient Tweets

Digital Taxidermy ?

Once recorded, the live data is frozen. The cascades are turned in something like a stuffed animal. Digital taxidermy, if you will.

--Damien Hirst, The Physical Imbossibility of Death in the Mind of Someone LivingPhoto: http://www.flickr.com/photos/chaostrophy/2594401926/

Page 10: Digital Paleontology - Digging for Ancient Tweets

Just-in-time: The Search API

The next best option is to use the Search API to analyze tweets in real-time, as they are published.After real-time recording, you get just-in-time. For 6 to nine days, you can use the twitter search API.The search API would be like an underwater tunnel, giving you easy access to the closest data.

--Pic: http://www.flickr.com/photos/lorensztajer/4201751064/

Page 11: Digital Paleontology - Digging for Ancient Tweets

A bit too late : the REST API

But what happens after that, after real-time? When you are just-too-late?After a week or so, you only have the REST API available. It means you cannot search, and you only get 150 API calls an hour. You can view the details of a tweet or a user if you already know where to look.

As the twitter dev docs put it, you need to get «creative» to get what you want with such limitations.

--Pic: http://www.flickr.com/photos/stinkenroboter/6604532503/

Page 12: Digital Paleontology - Digging for Ancient Tweets

Much too late...Digital Paleontology

You don’t have access to the real, live cascade anymore, but you can still see its shadow, its ghost.

And if you spend enough time putting back together the scattered pieces, it can give you a good idea of what the cascade was.

So, what happens when you were too late for just-in-time amounts to me digital paleontology.

There are bits and pieces of data, scattered all over the web if you only know where to dig.

For my PhD, I am trying to work a bit of digital paleontology to reconstruct an information cascade that happened more than a year and a half ago, on may 2, 2011.

As you may have guessed, I originally tried to map the cascade about 15 days after the fact using the twitter API, and came back sorely disappointed. I tried again six months later, and this presentation is the result of my work.

--Pic: http://www.flickr.com/photos/14508691@N08/4531324072/

Page 13: Digital Paleontology - Digging for Ancient Tweets

May 01, 2011 – 03:58PM ET

On twitter, it began like this. A Pakistani can’t sleep because of a helicopter.

(Timestamps are crucial. I have converted everything to Eastern Time for clarity.)

Page 14: Digital Paleontology - Digging for Ancient Tweets

May 01, 2011 – 10:24PM ET

Six and a half hours later, wrestler and entertainer Dawyne ‘The Rock’ Johnson posts a rather cryptic tweet.

Page 15: Digital Paleontology - Digging for Ancient Tweets

May 01, 2011 – 10:24PM ET

At the same exact minute, former Rumsfeld Chief of Staff posts a more explicit one.

Page 16: Digital Paleontology - Digging for Ancient Tweets

People begin gathering in front of the White House, waiting for the press conference.

--Pic: http://www.flickr.com/photos/theqspeaks/5679548043/

Page 17: Digital Paleontology - Digging for Ancient Tweets

May 02, 2011 – 11:35PM ET

An hour later, Barak Obama appeared on TV to announce the news.

Responses were mixed.

--Pic: http://www.flickr.com/photos/us_embassy_newzealand/5682145416/

Page 18: Digital Paleontology - Digging for Ancient Tweets

Some Americans were very vocal in their enthusiasm.

Pic: http://www.flickr.com/photos/zokuga/5678699597/

Page 19: Digital Paleontology - Digging for Ancient Tweets

“I have never wished a man dead, but I have read some obituaries with great pleasure.”

Mark Twain

Somewhat less enthusiastic Americans expressed their feelings by tweeting this quote by Mark Twain.

... which is actually by civil rights lawyer Clarence Darrow.

--Pic : http://en.wikipedia.org/wiki/File:Mark_Twain,_Brady-Handy_photo_portrait,_Feb_7,_1871,_cropped.jpg

Page 20: Digital Paleontology - Digging for Ancient Tweets

“I have never wished a man dead, but I have read some obituaries with great pleasure.”

Clarence Darrow

Somewhat less enthusiastic Americans expressed their feelings by tweeting this quote by Mark Twain.

... which is actually by civil rights lawyer Clarence Darrow.

--Pic : http://en.wikipedia.org/wiki/File:Mark_Twain,_Brady-Handy_photo_portrait,_Feb_7,_1871,_cropped.jpg

Page 21: Digital Paleontology - Digging for Ancient Tweets

It seems that Mark Twain is to the US as Winston Churchill is to the UK or Jules Renard to France: funny quotes are attributed to him by default.

The mistake probably comes down to this page. You can imagine how this would look as a Google snippet.

http://www.estatevaults.com/lm/archives/2006/08/16/twain_and_darro.html

Page 22: Digital Paleontology - Digging for Ancient Tweets

“I mourn the loss of thousands of precious lives, but I will not rejoice in the death of one, not even an enemy.”

Martin Luther King, Jr.

Some Americans were appalled that bin Laden was killed rather than taken into custody. Many chose to tweet this quote of Martin Luther King to express their feelings.

As you may have guessed, Martin Luther King never said or wrote this sentence.

And this is what I will be talking about today, after a rather lengthy introduction.

Now, how did this happen? How did this misattributed quote go viral?

Now that’s a job for a digital paleontologist.

--Pic: http://en.wikipedia.org/wiki/File:Martin_Luther_King_Jr_NYWTS.jpg

Page 23: Digital Paleontology - Digging for Ancient Tweets

“I mourn the loss of thousands of precious lives, but I will not rejoice in the death of one, not even an enemy.”

Some Americans were appalled that bin Laden was killed rather than taken into custody. Many chose to tweet this quote of Martin Luther King to express their feelings.

As you may have guessed, Martin Luther King never said or wrote this sentence.

And this is what I will be talking about today, after a rather lengthy introduction.

Now, how did this happen? How did this misattributed quote go viral?

Now that’s a job for a digital paleontologist.

--Pic: http://en.wikipedia.org/wiki/File:Martin_Luther_King_Jr_NYWTS.jpg

Page 24: Digital Paleontology - Digging for Ancient Tweets

“I mourn the loss of thousands of precious lives, but I will not rejoice in the death of one, not even an enemy.”

?

Some Americans were appalled that bin Laden was killed rather than taken into custody. Many chose to tweet this quote of Martin Luther King to express their feelings.

As you may have guessed, Martin Luther King never said or wrote this sentence.

And this is what I will be talking about today, after a rather lengthy introduction.

Now, how did this happen? How did this misattributed quote go viral?

Now that’s a job for a digital paleontologist.

--Pic: http://en.wikipedia.org/wiki/File:Martin_Luther_King_Jr_NYWTS.jpg

Page 25: Digital Paleontology - Digging for Ancient Tweets

The cascade

We don’t often get to see the starting point of an information epidemic, but in this case it is known.

--Pic: http://www.flickr.com/photos/vilseskogen/3279138165/

Page 26: Digital Paleontology - Digging for Ancient Tweets

It all started from this facebook post.

Jessica Dovey, who teaches English in Kobe, Japan posts the following message on her Facebook wall.

Page 27: Digital Paleontology - Digging for Ancient Tweets

May 02, 2011 – 12:15PM ET

It’s interesting because :- we don’t always get access to the starting point of a cascade- we don’t often get to see Facebook at all.

I was prepared to say that we have no idea what happened exactly, but I’ll make a guess.

Page 28: Digital Paleontology - Digging for Ancient Tweets

All that we know is that at some point, someone stripped the quote of anything that was actually written by MLK and presented him as the author.

Here is what (to the best of my knowledge) happened.

Page 29: Digital Paleontology - Digging for Ancient Tweets

All that we know is that at some point, someone stripped the quote of anything that was actually written by MLK and presented him as the author.

Here is what (to the best of my knowledge) happened.

Page 30: Digital Paleontology - Digging for Ancient Tweets

02:22PM ET

A user posted the whole quote to twitlonger, mistakenly attributing the whole of it to MLK.

Quite a few tweets point to this page (it says 80, Topsy recorded a dozen or so)

Page 31: Digital Paleontology - Digging for Ancient Tweets

Many of these tweets have been deleted, probably in shame when their authors realized they had posted a fake quote.

Page 32: Digital Paleontology - Digging for Ancient Tweets

And then someone posted the first part of the misattributed quote on twitter.

We only have a faint trace it left on Topsy, a twitter archiving service.

Page 33: Digital Paleontology - Digging for Ancient Tweets

02:42PM ET

Fortunately for us, some twitter archives exist.

Here is the earliest recorded tweet.

(Theses tweets are called the ‘quote corpus’ my research notes.)

Page 34: Digital Paleontology - Digging for Ancient Tweets

02:52PM ET

Ten minutes later, a properly formatted quote reaches a so-called ‘Influential account’

Page 35: Digital Paleontology - Digging for Ancient Tweets

03:15PM ET

25 minutes later, the quote is reposted by Penn Jillette, a pretty famous US magician.

At that point, several things happen.

(a) The cascade accelerates exponentially(b) Some people, mostly journalists, begin to doubt the authenticity of the quote.

Page 36: Digital Paleontology - Digging for Ancient Tweets

May 2, 6:23PM

Megan McArdle, then an editor at The Atlantic, publishes a blog post at the end of the afternoon where she expresses her doubts as to the authenticity of the quote.

--Newspaper icon: http://thenounproject.com/noun/newspaper/#icon-No1233

Page 37: Digital Paleontology - Digging for Ancient Tweets

It’s thanks to the work of Megan McArdle at the Atlantic that we have Dovey’s screencap.

She obtained it via conventional journalism techniques, such as e-mailing a human being.

That’s significant because we would never have gained access to a private FB post via an API.

Journalists have their own investigation techniques.

They are a tad low-tech but this is changing, thanks to the ‘data-journalism’ trend currently going on. (This comes with its own set of issues which I won’t discuss here).

She followed up with a second article where she identifies Jessica Dovey.

Page 38: Digital Paleontology - Digging for Ancient Tweets

Penn Jillette retracts his previous tweet as soon as he realizes his mistake. This tweet does not really attract the same kind of attention as the quote.

Page 39: Digital Paleontology - Digging for Ancient Tweets

May 3,03:20PM ET

The interesting thing is that media coverage modified - almost bent - the cascade as it was still unfolding. Journalists were there at the right time, when it was still happening.

But they did not know how to use an API. This Salon.com article mistakenly believes that PJ was the first one to post the tweet on twitter.

Page 40: Digital Paleontology - Digging for Ancient Tweets

Why did Penn Jillette create a fake Martin Luther King Jr. quote yesterday? - Twitter - Salon.com

May 3,03:20PM ET

The interesting thing is that media coverage modified - almost bent - the cascade as it was still unfolding. Journalists were there at the right time, when it was still happening.

But they did not know how to use an API. This Salon.com article mistakenly believes that PJ was the first one to post the tweet on twitter.

Page 41: Digital Paleontology - Digging for Ancient Tweets

PJ reacts.

Page 42: Digital Paleontology - Digging for Ancient Tweets

Salon then posted a follow up piece, which was actually deleted at some later point - I don’t know when, I just realized it while working on these very slides.

A few days ago I could still access it using Google Cache, which is another form of ‘ghosting’.

Page 43: Digital Paleontology - Digging for Ancient Tweets

And that was the cache last night.

Penn Jilette did not delete his tweets, which is a rather classy move on his part.

Page 44: Digital Paleontology - Digging for Ancient Tweets

Then the big players and blogs take it from there.

kottke, wapo, CNN, AllthingsD, New Yorker (avec timestamp)

And Jason Kottke, who actually wrote what I consider to be the best write-up.

Page 45: Digital Paleontology - Digging for Ancient Tweets

Then the big players and blogs take it from there.

kottke, wapo, CNN, AllthingsD, New Yorker (avec timestamp)

And Jason Kottke, who actually wrote what I consider to be the best write-up.

Page 46: Digital Paleontology - Digging for Ancient Tweets

May 2,06:26PM ET

From there, the cascade branches again, creating a new twitter track.

Here is the first tweet identifying the quote as fake. It was posted a few minutes after Megan McArdle’s first blog post but does not contain a link.

Page 47: Digital Paleontology - Digging for Ancient Tweets

Most tweets calling out the MLK quote as fake contain a link to some type of blog post or news article.

My hypothesis is that people are more likely to include a link to some outside source when their tweet goes 'against the flow'. It gives more weight to their tweets.

Dissenting voices have a hard time being heard on twitter.

[These tweets are called the ‘links corpus’ my research notes.]

Page 48: Digital Paleontology - Digging for Ancient Tweets

As you can see, the peak of the links corpus (yellow) comes about 24h after the ‘quotes corpus’ peak.

Page 49: Digital Paleontology - Digging for Ancient Tweets

Tools & EvaluationA quick review of the tools used for this presentation.

--Pic: http://www.flickr.com/photos/anomieus/6205869109/

Page 50: Digital Paleontology - Digging for Ancient Tweets

[journalists]

If I may go back to the wildlife metaphor for a second, journalists are a bit like a crew filming a wildlife documentary.

Journalists were instrumental in this whole affair. Their analog worked and got results that would have been impossible to obtain using API (most notably on Facebook).

--Pic: http://www.public-domain-image.com/public-domain-images-pictures-free-stock-photos/people-public-domain-images-pictures/people-filming-in-wild-nature.jpg

Page 51: Digital Paleontology - Digging for Ancient Tweets

Some of them used a bit of google-fu to identify the quote as fake. With a date range filter, you can see wether a quote just appeared on the web or not.http://www.pcworld.com/article/226912/google_daterange_filter.html

But their lack of knowledge of the inner workings of twitter showed. Salon blamed PJ, and no one was unable to identify the first twitter posts - less than 48 hours after the beginning of the cascade. The twitter Search API was still available.

Salon, for example, does not reveal how they found their results, but obviously the methods were suboptimal (my guess is that they did not realize that twitter displays so-called ‘Top results’ as default search results).

That's changing as we speak. Data journalism is probably going to be featured in a number of 2012 buzzwords list.

Page 52: Digital Paleontology - Digging for Ancient Tweets

[data journalists]

[This is probably what data journalists look like.]

--Pic: http://www.public-domain-image.com/

Page 53: Digital Paleontology - Digging for Ancient Tweets

As for my own tools, the results shown today rely mostly on a twitter archiving service called Topsy.

When twitter announced its new terms of service this summer, almost all members of the twitter ecosystem cried in terror. Not Topsy. Topsy is exactly the kind of service twitter wants as a partner. This is the kind of ecosystem twitter is trying to build.

Page 54: Digital Paleontology - Digging for Ancient Tweets

There are various twitter archiving services. I chose Topsy because it is the most fully featured free twitter archive that I could find.

The crucial point was that it allowed to use a date range filter on the search.

Page 55: Digital Paleontology - Digging for Ancient Tweets

Topsy even offers an API, otter. You can interact with it using python-otter, a python library, or directly using the REST API.

Page 56: Digital Paleontology - Digging for Ancient Tweets

Topsy archives...> tweets containing a link, or

> tweets that were retweeted

How much does Topsy archive ? The first question would be ‘What does Topsy archive?’

Tweets are recorded when either (a) they are retweeted by someone else or (b) they contain a link

Page 57: Digital Paleontology - Digging for Ancient Tweets

Interesting caveat: Topsy records tweets when they are published. It means that if you delete a tweet after Topsy has archived it, the tweet will be removed from twitter, but not from Topsy.

It’s something to keep in mind: Topsy offers something like a ghost image, a remanent image.

Page 58: Digital Paleontology - Digging for Ancient Tweets

Interesting caveat: Topsy records tweets when they are published. It means that if you delete a tweet after Topsy has archived it, the tweet will be removed from twitter, but not from Topsy.

It’s something to keep in mind: Topsy offers something like a ghost image, a remanent image.

Page 59: Digital Paleontology - Digging for Ancient Tweets

2657 different authors in Topsy data394 were unavailable on twitter

Hidden : 94Suspended : 4Closed : 296

‘quote’ corpus

( )ghosting effect on the ‘quotes’ corpus

Page 60: Digital Paleontology - Digging for Ancient Tweets

174368 tweets, including :

3140 RT

18400 links

Topsy’s coverage ≈12,1%

test sample

Topsy archives tweets that were retweeted or tweets containing an URL. Topsy archives tweets *that are links*. This is the value. The textual content is impossible / not cost-effective to exploit right now.

To evaluate our results is to evaluate Topsy

It has become pretty hard to get a twitter dataset : twitter recently changed its terms of service and several previously available datasets have become unavailable.

I got a sample from the streaming API. It contains about 175000 tweets, which amounts to a few minutes of firehose output.

On our sample, Topsy’s coverage would be about 12%, excluding duplicates

Page 61: Digital Paleontology - Digging for Ancient Tweets

In closing...--Pic: http://www.flickr.com/photos/biodivlibrary/6217534124/

Page 62: Digital Paleontology - Digging for Ancient Tweets

•Resource decay is not uniform

•Linked content has more value

•Media coverage plays an important part in the phenomenon and in what survives

At least in the case of twitter, decay is not evenly distributed.

Content that survives is content that was interesting to someone, content that is linked. The value is the link, not the content.

In our case, it’s worth considering that media coverage both unearthed some content (the FB screenshot) and buried other pieces of information (deleted tweets).

Journalists and data investigation techniques are complementary. Data journalism will probably fuse the two in the long run.

Page 63: Digital Paleontology - Digging for Ancient Tweets

Thank [email protected]

http://nologos.net/@lagayascienza

Thank you for your attention.

Pic: http://www.flickr.com/photos/mmechtley/5379944214/


Recommended