Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 220 times |
Download: | 1 times |
Implicit Structure and Dynamics of BlogSpaceEytan Adar, Li Zhang, Lada Adamic, & Rajan Lukose
HP Labs, Palo Alto, CA
Blogs (weblogs) containonline stampedentries
date and time stamps
list of read blogs
URL thatis beingcommentedon
via link
Blogs: structure and transmission
• Blog use:– Record real-world and virtual experiences– Note and discuss things “seen” on the net
• Blog structure: blog-to-blog linking
• Use + Structure– Great to track “memes” (catchy ideas)
• Patterns of information flow– How does the popularity of a topic evolve over time?– Who is getting information from whom?
• Ranking algorithms that take advantage of transmission patterns
Related Work
Link prediction in social networks:Butts, C. Network Inference, Error, and Information (In)Accuracy:
A Bayesian Approach, Social Networks, 25(2):103-140.Dombroski, M., P. Fischbeck, and K. Carley, An Empirically-Based Model for Network
Estimation and Prediction, NAACSOS conference proceeding, Pittsburgh, PA, 2003.O’Madadhain J., Smyth P., Adamic L., Learning Predictive Models for Link Formation,
Sunbelt 2005 (hope you were there!)Getoor, L., N. Friedman, D. Koller, and B. Taskar, Learning Probabilistic Models of Link
Structure, Journal of Machine Learning Research, vol. 3(2002), pp. 690-707.Adamic L., Adar E., Friends and neighbors on the Web, Social Networks, 2003.Kleinberg, J., and .D. Liben-Nowell, The Link Prediction Problem for Social Networks’, in
Proceedings of CIKM ’03 (New Orleans, LA, November 2003), ACM Press.
Blog ranking:Technorati, BlogPulse, Daypop…
Blog epidemic tracking:Blogdex at MIT media lab, Cameron Marlow, Sunbelt 2003BlogPulse
Intelliseek’s BlogPulse
Service for tracking trends in the blogosphere:popular URLs, phrases, people
BlogPulse Data analyzed
37,153 blogs
Differential daily crawls (to find new posts) for May 2003Full page crawl for May 18, 2003 to capture blogrolls
175,712 URLs occurring on > 2 blogs
Pop
ula
rity
Time
Slashdot Effect
BoingBoing Effect
Tracking popularity over time
Blogdex, BlogPulse, etc. track the most popular links/phrases of the day
Election MapCartograms
Michael Gastner, Cosma Shalizi, and Mark Newman
University of Michigan
http://www-personal.umich.edu/~mejn/election/
Pop
ula
rity
Time
Tracking popularity over time
0 5 10 15 20 250
5
10
15
20
25
30
35
40
45
50
day
blo
gs
U-M: Election CartographsWIRED: Orrin Hatch: Software Pirate?
total mentions:100
total mentions:92
Clustering information popularity profiles
May 2003
Total # of mentions substantial(40)
URL mentioned for the first time in May
K-means clustering
259 URLs in the sample satisfy criteria
Take normalized cumulative profiles
all mentions
day
K-means minimizes the sum of the differences within each cluster
4 clusters captured most of the differences
Different kinds of information have differentpopularity profiles
Products, etc.
Major-news site (editorial content) – back of the paper
5 10 15 5 10 155 10 150
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
5 10 15
Slashdotpostings
Front-pagenews
1 2 3 4
Cluster Profile # urls examples1 Sharp peak on day 1 followed by fast decay 38 Slashdot postings
2 Day 1 peak followed by decay 46 Front page news
3 Day 2 peak followed by gradual decay 51 Editorial content,
Sun java release
4 Sustained interest 124 iPod, iTunes, quizzila
Popularity profiles
2 4 6 8 10 12 14 16 18 20 220
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
cluster 4cluster 3cluster 2cluster 1
Microscale Dynamics
• What do we need track specific info ‘epidemics’?– Timings– Underlying network
b1b1
Time of infectiont0 t1
b2b2
b3b3
Microscale Dynamics• Challenges
– Root may be unknown– Multiple possible paths– Uncrawled space, alternate media (email, voice)– No links
b1b1
Time of infectiont0 t1
b2b2
b3b3
??
bnbn
Microscale Dynamics who is getting info from whom
• Via Links (< 2 % of links, 50% within sample)unambiguous
• Multiple explicit links: which link is more likely
• No explicit links (70%) which implicit path is more likely
Link Inference
Use machine learning algorithms:
A) Support Vector Machine (SVM)B) Logistic Regression
What we can use
Full text
Blogs in common
Links in common
History of infection
BoingBoing
WIRED
Percentage of blog pairs sharing at least one link
link type same day A after B A before B
A B 17.4% 24.5% 24.5%
A B 10.9% 22.9% 17.0%
A,B unlinked
0.6% 1.5% 1.3%
0 0.1 0.2 0.3 0.4 0.50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
similarity in links to non-blog URLs
fra
ctio
n o
f p
air
s
0 0.1 0.2 0.3 0.4 0.50
0.2
0.4
0.6
0.8
1
similarity in links to other blogs
fra
ctio
n o
f p
air
s bidirectionalunidirectionalnot linked
Similarity in links between reciprocated, unreciprocated, and non-linked blog pairs
Blog A
Blog B
+
Tinfection(Blog B) > Tinfection(Blog A)
Blog A
Blog B
-
Positive Example
Negative Example
Infected Uninfected
Training on positive and negative examples of ‘infection’
Prediction results
Link Inference:SVM 91% accuracyregression 92% accuracy (blog-blog links most predictive)
Infection inference:SVM 71.5% accuracy:using blog and non-blog link similarity+ timing features(AbeforeB)/nA, (BbeforeA)/nA, (A same day B)/nA,, …
Regression:75% accuracy using only timing features
time inferred
actual
uncrawled blogor media source
Sources of error
Coarseness and sparseness of timing data (1 day resolution)
Mirror URLS (actually helps)
Incomplete crawls
B
A
C
Visualizationby Eytan Adar
• GUESS tool (build your own, see demo @ 5:30!)
– Using GraphViz (by AT&T) layouts
• Simple algorithm– If single, explicit link exists, draw it (add node if needed)
– Otherwise use ML algorithm• Pick the most likely explicit link• Pick the most likely possible link
• Tool lets you zoom around space, control threshold, link types, etc.
http://www-idl.hpl.hp.com/blogstuff
iRankFind early sources of good informationusing inferred information paths or timing
b1b1
b2b2
b3b3 b4b4 b5b5 bnbn…
True source
Popular site
iRank Algorithm
• Draw a weighted edge for all pairs of blogs that cite the same URL• higher weight for mentions closer together• run PageRank• control for ‘spam’
Time of infectiont0 t1
Do Bloggers Kill Kittens?
02:00 AM Friday Mar. 05, 2004 PST Wired publishes:
"Warning: Blogs Can Be Infectious.”
7:25 AM Friday Mar. 05, 2004 PST Slashdot posts:
"Bloggers' Plagiarism Scientifically Proven"
9:55 AM Friday Mar. 05, 2004 PST Metafilter announces
"A good amount of bloggers are outright thieves."
For more info
Information Dynamics Lab @ HPhttp://www.hpl.hp.com/research/idl
Blog Epidemic Analyzerhttp://www-idl.hpl.hp.com/blogstuff
Eytan, Li, Lada & Rajanhttp://www.hpl.hp.com/research/idl/people/eytan/http://www.hpl.hp.com/personal/Li_Zhang/http://www.hpl.hp.com/personal/Lada_Adamichttp://www.hpl.hp.com/research/idl/people/lukose/