+ All Categories
Home > Education > Characterizing the Splogosphere

Characterizing the Splogosphere

Date post: 14-Jan-2015
Category:
Upload: hiroshi-ono
View: 719 times
Download: 0 times
Share this document with a friend
Description:
 
34
UMBC UMBC an Honors University in an Honors University in Maryland Maryland Characterizing the Splogosphere Tim Finin http://ebiquity.umbc.edu/paper/ html/id/299/ Pranam Kolari, Akshay Java and Tim Finin University of Maryland, Baltimore County 3 rd Annual Workshop on the Weblogging Ecosytem: Aggregation, Analysis and Dynamics 22 May 2006
Transcript
Page 1: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

Characterizing theSplogosphereTim Finin

http://ebiquity.umbc.edu/paper/html/id/299/

Pranam Kolari, Akshay Java and Tim Finin

University of Maryland, Baltimore County

3rd Annual Workshop on the Weblogging Ecosytem: Aggregation, Analysis and

Dynamics22 May 2006

Page 2: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

Outline

• Introduction• Motivation• BlogPulse Dataset• Weblogs.com Dataset• Implications

Page 3: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

The Blogosphere

• 57% of online US teens generate content, 40% read blogs, 20% have them! (PewNov. 2005)

• 53% of companies are blogging (Guideware Oct. 2005)

• MySpace accounts for 1/3 of all web clicks (Hendler, 2006) ?!

• But … the Blogosphere is awash in spam

Source: Wikipedia

Page 4: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

Blogosphere/Splogosphere

Page 5: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

Spam in the Blogosphere

• Types: comment spam, ping spam, spam blogs• Akismet: “87% of all comments are spam”• 75% of update pings are spam (ebiquity 2005)• 20% of indexed blogs by popular blog search

engines is spam (Umbria 2006, ebiquity 2005)• “Spam blogs, sometimes referred to by the

neologism splogs, are weblog sites which the author uses only for promoting affiliated websites”

• “Spings, or ping spam, are pings that are sent from spam blogs”

1Wikipedia

Page 6: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

Motivation: host ads

Page 7: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

Motivation: index affiliates, promote pageRank

Page 8: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

Spings from weblogs.com

Page 9: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

“Honestly, Do you think people who make $10k/month from adsense make blogs manually? Come on, they need to make them as fast as possible. Save Time = More Money! It's Common SENSE! How much money do you think you will save if you can increase your work pace by a hundred times? Think about it…”

“Discover The Amazing Stealth Traffic Secrets Insiders Use To Drive Thousands Of Targeted Visitors To Any Site They Desire!”

“Holy Grail Of Advertising... “

“Easily Dominate Any Market, AnySearch Engine, Any Keyword.”

Where do Splogscome from?

$ 197

Page 10: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

Page 11: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

Our splog bait was picked up and used by dozens of sploggers

Page 12: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

Page 13: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

Our feed is RSSjacked by at least one splogger

Page 14: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

Why are splogs a problem?

• Splogs undermine ranking algorithms• Splogs water down search results• Splogs threaten the Web advertising

model• Splogs indulge in “plagiarism”• Splogs skew results of market research

tools• Splogs stress the Blogosphere

infrastructure of ping servers, blog search engines, etc.

Page 15: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

Outline

• Introduction• Motivation• BlogPulse Dataset• Weblogs.com Dataset• Implications

Page 16: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

Splog Detection

• SVM based probabilistic splog detection (Kolari et al., 2006)

• Hand verified training set of blogs and splogs

• Precision/Recall of 87%• Bag-of-words based feature

using text on blog home-page, O(x)

• Some additional local features

wewhatwasmyorgflickrpaper600openwordsweblogmotionmethankgojanuarytrackbackarchivesnowpolitical

findinfonewsyour27anotherwebsitebestarticlesonperfectproductsuncategorized280hotresourcesinc60threecopyright

P( x is a splog | O(x) )P( x is a blog | O(x) )

top featuresblogs splogs

Page 17: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

Outline

• Introduction• Motivation• BlogPulse Dataset• Weblogs.com Dataset• Implications

Page 18: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

BlogPulse Dataset• 21 days of July 2005• 1.3 million blogs• Eliminated Live-Journal• Re-fetched blog-homepages,

many spam blogs were non-existent since spam blogs areshort lived

• Arrived at 500K samples• Set probability thresholds

to 0.2 (authentic blog) and 0.8 (splog)• Identified 27K splogs• Sampled for 27K authentic blogs

Page 19: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

Splogs vs. Blogs – Word Count

blogs splogs

blog

s an

d sp

logs

Page 20: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

http://www.engadget.com 1942http://www.huffingtonpost.com/theblog 905http://www.crooksandliars.com 637http://blogs.guardian.co.uk/news 616http://www.littlegreenfootballs.com/weblog 611

http://spaces.msn.com/members/pony-girl 505http://spaces.msn.com/members/black-puss 505http://spaces.msn.com/members/amputee-women 505http://spaces.msn.com/members/free-stories 505http://spaces.msn.com/members/first-time-girl 505

Top 5

Top 5

Splogs vs. Blogs – In-degree

Page 21: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

Splogs vs. Blogs – Out-degree

http://www.xanga.com/home.aspx?user=hit_me_layoutz 273http://www.xanga.com/home.aspx?user=i_jock_layouts 271http://www.xanga.com/home.aspx?user=slp_layouts_slp 198http://spaces.msn.com/members/cyrustse1986 193http://www.xanga.com/home.aspx?user=layouts_n_codes2005 180

http://worldseriesofpokerchipscardguard.blogspot.com 898http://rule-wsop.blogspot.com 898http://worldseries-ofpoler.blogspot.com 898http://qsopcom-1.blogspot.com 898http://weopcom.blogspot.com 898

Top 5

Top 5

Page 22: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

Outline

• Introduction• Motivation• BlogPulse Dataset• Weblogs.com Dataset• Implications

Page 23: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

Weblogs.com Dataset

• 20 Nov 2005 – 11 Dec 2005• 16 million update pings• Pings subdivided by

language: da, de, en, es, fi, fr, it, nl, pt, sv

• Heuristics to identify Japanese, Chinese,Korean

• Set threshold of 0.5to separate out authentic blogs from splogs.

1Thanks to James Mayfield, JHU APL

Page 24: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

Ping times – Italian Blogs

Page 25: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

Sping vs. Ping times

Page 26: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

Spings vs. Pings: frequencyblogs vs. their ping frequency follows a power law, but splogs vs. spings does not

Page 27: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

•Close to 40% spings•Among English blogs

–75% pings are spings–Authentic blogs are 13% of all pings

•Including Info domain–50% of all pings are spings

url count

http://www.wiccapaganblog.com 1491

http://www.freecancerfacts.com/wp 1452

http://www.myaquariumiplace.com 1375

http://www.criss-angel.biz 1215

http://www.microdermabrasion-secrets.com

1211

http://www.tipstohealth.com/blog 1207

http://www.countrymusicdigest.com 1191

All Pings – 16 Million

Page 28: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

Outline

• Introduction• Motivation• BlogPulse Dataset• Weblogs.com Dataset• Implications

Page 29: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

Implications (1)

•BlogPulse dataset– Local word models most effective for fast splog

detection– If splogs escape filters, in-link and out-link

distribution point to link-based classification

•Weblogs.com dataset– Ping frequency can be useful – Splogs probably not a big problem in most

European languages. Yet.

•The nature of the domain, points to spam filters employing a multi-step, and adaptive approach, which we are currently pursuing

Page 30: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

Implications (2) – Filter Design

Heuristics

Spam Blog Filter

LanguageIdentifiers

Spam BlogDetectors

Blog Identifier

Blog Identifier

1 2 3 4

Authentic BlogsAuthentic Blogs

Spam BlogsSpam Blogs

IP BlacklistsIP BlacklistsSupporting Info(OPTIONAL)

Page 31: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

Conclusions• Blog spam is a serious problem

– Classic arms race, e.g., increased plagiarism, feedjacking

• Blog spam identification requires different tactics than used for email and Web spam– Local features effective, but not sufficient– Lots of relational features (e.g., links, ads, IP addresses, tight

but disconnected communities) but dynamism reduces effectiveness of analysis

• Getting good training sets expensive, especially in a multilingual environment.– Minute or more a judgment

• Good opportunities for infrastructure insertion, e.g., sping free ping servers

Page 32: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

http://ebiquity.umbc.edu/Annotated

in OWL

For more information

Page 33: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

Questions?

Page 34: Characterizing the Splogosphere

UMBCUMBCan Honors University in an Honors University in

MarylandMaryland

Blogs – A Specialized Domain

Update Pings

Update Pings

Ping Stream

1

2

Update Stream

Fetch Content

3

4

1 2 3 4( )


Recommended