Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas...

Post on 26-Mar-2015

223 views 3 download

Tags:

transcript

Nilesh Bansal and Nick Koudas

WebDB 2007

SEARCHING THE BLOGOSPHERE

Nilesh BansalNick KoudasUniversity of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

BLOGOSPHERE

Nilesh Bansal and Nick Koudas

WebDB 2007

Nilesh Bansal and Nick Koudas

WebDB 2007

67M KNOWN BLOGS

100K NEW EVERYDAY

DOUBLING EVERY 200 DAYS

Nilesh Bansal and Nick Koudas

WebDB 2007

WHAT ARE THEY WRITING ABOUT??

PERSONAL LIFEPRODUCT REVIEWS

POLITICSTECHNOLOGY

TOURISMSPORTS

ENTERTAINMENT

Nilesh Bansal and Nick Koudas

WebDB 2007

WHY SHOULD WE CARE?

Nilesh Bansal and Nick Koudas

WebDB 2007

HUGE DATA REPOSITORY

WILL CONTINUE TO GROW

EXTRACT PUBLIC OPINION

VALUABLE INSIGHTS

Nilesh Bansal and Nick Koudas

WebDB 2007

KEY INSIGHTS

MARKET RESEARCH

PUBLIC RELATION STRATEGIES

CUSTOMER OPINION TRACKING

Nilesh Bansal and Nick Koudas

WebDB 2007

CHALLENGES AND OPPORTUNITIES

Nilesh Bansal and Nick Koudas

WebDB 2007

HUGE AMOUNTS OF UNSTRUCTURED TEXT

Nilesh Bansal and Nick Koudas

WebDB 2007

Nilesh Bansal and Nick Koudas

WebDB 2007

MACHINE CREATED WEBLOGS

MORE THAN HALF OF BLOGSPOT IS SPAM

33% OF WEBSPAM HOSTED AT BLOGSPOT

Nilesh Bansal and Nick Koudas

WebDB 2007

TEMPORAL DIMENSION

Nilesh Bansal and Nick Koudas

WebDB 2007

GEOGRAPHICAL ASSOCIATION

Nilesh Bansal and Nick Koudas

WebDB 2007

CONVERSATION

Nilesh Bansal and Nick Koudas

WebDB 2007

Gruhl et al., The Predictive Power of Online Chatter, KKD 2005

Kumar et al., On the Bursty Evolution of Blogspace, WWW 2003

Chi et al., Eigen-trend: trend analysis in the blogosphere based on singular value decompositions, CIKM 2006

Mishne et al., MoodViews: Tool for Blog Mood Analysis, AAAI-CAAW 2006

Mei et al., Topic sentiment mixture: modeling facets and opinions in weblogs, WWW 2007

Nilesh Bansal and Nick Koudas

WebDB 2007

BLOGSCOPE

Nilesh Bansal and Nick Koudas

WebDB 2007

Nilesh Bansal and Nick Koudas

WebDB 2007

CRAWLER RUNNING 24x7

TRACKING 9M BLOGS

INDEXING 70M ARTICLES

AGGREGATION AND PREPROCESSING

INTERACTIVE SEARCH AND ANALYSIS

Nilesh Bansal and Nick Koudas

WebDB 2007

ANY STREAMING TEXT SOURCE

NEWS

MAILING LISTS

FORUMS

SOCIAL MEDIA

Nilesh Bansal and Nick Koudas

WebDB 2007

www.blogscope.net

HotKeywords

HotKeywords

Nilesh Bansal and Nick Koudas

WebDB 2007

RelatedTerms

RelatedTerms

PopularityCurve

PopularityCurve

SearchResultsSearchResults

GeoSearch

GeoSearch

Nilesh Bansal and Nick Koudas

WebDB 2007

Hawaii Earthquake

TaiwanUndersea

Earthquake Sumatra Earthquake

Nilesh Bansal and Nick Koudas

WebDB 2007

December 15 2006

March 06 2007

Nilesh Bansal and Nick Koudas

WebDB 2007

IPHONE ON JAN 09 2007

Nilesh Bansal and Nick Koudas

WebDB 2007

Curves are usually correlated, except

at one point

Nilesh Bansal and Nick Koudas

WebDB 2007

TECHNIQUES

Nilesh Bansal and Nick Koudas

WebDB 2007

CRAWLS RSS FEEDS

250 THOUSAND NEW POSTS DAILY

PING SERVER: WEBLOGS.COM

Nilesh Bansal and Nick Koudas

WebDB 2007

[Wang et al.] Spam Double-Funnel: Connecting Web Spammers with Advertisers, WWW 2007[Gyongi et al.] Combating Web Spam With TrustRank, VLDB 2004[Kolari et al.] Detecting Spam Blogs, A Machine Learning Approach, AAAI 2006

LINK BASED ANALYSIS IS NOT EFFECTIVE

SPAMMERS ARE INTELLIGENT

WE USE HEURISTICS

ON GOING BATTLE

Nilesh Bansal and Nick Koudas

WebDB 2007

INTERACTIVE APPLICATION

TWO SECOND RESPONSE TIME

HUGE AMOUNTS OF DATA

SEVEN THOUSAND UNIQUE IP ADDRESSES DAILY

SCALABILITY

Nilesh Bansal and Nick Koudas

WebDB 2007

Nilesh Bansal and Nick Koudas

WebDB 2007

BURST DETECTION

[Kleinberg] Bursty and Hierarchical Structures in Streams, DMKD 2007[Fung et al.] Parameter Free Bursty Events Detection in Text Streams, VLDB 2005

Nilesh Bansal and Nick Koudas

WebDB 2007

POPULARITY = BASE + ZERO MEAN GAUSSIAN

BURST = STATISTICAL OUTLIER

),0( 2 Nx

2x

Nilesh Bansal and Nick Koudas

WebDB 2007

IDENTIFYING RELATED TERMS

Nilesh Bansal and Nick Koudas

WebDB 2007

COLLOCATIONS

POINTWISE MUTUAL INFORMATION

EXPENSIVE

[Ott and Longnecker] An Introduction to Statistical Methods and Data Analysis[Manning and Schutze] Foundation of Natural Statistical Language Processing[Church and Hanks] Word Association Norms, Mutual Information and Lexicography, ACL 1989

)(

)|(

)(

)|(),(

DbP

DaDbP

DaP

DbDaPbascore

Nilesh Bansal and Nick Koudas

WebDB 2007

FAST COMPUTATION OF RELATED TERMS

RANDOM SAMPLE

MUTUAL INFORMATION IN EXPECTATION

USE TF WITH PRECOMPUTED IDF

)()(

)(),(

|}|{|

|||}|{|),(

dqPdtP

dtdqPqts

dtd

DdtDddqts

Nilesh Bansal and Nick Koudas

WebDB 2007

COMPUTING HOT KEYWORDS

Nilesh Bansal and Nick Koudas

WebDB 2007

POPULAR DOES NOT MEAN HOT

INTERESTING = SURPRISING

MIXTURE OF DIFFERENT SCORING FUNCTIONS

DEVIATION FROM EXPECTED

Nilesh Bansal and Nick Koudas

WebDB 2007

INTELLIGENT ALERT SERVICE

BURST SYNOPSIS

AUTHORATIVE RANKING

Nilesh Bansal and Nick Koudas

WebDB 2007

Nilesh Bansal, Fei Chiang, Nick Koudas, Frank Wm. Tompa, Seeking Stable Clusters in the Blogosphere, to appear in VLDB 2007.

Nilesh Bansal, Nick Koudas, BlogScope: System for Online Analysis of High Volume Text Streams, to appear in VLDB 2007 (Demonstration Proposal).

JUST THE BEGINNING

Nilesh Bansal and Nick Koudas

WebDB 2007Source: xkcd.com

THANK YOU. QUESTIONS?