+ All Categories
Home > Documents > New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall...

New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall...

Date post: 16-Oct-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
35
Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec et al.)
Transcript
Page 1: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

Data Mining TechniquesCS 6220 - Section 3 - Fall 2016

Lecture 17: Link AnalysisJan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec et al.)

Page 2: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

Graph Data: Media Networks

Connections between political blogsPolarization of the network [Adamic-Glance, 2005]

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 3: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

Schedule Updates

Page 4: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

• Human-curated (e.g. Yahoo, Looksmart) • Hand-written descriptions • Wait time for inclusion

• Text-search(e.g. WebCrawler, Lycos) • Prone to term-spam

Web search before PageRank

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 5: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

Web as a Directed Graph

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 6: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

PageRank: Links as Votes

• Pages with more inbound links are more important • Inbound links from important pages carry more weight

Not all pages are equally important

Manyinboundlinks

Few/noinboundlinks

Links fromunimportantpages

Links fromimportantpages

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 7: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

Example: PageRank Scores

B 38.4 C

34.3

E 8.1

F 3.9

D 3.9

A 3.3

1.61.6 1.6 1.6 1.6

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 8: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

Simple Recursive Formulation

• A link’s vote is proportional to the importance of its source page

• If page j with importance rj has n out-links, each link gets rj / n votes

• Page j’s own importance is the sum of the votes on its in-links

j

ki

rj/3

rj/3rj/3rj = ri/3+rk/4

ri/3 rk/4

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 9: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

Equivalent Formulation: Random Surfer

• At time t a surfer is on some page i • At time t+1 the surfer follows a

link to a new page at random • Define rank ri as fraction of time

spent on page i

j

ki

rj/3

rj/3rj/3rj = ri/3+rk/4

ri/3 rk/4

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 10: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

PageRank: The “Flow” Model

• 3 equations, 3 unknowns • Impose constraint: ry + ra + rm = 1• Solution: ry = 2/5, ra = 2/5, rm = 1/5

∑→

=ji

ij

rrid

y

maa/2

y/2a/2

m

y/2

“Flow” equations:ry = ry /2 + ra /2ra = ry /2 + rm

rm = ra /2

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 11: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

PageRank: The “Flow” Model

∑→

=ji

ij

rrid

y

maa/2

y/2a/2

m

y/2

“Flow” equations:ry = ry /2 + ra /2ra = ry /2 + rm

rm = ra /2

r = M·r

Matrix M is stochastic (i.e. columns sum to one) (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 12: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

PageRank: Eigenvector Problem

• PageRank: Solve for eigenvector r = M r with eigenvalue λ = 1

• Eigenvector with λ = 1 is guaranteed to exist since M is a stochastic matrix (i.e. if a = M b then Σ ai = Σ bi)

• Problem: There are billions of pages on the internet. How do we solve for eigenvector with order 1010 elements?

Page 13: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

PageRank: Power IterationModel for random Surfer:

• At time t = 0 pick a page at random • At each subsequent time t follow an

outgoing link at random

Probabilistic interpretation:

Page 14: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

PageRank: Power Iteration

y

maa/2

y/2a/2

m

y/2

pt converges to r. Iterate until |pt - pt -1| < ε(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 15: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

• PageRank is assumes a random walkmodel for individual surfers

• Equivalent assumption: flow modelin which equal fractions of surfers follow each link at every time

• Ergodicity: The equilibrium of the flow model is the same as the asymptotic distribution for an individual random walk

Aside: Ergodicity

Page 16: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

Aside: Ergodicity• PageRank is assumes a random walk

model for individual surfers • Equivalent assumption: flow model

in which equal fractions of surfers follow each link at every time

• Ergodicity: The equilibrium of the flow model is the same as the asymptotic distribution for an individual random walk

Page 17: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

Aside: Ergodicity• PageRank is assumes a random walk

model for individual surfers • Equivalent assumption: flow model

in which equal fractions of surfers follow each link at every time

• Ergodicity: The equilibrium of the flow model is the same as the asymptotic distribution for an individual random walk

Averaging over individuals is equivalentto averaging single individual over time

Page 18: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

PageRank: Problems

Dead end

Spider trap

1. Dead Ends • Nodes with no outgoing links. • Where do surfers go next?

2. Spider Traps • Subgraph with no outgoing

links to wider graph • Surfers are “trapped” with

no way out.

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 19: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

Power Iteration: Dead Ends

y

maa/2

y/2a/2

y/2

Probability not conserved(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 20: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

Power Iteration: Dead Ends

y

maa/2

y/2a/2

y/2

Fixes “probability sink” issue

(teleport at dead ends)

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 21: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

Power Iteration: Spider Traps

y

maa/2

y/2a/2

y/2

Probability accumulates in traps (surfers get stuck)

m

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 22: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

rj =X

i! j

�ri

di+ (1� �) 1

N

Solution: Random TeleportsModel for teleporting random surfer:

• At time t = 0 pick a page at random • At each subsequent time t

• With probability β follow an outgoing link at random

• With probability 1-β teleport to a new initial location at random

PageRank Equation [Page & Brin 1998]

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 23: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

Power Iteration: Teleports

y

ma (can use power iteration as normal)

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 24: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

Power Iteration: Teleports

y

ma (can use power iteration as normal)

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 25: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

Power Iteration: Teleports

y

ma (can use power iteration as normal)

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 26: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

Computing PageRank

• M is sparse - only store nonzero entries • Space proportional roughly to number of links • Say 10N, or 4*10*1 billion = 40GB • Still won’t fit in memory, but will fit on disk

0 3 1, 5, 71 5 17, 64, 113, 117, 245

2 2 13, 23

source node degree destination nodes

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 27: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

Block-based Update Algorithm• Break rnew into k blocks that fit in memory • Scan M and rold once for each block

0 4 0, 1, 3, 51 2 0, 52 2 3, 4

src degree destination01

23

45

012345

rnew rold

M

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 28: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

Block-Stripe Update Algorithm

0 4 0, 11 3 02 2 1

src degree destination

01

23

45

012345

rnew

rold

0 4 51 3 52 2 4

0 4 32 2 3

Break M into stripes: Each stripe contains only destination nodes in the corresponding block of rnew

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 29: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

First Spammers: Term Spam• How do you make your page appear to be

about movies? • (1) Add the word movie 1,000 times to your page • Set text color to the background color, so only search

engines would see it • (2) Or, run the query “movie” on your

target search engine • See what page came first in the listings • Copy it into your page, make it “invisible”

• These and similar techniques are term spam

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 30: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

Google’s Solution to Term Spam• Believe what people say about you, rather

than what you say about yourself • Use words in the anchor text (words that appear

underlined to represent the link) and its surrounding text

• PageRank as a tool to measure the “importance” of Web pages

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 31: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

Google vs. Spammers: Round 2!• Once Google became the dominant search

engine, spammers began to work out ways to fool Google

• Spam farms were developed to concentrate PageRank on a single page

• Link spam: • Creating link structures that

boost PageRank of a particular page

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 32: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

Link Spamming• Three kinds of web pages from a

spammer’s point of view • Inaccessible pages • Accessible pages

• e.g., blog comments pages • spammer can post links to his pages

• Owned pages • Completely controlled by spammer • May span multiple domain names

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 33: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

Link Farms• Spammer’s goal:

• Maximize the PageRank of target page t

• Technique: • Get as many links from accessible pages as

possible to target page t • Construct “link farm” to get PageRank

multiplier effect

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 34: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

Link Farms

Inaccessible

t

Accessible Owned

1

2

M

One of the most common and effective organizations for a link farm

Millions of farm pages

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 35: New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

PageRank: Extensions

• Topic-specific PageRank: • Restrict teleportation to some set S

of pages related to a specific topic • Set p0i = 1/|S| if i ∈ S, p0i = 0 otherwise

• Trust Propagation • Use set S of trusted pages for

teleport set


Recommended