Download - ECE 417 Guest Lecture Ranking in Heterogeneous Network · PageRank as a Random Surfer • PageRank of a page is the probability of arriving at that page after a large number of random

ECE 417 Guest LectureRanking in Heterogeneous

NetworkMin-Hsuan Tsai

Apr 23, 2013

Homogeneous (Info) Network

• Information network– Node: an entity– Edge: a relationship between entities

• Homogeneous info network– Nodes are of the same kind

• Social network (friendships, co-authorships, …)• Webpage network• Citation network• Taxonomy

2

A

BC

ED

Heterogeneous (Info) Network

• Multiple types of nodes– Social media network– Bibliographic network– Medical network– …

3

A

B

D E

C

Ranking in Homogeneous Network

• PageRank for webpage network:– PageRank of pages tries to capture Page “Popularity”– Intuitions (impact factor index for journal papers):

• Links are like citations in literature• A page that is cited often can be expected to be more useful in general

– PageRank is essentially “citation voting”, but improves over simple voting

• Consider “soft voting” (each page spreads its vote out evenly to its citation)• Consider “indirect citation” (being cited by a highly cited paper counts a lot

more)• Smoothing of citations (every page is assumed to have a non-zero citation

count)

4

Slide courtesy of Prof. ChenXiang Zhai

PageRank as a Random Surfer• PageRank of a page is the probability of arriving at that page

after a large number of random clicks• At any page,

1. with probability , randomly picking a link to follow2. with probability (1-), randomly jumping to another page is called damping factor

• Given– pt(di) = probability of visiting page di at time t– Mij = probability of going from di to dj (transition matrix)

• probability of visiting page dj at time t+1 is

5

d1

d2

d4

d3

002/12/100100001

2/12/100

M

11

N

iijM

N

iitN

N

iitijjt dpdpMdp

1

1

11 )()1()()(

Reach dj via following a link Reach dj via random jumping

PageRank as a Random Surfer

• with =0.8

• Starting with ANY initial , will converge to

6

d1

d2

d4

d3

N

iNitijjt dpMdp

1

11 )1()()(

1111

42.0

)(

)(

)(

)(

005.05.0001000015.05.000

8.0

)(

)(

)(

)(

4

3

2

1

41

31

21

11

dp

dp

dp

dp

dp

dp

dp

dp

t

t

t

tT

t

t

t

t

1886.01886.02763.03465.0

p Probability of visiting each page after a long time (PageRank score)

epMp NtT

t

11

(e: a vector of 1’s)

0p tp

PageRank as a Markov Chain

• E is a stochastic matrix whose entries are all 1/N• A is still a stochastic matrix• PageRank is a finite Markov Chain (time-homogeneous

Markov Chain w/ finite state space) – The states are the pages– is the probability distribution that pages are visited at time t– The probability of transition follows Markov property

• Aij = Pr(Xt+1=dj|Xt=di)

7

d1

d2

d4

d3

N

iitNijjt dpMdp

1

11 )()]1([)(

tp

ApEMpp ttt ))1((1


• A finite Markov chain is irreducible if– There is a path from every node to every other node (strongly

connected).

8

IrreducibleNot irreducible


• A state in finite Markov chain is aperiodic if– The greatest common divisor of all cycle length is 1– Periodicity: k = gcd{n: Pr(Xn=i | X0=i) > 0}

– A finite Markov chain is aperiodic if its states are all aperiodic 9

Periodicity is 2 Aperiodic

1 2 1 2

X0=1: 12121212121212…k=gcd{2,4,6,8,…}=2

X0=2: 21212121212121…k=gcd{2,4,6,8,…}=2

X0=2: 122222222222….k=gcd{2,3,4,5,…}=1

X0=1: 121122112221

k=gcd{3,4,5,6,…}=1


• Stationary distribution– When distribution does not change anymore

– captures the average probability of visiting each page (PageRank score)

– If a finite Markov chain is irreducible and aperiodic, then the largest eigenvalue of the transition matrix will be equal to 1 and all the other eigenvalues will be strictly less than 1 (Perron-Frobenius theorem)

– Meaning, there exists an unique stationary distribution• The stationary distribution is the left eigenvector of the transition

matrix A corresponding to eigenvalue 1 10

p

App tt ,1


• E guarantees the Markov chain is irreducibleand aperiodic– A unique stationary distribution exists!

11

d1

d2

d4

d3

EMAApp tt )1(,1

05.005.045.045.005.005.085.005.005.005.005.085.045.045.005.005.0

4/14/14/14/14/14/14/14/14/14/14/14/14/14/14/14/1

2.0

002/12/1001000012/12/100

8.0

)8.01(8.0 EMA

1886.01886.02763.03465.0

Left eigenvector of A corresponding to eigenvalue 1

PageRank by hand

• Two ways to calculate the PageRank scores– Let = 1 for now

– Make sure it is irreducible and aperiodic– Solve directly (finding left eigenvector corresponding

to eigenvalue 1)

– Power method

12

1 2

5.05.010

MA

6667.03333.06667.03333.0

,6680.03320.06641.03359.0

,6875.03125.0625.0375.0

,75.025.05.05.0 16842 AAAA

121,5.05.0

102121

pppppp

A

3/23/1

6667.03333.0

Extensions of PageRank

• M can be any stochastic matrix– Not necessary to be uniformly distributed

– Can be based on content similarity (e.g., VisualRank)

• Zi is the normalization factor to make M row-stochastic

13

d1

d2

d4

d3

0.7

0.310.2

0.8

007.03.0001000018.02.000

M

),(1ji

iij IIsim

ZM

Extensions of PageRank

• E can be any stochastic matrix as well– Not necessary to be uniformly distributed (e.g., topic-

sensitive PageRank)

14

05.005.005.005.005.005.005.005.0

05.0

05.0 T

eE

Heterogeneous (Info) Network

• An instance– Social media network

15

Image Domain Text Domain Social Domain

Content Photos or videos shared by the actors

Tags or comments attached to the images

Users share, tag, comment the images;User groups whose members favor images

Homogeneous edges

Content-based visual similarity to the other image node

Semantic similarities between each text content

Members attending the group

Heterogeneous edges

Images may be described, tagged, commented (I-T links) by users (I-A and T-A links), or favored by user groups (I-A links)

Ranking in Heterogeneous Network

• Problem of directly applying PageRank-like algorithm to the heterogeneous network– Edges are of different types of measurements

• How to deal with a image node with an edge to another image node as well as an edge to the user node with favor link?

– Cross-domain edges are usually sparse• Random surfer would easily get trapped in one domain• Meaning slow convergence!

16

like

similar Image 2Image 1

User 1

Our approach• Decomposed Heterogeneous Network

– Then we can employ homogeneous network analysis with PageRank-like algorithm

• A heterogeneous network can be decomposed into homogeneous sub-networks with heterogeneous links

Decomposition of heterogeneous network

Heterogeneous links

Homogeneous Sub-network



Our approach• Decomposed Heterogeneous Network

– Then we can employ homogeneous network analysis with PageRank-like algorithm

• Augmented Similarity Function (ASF)– Content-based similarity + link-based similarity– Exploit heterogeneous linkage for knowledge

propagation– Extend the idea that two objects are similar if

they are linked to similar objects– May also consider the relevance importance of

each object to the query

• Heterogeneous links for knowledge propagation

Decomposition of heterogeneous network




Hints

Augmented similarity function• Content-based similarity

• (Homogeneous) similarity metric based on the content of the two objects• Link-based similarity

• Based on the linkages of the two objects to the same object• Based on the linkages of the two objects to the different objects• Based on the weights of the linked objects (s2, s3)

21

Sab = f(s1)

a bCaCb

lac lbclad

lbe

Sde

cd es1 = g1 (Ca, Cb)

, s2)

s2 = g2 (lac, lbc)

s3 = g3 (Sde, lad, lbe)s3 = g3 (Sde, lad, lbe, rd, re)

, s2 , s3), s3)

s2 = g2 (lac, lbc, rc)

rerd

rc

Cross-domain Linkage Relevance importance

Algorithms - SocialRank

• For each homogeneous domain– Augmented similarities obtained from heterogeneous

domains– Content-based linkage with query-sensitive random

surfer model

• Iterative improvement on augmented similarities across different homogeneous sub-networks

Augmented similarity matrix (D: normalization matrix)

zpDSp th

t )1(~ 1

1

Query-biased vector

• To make the random walk query-sensitive, we use a biased vector

– The way the biased vector influence the final prob. distribution is to let the random walk restarts from the biased vector with prob. (1-)

• For query with keywords, the biased vector is the semantic similarity between the query and the content of text nodes

zpDSp th

t )1(~ 1

1

Qqjv

j

vqsimgz ),(

Experimental results – Dataset collection

• None of existing image dataset contains social linkage structure

• We crawled the Flickr site to construct the Flickrgroup dataset– The groups in Flickr are communities with people

who have the same interests toward a target subject– The group members would favor photos which are

closely related to the target subject

Experimental results – Dataset collection

• Statistics of the Flickrgroup dataset– 140 user groups– 118,000 images favored by those 140 user groups

• Encoded with 106 visual words• Similarity measured by

– COT: Number of co-occurrence terms– TF: term frequency of co-occurrence terms– TF-IDF: term frequency of co-occurrence terms penalized by document

frequency

– 150,000 unique tags associated with these images• Top 5000 most frequent ones selected as the codebook tags

Evaluation Measures for Ranking

• Precision– The fraction of retrieved documents that are relevant to

the query

• Recall– The fraction of the documents that are relevant to the

query that are successfully retrieved

26

Evaluation Measures for Ranking

• Precision-recall curve– For each # retrieved document (N), we obtain a (precision, recall) pair– By controlling N to obtain recall varying from 0 to 1, we can connect the

(precision, recall) pairs to obtain a curve

• AP– Average of precision with recall values from 0 to 1– Area under precision-recall curve

• AP@K– Average of precision with a cut-off N from 1 to K– Often used when #relevant documents is too much

• mAP(@K)– Average of AP(@K) over a number of queries

27

Flickr Groups

Favored image pool

Members

Flickr Group – Favored images

Experimental results – Image Ranking (I)

• An 11.5% improvement (on AP@100) over the state-of-the-art ranking methods

Experimental results – Image Ranking (II)

• Consistent improvements on various level of recall and retrieved images

Experimental results – Image Ranking (III)

Cat

Bird

Car

Upper row: VisualRank; Bottom row: SocialRank