ECE 417 Guest LectureRanking in Heterogeneous
NetworkMin-Hsuan Tsai
Apr 23, 2013
Homogeneous (Info) Network
• Information network– Node: an entity– Edge: a relationship between entities
• Homogeneous info network– Nodes are of the same kind
• Social network (friendships, co-authorships, …)• Webpage network• Citation network• Taxonomy
2
A
BC
ED
Heterogeneous (Info) Network
• Multiple types of nodes– Social media network– Bibliographic network– Medical network– …
3
A
B
D E
C
Ranking in Homogeneous Network
• PageRank for webpage network:– PageRank of pages tries to capture Page “Popularity”– Intuitions (impact factor index for journal papers):
• Links are like citations in literature• A page that is cited often can be expected to be more useful in general
– PageRank is essentially “citation voting”, but improves over simple voting
• Consider “soft voting” (each page spreads its vote out evenly to its citation)• Consider “indirect citation” (being cited by a highly cited paper counts a lot
more)• Smoothing of citations (every page is assumed to have a non-zero citation
count)
4
Slide courtesy of Prof. ChenXiang Zhai
PageRank as a Random Surfer• PageRank of a page is the probability of arriving at that page
after a large number of random clicks• At any page,
1. with probability , randomly picking a link to follow2. with probability (1-), randomly jumping to another page is called damping factor
• Given– pt(di) = probability of visiting page di at time t– Mij = probability of going from di to dj (transition matrix)
• probability of visiting page dj at time t+1 is
5
d1
d2
d4
d3
002/12/100100001
2/12/100
M
11
N
iijM
N
iitN
N
iitijjt dpdpMdp
1
1
11 )()1()()(
Reach dj via following a link Reach dj via random jumping
PageRank as a Random Surfer
• with =0.8
• Starting with ANY initial , will converge to
6
d1
d2
d4
d3
N
iNitijjt dpMdp
1
11 )1()()(
1111
42.0
)(
)(
)(
)(
005.05.0001000015.05.000
8.0
)(
)(
)(
)(
4
3
2
1
41
31
21
11
dp
dp
dp
dp
dp
dp
dp
dp
t
t
t
tT
t
t
t
t
1886.01886.02763.03465.0
p Probability of visiting each page after a long time (PageRank score)
epMp NtT
t
11
(e: a vector of 1’s)
0p tp
PageRank as a Markov Chain
• E is a stochastic matrix whose entries are all 1/N• A is still a stochastic matrix• PageRank is a finite Markov Chain (time-homogeneous
Markov Chain w/ finite state space) – The states are the pages– is the probability distribution that pages are visited at time t– The probability of transition follows Markov property
• Aij = Pr(Xt+1=dj|Xt=di)
7
d1
d2
d4
d3
N
iitNijjt dpMdp
1
11 )()]1([)(
tp
ApEMpp ttt ))1((1
PageRank as a Markov Chain
• A finite Markov chain is irreducible if– There is a path from every node to every other node (strongly
connected).
8
IrreducibleNot irreducible
PageRank as a Markov Chain
• A state in finite Markov chain is aperiodic if– The greatest common divisor of all cycle length is 1– Periodicity: k = gcd{n: Pr(Xn=i | X0=i) > 0}
– A finite Markov chain is aperiodic if its states are all aperiodic 9
Periodicity is 2 Aperiodic
1 2 1 2
X0=1: 12121212121212…k=gcd{2,4,6,8,…}=2
X0=2: 21212121212121…k=gcd{2,4,6,8,…}=2
X0=2: 122222222222….k=gcd{2,3,4,5,…}=1
X0=1: 121122112221
k=gcd{3,4,5,6,…}=1
PageRank as a Markov Chain
• Stationary distribution– When distribution does not change anymore
– captures the average probability of visiting each page (PageRank score)
– If a finite Markov chain is irreducible and aperiodic, then the largest eigenvalue of the transition matrix will be equal to 1 and all the other eigenvalues will be strictly less than 1 (Perron-Frobenius theorem)
– Meaning, there exists an unique stationary distribution• The stationary distribution is the left eigenvector of the transition
matrix A corresponding to eigenvalue 1 10
p
App tt ,1
PageRank as a Markov Chain
• E guarantees the Markov chain is irreducibleand aperiodic– A unique stationary distribution exists!
11
d1
d2
d4
d3
EMAApp tt )1(,1
05.005.045.045.005.005.085.005.005.005.005.085.045.045.005.005.0
4/14/14/14/14/14/14/14/14/14/14/14/14/14/14/14/1
2.0
002/12/1001000012/12/100
8.0
)8.01(8.0 EMA
1886.01886.02763.03465.0
Left eigenvector of A corresponding to eigenvalue 1
PageRank by hand
• Two ways to calculate the PageRank scores– Let = 1 for now
– Make sure it is irreducible and aperiodic– Solve directly (finding left eigenvector corresponding
to eigenvalue 1)
– Power method
12
1 2
5.05.010
MA
6667.03333.06667.03333.0
,6680.03320.06641.03359.0
,6875.03125.0625.0375.0
,75.025.05.05.0 16842 AAAA
121,5.05.0
102121
pppppp
A
3/23/1
6667.03333.0
Extensions of PageRank
• M can be any stochastic matrix– Not necessary to be uniformly distributed
– Can be based on content similarity (e.g., VisualRank)
• Zi is the normalization factor to make M row-stochastic
13
d1
d2
d4
d3
0.7
0.310.2
0.8
007.03.0001000018.02.000
M
),(1ji
iij IIsim
ZM
Extensions of PageRank
• E can be any stochastic matrix as well– Not necessary to be uniformly distributed (e.g., topic-
sensitive PageRank)
14
05.005.005.005.005.005.005.005.0
05.0
05.0 T
eE
Heterogeneous (Info) Network
• An instance– Social media network
15
Image Domain Text Domain Social Domain
Content Photos or videos shared by the actors
Tags or comments attached to the images
Users share, tag, comment the images;User groups whose members favor images
Homogeneous edges
Content-based visual similarity to the other image node
Semantic similarities between each text content
Members attending the group
Heterogeneous edges
Images may be described, tagged, commented (I-T links) by users (I-A and T-A links), or favored by user groups (I-A links)
Ranking in Heterogeneous Network
• Problem of directly applying PageRank-like algorithm to the heterogeneous network– Edges are of different types of measurements
• How to deal with a image node with an edge to another image node as well as an edge to the user node with favor link?
– Cross-domain edges are usually sparse• Random surfer would easily get trapped in one domain• Meaning slow convergence!
16
like
similar Image 2Image 1
User 1
Our approach• Decomposed Heterogeneous Network
– Then we can employ homogeneous network analysis with PageRank-like algorithm
• A heterogeneous network can be decomposed into homogeneous sub-networks with heterogeneous links
Decomposition of heterogeneous network
Heterogeneous links
Homogeneous Sub-network
Homogeneous Sub-network
Homogeneous Sub-network
Our approach• Decomposed Heterogeneous Network
– Then we can employ homogeneous network analysis with PageRank-like algorithm
• Augmented Similarity Function (ASF)– Content-based similarity + link-based similarity– Exploit heterogeneous linkage for knowledge
propagation– Extend the idea that two objects are similar if
they are linked to similar objects– May also consider the relevance importance of
each object to the query
• Heterogeneous links for knowledge propagation
Decomposition of heterogeneous network
Homogeneous Sub-network
Homogeneous Sub-network
Homogeneous Sub-network
Hints
Augmented similarity function• Content-based similarity
• (Homogeneous) similarity metric based on the content of the two objects• Link-based similarity
• Based on the linkages of the two objects to the same object• Based on the linkages of the two objects to the different objects• Based on the weights of the linked objects (s2, s3)
21
Sab = f(s1)
a bCaCb
lac lbclad
lbe
Sde
cd es1 = g1 (Ca, Cb)
, s2)
s2 = g2 (lac, lbc)
s3 = g3 (Sde, lad, lbe)s3 = g3 (Sde, lad, lbe, rd, re)
, s2 , s3), s3)
s2 = g2 (lac, lbc, rc)
rerd
rc
Cross-domain Linkage Relevance importance
Algorithms - SocialRank
• For each homogeneous domain– Augmented similarities obtained from heterogeneous
domains– Content-based linkage with query-sensitive random
surfer model
• Iterative improvement on augmented similarities across different homogeneous sub-networks
Augmented similarity matrix (D: normalization matrix)
zpDSp th
t )1(~ 1
1
Query-biased vector
• To make the random walk query-sensitive, we use a biased vector
– The way the biased vector influence the final prob. distribution is to let the random walk restarts from the biased vector with prob. (1-)
• For query with keywords, the biased vector is the semantic similarity between the query and the content of text nodes
zpDSp th
t )1(~ 1
1
Qqjv
j
vqsimgz ),(
Experimental results – Dataset collection
• None of existing image dataset contains social linkage structure
• We crawled the Flickr site to construct the Flickrgroup dataset– The groups in Flickr are communities with people
who have the same interests toward a target subject– The group members would favor photos which are
closely related to the target subject
Experimental results – Dataset collection
• Statistics of the Flickrgroup dataset– 140 user groups– 118,000 images favored by those 140 user groups
• Encoded with 106 visual words• Similarity measured by
– COT: Number of co-occurrence terms– TF: term frequency of co-occurrence terms– TF-IDF: term frequency of co-occurrence terms penalized by document
frequency
– 150,000 unique tags associated with these images• Top 5000 most frequent ones selected as the codebook tags
Evaluation Measures for Ranking
• Precision– The fraction of retrieved documents that are relevant to
the query
• Recall– The fraction of the documents that are relevant to the
query that are successfully retrieved
26
Evaluation Measures for Ranking
• Precision-recall curve– For each # retrieved document (N), we obtain a (precision, recall) pair– By controlling N to obtain recall varying from 0 to 1, we can connect the
(precision, recall) pairs to obtain a curve
• AP– Average of precision with recall values from 0 to 1– Area under precision-recall curve
• AP@K– Average of precision with a cut-off N from 1 to K– Often used when #relevant documents is too much
• mAP(@K)– Average of AP(@K) over a number of queries
27
Flickr Groups
Favored image pool
Members
Flickr Group – Favored images
Experimental results – Image Ranking (I)
• An 11.5% improvement (on AP@100) over the state-of-the-art ranking methods
Experimental results – Image Ranking (II)
• Consistent improvements on various level of recall and retrieved images
Experimental results – Image Ranking (III)
Cat
Bird
Car
Upper row: VisualRank; Bottom row: SocialRank