7/23/2019 Cse535 Link Analysis
http://slidepdf.com/reader/full/cse535-link-analysis 1/19
CS315 – Link Analysis
Three generations of Search Engines
Anchor text
Link analysis for ranking Pagerank
HITS
7/23/2019 Cse535 Link Analysis
http://slidepdf.com/reader/full/cse535-link-analysis 2/19
1st Generation: Content Similarity
Content Similarity Ranking:The more rare words two documents share,the more similar they are
Documents are treated as “bags of words”
no e!ort to “understand” the contents"
Similarity is measured #y $ector angles
%uery &esults are ranked
#y sorting the angles#et'een (uery and documents
t 1
d
2
d 1
t 3
t 2
θ
7/23/2019 Cse535 Link Analysis
http://slidepdf.com/reader/full/cse535-link-analysis 3/19
But we also have links (los links!)
Assumption 1:
A hy)erlink from a )age denotes $ote ofcon*dence to second )age (uality signal"
Assumption 2: The anchor text of the hy)erlink
descri#es the target )age textual context"
hyperlink Anchor text
Page A Page +
7/23/2019 Cse535 Link Analysis
http://slidepdf.com/reader/full/cse535-link-analysis 4/19
n Generation: A "o#ularity
A hy)erlinkfrom a )age in site Ato some )age in site +is considered a popularity vote from site A to site +
Score of a )age ,num#er of in-links
%uery Processing -irst retrie$e all )ages
meeting the text (uery sayventure capital ".
/rder these #y the link)o)ularity of the )age or thesite"
'''.aa.com0
'''.##.com1
'''.cc.com0 '''.dd.com
1
'''.22.com3
7/23/2019 Cse535 Link Analysis
http://slidepdf.com/reader/full/cse535-link-analysis 5/19
3r Generation: A $e#utation
Each )age starts 'ith some #asic “re)utation” e.g.4,0"and re)eatedly distri#utes e(ual fractions to its links'hile recei$ing from them"until some “e(uili#rium”
The reputation “Page&ank” of a )age P ,the sum of a fair fraction of the re)utations of all )ages P 5 that )oint to P
+eautiful 6ath #ehind it
P& , )rinci)al eigen$ectorof the 'e#7s link matrix
P& e(ui$alent to the chanceof randomly sur*ng to the )age
Idea similar to academic co-citations
PR(W ) = PR(W
1)
O(W 1)+ PR(W
2)
O(W 2 )+...+
PR(W n
)
O(W n
)
7/23/2019 Cse535 Link Analysis
http://slidepdf.com/reader/full/cse535-link-analysis 6/19
$oots o% "$: Citation Analysis
8itation fre(uency The kind of #ackground 'ork Deans are doing at tenure
time
8o9citation cou)ling fre(uency
8o9citations 'ith a gi$en author measures“im)act”
Are you co9cited 'ith in:uential )u#lications;
+i#liogra)hic cou)ling fre(uency
Articles that co9cite the same articles are related8itation indexing <ho is author cited #y;
7/23/2019 Cse535 Link Analysis
http://slidepdf.com/reader/full/cse535-link-analysis 7/19
"a&e$ank "$ – Com#lete 'e%inition
W is a 'e# )age
W i are the 'e# )ages that ha$e a link to W
O<i" is the num#er of out9links from W i
t is the tele)ortation )ro#a#ility e.g. 3.0="
N is the si2e of the 'e# that 'e ha$e seen"
PR(W ) = t
N + (1− t )(PR(W
1)
O(W 1)
+ PR(W
2)
O(W 2
)+...+
PR(W n
)
O(W n
))
<.
<0
<1
<>
<0
<1
<>
7/23/2019 Cse535 Link Analysis
http://slidepdf.com/reader/full/cse535-link-analysis 8/19
"a&e$ank: terative Com#utation
t is normally set to 3.0=4#ut for this exam)le4 for sim)licity let7s set it to 3.=
Set initial PR $alues to 0
Solve the following equations iteratively:
PR( A) = 0.5 /3 + 0.5PR(C )
PR( B) = 0.5 /3 + 0.5(PR( A)/2)
PR(C ) = 0.5 /3 + 0.5(PR( A) / 2 + PR( B))
PR(W ) = t
N + (1− t )(PR(W
1)
O(W 1)
+ PR(W
2)
O(W 2
)+...+
PR(W n
)
O(W n
))
7/23/2019 Cse535 Link Analysis
http://slidepdf.com/reader/full/cse535-link-analysis 9/19
*am#leCom#utation
o% "$in *+el
7/23/2019 Cse535 Link Analysis
http://slidepdf.com/reader/full/cse535-link-analysis 10/19
"a&erank – ,atri* ,ulti#li+ation -uivalent 'e%.
Imagine a #ro'ser doing a random 'alk on 'e# )ages? Start at a random )age P
At each ste)4'alk 'ith e(ual )ro#a#ility out of the current )agealong one of the links on that )age4
8ontinue doing this random'alk for a long time
“In the steady state”each )age has a long9term $isit rate?
@se this rate as the page’s score.
P1/3
1/3
1/3
7/23/2019 Cse535 Link Analysis
http://slidepdf.com/reader/full/cse535-link-analysis 11/19
/ot -uite enou&h
The 'e# is full of dead9ends. &andom 'alk can get stuck in dead9ends.
6akes no sense to talk a#out long9term $isit rates.
??
7/23/2019 Cse535 Link Analysis
http://slidepdf.com/reader/full/cse535-link-analysis 12/19
0ele#ortin&
At a dead end4 5um) to a random 'e# )age.
At any non9dead end4
<ith )ro#a#ility4 say4 0=4 5um) to a random web page.
<ith remaining )ro#a#ility B="4go out on a random link .
t,3.0= is the “tele#ortin&” )arameter.
7/23/2019 Cse535 Link Analysis
http://slidepdf.com/reader/full/cse535-link-analysis 13/19
$esult o% tele#ortin&
Co' cannot get stuck locally.
There exists a com)uta#le long9term rate
at 'hich any )age is $isited This not o#$ious4 #ut it has #een
)ro$en
Ho' do 'e com)ute this $isitrate;
7/23/2019 Cse535 Link Analysis
http://slidepdf.com/reader/full/cse535-link-analysis 14/19
,arkov +hains: astra+tions o% ranom walks
A 6arko$ chain consists of n states4and an n×n transition probability matrix P.
At each ste)4 'e are in exactly one of thestates.
-or ≤ i,! ≤ n,the matrix entry Pi!
tells us the )ro#a#ility of ! #eing the nextstate4
gi$en 'e are currently in state i.8learly4 for all i4
i j P ij
.11
=∑=ij
n
j P
7/23/2019 Cse535 Link Analysis
http://slidepdf.com/reader/full/cse535-link-analysis 15/19
Com#utin& "$ with ,arkov +hains
Example "ne#t two slides$?&e)resent the tele)orting random 'alk
'ith tele)orting )arameter t=1!
as a 6arko$ chain4 for this gra)h?
A + 8 D
7/23/2019 Cse535 Link Analysis
http://slidepdf.com/reader/full/cse535-link-analysis 16/19
Com#utin& " with ,atri* ,ulti#li+ation
Start 'ith Ad5acency matrix A of the <e# ra)h If there is hy)erlink from i to 54 A i5 , 04 else Ai5 , 3
If a ro' has all 37s4
re)lace each element #y 0FC
Else di$ide each 0 #y the num#er of 07s in the ro'
6ulti)ly the matrix #y 09t
Add tFC to e$ery entry of the resulting matrix
A + 8 D P,
7/23/2019 Cse535 Link Analysis
http://slidepdf.com/reader/full/cse535-link-analysis 17/19
Com#utin& all "a&eranks
Theorem?&egardless of 'here 'e start4 'eeventually reach the steady state a.
Start 'ith any distri#utionsay x, % & %"". After one ste)4 'e7re at xPG after t'o ste)s at xP' 4
then xP( and so on.
“E$entually” means for “large” ) 4 xP) , a.
Algorithm? multi)ly x #y increasing
)o'ers of P until the )roduct looks stable.
A + 8 D
P,
7/23/2019 Cse535 Link Analysis
http://slidepdf.com/reader/full/cse535-link-analysis 18/19
"a&erank summary
Pre)rocessing? i$en gra)h of links4 #uild matrix P.
-rom it com)ute a.
The entry ai is a num#er #et'een 3 and 0? the
)agerank of )age i.
%uery )rocessing? &etrie$e )ages meeting (uery.
&ank them #y their )agerank.
/rder is (uery9independent If P&A" P&+" for some (uery4 it #eats it in every
(uery
7/23/2019 Cse535 Link Analysis
http://slidepdf.com/reader/full/cse535-link-analysis 19/19
2ow is "a&erank use
Page"ank #ec$nology?
Page&ank re:ects our $ie' of the im)ortance of 'e# )ages #yconsidering more than =33 million $aria#les and 1 #illion terms.Pages that 'e #elie$e are im)ortant )ages recei$e a higherPage&ank and are more likely to a))ear at the to) of the search
results.
#$is claim $as recently c$anged?
“Today 'e use more than 133 signals4 including Page&ank4 toorder 'e#sites4 and 'e u)date these algorithms on a 'eekly
#asis”
Pagerank is dead4 long li$e Pagerank
htt)?FF'''.google.comFcor)orateFtec
h.html