Post on 20-Jan-2016
transcript
9 Algorithms:PageRank
Ranking
• After matching, have to rank:
Index Based Ranking
• Strategies we could (do) use:– Frequency– Position– Metadata
Missing Ingredient
• Index lacks intra-page information
Link Quality
• Not all links are equal
• Who do you trust?– CS Prof– World Famous Chef
Identifying Authority
• Links into a page give it authority• Page value = sum of authorities of pages
linking to it
Link Quality
• More links is easy to abuse Spam Link Pages
Issues
• Spam Links– Discourage with negative weight
Spam Link Pages
-1
-1
-1
-1
-1
-1
Issues
• Cycles:
Issues
• Cycles:
Issues
• Cycles:
…
Random Surfer
• Simulating a web surfing session– Start at random page– At each page have a chance to
• Pick a random link to go to• Jump to a completely random page
Results
• Results of many random sessions:
Results
• Expressed as percentages, results stabilize– Law of large numbers
Cycle Buster
• Random surfer not phased by cycles:
Random Surfer In Use
• The recipe pages visited by random surfers:
Simulator
• PageRank Simulator:http://caccio.blogdns.net/software/pagerank-simulator
The Real Math
• Markov Chains– Set of states– Each state has probability of leading to other
states– Represent as matrix
Excel Simulation
• Three pages:
Limitations
• Still have issues/room for growth– Link Spam– Context of link• Where link is on page• "Bob's recipe is terrible" vs "Bob's recipe is great"
– Lack of semantic knowledge• Page's Authority should not be the same for all domains
Power
• Controlling search is power:
http://www.bitsbook.com/
"If you're not paying for the product, you are the product."