Finding Related Communities on the WebFinding Related Communities on the WebMasashi Toyoda
We propose a new web search technique, which finds related communities from a given URL.A community is a set of web pages written by authors who have a common interest on a specific topic, such as fan pages of a professional baseball team.Our technique finds a community that includes the given URL, and communities on related topics, using hyperlink analysis.
What proposed technique findsCommunities related to a given seed page
A fan of SONY VAIO PC PC vendors
A community of VAIO fans
HITS [Kleinberg ’97]Extracts good authorities and hubs from a given subset of
web graph
Authorities: pages pointed by many good hubs
Hubs: pages pointing many good authorities
Authority
Authority
Authority
Hub
Hub
Hub
auth(n) = Σ hub(m), for all m pointing to nhub(n) = Σ auth(m), for all m pointed to by n
Sub-graph for finding related pages
Seed
Algorithm
Seed URLhttp://foo.bar/
URL1URL2…URL10
Top 10 authorities
URL1.1URL1.2…URL1.10
URL2.1URL2.2…URL2.10
10 of Top 10 authorities
URL10.1URL10.2…URL10.10
Use each authorityas a next seed
HITS
HITSURL1.1…URL1.10URL2.1…URL.2.10
URL10.1…URL10.10
Merge two “top 10 authorities”into a cluster when theyshare 3 or more URLs
Result communitiesClustering
Typical Behavior
VAIO fan AVAIO fan BVAIO official pageVAIO fan CVAIO and WinCE……
VAIO fan A
The firsttop 10 authorities
VAIO fan AVAIO fan BVAIO fan CVAIO fan D…
VAIO official pageSONYIBMTOSHIBA…
VAIO and WinCEWinCE fan AWinCE fan B…
Result communities
Data Set•17 million web pages (90GB) •Crawled from July to September, 1999•Pages in jp domain, or pages in other domain including Japanese characters
•Root URL: http://www.yahoo.co.jp/•Crawling strategy: breadth first
Web Graph
21M pagespointed to by retrieved pages
17M pagesretrieved by the crawler
•38 M URLs•23 M inter-server links•Mapped on main memory (2.5 GB)
Experiment
Result
0 1 2 3 4 5 6 7 8
# of seeds
1
2
3
4
5
6
7
# o
f com
munitie
s
•Randomly select 50 moderately popular pages as seed 10 ≦ # of in-links 50≦
•Examine whether result communities are related
•35 seeds produce related communities•15 seeds produce unrelated communities