+ All Categories
Home > Documents > Finding Related Communities on the Web

Finding Related Communities on the Web

Date post: 09-Jan-2016
Category:
Upload: stacia
View: 20 times
Download: 2 times
Share this document with a friend
Description:
Finding Related Communities on the Web. Masashi Toyoda. We propose a new web search technique, which finds related communities from a given URL. - PowerPoint PPT Presentation
Popular Tags:
1
Finding Related Communities on the Web Finding Related Communities on the Web Masashi Toyoda We propose a new web search technique, which finds related communities from a given URL. A community is a set of web pages written by authors who have a common interest on a specific topic, such as fan pages of a professional baseball team. Our technique finds a community that includes the given URL, and communities on related topics, using hyperlink analysis. What proposed technique finds Communities related to a given seed page A fan of SONY VAIO PC PC vendors A community of VAIO fans HITS [Kleinberg ’9 7] Extracts good authorities and hubs from a given subset of web graph Authorities: pages pointed by many good hubs Hubs: pages pointing many good authorities Authority Authority Authority Hub Hub Hub auth(n) = Σ hub(m), for all m pointin g to n hub(n) = Σ auth(m), for all m pointed to by n Sub-graph for finding related pages Seed Algorithm Seed URL http://foo.bar/ URL1 URL2 URL10 Top 10 authorities URL1.1 URL1.2 URL1.10 URL2.1 URL2.2 URL2.10 10 of Top 10 authorities URL10.1 URL10.2 URL10.10 Use each authority as a next seed HITS HITS URL1.1 URL1.10 URL2.1 URL.2.10 URL10.1 URL10.10 Merge two “top 10 authorities” into a cluster when they share 3 or more URLs Result communities Clustering Typical Behavior VAIO fan A VAIO fan B VAIO official page VAIO fan C VAIO and WinCE VAIO fan A The first top 10 authorities VAIO fan A VAIO fan B VAIO fan C VAIO fan D VAIO official page SONY IBM TOSHIBA VAIO and WinCE WinCE fan A WinCE fan B Result communities Data Set •17 million web pages (90GB) •Crawled from July to September, 19 99 •Pages in jp domain, or pages in ot her domain including Japanese char acters Root URL: http://www.yahoo.co.jp/ Crawling strategy: breadth first Web Graph 21M pages pointed to by retrieved pages 17M pages retrieved by the crawler •38 M URLs •23 M inter-server links •Mapped on main memory (2.5 GB) Experiment Result 0 1 2 3 4 5 6 7 8 # ofseeds 1 2 3 4 5 6 7 # of com m unitie •Randomly select 50 moderately popular pages as seed 10 # of in-links 50 •Examine whether result communities are related •35 seeds produce related communities •15 seeds produce unrelated communities
Transcript
Page 1: Finding Related Communities on the Web

Finding Related Communities on the WebFinding Related Communities on the WebMasashi Toyoda

We propose a new web search technique, which finds related communities from a given URL.A community is a set of web pages written by authors who have a common interest on a specific topic, such as fan pages of a professional baseball team.Our technique finds a community that includes the given URL, and communities on related topics, using hyperlink analysis.

What proposed technique findsCommunities related to a given seed page

A fan of SONY VAIO PC PC vendors

A community of VAIO fans

HITS [Kleinberg ’97]Extracts good authorities and hubs from a given subset of

web graph

Authorities: pages pointed by many good hubs

Hubs: pages pointing many good authorities

Authority

Authority

Authority

Hub

Hub

Hub

auth(n) = Σ hub(m), for all m pointing to nhub(n) = Σ auth(m), for all m pointed to by n

Sub-graph for finding related pages

Seed

Algorithm

Seed URLhttp://foo.bar/

URL1URL2…URL10

Top 10 authorities

URL1.1URL1.2…URL1.10

URL2.1URL2.2…URL2.10

10 of Top 10 authorities

URL10.1URL10.2…URL10.10

Use each authorityas a next seed

HITS

HITSURL1.1…URL1.10URL2.1…URL.2.10

URL10.1…URL10.10

Merge two “top 10 authorities”into a cluster when theyshare 3 or more URLs

Result communitiesClustering

Typical Behavior

VAIO fan AVAIO fan BVAIO official pageVAIO fan CVAIO and WinCE……

VAIO fan A

The firsttop 10 authorities

VAIO fan AVAIO fan BVAIO fan CVAIO fan D…

VAIO official pageSONYIBMTOSHIBA…

VAIO and WinCEWinCE fan AWinCE fan B…

Result communities

Data Set•17 million web pages (90GB) •Crawled from July to September, 1999•Pages in jp domain, or pages in other domain including Japanese characters

•Root URL: http://www.yahoo.co.jp/•Crawling strategy: breadth first

Web Graph

21M pagespointed to by retrieved pages

17M pagesretrieved by the crawler

•38 M URLs•23 M inter-server links•Mapped on main memory (2.5 GB)

Experiment

Result

0 1 2 3 4 5 6 7 8

# of seeds

1

2

3

4

5

6

7

# o

f com

munitie

s

•Randomly select 50 moderately popular pages as seed 10 ≦  # of in-links 50≦

•Examine whether result communities are related

•35 seeds produce related communities•15 seeds produce unrelated communities

Recommended