Learning URL Patterns for Webpage De-duplication

Post on 13-Jan-2016

40 views 3 download

description

Learning URL Patterns for Webpage De-duplication. Authors: Hema Swetha Koppula… WSDM 2010 Reporter: Jing Chiu Email: D9815013@mail.ntust.edu.tw. Outlines. Introduction Duplicate URLs Problem Definition Related Works Algorithms URL Preprocessing Rule Generation Evaluation Conclusions. - PowerPoint PPT Presentation

transcript

Learning URL Patterns for Webpage De-duplicationAuthors: Hema Swetha Koppula…WSDM 2010Reporter: Jing ChiuEmail: D9815013@mail.ntust.edu.tw

112/04/21 1Data Mining & Machine Learning Lab

Outlines

•Introduction▫Duplicate URLs▫Problem Definition

•Related Works•Algorithms

▫URL Preprocessing▫Rule Generation

•Evaluation•Conclusions

112/04/21 2Data Mining & Machine Learning Lab

Introduction

•Duplicate URLs•Problem Definition

112/04/21 3Data Mining & Machine Learning Lab

• Making URLs search engine friendly▫ http://en.wikipedia.org/wiki/Casino_Royale▫ http://en.wikipedia.org/?title=Casino_Royale

• Session-id or cookie information present in URLs▫ http://cs.stanford.edu/degrees/mscs/faq/index.php?sid=67873

&cat=8▫ http://cs.stanford.edu/degrees/mscs/faq/index.php?sid=78813

&cat=8• Irrelevant or superfluous components in URLs

▫ http://www.amazon.com/Lord-Rings/dp/B000634DCW▫ http://www.amazon.com/dp/B000634DCW

• Webmaster construct URL representations with custom delimiters▫ http://catalog.ebay.com/The-Grudge_UPC_043396062603_W0Q

Q_fclsZ1QQ_pcatidZ1QQ_pidZ43973351QQ_tabZ2▫ http://catalog.ebay.com/The-Grudge_UPC_043396062603_W0?

_fcls=1&_pcatid=1&_pid=43973351&_tab=2

Duplicate URLs

112/04/21 Data Mining & Machine Learning Lab 4

•Given a set of duplicate clusters and their corresponding URLs▫Learning Rules from URL strings which can

identify duplicates▫Utilizing learned Rules for normalizing

unseen duplicate URLs into a unique normalized URL

•Applications such as crawlers can apply these generalized Rules on a given URL to generate a normalized URL

Problem Definition

112/04/21 Data Mining & Machine Learning Lab 5

• Do not crawl in the dust: different urls with similar text▫Authors: Z. Bar-Yossef, I. Keidar, and U.Schonfeld.▫Conference: International conference on World

Wide Web 2007▫DUST algorithm

Discovering substring substitution rules to transform URLs of similar content to one canonical URL

Rules are learned from URLs obtained from previous crawl logs or web server logs with a confidence measure

Related Works

112/04/21 Data Mining & Machine Learning Lab 6

• De-duping urls via rewrite rules▫ Authors: A. Dasgupta, R. Kumar, and A. Sasturkar▫ Conference: ACM SIGKDD international conference

on Knowledge discovery and data mining▫ Considering a broader set of rule types which

subsume the DUST rules DUST rules session-id rules irrelevant path components Complicate rewrites

▫ Algorithm learns rules from a cluster of URLs with similar page content such a cluster is referred to as a duplicate cluster or a

dup cluster

Related Works (cont.)

112/04/21 Data Mining & Machine Learning Lab 7

•URL Preprocessing▫Basic Tokenization▫Deep Tokenization

•Rule Generation▫Pair-wise Rule Generation▫Rule Generalization

Algorithms

112/04/21 Data Mining & Machine Learning Lab 8

•Basic Tokenization▫Using the standard delimiters specified in

theRFC 1738▫Extracted Tokens:

Protocol Hostname Path components Query-args

•Deep Tokenization▫Using unsupervised technique to learn

custom URL encodings used by webmasters

URL Preprocessing

112/04/21 Data Mining & Machine Learning Lab 9

URL Preprocessing (cont.)

112/04/21 Data Mining & Machine Learning Lab 10

• Definitions▫ URL▫ Rule

• Example▫ u1: http://360.yahoo.com/friends-lttU7d6kIuGq

u1 = {k(1,3) = http, k(2,2) = 360.yahoo.com, k(3.1,1.3) = friends, k(3.2,1.2)

= −, k(3.3,1.1) = lttU7d6kIuGq}▫ u2: http://360.yahoo.com/friendsnMfcaJRPUSMQ

u2 = {k(1,3) = http, k(2,2) = 360.yahoo.com, k(3.1,1.3) = friends, k(3.2,1.2) = −, k(3.3,1.1) = nMfcaJRPUSMQ}

▫ Rule Context (C ):

c(k(1,3)) = http, c(k(2,2)) = 360.yahoo.com, c(k(3.1,1.3)) = friends, c(k(3.2,1.2)) = −, c(k(3.3,1.1)) = nMfcaJRPUSMQ

Transformation (T): t(k(3.3,1.1)) = lttU7d6kIuGq.

Rule Generation

112/04/21 Data Mining & Machine Learning Lab 11

• Pair-wise Rule Generation▫ Target Selection▫ Source Selection

• Rule Generalization▫ Pair 1:

http://www.imdb.com/title/tt0810900/photogallery http://www.imdb.com/title/tt0810900/mediaindex

▫ Pair 2: http://www.imdb.com/title/tt0053198/photogallery http://www.imdb.com/title/tt0053198/mediaindex

▫ Rule 1: c(k(1,5)) = http, c(k(2,4)) = www.imdb.com, c(k(3,3)) = title, c(k(4.1,2.2)) = tt,

c(k(4.2,2.1)) = 0810900, c(k(5,1)) = photogallery, t(k(5,1)) = mediaindex▫ Rule 2:

c(k(1,5)) = http, c(k(2,4)) = www.imdb.com, c(k(3,3)) = title, c(k(4.1,2.2)) = tt, c(k(4.2,2.1)) = 0053198, c(k(5,1)) = photogallery, t(k(5,1)) = mediaindex

Rule Generation (cont.)

112/04/21 Data Mining & Machine Learning Lab 12

•Dataset

•Rule Numbers after each step

Evaluation

112/04/21 Data Mining & Machine Learning Lab 13

•Small dataset

Evaluation (cont.)

112/04/21 Data Mining & Machine Learning Lab 14

•Small dataset

Evaluation (cont.)

112/04/21 Data Mining & Machine Learning Lab 15

•Large dataset

Evaluation (cont.)

112/04/21 Data Mining & Machine Learning Lab 16

•Large dataset

Evaluation (cont.)

112/04/21 Data Mining & Machine Learning Lab 17

•Presented a set of scalable and robust techniques for de-duplication of URLs▫Basic and deep tokenization▫Rule generation and generalization

•Easy adaptability to MapReduce paradigm•Evaluate effectiveness on both small and

large dataset

Conclusion

112/04/21 Data Mining & Machine Learning Lab 18

•Questions?

Thanks for your attention

112/04/21 Data Mining & Machine Learning Lab 19

Algorithm 1

112/04/21 Data Mining & Machine Learning Lab 20

Algorithm 2

112/04/21 Data Mining & Machine Learning Lab 21

Algrithm 3

112/04/21 Data Mining & Machine Learning Lab 22

Algorithm 4

112/04/21 Data Mining & Machine Learning Lab 23

Algorithm 5

112/04/21 Data Mining & Machine Learning Lab 24

•URL: A URL u is defined as function ▫u : K → V ∪ {⊥}▫K: keys

k(x.i,y.j) x, y represent the position index from the

start and end of the URL i,j represent the deep token index

▫V: Values ▫A key not present in the URL is denoted by

Definitions of URL

112/04/21 Data Mining & Machine Learning Lab 25

•RULE: A Rule r is defined as a function ▫r : C → T ▫C: context

C : K → V ∪ {∗}▫T: transformation

T : K → V ∪ {⊥,K’} K’ = K ∪ ValueConversions ValueConversions = {Lowercase(K),

Uppercase(K), Encode(K), Decode(K), ...}

Definitions of Rule

112/04/21 Data Mining & Machine Learning Lab 26

Rule Coverage

112/04/21 Data Mining & Machine Learning Lab 27

MapReduce

112/04/21 Data Mining & Machine Learning Lab 28