+ All Categories
Home > Documents > Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web...

Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web...

Date post: 23-Apr-2018
Category:
Upload: phamkiet
View: 213 times
Download: 0 times
Share this document with a friend
41
Web Characteristics CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Some slides have been adapted from: Profs. Leskovec, Rajaraman, and Ullman (Mining of Massive Datasets course, Stanford)
Transcript
Page 1: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

Web CharacteristicsCE-324: Modern Information Retrieval Sharif University of Technology

M. Soleymani

Fall 2017

Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

Some slides have been adapted from: Profs. Leskovec, Rajaraman, and Ullman (Mining of Massive Datasets course, Stanford)

Page 2: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

Web document collection

No design/co-ordination

Distributed content creation, linking,democratization of publishing

Content includes truth, lies, obsolete information,contradictions …

Unstructured (text, html,…), semi-structured (XML,annotated photos), structured (Databases)…

Scale much larger than previous text collections …but corporate records are catching up

Growth – slowed down from initial “volumedoubling every few months” but still expanding

Content can be dynamically generatedThe Web

Sec. 19.2

2

Page 3: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

Web search basics

The Web

Ad indexes

Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)

Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages

Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages

Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this

page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages

Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages

Sponsored Links CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com

Web spider

Indexer

Indexes

Search

User

Sec. 19.4.1

3

Page 4: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

Web graph

HTML pages together with hyperlinks between them

Can be modeled as a directed graph

Anchor text: text surrounding the origin of the hyper-

link on page A

4

Page 5: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

SPAM(SEARCH ENGINE OPTIMIZATION)

7

Page 6: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

The trouble with paid search ads …

It costs money. What’s the alternative?

Search Engine Optimization:

“Tuning” your web page to rank highly in the algorithmic search

results for selected keywords

Alternative to paying for placement

Thus, intrinsically a marketing function

Performed by companies, webmasters & consultants

(“Search engine optimizers”) for their clients

Some perfectly legitimate, some very shady

Sec. 19.2.2

8

Page 7: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

Search Engine Optimizer (SEO)

Motives Commercial, political, religious, lobbies

Promotion funded by advertising budget

Operators Contractors (Search Engine Optimizers) for lobbies, companies

Web masters

Hosting services

Forums E.g.,Web master world ( www.webmasterworld.com )

Search engine specific tricks

Discussions about academic papers

Sec. 19.2.2

9

Page 8: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

Simplest forms

First generation engines relied heavily on tf/idf The top-ranked pages for the query maui resort were the ones

containing the most maui’s and resort’s

SEOs responded with dense repetitions of chosen terms e.g.,maui resort maui resort maui resort

Often, the repetitions would be in the same color as thebackground of the web page

Repeated terms got indexed by crawlers

But not visible to humans on browsers

Pure word density cannot

be trusted as an IR signal

Sec. 19.2.2

10

Page 9: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

Variants of keyword stuffing

Misleading meta-tags, excessive repetition

Hidden text with colors, style sheet tricks, etc.

Meta-Tags = “… London hotels, hotel, holiday inn, hilton, discount, booking, reservation, …”

Sec. 19.2.2

11

Page 10: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

Cloaking

Serve fake content to search engine spider

DNS cloaking: Switch IP address. Impersonate

Is this a Search

Engine spider?

N

Y

SPAM

Real

DocCloaking

Sec. 19.2.2

12

Page 11: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

More spam techniques

Doorway pages

Pages optimized for a single keyword that re-direct to thereal target page

Link spamming

Mutual admiration societies, hidden links, awards

Domain flooding: numerous domains that point or re-direct toa target page

Robots

Fake query stream – rank checking programs

“Curve-fit” ranking programs of search engines

Millions of submissions via Add-Url

Sec. 19.2.2

13

Page 12: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

The war against spam

Quality signals: Preferauthoritative pages based on: Votes from authors (linkage signals)

Votes from users (usage signals)

Policing of URL submissions Anti robot test

Limits on meta-keywords

Robust link analysis Ignore statistically implausible

linkage (or text)

Use link analysis to detectspammers (guilt by association)

Spam recognition by machinelearning Training set based on known spam

Family friendly filters Linguistic analysis, general

classification techniques, etc.

For images: flesh tone detectors,source text analysis, etc.

Editorial intervention Blacklists

Top queries audited

Complaints addressed

Suspect pattern detection

14

Page 13: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

More on spam

Web search engines have policies on SEO practices theytolerate/block http://help.yahoo.com/help/us/ysearch/index.html

http://www.google.com/intl/en/webmasters/

Adversarial IR: the unending (technical) battle betweenSEO’s and web search engines

Research http://airweb.cse.lehigh.edu/

15

Page 14: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

Understanding the users

16

Page 15: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

User Needs Need [Brod02, RL04]

Informational – want to learn about something (~40% / 65%)

Navigational – want to go to that page (~25% / 15%)

Transactional – want to do something (web-mediated) (~35% / 20%)

Access a service

Downloads

Shop

Gray areas

Find a good hub

Exploratory search “see what’s there”

Low hemoglobin

United Airlines

Seattle weather

Mars surface images

Canon S410

Car rental Brasil

Sec. 19.4.1

17

Page 16: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

Users’ empirical evaluation of results

Quality of pages varies widely Relevance is not enough

Other desirable qualities (non IR!!)

Content:Trustworthy, diverse, non-duplicated, well maintained

Web readability: display correctly & fast

No annoyances: pop-ups, etc.

Precision vs. recall:

On the web, recall seldom matters

What matters Precision at 1? Precision above the fold?

Comprehensiveness – must be able to deal with obscure queries

Recall matters when the number of matches is very small

User perceptions may be unscientific, but are significant overa large aggregate

19

Page 17: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

Users’ empirical evaluation of engines

Relevance and validity of results

Coverage of topics for polysemic queries

Trust – Results are objective

UI – Simple, no clutter, error tolerant

Pre/Post process tools provided Mitigate user errors (auto spell check, search assist,…)

Explicit: Search within results, more like this, refine ...

Anticipative: related searches

Deal with idiosyncrasies Web specific vocabulary (Impact on stemming, spell-check, etc.)

Web addresses typed in the search box

20

Page 18: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

DUPLICATE DETECTION

21

Page 19: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

Duplicate documents

The web is full of duplicated content

Strict duplicate detection = exact match

Not as common

But many, many cases of near duplicates

E.g., last-modified date the only difference between two copies

of a page

Sec. 19.6

22

Page 20: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

Duplicate/near-duplicate detection

Duplication: Exact match can be detected with fingerprints

Near-Duplication:Approximate match

Overview

Compute syntactic similarity with an edit-distance

measure

Use similarity threshold to detect near-duplicates E.g., Similarity > 80% => Docs are “near duplicates”

Not transitive though sometimes used transitively

Sec. 19.6

23

Page 21: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

Computing Similarity

Features:

Segments of a doc (natural or artificial breakpoints)

Shingles (Word N-Grams)

Similarity Measure between two docs (= sets of shingles)

Jaccard coefficient:𝐴∩𝐵

𝐴∪𝐵

Sec. 19.6

24

Page 22: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

Example

25

Doc A:“a rose is red a rose is white”

Doc B:“a rose is white a rose is red”

Doc A: 4 shingles

“a rose is red”

“rose is red a”

“is red a rose”

“red a rose is”

“a rose is white”

Doc B: 4 shingles

“a rose is white”

“rose is white a”

“is white a rose”

“white a rose is”

“a rose is red”

𝐽𝑎𝑐𝑐𝑎𝑟𝑑 = 0.25

Page 23: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

Shingles + Set Intersection

Doc AShingle set A Sketch A

Doc BShingle set B Sketch B

Jaccard

Sec. 19.6

26

Computing exact set intersection of shingles between all

pairs of docs is expensive/intractable

Approximate using a cleverly chosen subset of shingles from

each (called sketch)

Estimate𝑆(𝐴)∩𝑆(𝐵)

𝑆(𝐴)∪𝑆(𝐵)based on short sketches of Doc A and B

Page 24: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

From sets to Boolean matrices

27

Rows =elements of the universal set.

Example: the set of all k shingles.

Columns =sets.

View sets as columns of a matrix 𝐶; one row for each element

in the universe of shingles.

𝐶𝑖𝑗 = 1 indicates presence of shingle 𝑖 in set 𝑗

Typical matrix is sparse.

Page 25: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

Example: Column similarity

28

Page 26: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

For types of rows (for a pair of columns)

29

For columns 𝐶𝑖,𝐶𝑗, four types of rows

𝑪𝒊 𝑪𝒋1 1

1 0

0 1

0 0

𝑛11: # of rows where both columns are one (# of the items that

exist in both sets 𝐶𝑖 and 𝐶𝑗)

𝑛10: # of rows where 𝐶𝑖 contains 1 but 𝐶𝑗 contains 0 (# of the items

that exist in both sets 𝐶𝑖 and 𝐶𝑗)

and so on

𝐽𝑎𝑐𝑐𝑎𝑟𝑑 𝐶𝑖 , 𝐶𝑗 =𝑛11

𝑛10 + 𝑛01 + 𝑛11

Page 27: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

Minhashing

30

Imagine the rows permuted randomly, define minhash function:

ℎ𝜋(𝐶𝑖) = index of first row with 1 in column 𝐶𝑖 (after the permutation 𝜋on rows)

Use several (e.g., 100) independent permutations to create a

signature for each column.

The signatures can be displayed in another matrix:

The signature matrix – whose columns represent the sets and the rows

represent the minhash values, in order for that column.

Page 28: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

Minhashing: example

31

Randomly permute rows

through permutation 𝜋

For each permutation 𝜋 , ℎ𝜋(𝐶𝑖)denotes the index of first row with 1

in column 𝐶𝑖 (after the permutation 𝜋)

Page 29: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

A property of minhashing

32

The probability (over all permutations of the rows) that

ℎ𝜋(𝐶𝑖) = ℎ𝜋(𝐶𝑗) is the same as 𝐽𝑎𝑐𝑐𝑎𝑟𝑑(𝐶𝑖 , 𝐶𝑗).

Both are 𝑛11 /(𝑛01 + 𝑛10 + 𝑛11 )!

Proof sketch:

• Look down the permuted columns 𝐶𝑖 and 𝐶𝑗 until we see a 1.

• If both columns have one in this row then ℎ𝜋(𝐶𝑖) = ℎ𝜋(𝐶𝑗) .

However, if only one of them contains 1 ℎ𝜋(𝐶𝑖) ≠ ℎ𝜋(𝐶𝑗)

Page 30: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

Proof

We intend to show 𝑃 ℎ𝜋 𝐶𝑖 = ℎ𝜋 𝐶𝑗 = 𝐽𝑎𝑐𝑐𝑎𝑟𝑑(𝐶𝑖 , 𝐶𝑗)

Look down columns 𝐶𝑖, 𝐶𝑗 until first row that at least one of 𝐶𝑖 or 𝐶𝑗are non-zero ⇒ the corresponding shingle to this row is in the 𝐶𝑖 ∪ 𝐶𝑗.

If both 𝐶𝑖 and 𝐶𝑗 are non-zero in this row (ℎ𝜋 𝐶𝑖 = ℎ𝜋 𝐶𝑗 ), the

corresponding shingle is in the 𝐶𝑖 ∩ 𝐶𝑗.

Thus, in each permutation we indeed select a random sample from 𝐶𝑖∪ 𝐶𝑗 and check that if it exist also in 𝐶𝑖 ∩ 𝐶𝑗.

Therefore, the expectation of ℎ𝜋 𝐶𝑖 = ℎ𝜋 𝐶𝑗 on different

permutations 𝜋 shows 𝐽𝑎𝑐𝑐𝑎𝑟𝑑(𝐶𝑖 , 𝐶𝑗)

Sec. 19.6

33

Page 31: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

Similarity for signatures

34

The similarity of signatures is the fraction of the

minhash functions in which they agree.

Thinking of signatures as columns of integers, the similarity of

signatures is the fraction of rows in which they agree.

Thus, the expected similarity of two signatures equals the

Jaccard similarity of the columns or sets that the

signatures represent.

And the longer the signatures, the smaller will be the expected

error.

Page 32: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

Min Hashing: Example

35

Page 33: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

Sketch of a document

Create a “sketch vector” (of size ~200) for each doc 𝐷:

Let 𝑓 map all shingles to 0. . . 2𝑚 − 1 (e.g., 𝑓= fingerprinting).

Indeed, it maps each set of shingles to an m-bit integer.

For i=1 to size of sketch vector

Let 𝜋𝑖 be a random permutation

Pick 𝑆𝑘𝑒𝑡𝑐ℎ𝐷 𝑖 be the minimum of shingle value for doc 𝐷 after

performing permutation 𝜋𝑖 on these numbers

Docs that share ≥ threshold (e.g. 90%) corresponding

sketch vector elements are near duplicates

If the size of sketch vector is 𝑀 , we have 𝐽𝑎𝑐𝑐𝑎𝑟𝑑(𝐴, 𝐵)

≈ 𝑖=1𝑀 𝐼(𝑆𝑘𝑒𝑡𝑐ℎ𝐴 𝑖 =𝑆𝑘𝑒𝑡𝑐ℎ𝐵 𝑖 )

𝑀

Sec. 19.6

36

Page 34: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

Computing Sketch[i] for Doc1

Document 1

264

264

264

264

Start with 64-bit f(shingles)

Permute on the number line

Pick the min value

Sec. 19.6

37

Page 35: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

Test if Doc1.Sketch[i] = Doc2.Sketch[i]

Document 1 Document 2

264

264

264

264

264

264

264

264

Are these equal?

Test for 200 random permutations: p1, p2,… p200

A B

Sec. 19.6

38

Page 36: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

However…

Document 1 Document 2

264

264

264

264

264

264

264

264

BA

Sec. 19.6

39

the shingle with the MIN value in both of Doc1 and Doc2 is

common to both (i.e., lies in the intersection)

Page 37: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

Min-Hash sketches: summary

Use 𝑓 to map shingles to 𝑚 bites

Pick P random row permutations of the numbers

MinHash sketch

𝑆𝑘𝑒𝑡𝑐ℎ𝐶[𝑖] shows the first row with 1 in the column Cin the 𝑖-th permutation

Similarity of signatures

Let 𝑠𝑖𝑚[𝑠𝑘𝑒𝑡𝑐ℎ𝐶 , 𝑠𝑘𝑒𝑡𝑐ℎ𝐶′] = fraction of identical elements inthe vectors 𝑠𝑘𝑒𝑡𝑐ℎ𝐶 and 𝑠𝑘𝑒𝑡𝑐ℎ𝐶′ fraction of permutations where MinHash values agree

Observe 𝑠𝑖𝑚[𝑠𝑘𝑒𝑡𝑐ℎ𝐶 , 𝑠𝑘𝑒𝑡𝑐ℎ𝐶′] ≈ 𝐽𝑎𝑐𝑐𝑎𝑟𝑑(𝐴, 𝐵)

Sec. 19.6

40

Page 38: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

Implementation of Min-Hashing

41

Suppose one billion rows.

Hard to pick a random permutation of 1…billion.

Representing a random permutation requires 1 billion

entries.

Accessing rows in permuted order leads to thrashing.

Page 39: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

Implementation of Min-Hashing

42

A good approximation to permuting rows:

Pick, say, 100 hash functions.

ℎ𝑖(.) gives order of rows for i-th permutation.

For each column and each hash function ℎ𝑖, 𝑆𝑘𝑒𝑡𝑐ℎ𝐶(𝑖)will become the smallest value of ℎ𝑖(𝑟) for which column

𝐶 has 1 in row r.

Page 40: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

All signature pairs

We have an extremely efficient method for estimating a

Jaccard coefficient for a single pair of docs.

But we still have to estimate N2 coefficients where N is

the number of web pages (Still slow)

One solution: locality sensitive hashing (LSH)

Another solution: sorting (Henzinger 2006)

Sec. 19.6

43

Page 41: Web Characteristics - Sharifce.sharif.edu/courses/96-97/1/ce324-1/resources/root/slides/Web...Web readability:display correctly & fast ... Compute syntactic similarity with an edit-distance

More resources

IIR Chapter 19

44


Recommended