+ All Categories
Home > Documents > seo service

seo service

Date post: 12-Mar-2016
Category:
Upload: aarti-sharma
View: 214 times
Download: 1 times
Share this document with a friend
Description:
we provide seo service
48
Special Topics in Search Engines Result Summaries Anti-spamming Duplicate elimination
Transcript
Page 1: seo service

Special Topics in Search Engines

Result SummariesAnti-spamming

Duplicate elimination

Page 2: seo service

Results summaries

Page 3: seo service

Summaries Having ranked the documents matching a

query, we wish to present a results list Most commonly, the document title plus a

short summary The title is typically automatically

extracted from document metadata What about the summaries?

Page 4: seo service

Summaries Two basic kinds:

Static Dynamic

A static summary of a document is always the same, regardless of the query that hit the doc

Dynamic summaries are query-dependent attempt to explain why the document was retrieved for the query at hand

Page 5: seo service

Static summaries In typical systems, the static summary is a

subset of the document Simplest heuristic: the first 50 (or so – this can

be varied) words of the document Summary cached at indexing time

More sophisticated: extract from each document a set of “key” sentences Simple NLP heuristics to score each sentence Summary is made up of top-scoring sentences.

Most sophisticated: NLP used to synthesize a summary Seldom used in IR; cf. text summarization work

Page 6: seo service

Dynamic summaries Present one or more “windows” within the

document that contain several of the query terms “KWIC” snippets: Keyword in Context presentation

Generated in conjunction with scoring If query found as a phrase, the/some occurrences

of the phrase in the doc If not, windows within the doc that contain

multiple query terms The summary itself gives the entire content of

the window – all terms, not only the query terms – how?

Page 7: seo service

Generating dynamic summaries If we have only a positional index, we cannot

(easily) reconstruct context surrounding hits If we cache the documents at index time,

can run the window through it, cueing to hits found in the positional index E.g., positional index says “the query is a

phrase in position 4378” so we go to this position in the cached document and stream out the content

Most often, cache a fixed-size prefix of the doc Note: Cached copy can be outdated

Page 8: seo service

Dynamic summaries Producing good dynamic summaries is a tricky

optimization problem The real estate for the summary is normally

small and fixed Want short item, so show as many KWIC

matches as possible, and perhaps other things like title

Want snippets to be long enough to be useful Want linguistically well-formed snippets: users

prefer snippets that contain complete phrases Want snippets maximally informative about doc

But users really like snippets, even if they complicate IR system design

Page 9: seo service

Anti-spamming

Page 10: seo service

Adversarial IR (Spam) Motives

Commercial, political, religious, lobbies Promotion funded by advertising budget

Operators Contractors (Search Engine Optimizers) for lobbies,

companies Web masters Hosting services

Forum Web master world ( www.webmasterworld.com )

Search engine specific tricks Discussions about academic papers

Page 11: seo service

Search Engine Optimization IAdversarial IR

(“search engine wars”)

Page 12: seo service

Can you trust words on the page?

Examples from July 2002

auctions.hitsoffice.com/

www.ebay.com/Pornographic Content

Page 13: seo service

Simplest forms Early engines relied on the density of terms

The top-ranked pages for the query maui resort were the ones containing the most maui’s and resort’s

SEOs responded with dense repetitions of chosen terms e.g., maui resort maui resort maui resort Often, the repetitions would be in the same

color as the background of the web page Repeated terms got indexed by crawlers But not visible to humans on browsers

Can’t trust the words on a web page, for ranking.

Page 14: seo service

A few spam technologies Cloaking

Serve fake content to search engine robot DNS cloaking: Switch IP address.

Impersonate Doorway pages

Pages optimized for a single keyword that re-direct to the real target page

Keyword Spam Misleading meta-keywords, excessive

repetition of a term, fake “anchor text” Hidden text with colors, CSS tricks, etc.

Link spamming Mutual admiration societies, hidden links,

awards Domain flooding: numerous domains that

point or re-direct to a target page Robots

Fake click stream Fake query stream Millions of submissions via Add-Url

Page 15: seo service

More spam techniques Cloaking Serve fake content to search engine spider DNS cloaking: Switch IP address.

Impersonate

Is this a SearchEngine spider?

Y

N

SPAM

RealDocCloaking

Page 16: seo service

Tutorial onCloaking & Stealth

Technology

Page 17: seo service

Variants of keyword stuffing Misleading meta-tags, excessive repetition Hidden text with colors, style sheet tricks,

etc.

Meta-Tags = “… London hotels, hotel, holiday inn, hilton, discount, booking, reservation, sex, mp3, britney spears, viagra, …”

Page 18: seo service

More spam techniques Doorway pages

Pages optimized for a single keyword that re-direct to the real target page

Link spamming Mutual admiration societies, hidden links,

awards – more on these later Domain flooding: numerous domains that

point or re-direct to a target page Robots

Fake query stream – rank checking programs “Curve-fit” ranking programs of search engines

Millions of submissions via Add-Url

Page 19: seo service

The war against spam Quality signals - Prefer authoritative pages based on: Votes from authors (linkage signals) Votes from users (usage signals)

Policing of URL submissions Anti robot test

Limits on meta-keywordsRobust link analysis Ignore statistically implausible linkage (or text) Use link analysis to detect spammers (guilt by

association)

Page 20: seo service

The war against spam Spam recognition by machine learning

Training set based on known spam Family friendly filters

Linguistic analysis, general classification techniques, etc.

For images: flesh tone detectors, source text analysis, etc.

Editorial intervention Blacklists Top queries audited Complaints addressed

Page 21: seo service

Acid test Which SEO’s rank highly on the query seo? Web search engines have policies on SEO

practices they tolerate/block See pointers in Resources

Adversarial IR: the unending (technical) battle between SEO’s and web search engines

See for instance http://airweb.cse.lehigh.edu/

Page 22: seo service

Duplicate detection

Page 23: seo service

Duplicate/Near-Duplicate Detection

Duplication: Exact match with fingerprints Near-Duplication: Approximate match Overview

Compute syntactic similarity with an edit-distance measure

Use similarity threshold to detect near-duplicates

E.g., Similarity > 80% => Documents are “near duplicates”

Not transitive though sometimes used transitively

Page 24: seo service

Computing Similarity Segments of a document (natural or

artificial breakpoints) [Brin95] Shingles (Word k-Grams) [Brin95, Brod98]

“a rose is a rose is a rose” => a_rose_is_a rose_is_a_rose is_a_rose_is

Similarity Measure between two docs (= sets of shingles) Set intersection [Brod98] (Specifically, Size_of_Intersection /

Size_of_Union )Jaccard measure

Page 25: seo service

Shingles + Set Intersection Computing exact set intersection of shingles between all pairs of documents is expensive Approximate using a cleverly chosen subset of

shingles from each (a sketch) Estimate Jaccard from a short sketchCreate a “sketch vector” (e.g., of size 200) for each document Documents which share more than t (say 80%)

corresponding vector elements are similar For doc d, sketchd[i] is computed as follows:

Let f map all shingles in the universe to 0..2m

Let i be a specific random permutation on 0..2m

Pick MIN i (f(s)) over all shingles s in d

Page 26: seo service

Shingling with sampling minima

Given two documents A1, A2. Let S1 and S2 be their shingle sets Resemblance = |Intersection of S1 and

S2| / | Union of S1 and S2|. Let Alpha = min ( (S1)) Let Beta = min ((S2))

Probability (Alpha = Beta) = Resemblance

Page 27: seo service

Computing Sketch[i] for Doc1Document 1

264

264

264

264

Start with 64 bit shingles

Permute on the number linewith i

Pick the min value

Page 28: seo service

Test if Doc1.Sketch[i] = Doc2.Sketch[i]

Document 1 Document 2

264

264

264

264

264

264

264

264

Are these equal?

Test for 200 random permutations: , ,… 200

A B

Page 29: seo service

However…Document 1 Document 2

264

264

264

264

264

264

264

264

A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (I.e., lies in the intersection)

This happens with probability: Size_of_intersection / Size_of_union

BA

Why?

Page 30: seo service

Set Similarity Set Similarity (Jaccard measure)

View sets as columns of a matrix; one row for each element in the universe. aij = 1 indicates presence of item i in set j

Example

ji

jijiJ

CC

CC)C,(Csim

C1 C2

0 1 1 0 1 1 simJ(C1,C2) = 2/5 = 0.4 0 0 1 1 0 1

Page 31: seo service

Key Observation For columns Ci, Cj, four types of rows

Ci Cj

A 1 1B 1 0C 0 1D 0 0

Overload notation: A = # of rows of type A Claim

CBAA)C,(Csim jiJ

Page 32: seo service

Min Hashing Randomly permute rows h(Ci) = index of first row with 1 in column

Ci Surprising Property

Why? Both are A/(A+B+C) Look down columns Ci, Cj until first non-

Type-D row h(Ci) = h(Cj) type A row

jiJji C,Csim )h(C)h(C P

Page 33: seo service

Mirror Detection Mirroring is systematic replication of web pages

across hosts. Single largest cause of duplication on the web

Host1/ and Host2/ are mirrors iffFor all (or most) paths p such that when http://Host1/ / p exists http://Host2/ / p exists as wellwith identical (or near identical) content, and

vice versa.

Page 34: seo service

Mirror Detection example http://www.elsevier.com/ and http://www.elsevier.nl/ Structural Classification of Proteins

http://scop.mrc-lmb.cam.ac.uk/scop http://scop.berkeley.edu/ http://scop.wehi.edu.au/scop http://pdb.weizmann.ac.il/scop http://scop.protres.ru/

Page 35: seo service

Repackaged MirrorsAuctions.msn.com Auctions.lycos.com

Aug 2001

Page 36: seo service

Motivation Why detect mirrors?

Smart crawling Fetch from the fastest or freshest server Avoid duplication

Better connectivity analysis Combine inlinks Avoid double counting outlinks

Redundancy in result listings “If that fails you can try: <mirror>/samepath”

Proxy caching

Page 37: seo service

Maintain clusters of subgraphs Initialize clusters of trivial subgraphs

Group near-duplicate single documents into a cluster Subsequent passes

Merge clusters of the same cardinality and corresponding linkage

Avoid decreasing cluster cardinality To detect mirrors we need:

Adequate path overlap Contents of corresponding pages within a small time range

Bottom Up Mirror Detection[Cho00]

Page 38: seo service

Can we use URLs to find mirrors?

www.synthesis.org

a b

cd

synthesis.stanford.edu

a b

cd

www.synthesis.org/Docs/ProjAbs/synsys/synalysis.htmlwww.synthesis.org/Docs/ProjAbs/synsys/visual-semi-quant.htmlwww.synthesis.org/Docs/annual.report96.final.htmlwww.synthesis.org/Docs/cicee-berlin-paper.htmlwww.synthesis.org/Docs/myr5www.synthesis.org/Docs/myr5/cicee/bridge-gap.htmlwww.synthesis.org/Docs/myr5/cs/cs-meta.htmlwww.synthesis.org/Docs/myr5/mech/mech-intro-mechatron.htmlwww.synthesis.org/Docs/myr5/mech/mech-take-home.htmlwww.synthesis.org/Docs/myr5/synsys/experiential-learning.htmlwww.synthesis.org/Docs/myr5/synsys/mm-mech-dissec.htmlwww.synthesis.org/Docs/yr5arwww.synthesis.org/Docs/yr5ar/assesswww.synthesis.org/Docs/yr5ar/ciceewww.synthesis.org/Docs/yr5ar/cicee/bridge-gap.htmlwww.synthesis.org/Docs/yr5ar/cicee/comp-integ-analysis.html

synthesis.stanford.edu/Docs/ProjAbs/deliv/high-tech-…synthesis.stanford.edu/Docs/ProjAbs/mech/mech-enhanced…synthesis.stanford.edu/Docs/ProjAbs/mech/mech-intro-…synthesis.stanford.edu/Docs/ProjAbs/mech/mech-mm-case-…synthesis.stanford.edu/Docs/ProjAbs/synsys/quant-dev-new-…synthesis.stanford.edu/Docs/annual.report96.final.htmlsynthesis.stanford.edu/Docs/annual.report96.final_fn.htmlsynthesis.stanford.edu/Docs/myr5/assessmentsynthesis.stanford.edu/Docs/myr5/assessment/assessment-…synthesis.stanford.edu/Docs/myr5/assessment/mm-forum-kiosk-…synthesis.stanford.edu/Docs/myr5/assessment/neato-ucb.htmlsynthesis.stanford.edu/Docs/myr5/assessment/not-available.htmlsynthesis.stanford.edu/Docs/myr5/ciceesynthesis.stanford.edu/Docs/myr5/cicee/bridge-gap.htmlsynthesis.stanford.edu/Docs/myr5/cicee/cicee-main.htmlsynthesis.stanford.edu/Docs/myr5/cicee/comp-integ-analysis.html

Page 39: seo service

Top Down Mirror Detection[Bhar99, Bhar00c]

E.g., www.synthesis.org/Docs/ProjAbs/synsys/synalysis.htmlsynthesis.stanford.edu/Docs/ProjAbs/synsys/quant-dev-new-teach.html

What features could indicate mirroring? Hostname similarity:

word unigrams and bigrams: { www, www.synthesis, synthesis, …}

Directory similarity: Positional path bigrams { 0:Docs/ProjAbs, 1:ProjAbs/synsys, … }

IP address similarity: 3 or 4 octet overlap Many hosts sharing an IP address => virtual hosting by an ISP

Host outlink overlap Path overlap

Potentially, path + sketch overlap

Page 40: seo service

Implementation Phase I - Candidate Pair Detection

Find features that pairs of hosts have in common Compute a list of host pairs which might be mirrors

Phase II - Host Pair Validation Test each host pair and determine extent of mirroring

Check if 20 paths sampled from Host1 have near-duplicates on Host2 and vice versa

Use transitive inferences: IF Mirror(A,x) AND Mirror(x,B) THEN Mirror(A,B) IF Mirror(A,x) AND !Mirror(x,B) THEN !Mirror(A,B)

Evaluation 140 million URLs on 230,000 hosts (1999) Best approach combined 5 sets of features

Top 100,000 host pairs had precision = 0.57 and recall = 0.86

Page 41: seo service

WebIR Infrastructure Connectivity Server

Fast access to links to support for link analysis

Term Vector Database Fast access to document vectors to

augment link analysis

Page 42: seo service

Connectivity Server[CS1: Bhar98b, CS2 & 3: Rand01]

Fast web graph access to support connectivity analysis

Stores mappings in memory from URL to outlinks, URL to inlinks

Applications HITS, Pagerank computations Crawl simulation Graph algorithms: web connectivity, diameter etc.

more on this later Visualizations

Page 43: seo service

UsageInput

Graphalgorithm

+URLs

+Values

URLstoFPstoIDs

Execution

Graphalgorithm

runs inmemory

IDstoURLs

Output

URLs+

Values

Translation Tables on DiskURL text: 9 bytes/URL (compressed from ~80 bytes ) FP(64b) -> ID(32b): 5 bytesID(32b) -> FP(64b): 8 bytesID(32b) -> URLs: 0.5 bytes

Page 44: seo service

ID assignment

Partition URLs into 3 sets, sorted lexicographically

High: Max degree > 254 Medium: 254 > Max degree > 24 Low: remaining (75%)

IDs assigned in sequence (densely)

E.g., HIGH IDs: Max(indegree , outdegree) > 254

ID URL…9891 www.amazon.com/9912

www.amazon.com/jobs/…9821878

www.geocities.com/…40930030 www.google.com/…

85903590 www.yahoo.com/

Adjacency lists In memory tables for

Outlinks, Inlinks List index maps from a

Source ID to start of adjacency list

Page 45: seo service

Adjacency List Compression - I

9813215398

147153

104105106

ListIndex

Sequenceof

AdjacencyLists

-63421-8496

104105106

ListIndex

DeltaEncoded

AdjacencyLists

• Adjacency List: - Smaller delta values are exponentially more frequent (80% to same host)- Compress deltas with variable length encoding (e.g., Huffman)

• List Index pointers: 32b for high, Base+16b for med, Base+8b for low- Avg = 12b per pointer

Page 46: seo service

Adjacency List Compression - II

Inter List Compression Basis: Similar URLs may share links

Close in ID space => adjacency lists may overlap Approach

Define a representative adjacency list for a block of IDs Adjacency list of a reference ID Union of adjacency lists in the block

Represent adjacency list in terms of deletions and additions when it is cheaper to do so

Measurements Intra List + Starts: 8-11 bits per link (580M pages/16GB

RAM) Inter List: 5.4-5.7 bits per link (870M pages/16GB RAM.)

Page 47: seo service

Term Vector Database[Stat00]

Fast access to 50 word term vectors for web pages Term Selection:

Restricted to middle 1/3rd of lexicon by document frequency Top 50 words in document by TF.IDF.

Term Weighting: Deferred till run-time (can be based on term freq, doc freq, doc length)

Applications Content + Connectivity analysis (e.g., Topic Distillation) Topic specific crawls Document classification

Performance Storage: 33GB for 272M term vectors Speed: 17 ms/vector on AlphaServer 4100 (latency to read a disk

block)

Page 48: seo service

Architecture

URL Info

LC:TIDLC:TID

…LC:TID

FRQ:RLFRQ:RL

…FRQ:RL

128ByteTV

Record

Terms

Freq

Base (4 bytes)

Bit vectorFor

480 URLids

offset

URLid to Term Vector Lookup

URLid * 64 /480


Recommended