1
Introduc)on to Informa)on Retrieval
Introduc)onto
Informa(onRetrieval
CS276Informa)onRetrievalandWebSearchPanduNayakandPrabhakarRaghavan
Lecture15:Websearchbasics
Introduc)on to Informa)on Retrieval
Brief(non‐technical)history
Earlykeyword‐basedenginesca.1995‐1997 Altavista,Excite,Infoseek,Inktomi,Lycos
Paidsearchranking:Goto(morphedintoOverture.com→Yahoo!) Yoursearchrankingdependedonhowmuchyoupaid
Auc)onforkeywords:casinowasexpensive!
2
Introduc)on to Informa)on Retrieval
Brief(non‐technical)history 1998+:Link‐basedrankingpioneeredbyGoogle
BlewawayallearlyenginessaveInktomi
Greatuserexperienceinsearchofabusinessmodel MeanwhileGoto/Overture’sannualrevenueswerenearing$1billion
Result:Googleaddedpaidsearch“ads”totheside,independentofsearchresults Yahoofollowedsuit,acquiringOverture(forpaidplacement)and
Inktomi(forsearch)
2005+:Googlegainssearchshare,domina)nginEuropeandverystronginNorthAmerica 2009:Yahoo!andMicroso_proposecombinedpaidsearchoffering
3
Introduc)on to Informa)on Retrieval
Algorithmic results.
Paid Search Ads
4
Introduc)on to Informa)on Retrieval
Websearchbasics
The Web
Ad indexes
Web spider
Indexer
Indexes
Search
User
Sec. 19.4.1
5
Introduc)on to Informa)on Retrieval
UserNeeds Need[Brod02,RL04]
Informa(onal–wanttolearnaboutsomething(~40%/65%)
Naviga(onal–wanttogotothatpage(~25%/15%)
Transac(onal–wanttodosomething(web‐mediated)(~35%/20%)
Accessaservice Downloads Shop
Grayareas Findagoodhub Exploratorysearch“seewhat’sthere”
Sec. 19.4.1
6
2
Introduc)on to Informa)on Retrieval
Howfardopeoplelookforresults?
(Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf) 7
Introduc)on to Informa)on Retrieval
Users’empiricalevalua)onofresults Qualityofpagesvarieswidely
Relevanceisnotenough Otherdesirablequali)es(nonIR!!)
Content:Trustworthy,diverse,non‐duplicated,wellmaintained Webreadability:displaycorrectly&fast Noannoyances:pop‐ups,etc.
Precisionvs.recall Ontheweb,recallseldommapers
Whatmapers Precisionat1?Precisionabovethefold? Comprehensiveness–mustbeabletodealwithobscurequeries
Recallmaperswhenthenumberofmatchesisverysmall
Userpercep)onsmaybeunscien)fic,butaresignificantoveralargeaggregate
8
Introduc)on to Informa)on Retrieval
Users’empiricalevalua)onofengines Relevanceandvalidityofresults UI–Simple,nocluper,errortolerant Trust–Resultsareobjec)ve Coverageoftopicsforpolysemicqueries Pre/Postprocesstoolsprovided
Mi)gateusererrors(autospellcheck,searchassist,…) Explicit:Searchwithinresults,morelikethis,refine... An)cipa)ve:relatedsearches
Dealwithidiosyncrasies Webspecificvocabulary
Impactonstemming,spell‐check,etc.
Webaddressestypedinthesearchbox
“Thefirst,thelast,thebestandtheworst…”9
Introduc)on to Informa)on Retrieval
TheWebdocumentcollec)on Nodesign/co‐ordina)on Distributedcontentcrea)on,linking,
democra)za)onofpublishing Contentincludestruth,lies,obsolete
informa)on,contradic)ons… Unstructured(text,html,…),semi‐
structured(XML,annotatedphotos),structured(Databases)…
Scalemuchlargerthanprevioustextcollec)ons…butcorporaterecordsarecatchingup
Growth–sloweddownfromini)al“volumedoublingeveryfewmonths”buts)llexpanding
Contentcanbedynamically generated
The Web
Sec. 19.2
10
Introduc)on to Informa)on Retrieval
SPAM(SEARCHENGINEOPTIMIZATION)
11
Introduc)on to Informa)on Retrieval
Thetroublewithpaidsearchads…
Itcostsmoney.What’sthealterna)ve?
Search Engine Op)miza)on: “Tuning”yourwebpagetorankhighlyinthealgorithmicsearchresultsforselectkeywords
Alterna)vetopayingforplacement
Thus,intrinsicallyamarke)ngfunc)on
Performedbycompanies,webmastersandconsultants(“Searchengineop)mizers”)fortheirclients
Someperfectlylegi)mate,someveryshady
Sec. 19.2.2
12
3
Introduc)on to Informa)on Retrieval
Searchengineop)miza)on(Spam)
Mo)ves Commercial,poli)cal,religious,lobbies
Promo)onfundedbyadver)singbudget
Operators Contractors(SearchEngineOp)mizers)forlobbies,companies Webmasters
Hos)ngservices Forums
E.g.,Webmasterworld(www.webmasterworld.com) Searchenginespecifictricks Discussionsaboutacademicpapers
Sec. 19.2.2
13
Introduc)on to Informa)on Retrieval
Simplestforms
Firstgenera)onenginesreliedheavilyon</idf Thetop‐rankedpagesforthequerymaui resortwerethe
onescontainingthemostmaui’sandresort’s
SEOsrespondedwithdenserepe))onsofchosenterms e.g.,maui resort maui resort maui resort O_en,therepe))onswouldbeinthesamecolorasthe
backgroundofthewebpage Repeatedtermsgotindexedbycrawlers Butnotvisibletohumansonbrowsers
Pure word density cannot be trusted as an IR signal
Sec. 19.2.2
14
Introduc)on to Informa)on Retrieval
Variantsofkeywordstuffing Misleadingmeta‐tags,excessiverepe))on Hiddentextwithcolors,stylesheettricks,etc.
Meta-Tags = “… London hotels, hotel, holiday inn, hilton, discount, booking, reservation, sex, mp3, britney spears, viagra, …”
Sec. 19.2.2
15
Introduc)on to Informa)on Retrieval
Cloaking
Servefakecontenttosearchenginespider DNScloaking:SwitchIPaddress.Impersonate
Is this a Search Engine spider?
N
Y
SPAM
Real Doc Cloaking
Sec. 19.2.2
16
Introduc)on to Informa)on Retrieval
Morespamtechniques
Doorwaypages Pagesop)mizedforasinglekeywordthatre‐directtotherealtargetpage
Linkspamming Mutualadmira)onsocie)es,hiddenlinks,awards–moreontheselater
Domain flooding:numerousdomainsthatpointorre‐directtoatargetpage
Robots Fakequerystream–rankcheckingprograms
“Curve‐fit”rankingprogramsofsearchengines
MillionsofsubmissionsviaAdd‐Url
Sec. 19.2.2
17
Introduc)on to Informa)on Retrieval
Thewaragainstspam Qualitysignals‐Prefer
authorita)vepagesbasedon: Votesfromauthors(linkage
signals) Votesfromusers(usagesignals)
PolicingofURLsubmissions An)robottest
Limitsonmeta‐keywords Robustlinkanalysis
Ignoresta)s)callyimplausiblelinkage(ortext)
Uselinkanalysistodetectspammers(guiltbyassocia)on)
Spamrecogni)onbymachinelearning Trainingsetbasedonknown
spam
Familyfriendlyfilters Linguis)canalysis,general
classifica)ontechniques,etc. Forimages:fleshtone
detectors,sourcetextanalysis,etc.
Editorialinterven)on Blacklists Topqueriesaudited Complaintsaddressed Suspectpaperndetec)on
18
4
Introduc)on to Informa)on Retrieval
Moreonspam WebsearchengineshavepoliciesonSEOprac)cestheytolerate/block hpp://help.yahoo.com/help/us/ysearch/index.html hpp://www.google.com/intl/en/webmasters/
AdversarialIR:theunending(technical)baplebetweenSEO’sandwebsearchengines
Researchhpp://airweb.cse.lehigh.edu/
19
Introduc)on to Informa)on Retrieval
SIZEOFTHEWEB
20
Introduc)on to Informa)on Retrieval
Whatisthesizeoftheweb? Issues
Thewebisreallyinfinite Dynamiccontent,e.g.,calendars So_404:www.yahoo.com/<anything>isavalidpage
Sta)cwebcontainssyntac)cduplica)on,mostlyduetomirroring(~30%)
Someserversareseldomconnected
Whocares? Media,andconsequentlytheuser Enginedesign Enginecrawlpolicy.Impactonrecall.
Sec. 19.5
21
Introduc)on to Informa)on Retrieval
Whatcanweapempttomeasure?
Therela)vesizesofsearchengines Theno)onofapagebeingindexediss)ll reasonablywelldefined.
Alreadythereareproblems Documentextension:e.g.,enginesindexpagesnotyetcrawled,byindexinganchortext.
Documentrestric)on:Allenginesrestrictwhatisindexed(first nwords,onlyrelevantwords,etc.)
Sec. 19.5
22
Introduc)on to Informa)on Retrieval
Newdefini)on?
Thesta)callyindexablewebiswhateversearchenginesindex.
IQiswhatevertheIQtestsmeasure.
Differentengineshavedifferentpreferences maxurldepth,maxcount/host,an)‐spamrules,priority
rules,etc.
DifferentenginesindexdifferentthingsunderthesameURL:
frames,meta‐keywords,documentrestric)ons,documentextensions,...
Sec. 19.5
23
Introduc)on to Informa)on Retrieval
A ∩ B = (1/2) * Size A A ∩ B = (1/6) * Size B
(1/2)*Size A = (1/6)*Size B
∴ Size A / Size B =
(1/6)/(1/2) = 1/3
Sample URLs randomly from A
Check if contained in B and vice versa
A ∩ B
Each test involves: (i) Sampling (ii) Checking
Rela)veSizefromOverlapGiventwoenginesAandB
Sec. 19.5
24
5
Introduc)on to Informa)on Retrieval
SamplingURLs
Idealstrategy:GeneratearandomURLandcheckforcontainmentineachindex.
Problem:RandomURLsarehardtofind!EnoughtogeneratearandomURLcontainedinagivenEngine.
Approach1:GeneratearandomURLcontainedinagivenengine Sufficesforthees)ma)onofrela)vesize
Approach2:Randomwalks/IPaddresses Intheory:mightgiveusatruees)mateofthesizeoftheweb(as
opposedtojustrela)vesizesofindexes)
Sec. 19.5
25
Introduc)on to Informa)on Retrieval
Sta)s)calmethods
Approach1 Randomqueries Randomsearches
Approach2 RandomIPaddresses
Randomwalks
Sec. 19.5
26
Introduc)on to Informa)on Retrieval
RandomURLsfromrandomqueries
Generaterandomquery:how? Lexicon:400,000+wordsfromawebcrawl
Conjunc(veQueries:w1andw2e.g., vocalists AND rsi
Get100resultURLsfromengineA ChoosearandomURLasthecandidatetocheckforpresenceinengineB
Thisdistribu)oninducesaprobabilityweightW(p)foreachpage.
Not an English dictionary
Sec. 19.5
27
Introduc)on to Informa)on Retrieval
QueryBasedChecking
Strong QuerytocheckwhetheranengineBhasadocumentD: DownloadD.Getlistofwords. Use8lowfrequencywordsasANDquerytoB CheckifDispresentinresultset.
Problems: Nearduplicates Frames Redirects Engine)me‐outs Is8‐wordquerygoodenough?
Sec. 19.5
28
Introduc)on to Informa)on Retrieval
Advantages&disadvantages
Sta)s)callysoundundertheinducedweight. Biasesinducedbyrandomquery
QueryBias:Favorscontent‐richpagesinthelanguage(s)ofthelexicon RankingBias:Solu)on:Useconjunc)vequeries&fetchall CheckingBias:Duplicates,impoverishedpagesomiped
Documentorqueryrestric)onbias:enginemightnotdealproperlywith8wordsconjunc)vequery
MaliciousBias:Sabotagebyengine
Opera)onalProblems:Time‐outs,failures,engineinconsistencies,indexmodifica)on.
Sec. 19.5
29
Introduc)on to Informa)on Retrieval
Randomsearches
Chooserandomsearchesextractedfromalocallog[Lawrence&Giles97]orbuild“randomsearches”[Notess] Useonlyquerieswithsmallresultsets. CountnormalizedURLsinresultsets.
Usera)osta)s)cs
Sec. 19.5
30
6
Introduc)on to Informa)on Retrieval
Advantages&disadvantages
Advantage Mightbeabeperreflec)onofthehumanpercep)onofcoverage
Issues Samplesarecorrelatedwithsourceoflog
Duplicates Technicalsta)s)calproblems(musthavenon‐zeroresults,ra)oaveragenotsta)s)callysound)
Sec. 19.5
31
Introduc)on to Informa)on Retrieval
Randomsearches
575&1050queriesfromtheNECRIemployeelogs 6Enginesin1998,11in1999 Implementa)on:
Restrictedtoquerieswith<600resultsintotal CountedURLsfromeachenginea_erverifyingquerymatch
Computedsizera)o&overlapforindividualqueries Es)matedindexsizera)o&overlapbyaveragingoverallqueries
Sec. 19.5
32
Introduc)on to Informa)on Retrieval
adap)ve access control neighborhood preserva)on
topographic hamiltonian structures right linear grammar pulse width modula)on neural unbalanced prior probabili)es ranked assignment method internet explorer favourites
impor)ng karvel thornber zili liu
QueriesfromLawrenceandGilesstudy
soKmax ac)va)on func)on bose mul)dimensional system
theory gamma mlp dvi2pdf john oliensis rieke spikes exploring neural video watermarking counterpropaga)on network fat shaNering dimension abelson amorphous compu)ng
Sec. 19.5
33
Introduc)on to Informa)on Retrieval
RandomIPaddresses
GeneraterandomIPaddresses
Findawebserveratthegivenaddress Ifthere’sone
Collectallpagesfromserver Fromthis,chooseapageatrandom
Sec. 19.5
34
Introduc)on to Informa)on Retrieval
RandomIPaddresses
HTTPrequeststorandomIPaddresses Ignored:emptyorauthoriza)onrequiredorexcluded
[Lawr99]Es)mated2.8millionIPaddressesrunningcrawlablewebservers(16milliontotal)fromobserving2500servers.
OCLCusingIPsamplingfound8.7Mhostsin2001 Netcra_[Netc02]accessed37.2millionhostsinJuly2002
[Lawr99]exhaus)velycrawled2500serversandextrapolated Es)matedsizeofthewebtobe800millionpages Es)mateduseofmetadatadescriptors:
Metatags(keywords,descrip)on)in34%ofhomepages,Dublincoremetadatain0.3%
Sec. 19.5
35
Introduc)on to Informa)on Retrieval
Advantages&disadvantages Advantages
Cleansta)s)cs Independentofcrawlingstrategies
Disadvantages Doesn’tdealwithduplica)on ManyhostsmightshareoneIP,ornotacceptrequests Noguaranteeallpagesarelinkedtorootpage.
E.g.:employeepages
Powerlawfor#pages/hostsgeneratesbiastowardssiteswithfewpages. Butbiascanbeaccuratelyquan)fiedIFunderlyingdistribu)onunderstood
Poten)allyinfluencedbyspamming(mul)pleIP’sforsameservertoavoidIPblock)
Sec. 19.5
36
7
Introduc)on to Informa)on Retrieval
Randomwalks
ViewtheWebasadirectedgraph Buildarandomwalkonthisgraph
Includesvarious“jump”rulesbacktovisitedsites Doesnotgetstuckinspidertraps! Canfollowalllinks!
Convergestoasta)onarydistribu)on Mustassumegraphisfiniteandindependentofthewalk. Condi)onsarenotsa)sfied(cookiecrumbs,flooding) Timetoconvergencenotreallyknown
Samplefromsta)onarydistribu)onofwalk Usethe“strongquery”methodtocheckcoveragebySE
Sec. 19.5
37
Introduc)on to Informa)on Retrieval
Advantages&disadvantages
Advantages “Sta)s)callyclean”method,atleastintheory! Couldworkevenforinfiniteweb(assumingconvergence)undercertainmetrics.
Disadvantages Listofseedsisaproblem. Prac)calapproxima)onmightnotbevalid.
Non‐uniformdistribu)on Subjecttolinkspamming
Sec. 19.5
38
Introduc)on to Informa)on Retrieval
Conclusions
Nosamplingsolu)onisperfect.
Lotsofnewideas... ....buttheproblemisge}ngharder Quan)ta)vestudiesarefascina)ngandagoodresearchproblem
Sec. 19.5
39
Introduc)on to Informa)on Retrieval
DUPLICATEDETECTION
40
Sec. 19.6
Introduc)on to Informa)on Retrieval
Duplicatedocuments
Thewebisfullofduplicatedcontent Strictduplicatedetec)on=exactmatch
Notascommon
Butmany,manycasesofnearduplicates E.g.,last‐modifieddatetheonlydifferencebetweentwocopiesofapage
Sec. 19.6
41
Introduc)on to Informa)on Retrieval
Duplicate/Near‐DuplicateDetec)on
Duplica)on:Exactmatchcanbedetectedwithfingerprints
Near‐Duplica)on:Approximatematch Overview
Computesyntac)csimilaritywithanedit‐distancemeasure
Usesimilaritythresholdtodetectnear‐duplicates E.g.,Similarity>80%=>Documentsare“nearduplicates”
Nottransi)vethoughsome)mesusedtransi)vely
Sec. 19.6
42
8
Introduc)on to Informa)on Retrieval
Compu)ngSimilarity Features:
Segmentsofadocument(naturalorar)ficialbreakpoints) Shingles(WordN‐Grams)
a rose is a rose is a rose→a_rose_is_a
rose_is_a_rose
is_a_rose_is a_rose_is_a
SimilarityMeasurebetweentwodocs(=setsofshingles) Jaccardcoefficient:Size_of_Intersec)on/Size_of_Union
Sec. 19.6
43
Introduc)on to Informa)on Retrieval
Shingles+SetIntersec)on
Compu)ngexactsetintersec)onofshinglesbetweenallpairsofdocumentsisexpensive/intractable Approximateusingacleverlychosensubsetofshinglesfromeach(asketch)
Es)mate(size_of_intersec)on/size_of_union)basedonashortsketch
Doc A
Shingle set A Sketch A
Doc B
Shingle set B Sketch B
Jaccard
Sec. 19.6
44
Introduc)on to Informa)on Retrieval
Sketchofadocument
Createa“sketchvector”(ofsize~200)foreachdocument Documentsthatshare≥t(say80%)correspondingvectorelementsarenearduplicates
FordocD,sketchD[i ]isasfollows: Letfmapallshinglesintheuniverseto0..2m‐1(e.g.,f=fingerprin)ng)
Letπibearandom permuta)onon0..2m‐1
PickMIN{πi(f(s))}overallshinglessinD
Sec. 19.6
45
Introduc)on to Informa)on Retrieval
Compu)ngSketch[i]forDoc1
Document 1
264
264
264
264
Start with 64-bit f(shingles)
Permute on the number line
with πi
Pick the min value
Sec. 19.6
46
Introduc)on to Informa)on Retrieval
TestifDoc1.Sketch[i]=Doc2.Sketch[i]
Document 1 Document 2
264
264
264
264
264
264
264
264
Are these equal?
Test for 200 random permutations: π1, π2,… π200
A B
Sec. 19.6
47
Introduc)on to Informa)on Retrieval
However…
Document 1 Document 2
264
264
264
264
264
264
264
264
A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (i.e., lies in the intersection)
Claim: This happens with probability Size_of_intersection / Size_of_union
B A
Why?
Sec. 19.6
48
9
Introduc)on to Informa)on Retrieval
SetSimilarityofsetsCi,Cj
View sets as columns of a matrix A; one row for each element in the universe. aij = 1 indicates presence of item i in set j
Example
C1 C2
0 1 1 0 1 1 Jaccard(C1,C2) = 2/5 = 0.4 0 0 1 1 0 1
Sec. 19.6
49
Introduc)on to Informa)on Retrieval
KeyObserva)on
ForcolumnsCi,Cj,fourtypesofrows Ci Cj A 1 1
B 1 0 C 0 1
D 0 0
Overloadnota)on:A=#ofrowsoftypeA Claim
Sec. 19.6
50
Introduc)on to Informa)on Retrieval
“Min”Hashing
Randomly permute rows Hash h(Ci) = index of first row with 1 in column
Ci Surprising Property
Why? Both are A/(A+B+C) Look down columns Ci, Cj until first non-Type-D row h(Ci) = h(Cj) type A row
Sec. 19.6
51
Introduc)on to Informa)on Retrieval
Min‐Hashsketches
PickPrandomrowpermuta)ons
MinHashsketchSketchD=listofPindexesoffirstrowswith1incolumnC
Similarityofsignatures Letsim[sketch(Ci),sketch(Cj)]=frac)onofpermuta)onswhereMinHashvaluesagree
ObserveE[sim(sketch(Ci),sketch(Cj))]=Jaccard(Ci,Cj)
Sec. 19.6
52
Introduc)on to Informa)on Retrieval
Example
C1 C2 C3 R1 1 0 1 R2 0 1 1 R3 1 0 0 R4 1 0 1 R5 0 1 0
Signatures S1 S2 S3 Perm 1 = (12345) 1 2 1 Perm 2 = (54321) 4 5 4 Perm 3 = (34512) 3 5 4
Similarities 1-2 1-3 2-3 Col-Col 0.00 0.50 0.25 Sig-Sig 0.00 0.67 0.00
Sec. 19.6
53
Introduc)on to Informa)on Retrieval
Allsignaturepairs
Nowwehaveanextremelyefficientmethodfores)ma)ngaJaccardcoefficientforasinglepairofdocuments.
Butwes)llhavetoes)mateN2coefficientswhereNisthenumberofwebpages. S)llslow
Onesolu)on:localitysensi)vehashing(LSH) Anothersolu)on:sor)ng(Henzinger2006)
Sec. 19.6
54
10
Introduc)on to Informa)on Retrieval
Moreresources IIRChapter19
55