+ All Categories
Home > Documents > Instant Indexing

Instant Indexing

Date post: 25-Feb-2016
Category:
Upload: doris
View: 56 times
Download: 0 times
Share this document with a friend
Description:
Instant Indexing. Greg Lindahl CTO, Blekko. October 21, 2010 - BCS Search Solutions 2010. Blekko Who?. Founded in 2007, $24m in funding Whole-web search engine Currently in invite-only beta 3B page crawl innovative UI … but this talk is abut indexing. What whole-web search was. - PowerPoint PPT Presentation
Popular Tags:
22
Instant Indexing Greg Lindahl CTO, Blekko October 21, 2010 - BCS Search Solutions 2010
Transcript
Page 1: Instant Indexing

Instant Indexing

Greg LindahlCTO, Blekko

October 21, 2010 - BCS Search Solutions 2010

Page 2: Instant Indexing

Blekko Who?

• Founded in 2007, $24m in funding• Whole-web search engine• Currently in invite-only beta– 3B page crawl– innovative UI

• … but this talk is abut indexing

Page 3: Instant Indexing

What whole-web search was

• Sort by relevance only• News and blog search done with separate

engines• Main index updated slowly with a batch

process• Months to weeks update cycle

Page 4: Instant Indexing

What web-scale search is now

• Relevance and date sorting• Everything in a single index• Incremental updating• Live-crawled pages should appear in the main

index in seconds• All data stored as tables

Page 5: Instant Indexing

Instant Search Indexing

• /date screnshot

Page 6: Instant Indexing

Another Example

Page 7: Instant Indexing

Google’s take on the issue

• Daniel Peng and Frank Dabek, Large Scale Incremental Processing Using Distributed Transactions and Notifications

• “Databases do not meet the storage or throughput requirements of these tasks… MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency.”

Page 8: Instant Indexing

Percolator details• ACID, with multi-row transactions• triggers ("observers"), can be cascaded• crawler is a cascade of triggers:– MapReduce writes new documents into bigtable– trigger parses and extracts links– cascaded trigger does clustering– cascaded trigger exports changed clusters– 10 triggers total in indexing system

• max 1 observer per column for complexity reasons• message collapsing when there are multiple updates

to a column

Page 9: Instant Indexing

Blekko’s take on this

• We want to run the same code in a mapjob or in an incremental crawler/indexer

• Our bigtable-like thingie shouldn’t need a percolator-sized addition to do it

• Needs to be more efficient than other approaches

• OK with non-ACID, relaxed eventual consistentcy, etc

Page 10: Instant Indexing

Combinators

• Task: gather incoming links and anchortext• Each crawled webpage has dozens of outlinks• Crawler wants to write into dozens of inlists,

each in a separate cell in a table• TopN combinator: list of N highest-ranked

items• If a cell is frequently written, writes can be

combined before hitting disk

Page 11: Instant Indexing

Combining combinators

• Combine within the writing process• Combine within the local write daemon• Combine within the 3 disk daemons, and the ram

daemon– highly contented cells result in 1 disk transaction per 30

seconds

• Combinators are represented as strings and can be used without the database

• Using combinators seems to be a significant reduction of RPCs over Percolator, but I have no idea what the relative performance is.

Page 12: Instant Indexing

TopN example

• table: /index/32/url row: pbm.com/~lindahl/ column: inlinks– a list of: rank, key, data– 1000, www.disney.com, “great website”– 540, britishmuseum.com/dance, “16th century

dance manuals in facsimile”– 1, www.ehow.com/dance, “renaissance dance”

Page 13: Instant Indexing

MapReduce from a combinator perspective

• MapReduce is really map, shuffle, reduce• input: a file/table, output: a file/table• An incremental job to do the same

MapReduce looks completely different; you have to implement the shuffle+reduce

• Could write into BigTable cells…

Page 14: Instant Indexing

MapJobs+Combinators

• Map function runs on shards• All output is done by writing into a table, using

combinators• The same map function can also be run

incrementally on individual inputs• The shuffle+reduce is still there, it’s just done

by the database+combinators

Page 15: Instant Indexing

Combinator types

• topN• lastN = topN, using time as the rank• sum, avg, eavg, min, max• counting things– logcount: +- 50% count of strings in 16 bytes

• set -- everything is a combinator

• Cells in our tables are native Perl/Python data structures

• hence: atomic updates on a sub-cell level

Page 16: Instant Indexing

Combinators for indexing

• The basic data structure for search is the posting list:– for each term, a list with rows• docid, rank

• Sounds like a custom topN to us– rank = rank or date or …– lists heavily compressed

• Each posting list has N shards

Page 17: Instant Indexing

Combinators for crawling

• Pick a site, crawl the most important uncrawled pages– that’s stored as a topN

• (the “livecrawl” uses other criteria)• Crawl, parse, and spew writes– outlinks into inlinks cells– page ip/geo into incoming ips, geos– page hashes into duptext detection table– count everything under the sun– 100s of writes total

Page 18: Instant Indexing

Instant index step

• Crawler does the indexing• Decides which terms to index based on page

contents and incoming anchortext• Writes into posting lists– if indexed before, use list of previously indexed

terms to delete any obsolete terms• Heavily-contented posting lists are not a

problem due to combining -- that’s how a naked [/date] query works.

Page 19: Instant Indexing

Supporting date queries

• /date queries fetch about 3X the posting lists of a relevance query

• to support [/health /date], we keep a posting list of the most recent dated pages for each website

• date needs some relevance; every date-sorted posting list has a companion date-sorted lists of only highly-relevant articles

Page 20: Instant Indexing

Example: [obama /date]

• The term posting list for ‘obama’ has overflowed -- moderately relevant dated queries are probably smushed out

• The date posting list for ‘obama’ has overflowed

• The date posting list for highly-relevant dated ‘obama’ is not full

Page 21: Instant Indexing

To Sum Up

• There’s more than one way to do it– yes, we use Perl

• I don’t think Blekko’s scheme is better or worse than Google’s, but at least it’s very different

• See me if you’d like an invite to our beta-test

Page 22: Instant Indexing

Recommended