+ All Categories
Home > Documents > Build Your Own Search Engine

Build Your Own Search Engine

Date post: 10-Jun-2015
Category:
Upload: goodfriday
View: 390 times
Download: 2 times
Share this document with a friend
Description:
Amazon subsidiary Alexa.com is leveling the search playing field. For the first time, developers looking to build the next "big thing" in search or an ultra custom search engine have access to the 300 terabytes of Alexa crawl data, along with the utilities to search, process, and publish their own custom subset of the data-all at a reasonable price.
Popular Tags:
53
1 Build Your Own Build Your Own Search Engine Search Engine Jeff Barr Jeff Barr Web Services Evangelist Web Services Evangelist Amazon Web Services Amazon Web Services NGW044 NGW044
Transcript
Page 1: Build Your Own Search Engine

1

Build Your Own Search Build Your Own Search EngineEngine

Jeff BarrJeff BarrWeb Services EvangelistWeb Services EvangelistAmazon Web ServicesAmazon Web Services

NGW044NGW044

Page 2: Build Your Own Search Engine

2

AgendaAgenda

Amazon Web Services OverviewAmazon Web Services Overview

Looking BackLooking Back

Build Your Own Search EngineBuild Your Own Search Engine

Q&AQ&A

Page 3: Build Your Own Search Engine

3

Introduction And Introduction And BackgroundBackground

Software development backgroundSoftware development background

Veteran of several startupsVeteran of several startups

Visual Studio team at MicrosoftVisual Studio team at Microsoft(DHTML, XML, Web Services) (DHTML, XML, Web Services)

3.5 Years with Amazon3.5 Years with Amazon

Amazon Web Services EvangelistAmazon Web Services Evangelist

Page 4: Build Your Own Search Engine

4

What Is Amazon?What Is Amazon?

Online RetailerOnline RetailerOver 55 million active customer accountsOver 55 million active customer accounts

Seven countries: US, UK, Germany, Japan, France, Seven countries: US, UK, Germany, Japan, France, Canada, ChinaCanada, China

Technology ConsumerTechnology ConsumerMulti-National Web SitesMulti-National Web Sites

Vast Data Warehouse – 25 TBVast Data Warehouse – 25 TB

World-Class Logistics – 21 fulfillment centers; 9 million ft2World-Class Logistics – 21 fulfillment centers; 9 million ft2

Technology ProviderTechnology ProviderHundreds of thousands of Amazon AssociatesHundreds of thousands of Amazon Associates

Over 1,050,000 active seller accountsOver 1,050,000 active seller accounts

Over 150,000 software developers registered to use Amazon Over 150,000 software developers registered to use Amazon Web ServicesWeb Services

Page 5: Build Your Own Search Engine

5

What Is Alexa?What Is Alexa?

Amazon subsidiary since 1999Amazon subsidiary since 1999

Alexa ToolbarAlexa Toolbar

Web metricsWeb metrics

Traffic rankingsTraffic rankings

Web crawlingWeb crawling

Page 6: Build Your Own Search Engine

6

What Is Amazon Web What Is Amazon Web Services? Services?

APIs that give developers APIs that give developers programmatic access to Amazon’s programmatic access to Amazon’s data and technologydata and technology

Building-block web servicesBuilding-block web services

Web-scale infrastructureWeb-scale infrastructure

E-commerce capabilityE-commerce capability

Content, data, and informationContent, data, and information

New business modelsNew business models

Customer-created contentCustomer-created content

Page 7: Build Your Own Search Engine

7

AWS Product FamilyAWS Product Family

Amazon E-Commerce ServiceAmazon E-Commerce ServiceComplete access to Amazon’s Complete access to Amazon’s product catalogproduct catalogFree + Associates commissions Free + Associates commissions paidpaid

Amazon Historical PricingAmazon Historical PricingData warehouse access for Data warehouse access for product pricingproduct pricingMonthly FeeMonthly Fee

Amazon Mechanical Turk Amazon Mechanical Turk Artificial Artificial IntelligenceArtificial Artificial Intelligence10% Commission 10% Commission Paid workforcePaid workforce

Amazon Simple Queue ServiceAmazon Simple Queue ServiceIT building blockIT building blockIn betaIn beta

Amazon S3Amazon S3Storage for the internetStorage for the internetCharge by storage/bandwidth Charge by storage/bandwidth usageusage

Alexa Web Information ServiceAlexa Web Information ServiceData warehouse access for web Data warehouse access for web crawl datacrawl data10K calls per month free, then 15 10K calls per month free, then 15 cents per 1000 callscents per 1000 calls

Alexa Top SitesAlexa Top SitesTop sites by Alexa traffic rankTop sites by Alexa traffic rankCharges by URLCharges by URL

Alexa Web Search PlatformAlexa Web Search PlatformRoll your own search engineRoll your own search enginePay for time, storage, bandwidthPay for time, storage, bandwidth

Page 8: Build Your Own Search Engine

8

Amazon S3Amazon S3Simple Storage ServiceSimple Storage Service

Storage for the internet - web service to read and Storage for the internet - web service to read and write datawrite data

15 cents per Gigabyte-Month to store data15 cents per Gigabyte-Month to store data

20 cents per Gigabyte to access data20 cents per Gigabyte to access data

Private and public storagePrivate and public storage

Scalable, reliable, cost-effective, and simple!Scalable, reliable, cost-effective, and simple!

Page 9: Build Your Own Search Engine

9

Looking BackLooking Back

Page 10: Build Your Own Search Engine

10

Getting OnlineGetting Online

History LessonHistory Lesson

1996 vs. 20061996 vs. 2006

Lot has changedLot has changed

Let’s take a lookLet’s take a look

Page 11: Build Your Own Search Engine

11

Going OnlineGoing OnlineThen and NowThen and Now

What does is take to bring a simple What does is take to bring a simple web site online?web site online?

Domain registrationDomain registration

DNS supportDNS support

Network connectionNetwork connection

Server HardwareServer Hardware

Development ToolsDevelopment Tools

Publicity VehiclePublicity Vehicle

Monetization SystemMonetization System

Page 12: Build Your Own Search Engine

12

Then And NowThen And NowDomain RegistrationDomain Registration

ThenThenExpensive ($70/year)Expensive ($70/year)Single vendorSingle vendorMulti-step, multi-day processMulti-step, multi-day process

NowNowCheap ($10 or less / year)Cheap ($10 or less / year)Dozens of vendorsDozens of vendorsSingle step, 10 minute processSingle step, 10 minute process

Page 13: Build Your Own Search Engine

13

Then And Now Then And Now DNS SupportDNS Support

ThenThenLeech off of friend or universityLeech off of friend or university

Long propagation timesLong propagation times

ComplicatedComplicated

Days to understand & set upDays to understand & set up

NowNowFree services (e.g. ZoneEdit)Free services (e.g. ZoneEdit)

Very short propagation timeVery short propagation time

Minutes to understand & set upMinutes to understand & set up

Page 14: Build Your Own Search Engine

14

Then Versus NowThen Versus NowNetwork ConnectionNetwork Connection

ThenThen9600 baud modem9600 baud modem

ISDNISDN

T1T1

ExpensiveExpensive

NowNowDSLDSL

Dedicated hostingDedicated hosting

CheapCheap

Page 15: Build Your Own Search Engine

15

Then Versus NowThen Versus NowServer HardwareServer Hardware

ThenThenStart with dedicated PCStart with dedicated PC

Upgrade to expensive Sun hardwareUpgrade to expensive Sun hardware

NowNowBuild your own PCBuild your own PC

Hosting providers (EV1, BocaCom, Server Hosting providers (EV1, BocaCom, Server Beach)Beach)

Expensive Sun hardwareExpensive Sun hardware

Page 16: Build Your Own Search Engine

16

Page 17: Build Your Own Search Engine

17

Then And Now Then And Now Development ToolsDevelopment Tools

ThenThenText EditorText Editor

Shell WindowShell Window

NowNowVisual Web DeveloperVisual Web Developer

HTML KitHTML Kit

Front PageFront Page

Page 18: Build Your Own Search Engine

18

Then Versus NowThen Versus NowPublicity VehiclePublicity Vehicle

ThenThenYahoo What’s NewYahoo What’s New

UsenetUsenet

Press ReleasePress Release

Wired MagazineWired Magazine

NowNowBlogs / RSS / Pings Blogs / RSS / Pings

Link sitesLink sites

Word of MouthWord of Mouth

Page 19: Build Your Own Search Engine

19

Then Versus NowThen Versus NowMonetization SystemMonetization System

ThenThenMoney? We are purists and we are doing this Money? We are purists and we are doing this for fun!for fun!Banner adsBanner adsAd sales peopleAd sales peopleLarge sites onlyLarge sites only

NowNowPay per clickPay per clickSelf serveSelf serveMonetize page viewsMonetize page views

Page 20: Build Your Own Search Engine

20

ThenThenBuilding a Search EngineBuilding a Search Engine

Lots of ServersLots of Servers

Lots of BandwidthLots of Bandwidth

Lots of SoftwareLots of Software

Lots of MoneyLots of Money

Lots of Intellectual CapitalLots of Intellectual Capital

Lots of TimeLots of Time

Page 21: Build Your Own Search Engine

21

NowNowBuilding a Search EngineBuilding a Search Engine

Use our infrastructureUse our infrastructure

Leverage Alexa’s CrawlLeverage Alexa’s Crawl

Alexa Web Search PlatformAlexa Web Search Platform

300 TB Archive300 TB Archive

10 Billion web pages10 Billion web pages

Pay as you goPay as you go

Page 22: Build Your Own Search Engine

22

AWSPAWSPAlexa Web Search PlatformAlexa Web Search Platform

Build your own search engine!Build your own search engine!

ProcessProcessSpecify pages to access within the 300TB archiveSpecify pages to access within the 300TB archiveWrite parallelizable application to process pagesWrite parallelizable application to process pagesPublish results as XML feed or as web servicePublish results as XML feed or as web service

Pricing – everything costs $1Pricing – everything costs $150 GB of data processing50 GB of data processing1 CPU Hour1 CPU Hour1 GB of data downloaded1 GB of data downloaded4000 web service requests4000 web service requests

Page 23: Build Your Own Search Engine

23

AWSP ConceptsAWSP Concepts

Interactive Node - DevelopmentInteractive Node - Development

User Store – 12 TB of storageUser Store – 12 TB of storage

Compute Node – ProcessingCompute Node – Processing

Data StoreData Store4 billion documents per crawl 4 billion documents per crawl

3 crawls @ 100 TB3 crawls @ 100 TBIn ProcessIn Process

CurrentCurrent

PreviousPrevious

All document types (HTML, Media, XML)All document types (HTML, Media, XML)

Document header dataDocument header data

Page 24: Build Your Own Search Engine

24

AWSP Design ProcessAWSP Design Process

Great IdeaGreat Idea

Write CodeWrite Code

Test CodeTest Code

Identify Identify PagesPages

ScheduleScheduleJobJob

Run JobRun Job

Check Check ResultsResults

Publish Publish ResultsResults

Page 25: Build Your Own Search Engine

25

Great IdeasGreat Ideas

Vertical search engineVertical search engine

Search engine optimization (SEO)Search engine optimization (SEO)

Search engine marketing (SEM)Search engine marketing (SEM)

ResearchResearch

< your idea here >< your idea here >

Page 26: Build Your Own Search Engine

26

AWSP Design ProcessAWSP Design Process

Great IdeaGreat Idea

Write CodeWrite Code

Test CodeTest Code

Identify Identify PagesPages

ScheduleScheduleJobJob

Run JobRun Job

Check Check ResultsResults

Publish Publish ResultsResults

Page 27: Build Your Own Search Engine

27

Write CodeWrite Code

Run on Interactive NodeRun on Interactive NodeLinux command lineLinux command line

Interactive application developmentInteractive application development

Use Collection API for data retrievalUse Collection API for data retrieval

Use any languageUse any language

Libraries for C, Java, PerlLibraries for C, Java, Perl

Execution frameworkExecution framework

Application processes one documentApplication processes one document

Page 28: Build Your Own Search Engine

28

Write CodeWrite Code

Code canCode canExamine documentExamine document

Examine headersExamine headers

Write to a collectionWrite to a collection

Write to <stdout>Write to <stdout>

Store data to Amazon S3Store data to Amazon S3

Page 29: Build Your Own Search Engine

29

AWSP Design ProcessAWSP Design Process

Great IdeaGreat Idea

Write CodeWrite Code

Test CodeTest Code

Identify Identify PagesPages

ScheduleScheduleJobJob

Run JobRun Job

Check Check ResultsResults

Publish Publish ResultsResults

Page 30: Build Your Own Search Engine

30

Test CodeTest Code

Run small test on Interactive NodeRun small test on Interactive Node

Use predefined document collectionUse predefined document collection

Ensure proper functioningEnsure proper functioning

Measure document processing timeMeasure document processing time

Page 31: Build Your Own Search Engine

31

AWSP Design ProcessAWSP Design Process

Great IdeaGreat Idea

Write CodeWrite Code

Test CodeTest Code

Identify Identify PagesPages

ScheduleScheduleJobJob

Run JobRun Job

Check Check ResultsResults

Publish Publish ResultsResults

Page 32: Build Your Own Search Engine

32

Identify PagesIdentify Pages

Choose a crawlChoose a crawl

Choose pages within the crawl byChoose pages within the crawl byURLURL

LinkageLinkage

Alexa Traffic Rank (Top N)Alexa Traffic Rank (Top N)

Redirection statusRedirection status

ContentContent

Define a CollectionDefine a Collection

Page 33: Build Your Own Search Engine

33

Page 34: Build Your Own Search Engine

34

Page 35: Build Your Own Search Engine

35

Page 36: Build Your Own Search Engine

36

Page 37: Build Your Own Search Engine

37

Page 38: Build Your Own Search Engine

38

Page 39: Build Your Own Search Engine

39

Page 40: Build Your Own Search Engine

40

AWSP Design ProcessAWSP Design Process

Great IdeaGreat Idea

Write CodeWrite Code

Test CodeTest Code

Identify Identify PagesPages

ScheduleScheduleJobJob

Run JobRun Job

Check Check ResultsResults

Publish Publish ResultsResults

Page 41: Build Your Own Search Engine

41

Schedule JobSchedule Job

Allocate compute cluster resourcesAllocate compute cluster resourcesTimeTime

Processors (1-10)Processors (1-10)

Each processorEach processor3.6 GHz CPU3.6 GHz CPU

4 GB of RAM4 GB of RAM

500 GB of local disk storage500 GB of local disk storage

Charged at $1 per CPU hourCharged at $1 per CPU hour

Page 42: Build Your Own Search Engine

42

Page 43: Build Your Own Search Engine

43

Page 44: Build Your Own Search Engine

44

AWSP Design ProcessAWSP Design Process

Great IdeaGreat Idea

Write CodeWrite Code

Test CodeTest Code

Identify Identify PagesPages

ScheduleScheduleJobJob

Run JobRun Job

Check Check ResultsResults

Publish Publish ResultsResults

Page 45: Build Your Own Search Engine

45

Run JobRun Job

Job runs at specified timeJob runs at specified time

Code instances created on each nodeCode instances created on each node

Job output combined automaticallyJob output combined automatically

CollectionCollection

Compute Node #1Compute Node #1

Compute Node NCompute Node N

...... CombineCombine ResultsResults

Page 46: Build Your Own Search Engine

46

AWSP Design ProcessAWSP Design Process

Great IdeaGreat Idea

Write CodeWrite Code

Test CodeTest Code

Identify Identify PagesPages

ScheduleScheduleJobJob

Run JobRun Job

Check Check ResultsResults

Publish Publish ResultsResults

Page 47: Build Your Own Search Engine

47

Check ResultsCheck Results

Monitor progress using portalMonitor progress using portal

Final status emailFinal status email

Log filesLog files

OutputOutput

Page 48: Build Your Own Search Engine

48

AWSP Design ProcessAWSP Design Process

Great IdeaGreat Idea

Write CodeWrite Code

Test CodeTest Code

Identify Identify PagesPages

ScheduleScheduleJobJob

Run JobRun Job

Check Check ResultsResults

Publish Publish ResultsResults

Page 49: Build Your Own Search Engine

49

Publishing ResultsPublishing Results

Store data to S3Store data to S3

Create a new index for AWIS useCreate a new index for AWIS use

Publish data for access via web searchPublish data for access via web search

Page 50: Build Your Own Search Engine

50

Q & AQ & A

Page 51: Build Your Own Search Engine

51

Page 52: Build Your Own Search Engine

© 2006 Microsoft Corporation. All rights reserved.This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.

For More Information:For More Information:

AWSP: AWSP: websearch.alexa.comwebsearch.alexa.com

Alexa Blog: Alexa Blog: awis.blogspot.comawis.blogspot.com

AWS Blog: AWS Blog: aws.typepad.comaws.typepad.com

Amazon Web Services: Amazon Web Services: aws.amazon.comaws.amazon.com

Page 53: Build Your Own Search Engine

53

© 2006 Microsoft Corporation. All rights reserved.This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.


Recommended