Date post: | 10-Jun-2015 |
Category: |
Documents |
Upload: | goodfriday |
View: | 390 times |
Download: | 2 times |
1
Build Your Own Search Build Your Own Search EngineEngine
Jeff BarrJeff BarrWeb Services EvangelistWeb Services EvangelistAmazon Web ServicesAmazon Web Services
NGW044NGW044
2
AgendaAgenda
Amazon Web Services OverviewAmazon Web Services Overview
Looking BackLooking Back
Build Your Own Search EngineBuild Your Own Search Engine
Q&AQ&A
3
Introduction And Introduction And BackgroundBackground
Software development backgroundSoftware development background
Veteran of several startupsVeteran of several startups
Visual Studio team at MicrosoftVisual Studio team at Microsoft(DHTML, XML, Web Services) (DHTML, XML, Web Services)
3.5 Years with Amazon3.5 Years with Amazon
Amazon Web Services EvangelistAmazon Web Services Evangelist
4
What Is Amazon?What Is Amazon?
Online RetailerOnline RetailerOver 55 million active customer accountsOver 55 million active customer accounts
Seven countries: US, UK, Germany, Japan, France, Seven countries: US, UK, Germany, Japan, France, Canada, ChinaCanada, China
Technology ConsumerTechnology ConsumerMulti-National Web SitesMulti-National Web Sites
Vast Data Warehouse – 25 TBVast Data Warehouse – 25 TB
World-Class Logistics – 21 fulfillment centers; 9 million ft2World-Class Logistics – 21 fulfillment centers; 9 million ft2
Technology ProviderTechnology ProviderHundreds of thousands of Amazon AssociatesHundreds of thousands of Amazon Associates
Over 1,050,000 active seller accountsOver 1,050,000 active seller accounts
Over 150,000 software developers registered to use Amazon Over 150,000 software developers registered to use Amazon Web ServicesWeb Services
5
What Is Alexa?What Is Alexa?
Amazon subsidiary since 1999Amazon subsidiary since 1999
Alexa ToolbarAlexa Toolbar
Web metricsWeb metrics
Traffic rankingsTraffic rankings
Web crawlingWeb crawling
6
What Is Amazon Web What Is Amazon Web Services? Services?
APIs that give developers APIs that give developers programmatic access to Amazon’s programmatic access to Amazon’s data and technologydata and technology
Building-block web servicesBuilding-block web services
Web-scale infrastructureWeb-scale infrastructure
E-commerce capabilityE-commerce capability
Content, data, and informationContent, data, and information
New business modelsNew business models
Customer-created contentCustomer-created content
7
AWS Product FamilyAWS Product Family
Amazon E-Commerce ServiceAmazon E-Commerce ServiceComplete access to Amazon’s Complete access to Amazon’s product catalogproduct catalogFree + Associates commissions Free + Associates commissions paidpaid
Amazon Historical PricingAmazon Historical PricingData warehouse access for Data warehouse access for product pricingproduct pricingMonthly FeeMonthly Fee
Amazon Mechanical Turk Amazon Mechanical Turk Artificial Artificial IntelligenceArtificial Artificial Intelligence10% Commission 10% Commission Paid workforcePaid workforce
Amazon Simple Queue ServiceAmazon Simple Queue ServiceIT building blockIT building blockIn betaIn beta
Amazon S3Amazon S3Storage for the internetStorage for the internetCharge by storage/bandwidth Charge by storage/bandwidth usageusage
Alexa Web Information ServiceAlexa Web Information ServiceData warehouse access for web Data warehouse access for web crawl datacrawl data10K calls per month free, then 15 10K calls per month free, then 15 cents per 1000 callscents per 1000 calls
Alexa Top SitesAlexa Top SitesTop sites by Alexa traffic rankTop sites by Alexa traffic rankCharges by URLCharges by URL
Alexa Web Search PlatformAlexa Web Search PlatformRoll your own search engineRoll your own search enginePay for time, storage, bandwidthPay for time, storage, bandwidth
8
Amazon S3Amazon S3Simple Storage ServiceSimple Storage Service
Storage for the internet - web service to read and Storage for the internet - web service to read and write datawrite data
15 cents per Gigabyte-Month to store data15 cents per Gigabyte-Month to store data
20 cents per Gigabyte to access data20 cents per Gigabyte to access data
Private and public storagePrivate and public storage
Scalable, reliable, cost-effective, and simple!Scalable, reliable, cost-effective, and simple!
9
Looking BackLooking Back
10
Getting OnlineGetting Online
History LessonHistory Lesson
1996 vs. 20061996 vs. 2006
Lot has changedLot has changed
Let’s take a lookLet’s take a look
11
Going OnlineGoing OnlineThen and NowThen and Now
What does is take to bring a simple What does is take to bring a simple web site online?web site online?
Domain registrationDomain registration
DNS supportDNS support
Network connectionNetwork connection
Server HardwareServer Hardware
Development ToolsDevelopment Tools
Publicity VehiclePublicity Vehicle
Monetization SystemMonetization System
12
Then And NowThen And NowDomain RegistrationDomain Registration
ThenThenExpensive ($70/year)Expensive ($70/year)Single vendorSingle vendorMulti-step, multi-day processMulti-step, multi-day process
NowNowCheap ($10 or less / year)Cheap ($10 or less / year)Dozens of vendorsDozens of vendorsSingle step, 10 minute processSingle step, 10 minute process
13
Then And Now Then And Now DNS SupportDNS Support
ThenThenLeech off of friend or universityLeech off of friend or university
Long propagation timesLong propagation times
ComplicatedComplicated
Days to understand & set upDays to understand & set up
NowNowFree services (e.g. ZoneEdit)Free services (e.g. ZoneEdit)
Very short propagation timeVery short propagation time
Minutes to understand & set upMinutes to understand & set up
14
Then Versus NowThen Versus NowNetwork ConnectionNetwork Connection
ThenThen9600 baud modem9600 baud modem
ISDNISDN
T1T1
ExpensiveExpensive
NowNowDSLDSL
Dedicated hostingDedicated hosting
CheapCheap
15
Then Versus NowThen Versus NowServer HardwareServer Hardware
ThenThenStart with dedicated PCStart with dedicated PC
Upgrade to expensive Sun hardwareUpgrade to expensive Sun hardware
NowNowBuild your own PCBuild your own PC
Hosting providers (EV1, BocaCom, Server Hosting providers (EV1, BocaCom, Server Beach)Beach)
Expensive Sun hardwareExpensive Sun hardware
16
17
Then And Now Then And Now Development ToolsDevelopment Tools
ThenThenText EditorText Editor
Shell WindowShell Window
NowNowVisual Web DeveloperVisual Web Developer
HTML KitHTML Kit
Front PageFront Page
18
Then Versus NowThen Versus NowPublicity VehiclePublicity Vehicle
ThenThenYahoo What’s NewYahoo What’s New
UsenetUsenet
Press ReleasePress Release
Wired MagazineWired Magazine
NowNowBlogs / RSS / Pings Blogs / RSS / Pings
Link sitesLink sites
Word of MouthWord of Mouth
19
Then Versus NowThen Versus NowMonetization SystemMonetization System
ThenThenMoney? We are purists and we are doing this Money? We are purists and we are doing this for fun!for fun!Banner adsBanner adsAd sales peopleAd sales peopleLarge sites onlyLarge sites only
NowNowPay per clickPay per clickSelf serveSelf serveMonetize page viewsMonetize page views
20
ThenThenBuilding a Search EngineBuilding a Search Engine
Lots of ServersLots of Servers
Lots of BandwidthLots of Bandwidth
Lots of SoftwareLots of Software
Lots of MoneyLots of Money
Lots of Intellectual CapitalLots of Intellectual Capital
Lots of TimeLots of Time
21
NowNowBuilding a Search EngineBuilding a Search Engine
Use our infrastructureUse our infrastructure
Leverage Alexa’s CrawlLeverage Alexa’s Crawl
Alexa Web Search PlatformAlexa Web Search Platform
300 TB Archive300 TB Archive
10 Billion web pages10 Billion web pages
Pay as you goPay as you go
22
AWSPAWSPAlexa Web Search PlatformAlexa Web Search Platform
Build your own search engine!Build your own search engine!
ProcessProcessSpecify pages to access within the 300TB archiveSpecify pages to access within the 300TB archiveWrite parallelizable application to process pagesWrite parallelizable application to process pagesPublish results as XML feed or as web servicePublish results as XML feed or as web service
Pricing – everything costs $1Pricing – everything costs $150 GB of data processing50 GB of data processing1 CPU Hour1 CPU Hour1 GB of data downloaded1 GB of data downloaded4000 web service requests4000 web service requests
23
AWSP ConceptsAWSP Concepts
Interactive Node - DevelopmentInteractive Node - Development
User Store – 12 TB of storageUser Store – 12 TB of storage
Compute Node – ProcessingCompute Node – Processing
Data StoreData Store4 billion documents per crawl 4 billion documents per crawl
3 crawls @ 100 TB3 crawls @ 100 TBIn ProcessIn Process
CurrentCurrent
PreviousPrevious
All document types (HTML, Media, XML)All document types (HTML, Media, XML)
Document header dataDocument header data
24
AWSP Design ProcessAWSP Design Process
Great IdeaGreat Idea
Write CodeWrite Code
Test CodeTest Code
Identify Identify PagesPages
ScheduleScheduleJobJob
Run JobRun Job
Check Check ResultsResults
Publish Publish ResultsResults
25
Great IdeasGreat Ideas
Vertical search engineVertical search engine
Search engine optimization (SEO)Search engine optimization (SEO)
Search engine marketing (SEM)Search engine marketing (SEM)
ResearchResearch
< your idea here >< your idea here >
26
AWSP Design ProcessAWSP Design Process
Great IdeaGreat Idea
Write CodeWrite Code
Test CodeTest Code
Identify Identify PagesPages
ScheduleScheduleJobJob
Run JobRun Job
Check Check ResultsResults
Publish Publish ResultsResults
27
Write CodeWrite Code
Run on Interactive NodeRun on Interactive NodeLinux command lineLinux command line
Interactive application developmentInteractive application development
Use Collection API for data retrievalUse Collection API for data retrieval
Use any languageUse any language
Libraries for C, Java, PerlLibraries for C, Java, Perl
Execution frameworkExecution framework
Application processes one documentApplication processes one document
28
Write CodeWrite Code
Code canCode canExamine documentExamine document
Examine headersExamine headers
Write to a collectionWrite to a collection
Write to <stdout>Write to <stdout>
Store data to Amazon S3Store data to Amazon S3
29
AWSP Design ProcessAWSP Design Process
Great IdeaGreat Idea
Write CodeWrite Code
Test CodeTest Code
Identify Identify PagesPages
ScheduleScheduleJobJob
Run JobRun Job
Check Check ResultsResults
Publish Publish ResultsResults
30
Test CodeTest Code
Run small test on Interactive NodeRun small test on Interactive Node
Use predefined document collectionUse predefined document collection
Ensure proper functioningEnsure proper functioning
Measure document processing timeMeasure document processing time
31
AWSP Design ProcessAWSP Design Process
Great IdeaGreat Idea
Write CodeWrite Code
Test CodeTest Code
Identify Identify PagesPages
ScheduleScheduleJobJob
Run JobRun Job
Check Check ResultsResults
Publish Publish ResultsResults
32
Identify PagesIdentify Pages
Choose a crawlChoose a crawl
Choose pages within the crawl byChoose pages within the crawl byURLURL
LinkageLinkage
Alexa Traffic Rank (Top N)Alexa Traffic Rank (Top N)
Redirection statusRedirection status
ContentContent
Define a CollectionDefine a Collection
33
34
35
36
37
38
39
40
AWSP Design ProcessAWSP Design Process
Great IdeaGreat Idea
Write CodeWrite Code
Test CodeTest Code
Identify Identify PagesPages
ScheduleScheduleJobJob
Run JobRun Job
Check Check ResultsResults
Publish Publish ResultsResults
41
Schedule JobSchedule Job
Allocate compute cluster resourcesAllocate compute cluster resourcesTimeTime
Processors (1-10)Processors (1-10)
Each processorEach processor3.6 GHz CPU3.6 GHz CPU
4 GB of RAM4 GB of RAM
500 GB of local disk storage500 GB of local disk storage
Charged at $1 per CPU hourCharged at $1 per CPU hour
42
43
44
AWSP Design ProcessAWSP Design Process
Great IdeaGreat Idea
Write CodeWrite Code
Test CodeTest Code
Identify Identify PagesPages
ScheduleScheduleJobJob
Run JobRun Job
Check Check ResultsResults
Publish Publish ResultsResults
45
Run JobRun Job
Job runs at specified timeJob runs at specified time
Code instances created on each nodeCode instances created on each node
Job output combined automaticallyJob output combined automatically
CollectionCollection
Compute Node #1Compute Node #1
Compute Node NCompute Node N
...... CombineCombine ResultsResults
46
AWSP Design ProcessAWSP Design Process
Great IdeaGreat Idea
Write CodeWrite Code
Test CodeTest Code
Identify Identify PagesPages
ScheduleScheduleJobJob
Run JobRun Job
Check Check ResultsResults
Publish Publish ResultsResults
47
Check ResultsCheck Results
Monitor progress using portalMonitor progress using portal
Final status emailFinal status email
Log filesLog files
OutputOutput
48
AWSP Design ProcessAWSP Design Process
Great IdeaGreat Idea
Write CodeWrite Code
Test CodeTest Code
Identify Identify PagesPages
ScheduleScheduleJobJob
Run JobRun Job
Check Check ResultsResults
Publish Publish ResultsResults
49
Publishing ResultsPublishing Results
Store data to S3Store data to S3
Create a new index for AWIS useCreate a new index for AWIS use
Publish data for access via web searchPublish data for access via web search
50
Q & AQ & A
51
© 2006 Microsoft Corporation. All rights reserved.This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.
For More Information:For More Information:
AWSP: AWSP: websearch.alexa.comwebsearch.alexa.com
Alexa Blog: Alexa Blog: awis.blogspot.comawis.blogspot.com
AWS Blog: AWS Blog: aws.typepad.comaws.typepad.com
Amazon Web Services: Amazon Web Services: aws.amazon.comaws.amazon.com
53
© 2006 Microsoft Corporation. All rights reserved.This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.