Date post: | 25-May-2015 |
Category: |
Technology |
Upload: | shion-deysarkar |
View: | 1,268 times |
Download: | 1 times |
The Internet as a Single DatabaseTechnologies Used & Lessons Learned
Houston Code Camp, August 2011Shion DeysarkarCEO, Datafiniti
What does that mean?
Places, people, news, URLs, products, etc., etc.All web data in one, unified format
Accessible as if you were querying a database
Why build such a thing?
Web crawling is kludgy and unintuitiveOur users needed a better way of getting web data
Developers deserve something better than current APIs
Why build such a thing?
Because it would be awesome!
Not an easy task…
The Challenges
The Challenges
There’s a lot of data on the web
100 million registered domainsMaybe only 100,000 have interesting stuff? (Which ones?)Some sites have millions or billions of data points
It’s all structured differently!
Do we have to write web crawls for each website?Writing 100,000 web crawlers seems.. not fun
The Challenges
Data can conflict
How do we know which data is correct?
Website Name Categories Address Zip Code Neighborhood PhoneYelp Max's Wine Dive Wine Bars
American (New)Music Venues
4720 Washington Ave 77007 Washington CorridorRice MilitaryThe Heights
(713) 880-8737
Citysearch Max's Wine Dive RestaurantsWine Bars
4720 Washington Ave #B 77007 Washington Ave.Memorial ParkCentral
(713) 880-8737
Urbanspoon Max's Wine Dive AmericanInternational
4720 Washington Ave 77007 Rice Military (713) 880-8737
Google Max's Wine Dive Wine BarAmerican Restaurant
4720 Washington Ave 77007-5436 (713) 880-8737
Zagat Max's Wine Dive EclecticInt'l
4720 Washington Ave. 77007 Heights 713-880-8737
The Challenges
So let’s start at the beginning:
Data Collection
Data CollectionBuilding a scalable web crawler
Cloud or local data center? Neither.Grid computing (think SETI@home)1000s of home PCs that exchange time & bandwidth for $Crawl very fast for relatively little $
Data CollectionBuilding a scalable web crawler
Coding 1000s of extraction apps
Abstract away everything but pattern matching and link generation
Build a framework that handles all the kludgy work:- Link following & de-duplication- Result formatting & storage- Throttle rates & crawling behavior- Any other crawling activity not specific to a website’s structure
- Load lightweight, website-specific apps into above framework
Data CollectionBuilding a scalable web crawler
Coding 1000s of extraction appsAbstract away everything but pattern matching and link generation
Data CollectionBuilding a scalable web crawler
Coding 1000s of extraction appsAbstract away everything but pattern matching and link generation
Data CollectionBuilding a scalable web crawler
Current peak performance: 4.32 billion URLs per monthDeploying 20 new website crawls every monthEasy to scale crawling performance (just add grid nodes)Easy to scale deployment (just add contractors)
Now for step 2! (step 1 took us 3 years >_<)
Data Storage
Data StorageBuilding a scalable data store
What we’re dealing with:TBs (eventually PBs) of dataBillions of rows, Thousands of columns (maybe more)Don’t want to deal with shardingDon’t actually care about ACIDDo care about high-throughput and fault-tolerance
Data StorageBuilding a scalable data store
NoSQL (Cassandra) >> MySQL (for us)Can increase throughput and storage linearly by adding nodesVirtually unlimited and variable # of columnsMuch faster read/writeSome challenges
- Doesn’t yet support all the select features you’re used to- Not a mature technology yet, expect frequent updates
Data StorageBuilding a scalable data store
Choosing Cassandra over other NoSQL databasesMore active community, seems to be gaining traction most quickly
Impressive production-scale examples
Backed by corporations (DataStax) and some really smart people
Integrated with other relevant technologies- Solr for text search- Hadoop for batch-style processing
- Though it’s true it has some high-profile scrappings
Data StorageBuilding a unified database of everything
Normalizing separate data points that represent the same thingCo-occurrence: most popular choice wins
Website Name Categories Address Zip Code Neighborhood PhoneYelp Max's Wine Dive Wine Bars
American (New)Music Venues
4720 Washington Ave 77007 Washington CorridorRice MilitaryThe Heights
(713) 880-8737
Citysearch Max's Wine Dive RestaurantsWine Bars
4720 Washington Ave #B 77007 Washington Ave.Memorial ParkCentral
(713) 880-8737
Urbanspoon Max's Wine Dive AmericanInternational
4720 Washington Ave 77007 Rice Military (713) 880-8737
Google Max's Wine Dive Wine BarAmerican Restaurant
4720 Washington Ave 77007-5436 (713) 880-8737
Zagat Max's Wine Dive EclecticInt'l
4720 Washington Ave. 77007 Heights 713-880-8737
Data StorageBuilding a unified database of everything
Normalizing separate data points that represent the same thingTrusted sources: put more weight on sources that tend to be right
Website Name Categories Address Zip Code Neighborhood PhoneYelp Max's Wine Dive Wine Bars
American (New)Music Venues
4720 Washington Ave 77007 Washington CorridorRice MilitaryThe Heights
(713) 880-8737
Citysearch Max's Wine Dive RestaurantsWine Bars
4720 Washington Ave #B 77007 Washington Ave.Memorial ParkCentral
(713) 880-8737
Urbanspoon Max's Wine Dive AmericanInternational
4720 Washington Ave 77007 Rice Military (713) 880-8737
Google Max's Wine Dive Wine BarAmerican Restaurant
4720 Washington Ave 77007-5436 (713) 880-8737
Zagat Max's Wine Dive EclecticInt'l
4720 Washington Ave. 77007 Heights 713-880-8737
Data StorageBuilding a unified database of everything
Identifying interesting data on a random web page
Yay, step 3! (step 2 took us 3 months :D)
Data Retrieval
Data RetrievalBuilding an easy way to get lots of data fast
Making the right choices for our APISingle channel for all data retrieval
- RESTful API so anyone can develop with it- All external and internal functionality uses the same API (easier to manage)
As user-friendly and intuitive as possible- SQL-style querying on a NoSQL database- JSON default output, but will also supports CSV and XML- SSL authentication with token
Briefly considered using a 3rd-party service like Mashery
Put it all together… (step 3 took 3 weeks!!!)
Sneak Peak
Sign up for the beta at http://www.datafiniti.netFollow us @Datafiniti
Launching Soon