Date post: | 14-Dec-2014 |
Category: |
Technology |
Upload: | robert-stewart |
View: | 2,769 times |
Download: | 1 times |
CONFIDENTIALCONFIDENTIALCONFIDENTIALCONFIDENTIAL
Geo Searches for Health Care Pricing Datawith MongoDB
NoSQL Now 2013
Robert Stewart
Senior Architect, Castlight Health
@wombatnation
1
CONFIDENTIALCONFIDENTIALCONFIDENTIALCONFIDENTIAL
Castlight Health
The Business and Technical Problems
Initial Solution
MongoDB, Geospatial Indexes and SSDs
Replica Set Flipping
2
3
Hosted web and mobile applications providing unbiased information on health care cost and quality
Customers are employers and health plans
Founded in San Francisco in 2008
$181 million in VC funding
#1 on Wall Street Journal’s list of “Top 50 Venture-Backed Companies” for 2011
Hiring!
Castlight Health
4
Home Page
5
Search Results
6
Business Problem
Support searches for
Prices for a procedure performed by any in-network provider in a geographical area
Prices for all procedures performed by a single provider
Sub-second response, even if returning data on thousands of prices
7
Need a very fast geospatial index
Rate count at 1 billion and rising
Major rate updates monthly
Difficult to index data to ensure sequential reads
Sometimes lots of random reads
Technical Problems
Apr-11 Jun-11 Aug-11 Oct-11 Dec-11 Feb-12 Apr-12 Jun-12 Aug-12 Oct-12 Dec-12 Feb-13 Apr-13 Jun-13 Aug-13
8
Pricing Retrieval Architecture
9
Initial Solution
Store pricing data in MySQL
When Pricing Service starts, create two in-memory indexes and cache most of the rates
55 GB JVM Heap with lots of GC tuning
20-minute service startup time to build indexes
3 hours for background caching of most rates
Trouble Brewing: Total rates growing quickly Rolling restart becoming unacceptably slow If rates not in Java or MySQL cache, retrieval was very slow
CONFIDENTIALCONFIDENTIALCONFIDENTIALCONFIDENTIAL
Enter the Mongo
10
11
Geospatial Indexes We Evaluated
Standard 2D index in MongoDB 2.2 too slow for my use case
Geo Haystack index From docs.mongodb.org:
“A haystack index is a special index that is optimized to return results over small areas. Haystack indexes improve performance on queries that use flat geometry.”
2DSphere index in MongoDB 2.4
12
Mercator Projection with 10 degree grid
13
Geo Haystack
We chose degrees long-lat for x-y coordinate system
25 miles is our default search radius Roughly 0.5 degrees in middle of the US
db.priceables_1.ensureIndex(
{ loc: "geoHaystack", pm: 1 },
{ bucketSize: 0.5 })
db.runCommand(
{ geoSearch: "priceables_1",
near: [-122.4, 37.79],
maxDistance: 0.5,
search: { pm: 6757 },
limit: 50000 })
14
Geo Haystack Cons
Only one secondary filter
Second part of index can’t have an array value
Error on unindexed query on only the second part of the key
15
Supports earth-like spherical geometries
Points can be GeoJSON or x,y pairs
GeoJSON LineString and Polygon
Queries for inclusion, intersection and proximity
2DSphere Index
16
db.priceables_1.ensureIndex(
{ loc: "2dsphere", pm: 1, pn : 1 })
db.priceables_1.find(
{ "loc" :
{ "$geoWithin" :
{ "$centerSphere" :
[ [ -94.2128 , 36.3840], 0.006314]}},
"pm" : 6441,
"pn" : { "$in" : [ 5236 , 5237 ]
}})
2DSphere Index Creation and Sample Query
17
Geospatially Accurate
Even Faster than Haystack
2DSphere Results
18
SSDs
For uncached data on HDD, MongoDB geo index was twice as fast as custom Java geo index with MySQL
Still close to 1 minute for big queries with full data set
Death by random read
Tested with a $200 Samsung SSD Typical query dropped to 20 millis Big query only about 150 millis
19
Random 4k block reads, 5 GB file, 16 threads
Mongoperf on SSDs
Env SSD Read Ops/s Read MB/s
Prod Samsung 200GB SLC 74k 288
QA VM Samsung 200GB SLC 30k 117
Dev Samsung 830 256GB SATA MLC 47k 183
Env SSD Write Ops/s Write MB/s
Prod Samsung 200GB SLC 1074 289
QA VM Samsung 200GB SLC 405 196
Dev Samsung 830 256GB SATA MLC 438 210
Sequential write of the 5 GB file
20
Requirements Major price updates monthly Minor updates more frequently
Huge bulk loads with no impact on active replica set
I/O bound, not CPU bound
Solution Two MongoDB replica sets Multiple SSDs per server
Low Impact Pricing Updates
21
Replica Set Architecture
Physical Servers
ReplicaSets
prodpricing1
prodpricing2
Server pricing1
mongod 28001primary
mongod 28002secondary
Server pricing2
mongod 28001secondary
mongod 28002primary
Server db1
mongod 28001arbiter
Server db2
mongod 28002arbiter
22
Transfer compressed data files to passive replica set Protip: to compress and uncompress
tar cvf - pricing | pigz > ~/pricing.tgz
pigz -dc pricing.tgz | tar xvf -
Page in index and data db.runCommand({ touch: "priceables_1", index: true, data: true })
Pricing Service operation to atomically flip
Replica Set Flipping Solution
23
Obviously, increased cost, but only for extra SSDs
Recently added caching of remote pricing lookups TTL collections
Cache is lost during a flip
But, usually flip late at night
Cache eviction time is only a few hours
Replica Set Flipping Drawbacks
24
Geo search speed with cold cache acceptable
Geo search speed with warm cache awesome
Pricing Service startup down to a few seconds
No production impact for major rate updates
Lowered risk for minor rate updates
Overall Results
25
Summary
Geo Haystack Index great for … Retrieving lots of documents in a constrained search area Very simple geospatial searches with a single secondary filter
2DSphere Index great for … Complex geospatial searches or complex indexing
SSDs great for … Random reads Reducing need for lots of complex indexes
Replica set flipping great for … Instant swap of large amounts of data Primarily, if not solely, read only Trading cost for operational flexibility
CONFIDENTIALCONFIDENTIALCONFIDENTIALCONFIDENTIAL
Q & A
26