Geo Searches for Health Care Pricing Data with MongoDB

CONFIDENTIALCONFIDENTIALCONFIDENTIALCONFIDENTIAL

Geo Searches for Health Care Pricing Datawith MongoDB

NoSQL Now 2013

Robert Stewart

Senior Architect, Castlight Health

[email protected]

@wombatnation

1

mailto:[email protected]


Castlight Health

The Business and Technical Problems

Initial Solution

MongoDB, Geospatial Indexes and SSDs

Replica Set Flipping

2

3

Hosted web and mobile applications providing unbiased information on health care cost and quality

Customers are employers and health plans

Founded in San Francisco in 2008

$181 million in VC funding

#1 on Wall Street Journal’s list of “Top 50 Venture-Backed Companies” for 2011

Hiring!

Castlight Health

4

Home Page

5

Search Results

6

Business Problem

Support searches for

Prices for a procedure performed by any in-network provider in a geographical area

Prices for all procedures performed by a single provider

Sub-second response, even if returning data on thousands of prices

7

Need a very fast geospatial index

Rate count at 1 billion and rising

Major rate updates monthly

Difficult to index data to ensure sequential reads

Sometimes lots of random reads

Technical Problems

Apr-11 Jun-11 Aug-11 Oct-11 Dec-11 Feb-12 Apr-12 Jun-12 Aug-12 Oct-12 Dec-12 Feb-13 Apr-13 Jun-13 Aug-13

8

Pricing Retrieval Architecture

9

Initial Solution

Store pricing data in MySQL

When Pricing Service starts, create two in-memory indexes and cache most of the rates

55 GB JVM Heap with lots of GC tuning

20-minute service startup time to build indexes

3 hours for background caching of most rates

Trouble Brewing: Total rates growing quickly Rolling restart becoming unacceptably slow If rates not in Java or MySQL cache, retrieval was very slow


Enter the Mongo

10

11

Geospatial Indexes We Evaluated

Standard 2D index in MongoDB 2.2 too slow for my use case

Geo Haystack index From docs.mongodb.org:

“A haystack index is a special index that is optimized to return results over small areas. Haystack indexes improve performance on queries that use flat geometry.”

2DSphere index in MongoDB 2.4

12

Mercator Projection with 10 degree grid

13

Geo Haystack

We chose degrees long-lat for x-y coordinate system

25 miles is our default search radius Roughly 0.5 degrees in middle of the US

db.priceables_1.ensureIndex(

{ loc: "geoHaystack", pm: 1 },

{ bucketSize: 0.5 })

db.runCommand(

{ geoSearch: "priceables_1",

near: [-122.4, 37.79],

maxDistance: 0.5,

search: { pm: 6757 },

limit: 50000 })

14

Geo Haystack Cons

Only one secondary filter

Second part of index can’t have an array value

Error on unindexed query on only the second part of the key

15

Supports earth-like spherical geometries

Points can be GeoJSON or x,y pairs

GeoJSON LineString and Polygon

Queries for inclusion, intersection and proximity

2DSphere Index

16

db.priceables_1.ensureIndex(

{ loc: "2dsphere", pm: 1, pn : 1 })

db.priceables_1.find(

{ "loc" :

{ "$geoWithin" :

{ "$centerSphere" :

[ [ -94.2128 , 36.3840], 0.006314]}},

"pm" : 6441,

"pn" : { "$in" : [ 5236 , 5237 ]

}})

2DSphere Index Creation and Sample Query

17

Geospatially Accurate

Even Faster than Haystack

2DSphere Results

18

SSDs

For uncached data on HDD, MongoDB geo index was twice as fast as custom Java geo index with MySQL

Still close to 1 minute for big queries with full data set

Death by random read

Tested with a $200 Samsung SSD Typical query dropped to 20 millis Big query only about 150 millis

19

Random 4k block reads, 5 GB file, 16 threads

Mongoperf on SSDs

Env SSD Read Ops/s Read MB/s

Prod Samsung 200GB SLC 74k 288

QA VM Samsung 200GB SLC 30k 117

Dev Samsung 830 256GB SATA MLC 47k 183

Env SSD Write Ops/s Write MB/s

Prod Samsung 200GB SLC 1074 289

QA VM Samsung 200GB SLC 405 196

Dev Samsung 830 256GB SATA MLC 438 210

Sequential write of the 5 GB file

20

Requirements Major price updates monthly Minor updates more frequently

Huge bulk loads with no impact on active replica set

I/O bound, not CPU bound

Solution Two MongoDB replica sets Multiple SSDs per server

Low Impact Pricing Updates

21

Replica Set Architecture

Physical Servers

ReplicaSets

prodpricing1

prodpricing2

Server pricing1

mongod 28001primary

mongod 28002secondary

Server pricing2

mongod 28001secondary

mongod 28002primary

Server db1

mongod 28001arbiter

Server db2

mongod 28002arbiter

22

Transfer compressed data files to passive replica set Protip: to compress and uncompress

tar cvf - pricing | pigz > ~/pricing.tgz

pigz -dc pricing.tgz | tar xvf -

Page in index and data db.runCommand({ touch: "priceables_1", index: true, data: true })

Pricing Service operation to atomically flip

Replica Set Flipping Solution

23

Obviously, increased cost, but only for extra SSDs

Recently added caching of remote pricing lookups TTL collections

Cache is lost during a flip

But, usually flip late at night

Cache eviction time is only a few hours

Replica Set Flipping Drawbacks

24

Geo search speed with cold cache acceptable

Geo search speed with warm cache awesome

Pricing Service startup down to a few seconds

No production impact for major rate updates

Lowered risk for minor rate updates

Overall Results

25

Summary

Geo Haystack Index great for … Retrieving lots of documents in a constrained search area Very simple geospatial searches with a single secondary filter

2DSphere Index great for … Complex geospatial searches or complex indexing

SSDs great for … Random reads Reducing need for lots of complex indexes

Replica set flipping great for … Instant swap of large amounts of data Primarily, if not solely, read only Trading cost for operational flexibility


Q & A

26

Date post:	14-Dec-2014
Category:	Technology
Upload:	robert-stewart
View:	2,769 times
Download:	1 times

Geo Searches for Health Care Pricing Data with MongoDB

Technology