Date post: | 10-May-2015 |
Category: |
Technology |
Upload: | brian-oneill |
View: | 1,659 times |
Download: | 0 times |
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
The Big Data Quadfecta
Brian O’NeillLead Architect, Health Market Science@boneill42, [email protected]
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Quadfecta?1. Quadfecta
• A legendary beirut/beer pong shot that lands on the tops of four cups simultaneously. Considered the rarest shot in the game, topping even the trifecta, 2-cup knockover-and-sink, and simultaneous 6-cup game-ending double bounce-in.
• Kafka• Storm• Elastic Search• Cassandra
http://www.flickr.com/photos/yogma/3584984540/
http://www.urbandictionary.com/define.php?term=quadfecta
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Hold on Tight
http://www.flickr.com/photos/aspexdesign/7817329758/
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
3 V’s
Volume Variety Velocity
http://www.flickr.com/photos/20989942@N00/373985217/
http://www.flickr.com/photos/rhruzek/4071408305/
http://www.flickr.com/photos/adriansalgado/5310969147/
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
The Use Case
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Our Mission
Prescriber eligibility and remediation
Eliminate fraud, waste and abuse
Insights into the healthcare space
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
The BusinessBusiness Solutions
Health Care Provider & Facilities
Variety/Velocity• >l2000 of sources
• 6 Million unique HCPs
• 10+ years history
Data Challenges• Constant change in
real world data
• Conflicting & partial info
• Frequent changes to source structure
• Authoritative sources vs. crowdsource
• Predicting source quality
Master Data SolutionsMedical Procedures &
Diagnosis
Volume/Velocity• ~1B claims annually
• +5B records annually
• 5+ years history
Data Challenges• Sources have
incomplete capture
• Overlapping source data
• Statistical projections & biases
• Social media type relationships
Medical Claims Data
CompleteView, Expense Manager,
CompleteSpend
Prescriber Eligibility/Remdi
ation
Analtyics (Influencer Networks)
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Our SolutionsBusiness
Needs
Finance & LegalBusiness SystemsCompliance Sales & Marketing
SolutionsProvider Data ComplianceData Assessment, Integration &
Enrichment Services
01010011
Market Intelligence
HMSAuthoritative
SourcesPDC Federal StateMedical Claims Web Derived
AdvancedTechnology
Master Data Management
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Datacenter
Hundreds of Machines
1.5 Petabytes of raw storage
Virtualized (VMware)
On a SAN
Should we go physical???
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Under the Hood
Visualization
Dashboard / Reports
Structured Storage
RelationalIndexing
Flexible Storage
NoSQL Graph(s)
Interfacing
Web Services
Distributed Processing
Standardize
Validate
MatchConsolidat
e
Analytics
Data Sources
Government
Web
Customer
I’m happy
User Interface
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Master Data Management
Harvested
Government
PrivateSchema Change!
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
The Design
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
System of Record
Flexibility (Variety)Scalability (Velocity + Volume)
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Deep Dive
www.history.navy.mil/museums/seabee_museum.htm
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Installation
As easy as… Downloadhttp://cassandra.apache.org/download/
Uncompresstar -xvzf apache-cassandra-1.2.0-beta3-bin.tar.gz
Runbin/cassandra –f
(-f puts it in foreground)
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Data Model
Schema (a.k.a. Keyspace)
Table (a.k.a. Column Family)
RowHave arbitrary #’s of columnsValidator for keys (e.g. UTF8Type)
ColumnValidator for values and keysComparator for keys (e.g. DateType or BYOC)
(http://www.youtube.com/watch?v=bKfND4woylw)
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Distributed Architecture
Nodes form a token ring.
Nodes partition the ring by initial tokeninitial_token: (in cassandra.yaml)
Partitioners map row keys to tokens.Usually randomly, to evenly distribute the data
All columns for a row are stored together on disk in sorted order.
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Visually
A(67-0)
B(1-33)
C(34-66)
Row Hash
Alice 50
Bob 3
Eve 15
Token/Hash Range : 0-99
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Java Interpretation
Each table is a Distributed HashMapEach row is a SortedMap.Each column is an entry in the SortedMap.
Cassandra provides a massively scalable version of: HashMap<rowKey, SortedMap<columnKey, columnValue>
Implications:Direct row fetch is fast.Searching a range of rows can be costly.Searching a range of columns is cheap.
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
The World-Wide Globally Scalable Naughty List!
How about a Naughty and Nice list for Santa?
1.9 billion childrenThat will fit in a single row!
Queries to support:Children can login and check their standing.Santa can find nice children by country, state or zip.Toy lists for every child in the world.
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Two Tables
Children TableStore all the children in the world.One row per child.One column per attribute.
NaughtyOrNice TableSupports the queries we anticipateWide-Row Strategy
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Details of the NaughtyOrNice List
One row per standing:countryEnsures all children in a country are grouped together on disk.
One column per child using a compound keyEnsures the columns are sorted to support our search at varying levels of granularity
e.g. All nice children in the US.e.g. All naughty children in PA.
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Node 3
Node 2
Node 1
Visually Nice:USA
CA:94333:johny.b.good
CA:94333:richie.rich
Nice:IRL
D:EI33:collin.oneill
D:EI33:owen.oneill
Naughty:USA
CA:94111:bart.simpson
CA:94222:dennis.menace
PA:18964:michael.myers
Watch out for:• Hot spotting• Unbalanced Clusters
(1)Go to the row.(2)Get the column slice
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
What about the toys?
No problem. We’re in a NoSQL store. =)Let’s just add a column.
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
CQL Collections!
http://www.datastax.com/dev/blog/cql3_collections
SetUPDATE users SET emails = emails + {'[email protected]'} WHERE user_id = 'frodo';
ListUPDATE users SET top_places = [ 'the shire' ] + top_places WHERE user_id = 'frodo';
MapsUPDATE users SET todo['2012-10-2 12:10'] = 'die' WHERE user_id = 'frodo';
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Let’s Crank a Bit...
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Let’s code!What API should we use?
Production-Readiness
Potential Momentum
Thrift 10 -1 -1
Hector 10 8 8
Astyanax 8 9 10
Kundera (JPA) 6 9 9
Pelops 7 6 7
Firebrand 8 9 8
PlayORM 5 8 7
GORA 6 9 7
CQL Driver 8 10 10
IMHO!
Asytanax + CQL FTW!
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Coming up for air...
http://www.flickr.com/photos/64738468@N00/7184463727/
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
But continuing at warp speed...
http://www.flickr.com/photos/19942094@N00/4937185452/lightbox/
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Primitives of Distributed Processing
emit/proce
ss(tuple(…
))
map<key<map<[], value>>
pop(push(v))
index(field, type)
Kafka
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
What we did wrong…
Could not react to transactional changes
Needed extra logic to track what changed
Took too long
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
What we did wrong… (II)
AOP-based triggersWorked well initially.Business Processes captured as side-effects.
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Design Principles
PatternsIdempotent Operations
Elegantly handle replay
Immutable dataAssertions of facts over time
Anti-PatternsTransactions / Locking
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
What we did right.
REST APIs for Loose Coupling
See Virgil:https://github.com/hmsonline/virgil
But really… watch out for Intraverthttps://github.com/zznate/intravert-ug
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Kafka• Millions of Messages• Replay Enabled• No transactions / Lightning Fast
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Elastic Search• Edit Distance / Soundex• Native Scalability• Fuzzy Search• Geospatial• Facets
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Storm• Guaranteed once semantics• Well-designed processing
abstraction• Beats BYODP• Momentum
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
The System
KafkaQueue(s)
Offset
C*
A
BC
C* ES1Kafka
ElasticSearch
ES2C*
REST API
NP. We can route around
it.
NP. Replication Factor > 1.
NP. Rewind!
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Next Steps
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Shameless Shoutouts
HMS (https://github.com/hmsonline/)storm-cassandrastorm-elastic-searchstorm-jdbi (coming soon)
ptgoetz (https://github.com/ptgoetz) storm-jmsstorm-signals
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
The Team
We’re hiring!