+ All Categories
Home > Technology > Big data philly_jug

Big data philly_jug

Date post: 10-May-2015
Category:
Upload: brian-oneill
View: 1,659 times
Download: 0 times
Share this document with a friend
Description:
Big Data Overview and Cassandra Deep Dive for the Philly JUG
Popular Tags:
41
1• 800.593.4467 • [email protected] The Big Data Quadfecta Brian O’Neill Lead Architect, Health Market Science @boneill42, [email protected]
Transcript
Page 1: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

The Big Data Quadfecta

Brian O’NeillLead Architect, Health Market Science@boneill42, [email protected]

Page 2: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Quadfecta?1. Quadfecta

• A legendary beirut/beer pong shot that lands on the tops of four cups simultaneously. Considered the rarest shot in the game, topping even the trifecta, 2-cup knockover-and-sink, and simultaneous 6-cup game-ending double bounce-in.

• Kafka• Storm• Elastic Search• Cassandra

http://www.flickr.com/photos/yogma/3584984540/

http://www.urbandictionary.com/define.php?term=quadfecta

Page 3: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Hold on Tight

http://www.flickr.com/photos/aspexdesign/7817329758/

Page 4: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

3 V’s

Volume Variety Velocity

http://www.flickr.com/photos/20989942@N00/373985217/

http://www.flickr.com/photos/rhruzek/4071408305/

http://www.flickr.com/photos/adriansalgado/5310969147/

Page 5: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

The Use Case

Page 6: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Our Mission

Prescriber eligibility and remediation

Eliminate fraud, waste and abuse

Insights into the healthcare space

Page 7: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

The BusinessBusiness Solutions

Health Care Provider & Facilities

Variety/Velocity• >l2000 of sources

• 6 Million unique HCPs

• 10+ years history

Data Challenges• Constant change in

real world data

• Conflicting & partial info

• Frequent changes to source structure

• Authoritative sources vs. crowdsource

• Predicting source quality

Master Data SolutionsMedical Procedures &

Diagnosis

Volume/Velocity• ~1B claims annually

• +5B records annually

• 5+ years history

Data Challenges• Sources have

incomplete capture

• Overlapping source data

• Statistical projections & biases

• Social media type relationships

Medical Claims Data

CompleteView, Expense Manager,

CompleteSpend

Prescriber Eligibility/Remdi

ation

Analtyics (Influencer Networks)

Page 8: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Our SolutionsBusiness

Needs

Finance & LegalBusiness SystemsCompliance Sales & Marketing

SolutionsProvider Data ComplianceData Assessment, Integration &

Enrichment Services

01010011

Market Intelligence

HMSAuthoritative

SourcesPDC Federal StateMedical Claims Web Derived

AdvancedTechnology

Master Data Management

Page 9: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Datacenter

Hundreds of Machines

1.5 Petabytes of raw storage

Virtualized (VMware)

On a SAN

Should we go physical???

Page 10: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Under the Hood

Visualization

Dashboard / Reports

Structured Storage

RelationalIndexing

Flexible Storage

NoSQL Graph(s)

Interfacing

Web Services

Distributed Processing

Standardize

Validate

MatchConsolidat

e

Analytics

Data Sources

Government

Web

Customer

I’m happy

User Interface

Page 11: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Master Data Management

Harvested

Government

PrivateSchema Change!

Page 12: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

The Design

Page 13: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

System of Record

Flexibility (Variety)Scalability (Velocity + Volume)

Page 14: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Deep Dive

www.history.navy.mil/museums/seabee_museum.htm

Page 15: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Installation

As easy as… Downloadhttp://cassandra.apache.org/download/

Uncompresstar -xvzf apache-cassandra-1.2.0-beta3-bin.tar.gz

Runbin/cassandra –f

(-f puts it in foreground)

Page 16: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Data Model

Schema (a.k.a. Keyspace)

Table (a.k.a. Column Family)

RowHave arbitrary #’s of columnsValidator for keys (e.g. UTF8Type)

ColumnValidator for values and keysComparator for keys (e.g. DateType or BYOC)

(http://www.youtube.com/watch?v=bKfND4woylw)

Page 17: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Distributed Architecture

Nodes form a token ring.

Nodes partition the ring by initial tokeninitial_token: (in cassandra.yaml)

Partitioners map row keys to tokens.Usually randomly, to evenly distribute the data

All columns for a row are stored together on disk in sorted order.

Page 18: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Visually

A(67-0)

B(1-33)

C(34-66)

Row Hash

Alice 50

Bob 3

Eve 15

Token/Hash Range : 0-99

Page 19: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Java Interpretation

Each table is a Distributed HashMapEach row is a SortedMap.Each column is an entry in the SortedMap.

Cassandra provides a massively scalable version of: HashMap<rowKey, SortedMap<columnKey, columnValue>

Implications:Direct row fetch is fast.Searching a range of rows can be costly.Searching a range of columns is cheap.

Page 20: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

The World-Wide Globally Scalable Naughty List!

How about a Naughty and Nice list for Santa?

1.9 billion childrenThat will fit in a single row!

Queries to support:Children can login and check their standing.Santa can find nice children by country, state or zip.Toy lists for every child in the world.

Page 21: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Two Tables

Children TableStore all the children in the world.One row per child.One column per attribute.

NaughtyOrNice TableSupports the queries we anticipateWide-Row Strategy

Page 22: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Details of the NaughtyOrNice List

One row per standing:countryEnsures all children in a country are grouped together on disk.

One column per child using a compound keyEnsures the columns are sorted to support our search at varying levels of granularity

e.g. All nice children in the US.e.g. All naughty children in PA.

Page 23: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Node 3

Node 2

Node 1

Visually Nice:USA

CA:94333:johny.b.good

CA:94333:richie.rich

Nice:IRL

D:EI33:collin.oneill

D:EI33:owen.oneill

Naughty:USA

CA:94111:bart.simpson

CA:94222:dennis.menace

PA:18964:michael.myers

Watch out for:• Hot spotting• Unbalanced Clusters

(1)Go to the row.(2)Get the column slice

Page 24: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

What about the toys?

No problem. We’re in a NoSQL store. =)Let’s just add a column.

Page 25: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

CQL Collections!

http://www.datastax.com/dev/blog/cql3_collections

SetUPDATE users SET emails = emails + {'[email protected]'} WHERE user_id = 'frodo';

ListUPDATE users SET top_places = [ 'the shire' ] + top_places WHERE user_id = 'frodo';

MapsUPDATE users SET todo['2012-10-2 12:10'] = 'die' WHERE user_id = 'frodo';

Page 26: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Let’s Crank a Bit...

Page 27: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Let’s code!What API should we use?

Production-Readiness

Potential Momentum

Thrift 10 -1 -1

Hector 10 8 8

Astyanax 8 9 10

Kundera (JPA) 6 9 9

Pelops 7 6 7

Firebrand 8 9 8

PlayORM 5 8 7

GORA 6 9 7

CQL Driver 8 10 10

IMHO!

Asytanax + CQL FTW!

Page 28: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Coming up for air...

http://www.flickr.com/photos/64738468@N00/7184463727/

Page 29: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

But continuing at warp speed...

http://www.flickr.com/photos/19942094@N00/4937185452/lightbox/

Page 30: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Primitives of Distributed Processing

emit/proce

ss(tuple(…

))

map<key<map<[], value>>

pop(push(v))

index(field, type)

Kafka

Page 31: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

What we did wrong…

Could not react to transactional changes

Needed extra logic to track what changed

Took too long

Page 32: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

What we did wrong… (II)

AOP-based triggersWorked well initially.Business Processes captured as side-effects.

Page 33: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Design Principles

PatternsIdempotent Operations

Elegantly handle replay

Immutable dataAssertions of facts over time

Anti-PatternsTransactions / Locking

Page 34: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

What we did right.

REST APIs for Loose Coupling

See Virgil:https://github.com/hmsonline/virgil

But really… watch out for Intraverthttps://github.com/zznate/intravert-ug

Page 35: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Kafka• Millions of Messages• Replay Enabled• No transactions / Lightning Fast

Page 36: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Elastic Search• Edit Distance / Soundex• Native Scalability• Fuzzy Search• Geospatial• Facets

Page 37: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Storm• Guaranteed once semantics• Well-designed processing

abstraction• Beats BYODP• Momentum

Page 38: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

The System

KafkaQueue(s)

Offset

C*

A

BC

C* ES1Kafka

ElasticSearch

ES2C*

REST API

NP. We can route around

it.

NP. Replication Factor > 1.

NP. Rewind!

Page 39: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Next Steps

Page 40: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Shameless Shoutouts

HMS (https://github.com/hmsonline/)storm-cassandrastorm-elastic-searchstorm-jdbi (coming soon)

ptgoetz (https://github.com/ptgoetz) storm-jmsstorm-signals

Page 41: Big data philly_jug

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

The Team

We’re hiring!


Recommended