+ All Categories
Home > Documents > Joe Caserta President Elliott Cordo Chief Architect September 30, 2015, Javits Center, New York City...

Joe Caserta President Elliott Cordo Chief Architect September 30, 2015, Javits Center, New York City...

Date post: 17-Jan-2016
Category:
Upload: lynette-allison
View: 221 times
Download: 0 times
Share this document with a friend
Popular Tags:
23
Joe Caserta President Elliott Cordo Chief Architect September 30, 2015, Javits Center, New York City Building a Data Lake for Digital Music Dominance
Transcript

Joe CasertaPresident

Elliott CordoChief Architect

September 30, 2015, Javits Center, New York City

Building a Data Lake for Digital Music Dominance

Big Data StrategyInnovation

Technical Implementation

Awards and Recognition

The Music Maze

Build a Dynamic Platform – Paradigm ShiftOLD WAY:• Structure Ingest Analyze• Fixed Capacity• Monolith

NEW WAY:• Ingest Analyze Structure• Dynamic Capacity• Ecosystem

RECIPE:• Cloud• Data Lake• Polyglot Warehouse

Move to the Cloud

Existing On-Premise Solution • Challenges with operations of Hadoop servers in Data Center• Increasing infrastructure complexity• Keeping up with data growth

Cloud Advantages• Reduced upfront capital investment• Faster speed to value• Elasticity “Those that go out and buy expensive

infrastructure find that the problem scope and domain shift really quickly. By the time they get around to answering the original question, the business has moved on.” - Matt Wood, AWS

Cost savings of dynamic capacity

Elasticity not only saves money

Essentially, Servers Suck

But more importantly think Infrastructure as code• Your servers should be API calls• Use stateless processes• Make all resources ephemeral• Make everything scalable and elastic!

Ephemeral?Disposable:• Processing Fleets• Elastic Map Reduce Clusters• Redshift Clusters

Use distributed services and systems to maintain state and preserve your data: • Cassandra, Dynamo • S3

Anatomy of our Processing Fleet

S3 Input Buckets

Auto-scaling

Que

uing

ser

vice

S3 Output Buckets

Elastic Map Reduce

Hadoop on Demand• No Operations –your cluster dies so what• Bootstrap whatever processing engine makes sense• Programmatically estimate instance type and cluster

size

You May Need Some Persistent Servers

If at all possible they should be inherently scalable, distributed, and elastic

Move to a Data Lake ParadigmTechnology:• Scalable distributed storage S3• Pluggable fit-for-purpose processing EMR

Functional Capabilities:• Remove barriers from data ingestion and analysis• Storage and processing for all data• Tunable Governance

Ingest Raw Data

Organize, Define, Complete

Munging, BlendingMachine Learning Data Quality and Monitoring

Metadata, ILM , Security Data Catalog Data Integration

Fully Governed ( trusted)Arbitrary/Ad-hoc Queries and Reporting

BigDataWarehouse

Data Science Workspace

Data Lake – Integrated Sandbox

Landing Area – Source Data in “Full Fidelity”

Usage Pattern Data Governance

Metadata, ILM, Security

Putting it together: The Big Data Pyramid

Data Ingestion and Onboarding

• Incoming to S3:– Lightweight API wrapper– Web front end– Direct writes to S3

• Ingest the data in a reasonable partitioning schema: Bucket and Keys

• Turn analysts and data scientists loose Late bind analytics

But we need to feed the cash register

• Data needs to be refined and mapped:– Processing Fleet– EMR

• 80/20 rule: metadata driven when possible• Abstract away “Big Data”• And make sure it’s right!– Automated data quality checks using HAMBOT, soon to be

open sourced

“…any decent sized enterprise will have a variety of different data technologies for different kinds of data. There will still be large amounts of it managed in relational stores, but increasingly we'll be first asking how we want to manipulate the data and only then figuring out what technology is the best bet for it.” - Martin Fowler

Think Data Ecosystem, Not Tech Stack

Polyglot in Practice

Best practices from traditional EDW• Consolidation• Data Governance• Master Data• Tuned for analytics

Applied to:• Fit-for-purpose technologies and approaches• Relational, MPP, Graph, KV, TimeseriesDB, Data Lake• Apply “tunable governance” and traditional principles

Use the right tool for the job

The Landscape for Digital Dominance

Landing Que

ue

Data Lake

BDW

Data Science

API

Data Providers

Near Real-time

Batch

Data Science Clusters

EDWGraph

RDS Metastore

Joe CasertaPresident, Caserta [email protected] @joe_Caserta

Elliott CordoChief Architect, Caserta [email protected]

•Award-winning company•Transformative Data Strategies•Modern Data Engineering•Advanced Architecture

•Innovation Partner•Strategic Consulting•Advanced Technical Design•Build & Deploy Solutions

•BDW Meetup•New York City•3,000+ members•Knowledge sharing

Data is not important, it’s what you do with it that’s important!

Thank You


Recommended