Develop a Custom Data Solution Architecture with NorthBay

Post on 14-Apr-2017

925 views 0 download

transcript

“Teaching Old Data New Tricks™”

Brian Barker • CEO • NorthBay Solutions

John Puopolo • SVP • Engineering • Eliza Corporation

Ali Khan • Director, Business Intelligence and Analytics • Scholastic

Sai Reddy Thangirala • Solutions Architect • Amazon Web Services

Agenda• Big Data on AWS• NorthBay• Eliza Corporation Case Study

• Challenges Eliza Faced• Strategic Goals• Why a Data Lake Approach was Chosen• Outcomes & Benefits Eliza Achieved

• Scholastic Case Study• Challenges• Goals• The AWS/NorthBay Decision• How the Initiative Unfolded• Key Learnings

Data is Growing

of new data will be created every second for every human being on the planet by 2020

http://www.whizpr.be/upload/medialab/21/company/Media_Presentation_2012_DigiUniverseFINAL1.pdf

1.7MB

compound annual growth rate of 58% surpassing $1 billion by 2020 forecasted for the Hadoop market

http://www.ap-institute.com/big-data-articles/big-data-what-is-hadoop-%E2%80%93-an-explanation-for-absolutely-anyone.aspx

http://www.marketanalysis.com/?p=279

58%of all data is ever analyzed and used at the moment

http://www.technologyreview.com/news/514346/the-data-made-me-do-it/

0.5%<

Big Data Is for Everyone

The market for Big Data technologies is growing more than six times faster than the information technology market as a whole….

…and those companies who use their data well win.

Why AWS for Big Data?

Immediately Available

Broad and Deep Capabilities

Trusted and Secure

Scalable

AWS Provides the Most Complete Platform for Big DataIt’s easy to get data to AWS, store it securely, and analyze it with the engine of your choice, without any long-term commitment or vendor lock-in

CollectImport/ExportSnowballDirect ConnectVM Import/Export

StoreAmazon S3EMRAmazon GlacierAmazon RedshiftDynamoDBAurora

AnalyzeAmazon KinesisLambdaEMREC2

What Can You Do With Big Data on AWS?

Big Data Repositories Clickstream Analysis ETL Offload

Machine Learning Online Ad Serving BI Applications

“Teaching Old Data New Tricks™” with NorthBay

“Teaching Old Data New Tricks™”

Untapped wealth - Companies gain tremendous leverage when “Teaching Old Data New Tricks™”

• So what does that mean?• You’ll hear 2 exciting Customer

Examples/Use Cases presented today

Building a HIPAA compliant Data Lake

Re-tooling old on premise technology on the fly

Customer Examples/Use Cases

Scholastic Preview of Coming Attractions• How did an old school $1.5B 100-year-old company re-invent its

old school IBM and Microsoft based big data system & analytics system on the fly?

• What was their starting point?• What factors did they consider when making their decision?• What did they decide on for technology and partners and why?• How did they implement?• What were the results?• Lessons learned?

AWS & NorthBay Background

Global Provider of Big Data Solutions

250+ Full-time professionals

145+ Clients

200+ Solutions launched

Conceptual Data Lake Architecture

Eliza Preview of Coming Attractions

• How does a high flying Healthcare services company re-platform its Enterprise Data Platform while processing millions of 'interactions' every day.

• Why the need to change?• What strategic goals had to be achieved?• What is so tough about "named value pairs" • Why a Data Lake and why NorthBay?• Which AWS services were chosen to leverage?• What did they decide on for technology and partners and why?• How did it turn out?• What did they learn?

Eliza CorporationJohn Puopolo, SVP, Engineering, Eliza Corporation

About Eliza Corporation

• Founded 2000• Leader in Health Engagement Management

(HEM) outreach services• Hundreds of millions of outreaches for

intensive operation and analytics processing• High-volume semi-structured data, complex

business flow of data• Variety of analytics/consumption needs

ranging from portal for customers to ML workloads

Challenges Eliza Faced

Eliza Corporation analyzes more than 300 million interactions per year 

Outreach questions and responses form a decision tree, and each question and response are captured as a pair, E.G.: <question, response> = <“Did you visit your physician in

the last 30 days?”, “Yes”>

 

Diverse downstream consumption requirements

 

Challenging to process and analyze data

Strategic Goals

Create next generation data architecture

Decouple Storage and Compute

Ability to process old & new data streams

Achieve HIPAA compliance

Ingest & store original datasets

Allow both real-time & batch processing

Enable access through entitlements and governance

Increase self-service for end-users

Conceptual Data Lake ArchitectureMonitoring, auditing, management, and alerting

Data System Analytics (Lineage, Profiling)

EDWETL

Data Lake Storage

Data Lake Archive

Catalog & Search

& Data Discovery

API & UI

Entitlements & Authorizations

Data Quality & Governance

Streaming Data Sources

Batch Data Sources

Data Sources & Ingestion Processing & Storage Consumption & Analytics

Real Time Analytics

BI tools

Hadoop (Shared

services)

Business Units

BI UI

Hadoop, SAS

(Business Unit

Dedicated)

D

D

D

D

D

Benefits of the New Enterprise Data Platform Architecture

• Hub & spoke model for one original copy of all enterprise analytics data

• Quality layer for consistent transformations and cleansing of data• Governance layer for entitlements and security management • Enable multiple consumption patterns called projections• A purpose-designed schema for an Enterprise Data Warehouse

(Redshift) for efficient reporting of known queries • Streamline and automated ingestion of source batch and streaming

data reducing human/manual touch points

Technical Architecture

Major AWS Services Used

Aurora

Kinesis + Kinesis Streams

Amazon Redshift Dynamo DB

Hive, Presto, Spark on EMR

CloudSearch, EC2

Benefits of a New Enterprise Data Platform

• Streamlined data load process by enabling schema on read

• Improved business agility by allowing schema on read

• Improved ability to manage costs by allowing separation of costs

• Provided ability to enable resources to scale on-demand

• Reduced end-to-end client analytics time

Key Learnings

• The nature of our data is name-value. We were doing too many transformations due to our original storage formats.

• Using mini-PoCs to form hypotheses and prove/disprove them led to an emergent architecture, which pointed us towards a data lake

• A data lake architecture fits our core business and growth plans extremely well

ScholasticAli Khan, Director, Business Intelligence and Analytics, Scholastic

About Scholastic

in annual revenue. The worlds largest publisher and

distributor of children’s books

website for U.S. elementary school teachers

employees globally

1.6B #1 8,400+

countries languages

165+ 45+ A leader in comprehensive

educational solutions

Existing Platform & Challenges• We taught old data new tricks

• IBM AS/400 was primary data warehouse platform, supplemented by Microsoft SQL Server to enable business intelligence

• 5,500+ AS/400 workloads, 350+ SQL Server workloads

• Inflexible architecture – slow time to market

• Unable to meet internal SLAs due to performance of daily ETL processes

• Scalability limitations with SQL Server Analysis Services (SSAS) for dashboards/reports

• Limited ability to perform self-service business intelligence

28

 

Project Goals

Improve performance, scalability, availability, logging, security

Enable self-service business intelligence

Integrate with existing technology stack

Align with the tech strategy (DevOps model, Cloud First)

Leverage the skill set of current team (SQL/relational)

Team up with an experienced partner

• AWS was chosen because of agility, scalability, elasticity, security and alignment with corporate strategy

• Redshift was chosen to replace AS400 and SQL Server for its relational-style high performance data store

• NorthBay was chosen for their expertise in Big Data and Amazon Redshift migrations

The Decision

30

Pilot Plans

Migrate function area in key business unit during a 3-month pilot

Demonstrate immediate business value

Stand up the AWS environment to allow IT to gain competence with AWS

Pilot Outcomes

Create core framework for migration

Implement ELT architecture and perform

validation

Establish visualization/self-service

capability through Tableau

Technical Architecture

AS400 / DB2(Source DB)

EMR Cluster running Sqoop Script

Output Bucket

EC2 Instance running Copy Command

Redshift (Staging)

Tableau(Reporting Tool)

Data Pipeline

SNS Topic

(Pipeline Status) (Pipeline Failure)

SNS Email Notification

Lambda (Save Pipeline Stats)

RDS MySQL Instance

(Save Pipeline Stats)

(Pipeline Configurations)

DynamoDB

DynamoDB Redshift (Data Warehouse)

RDS MySQL Instance

Core Framework• Jobs and job groups are defined as metadata in DynamoDB• Control-M Scheduler, Custom Application and Data Pipeline for

Orchestration• ELT Process with EMR/Sqoop for Extraction, Redshift Load and Transform

the data through SQL scripts• Core Framework allows for

• Restart capability from point of failure

• Capturing of operational statistics ( # of rows updated)

• Audit capability (which feed caused the fact to change)

34

Data Visualization Through Tableau

• Business users have access to facts/dimensions for standard reports through Tableau

• Power users have access to Staging tables for Ad-Hoc queries through Tableau

• Data Scientists have access to Files in S3 (from all extracts serving as Data Archive) using Hive and/or Presto

35

Accelerating the Program Timeline

36

• CTO moved budget forward to:

• Reduce project timeline by 50%

• Eliminate overhead of 2 platforms

• Parallel work streams (swim lanes) utilized the same core framework for migrating data for other business units

• NorthBay partners with each of those work streams to accelerate migration

• Users wanted to be on the new platform sooner

Lessons Learned - Technology

Isolate core framework with project specific code repositories

Make appropriate schema changes when migrating to new platform

Customize Framework for gathering operational stats (eg: # of rows loaded etc.)

Start with test automation tools and Acceptance Test Driven Development (ATDD) earlier in the project

Lessons Learned – Program Execution

Creating new data platforms and migrating data into them is easy, especially with AWS. Decommission of existing data platforms is hard!

“Data Champion” / “Data Guide” partnership absolutely critical for successful adoption of new platforms and working models

Importance of strong Agile coaches while scaling out Agile teams

Questions & AnswersBrian Barker • CEO • NorthBay Solutions brian.barker@northbaysolutions.com John Puopolo • SVP • Engineering • Eliza CorporationAli Khan • Director, Business Intelligence and Analytics • ScholasticSai Reddy Thangirala • Solutions Architect • Amazon Web Services

www.northbaysolutions.com info@northbaysolutions.com