Ingest, Transform & Visualize w Amazon Web Services

Post on 21-Jan-2018

206 views 4 download

transcript

Cloudwick© 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Data Ingestion, Transformation, and Visualization with Amazon Web ServicesSponsoredbyAmazonWebServicesandIntel(Venue)PresentedbyArunKumarPalathumpattu(Cloudwick)

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Cloudwick

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Agenda Cloudwick

• Introduction• CommonChallengesinDataAnalyticsEnvironments• CreatingDataLake– Singlesourceoftruth• BuildEffectiveDataworkflow• Demo• Q&A

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Top 5 Challenges in Building Data Analysis Environments

Cloudwick

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

# 1 Market LandscapeCloudwick

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Cloudwick

Image:http://mwicorp.com/lake-okeechobee-water-transfer/

Structured

# 2 Issues of Storing All the Data

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

# 3 Time it Takes to Find the Useful InformationCloudwick

Image:https://img.clipartfest.com/0ffb2c38607970437e20a6fdd2872eb9_4-benefits-from-taking-guitar-time-value-of-money-clipart_500-500.png

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

# 4 SecurityCloudwick

Image:https://insights.ubuntu.com/2017/03/20/three-flaws-at-the-heart-of-iot-security/

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

# 5 Skill GapCloudwick

https://www.linkedin.com/pulse/wonder-three-years-analytics-big-data-skills-most-demand-hosseini

EvolutionofDataArchitectures1985:DataWarehouseAppliances Benefits

• Consolidatedmultipledecisionsupportenvironments(i.e.databases)intoasinglearchitecture

• Bestperformanceavailableattimeofconception,hencetheexpensivelicenses

• Workedwellwithstructured,columnardata• Couldbuildcustomizeddatamartsontop

SharedStorageTier(NASAppliance)

ComputeNode

ComputeNode

ComputeNode

ComputeNode

• Proprietarysoftwarelicensepaidpernodeperyear

• Gold-platedhardwareavailableonlyfromthevendorwithpernodeperyearcost

Constraints

• Proprietarysoftwarelicensepaidpernodeperyear• Gold-platedhardwareavailableonlyfromthe

vendorwithpernodeperyearcost• Couldnothandleunstructureddatasets• HeavyETL&datacleansing

EvolutionofDataArchitectures2006:HadoopClusters

CPUMemory

HDFSStorage

HadoopMasterNode

CPUMemory

HDFSStorage

CPUMemory

HDFSStorage

Improvements• Opensourcebasedsoftwarelicense!!!• Commoditywhiteboxservers!!!!• Couldhandlestructured&unstructureddatasets• Manydifferentapplicationswithintheframework

(MapReduce,Spark,Hive,Pig,HBase,Presto,etc.)

Constraints• HDFS3Xreplicationtoprotectagainstnodefailure

getsexpensiveatscale• 500TBdataset=1.5PBcluster

• LocalstoragemeansyoumustscaleandpayforCPU&memoryresourceswhenaddingdatacapacity

• Generalpurpose,monolithic clusterwithmanydifferentappsonsamehardware

• Stilladatasilo

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Legacy Data Architectures Exist as Isolated Data Silos

Hadoop Cluster

SQL Database

Data Warehouse Appliance

EvolutionofDataArchitectures2009:DecoupledEMRArchitecture

CPUMemory

HadoopMasterNode

CPUMemory

CPUMemory

Improvements• Decoupledstorage&compute• ScaleCPUandmemoryresourcesindependently

andup&down• Onlypayforthe500TBdataset(not3X)• Multi-physicalfacilityreplicationviaS3• Multipleclusterscanruninparallelagainstshared

datainS3• Eachjobgetsitsownoptimizedcluster.i.e.Spark

onmemoryintensive,HiveonCPUintensive,HBaseonI/Ointensive,etc.

Constraints• Stillhaveaclustertoprovisionandmanage• MustexposeEMRclustertoSQLusersviaHive,

Presto,etc.

S3asHDFS

EvolutionofDataArchitectures

2012:AmazonRedshift– CloudDWImprovements

Constraints

• Stillhavetoloaddataintoaschema

Leader node

Compute node

10 GigE(HPC)

IngestionBackupRestore

Customer VPC

Internal VPC

BI tools SQL clientsAnalytics tools

Compute node Compute node

JDBC/ODBC

• Automatedinstallation,patching,backups• Noserverstomanageandmaintain• MPPColumnarrelationaldatabase• $1,000/TB/Year• AccessibletoanyODBCorJDBCBITool

EvolutionofDataArchitectures

Today:Clusterless Improvements• Nocluster/infrastructuretomanage• BusinessusersandanalystscanwriteSQLwithout

havingtoprovisionaclusterortouchinfrastructure• Paybythequery• ZeroAdministration• Processdatawhereitlives

Constraints

• LimitedtoSQL,HiveandSparkjobstoday.Moreframeworkstocome!

SQLInterfaceinwebbrowser

AthenaforSQL

S3DataLake

GlueforETL

S3DataLake

Spark&HiveInterfaceinwebbrowser

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Building a Flexible Data Lake Architecture on AWS

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Enter Data Lake Architectures

Data Lake is a new and increasingly popular architecture to store and analyze massive volumes and heterogeneous types of data.

Separating your storage and compute allows you to scale each component as

required

“How can I scale up with the volume of data being generated?”

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Benefits of a Data Lake – All Data in One Place

Store and analyze all of your data, from all of your sources, in one

centralized location.

“Why is the data distributed in many locations? Where is the

single source of truth ?”

Quickly ingest data without needing to force it into a

pre-defined schema.

“How can I collect data quickly from various sources and store it

efficiently?”

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Benefits of a Data Lake – Schema on Read

“Is there a way I can apply multiple analytics and processing frameworks

to the same data?”

A Data Lake enables ad-hoc analysis by applying schemas

on read, not write.

BenefitsofanAWSS3DataLake

FixedClusterDataLake AWSS3DataLake

• Limitedtoonlythesingletoolcontainedonthecluster(i.e.HadoopordatawarehouseorCassandra,etc.).Usecases&ecosystemtoolschangerapidly

• Expensivetoaddnodestoaddstoragecapacity

• Expensivetoreplicatedataagainstnodeloss

• Complexityinscalinglocalstoragecapacity• Longrefreshcyclestoaddadditional

storageequipment

• DecouplestorageandcomputebymakingS3objectbasedstorage,notafixedtoolclusterthedatalake

• Flexibilitytouseanyandalltoolsintheecosystem.Therighttoolforthejob

• Futureproofyourarchitecture.Asnewusecasesandnewtoolsemergeyoucanplugandplaycurrentbestofbreed.

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Designed for 11 9s of durability

Designed for 99.99% availability

Durable Available High performance� Multiple upload� Range GET

� Store as much as you need� Scale storage and compute

independently� No minimum usage commitments

Scalable� Amazon EMR� Amazon Redshift� Amazon DynamoDB

Integrated� Simple REST API� AWS SDKs� Read-after-create consistency� Event notification� Lifecycle policies

Easy to use

S3 for data lake

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Building a Data Lake on AWS

Kinesis FirehoseAthena

Query Service

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Processing & Analytics

Real-time Batch

AI & Predictive

BI & Data Visualization

Transactional & RDBMS

AWS LambdaApache Storm

on EMR

Apache Flinkon EMR

Spark Streaming on EMR

ElasticsearchService

Kinesis Analytics, Kinesis Streams

DynamoDB

NoSQL DB Relational DatabaseAurora

EMRHadoop, Spark,

Presto

RedshiftData Warehouse

AthenaQuery Service

Amazon LexSpeech recognition

Amazon Rekognition

Amazon PollyText to speech

Machine LearningPredictive analytics

Kinesis Streams & Firehose

SummaryofAWSAnalytics,Database&AITools

AmazonRedshiftEnterpriseDataWarehouse

AmazonEMRHadoop/Spark

AmazonAthenaClusterless SQL

AmazonGlueClusterless ETL

AmazonAuroraManagedRelationalDatabase

AmazonMachineLearningPredictiveAnalytics

AmazonQuicksightBusinessIntelligence/Visualization

AmazonElasticSearch ServiceElasticSearch

AmazonElastiCacheRedis In-memoryDatastore

AmazonDynamoDBManagedNoSQLDatabase

AmazonRekognitionDeepLearning-basedImageRecognition

AmazonLexVoiceorTextChatbots

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Encryption ComplianceSecurity

� Identity and Access Management (IAM) policies

� Bucket policies� Access Control Lists (ACLs)� Private VPC endpoints to

Amazon S3

� SSL endpoints� Server Side Encryption

(SSE-S3)� S3 Server Side

Encryption with provided keys (SSE-C, SSE-KMS)

� Client-side Encryption

� Buckets access logs� Lifecycle Management

Policies� Access Control Lists

(ACLs)� Versioning & MFA

deletes� Certifications – HIPAA,

PCI, SOC 1/2/3 etc.

Implement the right cloud security controls

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

More Efficient Data Lake Architectures

Cloudwick

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

What is a Modern Enterprise Data Warehouse?

A Modern Enterprise Data warehouse (EDW), is designed to support rapid data growth, quick analytics over relational, non-relational as well as streaming data, with an easy and single interface to consume all these types of data.

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

DataConsumption

Building a Modern EDW on AWS

Amazon Kinesis

Firehose

Amazon S3

Amazon S3

AmazonEMR

Amazon

Redshift

Streaming

Un-structured

Relational

Amazon

RDS

Amazon

Athena

AmazonQuicksig

ht

Amazon

Machine

Learning

Analyze

Predict

Data Mart

EDW

Ad-hoc query

DataIngestion

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

What is Real-time analytics platform?

Real-time Analytics platform provides an ability to load and analyze streaming data, and build custom streaming data applications for specialized needs. This platform can handle staggering amounts of streaming data – sometimes TBs per hour – that need to be collected, stored, and processed continuously.

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Building a Real-time analytics platform (Streaming vs batching)

Stream Delivery

Stream Analytics

Stream processing

Event Driven

Kinesis Streaming

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Simplify Big Data Processing

ingest /collect

store process /analyze

consume / visualize

data answers

Time to Answer (Latency)Throughput

Cost

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Amazon QuickSight is a Business Analytics Service that lets business users quickly and easily visualize, explore, and share insights from their data.

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Amazon S3 Data Lake

Visualizing the Data Lake

Amazon Athena

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Cloudwick

DEMO

Cloudwick© 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Credits

EventSponsoredbyAmazonWebServicesVenuebyIntelPresentedbyArunKumarPalathumpattu(Cloudwick)Slidescredit:AWS+Cloudwick