Launching Your First Big Data Project on AWS

v

Launching Your First Big Data Project on AWS Structured, Unstructured & Streaming Data

Russell Nash

v

Structure Low High

Large

Small

Size

Tradi(onal Database

Hadoop

NoSQL

MPP Database

Unstructured Structured Streaming

MPP Databases

Amazon RedshiB

Hadoop

Amazon EMR

Real-‐(me Analysis

Amazon Kinesis

v

•  Standard SQL

•  Op(mized for fast analysis

•  Very scalable

SQL Clients/BI Tools

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores Compute Node

128GB RAM

16TB disk


128GB RAM

16TB disk


Leader Node

v Amazon Redshift

v

Q1. What is it?

v MPP SQL Database

Optimised for Analytics

Gigabytes to Petabytes

Fully relational

Fully managed

Amazon Redshi.

v

Q2. How does it work?


JDBC/ODBC

128GB RAM

16TB disk


128GB RAM

16TB disk


128GB RAM

16TB disk


128GB RAM

16TB disk

16 cores Leader Node


JDBC/ODBC

128GB RAM

16TB disk


128GB RAM

16TB disk


128GB RAM

16TB disk


128GB RAM

16TB disk

16 cores Leader Node ID Name

1 John Smith

2 Jane Jones

3 Peter Black

4 Pat Partridge

5 Sarah Cyan

6 Brian Snail

1 John Smith

4 Pat Partridge

2 Jane Jones

5 Sarah Cyan

3 Peter Black

6 Brian Snail

v

•  Column storage

•  Data compression

•  Zone maps

•  With row storage you do unnecessary I/O

•  To get average Amount by State, you have to read everything

ID Age State Amount

123 20 QLD 500

345 25 WA 250

678 40 NSW 125

957 37 WA 375

Drama%cally reduces I/O

v

•  With column storage, you only read the data you need

ID Age State Amount

123 20 QLD 500

345 25 WA 250

678 40 NSW 125

957 37 WA 375



•  Zone maps


v analyze compression listing; Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw



•  Zone maps •  COPY compresses automa(cally

•  You can analyze and override

•  More performance, less cost


v



•  Zone maps

10 | 13 | 14 | 26 |…

… | 100 | 245 | 324

375 | 393 | 417…

… 512 | 549 | 623

637 | 712 | 809 …

… | 834 | 921 | 959

10

324

375

623

637

959

•  Track the minimum and maximum value for each block

•  Skip over blocks that don’t contain relevant data


v

Q3. What’s good about it?

Performance, Scalability, Ease of Use, Cost

160 GB

DW2.L

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL










2 PB

v

Q4. How do I integrate with Redshift?

v

Works with your exis%ng analysis tools

JDBC/ODBC

Amazon Redshift

S3

Redshift

DynamoDB

EMR

Linux

Loading data

Amazon Redshift

Source Systems

ETL


MPP Databases

Amazon RedshiB

Hadoop

Amazon EMR


Amazon Kinesis

Input File

Hadoop cluster

Func(ons Output

1.  Very Flexible 2.  Very Scalable 3.  OBen Transient

v Amazon Elastic MapReduce (EMR)

v

Q1. What is it?

Managed Hadoop

Input File

EMR cluster

Func(ons Output EC2

EC2

EC2

EC2

EC2

EC2

v


v

EMR

EMR Cluster S3

1. Put the data into S3

2. Choose: Hadoop distribu(on, # of nodes, types of nodes, Hadoop apps like

Hive/Pig/HBase

4. Get the output from S3

3. Launch the cluster using the EMR console, CLI, SDK,

or APIs

v

EMR

EMR Cluster

S3

You can easily resize the cluster

And launch parallel clusters using the same

data

v

EMR

EMR Cluster

S3

Use Spot nodes to save (me and money

v

EMR Cluster S3

When processing is complete, you can terminate the cluster (and stop

paying)

v

EMR Cluster

Or just store everything in HDFS

(local disk)

v


Scalability, Cost & Ease of Use

v

14 Hours

Dura(on:

Scenario #1

Dura(on:

7 Hours

Scenario #2

EMR with spot instances

#1: Cost without Spot 4 instances *14 hrs * $0.50 = $28

#2: Cost with Spot 4 instances *7 hrs * $0.50 = $14 + 5 instances * 7 hrs * $0.25 = $8.75

Total = $22.75

Time Savings: 50% Cost Savings: ~22%

Master instance group EMR cluster

Task instance group Core instance group

HDFS HDFS

Amazon S3

Great for Spot Instances

v

The Hadoop Ecosystem

v

Q4. How are customers using it?

Big Data Ver%cals

Media/Advertising

Targeted Advertising

Image and Video

Processing

Oil & Gas

Seismic Analysis

Retail

Recommendations

Transactions Analysis

Life Sciences

Genome Analysis

Financial Services

Monte Carlo Simulations

Risk Analysis

Security

Anti-virus

Fraud Detection

Image Recognition

Social Network/Gaming

User Demographics

Usage analysis

In-game metrics


MPP Databases

Amazon RedshiB

Hadoop

Amazon EMR


Amazon Kinesis

v

Log Ingest ConEnual Metrics Real Time Data AnalyEcs Complex Stream Processing

So.ware/ Technology

IT server logs inges(on IT opera(onal metrics dashboards

Devices / Sensor Opera(onal Intelligence

Digital Ad Tech./ MarkeEng

Adver(sing Data aggrega(on Adver(sing metrics like coverage, yield, conversion

Analy(cs on User engagement with Ads

Op(mized bid/ buy engines

Financial Services Market/ Financial Transac(on order data collec(on

Financial market data metrics Fraud monitoring, and Value-‐at-‐Risk assessment

Audi(ng of market order data

E-‐Commerce

Online customer engagement data aggrega(on

Consumer engagement metrics like page views, CTR

Customer clickstream analy(cs

Recommenda(on engines

Real-‐%me Scenarios in Industry Segments

v

v

Q1. What is it?

v Kinesis

A fully managed service for real-‐(me processing of high-‐volume, streaming data.

v


Availability Zone

Availability Zone

Availability Zone

Data Sources

Data Sources

Data Sources

Data Sources

Data Sources

Logging

Metrics

Analysis

Machine Learning

S3

DynamoDB

Redshift

EMR

Kinesis

Stream

PuIng data into Kinesis

•  Each shard •  1000 Tx Per Second

•  1MB Per Second

•  50KB Payload Per Tx

•  Messages kept for 24 hours

Producer

Shard 1

Shard 2

Shard 3

Shard n

Shard 4

Producer

Producer

Producer

Producer

Producer

Producer

Producer

Producer

Kinesis

v

Shard 1

Shard 2

Shard 3

Shard n

Shard 4

KCL Worker 1

KCL Worker 2

EC2 Instance

KCL Worker 3

KCL Worker 4

EC2 Instance

KCL Worker n

EC2 Instance

Kinesis

GeIng data out of Kinesis

Kinesis Client Library (KCL): •  Abstracts code from individual shards

•  Starts a Kinesis Worker for each shard

•  Increases and decreases workers •  Tracks a Worker’s loca(on in the stream

v


v

Easy AdministraEon

Real-‐Eme Performance

High Throughput. ElasEc

IntegraEon S3

Redshi. DynamoDB

Storm ElasEcSearch

Build Real-‐Eme ApplicaEons

.

Low Cost

aws.amazon.com/big-‐data

Date post:	16-Jul-2015
Category:	Technology
Upload:	amazon-web-services
View:	393 times
Download:	0 times