+ All Categories
Home > Technology > Launching Your First Big Data Project on AWS

Launching Your First Big Data Project on AWS

Date post: 16-Jul-2015
Category:
Upload: amazon-web-services
View: 393 times
Download: 0 times
Share this document with a friend
Popular Tags:
53
Transcript

v  

Launching Your First Big Data Project on AWS Structured, Unstructured & Streaming Data

Russell Nash

v  

Structure Low High

Large

Small

Size

Tradi(onal    Database  

Hadoop  

NoSQL    

MPP  Database  

Unstructured  Structured   Streaming  

MPP  Databases  

Amazon  RedshiB    

Hadoop  

Amazon  EMR    

Real-­‐(me  Analysis  

Amazon  Kinesis    

v  

•  Standard  SQL  

•  Op(mized  for  fast  analysis  

•  Very  scalable  

SQL Clients/BI Tools

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores Compute Node

128GB RAM

16TB disk

16 cores Compute Node

128GB RAM

16TB disk

16 cores Compute Node

Leader Node

v  Amazon Redshift

v  

Q1. What is it?

v  MPP SQL Database

Optimised for Analytics

Gigabytes to Petabytes

Fully relational

Fully managed

Amazon    Redshi.  

v  

Q2. How does it work?

SQL Clients/BI Tools

JDBC/ODBC  

128GB RAM

16TB disk

16 cores Compute Node

128GB RAM

16TB disk

16 cores Compute Node

128GB RAM

16TB disk

16 cores Compute Node

128GB RAM

16TB disk

16 cores Leader Node

SQL Clients/BI Tools

JDBC/ODBC  

128GB RAM

16TB disk

16 cores Compute Node

128GB RAM

16TB disk

16 cores Compute Node

128GB RAM

16TB disk

16 cores Compute Node

128GB RAM

16TB disk

16 cores Leader Node ID   Name  

1   John  Smith  

2   Jane  Jones  

3   Peter  Black  

4   Pat  Partridge  

5   Sarah  Cyan  

6   Brian  Snail  

1   John  Smith  

4   Pat  Partridge  

2   Jane  Jones  

5   Sarah  Cyan  

3   Peter  Black  

6   Brian  Snail  

v  

•  Column  storage  

•  Data  compression  

•  Zone  maps    

•  With  row  storage  you  do  unnecessary  I/O  

•  To  get  average  Amount  by  State,  you  have  to  read  everything  

ID Age State Amount

123 20 QLD 500

345 25 WA 250

678 40 NSW 125

957 37 WA 375

Drama%cally  reduces  I/O

v  

•  With  column  storage,  you  only  read  the  data  you  need  

ID Age State Amount

123 20 QLD 500

345 25 WA 250

678 40 NSW 125

957 37 WA 375

•  Column  storage  

•  Data  compression  

•  Zone  maps  

Drama%cally  reduces  I/O

v   analyze compression listing; Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw

•  Column  storage  

•  Data  compression  

•  Zone  maps  •  COPY  compresses  automa(cally  

•  You  can  analyze  and  override  

•  More  performance,  less  cost  

Drama%cally  reduces  I/O

v  

•  Column  storage  

•  Data  compression  

•  Zone  maps  

10  |  13  |  14  |  26  |…    

…  |  100  |  245  |  324  

375  |  393  |  417…    

…  512  |  549  |  623  

637  |  712  |  809  …    

…  |  834  |  921  |  959  

10  

324  

375  

623  

637  

959  

•  Track  the  minimum  and  maximum  value  for  each  block  

•  Skip  over  blocks  that  don’t  contain  relevant  data  

Drama%cally  reduces  I/O

v  

Q3. What’s good about it?

Performance, Scalability, Ease of Use, Cost

160  GB          

DW2.L

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

2  PB          

v  

Q4. How do I integrate with Redshift?

v  

Works  with  your  exis%ng  analysis  tools

JDBC/ODBC  

   Amazon Redshift

S3

Redshift

DynamoDB

EMR

Linux

Loading  data

Amazon Redshift

Source Systems

ETL

Unstructured  Structured   Streaming  

MPP  Databases  

Amazon  RedshiB    

Hadoop  

Amazon  EMR    

Real-­‐(me  Analysis  

Amazon  Kinesis    

Input    File  

Hadoop  cluster  

Func(ons   Output    

1.  Very  Flexible  2.  Very  Scalable  3.  OBen  Transient  

v  Amazon Elastic MapReduce (EMR)

v  

Q1. What is it?

Managed Hadoop

Input    File  

EMR    cluster  

Func(ons   Output    EC2  

EC2  

EC2  

EC2  

EC2  

EC2  

v  

Q2. How does it work?

v  

EMR

EMR  Cluster S3

1.  Put  the  data  into  S3  

2.  Choose:  Hadoop  distribu(on,  #  of  nodes,  types  of  nodes,  Hadoop  apps  like  

Hive/Pig/HBase  

4.  Get  the  output  from  S3  

3.  Launch  the  cluster  using  the  EMR  console,  CLI,  SDK,  

or  APIs  

v  

EMR

EMR  Cluster

S3

You  can  easily  resize  the  cluster  

And  launch  parallel  clusters  using  the  same  

data  

v  

EMR

EMR  Cluster

S3

Use  Spot  nodes  to  save  (me  and  money  

v  

EMR  Cluster S3

When  processing  is  complete,  you  can  terminate  the  cluster  (and  stop  

paying)  

v  

EMR  Cluster

Or  just  store  everything  in  HDFS  

(local  disk)  

v  

Q3. What’s good about it?

Scalability, Cost & Ease of Use

v  

14  Hours  

Dura(on:  

Scenario  #1  

Dura(on:  

7  Hours  

Scenario  #2  

EMR with spot instances

#1:  Cost  without  Spot  4  instances  *14  hrs  *  $0.50  =  $28  

#2:  Cost  with  Spot  4  instances  *7  hrs  *  $0.50  =  $14  +  5  instances  *  7  hrs  *  $0.25  =  $8.75  

Total  =  $22.75  

Time  Savings:  50%    Cost  Savings:  ~22%  

Master  instance  group  EMR  cluster  

Task  instance  group  Core  instance  group  

HDFS   HDFS  

Amazon  S3  

Great  for   Spot  Instances

v  

The  Hadoop  Ecosystem

v  

Q4. How are customers using it?

Big  Data  Ver%cals

Media/Advertising

Targeted Advertising

Image and Video

Processing

Oil & Gas

Seismic Analysis

Retail

Recommendations

Transactions Analysis

Life Sciences

Genome Analysis

Financial Services

Monte Carlo Simulations

Risk Analysis

Security

Anti-virus

Fraud Detection

Image Recognition

Social Network/Gaming

User Demographics

Usage analysis

In-game metrics

Unstructured  Structured   Streaming  

MPP  Databases  

Amazon  RedshiB    

Hadoop  

Amazon  EMR    

Real-­‐(me  Analysis  

Amazon  Kinesis    

v      

Log  Ingest   ConEnual  Metrics   Real  Time  Data  AnalyEcs   Complex  Stream  Processing  

So.ware/  Technology    

IT  server  logs  inges(on     IT  opera(onal  metrics  dashboards    

Devices  /  Sensor  Opera(onal  Intelligence    

Digital  Ad  Tech./  MarkeEng  

Adver(sing  Data  aggrega(on     Adver(sing  metrics  like  coverage,  yield,  conversion    

Analy(cs  on  User  engagement  with  Ads  

Op(mized  bid/  buy  engines    

Financial  Services   Market/  Financial  Transac(on  order  data  collec(on    

Financial  market  data  metrics     Fraud  monitoring,  and  Value-­‐at-­‐Risk  assessment    

Audi(ng  of  market  order  data  

   E-­‐Commerce  

Online  customer  engagement  data  aggrega(on        

Consumer  engagement  metrics  like  page  views,  CTR    

Customer  clickstream  analy(cs      

Recommenda(on  engines    

Real-­‐%me  Scenarios  in  Industry  Segments  

v  

v  

Q1. What is it?

v  Kinesis    

A  fully  managed  service  for  real-­‐(me  processing  of  high-­‐volume,  streaming  data.    

v  

Q2. How does it work?

Availability Zone

Availability Zone

Availability Zone

 Data  Sources  

 Data  Sources  

Data  Sources  

 Data  Sources  

 Data  Sources  

Logging  

Metrics  

Analysis  

Machine  Learning  

S3

DynamoDB

Redshift

EMR

Kinesis

Stream

PuIng  data  into  Kinesis

•  Each  shard  •  1000  Tx  Per  Second  

•  1MB  Per  Second  

•  50KB  Payload  Per  Tx  

•  Messages  kept  for  24  hours  

 

Producer

Shard 1

Shard 2

Shard 3

Shard n

Shard 4

Producer

Producer

Producer

Producer

Producer

Producer

Producer

Producer

Kinesis

v  

Shard 1

Shard 2

Shard 3

Shard n

Shard 4

KCL Worker 1

KCL Worker 2

EC2 Instance

KCL Worker 3

KCL Worker 4

EC2 Instance

KCL Worker n

EC2 Instance

Kinesis

GeIng  data  out  of  Kinesis  

Kinesis  Client  Library  (KCL):  •  Abstracts  code  from  individual  shards  

•  Starts  a  Kinesis  Worker  for  each  shard  

•  Increases  and  decreases  workers  •  Tracks  a  Worker’s  loca(on  in  the  stream    

v  

Q3. What’s good about it?

v  

Easy  AdministraEon            

Real-­‐Eme  Performance            

High  Throughput.    ElasEc    

   

IntegraEon    S3  

Redshi.    DynamoDB  

Storm  ElasEcSearch  

           

Build  Real-­‐Eme    ApplicaEons  

   .                  

Low  Cost          

aws.amazon.com/big-­‐data  


Recommended