How to run your Hadoop Cluster in 10 minutes

transcript

Vladimir Simek, Solutions Architect @ AWS

22/03/2016

Amazon Elastic MapReduceHow to run your Hadoop Cluster in 10 minutes

Agenda

• Two different companies – 2 stories• Challenges with Big Data on premises• Technical introduction to Amazon EMR• Amazon EMR features and benefits• Use case of AOL – moving 2 PB on-prem Hadoop

cluster to the AWS cloud• Short demos

In the beginning – 2 different stories

• In 2007 New York Times has decided create a digital archive on the web – all articles from 1851-1922

• 11 million articles (4 TB of data) composed of:• 405,000 large TIFF images• 405,000 XML files• 3.3 million SGML files

• Used Amazon EC2 and Hadoop to process the data

Time to process?Less than 24 hours

Costs?About $240

(Undisclosed international company) – subsidiary in France

• In 2014 - has decided to run a POC on Big Data analytics

• What was the 1st step they did? Invested €7M into server purchase

“Want to increase innovation?Lower the cost of failure.”

Joi Ito, Director of MIT Media Lab

How many big ticket technology ideas can your budget tolerate?

(Big) Data for Competitive Advantage

Customer segmentation

Marketing spend optimization

Financial modeling & forecasting

Ad targeting & real-time bidding

Clickstream analysis

Fraud detection

Security threat detection

Challenges with In-House Infrastructure

Fixed Cost

Slow DeploymentCycle

Always On Self Serve

Static : Not Scalable Outages Impact Production Upgrade

Storage Compute

What is Amazon EMR and how it addresses such issues?

Amazon EMR • Managed platform• MapReduce, Apache Spark, Presto • Launch a cluster in minutes • Open source distribution and MapR

distribution• Leverage the elasticity of the cloud• Baked in security features• Pay by the hour and save with Spot• Flexibility to customize

Make it easy, secure, and cost-effective to run data-processing frameworks on the AWS cloud

What Do I Need to Build a Cluster ?

1. Choose instances2. Choose your software3. Choose your access method

Choice of Multiple Instances

CPUc3 family

cc1.4xlargecc2.8xlarge

Memorym2 familyr3 family

Disk/IOd2 familyi2 family

Generalm1 familym3 family

Machine Learning

Batch Processing

In-memory (Spark & Presto)

Large HDFS

Select an Instance

Choose Your Software (Quick Bundles)

Choose Your Software – Custom

Hadoop Applications Available in Amazon EMR

Choose Security and Access Control

You Are Up and Running!

Master Node DNS

Information about the software you are running, logs and features

Infrastructure for this cluster

Security Groups and Roles

Use the CLI

aws emr create-cluster --release-label emr-4.0.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1, InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge

Or use your favorite SDK

Demo – Build EMR cluster

Now that I have a cluster, I need to process some data

Amazon EMR can process data from multiple sources

Hadoop Distributed File System (HDFS)Amazon S3 (EMRFS)Amazon DynamoDBAmazon Kinesis

Amazon EMR can process data from multiple sources

Hadoop Distributed File System (HDFS)Amazon S3 (EMRFS)Amazon DynamoDBAmazon Kinesis

On an On-premises Environment

Tightly coupled

Compute and Storage Grow Together

Tightly coupled

Storage grows along with computeCompute requirements vary

Underutilized or Scarce Resources

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 260

120Re-processingWeekly peaks

Steady state

Underutilized or Scarce Resources

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 260

Underutilized capacity

Provisioned capacity

Contention for Same Resources

Compute bound Memory

Separation of Resources Creates Data Silos

Team A

Replication Adds to Cost

Single datacenter

So how does Amazon EMR solve these problems?

Decouple Storage and Compute

Amazon S3 is Your Persistent Data Store

Designed for 11 9’s durability$0.03 / GB / month in Ireland Lifecycle policiesVersioning Distributed by default EMRFS

Amazon S3

The Amazon EMR File System (EMRFS)

• Allows you to leverage Amazon S3 as a file-system• Streams data directly from Amazon S3 • Uses HDFS for intermediates • Better read/write performance and error handling than

open source components• Consistent view – consistency for read after write• Support for encryption • Fast listing of objects

Going from HDFS to Amazon S3

CREATE EXTERNAL TABLE serde_regex( host STRING,referer STRING, agent STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe') LOCATION ‘samples/pig-apache/input/'

Going from HDFS to Amazon S3

CREATE EXTERNAL TABLE serde_regex( host STRING,referer STRING, agent STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe') LOCATION 's3://elasticmapreduce.samples/pig-apache/input/'

Benefit 1: Switch Off Clusters

Amazon S3Amazon S3 Amazon S3

Auto-Terminate Clusters

You Can Build a Pipeline

Run Transient or Long-Running Clusters

Benefit 2: Resize Your Cluster

Resize the Cluster

Scale Up, Scale Down, Stop a resize, issue a resize on another

How do you scale up and save cost ?

Spot Instance

Bid Price

OD Price

Spot Integration

aws emr create-cluster --name "Spot cluster" --ami-version 3.3InstanceGroupType=MASTER,InstanceType=m3.xlarge,InstanceCount=1,InstanceGroupType=CORE,BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2 InstanceGroupType=TASK,BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3

Spot Integration with Amazon EMR

• Can provision instances from the Spot market• Impact of interruption

• Master node – Can lose the cluster • Core node – Can lose intermediate data • Task nodes – Jobs will restart on other nodes (application

dependent)

Scale up with Spot Instances

10 node cluster running for 14 hoursCost = 1.0 * 10 * 14 = $140

Resize Nodes with Spot Instances

Add 10 more nodes on Spot

20 node cluster running for 7 hoursCost = 1.0 * 10 * 7 = $70 = 0.5 * 10 * 7 = $35

Total $105

50 % less run-time ( 14 7)

25% less cost (140 105)

Intelligent Scale Down

Effectively Utilize Clusters

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 260

Benefit 3: Logical Separation of Jobs

Hive, Pig,Cascading

Presto Ad-Hoc

Amazon S3

Benefit 4: Disaster Recovery Built In

Cluster 1 Cluster 2

Cluster 3 Cluster 4

Amazon S3

Availability Zone Availability Zone

Demo 2 – Word Count Example

Case study: How AOL moved a 2 PB cluster to the AWS cloud

AOL Data Platforms Architecture 2014

Source Systems In-house Hadoop Cluster

Database

Reporting Tools

Data Stats & Insights

Cluster Size2 PB

In-House Cluster

100 Nodes

RawData/Day 2-3 TB

DataRetention

13-24 Months

Challenges with In-House Infrastructure

Fixed Cost

Slow DeploymentCycle

Always On Self Serve

Static : Not Scalable Outages Impact Production Upgrade

Storage Compute

AOL Data Platforms Architecture 2015

Source Systems

Amazon S3

Amazon EMR

Cluster Watchdog

Amazon SNS

Amazon IAM

AWS Direct Connect

Reporting Tools

Database

EMR Design Options

TransientAmazon S3Elastic ClusterOn-Demand vs. Reserved vs. Core NodesAmazon EMR

vs. Persistent Cluster vs. local HDFS vs. Static Cluster Spot vs. Task Nodes

AWS vs. In-House Cost

Service

0 1 2 3 4 5

Cost Comparison

AWSIn-House

Service

Cost Comparison

0 2 4 6

In-House

Source : AOL & AWS Billing Tool

4xIn-House / Month

1xAWS / Month

** In-House cluster includes Storage, Power and Network cost.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Cores Nodes Demand - 06/01/2015 Cores...

Restatement Use Case• Restate historical data going back 6 months

Availability Zones10

550EMR Clusters

24,000Spot EC2 Instances

010203040506070

Timing Comparison

In-HouseAWS

Any questions?

Thank you!