+ All Categories
Home > Technology > Analytics on AWS - IP Expo 2013

Analytics on AWS - IP Expo 2013

Date post: 15-Jan-2015
Category:
Upload: amazon-web-services
View: 378 times
Download: 2 times
Share this document with a friend
Description:
Learn more about the tools, techniques and technologies for working productively with data at any scale. This session will introduce the family of data analytics tools on AWS which you can use to collect, compute and collaborate around data, from gigabytes to petabytes. We'll discuss Amazon Elastic MapReduce, Redshift, Hadoop, structured and unstructured data, and the EC2 instance types which enable high performance analytics.
Popular Tags:
39
Analytics on AWS IP Expo 2013
Transcript
Page 1: Analytics on AWS - IP Expo 2013

Analytics on AWS

IP Expo 2013

Page 2: Analytics on AWS - IP Expo 2013

BIG DATA

When innovation is required

to collect, store, analyze, and

manage your data

Page 3: Analytics on AWS - IP Expo 2013

VOLUME

VELOCITY

VARIETY

Page 4: Analytics on AWS - IP Expo 2013

Customer Needs

• Store Any Amount of Data

– Without Capacity Planning

• Perform Complex Analysis on Any Data

– Scale on Demand

• Store Data Securely

• Decrease Time to Market

– Build Environments Quickly

• Reduce Costs

– Reduce Capital Expenditure

• Enable Global Reach

Page 5: Analytics on AWS - IP Expo 2013

Ingestion | Integration

Page 6: Analytics on AWS - IP Expo 2013

Elastic Block Store High performance block storage

device

1GB to 1TB in size

Mount as drives to instances with

snapshot/cloning functionalities

IMAGE

Availability 99.99%

Durability 99.999999999%

Is a Web Store Not a file system

No Single Points of Failure

Eventually consistent

Paradigm Object store

Performance Very Fast

Redundancy Across Availability Zones

Security Public Key / Private Key

Pricing $0.095/GB/month

(DUB)

Typical use

case

Write once, read many

Limits 100 Buckets, Unlimited

Storage, 5TB Objects

Simple Storage Service Highly scalable object storage for the internet

1 byte to 5TB in size

99.999999999% durability

Page 7: Analytics on AWS - IP Expo 2013

Peak Requests: 1.2 Million / Second

14 40

102

262

762

1300

2100

0

500

1000

1500

2000

Q4 2007 Q4 2008 Q4 2009 Q4 2010 Q4 2011 Q4 2012 Today

Objects in S3 B

illio

ns

Page 8: Analytics on AWS - IP Expo 2013

Amazon S3 provides near linear scalability

S3 Streaming Performance 100 VMs; 9.6GB/s; $26/hr

350 VMs; 28.7GB/s; $90/hr

34 secs per terabyte

GB/Second

Reader

Connections

Performance & Scalability

Page 9: Analytics on AWS - IP Expo 2013

• Spotify is an online music

service offering instant access

to over 16 million licensed

songs

• Over 15 million active users

and 4 million paying

subscribers

• Spotify adds over 20,000 tracks

a day to its catalogue

Spotify uses Amazon S3 for Music Storage

-Emil Fredriksson

Operations Director for Spotify

AMAZON S3 GIVES US CONFIDENCE IN OUR ABIL ITY TO EXPAND STORAGE Q U I C K LY W H I L E ALSO PROVIDING H I G H D A T A D U R A B I L I T Y

Page 10: Analytics on AWS - IP Expo 2013

Amazon Glacier Long term object archive

Extremely low cost per gigabyte

99.999999999% durability

Elastic Block Store High performance block storage

device

1GB to 1TB in size

Mount as drives to instances with

snapshot/cloning functionalities

IMAGE

Durability 99.999999999%

Designed for Archival Not a file system

Vaults & Archives

3-5 Hour Retrieval Time

Paradigm Archive Store

Performance Configurable - Low

Redundancy Across Availability Zones

Security Public Key / Private Key

Pricing $0.011/GB/month

Typical use

case

Write once, read

infrequently

< 10% / Month

Page 11: Analytics on AWS - IP Expo 2013

Simple Storage Service Highly scalable object storage

1 byte to 5TB in size

99.999999999% durability

Glacier Long term object archive

Extremely low cost per gigabyte

99.999999999% durability

Storage Lifecycle Integration

Page 12: Analytics on AWS - IP Expo 2013

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & Administration

Networking

NOSQL Data Capture

DynamoDB Provisioned throughput NoSQL database

Fast, predictable, configurable performance

Fully distributed, fault tolerant HA architecture

Integration with EMR & Hive

RDS Dynamo

DB

Redshift

Page 13: Analytics on AWS - IP Expo 2013

• Writes

• Writes are acknowledged

(committed) once they exist in at

least two physical data centers

• Writes are persisted to SSD

• Reads

• Tunable for Application

Requirements

• No reduction in durability or

consistency in order to

achieve throughput

Dynamo Consistency

Eventually Consistent Read Strongly Consistent Read

Stale Values reads possible No Stale Values read

Highest Throughput Lower Potential Throughput

√ √

Page 14: Analytics on AWS - IP Expo 2013

• Shazam connects more than 200

million people, in more than 200

countries and 33 languages, to the

music, TV shows and brands they love

• When customers hear a song or see a

TV program or ad they like, they simply

activate the app to “tag” it

• Shazam realized it could support over

500,000 writes per second with

Dynamo DB

• Also using Amazon EMR for large-

scale data analysis that can require

more than 1 million writes per second

Shazam scaled Dynamo DB to 500,000 IOPS for a

Superbowl Ad

-Jason Titus

Shazam CTO

AWS GAVE USE

THE ABILITY TO

BRING A MASSIVE

A M O U N T O F

C A P A C I T Y

O N L I N E I N A

SHORT PERIOD

O F T I M E

Page 15: Analytics on AWS - IP Expo 2013

Complex Data Analysis …

Parallel ETL

Page 16: Analytics on AWS - IP Expo 2013

Elastic MapReduce Managed, elastic Hadoop cluster

Integrates with S3 & DynamoDB

Automated installation of Hive & Pig

Support for Spot Instances

Integrated HBase NOSQL Database

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & Administration

Networking

Application Services

Elastic

MapReduce

Page 17: Analytics on AWS - IP Expo 2013
Page 18: Analytics on AWS - IP Expo 2013

EMR Data Sources

Page 19: Analytics on AWS - IP Expo 2013

Mix Spot and On-Demand instances to reduce cost and accelerate computation while protecting against interruption

#1: Cost without Spot 4 instances *14 hrs * $0.50 = $28

Job Flow

14 Hours

Duration:

Other EMR + Spot Use Cases Run entire cluster on Spot for biggest cost savings Reduce the cost of application testing

#2: Cost with Spot 4 instances *7 hrs * $0.50 = $14 +

5 instances * 7 hrs * $0.25 = $8.75

Total = $22.75

Scenario #1

Duration:

Job Flow

7 Hours

Scenario #2

Time Savings: 50%

Cost Savings: ~20%

Reducing Costs with Spot Instances

Page 20: Analytics on AWS - IP Expo 2013

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & Administration

Networking

Compute

Vertical

Scaling

From $0.02/hr

Elastic Compute Cloud (EC2) Basic unit of compute capacity

Range of CPU, memory & local disk options

13 Instance types available, from micro to cluster

compute

Feature Details

Flexible Run windows or linux distributions

Scalable Wide range of instance types from micro to cluster compute

Machine Images Configurations can be saved as machine images (AMIs) from which new instances can be created

Full control Full root or administrator rights

Secure Full firewall control via Security Groups

Monitoring Publishes metrics to Cloud Watch

Inexpensive On-demand, Reserved and Spot instance types

VM Import/Export Import and export VM images to transfer configurations in and out of EC2

Page 21: Analytics on AWS - IP Expo 2013
Page 22: Analytics on AWS - IP Expo 2013

Cluster Compute

EC2 Instance 2nd Generation cluster compute instance

Cluster Compute instances implement HVM process execution

Intel® Xeon® E5-2670 processors

10 Gigabit Ethernet

1

Cluster Compute

80 EC2

Compute Units

60GB RAM

3TB Local

Disk

Page 23: Analytics on AWS - IP Expo 2013

Cluster Compute

Network placement groups Cluster instances deployed in a ‘Placement

Group’ enjoy low latency, full bisection 10

Gbps bandwidth

2

10Gbps

Page 24: Analytics on AWS - IP Expo 2013

CC2 Instance Cluster

240 TFLOPS Making it the 72nd fastest

supercomputer in the world (#42 when announced at SC’11)

(Test performed nov 2011, benchmark published June 2012)

Page 25: Analytics on AWS - IP Expo 2013

Cluster GPU

EC2 instance GPU compute instances: Intel® Xeon® X5570 processors

2 x NVIDIA Tesla “Fermi” M2050 GPUs

I/O Performance: Very High (10 Gigabit Ethernet)

1

Cluster GPU

33.5 EC2

Compute Units

20GB RAM

2x NVIDIA GPU

@ >400 Cores

Each

Page 26: Analytics on AWS - IP Expo 2013

S&P Capital IQ Uses AWS for Big Data Processing

Provides data to 4200+ top

global investment firms

Launched Hadoop faster,

Learned Hadoop faster

S3 Hadoop Cluster

Page 27: Analytics on AWS - IP Expo 2013

Structured Data Management

Page 28: Analytics on AWS - IP Expo 2013

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & Administration

Networking

Structured Data Analysis

Relational Database Service Managed Oracle, MySQL & SQL Server

Dynamo DB Managed NOSQL Database

Amazon Redshift Massively Parallel Petabyte Scale Data Warehouse

RDS Dynamo

DB

Redshift

Page 29: Analytics on AWS - IP Expo 2013

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & Administration

Networking

Structured Data Analysis

Relational Database Service Database-as-a-Service

No need to install or manage database instances

Scalable and fault tolerant configurations

Integration with Data Pipeline

RDS Dynamo

DB

Redshift

Page 30: Analytics on AWS - IP Expo 2013

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & Administration

Networking

Structured Data Analysis

Redshift Managed Massively Parallel Petabyte Scale Data

Warehouse

Streaming Backup/Restore to S3

Extensive Security

2 TB -> 1.6 PB

RDS Dynamo

DB

Redshift

Page 31: Analytics on AWS - IP Expo 2013

Redshift parallelizes and distributes everything

Query

Load

Backup

Restore

Resize

ComputeNode

ComputeNode

ComputeNode

LeaderNode

Common BI Tools

JDBC/ ODBC

10GigE Mesh

Page 32: Analytics on AWS - IP Expo 2013

Redshift lets you start small and grow

big Extra Large Node (XL)

3 spindles, 2TB, 15GiB RAM

2 virtual cores, 10GigE

Single Node (2TB)

Cluster 2-32 Nodes (4TB – 64TB)

8 Extra Large Node (8XL)

24 spindles, 16TB, 120GiB RAM

16 virtual cores, 10GigE

Cluster 2-100 Nodes (32TB – 1.6PB)

Page 33: Analytics on AWS - IP Expo 2013

Important Redshift Features

No Downtime Resize

Streaming Backup/Restore to S3

Automated Point In Time

Snapshotting

Workload Management

Support for VPC

Support for Encrypted Data Loads

Cluster SSL Only Communications

Page 34: Analytics on AWS - IP Expo 2013

Input Datanode: This could be a S3 bucket, RDS

table, EMR Hive table, etc.

Activity: This is a data aggregation,

manipulation, or copy that runs on a user-

configured schedule.

Output Datanode: This supports all the same

datasources as the input datanode, but they don’t

have to be the same type.

Application Services

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & Administration

Networking

Data Pipeline Automatically Provision EC2 & EMR Resources

Manage Dependencies & Scheduling

Automatically Retry and Notify of Success &

Failure

Page 35: Analytics on AWS - IP Expo 2013

Output: S3 file

Path: s3://trend-data/#{year-month-day}.csv

Activity: EMR Transform

Hive Query: user-metrics.hql

Frequency: Daily

Input: RDS Table

Table: User-Demographics

SQL Precondition: “Select last_update from table“ > #{YY-MM-DD}

Input: DynamoDB Table

Table: User-Event-Data-#{year-month}

Success Notification: [email protected]

Failure Notification: [email protected]

Delay Notification: : [email protected]

Sample Use Case

Page 36: Analytics on AWS - IP Expo 2013

Integrated Analytics

Page 37: Analytics on AWS - IP Expo 2013

Integrated Analytics

Page 38: Analytics on AWS - IP Expo 2013

End User Reporting

Page 39: Analytics on AWS - IP Expo 2013

End User Reporting

Redshift RDS EMR


Recommended