+ All Categories
Home > Technology > AWS Summit Milan - Data Analysis

AWS Summit Milan - Data Analysis

Date post: 14-Jan-2015
Category:
Upload: amazon-web-services
View: 844 times
Download: 1 times
Share this document with a friend
Description:
 
Popular Tags:
91
AWS Summit 2013 Milan 31 Ottobre 2013 Hakan Gurel Solutions Architecture DATA ANALYSIS ON AWS
Transcript
Page 1: AWS Summit Milan - Data Analysis

AWS Summit 2013 Milan 31 Ottobre 2013

Hakan Gurel

Solutions Architecture

DATA ANALYSIS ON AWS

Page 2: AWS Summit Milan - Data Analysis
Page 3: AWS Summit Milan - Data Analysis

THE COST OF

GENERATING DATA

IS FALLING

Page 4: AWS Summit Milan - Data Analysis
Page 5: AWS Summit Milan - Data Analysis

THE MORE DATA YOU COLLECT

THE MORE VALUE YOU CAN

DERIVE FROM IT

Page 6: AWS Summit Milan - Data Analysis
Page 7: AWS Summit Milan - Data Analysis
Page 8: AWS Summit Milan - Data Analysis
Page 9: AWS Summit Milan - Data Analysis
Page 10: AWS Summit Milan - Data Analysis

GENERATE STORE ANALYZE SHARE

Lower cost,

higher throughput

Highly

constrained

Page 11: AWS Summit Milan - Data Analysis

Generated data

Available for analysis

DATA VOLUME

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011

IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Page 12: AWS Summit Milan - Data Analysis

GENERATE STORE ANALYZE SHARE

ACCELERATE

Page 13: AWS Summit Milan - Data Analysis

+ ELASTIC AND HIGHLY SCALABLE

+ NO UPFRONT CAPITAL EXPENSE

+ PAY FOR ONLY WHAT YOU USE

+ AVAILABLE ON-DEMAND

= REMOVE CONSTRAINTS

Page 14: AWS Summit Milan - Data Analysis

GENERATE STORE ANALYZE SHARE

AWS Import / Export

AWS Direct Connect

Page 15: AWS Summit Milan - Data Analysis

Generated and stored in AWS

Inbound data transfer is free

Multipart upload to S3

Physical media

AWS Direct Connect

Regional replication of AMIs and snapshots

Page 16: AWS Summit Milan - Data Analysis

GENERATE STORE ANALYZE SHARE

Amazon S3,

Amazon Glacier,

Amazon DynamoDB,

Amazon RDS,

Amazon Redshift,

AWS Storage Gateway,

Data on Amazon EC2

Page 17: AWS Summit Milan - Data Analysis

AMAZON S3 SIMPLE STORAGE SERVICE

Page 18: AWS Summit Milan - Data Analysis
Page 19: AWS Summit Milan - Data Analysis

AMAZON

DYNAMODB HIGH-PERFORMANCE, FULLY MANAGED

NoSQL DATABASE SERVICE

Page 20: AWS Summit Milan - Data Analysis

DURABLE &

AVAILABLE CONSISTENT, DISK-ONLY

WRITES (SSD)

Page 21: AWS Summit Milan - Data Analysis

LOW LATENCY AVERAGE READS < 5MS,

WRITES < 10MS

Page 22: AWS Summit Milan - Data Analysis

NO ADMINISTRATION

Page 23: AWS Summit Milan - Data Analysis

500,000 WRITES PER SECOND

DURING SUPER BOWL

Page 24: AWS Summit Milan - Data Analysis

AMAZON

REDSHIFT FULLY MANAGED, PETA-BYTE SCALE

DATAWAREHOUSE ON AWS

Page 25: AWS Summit Milan - Data Analysis
Page 26: AWS Summit Milan - Data Analysis

DESIGN OBJECTIVES: A petabyte-scale data warehouse service that was…

AMAZON REDSHIFT

A Whole Lot Simpler

A Lot Cheaper

A Lot Faster

Page 27: AWS Summit Milan - Data Analysis

AMAZON REDSHIFT

RUNS ON OPTIMIZED HARDWARE

HS1.8XL: 128 GB RAM, 16 Cores, 16 TB compressed user storage, 2 GB/sec scan rate

HS1.XL: 16 GB RAM, 2 Cores, 2 TB compressed customer storage

Page 28: AWS Summit Milan - Data Analysis
Page 29: AWS Summit Milan - Data Analysis
Page 30: AWS Summit Milan - Data Analysis

30 MINUTES

DOWN TO

12 SECONDS

Page 31: AWS Summit Milan - Data Analysis
Page 32: AWS Summit Milan - Data Analysis

Extra Large Node (HS1.XL)

Single Node (2 TB)

Cluster 2-32 Nodes (4 TB – 64 TB)

AMAZON REDSHIFT LETS YOU

START SMALL AND GROW BIG

Eight Extra Large Node (HS1.8XL) Cluster 2-100 Nodes (32 TB – 1.6 PB)

Page 33: AWS Summit Milan - Data Analysis

CREATE A DATAWAREHOUSE IN

MINUTES

Page 34: AWS Summit Milan - Data Analysis
Page 35: AWS Summit Milan - Data Analysis
Page 36: AWS Summit Milan - Data Analysis
Page 37: AWS Summit Milan - Data Analysis
Page 38: AWS Summit Milan - Data Analysis

JDBC/ODBC

Page 39: AWS Summit Milan - Data Analysis
Page 40: AWS Summit Milan - Data Analysis
Page 41: AWS Summit Milan - Data Analysis
Page 42: AWS Summit Milan - Data Analysis

Price Per Hour for

HS1.XL Single

Node

Effective Hourly

Price Per TB

Effective Annual

Price per TB

On-Demand $ 0.850 $ 0.425 $ 3,723

1 Year

Reservation $ 0.500 $ 0.250 $ 2,190

3 Year

Reservation $ 0.228 $ 0.114 $ 999

Page 43: AWS Summit Milan - Data Analysis

DATA WAREHOUSING DONE THE AWS WAY

No upfront costs, pay as you go

Really fast performance at a really low price

Open and flexible with support for popular tools

Easy to provision and scale up massively

Page 44: AWS Summit Milan - Data Analysis

USAGE SCENARIOS

Page 45: AWS Summit Milan - Data Analysis

Redshift Reporting

and BI EMR

S3

• Maintain online SQL access to historical logs

• Transformation and enrichment with EMR

• Longer history ensures better insight

Cloud ETL for Big Data

Page 46: AWS Summit Milan - Data Analysis

Live archive for (structured) Big Data

• Direct integration with copy command

• High velocity data

• Data ages into Redshift

• Low cost, high scale option for new apps

DynamoDB Redshift

OLTP Web Apps Reporting

and BI

Page 47: AWS Summit Milan - Data Analysis

Reporting Warehouse

• Accelerated operational reporting

• Support for short-time use cases

• Data compression, index redundancy

RDBMS Redshift

OLTP ERP Reporting

and BI

Page 48: AWS Summit Milan - Data Analysis

+

RDBMS Redshift

OLTP

ERP Reporting

& BI

On-Premises Integration

Page 49: AWS Summit Milan - Data Analysis

GENERATE STORE ANALYZE SHARE

Amazon EC2

Amazon Elastic

MapReduce

Page 50: AWS Summit Milan - Data Analysis

AMAZON EC2 ELASTIC COMPUTE CLOUD

Page 51: AWS Summit Milan - Data Analysis
Page 52: AWS Summit Milan - Data Analysis

CLUSTER GPU

QUADRUPLE EXTRA LARGE

Intel Xeon X5570, quad-core

Nehalem architecture

NVIDIA Tesla Fermi

M2050 GPUs

22 GB of memory – 1.7 TB of storage

2x

2x

Page 53: AWS Summit Milan - Data Analysis

ON A SINGLE INSTANCE

COMPUTE TIME: 4h

COST: 4h x $2.1 = $8.4

Page 54: AWS Summit Milan - Data Analysis

ON MULTIPLE INSTANCES

COMPUTE TIME: 1h

COST: 1h x 4 x $2.1 = $8.4

Page 55: AWS Summit Milan - Data Analysis
Page 56: AWS Summit Milan - Data Analysis

For 3 hours

$4828.85/hr

instead of $20+ MILLIONS in infrastructure

Page 57: AWS Summit Milan - Data Analysis

AMAZON ELASTIC

MAPREDUCE HADOOP AS A SERVICE

Page 58: AWS Summit Milan - Data Analysis

• A FRAMEWORK

• SPLITS DATA INTO PIECES

• LETS PROCESSING OCCUR

• GATHERS THE RESULTS

Page 59: AWS Summit Milan - Data Analysis

Corporate Data

Center

Elastic Data

Center

Application data

and logs for

analysis pushed

to S3

Page 60: AWS Summit Milan - Data Analysis

Corporate Data

Center

Elastic Data

Center

Amazon Elastic

Map Reduce

name node to

control analysis

N

Page 61: AWS Summit Milan - Data Analysis

Corporate Data

Center

Elastic Data

Center

Hadoop cluster

started by Elastic

Map Reduce

N

Page 62: AWS Summit Milan - Data Analysis

Corporate Data

Center

Elastic Data

Center

N

Adding many

hundreds or

thousands of

nodes

Page 63: AWS Summit Milan - Data Analysis

Corporate Data

Center

Elastic Data

Center

N

Disposed of when

job completes

Page 64: AWS Summit Milan - Data Analysis

Corporate Data

Center

Elastic Data

Center

Results of

analysis pulled

back into your

systems

Page 65: AWS Summit Milan - Data Analysis
Page 66: AWS Summit Milan - Data Analysis

GENERATE STORE ANALYZE SHARE

Amazon S3,

Amazon DynamoDB,

Amazon RDS,

Amazon Redshift,

Data on Amazon EC2

Page 67: AWS Summit Milan - Data Analysis
Page 68: AWS Summit Milan - Data Analysis

GENERATE STORE ANALYZE SHARE

AWS Data Pipeline

Page 69: AWS Summit Milan - Data Analysis

AWS Data Pipeline

Data-intensive orchestration and automation

Reliable and scheduled

Easy to use, drag and drop

Execution and retry logic

Map data dependencies

Create and manage compute resources

Page 70: AWS Summit Milan - Data Analysis
Page 71: AWS Summit Milan - Data Analysis
Page 72: AWS Summit Milan - Data Analysis
Page 73: AWS Summit Milan - Data Analysis

GENERATE STORE ANALYZE SHARE

Amazon S3,

Amazon Glacier,

Amazon DynamoDB,

Amazon RDS,

Amazon Redshift,

AWS Storage Gateway,

Data on Amazon EC2

AWS Import / Export

AWS Direct Connect

Amazon S3,

Amazon DynamoDB,

Amazon RDS,

Amazon Redshift,

Data on Amazon EC2

Amazon EC2

Amazon Elastic

MapReduce

AWS Data Pipeline

Page 74: AWS Summit Milan - Data Analysis

FROM DATA TO

ACTIONABLE

INFORMATION

Page 75: AWS Summit Milan - Data Analysis
Page 76: AWS Summit Milan - Data Analysis

Stefano Rodighiero

Page 77: AWS Summit Milan - Data Analysis

Words matter

7+ million lyrics catalogue in more than 50 distinct languages

Music Discography Meta Data: Lyrics, Artists, Albums, Songs, Biographies, Worldwide Charts

MXM FACTS

Synced lyrics! Daily updated with more than 1 million artists and more than 20 million music tracks

Currently musiXmatch is the only lyrics platform allowed for worldwide licensing and has deals with top Music Publishers: Warner Chappell, Universal, BMG, EMI Publishing, Sony ATV, Peer Music, ...

Page 78: AWS Summit Milan - Data Analysis

SYNCED LYRICS

Page 79: AWS Summit Milan - Data Analysis

MUSIC

METADATA:

RECORDING &

PUBLISHING

OUR DATA

Page 80: AWS Summit Milan - Data Analysis

CONTENT USAGE

OUR DATA

Page 81: AWS Summit Milan - Data Analysis

OTHER SOURCES

OUR DATA

Page 82: AWS Summit Milan - Data Analysis

CONTENT

USAGE:

REPORTING &

ANALYTICS

DATA ANALYSIS @ MXM

Page 83: AWS Summit Milan - Data Analysis

Words matter

Publishing catalogue

"Unrolling" Frontend Filter/normalization

Hive

Post process

Batch

Analytics

Redshift

DATAFLOW

Redis

(real time analytics)

Analytics

Redshift

Page 84: AWS Summit Milan - Data Analysis

Words matter

Hive

Post process

Batch

BATCH REPORTING

Step 1. Aggregation of views by country,

application and content type

Step 2. Join with a 500M+ rows table

It takes approx 1 hour with 5 c1.xlarge

instances

It used to take days with traditional techniques!

SQL interface makes it easier to review and

share the process

Page 85: AWS Summit Milan - Data Analysis

Words matter

Publishing catalogue

"Unrolling" Frontend proxy

Filter/normalization

Hive

Post process

DATAFLOW

Redis

(real time analytics)

Analytics

Redshift

Interactive

Page 86: AWS Summit Milan - Data Analysis

Words matter

INTERACTIVE ANALYTICS

SQL interface like Hive, accessible with any

Postgresql client...

...but faster!

Flexible costs

With Redshift doing all the heavy lifting, it's

easier to build analytics tools

Redis

(real time analytics)

Analytics

Redshift

Interactive

Page 87: AWS Summit Milan - Data Analysis

Words matter

Publishing catalogue

"Unrolling" Frontend proxy

Filter/normalization

Hive

Post process

Batch

DATAFLOW

Redis

(real time analytics)

Analytics

Redshift

Interactive

Page 88: AWS Summit Milan - Data Analysis

Words matter

Stefano Rodighiero [email protected]

@larsen

MUSIXMATCH

Page 89: AWS Summit Milan - Data Analysis

THANK YOU!

MUSIXMATCH

Page 90: AWS Summit Milan - Data Analysis
Page 91: AWS Summit Milan - Data Analysis

THANK YOU [email protected]


Recommended