Leveraging Amazon Redshift for your Data Warehouse

Post on 14-Aug-2015

382 views 0 download

Tags:

transcript

©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Leveraging Amazon Redshift for Your

Data Warehouse

John Loughlin, Solutions Architect @ AWS

Kyle Hubert, Principal Data Architect @ Simulmedia

Petabyte scale

Massively parallel

Relational data warehouse

Fully managed; zero admin

Amazon

Redshift

a lot faster

a lot cheaper

a whole lot simpler

Amazon

Redshift

Amazon

EMR

Amazon

EC2

Analyze

AWS Data

Pipeline

Amazon

Glacier

Amazon

DynamoDB

Store

AWS Direct

Connect

Collect

Amazon Kinesis

Amazon

S3

Common customer use cases

• Reduce costs by extending

DW rather than adding HW

• Migrate completely from

existing DW systems

• Respond faster to business

• Improve performance by an

order of magnitude

• Make more data available

for analysis

• Access business data via

standard reporting tools

• Add analytic functionality to

applications

• Scale DW capacity as

demand grows

• Reduce HW and SW costs

by an order of magnitude

Traditional enterprise DW Companies with big data SaaS companies

Amazon.com enterprise data warehouse

• Generates weblogs @ 2 terabytes/day, growing 67% YoY

• Oracle RAC legacy system

• Scan rate: 1 week of data/hour

• Hit RAC node limit of 32 nodes

• More data => Slower queries

• Migrated to Amazon Redshift

• Scan rate: 15 months of data (2.25 trillion rows) in 14 minutes

• More than 10 x performance with 100 node cluster

• 21 billion rows joined with 10 billion rows in under 2 hours, from

days

Amazon Redshift architecture

• Leader node

– SQL endpoint, JDBC/ODBC

– Stores metadata

– Coordinates query execution

• Compute nodes

– Local, columnar storage

– Execute queries in parallel

– Load, backup, restore via Amazon S3

– Load from Amazon DynamoDB or SS

• Two hardware platforms

– Optimized for data processing

– DS2: HDD; scale from 2TB to 2PB

– DC1: SSD; scale from 160 GB to 326 TB

10 GigE

(HPC)

IngestionBackupRestore

JDBC/ODBC

Amazon Redshift node types

• Optimized for I/O intensive workloads

• High disk density

• On demand at $0.85/hour

• As low as $1,000/TB/year

• Scale from 2 TB to 2 PB

DS2.XL: 31 GB RAM, 2 cores

2 TB compressed storage, 0.5 GB/sec

scan

DS2.8XL: 244 GB RAM, 16 cores

16 TB compressed, 4 GB/sec scan

• High performance at smaller storage size

• High compute and memory density

• On demand at $0.25/hour

• As low as $5,500/TB/year

• Scale from 160 GB to 326 TB

DC1.L: 16 GB RAM, 2 cores

160 GB of compressed SSD storage

DC1.8XL: 256 GB RAM, 32 cores

2.56 TB of compressed SSD storage

Amazon Redshift lets you analyze all your data

Price is nodes times

hourly cost

No charge for leader

node

3 x data compression

on average

Price includes 3 copies

of data

DS2 (HDD)Price per hour for

smallest single node

Effective annual

price per TB compressed

On-Demand $ 0.850 $ 3,725

1 Year Reservation $ 0.500 $ 2,190

3 Year Reservation $ 0.228 $ 999

DC1 (SSD)Price per hour for

smallest single node

Effective annual

price per TB compressed

On-Demand $ 0.250 $ 13,690

1 Year Reservation $ 0.161 $ 8,795

3 Year Reservation $ 0.100 $ 5,500

Amazon Redshift works with your analysis tools

JDBC/ODBC

Amazon Redshift

Amazon Redshift is easy to use

• Provision in minutes

• Monitor query

performance

• Point and click

resize

• Automatic backup

• Built-in security

Amazon Redshift continuously backs up your

data and recovers from failures

• Replication within the cluster and backup to Amazon S3 to

maintain multiple copies of data at all times

• Backups to Amazon S3 are continuous, automatic, and

incremental

– Designed for eleven nines of durability

• Continuous monitoring and automated recovery from failures of

drives and nodes

• Able to restore snapshots to any Availability Zone within a region

• Easily enable backups to a second region for disaster recovery

Amazon Redshift has security built-in

• Load encrypted from S3

• SSL to secure data in transit; ECDHE

perfect forward security

• Encryption to secure data at rest

– All blocks on disks and in S3 encrypted

– Block key, cluster key, master key (AES-

256)

– On-premises HSM and AWS CloudHSM

support

• Audit logging and AWS CloudTrail

integration

• Amazon VPC support

• SOC 1/2/3, PCI-DSS Level 1, FedRAMP

10 GigE

(HPC)

Ingestion

Backup

Restore

Customer VPC

InternalVPC

JDBC/ODBC

Amazon Redshift dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage • With row storage you do

unnecessary I/O

• To get the total amount, you

have to read everything

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

• With column storage, you

only read the data you

need

Amazon Redshift dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

analyze compression listing;

Table | Column | Encoding

---------+----------------+----------

listing | listid | delta

listing | sellerid | delta32k

listing | eventid | delta32k

listing | dateid | bytedict

listing | numtickets | bytedict

listing | priceperticket | delta32k

listing | totalprice | mostly32

listing | listtime | raw

Amazon Redshift dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage • COPY compresses

automatically

• You can analyze and override

• More performance, less cost

Amazon Redshift dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage

10 | 13 | 14 | 26 |…

… | 100 | 245 | 324

375 | 393 | 417…

… 512 | 549 | 623

637 | 712 | 809 …

… | 834 | 921 | 959

10

324

375

623

637

959

• Track the minimum and

maximum value for each

block

• Skip over blocks that don’t

contain relevant data

Amazon Redshift dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage

• Use local storage for

performance

• Maximize scan rates

• Automatic replication and

continuous backup

• HDD and SSD platforms

Amazon Redshift @ Simulmedia

—John Wanamaker

“Half the money I spend on advertising is wasted; the

trouble is I don't know which half.”

A data-centric approach to TV advertising

Targeted TV advertising that reaches

110 million households

Anonymous viewing data from millions of set-top

boxes and smart TVs overlaid with 3rd party

viewing data

Reinvested in our platform with Amazon Redshift

10–100 x improvement in performance

Decreased time to release

Proliferation of experiments on the data

Business opportunity/capacity has

increased exponentially;

headcount for the team has remained

stable

On-premises Hadoop/Hive cluster with >80

nodes storing 150 TBs of data

HDFS -> S3

Freedom from replication factor

Separate archives and active data set

Scalable performance

Production data was optimal for MPP

$0

$35,000

$70,000

$105,000

$140,000

$175,000

HDD SSDAmazon Redshift solution A solution B solution C solution D

MPP cost—per TB per year

Managed service

Continual upgrades

Automatic snapshotting

<1 sec to query 2 years of historical viewing data

N.B.: skinny fact table

Flexible data discovery period

Better understanding of data

Tuned facts and distributed dimensions

Production Amazon Redshift cluster with 3

nodes storing ~1.4 TB

Non-production Amazon Redshift cluster

with 2 nodes storing ~8 TB

S3 data lake

Minor transformations during ingestion

Idempotent audit tables in Amazon Redshift

Star schema design

Decreased our infrastructure costs

Cleaned up our architecture

Operationally complexity removed

Capacity planning eased

Demographics/Targeting/Forecasting

From ~1 hour to ~10 seconds

Measurement

from ~7–10 hours to ~5 minutes

SQL everywhere

Data science:

Improve forecasting

Improve optimizations

Improve measurement

Analytics:

Build new reports

Discover more about effective spots

Best practices

Learn the Amazon Redshift Management Console:

Set up queueing

Set up alerts

Track CPU utilization when debugging

Low concurrency (1–3 queries)

Alerts on disk usage

Query execution details

COPY/UNLOAD

Remember to analyze tables for planner

Take advantage of compression analysis

Use timestamp/date data types

(Add timezone to column name)

Use varchar

Your Feedback is Important to AWSPlease complete the session evaluation. Tell us what you think!

NEW YORK