Leveraging Amazon Redshift for your Data Warehouse

transcript

Leveraging Amazon Redshift for Your

Data Warehouse

John Loughlin, Solutions Architect @ AWS

Kyle Hubert, Principal Data Architect @ Simulmedia

Petabyte scale

Massively parallel

Relational data warehouse

Fully managed; zero admin

Amazon

Redshift

a lot faster

a lot cheaper

a whole lot simpler

Amazon

Redshift

Amazon

Analyze

AWS Data

Pipeline

Amazon

Glacier

Amazon

DynamoDB

AWS Direct

Connect

Collect

Amazon Kinesis

Amazon

Common customer use cases

• Reduce costs by extending

DW rather than adding HW

• Migrate completely from

existing DW systems

• Respond faster to business

• Improve performance by an

order of magnitude

• Make more data available

for analysis

• Access business data via

standard reporting tools

• Add analytic functionality to

applications

• Scale DW capacity as

demand grows

• Reduce HW and SW costs

by an order of magnitude

Traditional enterprise DW Companies with big data SaaS companies

Amazon.com enterprise data warehouse

• Generates weblogs @ 2 terabytes/day, growing 67% YoY

• Oracle RAC legacy system

• Scan rate: 1 week of data/hour

• Hit RAC node limit of 32 nodes

• More data => Slower queries

• Migrated to Amazon Redshift

• Scan rate: 15 months of data (2.25 trillion rows) in 14 minutes

• More than 10 x performance with 100 node cluster

• 21 billion rows joined with 10 billion rows in under 2 hours, from

Amazon Redshift architecture

• Leader node

– SQL endpoint, JDBC/ODBC

– Stores metadata

– Coordinates query execution

• Compute nodes

– Local, columnar storage

– Execute queries in parallel

– Load, backup, restore via Amazon S3

– Load from Amazon DynamoDB or SS

• Two hardware platforms

– Optimized for data processing

– DS2: HDD; scale from 2TB to 2PB

– DC1: SSD; scale from 160 GB to 326 TB

10 GigE

IngestionBackupRestore

JDBC/ODBC

Amazon Redshift node types

• Optimized for I/O intensive workloads

• High disk density

• On demand at $0.85/hour

• As low as $1,000/TB/year

• Scale from 2 TB to 2 PB

DS2.XL: 31 GB RAM, 2 cores

2 TB compressed storage, 0.5 GB/sec

DS2.8XL: 244 GB RAM, 16 cores

16 TB compressed, 4 GB/sec scan

• High performance at smaller storage size

• High compute and memory density

• On demand at $0.25/hour

• As low as $5,500/TB/year

• Scale from 160 GB to 326 TB

DC1.L: 16 GB RAM, 2 cores

160 GB of compressed SSD storage

DC1.8XL: 256 GB RAM, 32 cores

2.56 TB of compressed SSD storage

Amazon Redshift lets you analyze all your data

Price is nodes times

hourly cost

No charge for leader

3 x data compression

on average

Price includes 3 copies

of data

DS2 (HDD)Price per hour for

smallest single node

Effective annual

price per TB compressed

On-Demand $ 0.850 $ 3,725

1 Year Reservation $ 0.500 $ 2,190

3 Year Reservation $ 0.228 $ 999

DC1 (SSD)Price per hour for

smallest single node

Effective annual

price per TB compressed

On-Demand $ 0.250 $ 13,690

Amazon Redshift works with your analysis tools

JDBC/ODBC

Amazon Redshift

Amazon Redshift is easy to use

• Provision in minutes

• Monitor query

performance

• Point and click

resize

• Automatic backup

• Built-in security

Amazon Redshift continuously backs up your

data and recovers from failures

• Replication within the cluster and backup to Amazon S3 to

maintain multiple copies of data at all times

• Backups to Amazon S3 are continuous, automatic, and

incremental

– Designed for eleven nines of durability

• Continuous monitoring and automated recovery from failures of

drives and nodes

• Able to restore snapshots to any Availability Zone within a region

• Easily enable backups to a second region for disaster recovery

Amazon Redshift has security built-in

• Load encrypted from S3

• SSL to secure data in transit; ECDHE

perfect forward security

• Encryption to secure data at rest

– All blocks on disks and in S3 encrypted

– Block key, cluster key, master key (AES-

– On-premises HSM and AWS CloudHSM

support

• Audit logging and AWS CloudTrail

integration

• Amazon VPC support

• SOC 1/2/3, PCI-DSS Level 1, FedRAMP

10 GigE

Ingestion

Backup

Restore

Customer VPC

InternalVPC

JDBC/ODBC

Amazon Redshift dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage • With row storage you do

unnecessary I/O

• To get the total amount, you

have to read everything

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

• With column storage, you

only read the data you

• Column storage

• Zone maps

• Direct-attached storage

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

analyze compression listing;

Table | Column | Encoding

---------+----------------+----------

listing | listid | delta

listing | sellerid | delta32k

listing | eventid | delta32k

listing | dateid | bytedict

listing | numtickets | bytedict

listing | priceperticket | delta32k

listing | totalprice | mostly32

listing | listtime | raw

• Column storage

• Zone maps

• Direct-attached storage • COPY compresses

automatically

• You can analyze and override

• More performance, less cost

• Column storage

• Zone maps

10 | 13 | 14 | 26 |…

… | 100 | 245 | 324

375 | 393 | 417…

… 512 | 549 | 623

637 | 712 | 809 …

… | 834 | 921 | 959

• Track the minimum and

maximum value for each

• Skip over blocks that don’t

contain relevant data

• Column storage

• Zone maps

• Use local storage for

performance

• Maximize scan rates

• Automatic replication and

continuous backup

• HDD and SSD platforms

Amazon Redshift @ Simulmedia

—John Wanamaker

“Half the money I spend on advertising is wasted; the

trouble is I don't know which half.”

A data-centric approach to TV advertising

Targeted TV advertising that reaches

110 million households

Anonymous viewing data from millions of set-top

boxes and smart TVs overlaid with 3rd party

viewing data

Reinvested in our platform with Amazon Redshift

10–100 x improvement in performance

Decreased time to release

Proliferation of experiments on the data

Business opportunity/capacity has

increased exponentially;

headcount for the team has remained

stable

On-premises Hadoop/Hive cluster with >80

nodes storing 150 TBs of data

HDFS -> S3

Freedom from replication factor

Separate archives and active data set

Scalable performance

Production data was optimal for MPP

$35,000

$70,000

$105,000

$140,000

$175,000

HDD SSDAmazon Redshift solution A solution B solution C solution D

MPP cost—per TB per year

Managed service

Continual upgrades

Automatic snapshotting

<1 sec to query 2 years of historical viewing data

N.B.: skinny fact table

Flexible data discovery period

Better understanding of data

Tuned facts and distributed dimensions

Production Amazon Redshift cluster with 3

nodes storing ~1.4 TB

Non-production Amazon Redshift cluster

with 2 nodes storing ~8 TB

S3 data lake

Minor transformations during ingestion

Idempotent audit tables in Amazon Redshift

Star schema design

Decreased our infrastructure costs

Cleaned up our architecture

Operationally complexity removed

Capacity planning eased

Demographics/Targeting/Forecasting

From ~1 hour to ~10 seconds

Measurement

from ~7–10 hours to ~5 minutes

SQL everywhere

Data science:

Improve forecasting

Improve optimizations

Improve measurement

Analytics:

Build new reports

Discover more about effective spots

Best practices

Learn the Amazon Redshift Management Console:

Set up queueing

Set up alerts

Track CPU utilization when debugging

Low concurrency (1–3 queries)

Alerts on disk usage

Query execution details

COPY/UNLOAD

Remember to analyze tables for planner

Take advantage of compression analysis

Use timestamp/date data types

(Add timezone to column name)

Use varchar

Your Feedback is Important to AWSPlease complete the session evaluation. Tell us what you think!

NEW YORK

Leveraging Amazon Redshift for your Data Warehouse

Technology