Getting started with Amazon Redshift

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Ian Meyers, Principal Solution Architect, AWS

July 7th, 2016

Getting Started with

Amazon Redshift

AnalyzeStore

Amazon

Glacier

Amazon S3

Amazon

DynamoDB

Amazon RDS,

Amazon Aurora

AWS Data Pipeline

Amazon

CloudSearch

Amazon EMR Amazon EC2

Amazon

Redshift

Amazon

Machine

Learning

Amazon

Elasticsearch

Service

Amazon

QuickSight

Amazon

Kinesis

Firehose

AWS Import/Export

Collect

Amazon Kinesis

Streams

AWS Direct

Connect

AWS Database

Migration Service

Amazon

CloudWatch

Relational data warehouse

Massively parallel; petabyte scale

Fully managed

HDD and SSD platforms

$1,000/TB/year; starts at $0.25/hour

Amazon

Redshift

a lot faster

a lot simpler

a lot cheaper

The Amazon Redshift view of data warehousing

10x cheaper

Easy to provision

Higher DBA productivity

10x faster

No programming

Easily leverage BI tools,

Hadoop, machine

learning, streaming

Analysis inline with

process flows

Pay as you go, grow as

you need

Managed availability and

disaster recovery

Enterprise Big data SaaS

Selected Amazon Redshift customers

Amazon Redshift architecture

Leader node

Simple SQL endpoint

Stores metadata

Optimizes query plan

Coordinates query execution

Compute nodes

Local columnar storage

Parallel/distributed execution of all queries, loads,

backups, restores, resizes

Start at just $0.25/hour, grow to 2 PB (compressed)

DC1: SSD; scale from 160 GB to 326 TB

DS2: HDD; scale from 2 TB to 2 PB

Ingestion/Backup

Backup

Restore

JDBC/ODBC

10 GigE

(HPC)

Benefit #1: Amazon Redshift is fast

Parallel and distributed

Query

Load

Export

Backup

Restore

Resize

Benefit #1: Amazon Redshift is fast

Dense Storage DS2 (HDD) instance type

Improved memory 2x, compute 2x, disk throughput 1.5x

Cost: Same as our prior generation DS1!

Performance improvement: 50%

Enhanced I/O and commit improvements (Jan ’16)

Reduce amount of time to commit data

Performance improvement: 35%

Benefit #2: Amazon Redshift is inexpensive

Ds2 (HDD)Price per hour for

DW1.XL single node

Effective annual

price per TB compressed

On-demand $ 0.850 $ 3,725

1 year reservation $ 0.500 $ 2,190

3 year reservation $ 0.228 $ 999

Dc1 (SSD)Price per hour for

DW2.L single node

Effective annual

price per TB compressed

On-demand $ 0.250 $ 13,690



Pricing is simple

Number of nodes x price/hour

No charge for leader node

No upfront costs

Pay as you go

Benefit #3: Amazon Redshift is fully managed

Continuous/incremental backups

Multiple copies within cluster

Continuous and incremental backups

to Amazon S3

Continuous and incremental backups

across regions

Streaming restore

Amazon S3

Amazon S3

Region 1

Region 2

Benefit #3: Amazon Redshift is fully managed

Amazon S3

Amazon S3

Region 1

Region 2

Fault tolerance

Disk failures

Node failures

Network failures

Availability Zone/region level disasters

Benefit #4: Security is built-in

• Load encrypted from S3

• SSL to secure data in transit

ECDHE perfect forward security

• Amazon VPC for network isolation

• Encryption to secure data at rest

All blocks on disks and in S3 encrypted

Block key, cluster key, master key (AES-256)

On-premises HSM & AWS CloudHSM support

• Audit logging and AWS CloudTrail integration

• SOC 1/2/3, PCI-DSS, FedRAMP, BAA

10 GigE

(HPC)

Ingestion, Backup & Restore

Customer VPC

Internal

VPC

JDBC/ODBC

Benefit #5: We innovate quickly

Well over 100 new features added since launch

Release every two weeks

Automatic patching

Service Launch (2/14)

PDX (4/2)

Temp Credentials (4/11)

DUB (4/25)

SOC1/2/3 (5/8)

Unload Encrypted Files

NRT (6/5)

JDBC Fetch Size (6/27)

Unload logs (7/5)

SHA1 Builtin (7/15)

4 byte UTF-8 (7/18)

Sharing snapshots (7/18)

Statement Timeout (7/22)

Timezone, Epoch, Autoformat (7/25)

WLM Timeout/Wildcards (8/1)

CRC32 Builtin, CSV, Restore Progress (8/9)

Resource Level IAM (8/9)

PCI (8/22)

UTF-8 Substitution (8/29)

JSON, Regex, Cursors (9/10)

Split_part, Audit tables (10/3)

SIN/SYD (10/8)

HSM Support (11/11)

Kinesis EMR/HDFS/SSH copy, Distributed Tables, Audit

Logging/CloudTrail, Concurrency, Resize Perf., Approximate Count Distinct, SNS

Alerts, Cross Region Backup (11/13)

Distributed Tables, Single Node Cursor Support, Maximum Connections to 500

(12/13)

EIP Support for VPC Clusters (12/28)

New query monitoring system tables and diststyle all (1/13)

Redshift on DW2 (SSD) Nodes (1/23)

Compression for COPY from SSH, Fetch size support for single node clusters, new

system tables with commit stats, row_number(), strotol() and query

termination (2/13)

Resize progress indicator & Cluster Version (3/21)

Regex_Substr, COPY from JSON (3/25)

50 slots, COPY from EMR, ECDHE ciphers (4/22)

3 new regex features, Unload to single file, FedRAMP(5/6)

Rename Cluster (6/2)

Copy from multiple regions, percentile_cont, percentile_disc (6/30)

Free Trial (7/1)

pg_last_unload_count (9/15)

AES-128 S3 encryption (9/29)

UTF-16 support (9/29)

Benefit #6: Amazon Redshift has a large ecosystem

Data integration Systems integratorsBusiness intelligence

Getting started

Enter cluster details

Select node configuration

Select security settings and provision

Point-and-click resize

Resize

Resize while remaining online

Provision a new cluster in the

background

Copy data in parallel from node to

node

You are only charged for the source

cluster

Data modeling

3 Important Details…

Column Encoding

Applied on First Data Load

Automatically

Ensure correct encoding is

used

Periodically revisit

encodings in case of change

Data Distribution

Even, Key Based, or

Replicated distribution of

data is available

Focus on colocation of data

to limit network transfer

View network transfer

information in Explain Plan

Data Sorting

Compound (default) Sort

Keys for predictable query

patterns

Interleaved Sort Keys for

tables that can be queried in

any way

MIN: 01-JUNE-2013

MAX: 20-JUNE-2013

MIN: 08-JUNE-2013

MAX: 30-JUNE-2013

MIN: 12-JUNE-2013

MAX: 20-JUNE-2013

MIN: 02-JUNE-2013

MAX: 25-JUNE-2013

Unsorted

table MIN: 01-JUNE-2013

MAX: 06-JUNE-2013

MIN: 07-JUNE-2013

MAX: 12-JUNE-2013

MIN: 13-JUNE-2013

MAX: 18-JUNE-2013

MIN: 19-JUNE-2013

MAX: 24-JUNE-2013

Sorted by date

Columnar Encoding

Dramatically less I/O

Column storage

Data compression

Zone maps

Direct-attached storage

Large data block sizes

Hardware optimized for I/O intensive workloads,

4 GB/sec/node

Enhanced networking, over 1 million

packets/sec/node

analyze compression listing;

Table | Column | Encoding

---------+----------------+----------

listing | listid | delta

listing | sellerid | delta32k

listing | eventid | delta32k

listing | dateid | bytedict

listing | numtickets | bytedict

listing | priceperticket | delta32k

listing | totalprice | mostly32

listing | listtime | raw

10 | 13 | 14 | 26 |…

… | 100 | 245 | 324

375 | 393 | 417…

… 512 | 549 | 623

637 | 712 | 809 …

… | 834 | 921 | 959

10

324

375

623

637

959

Even Data is distributed evenly amongst all

Compute Nodes on the basis of the

Key

Based

Data is distributed to Compute Nodes on

the basis of the provided distribution key

column from a given record

All Data is replicated onto each Compute

Node

Key Based

Large fact tables

Large dimension tables

All

Medium dimension tables (1K–2M)

Even

Tables with no joins or group bys

Small dimension tables (<1000)

When to use which type of distribution?

Choosing a good distribution key

• High cardinality• Number of unique values in the distribution key is significantly

larger than the number of slices in the cluster

• Low skew (uniform distribution)• Each unique value in the distribution key is associated with

the same number of records in the table

• High entropy• The unique values in the distribution key vary from each other

greatly

• Think GUIDs not sequential ID’s

• Frequently joined to other tables

SELECT COUNT(*) FROM LOGS WHERE MY_DATE = ‘09-JUNE-2013’

MIN: 01-JUNE-2013

MAX: 20-JUNE-2013

MIN: 08-JUNE-2013

MAX: 30-JUNE-2013

MIN: 12-JUNE-2013

MAX: 20-JUNE-2013

MIN: 02-JUNE-2013

MAX: 25-JUNE-2013

Unsorted table

MIN: 01-JUNE-2013

MAX: 06-JUNE-2013

MIN: 07-JUNE-2013

MAX: 12-JUNE-2013

MIN: 13-JUNE-2013

MAX: 18-JUNE-2013

MIN: 19-JUNE-2013

MAX: 24-JUNE-2013

Sorted by MY_DATE

Types of Sort Keys

• Compound (default)

• Good for known query patterns

• Contains up to 400 columns

• Interleaved

• Good for unknown query patterns

• Can contain up to 8 columns

• Must be maintained during Vacuum phase

Getting data in…

Corporate data center

Amazon S3Amazon

Redshift

Flat files

Data loading options - Files


ETL

Source DBs

Amazon

Redshift

Amazon

Redshift

Data loading options – ETL Tools


Source DBs

Amazon

Redshift

Data loading options - Replication

AWS Database

Migration

Service

Amazon

Redshift

Amazon

Kinesis

Firehose

Data loading options – Stream Loading

Amazon S3

Getting data out…

JDBC/ODBC

Amazon Redshift

Amazon Redshift works with your existing

analysis tools

Monitor query performance

View explain plans

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

[email protected] – Senior Engineer, Big Data

7th July 2016, London UK

Customer 360+

Dream Stack: Redshift, Matillion, Python & Tableau

3D Printing

Bot-Farm, Innovation Centre

www.thingiverse.com —> Community

www.makerbot.com/uses/for-educators

Enable – Prosthetics for Kids

MakerBot.com

• MakerBot, a subsidiary of Stratasys Ltd. (Nasdaq: SSYS), is

leading the next industrial revolution by setting the standards in

reliable and affordable desktop 3D-Printing

• Founded in 2009, MakerBot sells desktop 3D-Printers to innovative

and industry-leading customers worldwide, including engineers,

architects, designers, educators and consumers

• Has the largest installed base, and market share, of the desktop

3D-Printing industry

• Runs Thingiverse.com, the largest 3D-Printing Community

• 3D-Printing easy and accessible for everyone

Thingiverse.com

The 50 Most Influential

Gadgets of All Time

http://www.thingiverse.com/

Richard L Williams~20 years in Data Warehousing in HK & USA

• discovered unknown author Ralph Kimball

• used Cognos (shipped with VB 4.0) & RedBrick

• eCommerce, Retail, Insurance, Pharma

• Email/Lifecycle Marketing, Campaign Mgt, Actuarial

• Using AWS: 1800-Flowers, BMS, Janssen (J&J), MakerBot

Ecosystem – where’s the data?

Largest table ~130m rows

But most in 100k – 1m range

Tables Slowest to Load:

- Salesforce

- 100-200 columns “wide”

SQL-Tool:-

- DBVisualizer

- SQLWorkBench/J

- Aginity (Windows)

MS SQL-Svr on EC2

MySQL as RDS

Cloud apps

Internal web-sites

Desktop s/w

Firmware (on printer) s/w

Dream Stack

Redshift Matillion Tableau

Python

Addresses all the issues in DW:-

- can even do unstructured data..!

Works with Redshift, and Fast:-

- Informatica, Snaplogic, Talend do

not work with MPP

- Hadoop/EMR not necessary

Power to the users

Intuitive, data-types, Boto3,

libraries, widely used

So what..?

Personally: Career Transformative

- accurately predict effort and time

Manager: very happy

- Quickly build

- Quickly iterate

- “No Limits” –> Roadmap to the Vision

Company: becoming strategic

- Competitive Advantage

AWS Marketplace

Ease of Purchase

Reserved Instances

Demo - Master Class

Deep Copy

Deep Insert “Waves”

S3 “Trigger” files

Grants on Schemas to Groups

Groups are “roles”, add Users

Revoke on Schema [Public]

Matillion working Schema

Delta’s

Lookup’s

…

..

.

Scripts

Python + Boto(3)

ETL Matillion

Future

I wish I could describe these in more

detail but they are the company’s

Competitive Advantage

[email protected]

Thank you!

aws.amazon.com/big-data

Please remember to rate this

session under My Agenda on

awssummit.london

Date post:	20-Jan-2017
Category:	Technology
Upload:	amazon-web-services
View:	286 times
Download:	4 times

Getting started with Amazon Redshift

Technology