+ All Categories
Home > Technology > Getting started with Amazon Redshift

Getting started with Amazon Redshift

Date post: 20-Jan-2017
Category:
Upload: amazon-web-services
View: 286 times
Download: 4 times
Share this document with a friend
50
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Ian Meyers, Principal Solution Architect, AWS July 7th, 2016 Getting Started with Amazon Redshift
Transcript
Page 1: Getting started with Amazon Redshift

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Ian Meyers, Principal Solution Architect, AWS

July 7th, 2016

Getting Started with

Amazon Redshift

Page 2: Getting started with Amazon Redshift

AnalyzeStore

Amazon

Glacier

Amazon S3

Amazon

DynamoDB

Amazon RDS,

Amazon Aurora

AWS Data Pipeline

Amazon

CloudSearch

Amazon EMR Amazon EC2

Amazon

Redshift

Amazon

Machine

Learning

Amazon

Elasticsearch

Service

Amazon

QuickSight

Amazon

Kinesis

Firehose

AWS Import/Export

Collect

Amazon Kinesis

Streams

AWS Direct

Connect

AWS Database

Migration Service

Amazon

CloudWatch

Page 3: Getting started with Amazon Redshift

Relational data warehouse

Massively parallel; petabyte scale

Fully managed

HDD and SSD platforms

$1,000/TB/year; starts at $0.25/hour

Amazon

Redshift

a lot faster

a lot simpler

a lot cheaper

Page 4: Getting started with Amazon Redshift

The Amazon Redshift view of data warehousing

10x cheaper

Easy to provision

Higher DBA productivity

10x faster

No programming

Easily leverage BI tools,

Hadoop, machine

learning, streaming

Analysis inline with

process flows

Pay as you go, grow as

you need

Managed availability and

disaster recovery

Enterprise Big data SaaS

Page 5: Getting started with Amazon Redshift

Selected Amazon Redshift customers

Page 6: Getting started with Amazon Redshift

Amazon Redshift architecture

Leader node

Simple SQL endpoint

Stores metadata

Optimizes query plan

Coordinates query execution

Compute nodes

Local columnar storage

Parallel/distributed execution of all queries, loads,

backups, restores, resizes

Start at just $0.25/hour, grow to 2 PB (compressed)

DC1: SSD; scale from 160 GB to 326 TB

DS2: HDD; scale from 2 TB to 2 PB

Ingestion/Backup

Backup

Restore

JDBC/ODBC

10 GigE

(HPC)

Page 7: Getting started with Amazon Redshift

Benefit #1: Amazon Redshift is fast

Parallel and distributed

Query

Load

Export

Backup

Restore

Resize

Page 8: Getting started with Amazon Redshift

Benefit #1: Amazon Redshift is fast

Dense Storage DS2 (HDD) instance type

Improved memory 2x, compute 2x, disk throughput 1.5x

Cost: Same as our prior generation DS1!

Performance improvement: 50%

Enhanced I/O and commit improvements (Jan ’16)

Reduce amount of time to commit data

Performance improvement: 35%

Page 9: Getting started with Amazon Redshift

Benefit #2: Amazon Redshift is inexpensive

Ds2 (HDD)Price per hour for

DW1.XL single node

Effective annual

price per TB compressed

On-demand $ 0.850 $ 3,725

1 year reservation $ 0.500 $ 2,190

3 year reservation $ 0.228 $ 999

Dc1 (SSD)Price per hour for

DW2.L single node

Effective annual

price per TB compressed

On-demand $ 0.250 $ 13,690

1 year reservation $ 0.161 $ 8,795

3 year reservation $ 0.100 $ 5,500

Pricing is simple

Number of nodes x price/hour

No charge for leader node

No upfront costs

Pay as you go

Page 10: Getting started with Amazon Redshift

Benefit #3: Amazon Redshift is fully managed

Continuous/incremental backups

Multiple copies within cluster

Continuous and incremental backups

to Amazon S3

Continuous and incremental backups

across regions

Streaming restore

Amazon S3

Amazon S3

Region 1

Region 2

Page 11: Getting started with Amazon Redshift

Benefit #3: Amazon Redshift is fully managed

Amazon S3

Amazon S3

Region 1

Region 2

Fault tolerance

Disk failures

Node failures

Network failures

Availability Zone/region level disasters

Page 12: Getting started with Amazon Redshift

Benefit #4: Security is built-in

• Load encrypted from S3

• SSL to secure data in transit

ECDHE perfect forward security

• Amazon VPC for network isolation

• Encryption to secure data at rest

All blocks on disks and in S3 encrypted

Block key, cluster key, master key (AES-256)

On-premises HSM & AWS CloudHSM support

• Audit logging and AWS CloudTrail integration

• SOC 1/2/3, PCI-DSS, FedRAMP, BAA

10 GigE

(HPC)

Ingestion, Backup & Restore

Customer VPC

Internal

VPC

JDBC/ODBC

Page 13: Getting started with Amazon Redshift

Benefit #5: We innovate quickly

Well over 100 new features added since launch

Release every two weeks

Automatic patching

Service Launch (2/14)

PDX (4/2)

Temp Credentials (4/11)

DUB (4/25)

SOC1/2/3 (5/8)

Unload Encrypted Files

NRT (6/5)

JDBC Fetch Size (6/27)

Unload logs (7/5)

SHA1 Builtin (7/15)

4 byte UTF-8 (7/18)

Sharing snapshots (7/18)

Statement Timeout (7/22)

Timezone, Epoch, Autoformat (7/25)

WLM Timeout/Wildcards (8/1)

CRC32 Builtin, CSV, Restore Progress (8/9)

Resource Level IAM (8/9)

PCI (8/22)

UTF-8 Substitution (8/29)

JSON, Regex, Cursors (9/10)

Split_part, Audit tables (10/3)

SIN/SYD (10/8)

HSM Support (11/11)

Kinesis EMR/HDFS/SSH copy, Distributed Tables, Audit

Logging/CloudTrail, Concurrency, Resize Perf., Approximate Count Distinct, SNS

Alerts, Cross Region Backup (11/13)

Distributed Tables, Single Node Cursor Support, Maximum Connections to 500

(12/13)

EIP Support for VPC Clusters (12/28)

New query monitoring system tables and diststyle all (1/13)

Redshift on DW2 (SSD) Nodes (1/23)

Compression for COPY from SSH, Fetch size support for single node clusters, new

system tables with commit stats, row_number(), strotol() and query

termination (2/13)

Resize progress indicator & Cluster Version (3/21)

Regex_Substr, COPY from JSON (3/25)

50 slots, COPY from EMR, ECDHE ciphers (4/22)

3 new regex features, Unload to single file, FedRAMP(5/6)

Rename Cluster (6/2)

Copy from multiple regions, percentile_cont, percentile_disc (6/30)

Free Trial (7/1)

pg_last_unload_count (9/15)

AES-128 S3 encryption (9/29)

UTF-16 support (9/29)

Page 14: Getting started with Amazon Redshift

Benefit #6: Amazon Redshift has a large ecosystem

Data integration Systems integratorsBusiness intelligence

Page 15: Getting started with Amazon Redshift

Getting started

Page 16: Getting started with Amazon Redshift

Enter cluster details

Page 17: Getting started with Amazon Redshift

Select node configuration

Page 18: Getting started with Amazon Redshift

Select security settings and provision

Page 19: Getting started with Amazon Redshift

Point-and-click resize

Page 20: Getting started with Amazon Redshift

Resize

Resize while remaining online

Provision a new cluster in the

background

Copy data in parallel from node to

node

You are only charged for the source

cluster

Page 21: Getting started with Amazon Redshift

Data modeling

Page 22: Getting started with Amazon Redshift

3 Important Details…

Column Encoding

Applied on First Data Load

Automatically

Ensure correct encoding is

used

Periodically revisit

encodings in case of change

Data Distribution

Even, Key Based, or

Replicated distribution of

data is available

Focus on colocation of data

to limit network transfer

View network transfer

information in Explain Plan

Data Sorting

Compound (default) Sort

Keys for predictable query

patterns

Interleaved Sort Keys for

tables that can be queried in

any way

MIN: 01-JUNE-2013

MAX: 20-JUNE-2013

MIN: 08-JUNE-2013

MAX: 30-JUNE-2013

MIN: 12-JUNE-2013

MAX: 20-JUNE-2013

MIN: 02-JUNE-2013

MAX: 25-JUNE-2013

Unsorted

table MIN: 01-JUNE-2013

MAX: 06-JUNE-2013

MIN: 07-JUNE-2013

MAX: 12-JUNE-2013

MIN: 13-JUNE-2013

MAX: 18-JUNE-2013

MIN: 19-JUNE-2013

MAX: 24-JUNE-2013

Sorted by date

Page 23: Getting started with Amazon Redshift

Columnar Encoding

Dramatically less I/O

Column storage

Data compression

Zone maps

Direct-attached storage

Large data block sizes

Hardware optimized for I/O intensive workloads,

4 GB/sec/node

Enhanced networking, over 1 million

packets/sec/node

analyze compression listing;

Table | Column | Encoding

---------+----------------+----------

listing | listid | delta

listing | sellerid | delta32k

listing | eventid | delta32k

listing | dateid | bytedict

listing | numtickets | bytedict

listing | priceperticket | delta32k

listing | totalprice | mostly32

listing | listtime | raw

10 | 13 | 14 | 26 |…

… | 100 | 245 | 324

375 | 393 | 417…

… 512 | 549 | 623

637 | 712 | 809 …

… | 834 | 921 | 959

10

324

375

623

637

959

Page 24: Getting started with Amazon Redshift

Even Data is distributed evenly amongst all

Compute Nodes on the basis of the

Key

Based

Data is distributed to Compute Nodes on

the basis of the provided distribution key

column from a given record

All Data is replicated onto each Compute

Node

Page 25: Getting started with Amazon Redshift

Key Based

Large fact tables

Large dimension tables

All

Medium dimension tables (1K–2M)

Even

Tables with no joins or group bys

Small dimension tables (<1000)

When to use which type of distribution?

Page 26: Getting started with Amazon Redshift

Choosing a good distribution key

• High cardinality• Number of unique values in the distribution key is significantly

larger than the number of slices in the cluster

• Low skew (uniform distribution)• Each unique value in the distribution key is associated with

the same number of records in the table

• High entropy• The unique values in the distribution key vary from each other

greatly

• Think GUIDs not sequential ID’s

• Frequently joined to other tables

Page 27: Getting started with Amazon Redshift

SELECT COUNT(*) FROM LOGS WHERE MY_DATE = ‘09-JUNE-2013’

MIN: 01-JUNE-2013

MAX: 20-JUNE-2013

MIN: 08-JUNE-2013

MAX: 30-JUNE-2013

MIN: 12-JUNE-2013

MAX: 20-JUNE-2013

MIN: 02-JUNE-2013

MAX: 25-JUNE-2013

Unsorted table

MIN: 01-JUNE-2013

MAX: 06-JUNE-2013

MIN: 07-JUNE-2013

MAX: 12-JUNE-2013

MIN: 13-JUNE-2013

MAX: 18-JUNE-2013

MIN: 19-JUNE-2013

MAX: 24-JUNE-2013

Sorted by MY_DATE

Page 28: Getting started with Amazon Redshift

Types of Sort Keys

• Compound (default)

• Good for known query patterns

• Contains up to 400 columns

• Interleaved

• Good for unknown query patterns

• Can contain up to 8 columns

• Must be maintained during Vacuum phase

Page 29: Getting started with Amazon Redshift

Getting data in…

Page 30: Getting started with Amazon Redshift

Corporate data center

Amazon S3Amazon

Redshift

Flat files

Data loading options - Files

Page 31: Getting started with Amazon Redshift

Corporate data center

ETL

Source DBs

Amazon

Redshift

Amazon

Redshift

Data loading options – ETL Tools

Page 32: Getting started with Amazon Redshift

Corporate data center

Source DBs

Amazon

Redshift

Data loading options - Replication

AWS Database

Migration

Service

Page 33: Getting started with Amazon Redshift

Amazon

Redshift

Amazon

Kinesis

Firehose

Data loading options – Stream Loading

Amazon S3

Page 34: Getting started with Amazon Redshift

Getting data out…

Page 35: Getting started with Amazon Redshift

JDBC/ODBC

Amazon Redshift

Amazon Redshift works with your existing

analysis tools

Page 36: Getting started with Amazon Redshift

Monitor query performance

Page 37: Getting started with Amazon Redshift

View explain plans

Page 38: Getting started with Amazon Redshift

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

[email protected] – Senior Engineer, Big Data

7th July 2016, London UK

Customer 360+

Dream Stack: Redshift, Matillion, Python & Tableau

Page 39: Getting started with Amazon Redshift

3D Printing

Bot-Farm, Innovation Centre

www.thingiverse.com —> Community

www.makerbot.com/uses/for-educators

Page 40: Getting started with Amazon Redshift

Enable – Prosthetics for Kids

Page 41: Getting started with Amazon Redshift

MakerBot.com

• MakerBot, a subsidiary of Stratasys Ltd. (Nasdaq: SSYS), is

leading the next industrial revolution by setting the standards in

reliable and affordable desktop 3D-Printing

• Founded in 2009, MakerBot sells desktop 3D-Printers to innovative

and industry-leading customers worldwide, including engineers,

architects, designers, educators and consumers

• Has the largest installed base, and market share, of the desktop

3D-Printing industry

• Runs Thingiverse.com, the largest 3D-Printing Community

• 3D-Printing easy and accessible for everyone

Thingiverse.com

The 50 Most Influential

Gadgets of All Time

Page 42: Getting started with Amazon Redshift

Richard L Williams~20 years in Data Warehousing in HK & USA

• discovered unknown author Ralph Kimball

• used Cognos (shipped with VB 4.0) & RedBrick

• eCommerce, Retail, Insurance, Pharma

• Email/Lifecycle Marketing, Campaign Mgt, Actuarial

• Using AWS: 1800-Flowers, BMS, Janssen (J&J), MakerBot

Page 43: Getting started with Amazon Redshift

Ecosystem – where’s the data?

Largest table ~130m rows

But most in 100k – 1m range

Tables Slowest to Load:

- Salesforce

- 100-200 columns “wide”

SQL-Tool:-

- DBVisualizer

- SQLWorkBench/J

- Aginity (Windows)

MS SQL-Svr on EC2

MySQL as RDS

Cloud apps

Internal web-sites

Desktop s/w

Firmware (on printer) s/w

Page 44: Getting started with Amazon Redshift

Dream Stack

Redshift Matillion Tableau

Python

Addresses all the issues in DW:-

- can even do unstructured data..!

Works with Redshift, and Fast:-

- Informatica, Snaplogic, Talend do

not work with MPP

- Hadoop/EMR not necessary

Power to the users

Intuitive, data-types, Boto3,

libraries, widely used

Page 45: Getting started with Amazon Redshift

So what..?

Personally: Career Transformative

- accurately predict effort and time

Manager: very happy

- Quickly build

- Quickly iterate

- “No Limits” –> Roadmap to the Vision

Company: becoming strategic

- Competitive Advantage

Page 46: Getting started with Amazon Redshift

AWS Marketplace

Ease of Purchase

Reserved Instances

Page 47: Getting started with Amazon Redshift

Demo - Master Class

Deep Copy

Deep Insert “Waves”

S3 “Trigger” files

Grants on Schemas to Groups

Groups are “roles”, add Users

Revoke on Schema [Public]

Matillion working Schema

Delta’s

Lookup’s

..

.

Scripts

Python + Boto(3)

ETL Matillion

Page 48: Getting started with Amazon Redshift

Future

I wish I could describe these in more

detail but they are the company’s

Competitive Advantage

[email protected]

Page 49: Getting started with Amazon Redshift

Thank you!

aws.amazon.com/big-data

Page 50: Getting started with Amazon Redshift

Please remember to rate this

session under My Agenda on

awssummit.london


Recommended