+ All Categories
Home > Technology > Redshift Chartio Event Presentation

Redshift Chartio Event Presentation

Date post: 15-Apr-2017
Category:
Upload: chartio
View: 357 times
Download: 2 times
Share this document with a friend
17
Amazon Redshift Spend time with your data, not your database….
Transcript
Page 1: Redshift Chartio Event Presentation

Amazon Redshift

Spend time with your data, not your database….

Page 2: Redshift Chartio Event Presentation

Data Warehouse Challenges

Cost

Complexity

Performance

Rigidity1990 2000 2010 2020

Enterprise Data Data in Warehouse

Page 3: Redshift Chartio Event Presentation

Amazon Redshift powers Clickstream Analytics for Amazon.com• Web log analysis for Amazon.com

– Petabyte workload– Largest table: 400 TB

• Understand customer behavior– Who is browsing but not buying– Which products/features are winners– What sequence led to higher customer conversion

• Solution– Best scale-out solution—query across 1 week– Hadoop—query across 1 month

Page 4: Redshift Chartio Event Presentation

Amazon Redshift benefits realized• Performance

– Scan 2.25 trillion rows of data: 14 minutes

– Load 5 billion rows data: 10 minutes

– Backfill 150 billion rows of data: 9.75 hours

– Pig Amazon Redshift: 2 days to 1 hr• 10B row join with 700 M rows

– Oracle Amazon Redshift: 90 hours to 8 hrs

• Cost– 1.6 PB cluster– 100 8xl HDD nodes– $180/hr

• Complexity– 20% time of one DBA

• Backup• Restore• Resizing

Page 5: Redshift Chartio Event Presentation

Expanding Amazon RedshiftFunctionality

Page 6: Redshift Chartio Event Presentation

Scalar User-Defined Functions (UDF)

• Scalar UDFs using Python 2.7– Return single result value for each input value– Executed in parallel across cluster– Syntax largely identical to PostgreSQL– We reserve any function with f_ for customers

• Pandas, NumPy, SciPy pre-installed– Do matrix operations, build optimization algorithms,

and run statistical analyses– Build end-to-end modeling workflow

• Import your own libraries

CREATE FUNCTION f_function_name

( [ argument_name arg_type, ... ] )

RETURNS data_type

{ VOLATILE | STABLE | IMMUTABLE }

AS $$

python_program

$$ LANGUAGE plpythonu;

Page 7: Redshift Chartio Event Presentation

Scalar UDF Security

• Run in restricted container that is fully isolated – Cannot make system and network calls – Cannot corrupt your cluster or negatively impact its performance

• Current limitations– Can’t access file system - functions that write files won’t work– Don’t yet cache stable and immutable functions – Slower than built-in functions compiled to machine code

• Haven’t fully optimized some cases, including nested functions

Page 8: Redshift Chartio Event Presentation

Scalar UDF example - URL parsing

CREATE FUNCTION f_hostname (url varchar)  RETURNS varcharIMMUTABLE AS $$  import urlparse  return urlparse.urlparse(url).hostname$$ LANGUAGE plpythonu;

SELECT f_hostname(url) FROM table;

Rather than using complex regular expressions (e.g. to extract a host name from URL)… SELECT REGEXP_REPLACE(url, '(https?)://([^@]*@)?([^:/]*)([/:].*|$)', ‘\3') FROM table;

….You can use a built-in Python URL parsing library directly in your SQL

Page 9: Redshift Chartio Event Presentation

Scalar UDF example – Distance

CREATE FUNCTION f_distance (orig_lat float, orig_long float, dest_lat float, dest_long float) RETURNS floatSTABLE AS $$ import math r = 3963.1676 # earth's radius, in miles phi_orig = math.radians(orig_lat) phi_dest = math.radians(dest_lat) delta_lat = math.radians(dest_lat - orig_lat) delta_long = math.radians(dest_long - orig_long) a = math.sin(delta_lat/2) * math.sin(delta_lat/2) + math.cos(phi_orig) \ * math.cos(phi_dest) * math.sin(delta_long/2) * math.sin(delta_long/2) c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a)) d = r * c return d$$ LANGUAGE plpythonu;

Calculate approx distance in miles between origin and destination

Page 10: Redshift Chartio Event Presentation

Redshift Github UDF Repository

Script Purpose

f_encryption.sql Uses pyaes library to encrypt/decrypt strings using passphrase

f_next_business_day.sql Uses pandas library to return dates which are US Federal Holiday aware

f_null_syns.sql Uses python sets to match strings, similar to a SQL IN condition

f_parse_url_query_string.sql Uses urlparse to parse the field-value pairs from a url query string

f_parse_xml.sql Uses xml.etree.ElementTree to parse XML

f_unixts_to_timestamp.sql Uses pandas library to convert a unix timestamp to UTC datetime

github.com/awslabs/amazon-redshift-udfs

Page 11: Redshift Chartio Event Presentation

Amazon Kinesis Firehose to Amazon RedshiftLoad massive volumes of streaming data into Amazon Redshift

• Zero administration: Capture and deliver streaming data into Redshift without writing an application

• Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery

• Seamless elasticity: Seamlessly scales to match data throughput w/o intervention

Capture and submit streaming data to Firehose

Firehose loads streaming data continuously into S3 and Redshift

Analyze streaming data using Chartio

Page 12: Redshift Chartio Event Presentation

• Uses your S3 bucket as an intermediate destination• S3 bucket has ‘manifests’ folder – holds manifest of files to be copied

• Issues COPY command synchronously • Single delivery stream loads into a single Redshift cluster, database, and table • Continuously issues COPY once previous one is finished • Frequency of COPYs determined by how fast your cluster can load files• No partial loads. If a single record fails, whole file or batch fails

• Info on skipped files delivered to S3 bucket as manifest in errors folder• If cannot reach cluster, retries every 5 min for 60 min and then moves on to

next batch of objects

Amazon Kinesis Firehose to Amazon Redshift

Page 13: Redshift Chartio Event Presentation

Multi-Column Sort

• Compound sort keys– Filter data by one leading column

• Interleaved sort keys– Filter data by up to eight columns– No storage overhead, unlike an index or projection– Lower maintenance penalty

Page 14: Redshift Chartio Event Presentation

Compound sort keys illustrated

• Four records fill a block, sorted by customer

• Records with a given customer are all in one block.

• Records with a given product are spread across four blocks.

1

1

1

1

2

3

4

1

4

4

4

2

3

4

4

1

3

3

3

2

3

4

3

1

2

2

2

2

3

4

2

1

1 [1,1] [1,2] [1,3] [1,4]

2 [2,1] [2,2] [2,3] [2,4]

3 [3,1] [3,2] [3,3] [3,4]

4 [4,1] [4,2] [4,3] [4,4]

1 2 3 4prod_id

cust_id

cust_id prod_id other columns blocks

Page 15: Redshift Chartio Event Presentation

1 [1,1] [1,2] [1,3] [1,4]

2 [2,1] [2,2] [2,3] [2,4]

3 [3,1] [3,2] [3,3] [3,4]

4 [4,1] [4,2] [4,3] [4,4]

1 2 3 4prod_id

cust_id

Interleaved sort keys illustrated

• Records with a given customer are spread across two blocks.

• Records with a given product are also spread across two blocks.

• Both keys are equal.

1

1

2

2

2

1

2

3

3

4

4

4

3

4

3

1

3

4

4

2

1

2

3

3

1

2

2

4

3

4

1

1

cust_id prod_id other columns blocks

Page 16: Redshift Chartio Event Presentation

Interleaved Sort Key Considerations• Vacuum time can increase by 10-50% for interleaved sort

keys vs. compound keys • If data increases monotonically, such as dates, interleaved

sort order will skew over time– You’ll need to run a vacuum operation to re-analyze the distribution

and re-sort the data.

• Query filtering on the leading sort column, runs faster using compound sort keys vs. interleaved

Page 17: Redshift Chartio Event Presentation

SAN FRANCISCO

Questions/Comments? Please contact us at [email protected]


Recommended