Date post: | 15-Apr-2017 |
Category: |
Technology |
Upload: | chartio |
View: | 357 times |
Download: | 2 times |
Amazon Redshift
Spend time with your data, not your database….
Data Warehouse Challenges
Cost
Complexity
Performance
Rigidity1990 2000 2010 2020
Enterprise Data Data in Warehouse
Amazon Redshift powers Clickstream Analytics for Amazon.com• Web log analysis for Amazon.com
– Petabyte workload– Largest table: 400 TB
• Understand customer behavior– Who is browsing but not buying– Which products/features are winners– What sequence led to higher customer conversion
• Solution– Best scale-out solution—query across 1 week– Hadoop—query across 1 month
Amazon Redshift benefits realized• Performance
– Scan 2.25 trillion rows of data: 14 minutes
– Load 5 billion rows data: 10 minutes
– Backfill 150 billion rows of data: 9.75 hours
– Pig Amazon Redshift: 2 days to 1 hr• 10B row join with 700 M rows
– Oracle Amazon Redshift: 90 hours to 8 hrs
• Cost– 1.6 PB cluster– 100 8xl HDD nodes– $180/hr
• Complexity– 20% time of one DBA
• Backup• Restore• Resizing
Expanding Amazon RedshiftFunctionality
Scalar User-Defined Functions (UDF)
• Scalar UDFs using Python 2.7– Return single result value for each input value– Executed in parallel across cluster– Syntax largely identical to PostgreSQL– We reserve any function with f_ for customers
• Pandas, NumPy, SciPy pre-installed– Do matrix operations, build optimization algorithms,
and run statistical analyses– Build end-to-end modeling workflow
• Import your own libraries
CREATE FUNCTION f_function_name
( [ argument_name arg_type, ... ] )
RETURNS data_type
{ VOLATILE | STABLE | IMMUTABLE }
AS $$
python_program
$$ LANGUAGE plpythonu;
Scalar UDF Security
• Run in restricted container that is fully isolated – Cannot make system and network calls – Cannot corrupt your cluster or negatively impact its performance
• Current limitations– Can’t access file system - functions that write files won’t work– Don’t yet cache stable and immutable functions – Slower than built-in functions compiled to machine code
• Haven’t fully optimized some cases, including nested functions
Scalar UDF example - URL parsing
CREATE FUNCTION f_hostname (url varchar) RETURNS varcharIMMUTABLE AS $$ import urlparse return urlparse.urlparse(url).hostname$$ LANGUAGE plpythonu;
SELECT f_hostname(url) FROM table;
Rather than using complex regular expressions (e.g. to extract a host name from URL)… SELECT REGEXP_REPLACE(url, '(https?)://([^@]*@)?([^:/]*)([/:].*|$)', ‘\3') FROM table;
….You can use a built-in Python URL parsing library directly in your SQL
Scalar UDF example – Distance
CREATE FUNCTION f_distance (orig_lat float, orig_long float, dest_lat float, dest_long float) RETURNS floatSTABLE AS $$ import math r = 3963.1676 # earth's radius, in miles phi_orig = math.radians(orig_lat) phi_dest = math.radians(dest_lat) delta_lat = math.radians(dest_lat - orig_lat) delta_long = math.radians(dest_long - orig_long) a = math.sin(delta_lat/2) * math.sin(delta_lat/2) + math.cos(phi_orig) \ * math.cos(phi_dest) * math.sin(delta_long/2) * math.sin(delta_long/2) c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a)) d = r * c return d$$ LANGUAGE plpythonu;
Calculate approx distance in miles between origin and destination
Redshift Github UDF Repository
Script Purpose
f_encryption.sql Uses pyaes library to encrypt/decrypt strings using passphrase
f_next_business_day.sql Uses pandas library to return dates which are US Federal Holiday aware
f_null_syns.sql Uses python sets to match strings, similar to a SQL IN condition
f_parse_url_query_string.sql Uses urlparse to parse the field-value pairs from a url query string
f_parse_xml.sql Uses xml.etree.ElementTree to parse XML
f_unixts_to_timestamp.sql Uses pandas library to convert a unix timestamp to UTC datetime
github.com/awslabs/amazon-redshift-udfs
Amazon Kinesis Firehose to Amazon RedshiftLoad massive volumes of streaming data into Amazon Redshift
• Zero administration: Capture and deliver streaming data into Redshift without writing an application
• Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery
• Seamless elasticity: Seamlessly scales to match data throughput w/o intervention
Capture and submit streaming data to Firehose
Firehose loads streaming data continuously into S3 and Redshift
Analyze streaming data using Chartio
• Uses your S3 bucket as an intermediate destination• S3 bucket has ‘manifests’ folder – holds manifest of files to be copied
• Issues COPY command synchronously • Single delivery stream loads into a single Redshift cluster, database, and table • Continuously issues COPY once previous one is finished • Frequency of COPYs determined by how fast your cluster can load files• No partial loads. If a single record fails, whole file or batch fails
• Info on skipped files delivered to S3 bucket as manifest in errors folder• If cannot reach cluster, retries every 5 min for 60 min and then moves on to
next batch of objects
Amazon Kinesis Firehose to Amazon Redshift
Multi-Column Sort
• Compound sort keys– Filter data by one leading column
• Interleaved sort keys– Filter data by up to eight columns– No storage overhead, unlike an index or projection– Lower maintenance penalty
Compound sort keys illustrated
• Four records fill a block, sorted by customer
• Records with a given customer are all in one block.
• Records with a given product are spread across four blocks.
1
1
1
1
2
3
4
1
4
4
4
2
3
4
4
1
3
3
3
2
3
4
3
1
2
2
2
2
3
4
2
1
1 [1,1] [1,2] [1,3] [1,4]
2 [2,1] [2,2] [2,3] [2,4]
3 [3,1] [3,2] [3,3] [3,4]
4 [4,1] [4,2] [4,3] [4,4]
1 2 3 4prod_id
cust_id
cust_id prod_id other columns blocks
1 [1,1] [1,2] [1,3] [1,4]
2 [2,1] [2,2] [2,3] [2,4]
3 [3,1] [3,2] [3,3] [3,4]
4 [4,1] [4,2] [4,3] [4,4]
1 2 3 4prod_id
cust_id
Interleaved sort keys illustrated
• Records with a given customer are spread across two blocks.
• Records with a given product are also spread across two blocks.
• Both keys are equal.
1
1
2
2
2
1
2
3
3
4
4
4
3
4
3
1
3
4
4
2
1
2
3
3
1
2
2
4
3
4
1
1
cust_id prod_id other columns blocks
Interleaved Sort Key Considerations• Vacuum time can increase by 10-50% for interleaved sort
keys vs. compound keys • If data increases monotonically, such as dates, interleaved
sort order will skew over time– You’ll need to run a vacuum operation to re-analyze the distribution
and re-sort the data.
• Query filtering on the leading sort column, runs faster using compound sort keys vs. interleaved