Date post: | 16-Jul-2015 |
Category: |
Technology |
Upload: | amazon-web-services |
View: | 393 times |
Download: | 0 times |
Unstructured Structured Streaming
MPP Databases
Amazon RedshiB
Hadoop
Amazon EMR
Real-‐(me Analysis
Amazon Kinesis
v
• Standard SQL
• Op(mized for fast analysis
• Very scalable
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores Compute Node
128GB RAM
16TB disk
16 cores Compute Node
128GB RAM
16TB disk
16 cores Compute Node
Leader Node
v MPP SQL Database
Optimised for Analytics
Gigabytes to Petabytes
Fully relational
Fully managed
Amazon Redshi.
SQL Clients/BI Tools
JDBC/ODBC
128GB RAM
16TB disk
16 cores Compute Node
128GB RAM
16TB disk
16 cores Compute Node
128GB RAM
16TB disk
16 cores Compute Node
128GB RAM
16TB disk
16 cores Leader Node
SQL Clients/BI Tools
JDBC/ODBC
128GB RAM
16TB disk
16 cores Compute Node
128GB RAM
16TB disk
16 cores Compute Node
128GB RAM
16TB disk
16 cores Compute Node
128GB RAM
16TB disk
16 cores Leader Node ID Name
1 John Smith
2 Jane Jones
3 Peter Black
4 Pat Partridge
5 Sarah Cyan
6 Brian Snail
1 John Smith
4 Pat Partridge
2 Jane Jones
5 Sarah Cyan
3 Peter Black
6 Brian Snail
v
• Column storage
• Data compression
• Zone maps
• With row storage you do unnecessary I/O
• To get average Amount by State, you have to read everything
ID Age State Amount
123 20 QLD 500
345 25 WA 250
678 40 NSW 125
957 37 WA 375
Drama%cally reduces I/O
v
• With column storage, you only read the data you need
ID Age State Amount
123 20 QLD 500
345 25 WA 250
678 40 NSW 125
957 37 WA 375
• Column storage
• Data compression
• Zone maps
Drama%cally reduces I/O
v analyze compression listing; Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw
• Column storage
• Data compression
• Zone maps • COPY compresses automa(cally
• You can analyze and override
• More performance, less cost
Drama%cally reduces I/O
v
• Column storage
• Data compression
• Zone maps
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324
375
623
637
959
• Track the minimum and maximum value for each block
• Skip over blocks that don’t contain relevant data
Drama%cally reduces I/O
160 GB
DW2.L
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
2 PB
Unstructured Structured Streaming
MPP Databases
Amazon RedshiB
Hadoop
Amazon EMR
Real-‐(me Analysis
Amazon Kinesis
v
EMR
EMR Cluster S3
1. Put the data into S3
2. Choose: Hadoop distribu(on, # of nodes, types of nodes, Hadoop apps like
Hive/Pig/HBase
4. Get the output from S3
3. Launch the cluster using the EMR console, CLI, SDK,
or APIs
v
EMR
EMR Cluster
S3
You can easily resize the cluster
And launch parallel clusters using the same
data
v
14 Hours
Dura(on:
Scenario #1
Dura(on:
7 Hours
Scenario #2
EMR with spot instances
#1: Cost without Spot 4 instances *14 hrs * $0.50 = $28
#2: Cost with Spot 4 instances *7 hrs * $0.50 = $14 + 5 instances * 7 hrs * $0.25 = $8.75
Total = $22.75
Time Savings: 50% Cost Savings: ~22%
Master instance group EMR cluster
Task instance group Core instance group
HDFS HDFS
Amazon S3
Great for Spot Instances
Big Data Ver%cals
Media/Advertising
Targeted Advertising
Image and Video
Processing
Oil & Gas
Seismic Analysis
Retail
Recommendations
Transactions Analysis
Life Sciences
Genome Analysis
Financial Services
Monte Carlo Simulations
Risk Analysis
Security
Anti-virus
Fraud Detection
Image Recognition
Social Network/Gaming
User Demographics
Usage analysis
In-game metrics
Unstructured Structured Streaming
MPP Databases
Amazon RedshiB
Hadoop
Amazon EMR
Real-‐(me Analysis
Amazon Kinesis
v
Log Ingest ConEnual Metrics Real Time Data AnalyEcs Complex Stream Processing
So.ware/ Technology
IT server logs inges(on IT opera(onal metrics dashboards
Devices / Sensor Opera(onal Intelligence
Digital Ad Tech./ MarkeEng
Adver(sing Data aggrega(on Adver(sing metrics like coverage, yield, conversion
Analy(cs on User engagement with Ads
Op(mized bid/ buy engines
Financial Services Market/ Financial Transac(on order data collec(on
Financial market data metrics Fraud monitoring, and Value-‐at-‐Risk assessment
Audi(ng of market order data
E-‐Commerce
Online customer engagement data aggrega(on
Consumer engagement metrics like page views, CTR
Customer clickstream analy(cs
Recommenda(on engines
Real-‐%me Scenarios in Industry Segments
Availability Zone
Availability Zone
Availability Zone
Data Sources
Data Sources
Data Sources
Data Sources
Data Sources
Logging
Metrics
Analysis
Machine Learning
S3
DynamoDB
Redshift
EMR
Kinesis
Stream
PuIng data into Kinesis
• Each shard • 1000 Tx Per Second
• 1MB Per Second
• 50KB Payload Per Tx
• Messages kept for 24 hours
Producer
Shard 1
Shard 2
Shard 3
Shard n
Shard 4
Producer
Producer
Producer
Producer
Producer
Producer
Producer
Producer
Kinesis
v
Shard 1
Shard 2
Shard 3
Shard n
Shard 4
KCL Worker 1
KCL Worker 2
EC2 Instance
KCL Worker 3
KCL Worker 4
EC2 Instance
KCL Worker n
EC2 Instance
Kinesis
GeIng data out of Kinesis
Kinesis Client Library (KCL): • Abstracts code from individual shards
• Starts a Kinesis Worker for each shard
• Increases and decreases workers • Tracks a Worker’s loca(on in the stream
v
Easy AdministraEon
Real-‐Eme Performance
High Throughput. ElasEc
IntegraEon S3
Redshi. DynamoDB
Storm ElasEcSearch
Build Real-‐Eme ApplicaEons
.
Low Cost