Date post: | 21-Apr-2017 |
Category: |
Data & Analytics |
Upload: | amazon-web-services |
View: | 10,762 times |
Download: | 0 times |
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Arie Leeuwesteijn, Solutions Architect, AWS
May 2016
Big Data Architectural Patterns and Best Practices on AWS
Agenda
Big data challengesHow to simplify big data processingWhat technologies should you use?
• Why?• How?
Reference architectureSanoma Use CaseDesign patterns
Ever Increasing Big Data
Volume
Velocity
Variety
Big Data Evolution
Batch processing
Streamprocessing
Machinelearning
Plethora of Tools
Amazon Glacier
S3 DynamoDB
RDS
EMR
Amazon Redshift
Data PipelineAmazon Kinesis
Kinesis-enabled app
Lambda ML
SQS
ElastiCache
DynamoDBStreams
Amazon ElasticsearchService
Big Data Challenges
Is there a reference architecture?
What tools should I use?
How?
Why?
Decoupled “data bus”• Data → Store → Process → Store → Analyze → Answers
Use the right tool for the job• Data structure, latency, throughput, access patterns
Use Lambda architecture ideas• Immutable (append-only) log, batch/speed/serving layer
Leverage AWS managed services• Scalable/elastic, available, reliable, secure, no/low admin
Big data ≠ big cost
Architectural Principles
Simplify Big Data Processing
COLLECT STORE PROCESS/ANALYZE CONSUME
Time to answer (Latency)Throughput
Cost
COLLECT
Types of Data
Database records
Search documents
Log files
Messaging events
Devices / sensors / IoT streamDevices
Sensors & IoT platforms
AWS IoT STREAMS
StreamstorageIo
TCOLLECT STORE
Mobile apps
Web apps
Data centers AWS Direct Connect
RECORDSDatabase
Appl
icat
ions
AWS Import/ExportSnowball
Logging
Amazon CloudWatch
AWSCloudTrail
DOCUMENTS
FILES
Search
File store
Logg
ing
Tran
spor
t
MessagingMessage MESSAGES Queue
Mes
sagi
ng
Store
Amazon Kinesis Firehose
Amazon KinesisStreams
Apache Kafka
Amazon DynamoDB Streams
Amazon SQS
Amazon SQS• Managed message queue service
Apache Kafka• High throughput distributed messaging
system
Amazon Kinesis Streams• Managed stream storage + processing
Amazon Kinesis Firehose• Managed data delivery
Amazon DynamoDB• Managed NoSQL database• Tables can be stream-enabled
Message & Stream Storage
Devices
Sensors & IoT platforms
AWS IoT STREAMS
IoT
COLLECT STORE
Mobile apps
Web apps
Data centers AWS Direct Connect
RECORDSDatabase
Appl
icat
ions
AWS Import/ExportSnowball
Logging
Amazon CloudWatch
AWSCloudTrail
DOCUMENTS
FILES
Search
File store
Logg
ing
Tran
spor
t
MessagingMessage MESSAGES
Mes
sagi
ng
Que
ueS
tream
Why Stream Storage?
Decouple producers & consumersPersistent bufferCollect multiple streams
Preserve client orderingStreaming MapReduceParallel consumption
4 4 3 3 2 2 1 14 3 2 1
4 3 2 1
4 3 2 1
4 3 2 14 4 3 3 2 2 1 1
shard 1 / partition 1
shard 2 / partition 2
Consumer 1Count of red = 4
Count of violet = 4
Consumer 2Count of blue = 4
Count of green = 4
DynamoDB stream Amazon Kinesis stream Kafka topic
What About Queues & Pub/Sub ? • Decouple producers &
consumers/subscribers• Persistent buffer• Collect multiple streams• No client ordering• No parallel consumption for
Amazon SQS• Amazon SNS can route
to multiple queues or ʎ functions
• No streaming MapReduce
Consumers
Producers
Producers
Amazon SNS
Amazon SQS
queue
topic
function
ʎ
AWS Lambda
Amazon SQSqueue
Subscriber
What Stream Storage should I use?AmazonDynamoDBStreams
AmazonKinesisStreams
AmazonKinesis Firehose
ApacheKafka
AmazonSQS
AWS managed service
Yes Yes Yes No Yes
Guaranteedordering
Yes Yes Yes Yes No
Delivery exactly-once at-least-once exactly-once at-least-once at-least-once
Data retention period
24 hours 7 days N/A Configurable 14 days
Availability 3 AZ 3 AZ 3 AZ Configurable 3 AZ
Scale / throughput
No limit /~ table IOPS
No limit /~ shards
No limit /automatic
No limit /~ nodes
No limits /automatic
Parallel clients Yes Yes No Yes No
Stream MapReduce Yes Yes N/A Yes N/A
Record/object size 400 KB 1 MB Amazon Redshift row size Configurable 256 KB
Cost Higher (table cost) Low Low Low (+admin) Low-medium
Hot Warm
COLLECT STORE
Mobile apps
Web apps
Data centers AWS Direct Connect
RECORDSDatabase
AWS Import/ExportSnowball
Logging
Amazon CloudWatch
AWSCloudTrail
DOCUMENTS
FILES
Search
MessagingMessage MESSAGES
Devices
Sensors & IoT platforms
AWS IoT STREAMS
Apache Kafka
Amazon KinesisStreams
Amazon Kinesis Firehose
Amazon DynamoDB Streams
Hot
Stre
am
Amazon S3
Amazon SQS
Mes
sage
Amazon S3File
Logg
ing
IoT
Appl
icat
ions
Tran
spor
tM
essa
ging
File Storage
Why is Amazon S3 Good for Big Data?
• Natively supported by big data frameworks (Spark, Hive, Presto, etc.) • No need to run compute clusters for storage (unlike HDFS)• Can run transient Hadoop clusters & Amazon EC2 Spot Instances• Multiple distinct (Spark, Hive, Presto) clusters can use the same data• Unlimited number of objects • Very high bandwidth – no aggregate throughput limit• Highly available – can tolerate AZ failure• Designed for 99.999999999% durability• Tired-storage (Standard, IA, Amazon Glacier) via life-cycle policy• Secure – SSL, client/server-side encryption at rest• Low cost
What about HDFS & Amazon Glacier?
• Use HDFS for very frequently accessed (hot) data
• Use Amazon S3 Standard for frequently accessed data
• Use Amazon S3 Standard – IA for infrequently accessed data
• Use Amazon Glacier for archiving cold data
Cache, database, search
COLLECT STORE
Mobile apps
Web apps
Data centers AWS Direct Connect
RECORDS
AWS Import/ExportSnowball
Logging
Amazon CloudWatch
AWSCloudTrail
DOCUMENTS
FILES
MessagingMessage MESSAGES
Devices
Sensors & IoT platforms
AWS IoT STREAMS
Apache Kafka
Amazon KinesisStreams
Amazon Kinesis Firehose
Amazon DynamoDB Streams
Hot
Stre
am
Amazon SQS
Mes
sage
Amazon Elasticsearch Service
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
Sea
rch
S
QL
N
oSQ
L
Cac
heFi
le
Logg
ing
IoT
Appl
icat
ions
Tran
spor
tM
essa
ging
Database Anti-pattern
Data TierDatabase tier
Best Practice - Use the Right Tool for the Job
Data TierSearch
Amazon Elasticsearch Service
Cache
AmazonElastiCache
RedisMemcached
SQL
Amazon AuroraMySQLPostgreSQLOracleSQL ServerMariaDB
NoSQL
Amazon DynamoDB
CassandraHBaseMongoDB
Database tier
Materialized Views
Amazon Elasticsearch Service
KCL
What Data Store Should I Use?
Data structure → Fixed schema, JSON, key-value
Access patterns → Store data in the format you will access it
Data / access characteristics → Hot, warm, cold
Cost → Right cost
Data Structure and Access PatternsAccess Patterns What to use?Put/Get (key, value) Cache, NoSQLSimple relationships → 1:N, M:N NoSQLCross table joins, transaction, SQL SQLFaceting, search, free text Search
Data Structure What to use?Fixed schema SQL, NoSQLSchema-free (JSON) NoSQL, Search(Key, value) Cache, NoSQL
What Is the Temperature of Your Data / Access ?
Hot Warm ColdVolume MB–GB GB–TB PBItem size B–KB KB–MB KB–TBLatency ms ms, sec min, hrsDurability Low – high High Very highRequest rate Very high High LowCost/GB $$-$ $-¢¢ ¢
Hot data Warm data Cold data
Data / Access Characteristics: Hot, Warm, Cold
Cache SQL
Request rateHigh Low
Cost/GBHigh Low
LatencyLow High
Data volumeLow High
Amazon Glacier
Stru
ctur
e
NoSQL
Hot data Warm data Cold data
Low
High
Search
Amazon ElastiCache AmazonDynamoDB
AmazonRDS/Aurora
AmazonElasticsearch
Amazon S3 Amazon Glacier
Average latency
ms ms ms, sec ms,sec ms,sec,min(~ size)
hrs
Typicaldata stored
GB GB–TBs(no limit)
GB–TB(64 TB max)
GB–TB MB–PB(no limit)
GB–PB(no limit)
Typicalitem size
B-KB KB(400 KB max)
KB(64 KB max)
KB(2 GB max)
KB-TB(5 TB max)
GB(40 TB max)
Request Rate
High – very high Very high(no limit)
High High Low – high(no limit)
Very low
Storage costGB/month
$$ ¢¢ ¢¢ ¢¢ ¢ ¢/10
Durability Low - moderate Very high Very high High Very high Very high
Availability High2 AZ
Very high 3 AZ
Very high3 AZ
High2 AZ
Very high3 AZ
Very high3 AZ
Hot data Warm data Cold data
What Data Store Should I Use?
Cost Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB?
“I’m currently scoping out a project that will greatly increase my team’s use of Amazon S3. Hoping you could answer some questions. The current iteration of the design calls for many small files, perhaps up to a billion during peak. The total size would be on the order of 1.5 TB per month…”
Request rate (Writes/sec)
Object size(Bytes)
Total size(GB/month)
Objects per month
300 2048 1483 777,600,000
Cost Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB?
https://calculator.s3.amazonaws.com/index.html
Simple Monthly Calculator
Request rate (Writes/sec)
Object size(Bytes)
Total size(GB/month)
Objects per month
300 2,048 1,483 777,600,000
Amazon S3 orAmazon DynamoDB?
Request rate (Writes/sec)
Object size(Bytes)
Total size(GB/month)
Objects per month
Scenario 1300 2,048 1,483 777,600,000
Scenario 2300 32,768 23,730 777,600,000
Amazon S3
Amazon DynamoDB
use
use
COLLECT STORE PROCESS / ANALYZE
Amazon Elasticsearch Service
Apache Kafka
Amazon SQS
Amazon KinesisStreams
Amazon Kinesis Firehose
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
Amazon DynamoDB Streams
Hot
Hot
War
m
Sea
rch
S
QL
N
oSQ
L
Cac
heFi
leM
essa
geS
tream
Mobile apps
Web apps
Devices
MessagingMessage
Sensors & IoT platforms
AWS IoT
Data centers AWS Direct Connect
AWS Import/ExportSnowball
Logging
Amazon CloudWatch
AWSCloudTrail
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMS
Logg
ing
IoT
Appl
icat
ions
Tran
spor
tM
essa
ging
Process /analyze
PROCESS / ANALYZE
Process / Analyze
• Batch - Minutes or hours on cold data• Daily/weekly/monthly reports
• Interactive – Seconds on warm/cold data• Self-service dashboards
• Messaging – Milliseconds or seconds on hot data • Message/event buffering
• Streaming - Milliseconds or seconds on hot data• Billing/fraud alerts, 1 minute metrics
What Analytics Technology Should I Use?Amazon Redshift
Amazon EMR
Presto Spark Hive
Query latency
Low Low Low High
Durability High High High High
Data volume 1.6 PB max ~Nodes ~Nodes ~Nodes
AWS managed
Yes Yes Yes Yes
Storage Native HDFS / S3 HDFS / S3 HDFS / S3
SQL compatibility
High High Low (SparkSQL) Medium (HQL)
Slow
Predictions via Machine Learning
ML gives computers the ability to learn without being explicitly programmed
Machine learning algorithms:Supervised learning ← “teach” program
- Classification ← Is this transaction fraud? (yes/no) - Regression ← Customer life-time value?
Unsupervised learning ← Let it learn by itself- Clustering ← Market segmentation
Tools and FrameworksMachine Learning
• Amazon ML, Amazon EMR (Spark ML)Interactive
• Amazon Redshift, Amazon EMR (Presto, Spark)Batch
• Amazon EMR (MapReduce, Hive, Pig, Spark)Messaging
• Amazon SQS application on Amazon EC2Streaming
• Micro-batch: Spark Streaming, KCL• Real-time: Amazon Kinesis Analytics, Storm,
AWS Lambda, KCL
Amazon SQS apps
Streaming
Amazon Kinesis Analytics
Amazon KCLapps
AWS Lambda
Amazon Redshift
PROCESS / ANALYZE
Amazon Machine Learning
Presto
AmazonEMR
Fast
Slow
Fast
Bat
chM
essa
geIn
tera
ctiv
eS
tream
ML
Amazon EC2
Amazon EC2
What Streaming / Messaging Technology Should I Use?Spark Streaming
Apache Storm
Kinesis KCL Application
AWS Lambda Amazon SQS Apps
Scale ~ Nodes ~ Nodes ~ Nodes Automatic ~ Nodes
Micro-batch or Real-time
Micro-batch Real-time Near-real-time Near-real-time Near-real-time
AWS managed service
Yes (EMR) No (EC2) No (KCL + EC2 + Auto Scaling)
Yes No (EC2 + AutoScaling)
Scalability No limits ~ nodes
No limits~ nodes
No limits~ nodes
No limits No limits
Availability Single AZ Configurable Multi-AZ Multi-AZ Multi-AZ
Programminglanguages
Java, Python, Scala
Any language via Thrift
Java, via MultiLangDaemon (.NET,Python, Ruby, Node.js)
Node.js, Java, Python
AWS SDK languages (Java, .NET, Python, …)
What About ETL?
https://aws.amazon.com/big-data/partner-solutions/
ETLSTORE PROCESS / ANALYZE
Amazon SQS apps
Streaming
Amazon Kinesis Analytics
Amazon KCLapps
AWS Lambda
Amazon Redshift
COLLECT STORE CONSUMEPROCESS / ANALYZE
Amazon Machine Learning
Presto
AmazonEMR
Amazon Elasticsearch Service
Apache Kafka
Amazon SQS
Amazon KinesisStreams
Amazon Kinesis Firehose
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
Amazon DynamoDB Streams
Hot
Hot
War
m
Fast
Slow
Fast
Bat
chM
essa
geIn
tera
ctiv
eS
tream
ML
Sea
rch
S
QL
N
oSQ
L
Cac
heFi
leM
essa
geS
tream
Amazon EC2
Amazon EC2
Mobile apps
Web apps
Devices
MessagingMessage
Sensors & IoT platforms
AWS IoT
Data centers AWS Direct Connect
AWS Import/ExportSnowball
Logging
Amazon CloudWatch
AWSCloudTrail
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMS
Logg
ing
IoT
Appl
icat
ions
Tran
spor
tM
essa
ging
ETL
CONSUME
STORE CONSUMEPROCESS / ANALYZE
Amazon QuickSight
Apps & Services
Anal
ysis
& v
isua
lizat
ion
Not
eboo
ksID
EAP
I
Applications & API
Analysis and visualization
Notebooks
IDE
Business users
Data scientist, developers
COLLECT ETL
Putting It All Together
Amazon SQS apps
Streaming
Amazon Kinesis Analytics
Amazon KCLapps
AWS Lambda
Amazon Redshift
COLLECT STORE CONSUMEPROCESS / ANALYZE
Amazon Machine Learning
Presto
AmazonEMR
Amazon Elasticsearch Service
Apache Kafka
Amazon SQS
Amazon KinesisStreams
Amazon Kinesis Firehose
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
Amazon DynamoDB Streams
Hot
Hot
War
m
Fast
Slow
Fast
Bat
chM
essa
geIn
tera
ctiv
eS
tream
ML
Sea
rch
S
QL
N
oSQ
L
Cac
heFi
leQ
ueue
Stre
am
Amazon EC2
Amazon EC2
Mobile apps
Web apps
Devices
MessagingMessage
Sensors & IoT platforms
AWS IoT
Data centers AWS Direct Connect
AWS Import/ExportSnowball
Logging
Amazon CloudWatch
AWSCloudTrail
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMS
Amazon QuickSight
Apps & Services
Anal
ysis
& v
isua
lizat
ion
Not
eboo
ksID
EAP
I
Reference architecture
Logg
ing
IoT
Appl
icat
ions
Tran
spor
tM
essa
ging
ETL
AWS SummitSanoma Big Data Migration CaseSander Kieft
Sanoma, Publishing and Learning company
27 May 2016 AWS Summit– Sanoma Big Data MIgration47
2+1002 Finnish newspapers
Over 100 magazines in The Netherland, Belgium and
Finland
7TV channels in Finland and The Netherlands, incl. on
demand platforms
200+Websites
100Mobile applications on
various mobile platforms
30+Learning applications
§ Users of Sanoma’s websites, mobile applications, online TV products generate large volumes of data§ We use this data to improve our products for our users and our advertisers§ Use cases:
Sanoma, Big Data use cases
27 May 2016 AWS Summit– Sanoma Big Data MIgration48
Dashboardsand Reporting
Product Improvements
Advertising optimization
• Reporting on various data sources• Self Service Data Access• Editorial Guidance
• A/B testing• Recommenders• Search optimizations• Adaptive learning/tutoring
• (Re)Targeting• Attribution• Auction price optimization
Starting point
27 May 2016 AWS Summit– Sanoma Big Data MIgration4949
ETL(Jenkins & Jython)
QlikViewReporting
Hive
Collecting
Serving
R Studio
Stream processing
(Storm)Kafka
Sources
EDW
Redis /Druid.io
Compute & Storage(Cloudera Dist.
Hadoop)
AWS cloud
Sanoma data center
S3Hadoop User Environment
(HUE)
60+NODES
720TBCAPACITY
475TBDATA STORED (GROSS)
180TBDATA STORED (NETTO)
150GBDAILY GROWTH
50+DATA SOURCES
200+DATA PROCESSES
3000+DAILY HADOOP JOBS
200+DASHBOARDS
275+AVG MONTHLY DASHBOARD USERS
Challenges
Positives§ Users have been steadily increasing§ Demand for ..
– more real time processing– faster data availability– higher availability (SLA)– quicker availability of new versions– specialized hardware (GPU’s/SSD’s)– quicker experiments (Fail Fast)
Negatives§ Data Center network support end of life§ Outsourcing of own operations team§ Version upgrades harder to manage§ Poor job isolation between test, dev, prod and
interactive, etl and data science workloads§ Higher level of security and access control
27 May 2016 AWS Summit– Sanoma Big Data MIgration61
§ Three scenario’s evaluated for moving
Amazon Migration Scenario’s
27 May 2016 AWS Summit– Sanoma Big Data MIgration62
SM
L
EC2 EBS S3 S3EMREC2 EBS
§ All services on EC2
§ Only source data on S3
§ All data on S3
§ EMR for Hadoop
§ EC2 only for utility services not provided by EMR
S3EMREC2 RedshiftEBS
§ All data on S3
§ EMR for Hadoop
§ Interactive querying workload is moved to Redshift instead of Hive
Easier to leverage spot pricing, due to data on S3
§ Three scenario’s evaluated for moving
Amazon Migration Scenario’s
27 May 2016 AWS Summit– Sanoma Big Data MIgration63
SM
L
EC2 EBS S3 S3EMREC2 EBS
§ All services on EC2
§ Only source data on S3
§ All data on S3
§ EMR for Hadoop
§ EC2 only for utility services not provided by EMR
S3EMREC2 RedshiftEBS
§ All data on S3
§ EMR for Hadoop
§ Interactive querying workload is moved to Redshift instead of Hive
Target architecture
27 May 2016 AWS Summit– Sanoma Big Data MIgration6464
ETL(Jenkins & Jython)
QlikViewReporting
R Studio
Sources
EDW
AWS cloud
Sanoma data center
AmazonRDS
S3
S3 Buckets
Sources Work Hive Datawarehouse
EMR Cluster 1
Coreinstances
Taskinstances
TaskSpot
Instances
Amazon EMR
HUE
Zeppelin
Hive
Collecting
Serving
Stream processing
(Storm)Kafka
Redis /Druid.io
§ We’re almost done with the migration. Running in parallel now.§ Setup solves our challenges, some still require work§ Take your time testing and validating your setup
– If you have time, rethink whole setup– No time, move as-is first and then start optimizing
Conclusion & Learnings
27 May 2016 AWS Summit– Sanoma Big Data MIgration65
EMR– Check data formats! RC vs ORC/Parquet– Start jobs from master node– Break up long running jobs, shorter
independent – Spot pricing & pre-empted nodes /w Spark– HUE and Zeppelin meta data on RDS– Research EMR FS behavior for your use case
§ S3– Bucket structure impacts performance– Setup access control, human error will occur– Uploading data takes time. Start early!
§ Check Snowball or new upload service
Design Patterns
Primitive: Multi-Stage Decoupled “Data Bus”
Multiple stagesStorage decouples multiple processing stages
Store Process Store Process
processstore
Primitive: Multiple Stream Processing Applications Can Read from Amazon Kinesis
Amazon Kinesis
AWS Lambda
Amazon DynamoDB
Amazon Kinesis S3Connector
processstore
Amazon S3
Amazon EMR
Primitive: Analysis Frameworks Could Read from Multiple Data Stores
Amazon Kinesis
AWS Lambda
Amazon S3
Amazon DynamoDB
Spark Streaming
Amazon Kinesis S3Connector
process store
Spark SQL
Spark Streaming Apache StormAWS Lambda
KCL appsAmazon Redshift
AmazonRedshift
Hive
Spark Presto
Amazon KinesisApache Kafka
Amazon DynamoDB Amazon S3data
Hot ColdData temperature
Proc
essi
ng s
peed
Fast
Slow Answers
Hive
Native appsKCL appsAWS Lambda
Data Temperature vs. Processing Speed
Amazon EMR
Real-time Analytics
ApacheKafka
KCL
AWS Lambda
SparkStreaming
Apache Storm
Amazon SNS
AmazonML
Notifications
AmazonElastiCache
(Redis)
AmazonDynamoDB
AmazonRDS
AmazonES
Alert
App state
Real-time prediction
KPI
processstore
DynamoDBStreams
Amazon Kinesis
Message / EventProcessing
Amazon SQS
Amazon SQS App
Amazon SQS App
Amazon SNS Subscribers
AmazonElastiCache
(Redis)
AmazonDynamoDB
AmazonRDS
AmazonES
Publish
App state
KPI
processstore
Amazon SQS App
Amazon SQSApp
Auto Scaling group
Amazon SQSPriority queue Messages /
events
Interactive & BatchAnalytics
Amazon S3
Amazon EMR
Hive
Pig
Spark
AmazonML
processstore
Consume
Amazon Redshift
Amazon EMR
Presto
Spark
Batch
Interactive
Batch prediction
Real-time predictionAmazon Kinesis
Firehose
Batch layer
AmazonKinesis
Datastream
processstore
Amazon Kinesis S3 Connector
Amazon S3
Applications
Amazon Redshift
Amazon EMR
Presto
Hive
Pig
Spark answer
Speed layer
answer
Serving layer
AmazonElastiCache
AmazonDynamoDBAmazon
RDSAmazon
ES
answer
AmazonML
KCL
AWS Lambda
Storm
Lambda Architecture
Spark Streaming on Amazon EMR
Summary
Decoupled “data bus”• Data → Store → Process → Store → Analyze → Answers
Use the right tool for the job• Data structure, latency, throughput, access patterns
Use Lambda architecture ideas• Immutable (append-only) log, batch/speed/serving layer
Leverage AWS managed services• Scalable/elastic, available, reliable, secure, no/low admin
Big data ≠ big cost
Thank you!aws.amazon.com/big-data