Date post: | 04-Aug-2015 |
Category: |
Technology |
Upload: | amazon-web-services |
View: | 478 times |
Download: | 7 times |
©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved
Deep Dive: Amazon EMR
Matt Yanchyshyn - Sr. Manager, Solutions Architecture
Why Amazon EMR?
easy to uselaunch a cluster in minutes
low costpay an hourly rate
elasticeasily add or remove capacity
reliablespend less time monitoring
securemanaged firewalls
flexibleyou control the cluster
Easy to deploy
AWS Management Console Command line
or use the EMR API with your favorite SDK
Easy to monitor and debug
Monitor Debug
integrated with Amazon CloudWatch
monitor cluster, node, and IO
Try different configurations to find your optimal architecture
CPU
c3 family
cc1.4xlarge
cc2.8xlarge
Memory
m2 family
r3 family
Disk/IO
d2 family
i2 family
General
m1 family
m3 family
Choose your instance types
Batch Machine Spark and Large
process learning interactive HDFS
Easy to add and remove compute
capacity on your cluster
Match compute
demands with
cluster sizing
Resizable clusters
Spot for
task nodes
Up to 90%
off EC2
on-demand
pricing
On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand
capacity
Easy to use Spot Instances
Meet SLA at predictable cost Exceed SLA at lower cost
Read data directly into Hive,
Pig, streaming and cascading
from Amazon Kinesis streams
No intermediate data
persistence required
Simple way to introduce real-time sources into
batch-oriented systems
Multi-application support and automatic
checkpointing
Amazon EMR Integration with Amazon Kinesis
The Hadoop ecosystem can run in Amazon EMR
Bootstrap Actions
Use bootstrap actions to install applications or to configure Hadoop
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop
--keyword-config-file – merge values in new config to existing
--keyword-key-value – override values provided
Configuration file
name
Configuration file
keywordFile name shortcut
Key-value pair
shortcut
core-site.xml core C c
hdfs-site.xml hdfs H h
mapred-site.xml mapred M m
yarn-site.xml yarn Y y
Hue
Amazon S3 and HDFS
Hue
Query Editor
Hue
Job Browser
Leverage Amazon S3 with EMRFS
Amazon S3 as your persistent data store
• Separate compute and storage
• Resize and shut down Amazon
EMR clusters with no data loss
• Point multiple Amazon EMR
clusters at same data in Amazon
S3
EMR
EMR
Amazon
S3
EMRFS makes it easier to use Amazon S3
• Read-after-write consistency
• Very fast list operations
• Error-handling options
• Support for Amazon S3 encryption
• Transparent to applications: s3://
EMRFS client-side encryption
Amazon S3
Am
azon
S3 e
ncry
ption
clie
nts
EM
RF
S e
nable
d fo
r
Am
azon S
3 c
lient-s
ide e
ncry
ptio
n
Key vendor (AWS KMS or your custom key vendor)
(client-side encrypted objects)
HDFS is still there if you need it
• Iterative workloads
– If you’re processing the same dataset more than
once
• Disk I/O-intensive workloads
• Persist data on Amazon S3 and use S3DistCp to
copy to/from HDFS for processing
Amazon EMR – Design Patterns
EMR example #1: Batch processing
GB of logs pushed to
S3 hourlyDaily EMR cluster
using Hive to process
data
Input and output
stored in S3
250 Amazon EMR jobs per day, processing 30 TB of data
http://aws.amazon.com/solutions/case-studies/yelp/
EMR example #2: Long-running cluster
Data pushed to S3 Daily EMR cluster
ETL data into
database
24/7 EMR cluster running
HBase holds last two years of
data
Front-end service uses
HBase cluster to power
dashboard with high
concurrency
TBs of logs sent
dailyLogs stored in
Amazon S3
Hive metastore
on Amazon EMR
EMR example #3: Interactive query
Interactive query using Presto on multi-petabyte warehouse
http://nflx.it/1dO7Pnt
EMR example #4: Streaming-data processing
TBs of logs sent
dailyLogs stored in
Amazon Kinesis
Amazon Kinesis
Client Library
AWS Lambda
Amazon EMR
Amazon EC2
Optimizations for Storage
File formats
• Row-oriented– text files
– sequence files
• writable object
– Avro data files
• described by schema
• Columnar format– Object Record Columnar (ORC)
– Parquet
logical table
row-oriented
column-oriented
Choosing the right file format
• Processing and query tools– Hive, Impala, and Presto
• Evolution of schema– Avro for schema and Presto for storage
• File format “splittability”– Avoid JSON/XML files. Use them as records.
• Compression - block or file
File sizes
• Avoid small files
– Anything smaller than 100 MB
• Each mapper is a single JVM
– CPU time is required to spawn JVMs/mappers
• Fewer files, matching closely to block size
– fewer calls to S3
– fewer network/HDFS requests
Dealing with small files
• Reduce HDFS block size, e.g. 1 MB (default is 128 MB)
– --bootstrap-action s3://elasticmapreduce/bootstrap-
actions/configure-hadoop --args “-m,dfs.block.size=1048576”
• Better: use S3DistCP to combine smaller files together
– S3DistCP takes a pattern and target path to combine smaller
input files to larger ones
– Supply a target size and compression codec
Compression
• Always compress data files on Amazon S3
– reduces network traffic between Amazon S3 and
Amazon EMR
– speeds up your job
• Compress mappers and reducer output
Amazon EMR compresses inter-node traffic with LZO with
Hadoop 1 and Snappy with Hadoop 2
Choosing the right compression
• Time-sensitive, faster compressions are a better choice
• Large amount of data, use space-efficient compressions
• Combined workload, use gzip
Algorithm Splittable? Compression ratioCompress and
decompress speed
Gzip (DEFLATE) No High Medium
bzip2 Yes Very high Slow
LZO Yes Low Fast
Snappy No Low Very fast
The Nielsen Company and Amazon EMR
Co
pyr
igh
t ©
2012
Th
e N
iels
en C
om
pan
y. C
on
fid
enti
al a
nd
pro
pri
etar
y.
SOCIAL TV IS A CONSUMER PHENOMENONViewers interacting around TV programming through social to connect with friends, fans, stars, and advertisers in real time.
of smartphone/tablet owners use devices as second screens
while watching TV
1 Billion84percent
Tweets about U.S.TV in 2014, sent by
25 million people
Co
pyr
igh
t ©
2012
Th
e N
iels
en C
om
pan
y. C
on
fid
enti
al a
nd
pro
pri
etar
y.
WHAT IS NIELSEN SOCIAL?The leading provider of social TV measurement, analytics, and audience engagement solutions.
• 90+ network, agency and advertiser clients in the U.S.
• Exclusive provider of Nielsen Twitter TV Ratings
• International presence in Italy, Australia, and Mexico
Co
pyr
igh
t ©
2012
Th
e N
iels
en C
om
pan
y. C
on
fid
enti
al a
nd
pro
pri
etar
y.
HOW DOES NIELSEN SOCIAL WORK?Nielsen Social captures Twitter activity about every TV program across 250+ U.S. networks, 1900+ brands, movies, and sports.
Twitter provides impressions and demographics for every tweet about TV, which we aggregate and de-duplicate to produce Nielsen Twitter TV Ratings.
Co
pyr
igh
t ©
2012
Th
e N
iels
en C
om
pan
y. C
on
fid
enti
al a
nd
pro
pri
etar
y.
NIELSEN SOCIAL – AMAZON EMR JOB FLOW
1) Country Segmentation Cluster• Input: Global Firehose (500M TPD, 200 GB gz
files on S3)• Output: US/IT/AU/MX Tweets (150M TPD)• Data Volatility: Low (+/- 20%)
2) TV Tweet Matching Cluster• Input: US/IT/AU/MX Tweets (150M TPD)• Output: TV Tweets (2 to 25M) • Data Volatility: High (+/- 1200%)
3) TV Analytics Cluster• Input: TV Tweets (2 to 25M TPD)• Output: Analytics and Reports • Data Volatility: High (+/- 1200%)
Amazon EMR Config: • Multiple Transient Clusters• On Demand Instances
+ more for spikes• Type: m4.2xlarge / m3.xl• Job Freq: daily
Amazon EMR Config: • Dedicated (24/7)• Reserved Instances• Type: m4.2xlarge• Job Freq: every 10 min
Amazon EMR Config: • Transient• Reserved Instances
+ On Demand for spikes• Type: m2.2xlarge• Job Freq: hourly/overlap
TWITTER FIREHOSE
TV SCHEDULES
DEMOS & IMPRESSIONS
Co
pyr
igh
t ©
2012
Th
e N
iels
en C
om
pan
y. C
on
fid
enti
al a
nd
pro
pri
etar
y.
TWITTER FIREHOSE
NIELSEN SOCIAL – AMAZON EMR JOB FLOW
1) Country Segmentation Cluster• Input: Global Firehose (500M TPD, 200 GB gz
files on S3)• Output: US/IT/AU/MX Tweets (150M TPD)• Data Volatility: Low (+/- 20%)
2) TV Tweet Matching Cluster• Input: US/IT/AU/MX Tweets (150M TPD)• Output: TV Tweets (2 to 25M) • Data Volatility: High (+/- 1200%)
3) TV Analytics Cluster• Input: TV Tweets (2 to 25M TPD)• Output: Analytics and Reports • Data Volatility: High (+/- 1200%)
TV SCHEDULES
DEMOS & IMPRESSIONS
Amazon EMR Config: • Dedicated (24/7)• Reserved Instances• Type: m4.2xlarge• Job Freq: every 10 min
Amazon EMR Config: • Transient• Reserved Instances
+ On Demand for spikes• Type: m2.2xlarge• Job Freq: hourly/overlap
NIELSEN TWITTER TV RATINGS
Takeaway
Cost-saving tips for Amazon EMR
• Use Amazon S3 as your persistent data store
• Only pay for compute when you need it
• Use Amazon EC2 Spot instances to save >80%
• Use Amazon EC2 Reserved instances for steady workloads
• Use CloudWatch alerts to notify you if a cluster is
underutilized, then shut it down (for example, 0 mappers
running for >N hours)
• Contact your AWS sales for pricing options if you are
spending >$10K/mo on Amazon EMR
CHICAGO