Peter-Mark Verwoerd
Big Data on AWS
Solutions Architect
What to get out of this talk
• Non-technical:
– Big Data processing stages: ingest, store, process, visualize
– Hot vs. Cold data
– Low latency processing vs. high latency processing
• Technical:
– Concepts above
– Big Data reference architectures and design patterns
GB TB PB
ZB
EB
The World is Producing Ever-Larger Volumes of
Big Data
• IT/ Application server logs IT Infrastructure logs, Metering, Audit logs, Change logs
• Web sites / Mobile Apps/ Ads Clickstream, User Engagement
• Sensor data Weather, Smart Grids, Wearables
• Social Media, User Content 450MM+ Tweets/day
Big Data
• Hourly server logs: how your systems were misbehaving an hour ago
• Weekly / Monthly Bill: What you spent this past billing cycle?
• Daily customer-preferences report from
your web-site’s click stream: tells you what deal or ad to try next time
• Daily fraud reports: tells you if there was fraud yesterday
Real-time Big Data
• CloudWatch metrics: what just went
wrong now
• Real-time spending alerts/caps:
guaranteeing you can’t overspend
• Real-time analysis: tells you what to offer
the current customer now
• Real-time detection: blocks fraudulent
use now
Big Data : Best Served Fresh
The Challenge
Data Big Data Real-time Big Data = Plethora of tools
The Zoo
Apache Kafka
Amazon Kinesis
Apache Flume
Storm
Apache Spark
Apache Spark
Streaming
Hadoop/EMR
Redshift S3
DynamoDB
Hive Pig Shark
HDFS
Impala
?
Partners
Flume, Sqoop
HParser
Simplify
Kinesis
Flume
Scribe
Jaspersoft
Kafka Tableau
Ingest Visualize
Data Answers
Storm
SharkSpark
Spark Streaming
Hive/PigHadoop/
EMR
Process
HDFS
DynamoDB
Redshift
S3
Store
Ingest
IngestData
Ingest
• The act of collecting and storing data
Ingest
Ingest
Why Data Ingest Tools?
• Collect random and high velocity data
– Many different sources
– High TPS
• Collecting random and high velocity data is a challenging task
– Hard to durably store data at scale
– Hard to keep highly available
– Hard to scale
Why Data Ingest Tools?
• Data ingest tools convert random streams of data into
fewer set of sequential streams
– Sequential streams are easier to process
– Easier to scale
– Easier to persist
Processing
Kafk
aO
rKin
esis
Processing
Data Ingest Tools
• Facebook Scribe Data collectors
• Amazon Kinesis Data collectors
• Apache Kafka Data collectors
• Apache Flume Data Movement and Transformation
Partners – Data Load and Transformation
Big Data Edition
Flume, Sqoop
HParser
Storage
Ingest StoreData
Storage
Structured – Complex Query
• SQL
– Amazon RDS (MySQL, Oracle, SQL Server, Postgres)
• Data Warehouse
– Amazon Redshift
• Search
– Amazon CloudSearch
Unstructured – Custom Query
• Hadoop/HDFS
– Amazon Elastic MapReduce
(EMR)
Structured – Simple Query
• NoSQL
– Amazon DynamoDB
• Cache
– Amazon ElastiCache (Memcached, Redis)
Unstructured – No Query
• Cloud Storage
– Amazon S3
– Amazon Glacier
Amazon RDS
Amazon Redshift
Amazon S3
Request rate High Low
Cost/GB High Low
Latency Low High
Data Volume Low High
Amazon Glacier
Amazon EMR
Stru
ctu
re
Low
High
Amazon DynamoDB
Amazon ElastiCache
Elasti- Cache
Amazon DynamoDB
Amazon RDS
Cloud Search
Amazon Redshift Amazon EMR (Hive)
Amazon S3 Amazon Glacier
Average latency
ms ms ms,sec ms,sec sec,min sec,min, hrs
ms,sec,min (~ size)
hrs
Data volume GB GB–TBs (no limit)
GB–TB (3 TB Max)
GB–TB TB–PB (1.6 PB max)
GB–PB (~nodes)
GB–PB (no limit)
GB–PB (no limit)
Item size B-KB KB (64 KB max)
KB (~rowsize)
KB (1 MB max)
KB (64 K max)
KB-MB KB-GB (5 TB max)
GB (40 TB max)
Request rate Very High Very High High High Low Low Low– Very High (no limit)
Very Low (no limit)
Cost ($/GB/month)
$$ ¢¢ ¢¢ $ ¢
¢ ¢ ¢
Durability Low - Moderate
Very High High High High High Very High Very High
Process
Ingest Store ProcessData
Process
• Answering questions about data
• Questions
– Analytics: Think SQL/Data warehouse
– Classification: Think Sentiment Analysis
– Predication: Think page-views Prediction
– Etc
Processing Frameworks
• Generally come in two major types
– Batch processing
– Stream processing
Processing Frameworks
• Batch Processing
– Take large amount (>100TB) of cold data and ask questions
– Takes hours to get answers back
Example: Generating Monthly AWS Billing Reports
Processing Frameworks
• Stream Processing (aka. Real-time)
– Take small amount of hot data and ask questions
– Takes short amount of time to get your answer back
Example: Cloudwatch 1min metrics
Processing Frameworks
• Hadoop/EMR Batch Processing
• Spark Batch Processing
• Spark Streaming Stream Processing
• Storm Stream Processing
• Redshift Batch Processing
Impala
Partners – Advanced Analytics
Visualize
Ingest Store ProcessData Visualize
Which country consumes the most oil?
What countries are oil exporters?
Is there a trend of increasing oil consumption
over time?
Order countries by oil consumption/production?
Is there a cluster of oil producers?
What is the oil consumption of USA per day?
What is the average oil consumption per day of
Europe?
Are there any
outliers?
What is the rage of oil production?
What is the distribution of oil producing countries?
Activities of Data Visualization Users
Partners – BI & Data Visualization
Putting it all together (coupled architecture)
• Ingest/Store and processing tightly coupled
• Examples:
– S3 + EMR/Hadoop
– HDFS + EMR/Hadoop
– S3 + Redshift
Putting it all together (coupled architecture)
• Coupled systems provide Less flexibility
– Cold data vs. Hot
– High latency processing vs. Low latency processing
• Example
– EMR+HDFS/S3
• Cold: Can handle processing 100 records/sec
• Hot: processing 1000000 records/sec ??
– Redshift + S3
• High latency: Generate reports once a day
• Low latency: Generate reports every minute
Putting it all together (de-coupled architecture)
• Multi-tier data processing architecture
– Similar to multi-tier web-application architectures
• Ingest & Store de-coupled from Processing
– Concept of “databus”
DatabusData Process Answers
Putting it all together (de-coupled architecture)
• Ingest tools write to multiple data stores within “data-bus”
• Processing frameworks (Hadoop, Spark, etc) consume from “databus”
• Consumers can decide which data store to read from depending on
their data processing requirement
Ingest Store
Data Process AnswersKafka
S3
HDFS
Data temperature & processing latency
Pattern 1: Redshift (cold & high)
Pattern 2: DynamoDB (warm and low)
Pattern 3: Hadoop (cold and high)
Pattern 4: Hadoop (warm and low)
Pattern 5: Spark (cold and low)
Pattern 6: Stream Processing (hot and low)
Putting it All Together
What to get out of this talk
• Non-technical:
– Big Data processing stages: ingest, store, process, visualize
– Hot vs. Cold data
– Low latency processing vs. high latency processing
• Technical:
– Concepts above
– Big Data reference architectures and design patterns
Questions?