Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World...

transcript

Peter-Mark Verwoerd

Big Data on AWS

Solutions Architect

What to get out of this talk

• Non-technical:

– Big Data processing stages: ingest, store, process, visualize

– Hot vs. Cold data

– Low latency processing vs. high latency processing

• Technical:

– Concepts above

– Big Data reference architectures and design patterns

GB TB PB

The World is Producing Ever-Larger Volumes of

Big Data

• IT/ Application server logs IT Infrastructure logs, Metering, Audit logs, Change logs

• Web sites / Mobile Apps/ Ads Clickstream, User Engagement

• Sensor data Weather, Smart Grids, Wearables

• Social Media, User Content 450MM+ Tweets/day

Big Data

• Hourly server logs: how your systems were misbehaving an hour ago

• Weekly / Monthly Bill: What you spent this past billing cycle?

• Daily customer-preferences report from

your web-site’s click stream: tells you what deal or ad to try next time

• Daily fraud reports: tells you if there was fraud yesterday

Real-time Big Data

• CloudWatch metrics: what just went

wrong now

• Real-time spending alerts/caps:

guaranteeing you can’t overspend

• Real-time analysis: tells you what to offer

the current customer now

• Real-time detection: blocks fraudulent

use now

Big Data : Best Served Fresh

The Challenge

Data Big Data Real-time Big Data = Plethora of tools

The Zoo

Apache Kafka

Amazon Kinesis

Apache Flume

Apache Spark

Streaming

Hadoop/EMR

Redshift S3

DynamoDB

Hive Pig Shark

Impala

Partners

Flume, Sqoop

HParser

Simplify

Kinesis

Scribe

Jaspersoft

Kafka Tableau

Ingest Visualize

Data Answers

SharkSpark

Spark Streaming

Hive/PigHadoop/

Process

DynamoDB

Redshift

Ingest

IngestData

Ingest

• The act of collecting and storing data

Ingest

Why Data Ingest Tools?

• Collect random and high velocity data

– Many different sources

– High TPS

• Collecting random and high velocity data is a challenging task

– Hard to durably store data at scale

– Hard to keep highly available

– Hard to scale

Why Data Ingest Tools?

• Data ingest tools convert random streams of data into

fewer set of sequential streams

– Sequential streams are easier to process

– Easier to scale

– Easier to persist

Processing

Data Ingest Tools

• Facebook Scribe Data collectors

• Amazon Kinesis Data collectors

• Apache Kafka Data collectors

• Apache Flume Data Movement and Transformation

Partners – Data Load and Transformation

Big Data Edition

Flume, Sqoop

HParser

Storage

Ingest StoreData

Storage

Structured – Complex Query

• SQL

– Amazon RDS (MySQL, Oracle, SQL Server, Postgres)

• Data Warehouse

– Amazon Redshift

• Search

– Amazon CloudSearch

Unstructured – Custom Query

• Hadoop/HDFS

– Amazon Elastic MapReduce

Structured – Simple Query

• NoSQL

– Amazon DynamoDB

• Cache

– Amazon ElastiCache (Memcached, Redis)

Unstructured – No Query

• Cloud Storage

– Amazon S3

– Amazon Glacier

Amazon RDS

Amazon Redshift

Amazon S3

Request rate High Low

Cost/GB High Low

Latency Low High

Data Volume Low High

Amazon Glacier

Amazon EMR

Amazon DynamoDB

Amazon ElastiCache

Elasti- Cache

Amazon DynamoDB

Amazon RDS

Cloud Search

Amazon Redshift Amazon EMR (Hive)

Amazon S3 Amazon Glacier

Average latency

ms ms ms,sec ms,sec sec,min sec,min, hrs

ms,sec,min (~ size)

Data volume GB GB–TBs (no limit)

GB–TB (3 TB Max)

GB–TB TB–PB (1.6 PB max)

GB–PB (~nodes)

GB–PB (no limit)

Item size B-KB KB (64 KB max)

KB (~rowsize)

KB (1 MB max)

KB (64 K max)

KB-MB KB-GB (5 TB max)

GB (40 TB max)

Request rate Very High Very High High High Low Low Low– Very High (no limit)

Very Low (no limit)

Cost ($/GB/month)

$$ ¢¢ ¢¢ $ ¢

¢ ¢ ¢

Durability Low - Moderate

Very High High High High High Very High Very High

Process

Ingest Store ProcessData

Process

• Answering questions about data

• Questions

– Analytics: Think SQL/Data warehouse

– Classification: Think Sentiment Analysis

– Predication: Think page-views Prediction

– Etc

Processing Frameworks

• Generally come in two major types

– Batch processing

– Stream processing

• Batch Processing

– Take large amount (>100TB) of cold data and ask questions

– Takes hours to get answers back

Example: Generating Monthly AWS Billing Reports

• Stream Processing (aka. Real-time)

– Take small amount of hot data and ask questions

– Takes short amount of time to get your answer back

Example: Cloudwatch 1min metrics

• Hadoop/EMR Batch Processing

• Spark Batch Processing

• Spark Streaming Stream Processing

• Storm Stream Processing

• Redshift Batch Processing

Impala

Partners – Advanced Analytics

Visualize

Ingest Store ProcessData Visualize

Which country consumes the most oil?

What countries are oil exporters?

Is there a trend of increasing oil consumption

over time?

Order countries by oil consumption/production?

Is there a cluster of oil producers?

What is the oil consumption of USA per day?

What is the average oil consumption per day of

Europe?

Are there any

outliers?

What is the rage of oil production?

What is the distribution of oil producing countries?

Activities of Data Visualization Users

Partners – BI & Data Visualization

Putting it all together (coupled architecture)

• Ingest/Store and processing tightly coupled

• Examples:

– S3 + EMR/Hadoop

– HDFS + EMR/Hadoop

– S3 + Redshift

Putting it all together (coupled architecture)

• Coupled systems provide Less flexibility

– Cold data vs. Hot

– High latency processing vs. Low latency processing

• Example

– EMR+HDFS/S3

• Cold: Can handle processing 100 records/sec

• Hot: processing 1000000 records/sec ??

– Redshift + S3

• High latency: Generate reports once a day

• Low latency: Generate reports every minute

Putting it all together (de-coupled architecture)

• Multi-tier data processing architecture

– Similar to multi-tier web-application architectures

• Ingest & Store de-coupled from Processing

– Concept of “databus”

DatabusData Process Answers

Putting it all together (de-coupled architecture)

• Ingest tools write to multiple data stores within “data-bus”

• Processing frameworks (Hadoop, Spark, etc) consume from “databus”

• Consumers can decide which data store to read from depending on

their data processing requirement

Ingest Store

Data Process AnswersKafka

Data temperature & processing latency

Pattern 1: Redshift (cold & high)

Pattern 2: DynamoDB (warm and low)

Pattern 3: Hadoop (cold and high)

Pattern 4: Hadoop (warm and low)

Pattern 5: Spark (cold and low)

Pattern 6: Stream Processing (hot and low)

Putting it All Together

What to get out of this talk

• Non-technical:

– Big Data processing stages: ingest, store, process, visualize

– Hot vs. Cold data

– Low latency processing vs. high latency processing

• Technical:

– Concepts above

– Big Data reference architectures and design patterns

Questions?

Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World...

Documents