Date post: | 18-Jan-2019 |
Category: |
Documents |
Upload: | duongtuyen |
View: | 213 times |
Download: | 0 times |
Data Volume, Velocity, Variety
• 2.7 zettabytes (ZB) of data exists
in the digital universe today
– 1 ZB = 1 billion terabytes
• 450 billion transaction per day by
2020
• More unstructured data than
structured data GB TB
PB
ZB
EB
1990 2000 2010 2020
Big Data • Hourly server logs: how your systems were
misbehaving an hour ago
• Weekly / Monthly Bill: What you spent this
past billing cycle?
• Daily customer-preferences report from your
web-site’s click stream: tells you what deal
or ad to try next time
• Daily fraud reports: tells you if there was
fraud yesterday
Real-time Big Data • CloudWatch metrics: what just went wrong
now
• Real-time spending alerts/caps:
guaranteeing you can’t overspend
• Real-time analysis: tells you what to offer
the current customer now
• Real-time detection: blocks fraudulent use
now
Big Data : Best Served Fresh
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure
Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Available for analysis
Generated data
Data volume - Gap
1990 2000 2010 2020
Data Analysis Gap
Big Data
Potentially massive datasets
Iterative, experimental style of data
manipulation and analysis
Frequently not a steady-state workload;
peaks and valleys
Time to results is key
Hard to configure/manage
AWS Cloud
Massive, virtually unlimited capacity
Iterative, experimental style of
infrastructure deployment/usage
At its most efficient with highly
variable workloads
Parallel compute clusters from singe
data source
Managed services
The Zoo
Hive Pig Shark Impala
Apache
Kafka Storm
Hadoop
/EMR
Dynamo
DB
Apache
Spark
Amazon
Kinesis
Apache
Flume HDFS
Redshift Apache
Spark
Streaming
S3 ?
Simplify
Ingest Store Process Visualize
Kafka S3 Hive/Pig
Hadoop/EMR Tableau
Kinesis DynamoDB Shark
Spark Jaspersoft
Flume HDFS Storm
Scribe Redshift Spark
Streaming
Data Answers
AWS Big Data Portfolio
Process / Analyze Collect /
Ingest
Kinesis
Glacier
S3
DynamoDB
Store
EMR EC2
Visualize /
Report
Redshift Data Pipeline
RDS
Why Data Ingest Tools?
• Data ingest tools convert random streams of data into fewer set of sequential streams
– Sequential streams are easier to process
– Easier to scale
– Easier to persist
Processing
Processing
Processing
Processing
Processing
Kafk
aO
rKin
esi
s
Processing
Data Ingest Tools
• Facebook Scribe Data collectors
• Amazon Kinesis Data collectors
• Apache Kafka Data collectors
• Apache Flume Data Movement and Transformation
Partners - Data Load and Transformation
Big Data Edition HParser
Flume, Sqoop
Real-time processing of streaming data
High throughput; elastic
Easy to use
EMR, S3, Redshift, DynamoDB integrations
Amazon Kinesis
Amazon Web Services
AZ AZ AZ
Durable, highly consistent storage replicates dataacross three data centers (availability zones)
Aggregate andarchive to S3
Millions ofsources producing100s of terabytes
per hour
FrontEnd
AuthenticationAuthorization
Ordered streamof events supportsmultiple readers
Real-timedashboardsand alarms
Machine learningalgorithms or
sliding windowanalytics
Aggregate analysisin Hadoop or adata warehouse
Inexpensive: $0.028 per million puts
Amazon Kinesis Architecture Amazon Kinesis Architecture
Kinesis Stream:
Managed ability to capture and store data
• Streams are made of Shards
• Each Shard ingests data up to
1MB/sec, and up to 1000 TPS
• Each Shard emits up to 2 MB/sec
• All data is stored for 24 hours
• Scale Kinesis streams by adding or
removing Shards
• Replay data inside of 24Hr. Window
Sending & Reading Data from Kinesis Streams
HTTP Post
AWS SDK
LOG4J
Flume
Fluentd
Get* APIs
Kinesis Client Library + Connector Library
Apache Storm
Amazon Elastic MapReduce
Sending Reading
Building Kinesis Processing Apps: Kinesis Client Library Client library for fault-tolerant, at least-once, Continuous Processing
o Java client library, source available on Github
o Build & Deploy app with KCL on your EC2 instance(s)
o KCL is intermediary b/w your application & stream
Automatically starts a Kinesis Worker for each shard
Simplifies reading by abstracting individual shards
Increase / Decrease Workers as # of shards changes
Checkpoints to keep track of a Worker’s location in the
stream, Restarts Workers if they fail
o Integrates with AutoScaling groups to redistribute workers
to new instances
Putting Data into Kinesis Simple Put interface to store data in Kinesis
• Producers use a PUT call to store data in a Stream
• PutRecord {Data, PartitionKey, StreamName}
• A Partition Key is supplied by producer and used to
distribute the PUTs across Shards
• Kinesis MD5 hashes supplied partition key over the hash
key range of a Shard
• A unique Sequence # is returned to the Producer upon a
successful PUT call
Storage
Structured – Complex Query
• SQL
– Amazon RDS (MySQL, Oracle, SQL Server)
• Data Warehouse
– Amazon Redshift
• Search
– Amazon CloudSearch
Unstructured – Custom Query
• Hadoop/HDFS
– Amazon Elastic MapReduce (EMR)
Structured – Simple Query
• NoSQL
– Amazon DynamoDB
• Cache
– Amazon ElastiCache (Memcached, Redis)
Unstructured – No Query
• Cloud Storage
– Amazon S3
– Amazon Glacier
Amazon S3
• Amazon S3 is for storing Objects (like ‘files’)
• Objects are stored in Buckets
• A Bucket keeps data in a single Region
• Highly durable, highly available
• Secure
Why is Amazon S3 good for Big Data?
• No limit on the number of Objects
• Object size up to 5TB
• Central data storage for all systems
• High bandwidth
• 99.999999999% durability
• Versioning, Lifecycle Policies and Glacier Integration
Amazon S3 Best Practices
• Use random hash prefix for keys
• Ensure a random access pattern
• Use Amazon CloudFront for high throughput GETs and PUTs
• Leverage the high durability, high throughput design of Amazon S3 for
backup and as a common storage sink
• Durable sink between data services
• Supports de-coupling and asynchronous delivery
• Consider RRS for lower cost, lower durability storage of derivatives or copies
• Consider parallel threads and multipart upload for faster writes
• Consider parallel threads and range get for faster reads
Aggregate All Data in S3 Surrounded by a collection of the right tools
EMR Kinesis
Redshift DynamoDB RDS
Data Pipeline
Spark Streaming Cassandra Storm
Amazon
S3
S3 Can Expand Along With Growing Data Volumes
Amazon S3
EMR Kinesis
Redshift DynamoDB RDS
Data Pipeline
Spark Streaming Cassandra Storm
Fully-managed NoSQL database service
Built on solid-state drives (SSDs)
Consistent low latency performance
Any throughput rate
No storage limits
Amazon DynamoDB
DynamoDB: A Flexible Data Model
• A Table is a collection of Items
• An Item is a arbitrary collection of Attributes (Name-Value pairs)
• Except for the required primary key Attribute, a Table is schema-less
• An Item can have any number of attributes (64KB max item size)
• No limits in the number of Items per Table
DynamoDB: Access and Query Model
• Two primary key options • Hash key: Key lookups: “Give me the status for user abc”
• Composite key (Hash with Range): “Give me all the status updates for user ‘abc’
that occurred within the past 24 hours”
• Support for multiple data types – String, number, binary… or sets of strings, numbers, or binaries
• Supports both strong and eventual consistency – Choose your consistency level when you make the API call
– Different parts of your app can make different choices
• Global Secondary Indexes
Amazon DynamoDB Best Practices
• Keep item size small
• Store metadata in Amazon DynamoDB and
large blobs in Amazon S3
• Use a table with a hash key for extremely
high scale
• Use table per day, week, month etc. for
storing time series data
• Use conditional/OCC updates
• Use hash-range key to model
– 1:N relationships
– Multi-tenancy
• Avoid hot keys and hot partitions
Events_table_2012
Event_id (Hash key)
Timestamp (range key)
Attribute1 …. Attribute N
Events_table_2012_05_week1
Event_id (Hash key)
Timestamp (range key)
Attribute1 …. Attribute N Events_table_2012_05_week2
Event_id (Hash key)
Timestamp (range key)
Attribute1 …. Attribute N
Events_table_2012_05_week3 Event_id (Hash key)
Timestamp (range key)
Attribute1 …. Attribute N
Amazon RDS
• Built-in Multi-AZ for HA
• Scale up to 3TB and 30,000 IOPS
• Read replicas; cross region backups
Amazon RDS Best Practices
• Use the right DB instance class
• Use EBS-optimized instances
– Example: db.m1.large, db.m1.xlarge, db.m2.2xlarge, db.m2.4xlarge,
db.cr1.8xlarge
• Use provisioned IOPS
• Use multi-AZ for high availability
• Use read replicas for
– Scaling reads
– Schema changes
– Additional failure recovery
Process
• Answering questions about data
• Questions
– Analytics: Think SQL/Data warehouse
– Classification: Think Sentiment Analysis
– Predication: Think page-views Prediction
– Etc
Processing Frameworks
• Batch Processing
– Take large amount (>100TB) of cold data and ask questions
– Takes hours to get answers back
Example: Hourly Reports
Processing Frameworks
• Stream Processing (aka. Real-time)
– Take small amount of hot data and ask questions
– Takes short amount of time to get your answer back
Example: 1 minute metrics
Processing Frameworks
• Batch Processing
– Hadoop/EMR
– Redshift
– Spark
• Stream Processing
– Spark Streaming
– Storm
Columnar data warehouse
Massively parallel
Petabyte scale
Fully managed
$1,000/TB/Year
Amazon Redshift
Amazon Redshift architecture
• Leader Node – SQL endpoint
– Stores metadata
– Coordinates query execution
• Compute Nodes – Local, columnar storage
– Execute queries in parallel
– Load, backup, restore via Amazon S3
– Parallel load from Amazon DynamoDB
• Hardware optimized for data processing
• Two hardware platforms – DW1: HDD; scale from 2TB to 1.6PB
– DW2: SSD; scale from 160GB to 256TB
10 GigE
(HPC)
Ingestion Backup Restore
JDBC/ODBC
Amazon Redshift Best Practices
• Use COPY command to load large data sets from Amazon S3, Amazon
DynamoDB, Amazon EMR/EC2/Unix/Linux hosts
– Split your data into multiple files
– Use GZIP or LZOP compression
– Use manifest file
• Choose proper sort key
– Range or equality on WHERE clause
• Choose proper distribution key
– Join column, foreign key or largest dimension, group by column
Hadoop/HDFS clusters
Hive, Pig, Impala, HBase
Easy to use; fully managed
On-demand and spot pricing
Tight integration with S3,
DynamoDB, and Kinesis
Amazon Elastic
MapReduce
EMR Cluster
S3
1. Put the data into S3
2. Choose: Hadoop distribution, # of nodes, types of nodes, Hadoop
apps like Hive/Pig/HBase
4. Get the output from S3
3. Launch the cluster using the EMR console, CLI, SDK, or
APIs
How Does EMR Work?
EMR
EMR Cluster
S3
You can easily resize the cluster
And launch parallel clusters using the same
data
How Does EMR Work?
Amazon Elastic MapReduce Best Practices
• Choose between transient and persistent clusters for
best TCO
• Leverage Amazon S3 integration for highly durable
and interim storage
• Right-size cluster instances based on each job – not
one size fits all
• Leverage resizing and spot to add and remove
capacity cost-effectively
• Tuning cluster instances can be easier than tuning
Hadoop code
Job Flow
14 Hours
Duration:
Duration:
Job Flow
7 Hours
Data Characteristics: Hot, Warm, Cold
Hot Warm Cold
Volume MB–GB GB–TB PB
Item size B–KB KB–MB KB–TB
Latency ms ms, sec min, hrs
Durability Low–High High Very High
Request rate Very High High Low Cost/GB $$-$ $-¢¢ ¢
High Low Request rate
High Low Cost/GB
Low Hig
h Latency
Low Hig
h Data Volume
Low S
tructu
re
Amazon
ElastiCache
Amazon
DynamoDB
Amazon
CloudSearch
Amazon
RDS
Amazon S3
Amazon
EMR
Amazon
Redshift
Amazon
Glacier
Average
latency
Data
volume
Item size
Request
rate
Cost ($/GB/month)
Durability
Elasti-
Cache
ms
GB
B-KB
Very
High
Low -
Moderate
$$
Amazon
DynamoDB
ms
GB-TBs (no limit)
B-KB (64 KB max)
Very
High
Very High
¢¢
Amazon
RDS
ms.sec
High
High
¢¢
GB-TB (3 TB
max)
KB (~rowsize)
Cloud
Search
ms.sec
High
High
$
GB-TB
KB (1 MB max)
Amazon
Redshift
sec.min
Low
High
¢
KB (64 K max)
TB-PB (1.6 PB max)
Amazon
EMR (Hive)
sec.min,
hrs
Low
High
¢
KB-MB
GB-PB (~nodes)
Amazon
S3
Very High
¢
KB-GB (5 TB max)
GB-PB (no limit)
ms,sec,
min (~size)
Low-Very
High (no limit)
Amazon
Glacier
Very High
¢
GB (40 TB max)
GB-PB (no limit)
hrs
Very Low (no limit)
Cost Conscious Design
Example: Should I use Amazon S3 or Amazon DynamoDB?
“I’m currently scoping out a project that will greatly increase
my team’s use of Amazon S3. Hoping you could answer
some questions. The current iteration of the design calls for
many small files, perhaps up to a billion during peak. The
total size would be on the order of 1.5 TB per month…”
Request rate (Writes/sec)
Object size (Bytes)
Total size (GB/month)
Objects per month
300 2048 1483 777,600,000
Cost Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB?
Request rate (Writes/sec)
Object size (Bytes)
Total size (GB/month)
Objects per month
300 2,048 1,483 777,600,000
Amazon S3 or Amazon DynamoDB?
Request rate (Writes/sec)
Object size (Bytes)
Total size (GB/month)
Objects per month
Scenario 1 300 2,048 1,483 777,600,000
Scenario 2 300 32,768 23,730 777,600,000
Amazon S3
Amazon DynamoDB
use
use
Putting it all together (coupled architecture)
• Coupled systems provide Less flexibility
– Cold data vs. Hot
– High latency processing vs. Low latency processing
• Example
– EMR+HDFS/S3
• Cold: Can handle processing 100 records/sec
• Hot: processing 1000000 records/sec ??
– Redshift + S3
• High latency: Generate reports once a day
• Low latency: Generate reports every minute
Putting it all together (de-coupled architecture)
• Multi-tier data processing architecture
– Similar to multi-tier web-application architectures
• Ingest & Store de-coupled from Processing
– Concept of “databus”
DatabusData Process Answers
Putting it all together (de-coupled architecture)
• Ingest tools write to multiple data stores within “data-bus”
• Processing frameworks (Hadoop, Spark, etc) consume from “databus”
• Consumers can decide which data store to read from depending on their data processing requirement
Ingest Store
Data Process AnswersKafka
S3
HDFS
Data temperature & processing latency
Data Warm
Store
Cold
Store
Ingest
Tool
Temperature
Hot Cold
Processing
Processing
Answers
Answers
La
ten
cy
Low
High
Processing
Answers
Pattern 1: Redshift (cold & high)
Data
Temperature
Hot Cold
Answers
Kinesis/
KafkaS3
Redshift
Late
ncy
Low
High • Daily Fraud Report • Weekly / Monthly Bill
Pattern 2: Hadoop (cold and high)
Data
Temperature
Hot Cold
Kinesis/
KafkaS3
NoSQL/
DynamoDB
Answers
EMR/
Hadoop
Late
ncy
Low
High
Pattern 3: DynamoDB (warm and low)
Data
Temperature
Hot Cold
Answers
Kinesis/
Kafka
Redshift
NoSQL/DynamoDB
Answers
DynamoDB
AppLa
ten
cy
Low
High • Hourly Alert • Real-time
Spending
Alerts/Caps
Pattern 4: Hadoop (warm and low)
Data
Temperature
Hot Cold
Answers
Kinesis/
KafkaS3
EMR/
Hadoop
Answers
EMR/
Hadoop
NoSQL/
DynamoDB
Lat e
ncy
Low
High
Pattern 5: Spark (cold and low)
Data
Temperature
Hot Cold
Answers
Kinesis/
KafkaS3
EMR/
Hadoop
NoSQL/
DynamoDB
Answers
EMR/
Hadoop
Spark
La
ten
cy
Low
High
Pattern 6: Stream processing (hot and low)
Data
Temperature
Hot Cold
Answers
Kinesis/
KafkaS3
EMR/
Hadoop
HDFS/NoSQL/
DynamoDB
Answers
EMR/
Hadoop
Spark
La
ten
cy
Low
High
Spark Streaming/
Storm
Answers
Answers
Impala
Answers
Overall Reference Architecture: Fitting it all together
Data
Temperature
Hot Cold
Lat enc y
Kinesis/
KafkaS3
EMR/
Hadoop
NoSQL/DynamoDB/HDFS
Spark Streaming/
Storm
Impala
Low
High
Redshift
Spark
EMR/
Hadoop
Redshift
Spark
Reference Architecture
Amazon
RDS
Amazon
CloudSearch
Amazon
DynamoDB
Amazon
ElastiCache
Amazon
EMR
Amazon
S3
Amazon
Glacier
AWS Data Pipeline
Amazon
Redshift
A Video Streaming App – Discovery
X
Amazon Glacier
Amazon
ElastiCache
CloudFront
Amazon DynamoDB
Amazon RDS
Amazon CloudSearch
Amazon S3
A look at how it works
Months of user history Common misspellings
Data Analyzed Using EMR:
Weste
n Wistin
Westa
n Whestin
Automatic spelling corrections
Months of user search data
Search terms
Misspellings
Final click throughs
Yelp web site log data goes into Amazon S3
Amazon S3
Hadoop Cluster
Amazon EMR Amazon S3
All 200 nodes of the cluster simultaneously look for common misspellings
Westen
Wistin
Westan
Hadoop Cluster
Amazon EMR Amazon S3
A map of common misspellings and suggested corrections are loaded back into Amazon S3.
Westen
Wistin
Westan
Then the cluster is shut down Yelp only pays for the time they used it
Hadoop Cluster
Amazon EMR Amazon S3
Each of Yelp’s 80 Engineers Can Do This Whenever They Have a Big Data Problem
spins up over
250 Hadoop
clusters per week in EMR.
Amazon EMR Amazon S3
Data Innovation Meets Action at Scale
at NASDAQ OMX
• NASDAQ’s technology powers more than 70 marketplaces in 50 countries
• NASDAQ’s global platform can handle more than 1 million messages/second at
a median speed of sub-55 microseconds
• NASDAQ own & operate 26 markets including 3 clearinghouse & 5 central securities repositories
• More than 5,500 structured products are tied to NASDAQ’s global indexes with the notional value of at least $1 trillion
• NASDAQ powers 1 in 10 of the world’s securities transactions
NASDAQ’s Big Data Challenge
• Archiving Market Data
– A classic “Big Data” problem
• Power Surveillance and Business Intelligence/Analytics
• Minimize Cost
– Not only infrastructure, but development/IT labor costs too
• Empower the business for self-service
Financial Information Forum, Redistribution without permission from FIF prohibited, email: [email protected]
SIP Total Monthly Message VolumesOPRA, UQDF and CQS
23
OPRA Annual Increase: 69%CQS Annual Increase: 10%UQDF Annual Decrease: 6%
Total Monthly Message Volume Average Daily
Volume Date OPRA
Aug-12 80,600,107,361 3,504,352,494
Sep-12 77,303,404,427 4,068,600,233
Oct-12 98,407,788,187 4,686,085,152
Nov-12 104,739,265,089 4,987,584,052
Dec-12 81,363,853,339 4,068,192,667
Jan-13 82,227,243,377 3,915,583,018
Feb-13 87,207,025,489 4,589,843,447
Mar-13 93,573,969,245 4,678,698,462
Apr-13 123,865,614,055 5,630,255,184
May-13 134,587,099,561 6,117,595,435
Jun-13 162,771,803,250 8,138,590,163
Jul-13 120,920,111,089 5,496,368,686
Aug-13 136,237,441,349 6,192,610,970
Total Monthly Message Volume Combined Average Daily
Volume Date UQDF CQSAug-12 2,317,804,321 8,241,554,280 459,102,548Sep-12 1,948,330,199 7,452,279,225 494,768,917Oct-12 1,016,336,632 7,452,279,225 403,267,422Nov-12 2,148,867,295 9,552,313,807 557,199,100Dec-12 2,017,355,401 8,052,399,165 503,487,728Jan-13 2,099,233,536 7,474,101,082 455,873,077Feb-13 1,969,123,978 7,531,093,813 500,011,463Mar-13 2,010,832,630 7,896,498,260 495,366,545Apr-13 2,447,109,450 9,805,224,566 556,924,273
May-13 2,400,946,680 9,430,865,048 537,809,624Jun-13 2,601,863,331 11,062,086,463 683,197,490Jul-13 2,142,134,920 8,266,215,553 473,106,840
Aug-13 2,188,338,764 9,079,813,726 512,188,750
0
200,000,000
400,000,000
600,000,000
Jan-00 Jan-00 Jan-00 Jan-00 Jan-00 Jan-00 Jan-00 Jan-00 Jan-00
NASDAQ Exchange Daily Peak Messages
Market
Data
Is Big
Data Charts courtesy of the Financial Information Forum
NASDAQ’s Legacy Solution
• On-premises MPP DB
– Relatively expensive, finite storage
– Required periodic additional expenses to add more storage
– Ongoing IT (administrative) human costs
• Legacy BI tool
– Requires developer involvement for new data sources, reports,
dashboards, etc.
New Solution: Amazon Redshift
• Cost Effective
– Redshift is 43% of the cost of legacy
• Assuming equal storage capacities
– Doesn’t include IT ongoing costs!
• Performance
– Outperforms NASDAQ’s legacy BI/DB solution
– Insert 550K rows/second on a 2 node 8XL cluster
• Elastic
– NASDAQ can add additional capacity on demand, easy to grow their cluster
• Amazon Redshift partner
– http://aws.amazon.com/redshift/partn
ers/pentaho/
• Self Service
– Tools empower BI users to integrate
new data sources, create their own
analytics, dashboards, and reports
without requiring development
involvement
• Cost effective
New Solution: Pentaho BI/ETL
Net Result
• New solution is cheaper, faster, and offers capabilities that NASDAQ
didn’t have before
– Empowers NASDAQ’s business users to explore data like they never
could before
– Reduces IT and development as bottlenecks
– Margin improvement (expense reduction and supports business
decisions to grow revenue)
AWS is here to help
Solution
Architects
Professional
Services Premium
Support
AWS Partner
Network (APN)
https://aws.amazon.com/architecture/
Processing large amounts of parallel data using a scalable cluster
AWS Architecture Diagrams
http://aws.amazon.com/marketplace
Big Data Case Studies
Learn from other AWS customers
aws.amazon.com/solutions/case-studies/big-data
http://aws.amazon.com/marketplace
AWS Public Data Sets
Free access to big data sets
aws.amazon.com/publicdatasets
https://aws.amazon.com/training
AWS Training & Events
Webinars, Bootcamps,
and Self-Paced Labs
aws.amazon.com/events