Date post: | 08-Sep-2014 |
Category: |
Technology |
Upload: | amazon-web-services |
View: | 838 times |
Download: | 3 times |
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Big Data Use Cases and
Solutions in the AWS Cloud
Ben Butler, @bensbutler, Sr. Mgr., Big Data & HPC
July 10, 2014
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Big Data: Unconstrained data growth
95% of the 1.2 zettabytes
of data in the digital
universe is unstructured
70% of of this is user-
generated content
Unstructured data growth
explosive, with estimates
of compound annual
growth (CAGR) at 62%
Source: IDCGB TB
PB
ZB
EB
The amount of information generated during the first day of
a baby’s life today is equivalent to 70 times the information
contained in the Library of Congress
Lower cost,
higher throughput Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Highly
constrained
Lower cost,
higher throughput Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Available for analysis
Generated data
Data volume - Gap
1990 2000 2010 2020
Elastic and highly scalable
No upfront capital expense
Only pay for what you use+
+
Available on-demand+
=
Remove constraints
Accelerated
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Technologies and techniques for working
productively with data, at any scale.
Big Data
Big data and AWS Cloud computing
Big data Cloud computing
Variety, volume, and velocity
requiring new tools
Variety of compute, storage,
and networking options
Big data and AWS Cloud computing
Big data Cloud computing
Potentially massive datasets Massive, virtually unlimited
capacity
Big data and AWS Cloud computing
Big data Cloud computing
Iterative, experimental style of
data manipulation and analysis
Iterative, experimental style of
infrastructure deployment/usage
Big data and AWS Cloud computing
Big data Cloud computing
Frequently not steady-state
workload; peaks and valleys
At its most efficient with highly
variable workloads
Big data and AWS Cloud computing
Big data Cloud computing
Absolute performance not as
critical as “time to results”;
shared resources are a
bottleneck
Parallel compute projects allow
each workgroup to have more
autonomy, get faster results
One tool to
rule them all
Use the right tools
Amazon
S3
Amazon
Kinesis
Amazon
DynamoDB
Amazon
RedshiftAmazon
Elastic
MapReduce
Store anything
Object storage
Scalable
99.999999999% durability
Amazon
S3
Real-time processing
High throughput; elastic
Easy to use
EMR, S3, Redshift, DynamoDB
Integrations
Amazon
Kinesis
NoSQL Database
Seamless scalability
Zero admin
Single digit millisecond latency
Amazon
DynamoDB
Relational data warehouse
Massively parallel
Petabyte scale
Fully managed
$1,000/TB/Year
Amazon
Redshift
Try Amazon Redshift with BI & ETL for Free!
aws.amazon.com/redshift/free-trial
2 months | 750 hours/month | dw2.large SSD instance
160GB of compressed storage per node
Try BI & ETL for free from nine partners at
aws.amazon.com/redshift/partners
Hadoop/HDFS clusters
Hive, Pig, Impala, Hbase
Easy to use; fully managed
On-demand and spot pricing
Tight integration with S3,
DynamoDB, and Kinesis
Amazon
Elastic
MapReduce
Amazon EMR now ships with ODBC and JDBC drivers for
Hive, Impala, and HBase
Easier to use popular BI tools like:
Microsoft Excel, Tableau, MicroStrategy, and QlikView
ODBC and JDBC drivers now for Amazon EMR
The right tools.
At the right scale.
At the right time.
HDFS
Amazon EMR
HDFS
Amazon S3 Amazon
DynamoDB
Amazon EMR
AWS Data Pipeline
HDFS
Amazon S3 Amazon
DynamoDB
Amazon EMR
Amazon
Kinesis
AWS Data Pipeline
Data
Sources
HDFS
Amazon S3 Amazon
DynamoDB
Amazon EMR
Amazon
Kinesis
AWS Data Pipeline
Data
Sources
Data management Hadoop Ecosystem analytical tools
HDFS
Amazon
RedShift
Amazon
RDS
Amazon S3 Amazon
DynamoDB
Amazon EMR
Amazon
Kinesis
AWS Data Pipeline
Data management Hadoop Ecosystem analytical tools
Data
Sources
HDFS
Amazon
RedShift
Amazon
RDS
Amazon S3 Amazon
DynamoDB
Amazon EMR
Amazon
Kinesis
AWS Data Pipeline
Data management Hadoop Ecosystem analytical tools
Data
Sources
AWS Data
Pipeline
Free steak campaign
Disaster recovery
Web site & media sharing
Facebook app
Ground campaign
SAP & SharePoint
Marketing web site
Business line of sight
Consumer social app
IT operations
Mars exploration ops
Interactive TV apps
Media streaming
Consumer social app
Facebook page
Securities Trading Data Archiving
Financial markets analytics
Web and mobile apps
Big data analytics
Digital media
Ticket pricing optimization
Streaming webcasts
Mobile analytics
Consumer social app
Core IT and media
Customer Use Cases of Big Data
Dropcam is the biggest inbound video service
on the Web
More data uploaded per
minute than YouTube
Petabytes of data
processed every month
Billions of motion events
detected
4 months to production
300% speed gain
$500k - $1M in CAPEX saved
500MM tweets/day = ~ 20.8MM tweets/hr
2k/tweet is ~12MB/sec, need 6 shards, ~1TB/day
$0.015/hour per shard, $0.028/million PUTS
Kinesis cost is $0.765/hour
Redshift cost is $0.850/hour (for a 2TB dw1.xlarge)
Total: $1.615/hour
Cost &
Scale
“THANKS TO AMAZON WEB SERVICES, WE CAN DELIGHT OUR PLAYERS WORLDWIDE.”
Sami Yliharju | Services Lead
The Climate Corporation - Weather Insurance for Farms
Challenge:Volatile weather is deadly to crops like grapes
Solution:
Built a predictive model based on freely available
data:
• 60 years of crop data,
• 14 TBs of soil data, and
• 1M government Doppler radar points
• 50 EMR clusters process new data as it comes
into S3 each day, continuously updating the
model.
150B Soil
Observations
3M Daily Weather
Measurements
850K Precision Rainfall
Grids Tracked
200 TB in Amazon S3
Foursquare…
33 million users1.3 million businesses
…generates a lot of Data3.5 billion check-ins 15M+ venues, Terabytes of log data
Uses EMR for
Evaluation of new features
Machine learning
Exploratory analysis
Daily customer usage reporting
Long-term trend analysis
Benefits of Amazon EMR
Ease-of-Use“We have decreased the processing time for urgent data-analysis”
FlexibilityTo deal with changing requirements & dynamically expand reporting clusters
Costs“We have reduced our analytics costs by over 50%”
Who is checking in?
0
0.1
0.2
0.3
0.4
0.5
0.6
Female Male
Gender
0 20 40 60 80
Age
Gorilla Coffee
Gray's Papaya
Amorino
Thursday Friday Saturday Sunday
When do people go to a place?
User Sign-ups
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
a
AmazonDynamoDB
Amazon
RDS
Amazon
Redshift
AWS
Direct Connect
AWS
Storage Gateway
AWS
Import/ Export
Amazon
GlacierS3
Amazon
KinesisAmazon EMR
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Amazon EC2 Amazon EMRAmazon
Kinesis
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
AmazonRedshift
AmazonDynamoDB
Amazon
RDS
S3 Amazon EC2 Amazon EMR
Amazon
CloudFront
AWS
CloudFormation
AWS
Data Pipeline
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
DataXu in the Cloud
Yekesa Kosuru, V.P Technology
July 10th 2014
What is DataXu?
• Digital Marketing Platform, Ad Tech Platform
• Real-time Multivariate Decision System
• 5th Fastest Growing Private Company in U.S (Inc 500)
• Optimize Digital Marketing Campaigns– ...put the right ad campaign in front of the right customer
– …find customer who left their site without converting
– …find more customers who are likely to convert
– …offer insight into who, why, when, where are respondents
• 950,000 times per second
Big Data, Little Decisions
Decision
impact(also proportional
to risk)
Decision rate
1
2000’s – “How often can we run a permission-based email mktg. campaign?” Rules-based alerts
2010’s – Millions of decisions and actions taken, all in less than a blink of an eye
volume ~ value
The Evolution of Real-Time Decision Systems
1
2
2
3
3
1990’s – “Should we advertise on the Superbowl? Should we run direct mail this qtr.?” Batch mode
Real Time Bidding
Site
Auctions
Ads, e.g
User
Opens
Browser
Goes to
Sports Site
DataXu
Bids(others bid too)
DataXu
Wins Bid
Ad Shown,
Page loads
Quick Statistics
• 950K bid requests per second
• Billions of impressions per month, Petabyte of
data
• 100 ms round trip response time
• 100+TB of warehouse data
• 3000+ Servers powering the platform
Why AWS
• Automation, API
• Costs, Pay As You Go
• Auto Scaling (elasticity – up and down)
• All Data in One Place (S3 foundational store)
• Improved Testability
• Security, Privacy
• Disaster Recovery and Business Continuity
DataXu StackCampaign
Management
Business Intelligence
Data Mart
Interactive
Queries
Batch
Queries
Real Time Bidding System
Activity Logs
1st Party3rd Party
Distributed Log Ingestion
S3/HDFS Warehouse
CDN
User
ProfilesCampaign
Metadata
ETL Attribution Machine Learning
SpendDecision
System
Audience
CalculationUniques/S
egment
Big Velocity950K TPS
Big VolumePetabyte of Data
Big VarietyData Providers
High Level Deployment
ON PREMISE
SSL
Meta
Amazon S3
RTB
System
Elastic Load
Balancing
Availability Zone
Route
53
EC2
Auto scaling Group
Volumes
AMI
Availability Zone
Log
Ingestion
System
Machine
Learning
SystemAuto scaling
Group
EMR
CloudWatch
Traditional Hadoop vs EMR• Traditional Hadoop
– Anticipate and provision for peaks
– Cant de-couple storage and compute
– 75% cluster is idle
– Data Duplication/Multiple Clusters
• EMR to the rescue
• Monthly savings of 72%using EMR
S3 Provides Linearly Scalable Bandwidth
• Big volume workloads involve several datasets together and terabytes of data
• Aggregate bandwidth matters
• S3 scales pretty linearly
S3 Streaming Performance
(m1.xlarge @ $0.34/hr)100 VMs; 9.6GB/s; $34/hr
350 VMs; 28.7GB/s; $119/hr
34 secs per terabyte
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Getting Started with
Big Data on AWS
AWS is here to help
Solution
Architects
Professional
ServicesPremium
Support
AWS Partner
Network (APN)
aws.amazon.com/partners/competencies/big-data
Partner with an AWS Big Data expert
https://aws.amazon.com/architecture/
Processing large amounts of parallel
data using a scalable cluster
AWS Architecture Diagrams
http://aws.amazon.com/marketplace
Big Data Case Studies
Learn from other AWS customers
aws.amazon.com/solutions/case-studies/big-data
AWS Marketplace
AWS Online Software Store
aws.amazon.com/marketplace
Shop the big data category
http://aws.amazon.com/marketplace
AWS Public Data Sets
Free access to big data sets
aws.amazon.com/publicdatasets
AWS Grants Program
AWS in Education
aws.amazon.com/grants
AWS Big Data Test Drives
APN Partner-provided labs
aws.amazon.com/testdrive/bigdata
https://aws.amazon.com/training
AWS Training & Events
Webinars, Bootcamps,
and Self-Paced Labs
aws.amazon.com/events
Big Data on AWS
Course on Big Data
aws.amazon.com/training/course-descriptions/bigdata
reinvent.awsevents.com
aws.amazon.com/big-data
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Thank you!
Ben Butler, @bensbutler, Sr. Mgr., Big Data
July 10, 2014 – http://aws.amazon.com/big-data