Date post: | 14-Jan-2015 |
Category: |
Technology |
Upload: | amazon-web-services |
View: | 844 times |
Download: | 1 times |
AWS Summit 2013 Milan 31 Ottobre 2013
Hakan Gurel
Solutions Architecture
DATA ANALYSIS ON AWS
THE COST OF
GENERATING DATA
IS FALLING
THE MORE DATA YOU COLLECT
THE MORE VALUE YOU CAN
DERIVE FROM IT
GENERATE STORE ANALYZE SHARE
Lower cost,
higher throughput
Highly
constrained
Generated data
Available for analysis
DATA VOLUME
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
GENERATE STORE ANALYZE SHARE
ACCELERATE
+ ELASTIC AND HIGHLY SCALABLE
+ NO UPFRONT CAPITAL EXPENSE
+ PAY FOR ONLY WHAT YOU USE
+ AVAILABLE ON-DEMAND
= REMOVE CONSTRAINTS
GENERATE STORE ANALYZE SHARE
AWS Import / Export
AWS Direct Connect
Generated and stored in AWS
Inbound data transfer is free
Multipart upload to S3
Physical media
AWS Direct Connect
Regional replication of AMIs and snapshots
GENERATE STORE ANALYZE SHARE
Amazon S3,
Amazon Glacier,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
AWS Storage Gateway,
Data on Amazon EC2
AMAZON S3 SIMPLE STORAGE SERVICE
AMAZON
DYNAMODB HIGH-PERFORMANCE, FULLY MANAGED
NoSQL DATABASE SERVICE
DURABLE &
AVAILABLE CONSISTENT, DISK-ONLY
WRITES (SSD)
LOW LATENCY AVERAGE READS < 5MS,
WRITES < 10MS
NO ADMINISTRATION
500,000 WRITES PER SECOND
DURING SUPER BOWL
AMAZON
REDSHIFT FULLY MANAGED, PETA-BYTE SCALE
DATAWAREHOUSE ON AWS
DESIGN OBJECTIVES: A petabyte-scale data warehouse service that was…
AMAZON REDSHIFT
A Whole Lot Simpler
A Lot Cheaper
A Lot Faster
AMAZON REDSHIFT
RUNS ON OPTIMIZED HARDWARE
HS1.8XL: 128 GB RAM, 16 Cores, 16 TB compressed user storage, 2 GB/sec scan rate
HS1.XL: 16 GB RAM, 2 Cores, 2 TB compressed customer storage
30 MINUTES
DOWN TO
12 SECONDS
Extra Large Node (HS1.XL)
Single Node (2 TB)
Cluster 2-32 Nodes (4 TB – 64 TB)
AMAZON REDSHIFT LETS YOU
START SMALL AND GROW BIG
Eight Extra Large Node (HS1.8XL) Cluster 2-100 Nodes (32 TB – 1.6 PB)
CREATE A DATAWAREHOUSE IN
MINUTES
JDBC/ODBC
Price Per Hour for
HS1.XL Single
Node
Effective Hourly
Price Per TB
Effective Annual
Price per TB
On-Demand $ 0.850 $ 0.425 $ 3,723
1 Year
Reservation $ 0.500 $ 0.250 $ 2,190
3 Year
Reservation $ 0.228 $ 0.114 $ 999
DATA WAREHOUSING DONE THE AWS WAY
No upfront costs, pay as you go
Really fast performance at a really low price
Open and flexible with support for popular tools
Easy to provision and scale up massively
USAGE SCENARIOS
Redshift Reporting
and BI EMR
S3
• Maintain online SQL access to historical logs
• Transformation and enrichment with EMR
• Longer history ensures better insight
Cloud ETL for Big Data
Live archive for (structured) Big Data
• Direct integration with copy command
• High velocity data
• Data ages into Redshift
• Low cost, high scale option for new apps
DynamoDB Redshift
OLTP Web Apps Reporting
and BI
Reporting Warehouse
• Accelerated operational reporting
• Support for short-time use cases
• Data compression, index redundancy
RDBMS Redshift
OLTP ERP Reporting
and BI
+
RDBMS Redshift
OLTP
ERP Reporting
& BI
On-Premises Integration
GENERATE STORE ANALYZE SHARE
Amazon EC2
Amazon Elastic
MapReduce
AMAZON EC2 ELASTIC COMPUTE CLOUD
CLUSTER GPU
QUADRUPLE EXTRA LARGE
Intel Xeon X5570, quad-core
Nehalem architecture
NVIDIA Tesla Fermi
M2050 GPUs
22 GB of memory – 1.7 TB of storage
2x
2x
ON A SINGLE INSTANCE
COMPUTE TIME: 4h
COST: 4h x $2.1 = $8.4
ON MULTIPLE INSTANCES
COMPUTE TIME: 1h
COST: 1h x 4 x $2.1 = $8.4
For 3 hours
$4828.85/hr
instead of $20+ MILLIONS in infrastructure
AMAZON ELASTIC
MAPREDUCE HADOOP AS A SERVICE
• A FRAMEWORK
• SPLITS DATA INTO PIECES
• LETS PROCESSING OCCUR
• GATHERS THE RESULTS
Corporate Data
Center
Elastic Data
Center
Application data
and logs for
analysis pushed
to S3
Corporate Data
Center
Elastic Data
Center
Amazon Elastic
Map Reduce
name node to
control analysis
N
Corporate Data
Center
Elastic Data
Center
Hadoop cluster
started by Elastic
Map Reduce
N
Corporate Data
Center
Elastic Data
Center
N
Adding many
hundreds or
thousands of
nodes
Corporate Data
Center
Elastic Data
Center
N
Disposed of when
job completes
Corporate Data
Center
Elastic Data
Center
Results of
analysis pulled
back into your
systems
GENERATE STORE ANALYZE SHARE
Amazon S3,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
Data on Amazon EC2
GENERATE STORE ANALYZE SHARE
AWS Data Pipeline
AWS Data Pipeline
Data-intensive orchestration and automation
Reliable and scheduled
Easy to use, drag and drop
Execution and retry logic
Map data dependencies
Create and manage compute resources
GENERATE STORE ANALYZE SHARE
Amazon S3,
Amazon Glacier,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
AWS Storage Gateway,
Data on Amazon EC2
AWS Import / Export
AWS Direct Connect
Amazon S3,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
Data on Amazon EC2
Amazon EC2
Amazon Elastic
MapReduce
AWS Data Pipeline
FROM DATA TO
ACTIONABLE
INFORMATION
Stefano Rodighiero
Words matter
7+ million lyrics catalogue in more than 50 distinct languages
Music Discography Meta Data: Lyrics, Artists, Albums, Songs, Biographies, Worldwide Charts
MXM FACTS
Synced lyrics! Daily updated with more than 1 million artists and more than 20 million music tracks
Currently musiXmatch is the only lyrics platform allowed for worldwide licensing and has deals with top Music Publishers: Warner Chappell, Universal, BMG, EMI Publishing, Sony ATV, Peer Music, ...
SYNCED LYRICS
MUSIC
METADATA:
RECORDING &
PUBLISHING
OUR DATA
CONTENT USAGE
OUR DATA
OTHER SOURCES
OUR DATA
CONTENT
USAGE:
REPORTING &
ANALYTICS
DATA ANALYSIS @ MXM
Words matter
Publishing catalogue
"Unrolling" Frontend Filter/normalization
Hive
Post process
Batch
Analytics
Redshift
DATAFLOW
Redis
(real time analytics)
Analytics
Redshift
Words matter
Hive
Post process
Batch
BATCH REPORTING
Step 1. Aggregation of views by country,
application and content type
Step 2. Join with a 500M+ rows table
It takes approx 1 hour with 5 c1.xlarge
instances
It used to take days with traditional techniques!
SQL interface makes it easier to review and
share the process
Words matter
Publishing catalogue
"Unrolling" Frontend proxy
Filter/normalization
Hive
Post process
DATAFLOW
Redis
(real time analytics)
Analytics
Redshift
Interactive
Words matter
INTERACTIVE ANALYTICS
SQL interface like Hive, accessible with any
Postgresql client...
...but faster!
Flexible costs
With Redshift doing all the heavy lifting, it's
easier to build analytics tools
Redis
(real time analytics)
Analytics
Redshift
Interactive
Words matter
Publishing catalogue
"Unrolling" Frontend proxy
Filter/normalization
Hive
Post process
Batch
DATAFLOW
Redis
(real time analytics)
Analytics
Redshift
Interactive
THANK YOU!
MUSIXMATCH
THANK YOU [email protected]