Analyze Big Data for Consumer Applications with Looker BI and Amazon Redshift
Welcome
Maya Cabassi
Partner Marketing Manager
Amazon Web Services
Webinar Overview
Submit Your Questions using the Q&A tool.
A copy of today’s presentation will be made available on:
AWS SlideShare Channel@ http://www.slideshare.net/AmazonWebServices/
AWS Webinar Channel on YouTube@ http://www.youtube.com/channel/UCT-
nPlVzJI-ccQXlxjSvJmw
Tina Adams Senior Product Manager
Amazon Web Services
Keenan Rice VP, Marketing & Alliances
Looker
Introducing
Justin Rosenthal Chief Technology Officer
MessageMe
Overview of Amazon Redshift data warehouse
How Looker integrates with Amazon Redshift to enable
big data analytics in the cloud
How MessageMe turns application metrics stored in
Amazon Redshift into actionable insights with Looker BI
Q&A
What We’ll Cover
Amazon Redshift Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year
Tina Adams| [email protected]
Senior Product Manager
We set out to build…
A fast and powerful, petabyte-scale data warehouse that is:
A Lot Faster
A Lot Cheaper
A Lot Simpler Amazon Redshift
Data warehousing done the AWS way
• Easy to provision
• Pay as you go, no up front costs
• Fast, cheap, easy to use
• SQL
Deploy
Common Customer Use Cases
• Reduce costs by extending DW rather than adding HW
• Migrate completely from existing DW systems
• Respond faster to business;
provision in minutes
• Improve performance by an order of magnitude
• Make more data available for analysis
• Access business data via
standard reporting tools
• Add analytic functionality to applications
• Scale DW capacity as demand grows
• Reduce HW & SW costs
by an order of magnitude
Traditional Enterprise DW Companies with Big Data SaaS Companies
Channel
Amazon Redshift Customers
Feature Delivery
Service Launch (2/14)
PDX (4/2)
Temp Credentials (4/11)
Unload Encrypted Files
DUB (4/25)
NRT (6/5)
JDBC Fetch Size (6/27)
Unload logs (7/5)
4 byte UTF-8 (7/18)
Statement Timeout (7/22)
SHA1 Builtin (7/15)
Timezone, Epoch, Autoformat (7/25)
WLM Timeout/Wildcards (8/1)
CRC32 Builtin, CSV, Restore Progress (8/9)
UTF-8 Substitution (8/29)
JSON, Regex, Cursors (9/10)
Split_part, Audit tables (10/3)
SIN/SYD (10/8)
HSM Support (11/11)
Kinesis EMR/HDFS/SSH copy, Distributed Tables, Audit
Logging/CloudTrail, Concurrency, Resize Perf., Approximate Count
Distinct, SNS Alerts (11/13)
SOC1/2/3 (5/8)
Sharing snapshots (7/18)
Resource Level IAM (8/9)
PCI (8/22) Distributed Tables, Single Node Cursor Support, Maximum Connections to 500 (12/13)
EIP Support for VPC Clusters (12/28)
Amazon Redshift architecture
• Leader Node
– SQL endpoint
– Stores metadata
– Coordinates query execution
• Compute Nodes
– Local, columnar storage
– Execute queries in parallel
– Load, backup, restore via Amazon S3
– Parallel load from Amazon Amazon S3, DynamoDB, EMR/HDFS/SSH
– Kinesis integration
• Hardware optimized for data
processing
• Scale while remaining online from a
single node to a 100 node 1.6 PB cluster
10 GigE
(HPC)
Ingestion Backup Restore
JDBC/ODBC
Amazon Redshift is priced to let you analyze all your data
Effective Hourly
Price (single node)
Effective Hourly
Price Per TB
Effective Annual
Price per TB
On-Demand $ 0.850 $ 0.425 $ 3,723
1 Year Reservation $ 0.500 $ 0.250 $ 2,190
3 Year Reservation $ 0.228 $ 0.114 $ 999
Simple Pricing
Number of Nodes x Cost per Hour
No charge for Leader Node
No upfront costs
Pay as you go
Amazon Redshift has security built-in
• SSL to secure data in transit
• Encryption to secure data at rest
– AES-256; hardware accelerated
– All blocks on disks and in Amazon
S3 encrypted
– HSM/CloudHSM
• No direct access to compute
nodes
• Amazon VPC support
• SOC1/2/3, PCI level 1, and others
coming soon
10 GigE (HPC)
Ingestion Backup Restore
Customer VPC
Internal Security Group
JDBC/ODBC
Amazon Redshift integrates with multiple data sources
Amazon EMR
Amazon
DynamoDB
Amazon RDS
Amazon Redshift
Amazon S3
Amazon Kinesis
JDBC
ODBC
Corporate
Datacenter
Analytics For Today’s Data-Driven Organizations
17
Keenan Rice, Vice President, Marketing & Alliances
1.28.14
The New Data Landscape
The Missed Innovation Cycle
The Next Generation
Innovative Customers
MessageMe Intro
18
New Data Landscape 19
Business Data
Data Analysts Data Modeling
New MPP Databases
Ridiculous Quantities of
Event & Business Data
ETL Data Warehouse
Business Users Limited data discovery
Data Analysts New Breed of Data Experts
Business Users New Curious Generation
Expect Immediate Results
Back to hand-
coding SQL
Missed Innovation Cycle BI is a relic of the old (expensive) data landscape 20
Event & Business
Application Data
Traditional
BI
New MPP databases
BI Software Heavy desktop apps
No reusability
One-time-use queries
No direct data access
Cubes / Simple models
Data Analysts New Breed of Data Experts
Business Users New Curious Generation
Expect Immediate Results
21
BI Software Web Based App
Data Analysts Flexible Delivery
Agile Modeling
Load
Transform Query
Looker — The Next Generation Modern analytics, built for the new data landscape
Business Users High-Resolution Discovery
Sharing & Collaboration
Business Users High-Resolution Discovery
Sharing & Collaboration
Data Analysts Flexible Delivery
Agile Modeling
Looker Inside 22
BI Software Web Based App
• Near real-time access to your Redshift data
• Exploit the computing power of the
AWS cloud and Redshift
• No need to re-architect or cube data
Load
Transform Query
Copy
Business Users High-Resolution Discovery
Sharing & Collaboration
Looker Intelligence 23
BI Software Web Based App
• Extend the power of your data analysts
• Fold data as complex as necessary
without any database effort
• Use Git for agile team development
Transform Query
Data Analysts Flexible Delivery
Agile Modeling
Copy
Data Analysts Flexible Delivery
Agile Modeling
Looker Everywhere
BI Software Web Based App
• Powerful data discovery for anyone
• Share, save, and collaborate
• Access all the data, in an interactive
web application
24
Business Users High-Resolution Discovery
Sharing & Collaboration
Transform Query
© 2014 Looker Inc. All Rights Reserved.
A New Perspective Changing the way organizations make decisions
25
Founded in Santa Cruz, California 2012
$18M Redpoint, First Round Capital & Pivot North
1200 Hours per month spent in Looker per customer
50+ Customers changing how they run their businesses
Lloyd Tabb Created first app server
(Netscape), founder
Mozilla.org, LiveOps, etc.
Frank Bien Proven software exec:
Greenplum, EMC
Marc Randolph Founder and first
CEO Netflix
© 2014 Looker Inc. All Rights Reserved.
Who’s Lookering? Data-driven organizations realizing the power of Looker
26
Powering Analytics @ MessageMe
1. Redshift + Looker
2. Example Looker Report & Model
3. MessageMe Data Storage
4. Analytics Strategies
5. DynamoDB → Redshift
Empower your team to answer their own questions.
Redshift + Looker
• What types of Stickers are sent the most?
• How do event/holiday themed-packs perform?
Internal dashboards and Looker link-sharing are commonplace.
Looker makes the data accessible and Redshift makes it fast.
• Which SMS provider is most cost-effective?
Redshift + Looker
Redshift + Looker
Data Storage: Why Redshift?
RDS Config (March 2013)
Master: db.m1.xlarge (15GB)
Slave: db.m1.xlarge (15GB)
Master: db.m1.xlarge (15GB)
Slave: db.m2.4xlarge (68GB)
At Launch:
• DynamoDB for all application data
• MySQL for all statistics data
90% of writes were via LOAD_DATA_INFILE, so write IOPS were not a problem.
However, index sizes were growing quickly…
RDS Config (April 2013)
Data Storage: Why Redshift?
event
Engine InnoDB
Index Width 48 Bytes / Row
Row Count ~3 Billion
Index Size 144 GB
message
Engine InnoDB
Index Width 32 Bytes / Row
Row Count ~2 Billion
Index Size 64 GB
Slave: db.m2.4xlarge (68GB)
MySQL Status (April 2013)
We could put data in, but we couldn’t get it back out!
Possible Solutions
1. Summarize
• PRO: Data compression
• CON: Data loss
2. Shard
• PRO: No data loss
• CON: Difficult to query
3. Redshift?
Data Storage: Why Redshift?
Data Storage: Current System
Redshift (90%)
• Append-only tables
• Delayed, bulk inserts OK
• Inline inserts
• Non-negotiable uniqueness
requirements (ON DUPLICATE
KEY UPDATE)
MySQL (10%)
Examples:
• `event`
• `message`
• `user_demographic`
Examples:
• `purchase`
• `user_demograhic`
Analytics Strategies w/ Billions of Rows
Deep-dive queries w/ row-level specifics
Super fast top-line metrics, aggregates
vs.
You get this out-of-the-box with Redshift
How do we get these, really fast? 1. Summarization
2. Cached Derived Tables
How many doodles were sent each day in the US since we launched?
100 seconds vs. 3 seconds
Analytics Strategies: Summarization
message
Columns
`id`
`sender_id`
`recipient_user_id`
`recipient_room_id`
`message_type`
`country`
`os_family`
`os_version`
`app_version`
`timestamp`
Rows / Day 10-100,000,000
sm_message
Columns
`send_hour`
`recipient_type`
`message_type`
`country`
`os_family`
`send_count`
Rows / Day 10-100,000 1,000:1
Compression
Some important queries will be complex and demand row-specific data.
Summarizing is not an option, what to do?
Analytics Strategies: Cached Derived Tables
…build Cached Derived Tables
• Turn long-running, complex queries into flat tables
Analytics Strategies: Cached Derived Tables
sm_retention_day
`join_day`
`nday`
`country`
`os_family`
`os_version`
`traffic_source`
`active_users`
`signups`
SELECT
…
INTO TABLE `sm_retention_day`
FROM (
SELECT
….
FROM `user`
JOIN `message`
JOIN `user_source`
), (
SELECT
….
FROM `user`
JOIN `user_source`
)
Example: Retention by Cohort
• Stats tables are homogenous and compact
• Application data can be heterogeneous and heavy
– Mixture of numbers, strings, binary, etc.
DynamoDB → Redshift
How many users signed up this week with a .edu email address?
COPY dynamodb://user
Questions
Contacts: Looker: http://www.looker.com/request-demo MessageMe: https://messageme.com AWS: aws.amazon.com/contact-us [email protected]
We’d like your feedback. Please respond to a short survey.
https://aws.asia.qualtrics.com/SE/?SID=SV_1yUN9wjaZX960kd