Download - AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker BI and Amazon Redshift

Analyze Big Data for Consumer Applications with Looker BI and Amazon Redshift

Welcome

Maya Cabassi

Partner Marketing Manager

Amazon Web Services

Webinar Overview

Submit Your Questions using the Q&A tool.

A copy of today’s presentation will be made available on:

AWS SlideShare Channel@ http://www.slideshare.net/AmazonWebServices/

AWS Webinar Channel on YouTube@ http://www.youtube.com/channel/UCT-

nPlVzJI-ccQXlxjSvJmw

http://www.slideshare.net/AmazonWebServices/

http://www.youtube.com/channel/UCT-nPlVzJI-ccQXlxjSvJmw





Tina Adams Senior Product Manager

Amazon Web Services

Keenan Rice VP, Marketing & Alliances

Looker

Introducing

Justin Rosenthal Chief Technology Officer

MessageMe

Overview of Amazon Redshift data warehouse

How Looker integrates with Amazon Redshift to enable

big data analytics in the cloud

How MessageMe turns application metrics stored in

Amazon Redshift into actionable insights with Looker BI

Q&A

What We’ll Cover

Amazon Redshift Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year

Tina Adams| [email protected]

Senior Product Manager

mailto:[email protected]

We set out to build…

A fast and powerful, petabyte-scale data warehouse that is:

A Lot Faster

A Lot Cheaper

A Lot Simpler Amazon Redshift

Data warehousing done the AWS way

• Easy to provision

• Pay as you go, no up front costs

• Fast, cheap, easy to use

• SQL

Deploy

Common Customer Use Cases

• Reduce costs by extending DW rather than adding HW

• Migrate completely from existing DW systems

• Respond faster to business;

provision in minutes

• Improve performance by an order of magnitude

• Make more data available for analysis

• Access business data via

standard reporting tools

• Add analytic functionality to applications

• Scale DW capacity as demand grows

• Reduce HW & SW costs

by an order of magnitude

Traditional Enterprise DW Companies with Big Data SaaS Companies

Channel

Amazon Redshift Customers

Feature Delivery

Service Launch (2/14)

PDX (4/2)

Temp Credentials (4/11)

Unload Encrypted Files

DUB (4/25)

NRT (6/5)

JDBC Fetch Size (6/27)

Unload logs (7/5)

4 byte UTF-8 (7/18)

Statement Timeout (7/22)

SHA1 Builtin (7/15)

Timezone, Epoch, Autoformat (7/25)

WLM Timeout/Wildcards (8/1)

CRC32 Builtin, CSV, Restore Progress (8/9)

UTF-8 Substitution (8/29)

JSON, Regex, Cursors (9/10)

Split_part, Audit tables (10/3)

SIN/SYD (10/8)

HSM Support (11/11)

Kinesis EMR/HDFS/SSH copy, Distributed Tables, Audit

Logging/CloudTrail, Concurrency, Resize Perf., Approximate Count

Distinct, SNS Alerts (11/13)

SOC1/2/3 (5/8)

Sharing snapshots (7/18)

Resource Level IAM (8/9)

PCI (8/22) Distributed Tables, Single Node Cursor Support, Maximum Connections to 500 (12/13)

EIP Support for VPC Clusters (12/28)

Amazon Redshift architecture

• Leader Node

– SQL endpoint

– Stores metadata

– Coordinates query execution

• Compute Nodes

– Local, columnar storage

– Execute queries in parallel

– Load, backup, restore via Amazon S3

– Parallel load from Amazon Amazon S3, DynamoDB, EMR/HDFS/SSH

– Kinesis integration

• Hardware optimized for data

processing

• Scale while remaining online from a

single node to a 100 node 1.6 PB cluster

10 GigE

(HPC)

Ingestion Backup Restore

JDBC/ODBC

Amazon Redshift is priced to let you analyze all your data

Effective Hourly

Price (single node)

Effective Hourly

Price Per TB

Effective Annual

Price per TB

On-Demand $ 0.850 $ 0.425 $ 3,723

1 Year Reservation $ 0.500 $ 0.250 $ 2,190

3 Year Reservation $ 0.228 $ 0.114 $ 999

Simple Pricing

Number of Nodes x Cost per Hour

No charge for Leader Node

No upfront costs

Pay as you go

Amazon Redshift has security built-in

• SSL to secure data in transit

• Encryption to secure data at rest

– AES-256; hardware accelerated

– All blocks on disks and in Amazon

S3 encrypted

– HSM/CloudHSM

• No direct access to compute

nodes

• Amazon VPC support

• SOC1/2/3, PCI level 1, and others

coming soon

10 GigE (HPC)

Ingestion Backup Restore

Customer VPC

Internal Security Group

JDBC/ODBC

Amazon Redshift integrates with multiple data sources

Amazon EMR

Amazon

DynamoDB

Amazon RDS

Amazon Redshift

Amazon S3

Amazon Kinesis

JDBC

ODBC

Corporate

Datacenter

Analytics For Today’s Data-Driven Organizations

17

Keenan Rice, Vice President, Marketing & Alliances

1.28.14

The New Data Landscape

The Missed Innovation Cycle

The Next Generation

Innovative Customers

MessageMe Intro

18

New Data Landscape 19

Business Data

Data Analysts Data Modeling

New MPP Databases

Ridiculous Quantities of

Event & Business Data

ETL Data Warehouse

Business Users Limited data discovery

Data Analysts New Breed of Data Experts

Business Users New Curious Generation

Expect Immediate Results

Back to hand-

coding SQL

Missed Innovation Cycle BI is a relic of the old (expensive) data landscape 20

Event & Business

Application Data

Traditional

BI

New MPP databases

BI Software Heavy desktop apps

No reusability

One-time-use queries

No direct data access

Cubes / Simple models

Data Analysts New Breed of Data Experts

Business Users New Curious Generation

Expect Immediate Results

21

BI Software Web Based App

Data Analysts Flexible Delivery

Agile Modeling

Load

Transform Query

Looker — The Next Generation Modern analytics, built for the new data landscape

Business Users High-Resolution Discovery

Sharing & Collaboration




Agile Modeling

Looker Inside 22


• Near real-time access to your Redshift data

• Exploit the computing power of the

AWS cloud and Redshift

• No need to re-architect or cube data

Load

Transform Query

Copy



Looker Intelligence 23


• Extend the power of your data analysts

• Fold data as complex as necessary

without any database effort

• Use Git for agile team development

Transform Query


Agile Modeling

Copy


Agile Modeling

Looker Everywhere


• Powerful data discovery for anyone

• Share, save, and collaborate

• Access all the data, in an interactive

web application

24



Transform Query

© 2014 Looker Inc. All Rights Reserved.

A New Perspective Changing the way organizations make decisions

25

Founded in Santa Cruz, California 2012

$18M Redpoint, First Round Capital & Pivot North

1200 Hours per month spent in Looker per customer

50+ Customers changing how they run their businesses

Lloyd Tabb Created first app server

(Netscape), founder

Mozilla.org, LiveOps, etc.

Frank Bien Proven software exec:

Greenplum, EMC

Marc Randolph Founder and first

CEO Netflix

© 2014 Looker Inc. All Rights Reserved.

Who’s Lookering? Data-driven organizations realizing the power of Looker

26

Powering Analytics @ MessageMe

1. Redshift + Looker

2. Example Looker Report & Model

3. MessageMe Data Storage

4. Analytics Strategies

5. DynamoDB → Redshift

Empower your team to answer their own questions.

Redshift + Looker

• What types of Stickers are sent the most?

• How do event/holiday themed-packs perform?

Internal dashboards and Looker link-sharing are commonplace.

Looker makes the data accessible and Redshift makes it fast.

• Which SMS provider is most cost-effective?

Redshift + Looker

Redshift + Looker

Data Storage: Why Redshift?

RDS Config (March 2013)

Master: db.m1.xlarge (15GB)

Slave: db.m1.xlarge (15GB)

Master: db.m1.xlarge (15GB)

Slave: db.m2.4xlarge (68GB)

At Launch:

• DynamoDB for all application data

• MySQL for all statistics data

90% of writes were via LOAD_DATA_INFILE, so write IOPS were not a problem.

However, index sizes were growing quickly…

RDS Config (April 2013)


event

Engine InnoDB

Index Width 48 Bytes / Row

Row Count ~3 Billion

Index Size 144 GB

message

Engine InnoDB

Index Width 32 Bytes / Row

Row Count ~2 Billion

Index Size 64 GB

Slave: db.m2.4xlarge (68GB)

MySQL Status (April 2013)

We could put data in, but we couldn’t get it back out!

Possible Solutions

1. Summarize

• PRO: Data compression

• CON: Data loss

2. Shard

• PRO: No data loss

• CON: Difficult to query

3. Redshift?


Data Storage: Current System

Redshift (90%)

• Append-only tables

• Delayed, bulk inserts OK

• Inline inserts

• Non-negotiable uniqueness

requirements (ON DUPLICATE

KEY UPDATE)

MySQL (10%)

Examples:

• èvent`

• `message`

• ùser_demographic`

Examples:

• `purchase`

• ùser_demograhic`

Analytics Strategies w/ Billions of Rows

Deep-dive queries w/ row-level specifics

Super fast top-line metrics, aggregates

vs.

You get this out-of-the-box with Redshift

How do we get these, really fast? 1. Summarization

2. Cached Derived Tables

How many doodles were sent each day in the US since we launched?

100 seconds vs. 3 seconds

Analytics Strategies: Summarization

message

Columns

ìd`

`sender_id`

`recipient_user_id`

`recipient_room_id`

`message_type`

`country`

òs_family`

òs_version`

àpp_version`

`timestamp`

Rows / Day 10-100,000,000

sm_message

Columns

`send_hour`

`recipient_type`

`message_type`

`country`

òs_family`

`send_count`

Rows / Day 10-100,000 1,000:1

Compression

Some important queries will be complex and demand row-specific data.

Summarizing is not an option, what to do?

Analytics Strategies: Cached Derived Tables

…build Cached Derived Tables

• Turn long-running, complex queries into flat tables

Analytics Strategies: Cached Derived Tables

sm_retention_day

`join_day`

`nday`

`country`

òs_family`

òs_version`

`traffic_source`

àctive_users`

`signups`

SELECT

…

INTO TABLE `sm_retention_day`

FROM (

SELECT

….

FROM ùser`

JOIN `message`

JOIN ùser_source`

), (

SELECT

….

FROM ùser`

JOIN ùser_source`

)

Example: Retention by Cohort

• Stats tables are homogenous and compact

• Application data can be heterogeneous and heavy

– Mixture of numbers, strings, binary, etc.

DynamoDB → Redshift

How many users signed up this week with a .edu email address?

COPY dynamodb://user

Questions

Contacts: Looker: http://www.looker.com/request-demo MessageMe: https://messageme.com AWS: aws.amazon.com/contact-us [email protected]

http://www.looker.com/request-demo



https://messageme.desk.com/

https://messageme.desk.com/

http://aws.amazon.com/contact-us/



mailto:[email protected]

We’d like your feedback. Please respond to a short survey.

https://aws.asia.qualtrics.com/SE/?SID=SV_1yUN9wjaZX960kd