+ All Categories
Home > Documents > Big Data on AWS · Ma ch in e le a rn in g a lg o rith ms o r sl id in g w in d o w a n a lyt ics...

Big Data on AWS · Ma ch in e le a rn in g a lg o rith ms o r sl id in g w in d o w a n a lyt ics...

Date post: 18-Jan-2019
Category:
Upload: duongtuyen
View: 213 times
Download: 0 times
Share this document with a friend
114
Big Data on AWS Siva Raghupathy Principal Solutions Architect Amazon Web Services
Transcript

Big Data on AWS

Siva Raghupathy

Principal Solutions Architect

Amazon Web Services

Data Volume, Velocity, Variety

• 2.7 zettabytes (ZB) of data exists

in the digital universe today

– 1 ZB = 1 billion terabytes

• 450 billion transaction per day by

2020

• More unstructured data than

structured data GB TB

PB

ZB

EB

1990 2000 2010 2020

Big Data • Hourly server logs: how your systems were

misbehaving an hour ago

• Weekly / Monthly Bill: What you spent this

past billing cycle?

• Daily customer-preferences report from your

web-site’s click stream: tells you what deal

or ad to try next time

• Daily fraud reports: tells you if there was

fraud yesterday

Real-time Big Data • CloudWatch metrics: what just went wrong

now

• Real-time spending alerts/caps:

guaranteeing you can’t overspend

• Real-time analysis: tells you what to offer

the current customer now

• Real-time detection: blocks fraudulent use

now

Big Data : Best Served Fresh

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure

Through 2011

IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011

IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Available for analysis

Generated data

Data volume - Gap

1990 2000 2010 2020

Data Analysis Gap

Big Data

Potentially massive datasets

Iterative, experimental style of data

manipulation and analysis

Frequently not a steady-state workload;

peaks and valleys

Time to results is key

Hard to configure/manage

AWS Cloud

Massive, virtually unlimited capacity

Iterative, experimental style of

infrastructure deployment/usage

At its most efficient with highly

variable workloads

Parallel compute clusters from singe

data source

Managed services

Partners

The Zoo

Hive Pig Shark Impala

Apache

Kafka Storm

Hadoop

/EMR

Dynamo

DB

Apache

Spark

Amazon

Kinesis

Apache

Flume HDFS

Redshift Apache

Spark

Streaming

S3 ?

Simplify

Ingest Store Process Visualize

Kafka S3 Hive/Pig

Hadoop/EMR Tableau

Kinesis DynamoDB Shark

Spark Jaspersoft

Flume HDFS Storm

Scribe Redshift Spark

Streaming

Data Answers

AWS Big Data Portfolio

Process / Analyze Collect /

Ingest

Kinesis

Glacier

S3

DynamoDB

Store

EMR EC2

Visualize /

Report

Redshift Data Pipeline

RDS

Ingest: The act of collecting and storing data

Why Data Ingest Tools?

• Data ingest tools convert random streams of data into fewer set of sequential streams

– Sequential streams are easier to process

– Easier to scale

– Easier to persist

Processing

Processing

Processing

Processing

Processing

Kafk

aO

rKin

esi

s

Processing

Data Ingest Tools

• Facebook Scribe Data collectors

• Amazon Kinesis Data collectors

• Apache Kafka Data collectors

• Apache Flume Data Movement and Transformation

Partners - Data Load and Transformation

Big Data Edition HParser

Flume, Sqoop

Real-time processing of streaming data

High throughput; elastic

Easy to use

EMR, S3, Redshift, DynamoDB integrations

Amazon Kinesis

Amazon Web Services

AZ AZ AZ

Durable, highly consistent storage replicates dataacross three data centers (availability zones)

Aggregate andarchive to S3

Millions ofsources producing100s of terabytes

per hour

FrontEnd

AuthenticationAuthorization

Ordered streamof events supportsmultiple readers

Real-timedashboardsand alarms

Machine learningalgorithms or

sliding windowanalytics

Aggregate analysisin Hadoop or adata warehouse

Inexpensive: $0.028 per million puts

Amazon Kinesis Architecture Amazon Kinesis Architecture

Kinesis Stream:

Managed ability to capture and store data

• Streams are made of Shards

• Each Shard ingests data up to

1MB/sec, and up to 1000 TPS

• Each Shard emits up to 2 MB/sec

• All data is stored for 24 hours

• Scale Kinesis streams by adding or

removing Shards

• Replay data inside of 24Hr. Window

Sending & Reading Data from Kinesis Streams

HTTP Post

AWS SDK

LOG4J

Flume

Fluentd

Get* APIs

Kinesis Client Library + Connector Library

Apache Storm

Amazon Elastic MapReduce

Sending Reading

Building Kinesis Processing Apps: Kinesis Client Library Client library for fault-tolerant, at least-once, Continuous Processing

o Java client library, source available on Github

o Build & Deploy app with KCL on your EC2 instance(s)

o KCL is intermediary b/w your application & stream

Automatically starts a Kinesis Worker for each shard

Simplifies reading by abstracting individual shards

Increase / Decrease Workers as # of shards changes

Checkpoints to keep track of a Worker’s location in the

stream, Restarts Workers if they fail

o Integrates with AutoScaling groups to redistribute workers

to new instances

Putting Data into Kinesis Simple Put interface to store data in Kinesis

• Producers use a PUT call to store data in a Stream

• PutRecord {Data, PartitionKey, StreamName}

• A Partition Key is supplied by producer and used to

distribute the PUTs across Shards

• Kinesis MD5 hashes supplied partition key over the hash

key range of a Shard

• A unique Sequence # is returned to the Producer upon a

successful PUT call

Storage

Storage

Structured – Complex Query

• SQL

– Amazon RDS (MySQL, Oracle, SQL Server)

• Data Warehouse

– Amazon Redshift

• Search

– Amazon CloudSearch

Unstructured – Custom Query

• Hadoop/HDFS

– Amazon Elastic MapReduce (EMR)

Structured – Simple Query

• NoSQL

– Amazon DynamoDB

• Cache

– Amazon ElastiCache (Memcached, Redis)

Unstructured – No Query

• Cloud Storage

– Amazon S3

– Amazon Glacier

Store anything

Object storage

Scalable

Designed for 99.999999999% durability

Amazon S3

Extremely low cost: $0.03/GB-Mo; $30.72/TB-Mo

Data layer for many AWS Big Data tools

Amazon S3

• Amazon S3 is for storing Objects (like ‘files’)

• Objects are stored in Buckets

• A Bucket keeps data in a single Region

• Highly durable, highly available

• Secure

Why is Amazon S3 good for Big Data?

• No limit on the number of Objects

• Object size up to 5TB

• Central data storage for all systems

• High bandwidth

• 99.999999999% durability

• Versioning, Lifecycle Policies and Glacier Integration

Amazon S3 Best Practices

• Use random hash prefix for keys

• Ensure a random access pattern

• Use Amazon CloudFront for high throughput GETs and PUTs

• Leverage the high durability, high throughput design of Amazon S3 for

backup and as a common storage sink

• Durable sink between data services

• Supports de-coupling and asynchronous delivery

• Consider RRS for lower cost, lower durability storage of derivatives or copies

• Consider parallel threads and multipart upload for faster writes

• Consider parallel threads and range get for faster reads

Aggregate All Data in S3 Surrounded by a collection of the right tools

EMR Kinesis

Redshift DynamoDB RDS

Data Pipeline

Spark Streaming Cassandra Storm

Amazon

S3

S3 Can Expand Along With Growing Data Volumes

Amazon S3

EMR Kinesis

Redshift DynamoDB RDS

Data Pipeline

Spark Streaming Cassandra Storm

Fully-managed NoSQL database service

Built on solid-state drives (SSDs)

Consistent low latency performance

Any throughput rate

No storage limits

Amazon DynamoDB

DynamoDB: A Flexible Data Model

• A Table is a collection of Items

• An Item is a arbitrary collection of Attributes (Name-Value pairs)

• Except for the required primary key Attribute, a Table is schema-less

• An Item can have any number of attributes (64KB max item size)

• No limits in the number of Items per Table

DynamoDB: Access and Query Model

• Two primary key options • Hash key: Key lookups: “Give me the status for user abc”

• Composite key (Hash with Range): “Give me all the status updates for user ‘abc’

that occurred within the past 24 hours”

• Support for multiple data types – String, number, binary… or sets of strings, numbers, or binaries

• Supports both strong and eventual consistency – Choose your consistency level when you make the API call

– Different parts of your app can make different choices

• Global Secondary Indexes

DynamoDB: High Availability and Durability

What does DynamoDB handle for me?

Amazon DynamoDB Best Practices

• Keep item size small

• Store metadata in Amazon DynamoDB and

large blobs in Amazon S3

• Use a table with a hash key for extremely

high scale

• Use table per day, week, month etc. for

storing time series data

• Use conditional/OCC updates

• Use hash-range key to model

– 1:N relationships

– Multi-tenancy

• Avoid hot keys and hot partitions

Events_table_2012

Event_id (Hash key)

Timestamp (range key)

Attribute1 …. Attribute N

Events_table_2012_05_week1

Event_id (Hash key)

Timestamp (range key)

Attribute1 …. Attribute N Events_table_2012_05_week2

Event_id (Hash key)

Timestamp (range key)

Attribute1 …. Attribute N

Events_table_2012_05_week3 Event_id (Hash key)

Timestamp (range key)

Attribute1 …. Attribute N

Relational Databases

Fully managed; zero admin

MySQL, PostgreSQL, Oracle & SQL Server

Amazon RDS

Amazon RDS

• Built-in Multi-AZ for HA

• Scale up to 3TB and 30,000 IOPS

• Read replicas; cross region backups

Amazon RDS Best Practices

• Use the right DB instance class

• Use EBS-optimized instances

– Example: db.m1.large, db.m1.xlarge, db.m2.2xlarge, db.m2.4xlarge,

db.cr1.8xlarge

• Use provisioned IOPS

• Use multi-AZ for high availability

• Use read replicas for

– Scaling reads

– Schema changes

– Additional failure recovery

Process and Analyze

Process

• Answering questions about data

• Questions

– Analytics: Think SQL/Data warehouse

– Classification: Think Sentiment Analysis

– Predication: Think page-views Prediction

– Etc

Processing Frameworks

• Generally come in two major types

– Batch processing

– Stream processing

Processing Frameworks

• Batch Processing

– Take large amount (>100TB) of cold data and ask questions

– Takes hours to get answers back

Example: Hourly Reports

Processing Frameworks

• Stream Processing (aka. Real-time)

– Take small amount of hot data and ask questions

– Takes short amount of time to get your answer back

Example: 1 minute metrics

Processing Frameworks

• Batch Processing

– Hadoop/EMR

– Redshift

– Spark

• Stream Processing

– Spark Streaming

– Storm

Partners – Advances Analytics (Scientific, algorithmic, predictive, etc)

Columnar data warehouse

Massively parallel

Petabyte scale

Fully managed

$1,000/TB/Year

Amazon Redshift

Amazon Redshift architecture

• Leader Node – SQL endpoint

– Stores metadata

– Coordinates query execution

• Compute Nodes – Local, columnar storage

– Execute queries in parallel

– Load, backup, restore via Amazon S3

– Parallel load from Amazon DynamoDB

• Hardware optimized for data processing

• Two hardware platforms – DW1: HDD; scale from 2TB to 1.6PB

– DW2: SSD; scale from 160GB to 256TB

10 GigE

(HPC)

Ingestion Backup Restore

JDBC/ODBC

Amazon Redshift Best Practices

• Use COPY command to load large data sets from Amazon S3, Amazon

DynamoDB, Amazon EMR/EC2/Unix/Linux hosts

– Split your data into multiple files

– Use GZIP or LZOP compression

– Use manifest file

• Choose proper sort key

– Range or equality on WHERE clause

• Choose proper distribution key

– Join column, foreign key or largest dimension, group by column

Hadoop/HDFS clusters

Hive, Pig, Impala, HBase

Easy to use; fully managed

On-demand and spot pricing

Tight integration with S3,

DynamoDB, and Kinesis

Amazon Elastic

MapReduce

EMR Cluster

S3

1. Put the data into S3

2. Choose: Hadoop distribution, # of nodes, types of nodes, Hadoop

apps like Hive/Pig/HBase

4. Get the output from S3

3. Launch the cluster using the EMR console, CLI, SDK, or

APIs

How Does EMR Work?

EMR

EMR Cluster

S3

You can easily resize the cluster

And launch parallel clusters using the same

data

How Does EMR Work?

EMR

EMR Cluster

S3

Use Spot nodes to save time and money

How Does EMR Work?

The Hadoop Ecosystem works inside of EMR

Amazon Elastic MapReduce Best Practices

• Choose between transient and persistent clusters for

best TCO

• Leverage Amazon S3 integration for highly durable

and interim storage

• Right-size cluster instances based on each job – not

one size fits all

• Leverage resizing and spot to add and remove

capacity cost-effectively

• Tuning cluster instances can be easier than tuning

Hadoop code

Job Flow

14 Hours

Duration:

Duration:

Job Flow

7 Hours

Visualize

Partners - BI & Data Visualization

Putting All The AWS Data Tools Together & Common Design Patterns

One tool to

rule them all

http://lambda-architecture.net

Data Characteristics: Hot, Warm, Cold

Hot Warm Cold

Volume MB–GB GB–TB PB

Item size B–KB KB–MB KB–TB

Latency ms ms, sec min, hrs

Durability Low–High High Very High

Request rate Very High High Low Cost/GB $$-$ $-¢¢ ¢

High Low Request rate

High Low Cost/GB

Low Hig

h Latency

Low Hig

h Data Volume

Low S

tructu

re

Amazon

ElastiCache

Amazon

DynamoDB

Amazon

CloudSearch

Amazon

RDS

Amazon S3

Amazon

EMR

Amazon

Redshift

Amazon

Glacier

Average

latency

Data

volume

Item size

Request

rate

Cost ($/GB/month)

Durability

Elasti-

Cache

ms

GB

B-KB

Very

High

Low -

Moderate

$$

Amazon

DynamoDB

ms

GB-TBs (no limit)

B-KB (64 KB max)

Very

High

Very High

¢¢

Amazon

RDS

ms.sec

High

High

¢¢

GB-TB (3 TB

max)

KB (~rowsize)

Cloud

Search

ms.sec

High

High

$

GB-TB

KB (1 MB max)

Amazon

Redshift

sec.min

Low

High

¢

KB (64 K max)

TB-PB (1.6 PB max)

Amazon

EMR (Hive)

sec.min,

hrs

Low

High

¢

KB-MB

GB-PB (~nodes)

Amazon

S3

Very High

¢

KB-GB (5 TB max)

GB-PB (no limit)

ms,sec,

min (~size)

Low-Very

High (no limit)

Amazon

Glacier

Very High

¢

GB (40 TB max)

GB-PB (no limit)

hrs

Very Low (no limit)

Cost Conscious Design

Cost Conscious Design

Example: Should I use Amazon S3 or Amazon DynamoDB?

“I’m currently scoping out a project that will greatly increase

my team’s use of Amazon S3. Hoping you could answer

some questions. The current iteration of the design calls for

many small files, perhaps up to a billion during peak. The

total size would be on the order of 1.5 TB per month…”

Request rate (Writes/sec)

Object size (Bytes)

Total size (GB/month)

Objects per month

300 2048 1483 777,600,000

Cost Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB?

Request rate (Writes/sec)

Object size (Bytes)

Total size (GB/month)

Objects per month

300 2,048 1,483 777,600,000

Amazon S3 or Amazon DynamoDB?

Request rate (Writes/sec)

Object size (Bytes)

Total size (GB/month)

Objects per month

Scenario 1 300 2,048 1,483 777,600,000

Scenario 2 300 32,768 23,730 777,600,000

Amazon S3

Amazon DynamoDB

use

use

Putting it all together (coupled architecture)

• Coupled systems provide Less flexibility

– Cold data vs. Hot

– High latency processing vs. Low latency processing

• Example

– EMR+HDFS/S3

• Cold: Can handle processing 100 records/sec

• Hot: processing 1000000 records/sec ??

– Redshift + S3

• High latency: Generate reports once a day

• Low latency: Generate reports every minute

Putting it all together (de-coupled architecture)

• Multi-tier data processing architecture

– Similar to multi-tier web-application architectures

• Ingest & Store de-coupled from Processing

– Concept of “databus”

DatabusData Process Answers

Putting it all together (de-coupled architecture)

• Ingest tools write to multiple data stores within “data-bus”

• Processing frameworks (Hadoop, Spark, etc) consume from “databus”

• Consumers can decide which data store to read from depending on their data processing requirement

Ingest Store

Data Process AnswersKafka

S3

HDFS

Data temperature & processing latency

Data Warm

Store

Cold

Store

Ingest

Tool

Temperature

Hot Cold

Processing

Processing

Answers

Answers

La

ten

cy

Low

High

Processing

Answers

Pattern 1: Redshift (cold & high)

Data

Temperature

Hot Cold

Answers

Kinesis/

KafkaS3

Redshift

Late

ncy

Low

High • Daily Fraud Report • Weekly / Monthly Bill

Pattern 2: Hadoop (cold and high)

Data

Temperature

Hot Cold

Kinesis/

KafkaS3

NoSQL/

DynamoDB

Answers

EMR/

Hadoop

Late

ncy

Low

High

Pattern 3: DynamoDB (warm and low)

Data

Temperature

Hot Cold

Answers

Kinesis/

Kafka

Redshift

NoSQL/DynamoDB

Answers

DynamoDB

AppLa

ten

cy

Low

High • Hourly Alert • Real-time

Spending

Alerts/Caps

Pattern 4: Hadoop (warm and low)

Data

Temperature

Hot Cold

Answers

Kinesis/

KafkaS3

EMR/

Hadoop

Answers

EMR/

Hadoop

NoSQL/

DynamoDB

Lat e

ncy

Low

High

Pattern 5: Spark (cold and low)

Data

Temperature

Hot Cold

Answers

Kinesis/

KafkaS3

EMR/

Hadoop

NoSQL/

DynamoDB

Answers

EMR/

Hadoop

Spark

La

ten

cy

Low

High

Pattern 6: Stream processing (hot and low)

Data

Temperature

Hot Cold

Answers

Kinesis/

KafkaS3

EMR/

Hadoop

HDFS/NoSQL/

DynamoDB

Answers

EMR/

Hadoop

Spark

La

ten

cy

Low

High

Spark Streaming/

Storm

Answers

Answers

Impala

Answers

Overall Reference Architecture: Fitting it all together

Data

Temperature

Hot Cold

Lat enc y

Kinesis/

KafkaS3

EMR/

Hadoop

NoSQL/DynamoDB/HDFS

Spark Streaming/

Storm

Impala

Low

High

Redshift

Spark

EMR/

Hadoop

Redshift

Spark

Use Case: A Video Streaming Application

Reference Architecture

Amazon

RDS

Amazon

CloudSearch

Amazon

DynamoDB

Amazon

ElastiCache

Amazon

EMR

Amazon

S3

Amazon

Glacier

AWS Data Pipeline

Amazon

Redshift

Use Case: A Video Streaming App – Upload

Amazon DynamoDB

Amazon RDS

Amazon CloudSearch

Amazon S3

A Video Streaming App – Discovery

X

Amazon Glacier

Amazon

ElastiCache

CloudFront

Amazon DynamoDB

Amazon RDS

Amazon CloudSearch

Amazon S3

Use Case: A Video Streaming App – Recs

Amazon

S3

Amazon

Glacier

Amazon

DynamoDB Amazon

EMR

Use Case: A Video Streaming App – Analytics

Amazon

EMR

Amazon

S3

Amazon

Glacier

Amazon

Redshift

Customer Use Cases

Autocomplete Search Recommendations Automatic spelling corrections

A look at how it works

Months of user history Common misspellings

Data Analyzed Using EMR:

Weste

n Wistin

Westa

n Whestin

Automatic spelling corrections

Months of user search data

Search terms

Misspellings

Final click throughs

Yelp web site log data goes into Amazon S3

Amazon S3

Amazon Elastic MapReduce spins up a 200 node Hadoop cluster

Hadoop Cluster

Amazon EMR Amazon S3

Hadoop Cluster

Amazon EMR Amazon S3

All 200 nodes of the cluster simultaneously look for common misspellings

Westen

Wistin

Westan

Hadoop Cluster

Amazon EMR Amazon S3

A map of common misspellings and suggested corrections are loaded back into Amazon S3.

Westen

Wistin

Westan

Then the cluster is shut down Yelp only pays for the time they used it

Hadoop Cluster

Amazon EMR Amazon S3

Each of Yelp’s 80 Engineers Can Do This Whenever They Have a Big Data Problem

spins up over

250 Hadoop

clusters per week in EMR.

Amazon EMR Amazon S3

Data Innovation Meets Action at Scale

at NASDAQ OMX

• NASDAQ’s technology powers more than 70 marketplaces in 50 countries

• NASDAQ’s global platform can handle more than 1 million messages/second at

a median speed of sub-55 microseconds

• NASDAQ own & operate 26 markets including 3 clearinghouse & 5 central securities repositories

• More than 5,500 structured products are tied to NASDAQ’s global indexes with the notional value of at least $1 trillion

• NASDAQ powers 1 in 10 of the world’s securities transactions

NASDAQ’s Big Data Challenge

• Archiving Market Data

– A classic “Big Data” problem

• Power Surveillance and Business Intelligence/Analytics

• Minimize Cost

– Not only infrastructure, but development/IT labor costs too

• Empower the business for self-service

Financial Information Forum, Redistribution without permission from FIF prohibited, email: [email protected]

SIP Total Monthly Message VolumesOPRA, UQDF and CQS

23

OPRA Annual Increase: 69%CQS Annual Increase: 10%UQDF Annual Decrease: 6%

Total Monthly Message Volume Average Daily

Volume Date OPRA

Aug-12 80,600,107,361 3,504,352,494

Sep-12 77,303,404,427 4,068,600,233

Oct-12 98,407,788,187 4,686,085,152

Nov-12 104,739,265,089 4,987,584,052

Dec-12 81,363,853,339 4,068,192,667

Jan-13 82,227,243,377 3,915,583,018

Feb-13 87,207,025,489 4,589,843,447

Mar-13 93,573,969,245 4,678,698,462

Apr-13 123,865,614,055 5,630,255,184

May-13 134,587,099,561 6,117,595,435

Jun-13 162,771,803,250 8,138,590,163

Jul-13 120,920,111,089 5,496,368,686

Aug-13 136,237,441,349 6,192,610,970

Total Monthly Message Volume Combined Average Daily

Volume Date UQDF CQSAug-12 2,317,804,321 8,241,554,280 459,102,548Sep-12 1,948,330,199 7,452,279,225 494,768,917Oct-12 1,016,336,632 7,452,279,225 403,267,422Nov-12 2,148,867,295 9,552,313,807 557,199,100Dec-12 2,017,355,401 8,052,399,165 503,487,728Jan-13 2,099,233,536 7,474,101,082 455,873,077Feb-13 1,969,123,978 7,531,093,813 500,011,463Mar-13 2,010,832,630 7,896,498,260 495,366,545Apr-13 2,447,109,450 9,805,224,566 556,924,273

May-13 2,400,946,680 9,430,865,048 537,809,624Jun-13 2,601,863,331 11,062,086,463 683,197,490Jul-13 2,142,134,920 8,266,215,553 473,106,840

Aug-13 2,188,338,764 9,079,813,726 512,188,750

0

200,000,000

400,000,000

600,000,000

Jan-00 Jan-00 Jan-00 Jan-00 Jan-00 Jan-00 Jan-00 Jan-00 Jan-00

NASDAQ Exchange Daily Peak Messages

Market

Data

Is Big

Data Charts courtesy of the Financial Information Forum

NASDAQ’s Legacy Solution

• On-premises MPP DB

– Relatively expensive, finite storage

– Required periodic additional expenses to add more storage

– Ongoing IT (administrative) human costs

• Legacy BI tool

– Requires developer involvement for new data sources, reports,

dashboards, etc.

New Solution: Amazon Redshift

• Cost Effective

– Redshift is 43% of the cost of legacy

• Assuming equal storage capacities

– Doesn’t include IT ongoing costs!

• Performance

– Outperforms NASDAQ’s legacy BI/DB solution

– Insert 550K rows/second on a 2 node 8XL cluster

• Elastic

– NASDAQ can add additional capacity on demand, easy to grow their cluster

• Amazon Redshift partner

– http://aws.amazon.com/redshift/partn

ers/pentaho/

• Self Service

– Tools empower BI users to integrate

new data sources, create their own

analytics, dashboards, and reports

without requiring development

involvement

• Cost effective

New Solution: Pentaho BI/ETL

Net Result

• New solution is cheaper, faster, and offers capabilities that NASDAQ

didn’t have before

– Empowers NASDAQ’s business users to explore data like they never

could before

– Reduces IT and development as bottlenecks

– Margin improvement (expense reduction and supports business

decisions to grow revenue)

Q&A

AWS is here to help

Solution

Architects

Professional

Services Premium

Support

AWS Partner

Network (APN)

aws.amazon.com/partners/competencies/big-data

Partner with an AWS Big Data expert

https://aws.amazon.com/architecture/

Processing large amounts of parallel data using a scalable cluster

AWS Architecture Diagrams

http://aws.amazon.com/marketplace

Big Data Case Studies

Learn from other AWS customers

aws.amazon.com/solutions/case-studies/big-data

AWS Marketplace

AWS Online Software Store

aws.amazon.com/marketplace

Shop the big data category

http://aws.amazon.com/marketplace

AWS Public Data Sets

Free access to big data sets

aws.amazon.com/publicdatasets

AWS Grants Program

AWS in Education

aws.amazon.com/grants

AWS Big Data Test Drives

APN Partner-provided labs

aws.amazon.com/testdrive/bigdata

https://aws.amazon.com/training

AWS Training & Events

Webinars, Bootcamps,

and Self-Paced Labs

aws.amazon.com/events

Big Data on AWS

Course on Big Data

aws.amazon.com/training/course-descriptions/bigdata

reinvent.awsevents.com

aws.amazon.com/big-data


Recommended