+ All Categories
Home > Technology > Getting to 1.5M Ads/sec: How DataXu manages Big Data

Getting to 1.5M Ads/sec: How DataXu manages Big Data

Date post: 17-Jul-2015
Category:
Upload: qubole
View: 185 times
Download: 0 times
Share this document with a friend
Popular Tags:
36
Getting to 1.5M Ads per Second How DataXu Manages Big Data AWS, DataXu, Qubole March 30 th , 2015
Transcript

Getting to 1.5M Ads per SecondHow DataXu Manages Big Data

AWS, DataXu, Qubole

March 30th, 2015

Today’s speakers

Yekesa Kosuru

VP of Engineering,

DataXu

Ashish Dubey

Solutions Architect,

Qubole

Scott Ward

Solutions Architect,

AWS

Agenda

• AWS: Big Data, Technologies & Techniques for working

productively with Data at any scale

• Qubole: Big Data Delivered as a Service

• DataXu: Leveraging Big Data to Understand & Engage Customers

Housekeeping

• The recording link will be distributed to all registrants via email after

the webinar next week

• Please submit your questions and comments using the Chat with

Presenters box located at the bottom left corner of your screen

Agenda

• AWS: Big Data, Technologies & Techniques for working

productively with Data at any scale

• Qubole: Big Data Delivered as a Service

• DataXu: Leveraging Big Data to Understand & Engage Customers

Technologies and techniques for working

productively with data, at any scale.

Big Data

Creating Value from Data Assets

Recommendations, Collective Intelligence

Machine Learning

Visualization

DashboardsBusiness Intelligence

Measuring Functionality and Services

Ad Hoc QueriesA/B Testing

Hypothesis Testing & Predictions

Statistical Analysis

Learning from Social Media Conversations

Sentiment Analysis

SOCIAL

BIG DATA

Machine Learning DashboardsBusiness Intelligence

Ad Hoc QueriesA/B Testing

Statistical Analysis

Sentiment Analysis

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Big Data Lifecycle

Big Data AWS Cloud

Potentially Massive Data Sets Massive, virtually unlimited capacity

Iterative, experimental style of data manipulation

and analysis

Iterative, experimental style of infrastructure

deployment/usage

Frequently not a steady-state workload;

peaks and valleys

Efficient with highly variable workloads

Time to results is keyParallel compute clusters from single data source

Hard to configure/manageManaged services for data storage and analysis

Big Data + AWS

AWS Data Services

Data

Velocity

Variety

Volume

Structured, Unstructured, Text, Binary

Gigabytes, Terabytes, Petabytes

Millisecond, Second, Minute, Hour, Day

EC2EBS

Instance Storage

RedshiftRDS

SQL Stores

EMR

Hadoop

DynamoDB

NoSQL

Kinesis

Stream

Storage Services

S3 Cloud

FrontGlacier

Elasticache

Caching

Data

Pipeline

Orchestrate

Amazon Elastic Map ReduceHosted Hadoop Framework

• Easy to use and fully managed

• Secure

• Resizable clusters to support processing needs

• Support for EC2 spot instances

• Use many query tools to support analysis of

your data

– Hive, Pig, Hbase, Spark, BI Tools, etc

• EMR-FS for an S3 backed data store.

• Direct integration with other AWS data stores

– S3, Redshift, DynamoDB

Master instance group

Task instance groupCore instance group

HDFS HDFS

Amazon S3Amazon

RedshiftAmazon

DynamoDB

Amazon EMR Architecture

EMR Security

• Security groups for master and

slave instances

• Instances launch in your VPC

• Encrypt data in S3

• Control who can access S3 data

• API requests required signed key

Master instance group

Task instance groupCore instance group

HDFS HDFS

Amazon S3Amazon

RedshiftAmazon

DynamoDB

Amazon RedshiftPetabyte Scale Data Warehouse

• Fully managed data warehouse solution

• Able to achieve petabyte scale at $1000

per TB per year

• Integrates with existing data warehouse

tools

• Scales through columnar storage and

parallel query execution

• Data load directly from S3

• Integration with Amazon EMR

Amazon Redshift Architecture

• Leader Node– SQL endpoint

– Stores metadata

– Coordinates query execution

• Compute Nodes– Local, columnar storage

– Execute queries in parallel

– Load, backup, restore via Amazon S3

– Parallel load from Amazon DynamoDB, Amazon EMR, Amazon S3, HDFS/SSH

• Two hardware platforms– Optimized for data processing

– DW1: HDD; scale from 2TB to 1.6PB

– DW2: SSD; scale from 160GB to 256TB

10 GigE

(HPC)

Ingestion

Backup

Restore

JDBC/ODBC

• SSL to secure data in transit

• Encryption to secure data at rest

– AES-256; hardware accelerated

– All blocks on disks and in

Amazon S3 encrypted

– HSM/CloudHSM

• No direct access to compute

nodes

• Amazon VPC support

10 GigE

(HPC)

Ingestion

Backup

Restore

Customer VPC

Internal

Security

Group

JDBC/ODBC

Amazon Redshift Security

Agenda

• AWS: Big Data, Technologies & Techniques for working

productively with Data at any scale

• Qubole: Big Data Delivered as a Service

• DataXu: Leveraging Big Data to Understand & Engage Customers

2014 Usage Statistics for Qubole on AWS:

• Total QCUH processed in 2014 = 40.6 million

• Total nodes managed in 2014 = 2.5 million

• Total PB processed in 2014 = 519

Operations

Analyst

Marketing OpsAnalyst

Data

Architects

Business

Users

Product

Support

Customer

SupportDeveloper

Sales

Ops

Product

Managers

DeveloperTools

Service Management

Data Workbench

Cloud Data Platform

BI & DWSystems

• SDK

• API

• Analysis

• Security

• Job Scheduler

• Data Governance

• Analytics templates

• Monitoring

• Support

• Collaboration

• Workflow &

Map/Reduce

• Auto Scaling

• Cloud Optimization

• Data Connectors• YARN • Presto & Hive• Spark & Pig

Hadoop Ecosystem (Apache Open Source)

Qubole Cluster Settings

Qubole Cluster Set up with AWS Credentials

Qubole Query Types

Qubole Dashboard

Agenda Slide

• AWS: Big Data, Technologies & Techniques for working

productively with Data at any scale

• Qubole: Big Data Delivered as a Service

• DataXu: Leveraging Big Data to Understand & Engage Customers

| 26

DataXu Introduction

Disruptive on-demand software platform relied upon by the world’s

leading brands

A petabyte scale marketing cloud that enables Fortune 500 brands to

manage data, insight and action to maximize Marketing ROI

The industry’s #1 rated programmatic marketing technology

spun out of MIT by the founders

One of the fastest growing companies in the Inc. 500

| 27

DataXu Quick Statistics

Big data + Real time decisions

Big Data Processing

13 petabytesof data

20 terabytes/dayconsumer data intake

Real-TimeDecisioning

42 billiondecisions per second

1,500,000Inbound Queries Per Second

Dozens of algorithms across mobile,

social, native, display, video and TV

Predictive Modeling

Executing 10,000+investments simultaneously

10M variablesconsidered per investment

decision using next gen machine learning

Enterprise-

Cloud

Infrastructure

14data centers

35,000+CPU cores

Patent portfolio for real-time decision systems

Exclusive license from MIT to Algebra Of Systems IPR

| 28

Programmatic buying exploits real time signals to

drive greater ROI.

Analyze the attributes

available at bidding time

Assess the value of each

impression to determine a bid

price and the creative to serve

Learn from served

impressions to adjust future

bidding and creative delivery

OptimizeAppraiseAnalyze

Context Geo O.S.

Time Demo Etc.

| 29

• On-premise and Cloud

• Why Cloud/AWS

– Automation, API driven

– All Data in One Place

– Improved Testability

– Deep Security

– Breadth and Depth of Services

– Costs, Pay As You Go

– Auto Scaling (Scalability, Elasticity)

– Disaster Recovery and Business Continuity

DataXu in the Cloud

AWS

| 30

DataXu Data Flows in AWS

Producers Continuous

Processing

StorageAnalytics

CDN

Real Time

Bidding

Retargeting

Platform

Qubole

KinesisS3

Redshift

Machine

LearningStreaming

Data Collection

Analysts

Data Scientists

Engineers

| 31

Why Qubole

Managed Service

• Auto Scaling

• Spot Pricing

• No Opex

• Redundant Clusters

• Data Security

Single Unified Interface

• Rich Unified Experience

• Data Discovery tool

• Query Templates

• Administration and Monitoring

Performance Optimizations

• Overall better performance than other

Hadoop clusters in the cloud

Automation• Workflow, Scheduler

• SDK

Support • 24 X 7 deep expertise support

| 32

Unified Experience

Operations

Analyst

Marketing

Ops

Analyst

Data

Architect

Busines

s

Users

Product

SupportCustomer

Support

Developer

Sales Ops

Product

Managers

Easy of use for anyone

| 33

• Use VPC, pick AZ’s appropriately to match reservations

• Use hybrid spot pricing strategy

• Use tags for better reporting

• Seek Qubole help for cluster tuning

Qubole Cluster Best Practices

| 34

Data Security & Privacy

• AWS offers comprehensive data security

• Security & Privacy

– VPC

– IAM Policies, Users, Roles

– S3 Buckets, Bucket Policies & HTTPS

– Security Groups, Whitelist IP CIDR

– Key Management Service & CloudHSM

– Server Side and Client Side Encryption

| 35

Right tool for right workload

Large scale ETL

Interactive

Discovery

Queries

Machine

Learning/Real time

queries

High Performance

DW

Queries/Reporting

backend

Use Case / Technology

Questions?

DataXu

Yekesa Kosuru

[email protected]

www.dataxu.com

Qubole

Ashish Dubey

[email protected]

www.qubole.com

AWS

Scott Ward

[email protected]

aws.amazon.com


Recommended