The evolution of advertising technology and the importance of personalization

transcript

EXPECTATIONS

AGENDA§ Introduction

§ Data Collection

§ Data Consumption

§ Data Analysis

§ Exposing Data

§ Q & A

DharmicData, Data Center of Excellence

Data StrategyData Management

Platform Data-Driven Solutions KPI’s and experimentation

Transforming the whole Value Chain

http://www.dharmicdata.com@dharmicdata@fsroque @moshtan

Data is everywhere

(BIG) DATA

Contains informationExtracting information allows us to act proactively

Overcome problemsOptimize returns

the more data we can collect and use, the more information we’ll generate, the better we’ll operate

(BIG) DATA

Varying perceptions of “BIG DATA”

BIG DATA“It’s about being smarter with your data?”

“It means making faster decisions?”

“It simply means more data?”

“It’s about cheaper storage technology?”

“It’s all about social media?”

What is (Big) Data?

Volume Velocity

VarietyValue

Big Data

Data in motionEnabling real-time decisions

Data in many formsStructured, unstructured, text, multimedia

Data in numbersExtracting businessinsights and revenue from data

Data at scaleTerabytes to petabytes of data(or when your work processes dictate)

Trends driving the importance of Big Data

Customer-centric outcomes

Operational optimization

Risk/Financial management

New business model14%

Employee Collaboration

Businesses' Big Data objectives

“Analytics, the real world use case of Big Data”. IBM Institute of Business Value Study, October 2012

• Everything is digitized• Advanced analytics technologies • Customer-centricity ‘smarter’ solutions

Data is everywhere, and should be accessible

PIPELINES

SENSORSSOCIAL INTERACTIONS

BEHAVIOR

CONSUMPTION

SOLUTION’SQUALITY

CAPTUREDATA $$$

Capturing most Data

PIPELINES

Turn it to valuable information

A pipeline ties in several Data processing steps together

ISSUESDEALING WITH DATA

From batches to pipelines

#1TRANSPORTING DATA BETWEEN

SYSTEMS

DATA INTEGRATION (~ETL)

RELATIONAL DATABASES

HADOOP

SEARCH AND INDEXINGMONITORING

KEY-VALUE STORES

http://www.confluent.io/blog/stream-data-platform-1/

THE SPAGHETTI MONSTER

#2NEED FOR RICH ANALYTICAL DATA

PROCESSING

VERY LOW LATENCY DATA PROCESSING

STREAM PROCESSING

REAL-TIME ANALYTICS

DATA CLOSE TO PROCESSING

MORE ISSUES• Lossy and high-latency connections

• Segmented (siloed) data sources

• Batched database migrations, data insertions, etc.

• Unscalable, tightly connected systems

• ‘Duct taped connections’

• Unreliable data – leading to a lot of QA

• No room for data processing outside of batch, data archival or ad hoc processing

STREAM DATA PLATFORM TO THE RESCUE

UNIVERSAL DATA PIPELINE

CONTINUOUS FEEDS OF WELL-FORMED DATA

THE STREAM DATA PLATFORM

REAL-TIME STREAM PROCESSING

STREAM

User profile

Enrich user profile

Store in db

Predict user behavior

Target user

WHAT DOES A STREAM DATA PLATFORM NEED TO DO?

FAST?HIGH THROUGHPUT?

SCALE WELL?

KEY REQUIREMENTS FOR A STREAM DATA PLATFORM

• Reliable, no data loss

• High Throughput to handle large event data

• Persist data for longer periods, for enabling batch based workflows

• Low latency data for real-time applications

• Central system

• Close integration with stream processing systems

STREAM DATA PLATFORM RELATED TO EXISTING THINGS

STREAM DATA PLATFORMENTERPRISE MESSAGING SYSTEM

One-off deployment Central data hub

Limited storage capacity Large log history

STREAM DATA PLATFORMDATA INTEGRATION TOOLS

Disparate tools and deployments True platform

Many routine data-cleanup steps Stream abstracting and data locality make it easier to tap into and build applications around a stream

STREAM DATA PLATFORMENTERPRISE SERVICE BUSES

Transformation logic is embedded Data transformation is decoupled from the stream

Processing tasks need to agree with multiple stakeholders

Individual teams can use and reuse streams, no bottleneck

STREAM DATA PLATFORMDATA WAREHOUSES AND HADOOP

Quickly flow dataPublish resultsLong term storage

A BIG IDEAThe democratization of “data”

=> Making data available through more of the organization

The democratization of the “cluster” => Making data+resources available through more of the

organization

In the Hypervisor world, low utilization has been widely observed

A McKinsey study in 2008 pegged data-center utilization at roughly 6 percent.

A Gartner report from 2012 put industry wide utilization rate at 12 percent.

An Accenture paper, from 2011, sampling Amazon EC2 machines found 7 percent utilization over the course of a week.

The business case for data Warehouse Scale Computing

Arguments for WSCRather than running several specialized clusters, each at relatively low utilization rates, instead run many mixed workloads obvious benefits are realized in terms of:

• scalability, elasticity, fault tolerance, performance, utilization • reduced equipment capex, Ops overhead, etc.• reduced licensing, eliminating need for VMs or potential vendor lock-in • reduced time for engineers to ramp up new services at scale • reduced latency between batch and services, enabling new high ROI use cases• enables Dev/Test apps to run safely on a Production cluster• Eases deployment

Prior Practice• Low utilization rates• Longer time to ramp up new services• Even more machines to manage• Substantial performance decrease• VM licensing costs and specific data center vendor economics

• Even more machines to manage• Substantial performance decrease• VM licensing costs and specific data center vendor economics• Failures make static partitioning more complex to manage

Current Practice: WSC

“We wanted people to be able to program for the datacenter just like they program for their laptop.”- Ben Hindman, Co-creator of Mesos

WAREHOUSE SCALE COMPUTING

Sever management and granular resource allocationExternal Scalability, Horizontal Scalability, Health checks, Monitoring, and Scheduling

MESOS MARATHONCHRONOS

WHERE TO FIND DATA?

Step 1. Ingest DataStep 0. Find Data

GETTING DATA

REQUEST/RESPONSE

STREAMING

REQUEST/RESPONSEClassic, the client just asks the third party

Issue a request, return a response

- What if the service returns a lot of data?

- What if the service generates data very fast?

- What if the data points will only be sent once?

STREAMINGA Permanent connection is made between the service and the consumer

The data flows continuously through the pipe. The consumer subscribes to the service

What if the incoming data rate is too high?

MICROSERVICES ARCHITECTUREThin collection layer:

• Pass the data to the next layer (the queue)

• Scalable vertically (increase req. rate)

• Scalable horizontally (support fast data)

• Extendable (capture new sources and types of data)

DMPs and Data CollectionWhat is a data management platform?

In simple terms, a data management platform is a data warehouse. It’s a piece of software that sucks up, sorts and houses information, and spits it out in a way that’s useful for marketers, publishers and other businesses.

Data Collection at DD

EVENTSOrders, Sales, Clicks

Sensor Data

Databases?

SIGNALS – CAPTURING USER BEHAVIOR

“BigData,thefutureoflogistics”.Luxembourg-Poland BusinessClub, KPMG

QUEUESP

PUB/SUB

“… put them (messages) on a software bus where all processes can see them”

- Gartner

A COMMIT LOG

0 1 2 3 4 5 6 7 8 9

1st Record Next Record Written

A log, is perhaps the simplest possible storage abstraction.It is an append only, totally-ordered sequence of records,

ordered by time.

A DISTRIBUTED COMMIT LOG

0 1 2 3 4 5 6 7 8 9Partition 0

Writes0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9

Partition 1

Partition 2

Old New

Is partitioned and replicated across multiple nodes.

scalableTime (duration configurable)

SPARK STREAMING

1. receive

3. output

2. process

streaming data from data sources

the data

the results out to downstream

Architecture of SPARK Streaming: Discretized Streams

batches (RDDs)

Receiver

Records are processed in batches and each batch a RDD, a partitioned dataset.

Event-driven ApplicationsStream processing

Real time Event data

Streaming data platforms

Event-data Can Be Thought of as Event-Streams

Evolution of Traditional ETL

Production DB Standby DB

Periodic Full backups

Frequent diffs

amount of data amount of data

Even more frequent diffing

What we are left with is a continuous sequence of single row changes amount of data

100s of users 1,000s users 10,000s of users 100,000s of users

Scalable micro-services

Micro-service

WORKSHOP• Split into groups of 3-4

• Discuss the question:

• “How can Big Data, pipelining, streaming, and real-time computing apply to my current or future

assignments?”

• One lucky group will be randomly selected for presenting :)

THE EXPLORATORY PHASE

EXPLORATORY PHASEUnavoidable

Understand the data you are working with

Computationally Expensive

Lot’s of retries, the model chosen down the line will involve trial and error

PIPELINING (UNIFIED DATA SILOS)

Productizing Data Science

MODELING DEPLOYINGCODING

Finding dataParsing structures

CleaningReducingLearning

Predicting

Connect to prod dataTuning training parameters

Create prediction serviceGenerate Deployable model

Connect to Prod infrastrIntegration with existing ENV

Allocate Schedule resourcesEnsure availability

Extended Pipeline

COLLECT CODE

COLLECTION TIERQUEING TIER

QUEING TIERIN MEMORY TIERCOMPUTING TIER

MODEL CODE DEPLOY CREATESERVICES

INTEGRATEAPPLICATION

COLLECTION TIERQUEING TIER

QUEING TIERIN MEMORY TIERCOMPUTING TIER

IN MEMORY TIERCOMPUTING TIER

RESOURCE MANAGER TIER

RESOURCE MANAGER TIERSERVICE TIER

CREATING SERVICESAbstracts access to prepared views Exposes Prediction capabilities Highly horizontally scalable Scaling micro services cluster→ cheaper than computing cluster

“Extra” Coding phase

EXPLORATORY PHASE – NEED FOR SPEEDWe can’t afford loosing time due to inefficient toolset

Interactivity and reactivity to find the optimal result and move forward

NOTEBOOKREPL evolution

DASHBOARD

SPARK NOTEBOOK

http://spark-notebook.io/

Spark + Scala

Exploration of Big Data

NOTEBOOK DEMO

DASHBOARD

http://redash.io/

Connect to any DB

(Custom) hdfs integration via Drill

Interactive

SQL-like querying

DASHBOARD DEMO

Exposing Views on data

The data science pipeline now has to include the way for results to be consumed by third parties

(service oriented architecture)

What are the results?

Intermediate results and the model need to be exposed

Having services for views (APIs) allows us to abstract the way they are created.

APIs Expose a stream

Expose intermediate results

Expose models

STREAM

Events Current events/sec Increment counter

APITotal number of events

Average events/sec

# occurrences of specific event

Event details

EVENT-DRIVEN APPLICATIONS

Monitor and respond. Real-time

The evolution of advertising technology and the importance of personalization

Technology