+ All Categories
Home > Software > BDM39: How MindGeek's Ad Network uses Big Data Technologies to push Billions of Impressions per Day...

BDM39: How MindGeek's Ad Network uses Big Data Technologies to push Billions of Impressions per Day...

Date post: 17-Aug-2015
Category:
Upload: big-data-montreal
View: 15 times
Download: 1 times
Share this document with a friend
Popular Tags:
21
Big Data @ MindGeek Big Data Montreal #39 Olivier H. Beauchesne, Lead Data Scientist
Transcript

Big Data @ MindGeek

Big Data Montreal #39 Olivier H. Beauchesne, Lead Data Scientist

Plan

Who is MindGeek? Data Processing Challenges Big Data as a Service Lightning: Our data stack

Goal Technologies Architecture Customizations

Uses Ads Data Processing (including Real-time processing) User Profiling Anomaly Detection Fraud Detection

Q&A04/18/2023

Who is MindGeek?

Founded in 2004 More than 1,000 employees Specialized in High traffic, High volume websites Active in Content Delivery, Streaming Media and Online

Advertising Behind several of the biggest, most trafficked sites online

100M+ Daily visitors More than 3 Billion ad impressions served every day

You might have heard of our brands:

04/18/2023

Who is MindGeek?

Founded in 2004 More than 1,000 employees Specialized in High traffic, High volume websites Active in Content Delivery, Streaming Media and Online

Advertising Behind several of the biggest, most trafficked sites online

100M+ Daily visitors More than 3 Billion ad impressions served every day

You might have heard of our brands:

04/18/2023

Data Processing Challenges

TrafficJunky.net – our Ad Platform existed for a number of years Generates >10TB of structured data per day Cannot go down – ever

RDBMS for everything Administration Interface Statistics Aggregation

Throwing money at the problem gets old Buy bigger servers Build more features Repeat

A new approach was needed While supporting the legacy system, adding new features and building the

new system04/18/2023

Big Data as a Service

Lots of acquisitions in the past years Several different technology stacks/architectures Several companies in a company

Create a company-wide data transport, transformation, aggregation and storage service Dedicated Experts Developers, Data Scientists, Business Analyst can concentrate on

their knowledge domain Cost reduction Raises the Bus Factor No more week-end emails (at least, for me)!

04/18/2023

Lightning!

Goal A real time, highly available, scalable end to end solution to transport, process, store

and report on any business critical data

Technologies used Transportation

Kafka, Flume, ZeroMQ, scp Processing

Hadoop (YARN), Samza, Tez Storage

HDFS Reporting

PostgreSQL, Hue, Hive, TJ Analytics, etc. Management

ZooKeeper, Custom Software to monitor, configure and submit jobs

04/18/2023

Lightning!

04/18/2023

Samza & Kafka as the Core infrastructure Stream based

Aggregate and discard Online Algorithms Low latency decisions (in the same session)

Scalable Spinning new instances is fast If something crashes, we can rewind the feed and start over

High-Availability Fits our problem domain Disaster Recovery

Multiple Data Centers

Lightning – Architecture

04/18/2023

Lightning – Customization

Health Monitoring Kafka Administration

04/18/2023

Lightning – Customization

Samza Worker Upload/Configuration/Administration Cluster/Job Monitoring and Debugging

04/18/2023

Ad Impressions Data Processing

04/18/2023

Delivery Server

Delivery Server

Delivery Server

...

KafkaTopic

Write to HDFS/Hive

Data Mart

Hive

Aggregation

Pre-Aggregation

In-MemoryDatabase

BI Interface:TJ Analytics

HUE

Live Database

SalesData Science

Engineers

Data Science

Data ScienceDeployment

EngineersSales

Ad Impressions Data Processing

04/18/2023

Detailed processing for the Data Mart

04/18/2023

Delivery Server

Delivery Server

Delivery Server

...

KafkaTopic

FilteringEvent

AggregationMeta-

Aggregation

Meta-data Update

OutputKafkaTopic Data Mart

PersistenceLayer

Process function – Called at every event

Window function, called every

minute

Window function, called every 15 minutes

TJ Analytics

04/18/2023

User Profiling

04/18/2023

Delivery Server

Delivery Server

Delivery Server

...

KafkaTopic

Delivery Server

Delivery Server

Delivery Server

...

Content-based

Profiling

Partitioning by hashed

GUID

OutputKafkaTopic Fast Storage

User Profile API

Secret Sauce

Process function – Called on every event

Anomaly Detection

04/18/2023

Delivery Server

Delivery Server

Delivery Server

...

KafkaTopic

Correlator

Template Storage

Event Storage

Template building

Pre-aggregati

on

BI Interface:TJ Analytics

SalesData Science

Engineers

Alerting System

Anomaly Detection – Interface

04/18/2023

Fraud Detection

04/18/2023

Event Aggregator

APIKafkaTopic

Knowledge DB

FlumeSamza

WorkerLogin Page Analysis

ResultsDB

Billing Software

PostgreSQL

DynamoDB HTTPSPaysite Login Pages

Fraud Detection – Interface

04/18/2023

Q&A

Thanks! We’re hiring

04/18/2023


Recommended