Startup Safary | Fight against robots with enbrite.ly data platform

Post on 13-Apr-2017

346 views 0 download

transcript

Fight against robots with enbrite.ly data platformJoe MÉSZÁROS

Joe MÉSZÁROSlead software engineer

@joemesz

joemeszaros

Who we are?

Our vision is to revolutionize the KPIs and metrics the online advertisement industry currently using. With our products, Antifraud, Brandsafety and Viewability we provide actionable data to our customers.

Ad display fraud (ad stacking, pixel stuffing)

Ad viewability

Brand safetyDetecting traffic that comes from unwanted categories (e.g. adult), countries and single domains

39%

39%Anti fraud detection

DATA COLLECTION

ANALYZEDATA PROCESSION

ANTI FRAUDVIEWABILITY

BRAND SAFETYREPORT + API

What we do?

How we do?

DATA PLATFORM

...so we need do analyze vast amount of data

Infrastucture Big Data technologies

+ enbrite.lydata

platform=

Amazon Web Services (AWS)

● Most popular cloud service provider● ~70 services, 13 geographical

"regions"● Amazon Big Data = Elastic Map

Reduce● BUT Do not trust the BIG guy (API

problem)https://aws.amazon.com/

Apache Hadoop

● de facto Big Data technology● open source software● distributed storage (HDFS) + data

processing (MapReduce)● ecosystem: many additional

softwareshttp://hadoop.apache.org/ | https://github.com/apache/hadoop

Apache Spark

● large-scale data processing engine● open source software (popular)● modules: core, sql, sreaming, graph,

ML● faster than Hadoop MapReduce

http://spark.apache.org/ | https://github.com/apache/spark

Data platform in numbers

20+ node cluster

16 services

110 servers

0.5 - 4 TB /day100+ TB on

S3

How we do?

DATA COLLECTION

How we do?

DATA PROCESSION

Let me tell you a short story...

Real world exampleYou have a simple idea to detect bot traffic, which saves the world. Let’s implement it!

Real world example

THE IDEA: Analyse events which are too hasty and deviate

from regular, humanlike profiles: too many clicks in a defined timeframe.

INPUT: Collected events on Amazon S3OUTPUT: Invalid sessions

Step 1: sessionize events

How to solve it?

Step 2: detect too many clicks

code: https://github.com/enbritely/startup-safary

Step 1: event to session//configure Spark application

//read events from HDFS

JavaRDD<Event> events = lines.map(Converter::jsonToEvent);

Application code : https://github.com/enbritely/startup-safary

//configure Spark application

//read events from HDFS

JavaRDD<Event> events = lines.map(Converter::jsonToEvent);

JavaRDD<Event> clicks = events.filter(e ->

e.type.equals("click"));

//configure Spark application

//read events from HDFS

JavaRDD<Event> events = lines.map(Converter::jsonToEvent);

JavaRDD<Event> clicks = events.filter(e ->

e.type.equals("click"));

JavaPairRDD<String, List<Event>> grouped = clicks

.groupBy(Event::sessionId);

//configure Spark application

//read events from HDFS

JavaRDD<Event> events = lines.map(Converter::jsonToEvent);

JavaRDD<Event> clicks = events.filter(e ->

e.type.equals("click"));

JavaPairRDD<String, List<Event>> grouped = clicks

.groupBy(Event::sessionId);

JavaRDD<Session> sessions = grouped.mapValues(sessionizer);

Step 1: event to session//Sessionizer

(Function<Iterable<Event>, Session>) unorderedEvents -> {

List<Event> clickOrdered = sortyByTimestamp(unorderedEvents);

Session session = new Session(sessionId);

for (Event event: clickOrdered) {

session.addClick(event.getTimestamp());

}

return session;

}

Application code : https://github.com/enbritely/startup-safary

Step 2: apply heuristic

Application code : https://github.com/enbritely/startup-safary

JavaRDD<String> badSessions = sessions

.filter(s -> s.getClickCount() > threshold)

.map(s -> s.sessionId + ":" + s.clickCount);

// save output to HDFS

Live demo!

● 4 node EMR (Hadoop) Cluster

● Apache Spark 1.6.1● 1 GB input events

build app : create-cluster : events S3 -> HDFS : submit app

Congratulation!MISSION COMPLETED

YOU just saved the world with a simple idea within ~10

minutes.

WE ARE HIRING!

working @exPrezi office, K9

check out the company in Forbes :-)

amazing company culture

BUT the real reason ….

WE ARE HIRING!

… is our mood manager, Bigyó :)

BEYOND enbrite.ly

...our investor and event sponsor is looking for talented guys

Joe MÉSZÁROSlead software engineerjoe@enbrite.ly

@joemesz @enbritely

joemeszarosenbritely

THANK YOU!

?QUESTIONS?