Date post: | 16-Mar-2018 |
Category: |
Science |
Upload: | vlad-sandulescu |
View: | 202 times |
Download: | 1 times |
DATA:CPH MEETUP VLAD [email protected] - Copenhagen, Denmark
TODAY
Adform
Our group and how we work
Forecasting website traffic problem in our context
Timeseries forecasting and stationarity
One real example
Silver bullet resources
ADFORM PLATFORM
+21000 +1000Advertisers Agencies +400 Publishers+600 Employees 16 Countries
WITH GLOBAL INFRASTRUCTURE
50bn 6Daily transactions Data centers 1m bid requests/sec
THE DATA
User visits website data collection
unique CookieId + visit timestamp geographical data from IP device data - type, OS, ISP browser data - type, language
user profile data from publisher = cookie data gender, income, work status, etc.
price user profile sells for on RTB (= Ads Stock Exchange) running contracts - matches campaign goals
RESEARCH SUPPLY GROUP
building data driven products for the publishers
traffic forecasting guaranteed delivery inventory availability delivery forecasting yield optimization using RTB (= Ads Stock Exchange)
pricing recommendations audience extension
OUR LITTLE CAVE
we love and prototype => simulate data or sample model validation => real data (as much of it) POCs + tips to scale them = happy dev team! Dev team research flows, fast data structures and scaling
scalability: (Scala)
deployment:
monitoring:
MODEL TO PRODUCTION
reuse the POC code or rewrite? run fully offline or online? how often do we need to refresh the model? how scalable is it, do we know its bounds? one model or multiple models? how can we sneak a new model unnoticed? what is our baseline? is it easy to tune the model? is the model code well separated from the rest? time to market -
“Anything is better than nothing” “We don’t need to think of everything upfront…it just needs to work in most cases”
… (regression through all these points!)
MODEL TO PRODUCTION
reuse the POC code or rewrite? run fully offline or online? how often do we need to refresh the model? how scalable is it, do we know its bounds? one model or multiple models? how can we sneak a new model unnoticed? what is our baseline? is it easy to tune the model? is the model code well separated from the rest? time to market -
“Anything is better than nothing” “We don’t need to think of everything upfront…it just needs to work in most cases”
… (regression through all these points!)
CONCRETE CASE: FORECASTING
how many mobile users from Copenhagen will visit my website next week? how about next Tuesday from 10 AM to 1 PM for this specific banner placement? how many English speaking users who are not using mobile (so tablet, desktop, tv, etc.) from Vesterbro will I get 3 weeks from now, between 8 PM and 9 PM?
why is this useful? if I know what users I can expect, I can better organize my inventory how much I can sell as guaranteed forecast delivery of impressions yield optimization using RTB markets
MULTI-DIMENSIONALITY AND CNF LOGIC
a user visit usually has many dimensions and each dimension has tens or hundreds of values (think of geo features - country, region, zip codes, etc.) device type (mobile, desktop, etc.), browser type, ISP, etc.
From DK, but not from CPH, using mobile, language English or Danish, using Firefox or Safari, not from Telia
DK /\ (not CPH) /\ mobile /\ (English \/ Danish) /\ (Firefox \/ Safari) /\ (not Telia)
Query: All users for Denmark and mobile ratio(Denmark) * ratio(mobile)
global forecast + adjusted by 0.5 x 0.5 = 0.25
SEGMENTS INDEPENDENCE ASSUMPTION
Denmark Germany France
0.250.250.5
mobile other
0.5 0.5
Query: All users for Denmark and mobile ratio(Denmark) * ratio(mobile)
global forecast + adjusted by 0.5 x 0.5 = 0.25 truly independent? can we do better?
SEGMENTS INDEPENDENCE ASSUMPTION
Denmark Germany France
0.250.250.5
mobile other
0.5 0.5
SEGMENTS INDEPENDENCE ASSUMPTION
Denmark Germany FranceMobile Mobile0.75
0.25
0.5 x 0.75 = 0.375 error = 0.375 - 0.25 = 0.125 (x 1M forecasted users) => 125,000 users offline pre computation of fractions of one segment relative to another is simply infeasible, way too many combinations (get a query -> parse it -> compute forecasts for the parsed query -> return forecasts) < 200ms
SEGMENTS INDEPENDENCE ASSUMPTION
Denmark Germany FranceMobile Mobile0.75
0.25
TIMESERIES FORECASTING GIST
Timeseries = observations collected at constant time intervals Timeseries
time dependent => independence assumption of the observations does not hold seasonality trends => variations within specific time windows stationary time series
is one whose properties do not depend on the time at which the series is observed is easier to predict, since its statistical properties will be the same in the future as they are now.
TIMESERIES FORECASTING GIST
series mean should be constant
variance should be constant
covariance of i th term and (i + k) term does not depend on time
(*)
* analyticsvidhya.com/blog/2015/12/complete-tutorial-time-series-modeling
LOOK AT THE DATA
0e+00
2e+05
4e+05
6e+05
Feb 04 Feb 06 Feb 08 Feb 10 Feb 12 Feb 14 Feb 16time
obsVisits
Time
Actual Forecasted
GETTING TO YOUR MODEL
which model should you choose? ETS, STL, ARIMA?
publishers have different time series patterns one case: we observed strong daily and weekly patterns, so we focused on a model which supports multiple seasonalities (TBATS) exploratory analysis on the time series correlations will (hopefully!) point you to the right model still hard to pinpoint which model works best, so you have to experiment with different types of models and see which gives the smallest error
RESOURCES
Forecasting Yoda = Hyndman and his silver bullet is https://www.otexts.org/fpp
fast, super basic, easy to read intro: analyticsvidhya.com/blog/2015/12/complete-tutorial-time-series-modeling
@adforminsider