Customer flow forecasting with GBRT: the benefits of
adopting a customized machine learning approach
Shaohui Ma, Nanjing Audit University, Nanjing, 211815, China
and Robert Fildes, Lancaster Centre for Marketing Analytics and Forecasting
Department of Management Science, Lancaster University
THE TIME SERIES FORECASTING PROBLEM - THE SEARCH FOR THE HOLY GRAIL
Two distinct problems when choosing a method- • A single series
• Multiple (related) series
Two solutions • Aggregate selection
• The same model/method applied whatever the context
• Individual selection • The method is selected depending on the series characteristics and the
pool of (related) series
FORECASTING COMPETITIONS - WHAT ARE THEY FOR? The grail: a class of methods which dominates alternatives • Benchmark how ‘simple’ methods compare to a model class
E.g. how badly Holt-Winters compares to ARIMA (Newbold and Granger, 1974)
• Evaluate and compare different methods “allow practitioners to select the most appropriate method(s) for
their forecasting needs”: Makridakis et al., 2019
• Introduce and test new methods • Machine Learning methods introduced in the M3 (2001)
competition. • One variant Neural net
• Theta
• Extended to open ML submissions (M4, 2019)
The Characteristics of Forecasting Competitions
1. Specify the population of relevant time series.
2. Define the forecasting task precisely.
Lead time
Information set
3. Specify the forecasting methods to be considered
• Include standard benchmarks
• Include current practice.
4. Define a range of performance measures (linked to value)
5. Specify the data to be used in training the methods.
6. Calculate error measures and choose best method. 4
History of Forecasting Competitions
5
Contribution Methods Objectives & comments Series
Nottingham: Reid, Granger: 1974
B-J, ES, AR, Combining • To assess the loss from using automatic methods
• To assess relative performance
106
Makridakis & Hibon:1979
13 core methods+seasonal & combining
• establishing the conditions under which one method outperforms alternatives
• Help forecasters choose
111
M1: 1982 14 core methods, 3 new (damped trend), 2 dropped
1001
Meese, Geweke: 1984
ARIMA, AR, ARARMA • Examine effects of pre-processing, e.g. detrending, deseasonalizing
150 macro series
M3: 2000 Core methods + Theta+ NN + Rule based + automatic ARIMA
• final attempt to settle the accuracy issue 3003
M4: 2019 As above + ML: 50 individual submissions: 61 in total
• Learn how to improve accuracy to improve the theory and practice of forecasting
• Include Prediction intervals
100K
In addition: competitions on homogeneous data: Telecomunnications, energy, tourism
The ML results
• In M3 the NN implementation performed poorly
• In Crone et al. (IJF, 2011) NN and Machine learning methods
performed poorly on subset M3 data (chosen for series length)
• In Makridakis et al. (Plos One, 2018) ML methods ‘out-of-the-box’
performed poorly on subset of N3
–Despite an earlier evaluation which excluded standard benchmarks
• In M4 a hybrid deep learning neural network model which could
learn cross-sectional patterns won the competition
© 2017 Wessex Press, Inc. Principles of Business Forecasting 2e (Ord, Fildes, Kourentzes) • Chapter 12: Putting Forecasting Methods to Work 6
Objections to forecasting competitions
• Lack of clarity of the objectives
–Results are aggregate, tells us little about how to tackle a specific
problem
• No defined population of time series
–Statistical significance
• There is a single optimal method –Combining is necessarily sub-optimal (but it works!)
• Competence or otherwise of the contributors (applied to
all competitions from ARIMA to ML)
• Error measures
–Aggregation over time and over series
–Use of a single time origin
A new competition – a case study of mobile payment data
• Clear objectives
–To provide (small) retailers with short-term forecasts of 1-14 days to improve their planning
• A clear population of time series
–Retailers using a mobile payment platform
• Not particularly homogeneous: geography, shop type
• Range of methods considered
–Single series statistical benchmarks and ML methods compared to pooled methods
• Pooling even heterogeneous series can improve parameter estimation
• Rigorous evaluation
–Range of problem relevant error measures considered
8
The practical and research question: can a ML method outperform statistical benchmarks in a particular context?
BACKGROUND: MOBILE PAYMENT DATA COLLECTION PROCESS AND CUSTOMER FLOW
FORECASTING
CUSTOMER FLOW FORECASTING WITH THIRD-PARTY MOBILE PAYMENT DATA
The practical question: can we help millions of small businesses improve their operations by providing professional customer flow forecasts based on third-party payment data?
INNOVATIONS – THE RESEARCH ISSUES
• a novel application in retailing using newly emerging mobile payment ‘big’ data.
• identifies a set of important predictors for forecasting daily customer flows
• explores the benefit of complex models using data pooling for forecasting many time series.
• develops a general solution for forecasting many time series based on regression trees (Gradient Boosting Regression Trees).
• proposes a new strategy to generate multi-step ahead forecasts
• provides experimental comparisons on various forecasting strategies for generating multiple steps ahead
Overall,
• To demonstrate rigorously the effectiveness of an ML method
METHODOLOGICAL FRAMEWORK
Many possible drivers
Data is messy Alternative strategies for multi-step forecasting
Estimation
TIME SERIES TRANSFORMATION FOR H-STEP AHEAD FORECASTING
Pseudo MIMO strategy Vector of lead time forecasts jointly estimated
Direct strategy xt+k based on available info to t
Recursive strategy – xt+k
based on xt+k-1
Xi,t is the number of customers in store i visiting on day t
FEATURE EXTRACTION
• Lags of the customer flow • Lags 1 to L
• Local dynamics • Moving 20/50/80 percentile and standard deviations over last 1 to
w=W/7 weeks
• Global summaries • The ratio between the mean of the flow in the day of the week
(Monday to Saturday) and the global mean
• Store specific characteristics • City, Category, Comments, average payments
• Seasonality • Day of week
• Calendar events • Holidays, the day before/after a holiday
• Weather • Temperature, Wind strength, Precipitation
FEATURE SELECTION PROCESS
• Core features • First, with a pre-set of core features included (such as Day of
Week and calendar events) and a maximum estimation window W, the optimal number of lags (L≤W) is determined;
• Local dynamics • Then the optimal window width w (W/7) is determined to
construct local indicators capturing the local dynamics, • e.g. moving median over the last 3 weeks;
• Outside factors. • Weather, shop type
The first two parts are based on a forward selection process, and the third part is based on an individual evaluation.
DATA
• A randomly selected 2000 stores sample from a leading mobile payment platform in China, including 19.6 million platform payments log from July 2015 to October 2016.
Average daily customer flow over 2000 stores
EXPERIMENTS DESIGN
• The training set spans 401 days
• Two test sets each consists of 42 days of customer flow data
• Rolling for every two weeks
• Operational forecast lead times, 1, 1-14
• Evaluation metrics • sMAPE, symmetric Mean Absolute Percentage Error • MdAPE, Median Absolute Percentage Error • AvgRelMAE, Average Relative Mean Absolute Error • MPE, Mean Percentage Error
BENCHMARK MODELS
• Time series models Last Week, Naïve, automated ETS , Theta, automatic ARIMA
• Pooled models Lasso regression Random Forest (RF)
Decision trees to split the sample Different ‘weak learner; models developed for each split
Bootstrap aggregation of ‘weak learners’ Performed well compared to other ML methods in Kaggle
Gradient Boosted Regression Tree (GBRT), Implemented with the ‘Xgboost’: https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/
• an additive model to add weak learners to minimize the loss function • Trees are parameterized • an iterative functional gradient descent algorithm to add a new tree
(weak learner) • one of the most preferred choices in data analytics competitions
RESULTS: FEATURE SELECTION
Nemenyi test results for rank of models with various lags and window widths for constructing local summaries at 5% significance level. The critical distance for the Nemenyi test is 0.106 and 0.036 respectively.
RANK OF FEATURE IMPORTANCE ON SELECTED FEATURES (XGBOOST )
ONE-DAY AHEAD FORECASTS: ACCURACY Test set 1 Test set 2 Test set 1&2
sMAPE MdAPE AvgRel
MAE MPE sMAPE MdAPE
AvgRel
MAE MPE sMAPE MdAPE
AvgRel
MAE MPE
Last Week 0.130 0.082 1.000 6.330 0.150 0.103 1.000 -0.781 0.138 0.091 1.000 -0.973
Naive 0.120 0.080 0.943 4.322 0.112 0.077 0.776 0.062 0.115 0.079 0.833 0.033
ETS 0.103 0.066 0.789 2.309 0.099 0.069 0.674 -2.734 0.099 0.067 0.713 -2.360
Theta 0.106 0.069 0.831 5.093 0.100 0.069 0.689 -0.838 0.102 0.068 0.741 -0.186
ARIMA 0.105 0.071 0.819 2.324 0.108 0.078 0.731 -2.788 0.105 0.074 0.758 -2.457
Lasso 0.114 0.076 0.865 0.678 0.109 0.078 0.726 -0.047 0.110 0.077 0.777 0.083
RF 0.104 0.064 0.745 -5.463 0.096 0.067 0.637 -5.684 0.097 0.065 0.674 -5.326
GBRT 0.097 0.057 0.676 -4.314 0.088 0.061 0.583 -3.826 0.090 0.058 0.614 -3.713
Nemenyi test results at 5% significance level on store level one-step ahead forecasts over the whole test periods. - For all error measures GBRT
significantly better - Bias?
ONE-DAY AHEAD FORECASTS: INDIVIDUAL VS. POOLING
The store level one-step ahead forecasting accuracy differences between GBRT and five time series models over the whole test periods.
1
2
Last week Naive Theta Arima ETS
Baseline models
Avg
Re
lMA
E
1-14 DAYS AHEAD FORECASTS: EVALUATION
Test set 1 Test set 2 Test set 1&2
sMAPE MdAPE AvgRel
MAE MPE sMAPE MdAPE
AvgRel
MAE MPE sMAPE MdAPE
AvgRel
MAE MPE
Last Week 0.135 0.086 1.000 1.245 0.150 0.102 1.000 -0.380 0.141 0.092 1.000 -2.179
Naive 0.154 0.102 1.138 -2.182 0.159 0.111 1.033 -5.901 0.155 0.105 1.082 -6.203
ETS 0.128 0.082 0.939 0.426 0.124 0.086 0.816 -2.713 0.125 0.083 0.874 -3.912
Theta 0.132 0.085 0.992 4.612 0.126 0.087 0.840 0.200 0.128 0.085 0.913 -0.316
ARIMA 0.136 0.091 1.009 1.031 0.135 0.097 0.891 -2.648 0.134 0.094 0.948 -3.652
Lasso-Recursive 0.140 0.096 1.052 6.609 0.129 0.096 0.850 1.480 0.133 0.095 0.940 0.659
RF-Recursive 0.145 0.100 1.093 8.639 0.139 0.101 0.924 3.428 0.140 0.100 0.994 1.919
GBRT-Recursive 0.113 0.073 0.827 4.690 0.114 0.079 0.758 -1.105 0.112 0.076 0.794 -2.027
Lasso-Direct 0.131 0.085 0.980 9.229 0.126 0.091 0.835 2.764 0.127 0.088 0.898 1.731
RF-Direct 0.128 0.086 0.944 -3.236 0.126 0.092 0.835 -8.080 0.126 0.089 0.881 -8.663
GBRT-Direct 0.113 0.072 0.814 -0.273 0.105 0.073 0.699 -3.659 0.107 0.072 0.753 -4.425
Lasso-pMIMO 0.138 0.095 1.013 3.077 0.123 0.088 0.810 0.782 0.128 0.091 0.901 0.803
RF-pMIMO 0.124 0.078 0.882 -4.343 0.113 0.082 0.749 -8.174 0.116 0.081 0.809 -6.539
GBRT-pMIMO 0.116 0.068 0.793 -3.770 0.101 0.069 0.663 -5.496 0.106 0.069 0.723 -4.666
GBRT – Direct and pMIMO significantly better
• longer term accuracy • multi-step ahead forecasting strategies
MULTI-DAY AHEAD FORECASTS: INDIVIDUAL VS. POOLING
The store level multi-step ahead forecasting accuracy differences between GBRT with different strategies and five time series models over the whole test periods.
Direct pMIMO Recursive
Last week Naive Theta Arima ETS Last week Naive Theta Arima ETS Last week Naive Theta Arima ETS
0
1
2
3
4
Baseline models
Avg
Re
lMA
E
The forecasting accuracy comparisons across 1-14 horizons on three strategies with GBRT over six rolling test periods. - pMIMO best for longer horizons
TESTING THE ROBUSTNESS OF MULTIPLE HORIZON FORECASTING STRATEGIES
Method
CONCLUSIONS FROM A FORECASTING COMPETITION - WHAT THIS CASE STUDY TELLS US (FILDES ET AL.,1998)
a. Statistically sophisticated or complex
methods do not typically produce more
accurate forecasts than simpler ones.
b. The rankings of the performance of the
various methods vary with the error
measures used.
c. The relative performance of the various
methods depends upon the length of the
forecasting horizon.
d. The characteristics of the data series are
important factors in determining relative
performance
• develop methods to ‘fit’ the
characteristics of your data series
e. Comparisons based on a single time
series and a single forecast origin are
unreliable; use
• multiple time series and forecast
origins are recommended.
f. Replicability of results
a. Complex ML methods work best
b. No support
c. Limited support
d. Pooling proves effective
Data features important
e. Many time series and rolling origin used
f. Availability of code? New applications
26
CASE STUDY CONCLUSIONS
• Customer flow forecasting based on a large pool of stores that includes a variety of categories, can generate more accurate forecasts than the forecasts generated by methods based on each store individually
• Complex tree methods (both GBRT & RF) perform very well under data pooling in forecasting many time series
• When using the right forecasting strategy, GBRT performs well for both one-step and multi-step ahead forecasting tasks
• Demonstrates the potential of ML methods in a realistic context.
WHY DOES AN ML METHOD WORK HERE?
• Data characteristics: Time series in the pool are closely correlated due to shopping patterns
• Data pooling: Many applications of ML are highly parameterized and have overfitted despite use of cross-validation; Pooling across time series overcomes problem
• Models: GBRT and RF are powerful ML models can capture complex cross-section patterns
• Multi-horizon forecasting strategy: right strategy is needed to generate multi-step forecasts
QUESTIONS AND COMMENTS
1-14 DAYS AHEAD FORECASTS: NEMENYI TEST
Nemenyi test results at 5% significance level on store level 1-14 days ahead forecasts over the whole test periods. The critical distance for the Nemenyi test is 0.443.