Mathematical modeling of garden watering demand
Ana Rosa da Cruz Lopes Marques
Thesis to obtain the Master of Science Degree in
Mathematics and Applications
Supervisors: Prof. Maria da Conceição Esperança AmadoDr. Dália Susana dos Santos da Cruz Loureiro
Examination Committee
Chairperson: Prof. António Manuel Pacheco PiresSupervisor: Prof. Maria da Conceição Esperança AmadoMembers of the Committee: Prof. Isabel Maria Alves Rodrigues
Magister Maria Regina Guerreiro Casimiro
July 2018
ii
Acknowledgments
I would first like to express my sincere gratitude to my advisor Prof. Conceicao Amado for the continuous
support, motivation, guidance and immense knowledge. Her guidance helped me in all the time of
research and writing of this thesis. I could not have imagined having a better advisor for my thesis.
Besides my advisor, I would like to thank my co-advisor Dr. Dalia Loureiro for welcoming me into NES
(Nucleo de Engenharia Sanitaria) and I am gratefully indebted to her for her very valuable comments on
this thesis.
I would like to thank Engr. Regina Casimiro and Engr. Pedro Pascoal for their availability.
Finally, I must express my gratitude to my parents and to my brother for providing me with unfailing
support and encouragement throughout my years of study. This accomplishment would not have been
possible without them.
iii
iv
Resumo
O aumento do turismo em regioes costeiras e o problema da intrusao salina nos aquıferos, que leva
ao fecho de furos usados para rega de jardins, causam uma pressao sobre o fornecimento de agua na
regiao em que se situa o caso de estudo. A esta situacao somam-se as consequencias das mudancas
climaticas, o que torna desafiante prever cenarios de consumo a medio e longo prazo. Este estudo
tem como objetivos caracterizar, modelar e prever o consumo de agua para rega numa regiao costeira
turıstica. Isto e possıvel devido a situacao particular da existencia de dois contadores de agua nos lotes
em estudo: um que mede o consumo de agua no interior e outro que mede o consumo no exterior.
Aplicamos um algoritmo de clustering para agrupar os consumidores segundo o padrao de consumo.
Para cada cluster, propomos um modelo aditivo generalizado. Para alem disso, testamos um metodo
de desagregacao do consumo total em uso de agua interior e uso de agua exterior.
Palavras-chave: Rega, Consumo de agua no exterior, Clustering de series temporais, Mod-
elos Aditivos Generalizados, Desagregacao de consumo de agua
v
vi
Abstract
An increase in tourism in coastal regions and the saltwater intrusion problem in the aquifers, which
will cause the closure of boreholes used to water gardens, create a pressure over the water supply of
the region in study. This situation, along with climate change, makes it challenging to envisage mean
and long term consumption scenarios. This study is aimed at characterizing, modeling and forecasting
the garden watering demand in a coastal touristic region. This is possible due to the particular situation
where the lots to study have two water meters: one to measure indoor water use and another to measure
outdoor water use.
We apply a clustering algorithm to group the customers by similarity of consumption pattern. For each
cluster, we propose a generalized additive model. Furthermore, we test a method to disaggregate the
total water use into indoor and outdoor use.
Keywords: Garden watering, Outdoor water use, Time series clustering, Generalized Additive
Models, Disaggregation of water consumption
vii
viii
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 State-of-the-art 5
3 Methodology 9
3.1 Time series basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.2 Autocovariance, Autocorrelation and Partial Autocorrelation Functions . . . . . . . 10
3.1.3 White Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.4 Differencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.5 Variance Stabilizing Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.6 Cross-correlation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Time series Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 Linear Stationary Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.2 Non-stationary Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.3 Model Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.4 Diagnostic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.5 Forecast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Time series clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.1 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.2 Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.3 Comparing clustering methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
ix
3.4.1 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.2 Generalized Additive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.3 Mixed Models - GAMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5 Disaggregation of water consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5.1 Classification algorithm: K-Nearest Neighbors (KNN) . . . . . . . . . . . . . . . . . 31
3.5.2 Method for disaggregation of water consumption . . . . . . . . . . . . . . . . . . . 32
4 Results and Discussion 33
4.1 Case study description and data processing . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Exploratory Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Time Series Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3.1 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.2 Choosing the best number of clusters . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3.3 Discussion of the clustering results . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4 Modeling garden watering demand using GAM . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4.1 Explanatory variables selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4.2 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4.3 Analysis of the Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4.4 Forecast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5 Daily disaggregation of water consumption . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5 Conclusions 69
5.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Bibliography 73
A Results of the Clustering 77
A.1 Exploratory analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
B Additional forecast results 83
C Additional daily disaggregation of consumption results 87
x
List of Tables
3.1 Summary of the properties of the stationary models (Source: Bisgaard and Kulahci [26]). 16
4.1 Information regarding the extreme observations of the 57 outdoor water meters. . . . . . 42
4.2 Comparison of the values of the four indexes for the best number of clusters for Ward
Method and Complete Linkage with periodogram based distance when using the Stan-
dard normalization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Best number of clusters according to each index using complete linkage method with
periodogram based distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Size of each cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5 Summary of the outdoor areas per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.6 Size of each cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.7 Summary of the outdoor areas per cluster (final clusters). . . . . . . . . . . . . . . . . . . 49
4.8 Summary of the estimated watered areas per cluster (final clusters). . . . . . . . . . . . . 50
4.9 Summary of the building areas per cluster (final clusters). . . . . . . . . . . . . . . . . . . 50
4.10 Average ratio between outdoor area and lot area per cluster (final clusters). . . . . . . . . 51
4.11 Mean estimated pool volume per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.12 Monthly peak factor per cluster for 2015 and 2016. . . . . . . . . . . . . . . . . . . . . . . 52
4.13 Mean monthly ratio betwen the garden watering and total water consumption per Cluster
for the months of August, September, October and November and years 2015 and 2016. 62
4.14 Group size for the test data set (N = 41). . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.15 KNN classification results of the Groups’s representative series according to the clusters
obtained for the water consumption for garden watering data set. . . . . . . . . . . . . . . 64
4.16 Size of each group. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
xi
xii
List of Figures
4.1 Mean daily water consumption for garden watering of the 57 water meters and mean daily
temperature from 01/01/2015 to 30/11/2017. . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Daily accumulated precipitation from January 2015 to November 2017. . . . . . . . . . . . 35
4.3 Boxplot of the monthly consumptions of the 57 water meters between January 2015 and
November 2017. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4 Boxplot of monthly consumptions of the 57 time series and grouped by year (2015, 2016
and 2017). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5 Monthly consumption in November for three years (2015, 2016 and 2017). . . . . . . . . . 38
4.6 Scatterplot of the mean daily consumption of each outdoor water meter versus outdoor
area for the 57 water meters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.7 Scatterplot of the mean daily consumption of each outdoor water meter versus estimated
watered area for the 57 water meters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.8 Scatterplots of the mean daily consumption versus a) outdoor area, b) estimated watered
area. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.9 Median monthly indoor consumption and median monthly water consumption for garden
watering of the 57 water meters between January 2015 and November 2017. . . . . . . . 41
4.10 Mean daily pattern of indoor and water consumption for garden watering of the 57 water
meters between 01/01/2015 and 30/11/2017. . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.11 The number of clusters versus Dunn index. . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.12 The number of clusters versus Entropy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.13 The number of clusters versus Gamma index. . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.14 The number of clusters versus Silhouette index. . . . . . . . . . . . . . . . . . . . . . . . . 44
4.15 Partition of the 57 time series in 5 clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.16 Representative series of Cluster 1 between 01/01/2015 and 31/07/2017. . . . . . . . . . . 45
4.17 Normalized monthly consumption aggregated by the median for each cluster between
January 2015 and July 2017. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.18 Boxplot of the outdoor area per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.19 Boxplot of the mean daily water consumption for garden watering per cluster. . . . . . . . 47
4.20 Boxplot of the normalized monthly consumption of the new Cluster 1 between January
2015 and July 2017. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
xiii
4.21 Boxplot per month of the year of the normalized monthly consumption of the new Cluster 1. 48
4.22 Boxplot per day of the week of the new Cluster 1. . . . . . . . . . . . . . . . . . . . . . . . 48
4.23 Daily pattern per month of the new Cluster 1. . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.24 Boxplot of the outdoor area per cluster (final clusters). . . . . . . . . . . . . . . . . . . . . 50
4.25 Boxplot of the estimated garden area per cluster (final clusters). . . . . . . . . . . . . . . . 50
4.26 Boxplot of the building area per cluster (final clusters). . . . . . . . . . . . . . . . . . . . . 51
4.27 Scatterplot of the mean daily consumption versus outdoor area grouped by cluster (final
clusters). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.28 Representative series Mean of the new Cluster 1 between 01/01/2015 and 31/07/2017. . 53
4.29 Sample ACF of the response variable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.30 Sample PACF of the response variable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.31 CCF between the differentiated mean temperature and the differentiated representative
series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.32 CCF between the differentiated maximum temperature and the differentiated representa-
tive series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.33 CCF between the differentiated minimum temperature and the differentiated representa-
tive series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.34 CCF between the accumulated precipitation and the differentiated representative series. . 55
4.35 Histogram of the residuals of Model 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.36 QQ-Plot of the residuals of Model 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.37 Residuals versus the linear predictor of Model 1. . . . . . . . . . . . . . . . . . . . . . . . 57
4.38 Daily forecast of the model of representative series Mean (Model 1, Equation 4.4) of
Cluster 1 and the real aggregated values by the mean, both in the original scale between
10/08/2017 and 30/11/2017. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.39 Daily forecast of Model 1 (Equation 4.4), Model 2 (Equation 4.5) and Model 3 (Equa-
tion 4.6) of Cluster 1 and the real aggregated values by the mean in the original scale
between 16/08/2017 and 30/11/2017. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.40 Daily forecast of the model of representative seriesMean (Model 4, Equation 4.8) of Clus-
ter 2 and the real aggregated values by the mean in the original scale between 27/08/2017
and 30/11/2017. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.41 Daily forecast of Model 4 (Equation 4.8), Model 5 (Equation 4.9) and Model 6 (Equa-
tion 4.10) of Cluster 2 and the real aggregated values by the mean in the original scale
between 19/08/2017 and 16/11/2017. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.42 Mean monthly indoor consumption and mean monthly water consumption for garden wa-
tering of Cluster 1 between January 2015 and July 2017. . . . . . . . . . . . . . . . . . . 61
4.43 Mean monthly indoor consumption and mean monthly water consumption for garden wa-
tering of Cluster 2 between January 2015 and July 2017. . . . . . . . . . . . . . . . . . . 61
4.44 Mean monthly indoor consumption and mean monthly water consumption for garden wa-
tering of Cluster 3 between January 2015 and July 2017. . . . . . . . . . . . . . . . . . . 62
xiv
4.45 Boxplot of the outdoor area per group for the test data set (N = 41). . . . . . . . . . . . . 64
4.46 Estimates of the total daily consumption between 22/08/2017 and 30/11/2017 and the real
total daily consumption of Group 1 in the original scale. . . . . . . . . . . . . . . . . . . . 65
4.47 Estimates of the total daily consumption between 22/08/2017 and 30/11/2017 and the real
total daily consumption of Group 2 in the original scale. . . . . . . . . . . . . . . . . . . . 65
4.48 Estimates of the daily garden watering and daily indoor consumption between 22/08/2017
and 30/11/2017 and the real total daily consumption of Group 1 in the original scale. . . . 66
4.49 Estimates of the daily garden watering and daily indoor consumption between 22/08/2017
and 30/11/2017 and the real total daily consumption of Group 2 in the original scale. . . . 66
A.1 Representative series Q95% of Cluster 1 between 01/01/2015 and 31/07/2017. . . . . . . 77
A.2 representative series Q25% of Cluster 1 between 01/01/2015 and 31/07/2017. . . . . . . 78
A.3 Representative series Mean of Cluster 2 between 01/01/2015 and 31/07/2017. . . . . . . 78
A.4 Representative series Q95% of Cluster 2 between 01/01/2015 and 31/07/2017. . . . . . . 78
A.5 representative series Q25% of Cluster 2 between 01/01/2015 and 31/07/2017. . . . . . . 79
A.6 Hourly pattern per month of Cluster 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
A.7 Boxplot per month of the normalized aggregated monthly consumptions of the members
of Cluster 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
A.8 Boxplot per month of the year of the normalized aggregated monthly consumptions of the
members of Cluster 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
A.9 Boxplot per day of the week of the normalized consumptions of the members of Cluster 4. 80
A.10 Representative series Mean of Cluster 3 between 01/01/2015 and 31/07/2017. . . . . . . 80
A.11 Representative series Q95% of Cluster 3 between 01/01/2015 and 31/07/2017. . . . . . . 81
A.12 Representative series Q25% of Cluster 3 between 01/01/2015 and 31/07/2017. . . . . . . 81
A.13 Hourly pattern per month of Cluster 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
A.14 Boxplot per month of the normalized aggregated monthly consumptions of the members
of Cluster 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
A.15 Boxplot per month of the year of the normalized aggregated monthly consumptions of the
members of Cluster 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
A.16 Boxplot per day of the week of the normalized consumptions of the members of Cluster 3. 82
B.1 Forecast of the model of representative seriesQ95% for the interval 07/08/2017 - 30/11/2017
of Cluster 1 in the original scale. The MAPE is equal to 19.712%. . . . . . . . . . . . . . . 83
B.2 Forecast of the model of representative seriesQ25% for the interval 16/08/2017 - 30/11/2017
of Cluster 1 in the original scale. The MAE is equal to 0.898. . . . . . . . . . . . . . . . . . 84
B.3 Forecast of the model of representative seriesQ95% for the interval 19/08/2017 - 16/11/2017
of Cluster 2 in the original scale. The MAPE is equal to 30.024%. . . . . . . . . . . . . . . 84
B.4 Forecast of the model of representative seriesQ25% for the interval 14/08/2017 - 30/11/2017
of Cluster 2 in the original scale. The MAE is equal to 2.098. . . . . . . . . . . . . . . . . . 84
xv
B.5 Forecast of the model of representative seriesMean for the interval 22/08/2017 - 30/11/2017
of Cluster 3 in the original scale. The MAPE is equal to 17.444%. . . . . . . . . . . . . . . 85
B.6 Forecast of the model of representative seriesQ95% for the interval 11/08/2017 - 25/11/2017
of Cluster 3 in the original scale. The MAPE is equal to 34.578%. . . . . . . . . . . . . . . 85
B.7 Forecast of the model of representative seriesQ25% for the interval 12/08/2017 - 08/11/2017
of Cluster 3 in the original scale. The MAE is equal to 0.769. . . . . . . . . . . . . . . . . . 85
B.8 Forecast and band intervals for Cluster 3 from 22/08/2017 until 8/11/2017 in the original
scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
C.1 Estimates of the total consumption between 22/08/2017 and 30/11/2017 and the real total
consumption of Group 3 in the original scale. . . . . . . . . . . . . . . . . . . . . . . . . . 87
C.2 Estimates of the garden watering and domestic consumption between 22/08/2017 and
30/11/2017 and the real total consumption of Group 3 in the original scale. . . . . . . . . 88
C.3 Estimates of the total consumption between 22/08/2017 and 30/11/2017 and the real total
consumption of Group 4 in the original scale. . . . . . . . . . . . . . . . . . . . . . . . . . 88
C.4 Estimates of the garden watering and domestic consumption between 22/08/2017 and
30/11/2017 and the real total consumption of Group 4 in the original scale. . . . . . . . . 88
C.5 Estimates of the total consumption between 27/08/2017 and 30/11/2017 and the real total
consumption of Group 1 Large in the original scale. The MAPE was equal to 66.41%. . . . 89
C.6 Estimates of the garden watering and domestic consumption between 27/08/2017 and
30/11/2017 and the real total consumption of Group 1 Large in the original scale. . . . . . 89
C.7 Estimates of the total consumption between 22/08/2017 and 30/11/2017 and the real total
consumption of Group 3 Small in the original scale. The MAPE was equal to 44.05%. . . . 89
C.8 Estimates of the garden watering and domestic consumption between 22/08/2017 and
30/11/2017 and the real total consumption of Group 3 Small in the original scale. . . . . . 90
xvi
Chapter 1
Introduction
In this Chapter, the motivation of this dissertation is presented in Section 1.1. In Section 1.2, the main
goals set for this study, as well as the adopted approach are discussed and a summarised description
of the case study is presented. Lastly, the structure of the dissertation is described in Section 1.3.
1.1 Motivation
With a rapid population growth worldwide, urban water systems must keep up with the increasing water
demands. Along with this, the increase in tourism in certain regions implies a pressure over their water
supplies, making a need for studies that will help prepare a sustainable future. Without enough water,
tourism in certain regions is compromised. The importance of implementation of water saving measure-
ments rises while water sources are affected with the growing extraction rate (Danilenko et al. [1]). With
climate change, an average global temperature increase is verified, the precipitation rates decrease and
many regions become drier, causing periods of drought. At the same time, high-precipitation events and
flooding are becoming more frequent in other regions. Thus, an effective water demand management
becomes increasingly important.
Indoor water use, that is, water use inside the houses, in residential areas remains, generally, the
same throughout the year (Makwiza [2]). However, outdoor water use, or the total amount of water
people use outside of their house, can suffer significant changes according to the weather change,
including an increase in garden irrigation during drier seasons. An important part of water conservation
strategies must go through a better understanding of the outdoor water use, how much water is used
outdoors and to what end it is used. There is a high potential to improve water savings of the outdoor
water use in residential areas.
During periods of drought, restrictions to water usage can be implemented, which typically aim out-
door water use, such as watering gardens, washing vehicles and refilling pools (Syme et al. [3], Root
and Survis [4]). These restrictions can also be complemented with a rise in water prices (Randolph and
Troy [5]). Situations where restrictions are necessary are becoming more frequent. In the Summer of
2017, severe restrictions to water use were absolutely necessary in certain cities of Portugal, since the
1
basic water needs of the population were at risk (www.publico.pt [6]). The dams, which are the main
water sources of these cities, presented an extremely low water level. Thus, there is a pressing need
to study the outdoor water use in residential areas to better understand how the water is being used
and how much of that water can be saved in the future. On the other hand, the outdoor water use may
represent a significant part of the total water use in a water supply system, influencing its operational
capacity. In addition, predicting how the outdoor water consumption will evolve is crucial to plan a sus-
tainable reabilitation of the water supply systems. Considering the research made for this study, this is
a topic that is not yet sufficiently explored. Moreover, these studies are crucial to educate the general
population, in order to move people to adopt conservation measures.
1.2 Objectives
The main goal of this dissertation is to study and characterize the daily outdoor water demand in a
coastal touristic region, such as its seasonality, its relation to the dimension of the outdoor area of the lot
and which other variables influence the outdoor water use the most. In this region, the gardens occupy
the majority of the outdoor areas, hence, the water use due to garden watering is the most important
component of the oudoor water uses and other possible outdoor water uses have little significance. Thus,
the terms outdoor water demand and garden watering demand are used interchangeably throughout
this dissertation. Another of the main objectives is to forecast daily future values of the outdoor water
demand and for that we will build a predictive model. Since there is data from several clients comprised
in this study, it is not practical to build a model for each one. Thus, we want to find groups of clients with
similar behaviours using clustering, allowing us to build one model for each group. For this, we need to
investigate which is the best clustering method we can use, as well as the similarity measure.
With the right grouping of the clients, we are able to proceed to the modeling. The models we
will study can include external variables. This is an important feature that we require, since we are
in a situation where the weather variables, such as daily average temperature and daily accumulated
precipitation, can have an important influence in the consumption. In order to include the weather
variables in the model, we will study the relation between them and the garden watering consumption.
Then, having the models that can explain the consumption of each cluster, we are able to predict future
values. Having each group characterized, it is possible to place a new client in a group that possibly will
have the same consumption pattern and use the respective model to predict future values.
Additionally, a secondary goal is to disaggregate indoor and outdoor water use for the cases where
there is a single water meter for the lot, using the garden watering demand models obtained. By doing
this, the water utility company can estimate the amount of water accounted for outdoor water use in
the total water use. A better understanding of indoor and outdoor water use is also important to the
management of residual waters drainage networks. Outdoor water uses include mainly garden watering
and this water is not returned to the residual water network, contrary to the case for indoor water uses,
such as showers/baths, washing clothes and dishes. In the case of indoor water use, a significant part
of the water is collected through the residual waters drainage system.
2
For this study, hourly water consumption of several clients for a period of almost 3 years is available.
These will be aggregated to daily consumption, since we wish to work with daily values. For each client,
we have at our disposal the lot area, building area (floor area of the house) and outdoor area. Also, we
have available the mean, maximum and minimum daily temperature and daily accumulated precipitation
for the period in study.
This study will focus on data collected from residential lots in a coastal and strongly touristic region
in the south of Portugal. A strong seasonal variability is present, motivated by the touristic affluence and
the garden watering necessities, due to dry weather, along with high temperatures, in the summer. In
this region, we encounter a particular situation with a group of lots that contain two high resolution water
meters, one that measures exclusively indoor water uses and another that measures exclusively outdoor
water uses, thus creating an exceptional opportunity to study outdoor water demand. The use of two
separate water meters, one to measure the indoor uses and another to measure the outdoor uses, is
not a widespread practice in Portugal. However, this is recommended in the cases where the outdoor
water use is very significant and it is necessary a better management of this component (for example, a
differentiated tariff).
A present problem in the region in study is the saltwater intrusion in the aquifers, which contaminates
the boreholes water. This leads to the closure of the borehole, which is used as a source to water the
garden. It is expected that the saltwater intrusion will increase in the region in study, therefore there
will be more gardens watered by the mains water. It is specially of high importance to determine if the
current water supply system can provide enough water and adequate service level if all the boreholes
are closed. This is all the more pertinent since it is expected that in the limit all the boreholes will be
closed. The situation described makes it challenging to envisage mean and long term consumption
scenarios. It is then necessary to study the consumption habits in order to know if it is possible to give
answer to the water demand for future planning purposes of the water supply system.
The results obtained from the study will be important to improve the water supply network manage-
ment. This study is not only relevant for residential consumers and water utility companies, but also large
consumers and municipalities. Large consumers, such as hotels and airports, may have a significant
outdoor water use due to garden watering, pools and street cleaning. In municipalities, a significant
part of non-revenue water corresponds to garden watering and improving efficiency is crucial to their
economic and environmental sustainability.
1.3 Thesis Outline
This dissertation is divided in 5 Chapters. In Chapter 2, an assessment of the variables that may
influence outdoor consumption, as well as an assessment of the methods of analysis adopted, are made.
The methods used to cluster time series and to model the data are described in Chapter 3. In Chapter
4, we focus on the exploratory analysis, the work developed towards modeling the data, the forecast
results obtained and their analysis. Also in Chapter 4, we discuss the method used to disaggregate
the total water use into indoor and outdoor water use. Chapter 5 is dedicated to the conclusions of the
3
dissertation, what was achieved and suggestions for future work.
4
Chapter 2
State-of-the-art
The research for this Chapter covers the assessment of the variables that influence the indoor and
outdoor water use, as well as methods of analysis adopted. Some of the models that have been used
to model the total water use are referred. Two studies that analyse the outdoor water use are examined.
An assessment of the approches used to characterize the end-use of water within a household is made.
Lastly, the approaches used in these studies that can be adapted to this thesis are presented.
There are different kind of studies that can be made regarding the study of residential water use,
such as a consumer habits study, an assessment of the variables that influence the water consumption
the most, the modeling of the water demand and the prediction of future values, among others. Wa-
ter consumption within a household can occur inside the house, including due to washing machines,
showers, toilets, taps, dishwasher and evaporative air conditioning system, or on the outdoor space,
including garden watering and water consumption related to swimming pools (Loh et al. [7]). There are
many possible variables that can influence the total water demand in residential areas, such as water
price, income, education, sustainability concern, temperature, precipitation, house size, housing typol-
ogy, outdoor space size, garden typology, presence of pool, among others (House-Peters and Chang
[8]).
The importance of garden watering varies from country to country according to the meteorological
conditions. Also, the weight of the outdoor water use on the total water consumption will be different
in different climates and according to the different consumption habits. Thus, the importance given to
water management and to study and understand the water use is also expected to vary. For example, in
Australia, the water scarcity problem is extremely important in certain regions. Therefore studies related
to consumption patterns and forecasts are quite relevant. It is important to understand and monitor the
outdoor water use in Australia in residential areas. In a study conducted with data collected in a resi-
dential area in Perth, Australia, between 1998 and 2001 (Loh et al. [7]) it was estimated that the outdoor
water use accounted for 56% of the total water use of a single detached residential household and almost
all of this water was used to water the garden. Moreover, the authors did not find a relationship between
the watered area of the outdoor space and the outdoor water consumption. With this study, it was also
verified that houses with a borehole use less water from the public supply system on the outdoor space
5
than the houses without a borehole.
Though studies have been made modeling and forecasting water demand (Ghiassi et al. [9], Caiado
[10], Gato et al. [11] ), commonly using time series models or Artificial Neural Networks, not many have
focused solely on modeling the outdoor water use. There are references to the water use in private
residential gardens in studies from a point of view of individual habits and environmental awareness
(Randolph and Troy [5]). In analysis of water use literature, regression models are most commonly used,
as well as time series analysis (Makwiza [2], House-Peters and Chang [8]). Meteorological variables
such as temperature and precipitation are included in regression models (Chang et al. [12]). It seems
that a mathematical study has not been yet conducted focused solely on the garden watering demand
in residential homes. In particular, there is little research done specifically with regard to the garden
watering demand in residential homes in Portugal.
Syme et al. [3] performed a study to better understand and predict the monthly water consumption
in outdoor areas of residential homes. This study was conducted using estimates of external water
use, such as on gardens or swimming pools, for 397 houses in Perth, Australia. It was used monthly
consumption data from 1 year and 5 months. To estimate the outdoor water use throughout the year,
the authors assumed that, during the winter months, it is not necessary to water the gardens due to
precipitation. This implies that only the indoor water use is registered during the winter months. The
outdoor water use in the summer was then estimated by the difference between the total water use
in the summer and the total water use in the winter. In this study, socio-demographic variables were
considered, including income, lot size, presence of swimming pool, interest in gardening, importance of
garden and green spaces in their personal life, attitudes towards water conservation, type of equipment
used to water the lawn, among others. A questionnaire was made to each of the clients that included
the variables mentioned. Syme et al. [3] applied a Structural Equation Model with latent variables,
which is commonly used in social sciences. The authors assessed that lots with larger sizes used
more water, lots with a swimming pool tended to use more water as well and the presence of more
sophisticated watering systems usually implied the use of more water. Also, it was concluded that the
lifestyle preferences, garden interest and garden use for leisure had an impact on the outdoor water
use. Moreover, it was also concluded that, when determining outdoor water use, the socio-demographic
variables were just as important as the consumer’s attitudes towards garden and gardening.
Jain et al. [13] proposed Artifical Neural Networks to model the water demand at the Indian Institute
of Technology. The authors assumed that the majority of the water consumption at the Indian Institute
of Technology was to water the lawns and gardens. For this study, the weekly water demand at the
Institute and campus was used, as well as the weekly accumulated rainfall and weekly average of the
daily maximum temperature. Furthermore, the authors verified that the occurrence of rainfall was a
more significant variable than the amount of rainfall, since that ”people may not want to water their
lawns/gardens on a rainy day regardless of the amount of rainfall”. The authors found a correlation
between the weekly water consumption and the weekly average of the daily maximum temperature, as
well as a correlation between the weekly water consumption at two consecutive weeks. However, they
found that there was no correlation between the weekly water consumption and the weekly total rainfall.
6
It was concluded that the water demand at the Institute of Technology in Kanpur and its campus is a
”dynamic process driven by the temperature and interrupted by the occurrence of rainfall”.
In order to characterize the end-use of water within a household, that is, when it was used, for
example, by the washing machine or in the shower, smart metering is usually used (Fontdecaba et al.
[14], Gurung et al. [15]). Smart meters, which are considerably expensive, collect data automatically
and communicate readings in real time, or nearly real time. There are references that explore different
disaggregation methods. Makwiza and Jacobs [16] conducted a study in which microphones were used
to record sound when an outdoor tap was being used, thus capturing outdoor water use events. The data
was collected in homes located in the City of Lilongue, Malawi, between December 2014 and January
2015 and later between May 2015 and July 2015. This technique had already been used to capture
water use within the homes (Chen et al. [17], Fogarty et al. [18]), so the goal of the authors was to verify
the validity of this low-cost method to capture the outdoor water use in residential homes. This method
allowed to identify the start and end of outdoor water use, however it could not accurately report the
volume of water used.
Generalized Additive Models have been successfully used to model and forecast short-term elec-
tricity load. Pierrot and Goude [19] applied Generalized Additive Models to electricity load hourly data,
including meteorological data as explanatory variables (temperature, cloud cover and wind speed). The
models exhibited a good performance in terms of prediction accuracy. Ba et al. [20] also used General-
ized Additive Models to model and forecast half-hourly load data.
In this project, we intend to understand the relation between the garden watering demand and the
the size of outdoor space, as in Loh et al. [7]. Also, we will verify the correlation between temperature
and outdoor water use, as well as between accumulated precipitation and outdoor water use. Based
on Pierrot and Goude [19] and Ba et al. [20], we apply Generalized Additive Models to garden watering
demand, which to our knowledge, has not been done yet. As Jain et al. [13], we will verify if the event of
precipitation is a more significant variable in the models than the accumulated precipitation. We will also
discuss the method used in Syme et al. [3] to disaggregate the total water use into indoor and outdoor
use for the case of our study.
7
8
Chapter 3
Methodology
In Section 3.1, some basic concepts of time series are presented, which will be needed throughout
the entire thesis. Some of the time series analysis classical models are presented in Section 3.2. In
Section 3.3, the clustering methods used in this project to group time series are described. This will
allow to form a partition of the clients according to their similarity and build one model for each group,
instead of building one model for each client, which is impractical. In Section 3.4, Generalized Additive
Models, which were the models used in this project to fit the garden watering demand, are studied. Also,
Generalized Additive Mixed Models are described. Lastly, in Section 3.5, a method of water consumption
disaggregation into indoor and outdoor use is discussed, which will be applied to a set of clients with a
single water meter that measures both the indoor and outdoor water use.
3.1 Time series basic concepts
A time series is a collection of observations obtained through repeated measurements over time. The
objectives of studying time series include understand the physical characteristics that generate them
and predict future values (Wei [21]).
To give a formal definition of time series, a stochastic process must be defined first.
Definition 3.1.1. A stochastic process Z = {Z(t), t ∈ T} is a collection of random variables, that is, for
each t in the index set T , Z(t) is a random variable. Usually, t is interpreted as time, therefore, Z(t)
is the state of the process at time t. If the index set T is a countable set, Z is called a discrete time
stochastic process and if T is continuous, Z is a continuous time stochastic process.
Definition 3.1.2. A stochastic process Z = {Z(t), t ∈ T} with values in R is a time series if T ⊆ R is
discrete.
A time series can be decomposed into trend (Tt), seasonal (St) and irregular or noise component
(εt) (Pires [22]). This decomposition can be additive
9
Zt = Tt + St + εt (3.1)
Or it can be multiplicative
Zt = Tt × St × εt (3.2)
The additive decomposition is usually chosen. The decomposition can also include a cyclic compo-
nent.
3.1.1 Stationarity
For the following definitions, a finite set of random variables {Zt1 , Zt2 , ..., Ztn} from a stochastic process
{Z(t) : t ∈ Z} is considered. Its n-dimensional joint distribution function is denoted by F (Zt1 , ..., Ztn).
Definition 3.1.3. A process is strongly stationary (or strictly stationary) if F (Zt1 , ..., Ztn) = F (Zt1+k, ..., Ztn+k)
for any finite set of indices {t1, t2, ..., tn} ⊂ Z with n ∈ Z+ and any k ∈ Z.
Definition 3.1.4. A process is first order stationary (or stationary on average) if F (Zt1) = F (Zt1+k) for
any t1, k, t1 + k ∈ Z, that is, if the distributuion function of dimension 1 is time invariant.
Definition 3.1.5. A process is second order stationary (or weakly stationary) if F (Zt1 , Zt2) = F (Zt1+k, Zt2+k)
for any t1, t2, k, t1 + k, t2 + k ∈ Z.
Definition 3.1.6. A process is rth order stationary if F (Zt1 , ..., Ztn) = F (Zt1+k, ..., Ztn+k) for any n ≤ r
and k, t1, t2, ..., tn ∈ Z.
Some relations regarding stationarity are worth mentioning, such as:
- A higher order of stationarity implies a lower order of stationarity.
- Second order stationarity does not imply strongly stationary.
- Strongly stationary does not imply second order stationary.
3.1.2 Autocovariance, Autocorrelation and Partial Autocorrelation Functions
Considering a weakly stationary process, Zt, where its mean, E[Zt] = µ, and variance, V ar(Zt) = σ2,
are constant.
The covariance, Cov(Zt, Zs), is a function that only depends on the difference |t− s|, ∀s, t ∈ Z. The
covariance between Zt and Zt+k can be written as (Wei [21]):
γk = Cov(Zt, Zt+k) = E [(Zt − µ)(Zt+k − µ)] (3.3)
γk is called the autocovariance function and it measures the linear dependance between two random
variables. This function presents the following properties for a stationary process:
10
1. γ0 = V ar(Zt);
2. |γk| ≤ γ0;
3. γk = γ−k for all k, i.e., γk is an even function.
The correlation between Zt and Zt+k is given by:
ρk =Cov(Zt, Zt+k)√
V ar(Zt)√V ar(Zt+k)
=γkγ0
(3.4)
This function, ρk, is called the autocorrelation function (ACF) of Zt and it verifies the following prop-
erties:
1. ρ0 = 1;
2. |ρk| ≤ 1, if the value is close to 1, it indicates a very strong positive correlation between Zt and
Zt+k; if it is close to −1, it indicates a very strong negative correlation;
3. ρk = ρ−k for all k, i.e., it is an even function;
Both the autocovariance function and autocorrelation function are positive semidefinite, that is, for
any set of time points t1, ..., tn:
n∑i=1
n∑j=1
αiαjγ|ti−tj | ≥ 0, ∀α1, α2, . . . , αn ∈ R (3.5)
n∑i=1
n∑j=1
αiαjρ|ti−tj | ≥ 0, ∀α1, α2, . . . , αn ∈ R (3.6)
This is a necessary condition for a function to be an autocovariance function or an autocorrelation
function of a process.
Consider a weakly stationary process Zt with null mean. Its partial autocorrelation function (PACF),
φkk, represents the coefficient of partial correlation between Zt and Zt+k after removing the linear de-
pendence with the variables Zt+1, Zt+2, ..., Zt+k−1. The partial autocorrelation function is calculated in
the following way (Wei [21]):
φ11 = ρ1 (3.7)
φ22 =
∣∣∣∣ 1 ρ1
ρ1 ρ2
∣∣∣∣∣∣∣∣ 1 ρ1
ρ1 1
∣∣∣∣=ρ2 − ρ2
1
1− ρ21
(3.8)
And, more generally:
11
φkk =
∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣
1 ρ1 ρ2 ... ρk−2 ρ1
ρ1 1 ρ1 ... ρk−3 ρ2
. . . . .
. . . . .
. . . . .
ρk−1 ρk−2 ρk−3 ... ρ1 ρk
∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣
1 ρ1 ρ2 ... ρk−2 ρ1
ρ1 1 ρ1 ... ρk−3 ρ2
. . . . .
. . . . .
. . . . .
ρk−1 ρk−2 ρk−3 ... ρ1 1
∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣
(3.9)
3.1.3 White Noise
A time series {Zt, t ∈ Z} is said to be a white noise serie if it is a sequence of non-correlated random
variables of a fixed distribution and a constant mean, E(Zt) = µ (usually, it is assumed to be zero),
constant variance V ar(Zt) = σ2 and γk = Cov(Zt, Zt+k) = 0, for any k 6= 0, denoted by {Zt, t ∈ Z} ∼
WN(µ, σ2). A white noise process is weakly stationary and its autocovariance function is given by:
γk =
{σ2, k = 0
0, k 6= 0(3.10)
Its autocorrelation function is given by:
ρk =
{1, k = 0
0, k 6= 0(3.11)
And its partial autocorrelation function:
φkk =
{1, k = 0
0, k 6= 0(3.12)
A white noise serie is said to be Gaussian if (Zt1 , Zt2 , ..., Ztn) has multivariate normal distribution,
∀n ≥ 1, t1, t2, ..., tn ∈ Z. In this case, weak stationarity implies strong stationarity.
3.1.4 Differencing
Data collected in a real life situation will usually not be stationary in the mean, that is, the mean will not
be constant over time. It is possible to make a series stationary in the mean by applying an operator,
which will be defined next.
The backward shift operator, B, is defined by
12
BZt = Zt−1 (3.13)
Hence, BmZt = Zt−m. The backward difference operator, ∇, is defined as follows
∇Zt = Zt − Zt−1 = (1−B)Zt (3.14)
For a higher order of the backward difference operator, ∇kZt = ∇(∇k−1Zt), for k ≥ 2. For example,
∇2Zt = ∇(∇Zt) = ∇Zt −∇Zt−1 = Zt − 2Zt−1 + Zt−2 (3.15)
By applying the difference operator, the trend of a series is removed. If a time series Zt has a linear
trend, then ∇Zt has no trend. In the case of a series with a non-linear trend, in order to remove it, the
differences should be built successively, i.e., first differences, second differences, until the time series
no longer possesses a trend (Pires [22]).
To verify if a series is stationary in the mean, unit root tests can be performed, such as the Aug-
mented Dickey-Fuller (ADF) Test and the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) Test (Hyndman
and Athanasopoulos [23]). The first is one of the most popular unit root tests and its null hypothesis
states that the data are not stationary. A small p-value suggests the data are stationary. This test
estimates the parameters of the following regression model
Z ′t = φZt−1 + β1Z′t−1 + ...+ βkZ
′t−k (3.16)
Where Z ′t is equal to Zt+Zt−1 and k is the number of lags included in the regression. The estimated
coefficient φ should be approximately zero if the series requires differencing and if it does not, then the
coefficient is smaller than zero.
The null hypothesis of the KPSS test is that the data are stationary. A small p-value indicates that
the series is not stationary and differencing is required. This test starts with the model
Zt = δt+ µt + ut (3.17)
µt = µt−1 + εt (3.18)
Where δ is the trend component, ut is a stationary process, µt is the random walk term and εt is an
independent and identically distributed process with mean equal to zero and variance σ2. If the variance
σ2 is equal to zero, then the random walk term is constant. So, the null hypothesis is that σ2 is equal to
zero. The test statistic is
KPSS =
∑Tt=1 S
2t
s2T 2(3.19)
Where T is the sample size, St =∑ti=1 ei, ei the residuals of a regression model on Zt and s2 is the
Newey-West estimate of the long-run variance (Zivot [24]).
13
3.1.5 Variance Stabilizing Transformations
In practice, many time series are not stationarity in the variance and these can be transformed in sta-
tionary time series using the proper techniques.
Considering a non-stationary time series {Zt, t ∈ Z} with finite mean and variance, the following
transformation was introduced by Box and Cox ([25]):
Tλ(Zt) =
{ Zλt − 1
λ, λ 6= 0
ln(Zt), λ = 0(3.20)
Where Zt is a positive time series and Tλ(Zt) is called the transformed series. In the case of a
non-positive time series, a positive constant can be added to the series. Usually, it is considered that
λ assumes values in the interval [−1, 1]. To find its optimal value, one can evaluate the residual mean
square error on a series of λ values.
3.1.6 Cross-correlation function
The cross-correlation function allows to understand the strength and direction of the correlation between
two random variables (Wei [21]). Consider two stochastic processes Xt and Yt with mean µx and µy,
respectively, and standard deviation σx and σy, respectively. The cross-covariance function between Xt
and Yt is given by
γxy(k) = E[(Xt−k − µx)(Yt − µy)] (3.21)
For k = 0,±1,±2, .... The cross-correlation function (CCF) is calculated by the following formula
ρxy(k) =γxy(k)
σxσy(3.22)
For k = 0,±1,±2, .... The cross-correlation function is a dimensionless quantity and it is not symmet-
ric around 0, that is, ρxy(k) 6= ρxy(−k). Since γxy(k) = E[(Xt−k − µx)(Yt − µy)] = E[(Yt − µy)(Xt−k −
µx)] = γyx(−k), it is verified that ρxy(k) = ρyx(−k). It is important to examine both negative and positive
lags of the CCF, since it is not symmetric.
3.2 Time series Models
In this Section, some of the classical time series analysis models are discussed. These can be stationary
or non-stationary models. Furthermore, the steps that can be taken to identify a good model that fits the
data in study are also discussed, as well as points to consider in the model diagnostic (Subsections 3.2.3
and 3.2.4). In Subsection 3.2.5, the minimum mean square error forecasts for stationary and non-
stationary time series models are presented.
14
3.2.1 Linear Stationary Models
Consider {Zt, t ∈ R} as a time series.
Autoregressive Processes (AR)
In situations where the values of a time series depend on the previous values plus a random shock,
autoregressive processes are useful to describe them (Wei [21]). A model is fit to the variable using
a linear combination of past values of the same variable. Zt is an autoregressive process of order p,
denoted by AR(p), if
Zt = φ1Zt−1 + ...+ φpZt−p + εt =
p∑i=1
φiZt−i + εt (3.23)
Where {εt, t ∈ R} ∼WN(0, σ2ε ). This can also be written by using the backward shift operator B
Zt = φ1BZt + φ2B2Zt + ...+ φpB
pZt + εt (3.24)
φp(B)Zt =(1− φ1B + φ2B2 + ...+ φpB
p)Zt = εt (3.25)
Where φp(B) is called the characteristic polynomial.
A process is invertible if it posesses an autoregressive representation. This process is always invert-
ible, since∑pj=1 |φj | <∞. However, it is not necessarily stationary, for that, the roots of the characteristic
polynomial must lie outside of the unit circle.
Moving Average Processes (MA)
These processes are useful to describe situations in which events have an immediate effect that lasts for
short periods of time (Wei [21]). Zt is a moving average process of order q, which is denoted by MA(q),
if
Zt = εt − θ1εt−1 − ...− θqεt−q (3.26)
Again, it is possible to rewrite the expression using the backward shift operator as follows
θq(B)εt = Zt (3.27)
Where θq(B) = (1− θ1B − ...− θqBq).
A moving average process is always stationary, since 1 + θ21 + ... + θ2
q < ∞, but not necessarily
invertible. It will be invertible if the roots of θq(B) = 0 lie outside of the unit circle.
15
Autoregressive Moving Average Processes (ARMA)
A process Zt is a mixed autoregressive moving average process, ARMA(p, q), if
φp(B)Zt = θq(B)εt (3.28)
Where φp(B) = 1−φ1B−...−φpBp, θq(B) = (1−θ1B−...−θqBq) and εt ∼WN(0, σ2ε ). It is assumed
that φp(B) = 0 and θq(B) = 0 have no roots in common andθq(B)
φp(B)is called the ARMA polynomial. Also,
if the roots of θq(B) = 0 lie outside of the unit circle, the process is invertible and if the roots of φp(B) = 0
lie outside of the unit circle, the process is stationary.
Note that an autoregresssive process of order p is a special case of an ARMA process with order q
equal to zero and a moving average process of order q is an ARMA process with order p equal to zero.
A stationary and invertible ARMA process can have a pure autoregressive representation, as well as
a pure moving average representation.
In Table 3.1, a summary of the behaviours of the ACF and PACF of theAR(p),MA(q) andARMA(p, q)
processes is presented. This is useful in the process of identifying models, as well as the order p and q
of the models.
Table 3.1: Summary of the properties of the stationary models (Source: Bisgaard and Kulahci [26]).
AR(p) MA(q) ARMA(p, q)
ACF
Infinite dampedexponentials and/or
damped sine waves; Tailsoff
Cuts off after lag q
Infinite dampedexponentials and/or
damped sine waves; Tailsoff
PACF Cuts off after lag p
Infinite dampedexponentials and/or
damped sine waves; Tailsoff
Infinite dampedexponentials and/or
damped sine waves; Tailsoff
There is a dual relationship between AR and MA processes, which can be summarized in the follow-
ing properties
• A stationary AR process of finite order is equivalent to an infinite order MA process.
• An invertible MA process of finite order is equivalent to an infinite order AR process.
• The duality of the respective ACF and PACF functions is also present, as can be seen in Table 3.1.
3.2.2 Non-stationary Linear Models
The previous models are based on the stationary assumption, however in many practical situations, the
time series are non-stationary.
Autoregressive Integrated Moving Average Processes (ARIMA)
An ARIMA(p, d, q) process has the following representation:
16
φp(B)(1−B)dZt = θ0 + θq(B)εt (3.29)
Where θ0 is a real number, εt ∼ WN(0, σ2ε ) and φp(z) = 1 − φ1z − φ2z
2 − ... − φpzp and θq(z) =
1−θ1z−θ2z2− ...−θqzq do not have any common roots. These models can be transformed in stationary
models by applying the simple difference operator. For example, an ARIMA(p, d, q) series can be
studied in the frame of the ARMA(p, q) models if the referred operator is applied d times to the series.
Note that the ARMA(p, q) models are a special case of the ARIMA(p, d, q) models when d = 0.
Seasonal Autoregressive Integrated Moving Average Processes (SARIMA)
A seasonal event is an event that repeats after a regular period of time and the smallest time period for
this phenomenon is called seasonal period (Wei [21]). Then, the ARIMA models are extended to model
seasonal time series.
Introducing the lag-S operator, ∇S , which is defined by
∇SXt = Xt −Xt−s = (1−BS)Xt (3.30)
For d and D non-negative integers, Xt is said to the a SARIMA(p, d, q)× (P,D,Q)S process if it has
the following representation
Φ(Bs)φ(B)(1−BS)D(1−B)dXt = Θ(BS)θ(B)εt (3.31)
Where εt ∼ WN(0, σ2ε ), and the functions φ(.) and θ(.) do not have common roots and no roots in
the unit circle. The functions Φ(.) and Θ(.) respect these same properties.
3.2.3 Model Identification
In time series analysis, the first step is to identify one or more possible models, then comes estimation
of the parameters and finally the evaluation and diagnostic of the model.
When dealing with real data, the ACF and PACF are not known, so it is necessary to estimate and
compare them with the ”theoretical” functions of each model. For that, Table 3.1 can be quite helpful.
The identification of one model is never exact, since there is not a method to do so, it is necessary
the critical thinking of the person performing the study. At this stage, the graphical analysis has a big
importance, as well as the model diagnostic.
According to Wei [21] , one can follow several steps to identify a model:
Step 1 Create the plot of the time series. By analysing the plot, it is possible to see, for example, if the
series have some trend, outliers or non-constant variance. After this, apply the necessary trans-
formations to the data. One of the most common ones is the Box-Cox transformation, which is
applied in the case of non-constant variance.
17
Step 2 Estimate and examine the sample autocorrelation function and the sample partial autocorrelation
function, to investigate if it is necessary to apply the difference operator. For example, when the
sample autocorrelation function decays very slowly and the sample partial autocorrelation function
is zero for lags k > 1, usually the first differences are applied, (1−B)Zt.
Step 3 After the transformations applied in the previous step, estimate and examine once again the sam-
ple autocorrelation function and the sample partial autocorrelation function in order to determine
the values of p and q. For such, it is necessary to compare the functions mentioned with the theo-
retical functions of the models (AR, MA, ARMA) and find a match. The Table 3.1 is a good auxiliar
in this step.
Step 4 Test if the term θ0 of the deterministic trend should be included when d > 0. The sample mean W
of the differentiated series, Wt = (1−B)dZt, is compared with its approximated standard deviation,
SW .
At this stage, more than one possible model are being considered, thus the goal is to select the best
model in order to go through with the analysis. For such, certain measures can be used, such as Akaike
information criterion (AIC) or Bayesian information criterion (BIC) (Box et al. [27]). These measures can
be respectively calculated by the following formulas
AIC(M) = nln(σ2ε ) + 2M (3.32)
BIC(M) = nln(σ2ε )− (n−M)ln(1− M
n) +Mln(n) +Mln
[(σ2z
σ2ε
− 1
)/M
](3.33)
Where M is the number of parameters of the model, σ2ε is the maximum likelihood estimator of σ2
ε
and σ2z is the sample variance of the series.
3.2.4 Diagnostic
Once the ”best” model is identified, its parameters should be estimated and it is necessary to check if
the initial assumptions are satisfied, namely:
• the error term εt follows a normal distribution. For this, the histogram of the residuals, εt, and the
QQ-Plot can be analysed and a goodness of fit test can be performed.
• the variance of the εt is constant. Examine the plot of the residuals or check the effect of the
Box-Cox transformation for several λ.
• the εt are white noise. Analyse the plots of the sample ACF and sample PACF and, additionally, a
portmanteau test, like Ljung-Box test, can be performed.
18
3.2.5 Forecast
One of the main goals of time series analysis is to predict future values. When obtaining predictions
of future values, the goal is to produce values with the minimum error as possible. In this Section, it
is discussed how to predict using the minimum mean square error forecasts for the different models
presented in Subsections 3.2.1 and 3.2.2, as it is in Wei [21].
Consider at time t = n the observations Zn, Zn−1, Zn−2, ... and the objective is to forecast the l-step
ahead value of Zn+l, with l > 0.
Forecast stationary time series
Consider the case of a stationary ARMA model with representation
φ(B)Zt = θ(B)εt (3.34)
Note that, since the model stationarity is being assumed, it can have a purely moving average repre-
sentation.
Zt = εt + ψ1εt−1 + ψ2εt−2 + ... (3.35)
With ψ0 = 1. Considering t = n+ l,
Zn+l =
∞∑j=0
ψjεn+l−j (3.36)
Knowing that each Zj can be written in the form 3.35, it can be defined the minimum mean square
error forecast of Zn+l, Zn(l), as
Zn(l) = ψ∗l εn + ψ∗l+1εn−1 + ψ∗l+2εn−2 + ... (3.37)
Where ψ∗j are to be determined.
The goal is to forecast Zn+l as a linear combination of the observations Zn, Zn−1, Zn−2, ... with
minimum mean square prediction error (MSPE), which is given by
Pn(l) = E(Zn+l − Zn(l))2 (3.38)
This can be rewritten as
E(Zn+l − Zn(l))2 = σ2ε
l−1∑j=0
ψ2j + σ2
ε
∞∑j=0
[ψl+j − ψ∗l+j ]2 (3.39)
The previous equation is minimized when ψl+j = ψ∗l+j . Therefore, Equation 3.37 can be rewritten as
Zn(l) = ψlεn + ψl+1εn−1 + ψl+2εn−2 + ... (3.40)
19
Now, using Equation 3.36 and the following property
E(εn+j |Zn, Zn−1, ...) =
0, j > 0
εn+j , j ≤ 0
(3.41)
it can be written:
E(Zn+l|Zn, Zn−1, ...) = ψlεn + ψl+1εn−1 + ψl+2εn−2 + ... (3.42)
The right-hand side of the previous equation is equal to the right-hand side of Equation 3.40. Thus,
the minimum mean of square error forecast of Zn+l, or the l-step ahead forecast of Zn+l at the forecast
origin n, is equal to
Zn(l) = E(Zn+l|Zn, Zn−1, Zn−2, ...) (3.43)
The forecast error, en(l), is given by
en(l) = Zn+l − Zn(l) =
l−1∑j=0
ψjεn+l−j (3.44)
The forecast is unbiased, since E(en(l)|Zt, t ≤ n) = 0, and its error variance is given by
V ar(en(l)) = σ2ε
l−1∑j=0
ψ2j (3.45)
Considering that Zt is a normal process and that zα/2 is the quantile of standard normal distribution,
the (1− α)× 100% forecast limits are given by
Zn(l)± zα/2[1 +
l−1∑j=1
ψ2j
]1/2
σε (3.46)
Forecast non-stationary time series
Consider a non-stationary ARIMA(p, d, q) model, with d 6= 0
φ(B)(1−B)dZt = θ(B)εt (3.47)
Where φ(B) = (1−φ1B− ...−φpBp) is a stationary autoregressive operator and θ(B) = (1−φ1B−
...− θqBq) is an invertible moving average operator.
Since the model is invertible, it can be rewritten in an AR representation. So, the AR representation
of the model at time t+ l is given by
π(B)Zt+l = εt+l (3.48)
Where
20
φ(B) = 1−∞∑j=1
φjBj =
φ(B)(1−B)d
θ(B)(3.49)
Or, it can also be written
Zt+l =
∞∑j=0
πjZt+l−j + εt+l (3.50)
By applying the operator 1 + ψ1B + ...+ ψl−1Bl−1 to Equation 3.50, Equation 3.51 is obtained.
∞∑j=0
l−1∑k=0
πjψkZt+l−j−k +
l−1∑k=0
ψkεt+l−k = 0 (3.51)
Where π0 = −1 and ψ0 = 1. It can be shown that
∞∑j=0
l−1∑k=0
πjψkZt+l−j−k = π0Zt+l +
l−1∑m=1
m∑l=0
πm−lψlZt+l−m +∞∑j=1
∞∑i=0
πl−1+j−iψjZt−j+1 (3.52)
By choosing the weights ψ such that
m∑i=0
πm−iψi = 0, for m = 1, 2, ..., l − 1 (3.53)
The expression in Equation 3.54 will be reached.
Zt+l =
∞∑j=1
π(l)j Zt−j+1 +
l−1∑i=0
ψiεt+l−i (3.54)
Where π(l)j =
∑l−1i=0 πl−1+j−iψi. Therefore, for t ≤ n, given Zt
Zt =E(Zn+l|Zt, t ≤ n)
=
∞∑j=1
π(l)j Zn−j+1
(3.55)
Since E(εn+j |Zt, t ≤ n) = 0, for j > 0.
The forecast error is then given by
en(l) =Zn+l − Zn(l)
=
l−1∑j=0
ψjεn+l−j
(3.56)
The weights ψ can be calculated recursively from the πj weights in the following manner
ψj =
j−1∑i=0
πj−iψi, j = 1, 2, ..., l − 1 (3.57)
21
Forecast evaluation
To evaluate the performance of a model, one can use certain measures to verify the quality of the
predictions. Also, this can be a way to compare different models to aid in the selection of a model. Let el
denote the one-step prediction error, that is, the difference between the real value, Zl, and the predicted
value at time l, l = j + 1, ..., n− 1.
Though there are many measures that can be used, in this project, there was a focus on two mea-
sures. The mean absolute percentage error can be calculated by the formula
MAPE =
(1
n− j
n−1∑k=j
∣∣∣∣ ekZk+1
∣∣∣∣)100% (3.58)
This measure is scale independent. However, it is not adequate to use if the time series takes values
equal or close to zero. Therefore, in those cases, another measure was also used, the mean absolute
error, which is calculated by
MAE =1
n− j
n−1∑k=j
|ek| (3.59)
The model that has the lower value for these measures will be preferred over the others.
3.3 Time series clustering
Clustering is a technique used to group objects in terms of similarity. It is not known in advance any class
information (unsupervised learning). The objects within the same cluster will be close to each other in
terms of distance (they will share similar data features) and far from the members of the other clusters.
When working with time series data, clustering is used to identify patterns in the time series. Time
series are a dynamic type of data due to their dependance of time. The choice of dissimilarity measure
for time series is still controversial and a research topic, however Dynamic Time Warping (DTW) is one
of the most used (Aghabozorgi et al. [28]). The periodogram based dissimilarity has been used as a
distance measure between time series (Caiado et al. [29]), as well as the dissimilarity index combining
temporal correlation and raw values behaviours (Chouakria and Nagabhushan [30]).
Hierarchical clustering has some advantages over other types of clustering, namely the number of
clusters is not required as an initial parameter and the results are presented in an intuitive dendrogram
(Pereira and de Mello [31]). For this study, hierarchical clustering was used, therefore, this type of
clustering will be described in more detail.
3.3.1 Hierarchical Clustering
In hierarchical clustering, each point is placed in its own cluster and two points are successively merged
according to the lowest dissimilarity value until all points are merged into one cluster (Giudici [32]). Along
with the hierarchical clustering, a dendrogram is built. A dendrogram is a tree like structure, where the
22
initial clusters, that contain only one point, are the leafs. At each step of the algorithm, one branch is
drawn on the tree to represent the merge of two clusters. The final cluster that contains all points is
represented by the root of the tree.
Different dissimilarity measures can be considered for this algorithm, depending on the choice of
method. Although there exists a wide variety of methods, four will be discussed: Single Linkage, Average
Linkage, Complete Linkage and Ward’s Method. All of these are agglomerative methods, that is, the
clusters are built from the leafs to the root. On the other hand, divisive methods, which were not used in
the project, build the clusters from the root to the leafs.
In Single Linkage, the distance betwen two clusters is defined as the minimum distance between
the observations of the two clusters. Complete Linkage defines the distance between two clusters as
the maximum distance between each point of one cluster and each point of the other cluster. Average
Linkage considers the following dissimilarity measure. The distances between each point of one cluster
and each point of the second cluster are calculated and then the average value of these distances is
computed. Ward’s Method uses a cost function in a way that a merger of two clusters is made if it has
the smallest increase of the cost function.
In practice, the choice of method is not a linear one, since there is not one that yields good results
for all types of data. Therefore, it is necessary to use different methods in order to make the best choice.
3.3.2 Distances
Choosing a distance is an important step in clustering, since different distances can lead to different
results. Three dissimilarity measures were considered in this study, in order to assess which one would
be better suited for the data. The dissimilarity measure that ended up being used was the Periodogram
based distance.
Dynamic Time Warping (DTW):
Dynamic Time Warping, proposed by Berndt and Clifford [33], is widely used with time series data
sets and it has been proven to be more robust than Euclidean Distance. DTW allows to compare two
time series that are similar in shape but have an axis misalignment.
In order to calculate the DTW distance between two realizations of time series, one must follow a
series of steps (Pereira and de Mello [31]). Consider two realizations of time series, x = (x1, x2, ... , xn)
and y = (y1, y2, ... ym):
• Compute the distance matrix (dij)n×m, where dij = d(xi, yj) = (xi − yj)2 is the distance between
points. Each element dij is an alignment of points xi and yj .
• Create a warping path W in the distance matrix that starts in entry (1, 1) and ends in entry (n,m).
This path defines a mapping between x and y. Each element wk of W has to be adjacent to wk−1.
Also, given wk = (a, b), then wk−1 = (c, d) with a ≥ c and b ≥ d, i.e., the points in W have to be
monotonically spaced in time. With this, W = (w1, w2, ... , wK) with max(n,m) ≤ K < n+m− 1.
23
• Select the path that minimizes the warping cost :
DTW (x,y) = min
(√√√√ K∑k=1
wk
)(3.60)
Dissimilarity Index Combining Temporal Correlation and Raw Values Behaviours (CORT):
This distance combines temporal correlation between two series, as well as the distance between
their raw values (Chouakria and Nagabhushan [30]). Consider again two time series, x and y, both with
length n. This dissimilarity index is given by:
d(x,y) = Φ[CORT (x,y)]δ(x,y) (3.61)
Where Φ(u) is an adaptative tuning function given by Equation 3.62, CORT (x,y) is a temporal
correlation coefficient given by Equation 3.63 and δ(x,y) is a dissimilarity measure between the raw
values of x and y, for example, the Euclidean Distance or DTW distance.
Φ(u) =2
1 + exp(ku)with k ≥ 0 (3.62)
CORT (x,y) =
∑n−1i=1 (xi+1 − xi)(yi+1 − yi)√∑n−1
i=1 (xi+1 − xi)2
√∑n−1i=1 (yi+1 − yi)2
(3.63)
Periodogram Based Dissimilarity:
This dissimilarity measure takes into account the distance between the periodogram coefficients of
two series (Caiado et al. [29] and Shumway and Stoffer [34]). To define the periodogram function, as
well as the dissimilarity measure, it is necessary to introduce some concepts.
The Discrete Fourier Transform (DFT) represents the discrete time signal into periodic Fourier series.
For a sequence x = (x1, x2, ..., xn), define the DFT as d(ω0), d(w1), ..., d(wn−1), where:
d(ωj) = n−1/2n∑t=1
xtexp−2πiωjt (3.64)
For j = 0, 1, ... , n− 1, where ωj = j/n are called the Fourier frequencies.
The periodogram of x is defined as the squared modulus of the DFT:
Ix(ωj) =| d(ωj) |2 (3.65)
Let Ix(ωj) and Iy(ωj) be the periodograms of x and y, respectively. One periodogram based dis-
tance is given by:
dLNP (x,y) =
√√√√bn/2c∑j=1
[log NIx(ωj)− log NIy(ωj)]2 (3.66)
Where NIx(ωj) and NIy(ωj) are the normalized periodograms, i.e., NIx(ωj) = Ix(ωj)/σx and
24
NIy(ωj) = Iy(ωj)/σy, with σx and σy being the sample variance of x and y, respectively.
3.3.3 Comparing clustering methods
The goal when applying a clustering algorithm is to find groups that are both similar and cohesive
internally and different from other groups (Giudici [32]). Therefore, it is important to have measures
to compare how well the clustering results of each method fit the data. There are internal measures
to assess the similarity of the members of each clusters, as well as external measures to evaluate how
different the clusters are from each other. Once the clustering results are obtained, there are a number of
values that can be examined, such as the average within distance calculated per cluster, the separation
between each cluster, among others.
Also, these measures are important to decide which is the best number of clusters for the dataset,
as in hierarchical clustering the number of clusters is not given by the user. The number of clusters can
vary from 2 until a value m, with m smaller or equal to the number of observations, and in order to decide
the best number of clusters, one can use several indexes or measures.
For this project, four indexes were used to select the optimal number of clusters.
Dunn Index
To calculate this index, the distance between the points in each cluster and the points in the remaining
clusters is computed. Select the minimum of these distances as the inter-cluster separation,min.separation.
Then, compute for all the clusters the distances between the points belonging to the same cluster and
take the maximum value, that is the maximum diameter, max.diameter. The Dunn Index is then calcu-
lated by
D =min.separation
max.diameter(3.67)
If the clusters are quite different from each other, then the distance between them must be large and
if the objects within each cluster are similar, then the diameter of the clusters is expected to be small.
Therefore, this index should be maximized.
Entropy
Entropy is another measure to evaluate the performance of a clustering algorithm that measures the
degree to which each cluster consists of objects of a single class (Giudici [32]). Consider n observations,
K clusters, mi is the number of objects in cluster i and mij represents the number of objects of class j
in cluster i. First, the probability that an observation of cluster i belongs to class j is estimated.
pij =mij
mi(3.68)
With this, the entropy of cluster i can be calculated.
25
ei = −L∑j=1
pij log2pij (3.69)
Where L is the number of classes. The total entropy of the cluster set can be computed by
e =
K∑i=1
mi
nei (3.70)
Gamma (The Baker-Hubert Gamma index)
To understand how to calculate the Gamma index, the concept of concordant vectors must be defined.
Let A and B be two same sized vectors with elements ai and bi, respectively. If for two indices i and j
, ai < aj and bi < bj , then the vectors are concordant (Desgraupes [35]). The number of concordant
pairs {i, j} is denoted by s+ and the number of discordant pairs is denoted by s−. Note that the pairs
where there is equality are not considered. The Gamma index is calculated by the formula
Γ =s+ − s−
s+ + s−(3.71)
The index takes values from −1 to 1 and it should be maximized.
Silhouette Method
The Silhouette coefficient of a cluster is calculated by taking the average value of the Silhouette coeffi-
cient of all the points in the cluster (Giudici [32]). The Silhouette coefficient varies between -1 and 1. If
the coefficient value of one observation is close to 1, it indicates that the observation is well placed in
its cluster. If the value is close to -1, then it means the observation is poorly grouped. To compute the
coefficient value of a single observation i, one must start by calculating the average distance from this
point to all other points in the same cluster, ai. Then, for all the clusters in which observation i is not
contained, compute the average distance to all the points and save the minimum average value, bi. The
coefficient value for point i will be equal to
si =bi − ai
max(ai, bi)(3.72)
3.4 Models
In this Section, the notes of Wood [36] are followed. In order to present the Generalized Additive Models,
first the Generalized Linear Models must be described. Then, the Generalized Additive Models will be
briefly discussed, followed by a brief mention of interactions between explanatory variables. Lastly, the
Mixed Models are presented.
26
3.4.1 Generalized Linear Models
Suppose that Y is a response random variable and X1, X2, ...Xp is a set of explanatory variables. In
regression models, the general idea is to predict Y from X1, X2, ...Xp. The generalized linear models
(GLM) allow for the response variable to have a different distribution, not just normal distribution (as
in linear regression models), from the exponential family and for a degree of non-linearity in the model
structure (Wood [36]). Some distributions in the exponential family are the Poisson, Binomial, Gamma
and Normal distributions. For these models, it is considered a smooth monotonic link function, g(.), Y
as the response variable, the mean E(Yi|X = x) as µi, where Yi are assumed to be independent and
identically distributed, following a distribution of the exponential family. The model’s general form can be
presented by Equation 3.73.
g(µi) = β0 + β1xi1 + ...+ βpxip, i = 1, 2, ..., n, (3.73)
Where β = (β0, β1, ..., βp) is a vector of unknown parameters.
3.4.2 Generalized Additive Models
In fact, in Generalized Linear Models the link function g(.) is used to relate the conditional mean µi to
the linear predictor. However, there is no requirement forcing that relationship to be linear, it can be, in
general, additive. In the generalized additive models, by using smooth functions, f(.), of the explanatory
variables, non-linear predictors are related to the expected value. Arbitrary smooth functions can be
used, for instance, splines that are real functions that are defined piecewise by polynomial functions and
the places where its pieces connect are designated by knots. The form of a generalized additive model,
with fi, i = 1, ..., p, univariate smooth functions, is given by Equation 3.74.
g(µ) = β0 + f1(x1) + ...+ fp(xp) (3.74)
To introduce the idea of smooth functions, consider a linear model with one smooth function of one
explanatory variable.
yi = f(xi) + εi (3.75)
Where yi is the response variable, xi an explanatory variable, f a univariate smooth function and
εi are independent and identically distributed N(0, σ2) random variables. For simplicity, suppose that
xi ∈ [0, 1].
The aim is to estimate f and for that it needs to be represented in a manner that Equation 3.75
becomes a linear model. For that, it is assumed that f is composed by a sum of basis functions bi(x)
and the corresponding regression coefficients βi. The bi(x) is the ith basis function of a chosen basis
that defines the space of functions to which f belongs to. Therefore, f can be written as follows
27
f(x) =
q∑i=1
bi(x)βi (3.76)
Where q is the basis dimension. With this representation, f is said to be modeled by regression
splines and substituting Equation 3.76 into Equation 3.75 plainly produced a linear model. Some ex-
amples of smoothing basis b include thin plate regression splines, cubic regression spline, cyclic cubic
regression spline and P-splines.
To control the smoothness of a spline, penalized regression splines can be used. The model can be
fit by minimizing
‖ y − βX ‖2 +λ
∫ 1
0
[f ′′(x)]2dx (3.77)
Where λ is the smoothing parameter, which controls how fit or how smooth the model will be. If λ
is chosen as 0, it will result in an un-penalized regression spline estimate for f . If λ → ∞, then it will
culminate in a straight line estimate. The integral of squares of second derivatives in Equation 3.77 can
be written as (3.78), since f is linear in the parameters.
∫ 1
0
[f ′′(x)]2dx = βTSβ (3.78)
Where S is the matrix of known coefficients. Therefore, the problem becomes to minimize the follow-
ing expression with regard to β
‖ y − βX ‖2 +λβTSβ (3.79)
Then, the estimation of the regression coefficients can be obtained by
β = (XTX + λS)−1XT y (3.80)
Also, the hat matrix for the model, H, is given by
H = X(XTX + λS)−1XT (3.81)
Then, it is important to choose an optimal smoothing parameter, λ, that is, one that leads to a spline
estimate of f , f , as close as possible to the true f , as well as to choose the number of basis dimensions.
To choose the smoothing parameter λ, consider the notation fi = f(xi) and fi = f(xi). The param-
eter λ can be chosen to minimize the following criterion:
M =1
n
n∑i=1
(fi + fi)2 (3.82)
M can not be used directly, because f is unknown, however an estimate of E(M)+σ2 can be made.
Let f [−i] denote the model fitted to all data except yi. The Ordinary Cross Validation (OCV) score is
defined by
28
υ0 =1
n
n∑i=1
(f [−i] − yi)2 (3.83)
This score takes the average of the squared differences between the missing point and its predicted
value. If yi is replaced by fi + εi in Equation 3.83, then the following is obtained
υ0 =1
n
n∑i=1
(fi[−i]− fi − εi)2
=1
n
n∑i=1
(fi[−i]− fi)2 − (fi
[−i]− fi)εi + ε2i
(3.84)
Taking the expectation of Equation 3.84 and knowing that E(εi) = 0 and that εi and fi[−i]
are inde-
pendent, the following equation is obtained
E(υ0) =1
nE
( n∑i=1
(fi[−i]− fi)2
)+ σ2 (3.85)
Now, f [−i] ≈ f with equality in the large sample limit, so E(υ0) ≈ E(M) + σ2 also with equality in the
large sample limit. Therefore, if the ideal would be to minimize M , then to choose λ in order to minimize
υ0 is a reasonable approach and this process is called Ordinary Cross Validation (OCV) method.
This approach is, however, inefficient and it makes it computationally expensive to calculate υ0, but
it can be shown that
υ0 =1
n
n∑i=1
(yi − fi)2
(1−Hii)2(3.86)
Where f is the estimate from fitting to all the data and H is the model hat matrix, which reduces
computational time to compute υ0 . In practice, the weights 1 − Hii are replaced by the mean weighttr(I−H)
n, where tr(.) indicates the trace of a matrix. With this, the Generalized Cross Validation score
(GCV) is obtained.
υg =n∑ni=1(yi + fi)
2
[tr(I−A)]2(3.87)
Therefore, GCV is used to choose λ that minimizes υg.
Interactions
Interactions between multiple explanatory variables can be important to the model and with GAM there
are four main ways to include them. First, there is the multiplication of two independent variables, x1×x2.
Second, it is possible to use a smoothed function to one variable, f1(x)× x2. Also, the same smoothed
function can be used for both variables, f1(x1)× f1(x2), which can also be denoted by f1(x1, x2). These
are invariant to rotation of explanatory variables space, that is, it produces an isotropic smooth. This
is appropriate when the quantities are measured in the same units, for example, spatial coordinates.
Lastly, there are tensor product interactions, that is, different smoothing bases can be used for variables
29
and penalize it in two different ways, f1(x1)⊗f2(x2). Tensor product interactions can be written as
f12(x1, x2) =
I∑i=1
J∑j=1
δijb1i(x1)b2j(x2) (3.88)
Where b1 and b2 are the basis functions, I and J are basis dimensions and δ is a vector of unknown
parameters. These interactions are invariant to linear rescaling of explanatory variables and appropriate
when the quantities are measured in different units or when it is necessary to have different degrees of
smoothness relative to different explanatory variables.
3.4.3 Mixed Models - GAMMs
Generalized additive models can be represented as mixed models with the smooth terms as random
effects. First, a brief description of linear mixed models and generalized linear mixed models will be
made.
In general, a linear mixed model extends the following model in Equation 3.89 to the model in Equa-
tion 3.90.
y = Xβ + ε, ε ∼ N (0, Iσ2) (3.89)
y = Xβ + Zb + ε, b ∼ N (0, ψ), ε ∼ N (0,Λσ2) (3.90)
Where vector b contains random effects, Z is a model matrix for the random effects and Λ is a
positive definite matrix. Usually, Λ can be the identity matrix.
The generalized linear mixed models (GLMM) follow from the linear mixed models and have the
following structure
g(µbi ) = Xiβ + Zib (3.91)
Where it is considered that µb = E(y|b), b follows a normal distribution with vector zero expected
value and covariance matrix ψ, which is usually parameterized in terms of a parameter vector θ and
yi|b. These random variables are independent and they follow a distribution from the exponential family.
Now, the generalized additive mixed models (GAMM) can be defined as follows:
yi = Xiβ + f1(x1i) + f2(x2i, x3i) + ...+ Zib + εi, (3.92)
Where Xi represents the row of a fixed effects model matrix, fj are smooth functions of the ex-
planatory variables, Zi represents the row of a random effects model matrix, b ∼ N (0,ψ) is a vector of
random effects coefficients, ψ is a positive definite covariance matrix and ε ∼ N (0,Λ) is a residual error
vector.
30
3.5 Disaggregation of water consumption
In this study, there are clients that have two water meters, one measures the indoor water use and
the other measures the outdoor water use. However, most of the clients have a single water meter that
measures both indoor and outdoor water use. A secondary goal of this dissertation was to test a method
to disaggregate the total water use of clients with a single water meter into indoor and outdoor water
use. In this Section, the steps taken in a possible disaggregation of water consumption method are
described. This method uses time series clustering, discussed in Section 3.3, a classification algorithm,
K-Nearest Neighbors, which will be described in Subsection 3.5.1, and garden watering demand models
(Generalized Additive Models).
Note that the method relies heavily on the results obtained for the data set studied when modeling
the garden watering demand models, as well as on the models themselves.
3.5.1 Classification algorithm: K-Nearest Neighbors (KNN)
K-Nearest Neighbors is one of the most commonly used methods with easy interpretation and applica-
tions in classification and regression problems (Giudici [32]). In this study, KNN algorithm was applied
to predict a class of a set of time series. The similarity used was the same used with the time series
clustering algorithm, the periodogram based distance (Subsection 3.3.2).
Consider a training set composed of observations (x, y) from the explanatory variables X and the
label variable Y . KNN can be used to predict a value of the class variable Y , y0, when the values of the
explanatory variables, x0, are known. This set of known instances of the explanatory variables is called
test set.
The steps to be taken in the KNN algorithm for each instance in the test set are as follows:
1. Specify a positive integer k. This indicates the number of nearest neighbors to take a vote from.
2. Calculate the distance between the instance and each element in the training set using the chosen
distance.
3. Sort the calculated distances in ascending order.
4. Select the k top entries that are closest to the sample.
5. Find the most common classification among these k entries. This is the predicted class of the
instance.
When the k chosen is equal to 1, the algorithm is denoted as 1-NN. In this case, the new instance is
assigned the same class as its nearest neighbor.
Note that KNN performs better if the data is on the same scale, thus the data can be normalized
before applying the method.
31
3.5.2 Method for disaggregation of water consumption
In this Subsection, the steps involved in the disaggregation method of the mean total consumption of a
set of clients that have a single water meter are described. Let the set of clients that have a single water
meter, that measures both indoor and outdoor water use, be denoted by single water meter set.
This method takes advantages of the results obtained while modeling the garden watering demand.
In order to model the daily garden watering consumption, time series clustering (discussed in Sec-
tion 3.3) was applied to the set of clients that have two water meters. Let the partition of the set of clients
that have two water meters (one indoor and one outdoor) be denoted by C, with size m. So, m models
were built, one for each cluster. This method assumes that the majority of the monthly total water use is
due to outdoor water use. Thus, the daily garden watering demand models are used in this method to
estimate the total daily water use of the single water meter clients.
The steps taken in this method are as follows:
Step 1 Apply the chosen clustering algorithm with the chosen distance to the normalized single water
meter set. Then, choose the best number of clusters k. Let the clusters of the single water meter
set be denoted by G.
Step 2 Having chosen the optimal number k of clusters, build the representative series for each cluster.
A representative serie of a cluster is calculated by at each time point t taking the mean of all the
time series in that cluster at time t. Then, this series is normalized.
Step 3 Consider the train set composed of the normalized representative series of the garden watering
consumption clusters, where each series will represent its own class, and the test set comprised
of the normalized representative series of the new single water meter clusters. Apply 1-NN (1-
Nearest Neighbor) with this train set and test set.
Step 4 According to the classification results of 1-NN, one of the m models of the garden watering con-
sumption is used to predict an estimation of the total daily consumption for each of the k clusters.
Step 5 Calculate the percentage that the monthly outdoor use represents in the monthly total water use
for each of the m clusters in C. Using the appropriate percentage values, estimate the future
daily outdoor water use by taking a percentage of the estimates obtained in Step 4. For the daily
estimates in a same month, the same percentage value is used.
Step 6 The estimates of the indoor water use are obtained by the difference between the estimates of the
total consumption (Step 4) and the estimates of the outdoor water consumption (Step 5).
32
Chapter 4
Results and Discussion
In this Chapter, the clustering results, the models obtained and forecast results are shown and dis-
cussed. In Section 4.1, the case study is described, as well as the initial data treatment applied to the
data. In Section 4.2, the exploratory work is shown. The results of the time series clustering are dis-
cussed in Section 4.3. Lastly, in Section 4.4, the steps to fitting the Generalized Additive Models are
explained and the forecast results are presented. In Section 4.5, a preliminary work for disaggregation
of the consumption into indoor and outdoor water use in clients with a single water meter is made and
the results obtained are discussed.
4.1 Case study description and data processing
In this Section, the data available for this dissertation is described. The steps taken in data treatment
are explained, including how the missing values were dealt with. The process to select the water meters
to be used in this study is explained. Furthermore, the meteorological variables are plotted for the period
in study to understand the meteorological conditions of the region.
We received several information associated with 73 lots, including the total area of the lot, building
area, outdoor area and housing typology (apartment or detached house). The outdoor area can be
comprised of grass, trees, small bushes, an assortment of small plants, pavements, annexes, as well as
a swimming pool. Each of these 73 lots has two water meters associated: one exclusive to the indoor
water consumption and the other exclusive to the outdoor consumption. The first measures all the
water consumed in the house by kitchen and bathroom sinks, dish washer, washing machine, showers,
bathtubs, toilet flush and possibly refrigerators. The second water meter counts the external uses of
water, these may include the garden watering, filling and maintaining the pool, washing of pavements
and vehicles, as well as the maintenance of decorative fountains. It is important to mention that all of the
lots have an exterior swimming pool. Note that the majority of the outdoor water use is due to garden
watering. Hence, throughout the dissertation we use the terms outdoor water use and garden watering
demand interchangeably. Note that these 73 clients belong to a set of almost 3000 managed by this
water utility company.
33
The time series data set provided contained the hourly water consumption from 01/01/2015 to
31/07/2017 of the two water meters from each lot. Subsequently, we were also given hourly water
consumption from 01/08/2017 to 30/11/2017 to be used as a test set to validate the models. For the first
stage of the study, we focused only on the consumption of the outdoor water meters, with the goal of
modeling the water consumption for garden watering. We aggregated the data to daily consumption to
build the models, since it would not be possible to model the hourly data, due to the many variations in
the patterns of each time series.
When a water meter has a data collection problem, it will not provide records for the whole day. Also,
the water meter has an indicator that registers the accumulation of water consumption since the day it
started working. Even when the water meter does not register the entries for one day, the indicator will
have those values accumulated. To fill in the missing observations, we made use of this indicator. For
example, if only one day is missing in a time series, we extract the real value of water consumed in that
day by the indicator. However, if we have n consecutive days missing, we use the indicator to know how
much water was spent over this period and divide that amount evenly over the n days.
Knowing beforehand that there will be extreme observations in the time series that we want to explore
and study, we did not apply any outlier detecting algorithm in this data set. These observations are an
important aspect of the real consumption’s pattern and they would be classified as outliers. This is
information that we did not wish to replace and lose. A more thorough analysis of these events in each
time series allows us to better understand the water consumption behaviour of the clients and in order
to do this each time series had to be examined individually. We were expecting to see in the daily
consumption of each client the moment when the swimming pool was filled, since it implies a large
volume of water, around the months of April or May. However, it was verified that not all the time series
have significant peaks and that the ones that do can present peaks in the consumption in any time of
the year. We needed to further study each peak to understand if it corresponded, for example, to the
renovation of the water in the swimming pool, to a human error or to a leakage. In Section 4.2, it can be
found how we identified the cause for each peak and more information about the renovation of water in
the swimming pools.
After some exploratory work, we ended up using only 57 time series in the study. The process
of selection took place over several stages. Firstly, having in mind the given reference by the wa-
ter utility company that the expected average consumption for garden watering varies between 3 and
5 l/(m2.day), we identified clients that presented a consistently low consumption. There is the suspicion
that these lots have a borehole installed, which means that the garden is not watered using the supply
network. Thefore, these lots were not included in the study. Another case included clients that showed
a significant alteration in the consumption during the period considered, presenting also a consistently
low consumption, making us suspect that this change in the consumption was caused by the installation
of a borehole. Furthermore, when inspecting each time series, there were some that stood out with an
unexpected behaviour and, not being possible to find a cause for it, they were excluded as well. Note
that out of the 57 lots selected, only one is an apartment. Therefore, the results that were obtained can
be used for detached houses, but can not be generalized to apartments.
34
2015 2016 2017 2018
02
46
810
1214
Time (days)
m3 /
day
1015
2025
30
Deg
rees
(ºC
)
Meanconsumption
Meantemperature
Figure 4.1: Mean daily water consumption for garden watering of the 57 water meters and mean dailytemperature from 01/01/2015 to 30/11/2017.
Furthermore, we downloaded the meteorological conditions of the nearest possible location from
which the data was collected during the time period considered from the website Weather Underground
(www.wunderground.com [37]). This instrument gave us the mean, maximum and minimum daily tem-
perature and the accumulated daily precipitation. With these variables, we intend to understand if they
are correlated with the water consumption for garden watering and, if so, include them in the model. In
Figure 4.1, the mean daily water consumption for garden watering of the 57 water meters is plotted with
the mean daily temperature in a two y-axis plot. The daily accumulated precipitation from 01/01/2015 to
30/11/2017 is plotted in Figure 4.2.
0
25
50
75
2014
−12
−01
2015
−01
−01
2015
−02
−01
2015
−03
−01
2015
−04
−01
2015
−05
−01
2015
−06
−01
2015
−07
−01
2015
−08
−01
2015
−09
−01
2015
−10
−01
2015
−11
−01
2015
−12
−01
2016
−01
−01
2016
−02
−01
2016
−03
−01
2016
−04
−01
2016
−05
−01
2016
−06
−01
2016
−07
−01
2016
−08
−01
2016
−09
−01
2016
−10
−01
2016
−11
−01
2016
−12
−01
2017
−01
−01
2017
−02
−01
2017
−03
−01
2017
−04
−01
2017
−05
−01
2017
−06
−01
2017
−07
−01
2017
−08
−01
2017
−09
−01
2017
−10
−01
2017
−11
−01
2017
−12
−01
2018
−01
−01
Time (days)
mm
/day
Figure 4.2: Daily accumulated precipitation from January 2015 to November 2017.
Moreover, we received the hourly water consumption of lots with just one water meter, that is, the
indoor water use and outdoor water use are measured together by one water meter. For these lots, we
35
also have the lot, building and outdoor areas. The second goal of this study is to disaggregate the daily
consumption of these clients to find out how much of this consumption is residual water.
4.2 Exploratory Analysis
We proceeded to perform some exploratory analysis to better understand the data. This study is focused
on the water consumption for garden watering, therefore we looked into the seasonality in the data, how
the consumption of each client relates to their respective outdoor area and what is the ratio of this
consumption in regard to the total water consumption. Moreover, we performed a close analysis to the
extreme values present in the data and attempted to relate them with the renovation of the pools water.
To explore the seasonality in the data, we built a boxplot per month of the aggregated monthly
consumption, shown in Figure 4.3. There is consumption throughout the entire year, that is, even in the
winter months, the gardens are watered. There is a clear yearly seasonality: the consumption is higher
in the summer months and lower in the winter months. We note that there are extreme observations
every month. The highest value recorded occurred in June 2016, one client consumed over 1500 m3 in
this month. Beside this, there are other four observations with a high value, around 1000 m3, and three
of them belong to the same client with the highest observation. In addition, in December 2015 there is a
slight increase in the median with regard to November and it is followed by a decrease in January 2016.
The following year, there is an increase in the median in January, followed by a decrease in February.
0
500
1000
1500
Jan
2015
Feb
201
5
Mar
201
5
Apr
201
5
May
201
5
Jun
2015
Jul 2
015
Aug
201
5
Sep
201
5
Oct
201
5
Nov
201
5
Dec
201
5
Jan
2016
Feb
201
6
Mar
201
6
Apr
201
6
May
201
6
Jun
2016
Jul 2
016
Aug
201
6
Sep
201
6
Oct
201
6
Nov
201
6
Dec
201
6
Jan
2017
Feb
201
7
Mar
201
7
Apr
201
7
May
201
7
Jun
2017
Jul 2
017
Aug
201
7
Sep
201
7
Oct
201
7
Nov
201
7
Time (months)
m3 /
mon
th
Figure 4.3: Boxplot of the monthly consumptions of the 57 water meters between January 2015 andNovember 2017.
The variability is higher in June, July, August and September and lower in January, February and
36
December over these three years. Note that the median value starts to increase in March and April,
as well as the variability. Furthermore, we notice that the month of May is similar in the median and
variability in the years of 2015 and 2017, but it is quite different in 2016 (Figure 4.4). The median is
significantly lower and it presents less variability. This is possibly due to the unusually high amount of
precipitation in May 2016 (Figure 4.2), that could have led to a lower consumption to water the gardens
in this month.
In Figure 4.5, we see the changes in the month of November over the three years with more detail.
The month of November in 2016 does not present many changes when compared to the year 2015,
however, in 2017 the median value in this month has a significant increase. Also, there is a higher
variability in November 2017. The months of October and November were significantly drier in 2017
(Figure 4.2) and that could have caused the increase in the consumption in November 2017.
0
500
1000
1500
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep Oct
Nov
Dec
Months
m3 /
mon
th Year201520162017
Figure 4.4: Boxplot of monthly consumptions of the 57 time series and grouped by year (2015, 2016 and2017).
As we are dealing with water consumption for garden watering, it is important to take into account the
outdoor area of each lot and how they relate. As mentioned in Section 4.1, the outdoor area of each lot
can consist of grass, small bushes, pavements, amongst others, which means that the actual watered
area does not correspond to the outdoor area. The watered area of each lot is smaller than the outdoor
area. As a reference, the mean size of the outdoor areas of the 57 lots is 1547 m2 and the mean lot
size is 1898.5 m2. We computed the mean daily consumption of each outdoor water meter and plotted it
against the outdoor area of each lot, as shown in Figure 4.6
We expected that the higher the size of the outdoor area, the higher the mean daily consumption,
i.e., a linear relationship. However, there are some points that are far from the linear regression line in
blue, as can be seen in Figure 4.6. There is a group with big outdoor areas, higher than 2000 m2, that is
37
0
100
200
300
400
Nov
Months
m3 /
mon
th Year201520162017
Figure 4.5: Monthly consumption in November for three years (2015, 2016 and 2017).
0
5
10
15
0 1000 2000 3000 4000
Outdoor Area ( m2 )
m3 /
day
Figure 4.6: Scatterplot of the mean daily consumption of each outdoor water meter versus outdoor areafor the 57 water meters.
above the line. Also, there are a few points below the line, and this seems to indicate that these clients
use less water than expected. This could be because their actual watered area is considerably smaller
than the outdoor area or their garden tipology requires less water. For example, the lot with the largest
outdoor area, over 4000 m2, also has the mean daily consumption value below the line. Since it has
such a large outdoor area, it would be expected to have a high mean daily consumption, however, it
has a lower value than expected. Possibly, not all of the outdoor area is looked after. Additionally, note
that the majority of the mean daily consumption values are comprehended between 2.5 m3 and 7.5 m 3.
Moreover, Loh et al. [7] did not find any relationship between the watered area of the outdoor space and
the outdoor water consumption, which is not our case, since there is a linear trend, meaning that the
larger the outdoor area, the higher the mean daily consumption.
38
To know if there was a meaningful relation between the mean daily consumption and the watered
areas, we resorted to Google Maps and took measurements of the non-watered areas, that is, swimming
pools and their surrounding pavements, exterior garages, car entries and front entry pavements. This
way, we got a value that is closer to the real watered area of each lot. It was not possible to get this
estimated watered area of a few lots for which Google Maps did not give the exact location. Moreover,
it was not possible to collect the garden tipology description of each watered area. However, it was
possible to assess that all of the 57 lots have a grass area, small bushes, smalls patches of plants and
trees. Only one of the considered lots did not have a grass area, only bushes, small plants and trees.
The mean daily consumption of each outdoor water meter plotted against the estimated watered area
can be found in Figure 4.7. In this plot, there is still a group of points above the regression line with
corresponding estimated watered areas higher than 1500 m2. There is now also a group of points that
is further from the regression line, but with smaller estimated watered areas.
0
5
10
15
1000 2000 3000
Estimated Watered Area ( m2 )
m3 /d
ay
Figure 4.7: Scatterplot of the mean daily consumption of each outdoor water meter versus estimatedwatered area for the 57 water meters.
To have a better understanding of the consumption behaviour of each client, that is, which ones
spend more or less water than expected for their respective outdoor areas and which ones spend within
the expected values. For this, we computed the mean daily consumption in litres per square meter
of outdoor area for each client (l/(m2.day)) and plotted it against the outdoor area. This scatterplot
can be found in Figure 4.8 (a). The majority of the points are concentrated between 3 l/(m2.day) and
5 l/(m2.day), which is the reference for a reasonable consumption, as mentioned in Section 4.1. Then,
it is easy to identify the clients that are consistently using a high volume of water to water their gardens.
This includes the client with the smallest outdoor area, that consumes an extremely high mean quantity
of water per day. Also, there are a few clients that consume on average less than 3 l/(m2.day).
39
0
5
10
0 1000 2000 3000 4000
Outdoor Area ( m2 )
l/( m
2 . da
y)
(a) Scatterplot of the mean daily consumption each outdoorwater meter versus outdoor area.
0
5
10
1000 2000 3000
Estimated Watered Area ( m2 )
l/( m
2 . da
y)
(b) Scatterplot of the mean daily consumption of each outdoorwater meter versus estimated watered area.
Figure 4.8: Scatterplots of the mean daily consumption versus a) outdoor area, b) estimated wateredarea.
In Figure 4.8 (b), we plot the mean daily consumption in litres per square meter of estimated wa-
tered area against the estimated watered areas. In this case, the majority of the points is concentrated
between 3 l/(m2.day) and 6 l/(m2.day). Here, it stands out that there is a group of points with values
above 7.5 l/(m2.day). These clients consume consistently a high volume of water per day with regard to
their respective estimated watered areas. These clients are the same as the ones identified in Figure 4.8
(a) as being big consumers, that is, with a mean daily consumption per square meter of outdoor area
above 5 l, with the addition of one client. Also, we notice that all of the big consumers have a smaller
estimated watered area, below 1500 m2. In addition, this plot reveals more information about the be-
haviour of the 16 clients that have a low average consumption per square meter of outdoor area, below
3 l/(m2.day). With these new values computed considering the estimated watered area, only 8 clients
have a mean daily consumption per square meter below 3 l. That is, of the 16 clients that were being
considered as low consumers, only 8 are considered as such. We note that one of these low consumers
is the only lot in this set that has no grass area and there were six clients in total for which we could not
estimate the watered area. Also, the client with the largest outdoor area continues to be considered as
a low consumer.
Regarding the low consumers, it can be important to understand which practices they adopt to have
these lower consumptions, for example, their consumption habits and garden typology. On the other
hand, the big consumers are potential clients to be targets of awareness-raising campaigns to reduce
the consumption.
It is also important to understand the weight of the water consumption for garden watering in the
total monthly consumption. In Figure 4.9, the median water consumption for garden watering per month
and the median indoor consumption per month of the 57 clients are presented in a stacked bar plot. On
top of every bar, the percentage of water consumed for garden watering in that month is presented. It
is very clear that the consumption for garden watering represents the majority of the total consumption
per month. For these clients, the weight of the garden watering in the total consumption is much more
significant than what Loh et al. [7] verified in Perth, Australia. This study conducted between 1998 and
2001 verified that 56% of the total water consumption was due to outdoor water use.
40
9394
95
93
96
95
94 94
95
92
9396
85
93
9193
94
96
9493
96
94
93
85
90
85
94
93
95
95
94
93
95
94
95
0
100
200
300
2014
−11
−01
2014
−12
−01
2015
−01
−01
2015
−02
−01
2015
−03
−01
2015
−04
−01
2015
−05
−01
2015
−06
−01
2015
−07
−01
2015
−08
−01
2015
−09
−01
2015
−10
−01
2015
−11
−01
2015
−12
−01
2016
−01
−01
2016
−02
−01
2016
−03
−01
2016
−04
−01
2016
−05
−01
2016
−06
−01
2016
−07
−01
2016
−08
−01
2016
−09
−01
2016
−10
−01
2016
−11
−01
2016
−12
−01
2017
−01
−01
2017
−02
−01
2017
−03
−01
2017
−04
−01
2017
−05
−01
2017
−06
−01
2017
−07
−01
2017
−08
−01
2017
−09
−01
2017
−10
−01
2017
−11
−01
2017
−12
−01
2018
−01
−01
Time (months)
m3 /
mon
th
TypeIndoorGarden watering
Figure 4.9: Median monthly indoor consumption and median monthly water consumption for gardenwatering of the 57 water meters between January 2015 and November 2017.
In Figure 4.10, the mean daily indoor consumption and the mean water consumption for garden
watering of the 57 clients is plotted in a two y-axis plot, in order to compare the patterns. We note that
there is also a seasonality in the indoor consumption, it is higher in the summer months and lower in the
winter months. However, the difference between these two periods is not as strong as in the mean water
consumption for garden watering.
2015 2016 2017 2018
02
46
810
1214
Time (days)
m3 /
day
0.5
1.0
1.5
2.0
m3 /
day
Meangardenwatering
Meanindoor
Figure 4.10: Mean daily pattern of indoor and water consumption for garden watering of the 57 watermeters between 01/01/2015 and 30/11/2017.
As mentioned, the 57 lots we worked with to build the model all have an exterior swimming pool. We
41
wanted to know more about the renovation of the water of the pools, namely in what time of the year
does this occur, how often, how long it takes on average to fill a pool and what is the average water flow
(m3/h).
To identify the filling of a pool in the consumption, it was necessary to look at each daily time series
individually. It was very clear which time series had a significant peak. In order to have more certainty
that a peak in the consumption corresponds to the renovation of the water of the swimming pool, we used
Google Maps to measure the surface area of the pool of each lot. By considering a standard residential
swimming pool depth, we estimated the volume of the pool in each lot. With this, we can compare the
quantity of water spent by a client during the peak with the respective estimated pool volume. However,
it was not possible to perform this estimation for a few of the pools, since we did not know the exact
location of a few lots.
Table 4.1: Information regarding the extreme observations of the 57 outdoor water meters.Mean estimated pool volume 75.85 m3
Median estimated pool volume 71.30 m3
Average duration of pool filling 30.90 hoursAverage water flow during pool filling 3.2 m3/h
Events caused possibly by pool filling 12
Estimated volume of water spent filling pools 950.760 m3
Events caused possibly by filling of a reservatory 14
Events with unknown cause 16
In addition, we inspected more closely the peaks that were ruled out as being caused by the filling
of a pool. In some cases, we see a continuous consumption with the same water flow that begins at the
end of the afternoon and stops in the morning of the next day. We believe that these cases happened
due to human error.
For other peaks in the consumption during the Winter months was not possible to discover the cause,
as well as some that presented a variable water flow, showing an erratic pattern.
4.3 Time Series Clustering
In this Section, the clustering results of the 57 time series are described. We discuss the different
clustering algorithms and dissimilarity measures used to group the clients by consumption pattern, in
order to find the algorithm and dissimilarity measure that best fit the data. Moreover, some exploratory
work was done to understand how different are the groups obtained and in what way they are different.
So far, we have been working with the whole data set, from 01/01/2015 to 30/11/2017, but from this
point on we worked with the data until 31/07/2017, leaving the last four months as a test set. The aim
was to group the series by consumption pattern and not by scale. For this reason, it was necessary to
normalize the time series before applying the clustering algorithms. The goal is to build a model for each
one of the resulting clusters.
42
We applied three normalizations to each time series, the Standard, the ”Median-Mad” and the ”Min-
Max”, whose formulas are displayed in equations 4.1, 4.2 and 4.3, respectively.
yti =xti −meani(xt1 , ... , xtn)
sdi(xt1 , ... , xtn)(4.1)
yti =xti −mediani(xt1 , ... , xtn)
madi(xt1 , ... , xtn)(4.2)
yti =xti −mini(xt1 , ... , xtn)
maxi(xt1 , ... , xtn)−mini(xt1 , ... , xtn)(4.3)
Where Yt = (yt1 , ... , ytn) is the normalized time series, Xt = (xt1 , ... , xtn) is the original time series,
sd represents the standard deviation and mad stands for median absolute deviation.
Next, we show the hierarchical method and distance chosen, the decision of the best number of
clusters, along with a discussion of the clustering results comparing the clusters.
4.3.1 Hierarchical clustering
We applied Ward Method, Single Linkage, Average Linkage and Complete Linkage. However, Single
and Average Linkage gave consistently poor results for all the distances, so we only consider here
Complete Linkage and Ward Method. In addition, the normalization given by Equation 4.2 would lead
to poor clustering results, separating just one time series in one cluster and all the other time series
in another. Also, better results were obtained with the Standard normalization, when compared to the
results obtained with the ”Min-Max” normalization. For that reason, we will only discuss results obtained
with the Standard normalization.
We applied the two clustering algorithms with three different distances, Dynamic Time Warping
(DTW), Dissimilarity Index Combining Temporal Correlation and Raw Values Behaviours and Periodogram
Based Dissimilarity for the normalized set of time series. With both clustering algorithms, the distance
that led to better results was the periodogram based dissimilarity. Complete Linkage performed slightly
better than the Ward Method, when comparing the Dunn, Entropy, Gamma and Silhouette indexes.
These values can be found in Table 4.2. For Complete Linkage, we chose 5 as the best number of
clusters. The steps that led to this choice are explained in Subsection 4.3.2.
Table 4.2: Comparison of the values of the four indexes for the best number of clusters for Ward Methodand Complete Linkage with periodogram based distance when using the Standard normalization.
Number of clusters Dunn Entropy Gamma Silhouette
Ward Method 5 0.251 1.590 0.781 0.218Complete Linkage 5 0.321 1.600 0.794 0.201
43
2 3 4 5 6 7 8
0.20
0.25
0.30
0.35
Dunn Index
Number of Clusters
Dun
n In
dex
Figure 4.11: The number of clusters ver-sus Dunn index.
2 3 4 5 6 7 8
0.6
0.8
1.0
1.2
1.4
1.6
Entropy
Number of Clusters
Ent
ropy
Figure 4.12: The number of clusters ver-sus Entropy.
2 3 4 5 6 7 8
0.65
0.70
0.75
0.80
Gamma
Number of Clusters
Gam
ma
Figure 4.13: The number of clusters ver-sus Gamma index.
2 3 4 5 6 7 8
0.20
0.22
0.24
0.26
0.28
Silhouette Method
Number of Clusters
Silh
ouet
te
Figure 4.14: The number of clusters ver-sus Silhouette index.
4.3.2 Choosing the best number of clusters
To choose the best number of clusters, we used the indexes Dunn, Entropy, Gamma and Silhouette. We
computed the values of the different indexes for k clusters, k varying between 2 and 8. The plots for the
four indexes are presented below in Figure 4.11, Figure 4.12, Figure 4.13 and Figure 4.14.
Table 4.3: Best number of clusters according to each index using complete linkage method with peri-odogram based distance.
Dunn Entropy Gamma Silhouette
Best number of clusters 8 2 8 2
Even though the maximum values of both the Dunn and Gamma index occur with 8 clusters, they
also have high values for 5 clusters and the difference is not very relevant. For both the Entropy and
Silhouette, the best number of clusters is 2. Since it is not relevant to divide the set of time series into
two groups, we chose 5 as the best number of clusters. The partition of the dendrogram into 5 clusters
can be found in Figure 4.15 and the size of each cluster is specified in Table 4.4.
Table 4.4: Size of each cluster.Cluster 1 2 3 4 5
Size 20 11 7 6 13
44
0.0
0.1
0.2
0.3
0.4
Hei
ght
V9
V27 V
8V
55V
22 V3
V20
V18
V30
V37
V34
V39
V31
V50
V19
V54
V12
V49 V
1V
42V
25V
40V
24V
13V
43V
35V
51V
48V
52 V2
V38
V11
V16
V36
V47
V17
V41
V45
V46
V23
V32
V15 V
7V
57V
53V
10V
14V
56 V6
V29
V28
V33 V
5V
44V
26 V4
V21
Cluster
1
2
3
4
5
Figure 4.15: Partition of the 57 time series in 5 clusters.
4.3.3 Discussion of the clustering results
Once we had the clustering results, we intended to understand better the differences between each
cluster, what characterized them and how different they are from each other. For that, we resorted to
several plots.
First, we calculated representative series to each one of the 5 clusters. The representative series of
a cluster is calculated by at each time point t taking the mean of all the time series in that cluster at time
t. Then, this series is normalized with the Standard normalization. As an example, the representative
series of Cluster 1 is shown in Figure 4.16.
−1
0
1
2
2014
−12
−01
2015
−01
−01
2015
−02
−01
2015
−03
−01
2015
−04
−01
2015
−05
−01
2015
−06
−01
2015
−07
−01
2015
−08
−01
2015
−09
−01
2015
−10
−01
2015
−11
−01
2015
−12
−01
2016
−01
−01
2016
−02
−01
2016
−03
−01
2016
−04
−01
2016
−05
−01
2016
−06
−01
2016
−07
−01
2016
−08
−01
2016
−09
−01
2016
−10
−01
2016
−11
−01
2016
−12
−01
2017
−01
−01
2017
−02
−01
2017
−03
−01
2017
−04
−01
2017
−05
−01
2017
−06
−01
2017
−07
−01
2017
−08
−01
2017
−09
−01
Time (days)
Val
ue
Figure 4.16: Representative series of Cluster 1 between 01/01/2015 and 31/07/2017.
For each cluster, we aggregated the consumption to monthly consumption, then normalized each
time series (with Standard Normalization) and aggregated by the median in order to compare the pattern.
In Figure 4.17, the plot of the normalized monthly consumption per cluster is presented. The patterns
are similar for all clusters and there is not a clear difference between each cluster.
In Figure 4.18, the boxplot of the outdoor area per cluster is presented. Cluster 4 has a higher
45
−1
0
1
2014
−12
−01
2015
−01
−01
2015
−02
−01
2015
−03
−01
2015
−04
−01
2015
−05
−01
2015
−06
−01
2015
−07
−01
2015
−08
−01
2015
−09
−01
2015
−10
−01
2015
−11
−01
2015
−12
−01
2016
−01
−01
2016
−02
−01
2016
−03
−01
2016
−04
−01
2016
−05
−01
2016
−06
−01
2016
−07
−01
2016
−08
−01
2016
−09
−01
2016
−10
−01
2016
−11
−01
2016
−12
−01
2017
−01
−01
2017
−02
−01
2017
−03
−01
2017
−04
−01
2017
−05
−01
2017
−06
−01
2017
−07
−01
2017
−08
−01
Time (months)
Val
ueCluster
12345
Figure 4.17: Normalized monthly consumption aggregated by the median for each cluster betweenJanuary 2015 and July 2017.
median value, followed by Cluster 5, while Clusters 1, 2 and 3 have smaller values that are very close to
each other. It seems that, even though the outdoor area was not used to group the series, it is implicit in
the clusters. Cluster 4 has members with larger outdoor areas, suggesting that some of the clients with
bigger lots have a similar consumption pattern. In Table 4.5 it is presented a summary of the outdoor
areas by cluster.
0
1000
2000
3000
4000
1 2 3 4 5Cluster
Are
a ( m
2 )
Clusters
1
2
3
4
5
Figure 4.18: Boxplot of the outdoor area per cluster.
Table 4.5: Summary of the outdoor areas per cluster.
Cluster 1 2 3 4 5
Outdoor area Minimum 429.9 158 775.8 901.5 680.8Median 1264 1318 1361 1839 1514Mean 1396 1406 1473 2096 1684
Maximum 2850 2494 2594 4185 3417
The boxplot of the mean daily consumption of each outdoor water meter per cluster is in Figure 4.19.
46
Note that these values are not normalized, since we intended to understand the consumption scale of
each cluster. In this plot, we see again that Cluster 4 stands out by having the highest median value.
Cluster 1 and 3 have similar median, though Cluster 2 has a lower median value and Cluster 5 has the
lowest median value.
5
10
1 2 3 4 5Cluster
Val
ue (
m3 )
Clusters
1
2
3
4
5
Figure 4.19: Boxplot of the mean daily water consumption for garden watering per cluster.
The members of Clusters 1, 2 and 3 have some similarities between them that separate them from
Cluster 4 and Cluster 5. Also, Cluster 4 is very different from the others. Additionally, the representative
series of Clusters 1, 2 and 3 did not seem to be very different. Therefore, we decided to join the first
3 clusters, hence avoiding building five different models, creating a new Cluster 1. In Figure 4.20, the
boxplot per month of the normalized monthly consumption of the new Cluster 1 is presented. In this
Figure, the yearly seasonality is very clear, the summer months (June, July and August) correspond
to higher values and the months of December, January and February correspond to lower values. In
Figure 4.21, we show the boxplot per month of the year of the normalized monthly consumption of the
same cluster, where we can see that July and August are quite similar do each other in terms of median
value and variability, while June has a lower median value with report to these months. Also, January,
February and December have median values close to each other.
We also verified that there is no difference between weekdays or weekend days, as can be seen in
Figure 4.22, the median value remains approximately the same for all days of the week.
We looked at the daily pattern per month of each cluster. In Figure 4.23, we show the plot for Cluster
1. We can see that there are two peaks during the day, one around 5 a.m. and the other around 10 p.m.
In the months June, July, August and September, these peaks are much more significant than in the
months of January, February, November and December. During day hours, between 8 a.m. and 7 p.m.,
the values are much less significant and in the months of January, February, November and December
they are close to zero. These results are particularly important for the water utility company to define
the day period to the real loss analysis. Usually, this period is defined during the night, when there is
approximately no indoor consumption. However, in a region with a very significant outdoor water use,
the period to monitor real losses should be during the day.
47
−1
0
1
2
3
2015
jan
2015
fev
2015
mar
2015
abr
2015
mai
2015
jun
2015
jul
2015
ago
2015
set
2015
out
2015
nov
2015
dez
2016
jan
2016
fev
2016
mar
2016
abr
2016
mai
2016
jun
2016
jul
2016
ago
2016
set
2016
out
2016
nov
2016
dez
2017
jan
2017
fev
2017
mar
2017
abr
2017
mai
2017
jun
2017
jul
Time (months)
Val
ue
Figure 4.20: Boxplot of the normalized monthly consumption of the new Cluster 1 between January 2015and July 2017.
−1
0
1
2
3
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep Oct
Nov
Dec
Months
Val
ue
Figure 4.21: Boxplot per month of the year of the normalized monthly consumption of the new Cluster 1.
0
5
10
15
Mon
day
Tues
day
Wed
nesd
ay
Thu
rsda
y
Frid
ay
Sat
urda
y
Sun
day
Day of the Week
Val
ue
Figure 4.22: Boxplot per day of the week of the new Cluster 1.
48
0.0
0.5
1.0
0 5 10 15 20Hour
Val
ue
Month123456789101112
Figure 4.23: Daily pattern per month of the new Cluster 1.
Furthermore, while building the models we encountered difficulties in finding a good model for Cluster
4. For that reason, we decided to apply clustering to the 6 members of this cluster. With this, we found
that 2 members were in fact quite different from the other 4 members, as well as different from each
other. Therefore, these 2 members were no longer considered for building the models. In Table 4.6, the
size of the final clusters is indicated.
Table 4.6: Size of each cluster.Cluster 1 2 3
Size 38 4 13
In Figure 4.24, the boxplot of the outdoor area per cluster for the final 3 clusters is shown. Cluster
2 is significantly different from the other 2 clusters, having the highest median, 1815m2, and highest
variability. As for Cluster 1 and Cluster 3, the outdoor areas of these two clusters do not differ as much:
the median value of Cluster 3, equal to 1514m2, is slightly higher than the median of Cluster 1, 1300m2.
The summary of the outdoor areas per cluster can be seen in Table 4.7.
Table 4.7: Summary of the outdoor areas per cluster (final clusters).
Cluster 1 2 3
Outdoor area Minimum 158 901.5 680.8Median 1300 1815 1514Mean 1413 2171 1684
Maximum 2850 4155 3417
In Figure 4.26, the boxplot of the building area per cluster is presented. Again, Cluster 2 stands out
with the highest median value equal to 695.2m2. For Cluster 1 and Cluster 3, the median values are very
49
0
1000
2000
3000
4000
1 2 3Cluster
m2
Clusters
1
2
3
Figure 4.24: Boxplot of the outdoor area per cluster (final clusters).
1000
2000
3000
1 2 3Cluster
m2
Clusters
1
2
3
Figure 4.25: Boxplot of the estimated garden area per cluster (final clusters).
Table 4.8: Summary of the estimated watered areas per cluster (final clusters).
Cluster 1 2 3
Estimated Watered Area Minimum 158 554.5 667.4Median 1005 1488 1179Mean 1082 1708 1354
Maximum 2415 3301 2853
close to each other, being equal to 409.3m2 and 424.4m2, respectively.
Table 4.9: Summary of the building areas per cluster (final clusters).
Cluster 1 2 3
Building Area Minimum 177.4 468.1 259.7Median 409.3 695.2 424.4Mean 405.9 664 431.4
Maximum 638 797.6 570.1
50
200
400
600
800
1 2 3Cluster
m2
Clusters
1
2
3
Figure 4.26: Boxplot of the building area per cluster (final clusters).
In Table 4.10, we see that the average ratio between the outdoor area and the lot area of each cluster
is quite similar for all three. Therefore, this measure can not be used to differentiate the clusters.
Table 4.10: Average ratio between outdoor area and lot area per cluster (final clusters).
Cluster 1 2 3
Percentage 79.3% 78.1% 80.8%
The time series were not clustered by the scale of the consumption, as we can see in Figure 4.27,
where the mean daily consumption of each water meter is plotted against the outdoor area. All the
clusters have members with high, average and low mean daily consumption.
5
10
0 1000 2000 3000 4000
Area ( m2 )
Val
ue (
m3 ) Clusters
123
Figure 4.27: Scatterplot of the mean daily consumption versus outdoor area grouped by cluster (finalclusters).
51
We also looked at the mean estimated pool volume per cluster, to investigate if there were significant
differences in pool size between the clusters. In Table 4.11, the mean estimated pool volume per cluster
is shown.
Table 4.11: Mean estimated pool volume per cluster.
Cluster 1 2 3
Volume (m3) 74.5 58.4 77.8
Note that it was not possible to estimate the pool volume for certain lots, as explained in Section 4.2.
The mean estimated pool volume of Cluster 1 and Cluster 3 are not too far apart, while the value of
Cluster 2 is significantly lower. Note however that it was not possible to estimate the pool volume for 2
of the clients that belong to this Cluster, that has only 4 members.
Moreover, we analysed the monthly peak factor for each cluster in the years of 2015 and 2016, which
are presented in Table 4.12. The monthly peak factor is the ratio of the maximum monthly consumption
observed during the year to the average monthly consumption of the same year. In 2015, all of the
clusters had the highest monthly consumption value in July, with no significant difference between the
monthly peak factor between the three clusters. In 2016, Cluster 1 and 3 had the highest monthly
consumption in August and Cluster 2 in July. Again in 2016, the monthly peak factor values for the 3
clusters are not significantly different.
Table 4.12: Monthly peak factor per cluster for 2015 and 2016.
Cluster 1 2 3
Year 2015 2016 2015 2016 2015 2016Month July August July July July AugustValue 1.88 2.07 2.26 2.26 2.00 2.17
4.4 Modeling garden watering demand using GAM
In this section, we show the steps taken when building the GAM models for each one of the 3 clusters.
We discuss the possible explanatory variables that can be used in the model. Furthermore, we dis-
cuss the process that led to finding the final models. Finally, we present and discuss the future values
predicted for 2 of the 3 clusters.
As mentioned before, we wanted to build a model for each one of the clusters. In this section, we
show the GAM models built for the 3 clusters and how we selected the explanatory variables used. The
data has already been split into train and test sets, as mentioned in Section 4.3.
We fitted three GAM models to each cluster. We computed three representative series for each
cluster: the aggregation by the mean (representative series Mean), the aggregation by the quantile 95%
(representative series Q95%) and the aggregation by the quantile 25% (representative series Q25%). In
Figure 4.28, the representative series Mean of Cluster 1 is plotted. A GAM model was fitted to each
52
one of these representative series. This way, we will get more information on forecasting the future
values. The predictions obtained from the models of representative series Q95% and Q25% will serve
as consumption intervals. That is, we followed a non-parametric approach as opposed to prediction
intervals.
Note that we removed the extreme consumptions that were mentioned in Section 4.2 from each time
series, in order to verify if the models would yield better results. However, when compared to the models
with the original data, they performed poorer, having a higher MAPE value. Therefore, the models were
built with the original data, without removing any of the extreme observations.
We present in detail the steps taken to build the model for representative series Mean of Cluster 1.
The model for the series aggregated by the median was also built, however this model led to poorer
results, therefore they will not be shown. In order to compare, we will show and comment the models
and results obtained for both Cluster 1 and Cluster 2 in Section 4.4.4. In Appendix B, we present the
forecast results for the models of Cluster 3.
−1
0
1
2
2014
−12
−01
2015
−01
−01
2015
−02
−01
2015
−03
−01
2015
−04
−01
2015
−05
−01
2015
−06
−01
2015
−07
−01
2015
−08
−01
2015
−09
−01
2015
−10
−01
2015
−11
−01
2015
−12
−01
2016
−01
−01
2016
−02
−01
2016
−03
−01
2016
−04
−01
2016
−05
−01
2016
−06
−01
2016
−07
−01
2016
−08
−01
2016
−09
−01
2016
−10
−01
2016
−11
−01
2016
−12
−01
2017
−01
−01
2017
−02
−01
2017
−03
−01
2017
−04
−01
2017
−05
−01
2017
−06
−01
2017
−07
−01
2017
−08
−01
2017
−09
−01
Time (days)
Val
ue
Figure 4.28: Representative series Mean of the new Cluster 1 between 01/01/2015 and 31/07/2017.
We began by checking if the representative series Mean of Cluster 1 is stationary. For that, the
KPSS test was used and gave a p-value equal to 0.04475. Considering a significance level of 1%, the
null hypothesis that states that the series is stationary, should be rejected, thus the test suggests that
the series is not stationary. Therefore, we applied the difference operator once. To determine if it was
necessary to apply a Box-Cox transformation to the data, we tried this transformation with different λ
values and checked the sample variance of the resulting series. Since none of the transformations led
to a lower sample variance, no Box-Cox transformation was applied.
4.4.1 Explanatory variables selection
To verify if a past lag of the response variable should be included in the model, the sample Autocor-
relation Function (ACF) and sample Partial Autocorrelation Function (PACF) of the differentiated series
53
−0.
3−
0.2
−0.
10.
00.
10.
20.
3
Lag
AC
F
0 5 10 15 20 25 30
ACF
Figure 4.29: Sample ACF of the responsevariable.
−0.
3−
0.2
−0.
10.
00.
10.
2
Lag
Par
tial A
CF
0 5 10 15 20 25 30
PACF
Figure 4.30: Sample PACF of the re-sponse variable.
were computed and are shown in Figure 4.29 and Figure 4.30. Both functions present a significant
spike in lag 1, as well as significant spikes around lags 7, 14 and 21. Note that the ACF and PACF are
symmetric with respect to the y-axis, as it is mentioned in Section 3.1, thus there are also significant
spikes in lags −1, −7, −14 and −21. Having a significant spike in the ACF and PACF in lag −1 means
that the response variable at time t is correlated with the response variable at time t− 1, or in our case,
the previous day. This seems to indicate some seasonality is present in the data and that lag −1 or −7
of the response variable might be needed in the model.
We proceeded to compute the cross-correlations with the meteorological variables: mean, maximum
and minimum daily temperatures and daily accumulated precipitation. Keep in mind that the temperature
series were differentiated once, since they were not stationary. We see in Figure 4.31 the sample CCF
between the differences of the mean temperature and the response variable, that the most significant
lags are −6 and 16, with correlation values respectively equal to 0.095 and −0.085. In Figure 4.32, where
the sample CCF between the differences of the maximum temperature and the response variable is
presented, there are some significant lags, namely lags 12, 13 and 16, with values respectively equal to
0.088, −0.099 and 0.088, however none of these ended up in the final model. In Figure 4.33, where the
sample CCF between the differences of the minimum temperature and the response variable is shown,
lag 24 is the most significant one with a value equal to 0.105. In fact, none of the temperature variables
were used in the model, since they were not significant in the model and, when present, did not result in
better forecasts. As for the Precipitation, in Figure 4.34, lag 0 and lag −1 are the most significant, with
cross-correlation values equal to −0.143 and −0.133, respectively. Both lags of the precipitation were
tested when building the models, but lag −1 was the one used in the model. A variable of the event of
precipitation was also used when building the models to verify if there was an improvement with regard
to the variable of precipitation quantity. The event of precipitation, EventPrecipt, is equal to 1 if the
value of precipitation was higher than zero in day t and zero if there was no occurence of precipitation.
Note that Jain et al. [13] verified that the occurrence of rainfall was a more significant variable than the
amount of rainfall, however, in our case, when the variable event of precipitation was included in the
models, it did not improve the forecast accuracy. In fact, this variable was only used in one model, where
it improved the forecast accuracy.
Also included in the model is the Month variable, taking values from 1 to 12, this variable represents
54
−30 −20 −10 0 10 20 30
−0.
050.
000.
050.
10
Lag (days)
cros
s−co
rrel
atio
n
DiffMeanTemp & DiffRepSeries1
Figure 4.31: CCF between the differenti-ated mean temperature and the differen-tiated representative series.
−30 −20 −10 0 10 20 30−0.
10−
0.05
0.00
0.05
Lag (days)
cros
s−co
rrel
atio
n
DiffMaxTemp & DiffRepSeries1
Figure 4.32: CCF between the differen-tiated maximum temperature and the dif-ferentiated representative series.
−20 −10 0 10 20
−0.
050.
000.
050.
10
Lag (days)
cros
s−co
rrel
atio
n
DiffMinTemp & DiffRepSeries1
Figure 4.33: CCF between the differenti-ated minimum temperature and the differ-entiated representative series.
−30 −20 −10 0 10 20 30−0.
15−
0.10
−0.
050.
000.
05
Lag (days)
cros
s−co
rrel
atio
n
Precip & DiffRepSeries1
Figure 4.34: CCF between the accumu-lated precipitation and the differentiatedrepresentative series.
the yearly seasonality present in the data.
Furthermore, we include in the model the Impulse variable. Impulse is equal to 1 only in the first
consecutive days it rains in October and 0 otherwise. It can be seen as a variable that represents
the transition from the summer to the winter season, which is a more sudden change in the mean
consumption than the transition from winter to summer. This binary variable represents an event that
happens once a year, every year, therefore it needs to be represented in the model.
The Trend variable captures the trend of the representative series, which is calculated by taking the
trend component of the STL decomposition of the representative series.
4.4.2 Modeling
In order to find a good model to fit the data that predicts values as close as possible to the real values,
we built several models with different combinations of the variables discussed in Subsection 4.4.1 and
interactions between them. At this stage, the following variables were used: lags −1 and −7 of the
response variable, lags −6 and 16 of the differentiated mean temperature, lags 12, 13 and 16 of the
differentiated maximum temperature, lag 24 of the differentiated minimum temperature, lags 0 and −1 of
the daily accumulated precipitation, Month, Trend and Impulse. Then, we built several models with the
possible combinations between these variables, in order to find the model that gave the best forecast
results. We also built models with a smooth function f(.) applied to a certain variable and another without
55
the smooth function applied to the same variable, to verify which variables require a smooth function. To
compare the forecast accuracy of the models, we used MAPE (Mean Absolute Percentage Error) and
chose the model with the lowest MAPE value as the best one.
We show the ”best” models fitted to the three representative series of Cluster 1. The models of rep-
resentative series Mean, Q95% and Q25% are indicated as Model 1, Model 2 and Model 3, respectively.
Model 1: yt = yt−1 + yt−7 + Prect−1 + f(Month) + Trend+ Impulse+ Impulse× Trend (4.4)
Model 2: yt = f1(Month) + f23(yt−3,Month) + Trend+ Impulse+ Impulse× Trend (4.5)
Model 3: yt = f1(yt−1) + yt−7 + Trend+Month+ Trend×Month+
Impulse+ Impulse× Trend+ f2(Month)(4.6)
Where yt−i represent the past lag −i of the response variable, f(.) represent smooth functions,
Month, Trend and Impulse are as explained in Subsection 4.4.1.
Model 1 (Equation 4.4) that fits the differentiated representative series Mean (response variable yt)
was built with variables of the lag −1 and −7 of the response variable (yt−1 and yt−7, respectively); lag
−1 of the precipitation variable, Prect−1; a smooth function applied to the Month variable representing
the seasonality present in the data; a variable that represents the trend present in the data, Trend ;
Impulse variable that represents the first days of consecutive rain in October and an interaction between
Impulse and Trend, that represents the shift in the values that occurs in the first days of consecutive rain
in October.
4.4.3 Analysis of the Residuals
Once the ”best” GAM model was chosen, Model 1 (Equation 4.4), it was necessary to analyse the
residuals, such as checking their stationarity and if they follow a Normal distribution. The KPSS test
applied to the residuals gave a p-value greater than 0.10, which suggested that they are stationary when
considering the usual significance levels (1%, 5% and 10%). In Figure 4.35 and Figure 4.36, we find
the histogram of the residuals and the QQ-Plot, respectively. In Figure 4.35, the pattern is similar to
a bell shape around zero, with a slight negative skew due probably to the extreme observations. In
Figure 4.36, the residuals follow the straight line, only deviating on the tails. Thus, both plots indicate
that the residuals seem to follow a Normal distribution.
The plot of the residuals versus the linear predictor is shown in Figure 4.37. The points appear to
be randomly distributed around zero without any clear pattern. This indicates that the residuals are
uncorrelated.
56
Histogram of residuals
Residuals
Fre
quen
cy
−0.5 0.0 0.5
050
100
150
200
250
300
Figure 4.35: Histogram of the residuals ofModel 1.
−3 −2 −1 0 1 2 3
−0.
8−
0.6
−0.
4−
0.2
0.0
0.2
0.4
QQ−plot
norm quantiles
Sam
ple
quan
tiles
Figure 4.36: QQ-Plot of the residuals ofModel 1.
−0.6 −0.4 −0.2 0.0 0.2
−0.
8−
0.6
−0.
4−
0.2
0.0
0.2
0.4
Resids vs. linear pred.
linear predictor
resi
dual
s
Figure 4.37: Residuals versus the linear predictor of Model 1.
4.4.4 Forecast
We used the chosen model to predict values from 10th August 2017 until 30th November 2017. These
predictions were compared to the actual values in the test set. Note that to compute the predictions,
the values of 2016 were used for the lags of the response variable, as well as the trend of 2016 for the
variable Trend, since it was the most recent data period available.
In order to show the forecast results in the original scale, we must reverse the transformations done
to the data. First, the differences were inversed, using the last observation in the train set, 31/07/2017,
as the initial point. Then, the normalization applied (showed in Equation 4.1) was reversed, using the for-
mula shown in Equation 4.7. The daily forecasts in the original scale betwen 10/08/2017 and 30/11/2017
can be found in Figure 4.38.
Xt = Yt × sd(Xt) +mean(Xt) (4.7)
To analyse the accuracy of the model, we calculated the MAPE, which gave a value of 9.959%. The
MAPE is calculated according to Equation 3.58 using the predictions in the original scale and the real
aggregated values and the lower the percentage value, the better the forecast accuracy of the model.
In Figure 4.39, the daily forecasts from 16/08/2017 until 30/11/2017 of the models of representative
57
3
6
920
17−
08−
08
2017
−08
−12
2017
−08
−16
2017
−08
−20
2017
−08
−24
2017
−08
−28
2017
−09
−01
2017
−09
−05
2017
−09
−09
2017
−09
−13
2017
−09
−17
2017
−09
−21
2017
−09
−25
2017
−09
−29
2017
−10
−03
2017
−10
−07
2017
−10
−11
2017
−10
−15
2017
−10
−19
2017
−10
−23
2017
−10
−27
2017
−10
−31
2017
−11
−04
2017
−11
−08
2017
−11
−12
2017
−11
−16
2017
−11
−20
2017
−11
−24
2017
−11
−28
2017
−12
−02
Time (days)
m3 /
day colour
PredictionsReal
Figure 4.38: Daily forecast of the model of representative series Mean (Model 1, Equation 4.4) ofCluster 1 and the real aggregated values by the mean, both in the original scale between 10/08/2017and 30/11/2017.
series Mean, 2 and 3 are shown, as well as the real aggregated values by the mean. The forecasts
of the models of representative series Q95% and 3 (aggregated by the quantile 95% and aggregated
by the quantile 25%, respectively) can be seen as consumption intervals of the forecasts of the mean
consumption. The model of representative series Q95% had a MAPE value equal to 19.71%. For the
model of representative series Q25%, the MAE (Mean Absolute Error) measure was used, calculated by
Equation 3.59, since we are dealing with values close to zero. Its value was equal to 0.867.
0
5
10
15
20
2017
−08
−14
2017
−08
−18
2017
−08
−22
2017
−08
−26
2017
−08
−30
2017
−09
−03
2017
−09
−07
2017
−09
−11
2017
−09
−15
2017
−09
−19
2017
−09
−23
2017
−09
−27
2017
−10
−01
2017
−10
−05
2017
−10
−09
2017
−10
−13
2017
−10
−17
2017
−10
−21
2017
−10
−25
2017
−10
−29
2017
−11
−02
2017
−11
−06
2017
−11
−10
2017
−11
−14
2017
−11
−18
2017
−11
−22
2017
−11
−26
2017
−11
−30
2017
−12
−04
Time (days)
m3 /
day
colour
Predictions(Mean)
Predictions(Q 25%)
Predictions(Q 95%)
Real
Figure 4.39: Daily forecast of Model 1 (Equation 4.4), Model 2 (Equation 4.5) and Model 3 (Equation 4.6)of Cluster 1 and the real aggregated values by the mean in the original scale between 16/08/2017 and30/11/2017.
The models of Cluster 2 performed poorer, when compared to the results obtained by the models
of Cluster 1 or Cluster 3. Note that the forecast interval is not equal for all the models due to the lags
58
used of the explanatory variables. Below, the models that were fitted to the 3 representative series are
presented, respectively. Note that, in R, the gamm function from package mgcv fits generalized additive
mixed models to the data and allows for the residuals of the model to be fit with an ARMA model.
Model 4: yt = f12(yt−1, P rect−18) + f3(yt−2) + f4(Month) + Trend,
with residuals ε ∼ ARMA(2, 1)(4.8)
Model 5: yt = f1(yt−2) + f2(DiffMinTempt+14) + EventPrect−18 + Trend+Month
+ Trend×Month, with residuals ε ∼ ARMA(2, 1)(4.9)
Model 6: yt = β + f1(DiffMaxTempt−13) + f2(yt−1) + yt−6 + Trend+Month
+ Trend×Month+ f(Month)(4.10)
The model of representative series Mean had a MAPE equal to 35.327% for the forecast interval
between 27/08/2017 and 30/11/2017 and it is shown in Figure 4.40. The model of representative series
Q95% had a MAPE equal to 30.024% and the model of representative series Q25% had a MAE equal to
2.098.
The forecast interval bands between 19/08/2017 and 16/11/2017 obtained by the models of Cluster
2 are shown in Figure 4.41. As can be seen in Figure 4.41, the real aggregated mean of this Cluster is
not contained in the interval bands, the predictions of the representative series Q95% model have some
values inferior to the real aggregated mean.
0
10
20
2017
−08
−26
2017
−08
−30
2017
−09
−03
2017
−09
−07
2017
−09
−11
2017
−09
−15
2017
−09
−19
2017
−09
−23
2017
−09
−27
2017
−10
−01
2017
−10
−05
2017
−10
−09
2017
−10
−13
2017
−10
−17
2017
−10
−21
2017
−10
−25
2017
−10
−29
2017
−11
−02
2017
−11
−06
2017
−11
−10
2017
−11
−14
2017
−11
−18
2017
−11
−22
2017
−11
−26
2017
−11
−30
2017
−12
−04
Time (days)
m3 /
day colour
PredictionsReal
Figure 4.40: Daily forecast of the model of representative seriesMean (Model 4, Equation 4.8) of Cluster2 and the real aggregated values by the mean in the original scale between 27/08/2017 and 30/11/2017.
If a new construction will begin in the area, the only information available about the new client is
actually the lot, outdoor and building areas. There is no information a priori about the behaviour or
consumption pattern of the new client. Thus, we can use the outdoor area to determine the Cluster
59
0
10
2020
17−
08−
26
2017
−08
−30
2017
−09
−03
2017
−09
−07
2017
−09
−11
2017
−09
−15
2017
−09
−19
2017
−09
−23
2017
−09
−27
2017
−10
−01
2017
−10
−05
2017
−10
−09
2017
−10
−13
2017
−10
−17
2017
−10
−21
2017
−10
−25
2017
−10
−29
2017
−11
−02
2017
−11
−06
2017
−11
−10
2017
−11
−14
2017
−11
−18
Time (days)
m3 /
day
colour
Predictions(Mean)
Predictions(Q 25%)
Predictions(Q 95%)
Real
Figure 4.41: Daily forecast of Model 4 (Equation 4.8), Model 5 (Equation 4.9) and Model 6 (Equa-tion 4.10) of Cluster 2 and the real aggregated values by the mean in the original scale between19/08/2017 and 16/11/2017.
whose model can be used to predict a possible consumption pattern for this new client. We can use the
Boxplot in Figure 4.24 to decide which model to use. For example, for a new lot that has an outdoor area
of 2000m2, the model of Cluster 2 should be used. If the lot will have an outdoor area of 1200m2, then
both models of Cluster 1 and Cluster 3 can be used and we can take the average of the predictions of
the models.
In addition, these models may be used to determine clients that have a borehole. By taking advan-
tage of the interval bands, we can see if a certain client is consistently below the values of the interval.
If that is the case, then the client has a suspiciously low water use, when compared to the mean con-
sumption of the set of clients used to build a model, and there is the possibility the client has a borehole.
4.5 Daily disaggregation of water consumption
A secondary goal of this dissertation was to disaggregate the consumption of the lots that have a single
water meter and with the disaggregation we will be able to say how much of the total consumption
corresponds to indoor consumption and how much corresponds to garden watering for the lots that have
a single water meter. In this Section, we present the method that was used to disaggregate daily water
consumption, which used the models of the garden watering demand, in the period between August and
November 2017. Moreover, the results obtained by this method are shown.
Since we wish to use the results obtained from modeling the garden watering demand, we began
by examining the weight of the outdoor water use in the total monthly consumption in each of the 3
clusters discussed in Subsection 4.3.3. In Figures 4.42, 4.43 and 4.44, we can see stacked bar plots
relative to Clusters 1, 2 and 3, respectively, where the mean monthly total consumption is represented
and separated into indoor and outdoor consumption. On top of each bar, the percentage that represents
60
the outdoor consumption in the total consumption is indicated. There are not very significant differences
between the percentages of each cluster. As expected, since analysing Figure 4.9, the garden watering
represents the majority of the mean monthly consumption in all clusters. For example, for Cluster 1,
between the months of March and September 2015, the percentage values were higher or equal to 90%.
In 2016, the percentage was higher or equal to 90% between April and September and in 2017, it was
higher or equal to 90% from April through July.
89 85
92
93
95
95
9392
94
87
83
86
65
77
82
9090
95
9494
95
89
75
74
89
83
88
92
94
95
94
0
100
200
300
400
2014
−12
−01
2015
−01
−01
2015
−02
−01
2015
−03
−01
2015
−04
−01
2015
−05
−01
2015
−06
−01
2015
−07
−01
2015
−08
−01
2015
−09
−01
2015
−10
−01
2015
−11
−01
2015
−12
−01
2016
−01
−01
2016
−02
−01
2016
−03
−01
2016
−04
−01
2016
−05
−01
2016
−06
−01
2016
−07
−01
2016
−08
−01
2016
−09
−01
2016
−10
−01
2016
−11
−01
2016
−12
−01
2017
−01
−01
2017
−02
−01
2017
−03
−01
2017
−04
−01
2017
−05
−01
2017
−06
−01
2017
−07
−01
2017
−08
−01
Time (months)
m3 /
mon
th
TypeIndoorGarden watering
Figure 4.42: Mean monthly indoor consumption and mean monthly water consumption for garden wa-tering of Cluster 1 between January 2015 and July 2017.
8787
91 88
92
94
92
85
85
8783
87
69
84
93
87
89
90
95
9395
94
90
7590 85
90
94
95
94
93
0
100
200
300
400
500
2014
−12
−01
2015
−01
−01
2015
−02
−01
2015
−03
−01
2015
−04
−01
2015
−05
−01
2015
−06
−01
2015
−07
−01
2015
−08
−01
2015
−09
−01
2015
−10
−01
2015
−11
−01
2015
−12
−01
2016
−01
−01
2016
−02
−01
2016
−03
−01
2016
−04
−01
2016
−05
−01
2016
−06
−01
2016
−07
−01
2016
−08
−01
2016
−09
−01
2016
−10
−01
2016
−11
−01
2016
−12
−01
2017
−01
−01
2017
−02
−01
2017
−03
−01
2017
−04
−01
2017
−05
−01
2017
−06
−01
2017
−07
−01
2017
−08
−01
Time (months)
m3 /
mon
th
TypeIndoorGarden watering
Figure 4.43: Mean monthly indoor consumption and mean monthly water consumption for garden wa-tering of Cluster 2 between January 2015 and July 2017.
In Table 4.13, the mean monthly ratio between the garden watering and total water consumption
per Cluster is presented for the months between August until November, since we performed the disag-
gregation method for the same months. This method uses the garden watering demand models, that
61
92
8493
90
95
94
95
93
91
94
9696
7093
9493
94
96
94
92
97
91
80
80
93
90
92
95
96
96
94
0
100
200
300
400
2014
−12
−01
2015
−01
−01
2015
−02
−01
2015
−03
−01
2015
−04
−01
2015
−05
−01
2015
−06
−01
2015
−07
−01
2015
−08
−01
2015
−09
−01
2015
−10
−01
2015
−11
−01
2015
−12
−01
2016
−01
−01
2016
−02
−01
2016
−03
−01
2016
−04
−01
2016
−05
−01
2016
−06
−01
2016
−07
−01
2016
−08
−01
2016
−09
−01
2016
−10
−01
2016
−11
−01
2016
−12
−01
2017
−01
−01
2017
−02
−01
2017
−03
−01
2017
−04
−01
2017
−05
−01
2017
−06
−01
2017
−07
−01
2017
−08
−01
Time (months)
m3 /
mon
th
TypeIndoorGarden watering
Figure 4.44: Mean monthly indoor consumption and mean monthly water consumption for garden wa-tering of Cluster 3 between January 2015 and July 2017.
were fit to train sets within the period of 01/01/2015 until 31/07/2017 and were tested over the months
August-November 2017. For this reason, the disaggregation method was applied between August 2017
and November 2017.
Table 4.13: Mean monthly ratio betwen the garden watering and total water consumption per Cluster forthe months of August, September, October and November and years 2015 and 2016.
Cluster 1 Cluster 2 Cluster 3
August 0.93 0.89 0.925
September 0.945 0.90 0.94
October 0.88 0.905 0.925
November 0.79 0.865 0.88
As mentioned in Section 4.1, the water utility company that provided the data has almost 3000 clients
with only 73 clients with two water meters. So, the majority of the clients have a single water meter that
measures both indoor and outdoor water use and can be used to test this method. To select which
clients to form this new set, we followed certain criteria. Since we used the garden watering demand
models, we wanted the new set to have a certain similarity with the water consumption for garden
watering data set of 57 clients. First of all, the clients needed to have data available from 01/01/2015,
since the training period used to build the garden watering demand models begins on that date. Second,
the clients needed to have a detached house, since this is the housing typology of all of the clients in the
water consumption for garden watering data set (with the exception of one apartment). Furthermore,
by looking at the Figure 4.24, we gather that the majority of the clients in the water consumption for
garden watering data set have an outdoor area between approximately 1100m2 and 2300m2. So, having
an outdoor area between 1100m2 and 2300m2 was another criterion when selecting clients for this new
data set. Moreover, it was important to select clients that do not have a very low mean daily water
62
consumption when compared to the size of the outdoor area, because it is possible that these clients
have a borehole. If a client has a borehole, it will be used to water the garden and the values registered
by the water meter will be mainly indoor water use, thus the water consumption can not be disaggregated
into indoor and outdoor water use. So, if a client presented a mean daily consumption close to zero, it
was not selected.
We ended up with a set of 41 clients that have a single water meter. Let us name this set as single
water meter set. The method discussed in Section 3.5 was tested. We discuss the steps taken in this
method and show some of the results, leaving additional results to be shown in Appendix C.
We outline the steps taken in this method before discussing the results:
Step 1 We applied the clustering algorithm Complete Linkage with the periodogram based distance to
the normalized single water meter set (normalized with the Standard Normalization, Equation 4.1).
Then, we chose the best number of clusters k.
Step 2 Having chosen the optimal number k of single water meter clusters, we built the representative
series for each cluster. These series are calculated by at each time point t taking the mean of all
the time series in that cluster at time t. Then, this series is normalized with Standard Normalization.
Step 3 We considered the train set composed of the normalized representative seriesMean of the 3 water
consumption for garden watering clusters and each series represents its own class. For the test
set, we considered the normalized representative series of the k single water meter clusters. We
then applied 1-NN (1-Nearest Neighbor) with the mentioned train set and test set to be classified.
Step 4 According to the classification results of 1-NN (1-Nearest Neighbor), one of the 3 water consump-
tion for garden watering models of representative series Mean of the water consumption for gar-
den watering clusters (discussed in Section 4.4) was used to predict estimates of the total daily
consumption for each of the k clusters.
Step 5 Using the appropriate values in Table 4.13, we estimated the future outdoor water use by taking a
percentage of the estimates obtained in Step 4. For the daily estimates in a same month, we used
the same percentage value.
Step 6 The estimates of the indoor water use were obtained by the difference between the estimates of
the total consumption (Step 4) and the estimates of the outdoor water consumption (Step 5).
In Step 1, we used the same clustering algorithm and the same distance that were used when
clustering the set of 57 exclusively outdoor water meters, as well as the indeces to choose the optimal
number of clusters, described in Section 4.3. Also, in Step 3, when applying 1-NN, the periodogram
based distance was used one more time, since we wanted to classify the normalized series by similarity
of pattern and we had already assessed in Section 4.3 that this distance was the best to do so.
By applying Complete Linkage with the periodogram based distance and the Dunn Index, Entropy,
Silhouette and Gamma to choose the number of clusters, the test set was partitioned into 5 clusters.
Let us name these clusters as Group 1, Group 2, Group 3, Group 4 and Group 5, respectively, to avoid
63
confusion with the clusters obtained in Section 4.3. The size of these groups are as shown in Table 4.14.
Note that Group 5 has only one member and it was not taken into consideration from this point on. In
Figure 4.45, the boxplot of the outdoor area per group is presented. Group 2 has the lowest median
value equal to 1478m2 and Group 3 has the highest equal to 1644m2, however there is not a clear
distinction between the groups.
Table 4.14: Group size for the test data set (N = 41).
Group 1 2 3 4 5
Size 9 11 11 9 1
1250
1500
1750
2000
2250
1 2 3 4Group
Are
a ( m
2 )
Group1234
Figure 4.45: Boxplot of the outdoor area per group for the test data set (N = 41).
Proceeding with 1-NN, the representative series of each group were classified according to the sim-
ilarity to the representative series Mean of Cluster 1, Cluster 2 and Cluster 3, obtained for the water
consumption for garden watering data set and are described in Section 4.3. The classification results
are shown in Table 4.15. This means that the garden watering demand model of representative series
Mean of Cluster 1 will be used to estimate future values of total consumption of Group 1. In the same
way, the model of representative series Mean of Cluster 3 will be used to estimate future values of total
consumption of Group 2. For all the cases, we performed the daily estimation between the 22/08/2017
and 30/11/2017.
Table 4.15: KNN classification results of the Groups’s representative series according to the clustersobtained for the water consumption for garden watering data set.
Group 1 2 3 4
Classification Cluster 1 Cluster 3 Cluster 3 Cluster 1
As seen in Figure 4.9 and Figures 4.42, 4.43 and 4.44, the outdoor water use of the 57 client set
studied represents the majority of the total consumption, which is why in Step 4 we use the garden
64
watering demand models forecasts as an estimate of the total consumption.
In Figure 4.46 and Figure 4.47, the predictions of the garden watering demand models as estimates
of the total consumption are shown along with the respective real total consumption for Group 1 and
Group 2. To evaluate the accuracy of the models, we calculate the measure MAPE (Mean Absolute
Percentage Error), Equation 3.58, and we remember that the lowest the MAPE value, the better. For
the results of Group 1, a MAPE equal to 28.40% was obtained and for Group 2, MAPE was equal to
16.25%, which was the best value out of all four. As for Group 3 and Group 4, MAPE values of 37.81%
and 26.22% were obtained, respectively. The corresponding plots obtained for Group 3 and Group 4 can
be found in Appendix C.
2.5
5.0
7.5
10.0
12.5
2017
−08
−17
2017
−08
−21
2017
−08
−25
2017
−08
−29
2017
−09
−02
2017
−09
−06
2017
−09
−10
2017
−09
−14
2017
−09
−18
2017
−09
−22
2017
−09
−26
2017
−09
−30
2017
−10
−04
2017
−10
−08
2017
−10
−12
2017
−10
−16
2017
−10
−20
2017
−10
−24
2017
−10
−28
2017
−11
−01
2017
−11
−05
2017
−11
−09
2017
−11
−13
2017
−11
−17
2017
−11
−21
2017
−11
−25
2017
−11
−29
2017
−12
−03
Time (days)
m3 /
day colour
PredictionsReal
Figure 4.46: Estimates of the total daily consumption between 22/08/2017 and 30/11/2017 and the realtotal daily consumption of Group 1 in the original scale.
5
10
2017
−08
−17
2017
−08
−21
2017
−08
−25
2017
−08
−29
2017
−09
−02
2017
−09
−06
2017
−09
−10
2017
−09
−14
2017
−09
−18
2017
−09
−22
2017
−09
−26
2017
−09
−30
2017
−10
−04
2017
−10
−08
2017
−10
−12
2017
−10
−16
2017
−10
−20
2017
−10
−24
2017
−10
−28
2017
−11
−01
2017
−11
−05
2017
−11
−09
2017
−11
−13
2017
−11
−17
2017
−11
−21
2017
−11
−25
2017
−11
−29
2017
−12
−03
Time (days)
m3 /
day colour
PredictionsReal
Figure 4.47: Estimates of the total daily consumption between 22/08/2017 and 30/11/2017 and the realtotal daily consumption of Group 2 in the original scale.
65
Then, we can proceed to Step 5 and Step 6 to get the disaggregated values from the estimates of
the total consumption. In Figure 4.48 and Figure 4.49, the estimates of consumption disaggregation for
Group 1 and Group 2 are shown, respectively. The estimates of garden watering are shown in green,
the estimates of the indoor water use are shown in red and the real total consumption is shown in blue.
Again, the respective plots of Group 3 and Group 4 are shown in Appendix C
0.0
2.5
5.0
7.5
10.0
12.5
2017
−08
−17
2017
−08
−21
2017
−08
−25
2017
−08
−29
2017
−09
−02
2017
−09
−06
2017
−09
−10
2017
−09
−14
2017
−09
−18
2017
−09
−22
2017
−09
−26
2017
−09
−30
2017
−10
−04
2017
−10
−08
2017
−10
−12
2017
−10
−16
2017
−10
−20
2017
−10
−24
2017
−10
−28
2017
−11
−01
2017
−11
−05
2017
−11
−09
2017
−11
−13
2017
−11
−17
2017
−11
−21
2017
−11
−25
2017
−11
−29
2017
−12
−03
Time (days)
m3 /
day
colourGardenwateringestimatesIndoorconsumptionestimates
Real (total)
Figure 4.48: Estimates of the daily garden watering and daily indoor consumption between 22/08/2017and 30/11/2017 and the real total daily consumption of Group 1 in the original scale.
0
5
10
2017
−08
−17
2017
−08
−21
2017
−08
−25
2017
−08
−29
2017
−09
−02
2017
−09
−06
2017
−09
−10
2017
−09
−14
2017
−09
−18
2017
−09
−22
2017
−09
−26
2017
−09
−30
2017
−10
−04
2017
−10
−08
2017
−10
−12
2017
−10
−16
2017
−10
−20
2017
−10
−24
2017
−10
−28
2017
−11
−01
2017
−11
−05
2017
−11
−09
2017
−11
−13
2017
−11
−17
2017
−11
−21
2017
−11
−25
2017
−11
−29
2017
−12
−03
Time (days)
m3 /
day
colourGardenwateringestimatesIndoorconsumptionestimates
Real (total)
Figure 4.49: Estimates of the daily garden watering and daily indoor consumption between 22/08/2017and 30/11/2017 and the real total daily consumption of Group 2 in the original scale.
With this method, we were able to obtain satisfactory estimates of the total consumption, allowing
good estimates of indoor and outdoor water use of clients that have a lot with one water meter and
similar characteristics with the 57 lots studied to build the garden watering demand models.
We now proceed to explain another method that was explored, but that gave less satisfactory results.
66
The first step of this method is equal to Step 1 of the method already discussed, therefore, we have
also 4 groups. In this method, the second step is to further separate the clusters obtained according to
the outdoor areas of the members. With this method, we also wanted to use garden watering demand
models obtained to estimate the total consumption, but by using the outdoor area as a determinant to
choose which model to use. Using Figure 4.18 as guidance, we separate each cluster into two groups:
one that has members with outdoor areas between 1100m2 and 1600m2 and another with areas between
1600m2 and 2300m2 , as it is shown is Table 4.16. The groups with smaller outdoor areas are represented
with an S and the groups with larger areas are represented with an L. The groups with larger outdoor
areas will use the model of Cluster 2 and the groups with smaller outdoor areas will use the models of
both Cluster 1 and Cluster 3, by taking a mean of their results. The last steps are to estimate the water
consumption for garden watering also by taking a percentage of the estimates of the total consumption,
according to Table 4.13. Then, similar to the previous method, the indoor consumption is estimated
by taking the difference between the estimates of the total consumption and the estimates of the water
consumption for garden watering.
Table 4.16: Size of each group.
Cluster 1 2 3 4
S L S L S L S LSize 5 4 7 4 5 6 5 4
For some of the groups, reasonably good results were obtained, however in other cases, very poor
results were obtained. With all the larger area groups, inadequate results were obtained. So, the
consumption pattern of these groups seems to be different from the one of Cluster 2.
With this method, the overall results were poorer when compared to the first method discussed. For
example, when estimating the total water consumption of Group 1 with outdoor area between 1600m2
and 2300m2 (Group 1 L) the MAPE was equal to 63.41%. This shows that having the outdoor area as a
determinant is not sufficient and better results are obtained when the consumption pattern is taken into
account. In Appendix C, the results for certain groups are shown.
Let us consider the idea of estimation from Syme et al. [3], as described in Chapter 2. In this paper,
the authors estimated the outdoor water consumption as the subtraction between the consumption in
summer months and the consumption in winter months. If we attempt to use this idea, the first question
that arrises is how do we select the ”summer months” and ”winter months”? As we have already dis-
cussed, the amount of precipitation and periods of occurences of rainfall have changed in 2017 when
compared to 2016 and 2015. Thus, it is not so clear how to define the same ”summer” and ”winter”
months for different years. Moreover, the region where this study is focused is a touristic region, in which
it is expected that the clients do not reside in the homes. Therefore, the clients are expected to be in the
houses, for example, during the usual period of summer holidays, June and August, and possibly Easter
holidays and Christmas holidays. Therefore, there will be a water consumption inside the homes only
during these periods when the clients are in the homes. Also, as we have already seen, in this region
even during the ”winter months” the gardens are watered. Thus, the approach used in Syme et al. [3] is
67
not appropriate for our case.
68
Chapter 5
Conclusions
In this Chapter, the achievements obtained throughout this study are stated in Section 5.1. In Sec-
tion 5.2, some ideas to develop future work are mentioned.
5.1 Achievements
In this dissertation our aim was to study, model and forecast garden watering demand in a coastal
touristic area. For that we used data collected between 01/01/2015 and 31/07/2017 from 57 water
meters that measure exclusively outdoor use.
We were able to verify that the relationship between the outdoor area of a lot and its respective mean
daily outdoor water use is not a linear one. Also, the characterization of the outdoor area typology of
each lot could be important information, however, it was not available at the time of this study. Then, we
made an estimate of the actual watered area of each lot, using pictures available on Google Maps, and
confirmed that these values reveal more information about the clients’s water use.
The first step taken was the clustering of the time series. To build a model to each one of the clients
is not practical, therefore this is an important step to group similar time series. The time series were
normalized before the clustering algorithm was applied, in order to group them by pattern. Had we clus-
tered the original series, it would result in a grouping by scale, which was not in our interest. Therefore,
we were able to identify 5 groups according to the consumption pattern in the set of clients, using the
Complete Linkage hierarchical clustering algorithm with the periodogram based distance. However, af-
ter some exploratory work of the different characteristics of each cluster, we decided to join 3 clusters
into one and applied the same clustering algorithm to one of the clusters, since its members were quite
different from each other.
We proceeded to build Generalized Additive Models to each one of the three clusters. With these
models, it was possible to use the weather variables, mean, minimum and maximum daily temperature
and daily accumulated precipitation, as explanatory variables. One important explanatory variable used
in the models was Impulse, which explained the abrupt shift in the consumption in the first consecutive
days of rain in the month of October. We also attempted to used one of the classical time series
69
models, SARIMA, however, the forecast values obtained from these were far from satisfactory, thus,
the Generalized Additive Models were more adequate for the data set.
Three models were built for each cluster, thus being able to provide a consumption interval for the
forecasts of the mean values of each cluster. After forecast evaluation, it was verified that the models
of one of the clusters (Cluster 2) did not achieve satisfactory results. The models of the remaining two
clusters presented a good forecast accuracy, the best being the models of Cluster 1.
These models can be used, for example, in the case a new lot is being built and it is necessary to
estimate the outdoor water use of the new client or to estimate the outdoor water consumption of an
existing client that will close the borehole in the lot. For both cases, the information about the lot area,
outdoor and building areas is available. Thus, to decide which model to use, the outdoor area is used as
guidance. For an outdoor area between around 1600m2 and 2300m2 , the models of Cluster 2 are used
to predict future daily values. For an outdoor area between around 1100m2 and 1600m2, an average of
the predicted values of both the models of Cluster 1 and Cluster 3 is considered.
The results obtained will be important to improve the water supply network management of the water
utility company. The predictive garden watering demand models are also important for future planning
in the case more clients close their boreholes and connect to the mains water. Additionally, this study is
also of interest to the management of outdoor areas of large consumers, such as hotels.
A secondary goal of this study was to identify a method to disaggregate daily water consumption
of meters that measure both indoor and outdoor water use. For this, 41 lots with only one meter and
outdoor areas between 1100m2 and 2300m2 were selected. By clustering these time series, we obtained
4 groups and by looking at the similarity between their representative series and the ones from the
clusters of the 57 meters that measure exclusively outdoor use, we were able to classify these 4 groups.
We used the garden watering demand models that we built in order to estimate the total consumption,
since we verified that the garden watering represented the majority of the total consumption. With the
method presented, we were able to obtain satisfactory estimates of the total consumption, allowing good
estimates of indoor and outdoor water use of lots with only one water meter.
This method can be helpful in future water management planning. By providing estimates of average
indoor consumption, this information can be important in sewage system planning. Also, understanding
the weight of the indoor and outdoor water use in the total consumption may help in a future billing
change. Furthermore, it can be important in the decision making of installing new meters that measure
exclusively indoor use and exclusively outdoor use in more lots.
5.2 Future Work
With regards to suggestions to future work, there are interesting possibilities that can derive from this
study.
Taking into consideration the disaggregation problem, it can be looked from a different point of view.
In this study, we attempted to disaggregate the daily consumptions, this way we were able to use the
garden watering demand models that were built. However, it might be possible to do this with hourly
70
observations. Knowing that the gardens are watered generally at night around 4 a.m. until 6 a.m., it
might be possible to identify the type of consumption according to the pattern within a day.
In this study, we modeled the data collected from exclusive outdoor water meters, that have a corre-
sponding exclusive indoor water meter. By adding the values of each outdoor meter with the respective
indoor meter, the total water consumption values of each lot are obtained. Thus, it is possible to model
the total water consumption. Then, it would be possible to use this together with the garden watering
demand models to obtain estimates of the indoor water consumption.
In addition, within the scope of this project, if a thorough study of the garden typology of each lot was
made, i.e., measurement of the area of lawn, types of small plants, types of trees or bushes present
in the lot’s outdoor area and the space occupied by them, it would be possible to better understand
the relation between the outdoor water use and the real watered area. Also, it would provide with an
understanding of how the presence of different types of plants or trees affect the water consumption.
Furthermore, the garden watering demand models may be used to identify clients with a borehole.
The models can be used to identify possible boreholes in the set of almost 3000 clients managed by the
water utility company. This is of high importance for future planning of the water utility company, since
it is expected the boreholes will eventually be closed due to saltwater intrusion, and the clients with a
borehole will connect to the mains water supply system.
71
72
Bibliography
[1] A. Danilenko, E. D. M., and Jacobsen. Climate change and urban water utilities: challenges and op-
portunities. Water Working Notes No 24, Water Sector Board, Sustainable Development Network.
World Bank, Washington DC, (50), 2010.
[2] C. Makwiza. Estimating outdoor water use allowing for the possible impacts of climate change.
PhD thesis, Faculty of Engineering at Stellenbosch University, March 2018.
[3] G. J. Syme, Q. Shao, and et all. Predicting and Understanding Home Garden Water Use. Land-
scape and Urban Planning, 68:121–128, May 2004.
[4] T. Root and Survis. Human water climate interactions in the context of managing Florida’s water
supplies. 43:4–16, 01 2012.
[5] B. Randolph and P. Troy. Understanding Water Consumption in Sydney. 2007.
[6] Publico. https://www.publico.pt/2018/06/04/sociedade/entrevista/
entrevista-godinho-1832382. Accessed: 2018-06-04.
[7] M. Loh, P. Coghlan, and W. Australia. Domestic water use study : in Perth, Western Australia,
1998-2001 / Michael Loh, Peter Coghlan. Water Corporation [West Leederville, W.A.], 2003.
[8] L. A. House-Peters and H. Chang. Urban water demand modeling: Review of concepts, methods,
and organizing principles. Water Resources Research, 47(15), 2011.
[9] M. Ghiassi, D. K. Zimbra, and H. Saidane. Urban Water Demand Forecasting with a Dynamic
Artificial Neural Network Model. Journal of Water Resources Planning and Management, 134(2):
138–146, 2008.
[10] J. Caiado. Performance of combined double seasonal univariate time series models for forecasting
water demand. CEMAPRE Working Papers 0903, Centre for Applied Mathematics and Economics
(CEMAPRE), School of Economics and Management (ISEG), Technical University of Lisbon, May
2009.
[11] S. Gato, N. Jayasuriya, and P. Roberts. Forecasting Residential Water Demand: Case Study.
Journal of Water Resources Planning and Management, 133:309–319, 2007.
73
[12] H. Chang, S. Praskievicz, and H. Parandvash. Sensitivity of urban water consumption to weather
and climate variability at multiple temporal scales: The case of portland, oregon. International
Journal of Geospatial and Environmental Research, 1(1):1–19, 2014.
[13] A. Jain, A. K. Varshney, and U. C. Joshi. Short-term Water Demand Forecast Modelling at IIT
Kanpur Using Artificial Neural Networks. Water Resources Management, 15:299–321, 2001.
[14] S. Fontdecaba, J. A. Sanchez-Espigares, L. Marco-Almagro, X. Tort-Martorell, F. Cabrespina, and
J. Zubelzu. An Approach to Disaggregating Total Household Water Consumption into Major End-
Uses. Water Resources Management, 27(7):2155–2177, May 2013.
[15] T. R. Gurung, R. Stewart, C. Beal, and A. Sharma. Smart water meter data for improved water
demand modelling of diversified water supply schemes. 02 2015.
[16] C. Makwiza and H. E. Jacobs. Sound recording to characterize outdoor tap water use events.
Journal of Water Supply: Research and Technology - Aqua, 2017.
[17] J. Chen, A. H. Kam, J. Zhang, N. Liu, and L. Shue. Bathroom activity monitoring based on sound.
In H. W. Gellersen, R. Want, and A. Schmidt, editors, Pervasive Computing, pages 47–61. Springer
Berlin Heidelberg, 2005.
[18] J. Fogarty, C. Au, and S. E. Hudson. Sensing from the basement: A feasibility study of unobtrusive
and low-cost home activity recognition. In Proceedings of the 19th Annual ACM Symposium on
User Interface Software and Technology, UIST ’06, pages 91–100, 2006.
[19] A. Pierrot and Y. Goude. Short-term electricity load forecasting with generalized additive models.
In Conference: Proceedings of ISAP power, pages 593–600, 2011.
[20] A. Ba, M. Sinn, Y. Goude, and P. Pompey. Adaptive Learning of Smoothing Functions: Application to
Electricity Load Forecasting. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors,
Advances in Neural Information Processing Systems 25, pages 2510–2518. Curran Associates,
Inc., 2012.
[21] W. W. S. Wei. Time Series Analysis: Univariate and Multivariate Methods. Pearson Addison Wesley,
2nd edition, 2006.
[22] A. P. Pires. Notas de Series Temporais. March 2001.
[23] R. Hyndman and G. Athanasopoulos. Forecasting: principles and practice. OTexts: Melbourne,
Australia, 2013. URL http://otexts.org/fpp/.
[24] E. Zivot. Time Series Econometrics - Lecture notes. 2006.
[25] G. P. E. Box and D. R. Cox. An Analysis of Transformations. Journal of the Royal Statistical Society,
26(2):211–252, 1964.
[26] S. Bisgaard and M. Kulahci. Time Series Analysis and Forecasting by Example. John Wiley and
Sons, Inc., 2011.
74
[27] G. Box, G. Jenkins, and G. Reinsel. Time Series Analysis: Forecasting and Control. Prentice Hall,
3rd edition, 1994.
[28] S. Aghabozorgi, A. S. Shirkhorshidi, and T. Y. Wah. Time series clustering - A decade review.
Information Systems, 53:16 – 38, October 2015.
[29] J. Caiado, N. Crato, and D. Pena. A periodogram-based metric for time series classification. Com-
put. Stat. Data Anal., 50(10):2668–2684, June 2006.
[30] A. D. Chouakria and P. N. Nagabhushan. Adaptive dissimilarity index for measuring time series
proximity. Advances in Data Analysis and Classification, 1(1):5–21, Mar 2007.
[31] C. M. M. Pereira and R. F. de Mello. Common dissimilarity measures are inappropriate for time
series clustering. RITA, 20:25–48, 2013.
[32] P. Giudici. Applied Data Mining: Statistical Methods for Business and Industry. John Wiley and
Sons, Inc., 2003.
[33] D. J. Berndt and J. Clifford. Using dynamic time warping to find patterns in time series. In Proceed-
ings of the 3rd International Conference on Knowledge Discovery and Data Mining, AAAIWS’94,
pages 359–370. AAAI Press, 1994.
[34] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications (Springer Texts in
Statistics). Springer-Verlag New York, Inc., 2005.
[35] B. Desgraupes. Clustering Indices. University of Paris Ouest - Lab Modal’X, pages 1–34, 2013.
[36] S. Wood. Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC, 1st edition,
2006.
[37] Weather underground. https://www.wunderground.com/. Accessed: 2017-12-01.
[38] R. Hyndman, K. Smith, and X. Wang. Characteristic-Based Clustering for Time Series Data. Data
Mining and Knowledge Discovery, 13:335–364, November 2006.
[39] P. Montero and J. A. Vilar. Tsclust: An R Package for Time Series Clustering. Journal of Statistical
Software, 62(1), November 2014.
[40] M. Charrad, N. Ghazzali, V. Boiteau, and A. Niknafs. Nbclust: An R Package for Determining the
Relevant Number of Clusters in a Data Set. Journal of Statistical Software, 61(6), October 2014.
[41] H. Jacobs and J. Haarhoff. Structure and data requirements of an end-use model for residential
water demand and return flow. Water SA, 30(3):293–304, 2004.
[42] S. Gato-Trinidad, N. Jayasuriya, and P. Roberts. Understanding urban residential end uses of water.
Water Science and Technology, 64(1):36–42, 2011.
75
76
Appendix A
Results of the Clustering
In this appendix we present further exploratory analysis on the clustering results obtained.
A.1 Exploratory analysis
In Figure A.3 and Figure A.10, the representative series Mean of Cluster 2 and Cluster 3 are shown. In
Figure A.1, Figure A.4 and Figure A.11, the representative series Q95% of the clusters are shown. In
Figure A.2, Figure A.5 and Figure A.12, the representative series Q25% of the clusters are shown. In
Figure A.9 and Figure A.16, the boxplots per day of week are shown. In Figure A.7 and Figure A.14,
the boxplots per month of the aggregated monthy consumptions of the members of Cluster 4 and of
the members of Cluster 5 are presented, respectively. In Figure A.8 and Figure A.15, the boxplots per
month of the year for both Clusters are shown.
−1
0
1
2
2014
−12
−01
2015
−01
−01
2015
−02
−01
2015
−03
−01
2015
−04
−01
2015
−05
−01
2015
−06
−01
2015
−07
−01
2015
−08
−01
2015
−09
−01
2015
−10
−01
2015
−11
−01
2015
−12
−01
2016
−01
−01
2016
−02
−01
2016
−03
−01
2016
−04
−01
2016
−05
−01
2016
−06
−01
2016
−07
−01
2016
−08
−01
2016
−09
−01
2016
−10
−01
2016
−11
−01
2016
−12
−01
2017
−01
−01
2017
−02
−01
2017
−03
−01
2017
−04
−01
2017
−05
−01
2017
−06
−01
2017
−07
−01
2017
−08
−01
2017
−09
−01
Time (days)
Val
ue
Figure A.1: Representative series Q95% of Cluster 1 between 01/01/2015 and 31/07/2017.
77
−1
0
1
2
2014
−12
−01
2015
−01
−01
2015
−02
−01
2015
−03
−01
2015
−04
−01
2015
−05
−01
2015
−06
−01
2015
−07
−01
2015
−08
−01
2015
−09
−01
2015
−10
−01
2015
−11
−01
2015
−12
−01
2016
−01
−01
2016
−02
−01
2016
−03
−01
2016
−04
−01
2016
−05
−01
2016
−06
−01
2016
−07
−01
2016
−08
−01
2016
−09
−01
2016
−10
−01
2016
−11
−01
2016
−12
−01
2017
−01
−01
2017
−02
−01
2017
−03
−01
2017
−04
−01
2017
−05
−01
2017
−06
−01
2017
−07
−01
2017
−08
−01
2017
−09
−01
Time (days)
Val
ue
Figure A.2: representative series Q25% of Cluster 1 between 01/01/2015 and 31/07/2017.
0
2
4
6
2014
−12
−01
2015
−01
−01
2015
−02
−01
2015
−03
−01
2015
−04
−01
2015
−05
−01
2015
−06
−01
2015
−07
−01
2015
−08
−01
2015
−09
−01
2015
−10
−01
2015
−11
−01
2015
−12
−01
2016
−01
−01
2016
−02
−01
2016
−03
−01
2016
−04
−01
2016
−05
−01
2016
−06
−01
2016
−07
−01
2016
−08
−01
2016
−09
−01
2016
−10
−01
2016
−11
−01
2016
−12
−01
2017
−01
−01
2017
−02
−01
2017
−03
−01
2017
−04
−01
2017
−05
−01
2017
−06
−01
2017
−07
−01
2017
−08
−01
2017
−09
−01
Time (days)
Val
ue
Figure A.3: Representative series Mean of Cluster 2 between 01/01/2015 and 31/07/2017.
0.0
2.5
5.0
7.5
10.0
2014
−12
−01
2015
−01
−01
2015
−02
−01
2015
−03
−01
2015
−04
−01
2015
−05
−01
2015
−06
−01
2015
−07
−01
2015
−08
−01
2015
−09
−01
2015
−10
−01
2015
−11
−01
2015
−12
−01
2016
−01
−01
2016
−02
−01
2016
−03
−01
2016
−04
−01
2016
−05
−01
2016
−06
−01
2016
−07
−01
2016
−08
−01
2016
−09
−01
2016
−10
−01
2016
−11
−01
2016
−12
−01
2017
−01
−01
2017
−02
−01
2017
−03
−01
2017
−04
−01
2017
−05
−01
2017
−06
−01
2017
−07
−01
2017
−08
−01
2017
−09
−01
Time (days)
Val
ue
Figure A.4: Representative series Q95% of Cluster 2 between 01/01/2015 and 31/07/2017.
78
−1
0
1
2
3
4
2014
−12
−01
2015
−01
−01
2015
−02
−01
2015
−03
−01
2015
−04
−01
2015
−05
−01
2015
−06
−01
2015
−07
−01
2015
−08
−01
2015
−09
−01
2015
−10
−01
2015
−11
−01
2015
−12
−01
2016
−01
−01
2016
−02
−01
2016
−03
−01
2016
−04
−01
2016
−05
−01
2016
−06
−01
2016
−07
−01
2016
−08
−01
2016
−09
−01
2016
−10
−01
2016
−11
−01
2016
−12
−01
2017
−01
−01
2017
−02
−01
2017
−03
−01
2017
−04
−01
2017
−05
−01
2017
−06
−01
2017
−07
−01
2017
−08
−01
2017
−09
−01
Time (days)
Val
ue
Figure A.5: representative series Q25% of Cluster 2 between 01/01/2015 and 31/07/2017.
0.0
0.5
1.0
1.5
0 5 10 15 20Hour
Val
ue
Month123456789101112
Figure A.6: Hourly pattern per month of Cluster 2.
−1
0
1
2
2015
Jan
2015
Feb
2015
Mar
2015
Apr
2015
May
2015
Jun
2015
Jul
2015
Aug
2015
Sep
2015
Oct
2015
Nov
2015
Dec
2016
Jan
2016
Feb
2016
Mar
2016
Apr
2016
May
2016
Jun
2016
Jul
2016
Aug
2016
Sep
2016
Oct
2016
Nov
2016
Dec
2017
Jan
2017
Feb
2017
Mar
2017
Apr
2017
May
2017
Jun
2017
Jul
Time (months)
Val
ue
Figure A.7: Boxplot per month of the normalized aggregated monthly consumptions of the members ofCluster 2.
79
−1
0
1
2
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep Oct
Nov
Dec
Months
Val
ue
Figure A.8: Boxplot per month of the year of the normalized aggregated monthly consumptions of themembers of Cluster 4.
0
4
8
12
Mon
day
Tues
day
Wed
nesd
ay
Thu
rsda
y
Frid
ay
Sat
urda
y
Sun
day
Day of the Week
Val
ue
Figure A.9: Boxplot per day of the week of the normalized consumptions of the members of Cluster 4.
−1
0
1
2
2014
−12
−01
2015
−01
−01
2015
−02
−01
2015
−03
−01
2015
−04
−01
2015
−05
−01
2015
−06
−01
2015
−07
−01
2015
−08
−01
2015
−09
−01
2015
−10
−01
2015
−11
−01
2015
−12
−01
2016
−01
−01
2016
−02
−01
2016
−03
−01
2016
−04
−01
2016
−05
−01
2016
−06
−01
2016
−07
−01
2016
−08
−01
2016
−09
−01
2016
−10
−01
2016
−11
−01
2016
−12
−01
2017
−01
−01
2017
−02
−01
2017
−03
−01
2017
−04
−01
2017
−05
−01
2017
−06
−01
2017
−07
−01
2017
−08
−01
2017
−09
−01
Time (days)
Val
ue
Figure A.10: Representative series Mean of Cluster 3 between 01/01/2015 and 31/07/2017.
80
−1
0
1
2
3
2014
−12
−01
2015
−01
−01
2015
−02
−01
2015
−03
−01
2015
−04
−01
2015
−05
−01
2015
−06
−01
2015
−07
−01
2015
−08
−01
2015
−09
−01
2015
−10
−01
2015
−11
−01
2015
−12
−01
2016
−01
−01
2016
−02
−01
2016
−03
−01
2016
−04
−01
2016
−05
−01
2016
−06
−01
2016
−07
−01
2016
−08
−01
2016
−09
−01
2016
−10
−01
2016
−11
−01
2016
−12
−01
2017
−01
−01
2017
−02
−01
2017
−03
−01
2017
−04
−01
2017
−05
−01
2017
−06
−01
2017
−07
−01
2017
−08
−01
2017
−09
−01
Time (days)
Val
ue
Figure A.11: Representative series Q95% of Cluster 3 between 01/01/2015 and 31/07/2017.
−1
0
1
2
2014
−12
−01
2015
−01
−01
2015
−02
−01
2015
−03
−01
2015
−04
−01
2015
−05
−01
2015
−06
−01
2015
−07
−01
2015
−08
−01
2015
−09
−01
2015
−10
−01
2015
−11
−01
2015
−12
−01
2016
−01
−01
2016
−02
−01
2016
−03
−01
2016
−04
−01
2016
−05
−01
2016
−06
−01
2016
−07
−01
2016
−08
−01
2016
−09
−01
2016
−10
−01
2016
−11
−01
2016
−12
−01
2017
−01
−01
2017
−02
−01
2017
−03
−01
2017
−04
−01
2017
−05
−01
2017
−06
−01
2017
−07
−01
2017
−08
−01
2017
−09
−01
Time (days)
Val
ue
Figure A.12: Representative series Q25% of Cluster 3 between 01/01/2015 and 31/07/2017.
0.0
0.5
1.0
0 5 10 15 20Hour
Val
ue
Month123456789101112
Figure A.13: Hourly pattern per month of Cluster 3.
81
−1
0
1
2
3
2015
jan
2015
fev
2015
mar
2015
abr
2015
mai
2015
jun
2015
jul
2015
ago
2015
set
2015
out
2015
nov
2015
dez
2016
jan
2016
fev
2016
mar
2016
abr
2016
mai
2016
jun
2016
jul
2016
ago
2016
set
2016
out
2016
nov
2016
dez
2017
jan
2017
fev
2017
mar
2017
abr
2017
mai
2017
jun
2017
jul
Time (months)
Val
ue
Figure A.14: Boxplot per month of the normalized aggregated monthly consumptions of the members ofCluster 3.
−1
0
1
2
3
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep Oct
Nov
Dec
Months
Val
ue
Figure A.15: Boxplot per month of the year of the normalized aggregated monthly consumptions of themembers of Cluster 3.
0
3
6
Mon
day
Tues
day
Wed
nesd
ay
Thu
rsda
y
Frid
ay
Sat
urda
y
Sun
day
Day of the Week
Val
ue
Figure A.16: Boxplot per day of the week of the normalized consumptions of the members of Cluster 3.
82
Appendix B
Additional forecast results
In this Appendix, the results of the models of representative series Q95% and 3 of Clusters 1 and 2 are
shown. Also, the forecast results obtained from the models of representative series Mean, 2 and 3 of
Cluster 3 are presented in Figure B.5, Figure B.6 and Figure B.7, respectively. Also, in Figure B.8, we
show the forecasts of the models of representative series Q95% and 3 as consumption intervals of the
forecasts of the model of representative series Mean.
10
15
20
2017
−08
−05
2017
−08
−09
2017
−08
−13
2017
−08
−17
2017
−08
−21
2017
−08
−25
2017
−08
−29
2017
−09
−02
2017
−09
−06
2017
−09
−10
2017
−09
−14
2017
−09
−18
2017
−09
−22
2017
−09
−26
2017
−09
−30
2017
−10
−04
2017
−10
−08
2017
−10
−12
2017
−10
−16
2017
−10
−20
2017
−10
−24
2017
−10
−28
2017
−11
−01
2017
−11
−05
2017
−11
−09
2017
−11
−13
2017
−11
−17
2017
−11
−21
2017
−11
−25
2017
−11
−29
2017
−12
−03
Time (days)
m3 /
day colour
PredictionsReal
Figure B.1: Forecast of the model of representative series Q95% for the interval 07/08/2017 - 30/11/2017of Cluster 1 in the original scale. The MAPE is equal to 19.712%.
83
0
2
4
620
17−
08−
14
2017
−08
−18
2017
−08
−22
2017
−08
−26
2017
−08
−30
2017
−09
−03
2017
−09
−07
2017
−09
−11
2017
−09
−15
2017
−09
−19
2017
−09
−23
2017
−09
−27
2017
−10
−01
2017
−10
−05
2017
−10
−09
2017
−10
−13
2017
−10
−17
2017
−10
−21
2017
−10
−25
2017
−10
−29
2017
−11
−02
2017
−11
−06
2017
−11
−10
2017
−11
−14
2017
−11
−18
2017
−11
−22
2017
−11
−26
2017
−11
−30
2017
−12
−04
Time (days)
m3 /
day colour
PredictionsReal
Figure B.2: Forecast of the model of representative series Q25% for the interval 16/08/2017 - 30/11/2017of Cluster 1 in the original scale. The MAE is equal to 0.898.
0
20
40
60
2017
−08
−18
2017
−08
−22
2017
−08
−26
2017
−08
−30
2017
−09
−03
2017
−09
−07
2017
−09
−11
2017
−09
−15
2017
−09
−19
2017
−09
−23
2017
−09
−27
2017
−10
−01
2017
−10
−05
2017
−10
−09
2017
−10
−13
2017
−10
−17
2017
−10
−21
2017
−10
−25
2017
−10
−29
2017
−11
−02
2017
−11
−06
2017
−11
−10
2017
−11
−14
2017
−11
−18
Time (days)
m3 /
day colour
PredictionsReal
Figure B.3: Forecast of the model of representative series Q95% for the interval 19/08/2017 - 16/11/2017of Cluster 2 in the original scale. The MAPE is equal to 30.024%.
0.0
2.5
5.0
7.5
10.0
2017
−08
−12
2017
−08
−16
2017
−08
−20
2017
−08
−24
2017
−08
−28
2017
−09
−01
2017
−09
−05
2017
−09
−09
2017
−09
−13
2017
−09
−17
2017
−09
−21
2017
−09
−25
2017
−09
−29
2017
−10
−03
2017
−10
−07
2017
−10
−11
2017
−10
−15
2017
−10
−19
2017
−10
−23
2017
−10
−27
2017
−10
−31
2017
−11
−04
2017
−11
−08
2017
−11
−12
2017
−11
−16
2017
−11
−20
2017
−11
−24
2017
−11
−28
2017
−12
−02
Time (days)
m3 /
day colour
PredictionsReal
Figure B.4: Forecast of the model of representative series Q25% for the interval 14/08/2017 - 30/11/2017of Cluster 2 in the original scale. The MAE is equal to 2.098.
84
2
4
6
8
10
12
2017
−08
−17
2017
−08
−21
2017
−08
−25
2017
−08
−29
2017
−09
−02
2017
−09
−06
2017
−09
−10
2017
−09
−14
2017
−09
−18
2017
−09
−22
2017
−09
−26
2017
−09
−30
2017
−10
−04
2017
−10
−08
2017
−10
−12
2017
−10
−16
2017
−10
−20
2017
−10
−24
2017
−10
−28
2017
−11
−01
2017
−11
−05
2017
−11
−09
2017
−11
−13
2017
−11
−17
2017
−11
−21
2017
−11
−25
2017
−11
−29
2017
−12
−03
Time (days)
m3 /
day colour
PredictionsReal
Figure B.5: Forecast of the model of representative seriesMean for the interval 22/08/2017 - 30/11/2017of Cluster 3 in the original scale. The MAPE is equal to 17.444%.
10
15
20
25
2017
−08
−09
2017
−08
−13
2017
−08
−17
2017
−08
−21
2017
−08
−25
2017
−08
−29
2017
−09
−02
2017
−09
−06
2017
−09
−10
2017
−09
−14
2017
−09
−18
2017
−09
−22
2017
−09
−26
2017
−09
−30
2017
−10
−04
2017
−10
−08
2017
−10
−12
2017
−10
−16
2017
−10
−20
2017
−10
−24
2017
−10
−28
2017
−11
−01
2017
−11
−05
2017
−11
−09
2017
−11
−13
2017
−11
−17
2017
−11
−21
2017
−11
−25
2017
−11
−29
Time (days)
m3 /
day colour
PredictionsReal
Figure B.6: Forecast of the model of representative series Q95% for the interval 11/08/2017 - 25/11/2017of Cluster 3 in the original scale. The MAPE is equal to 34.578%.
0
2
4
6
8
2017
−08
−11
2017
−08
−15
2017
−08
−19
2017
−08
−23
2017
−08
−27
2017
−08
−31
2017
−09
−04
2017
−09
−08
2017
−09
−12
2017
−09
−16
2017
−09
−20
2017
−09
−24
2017
−09
−28
2017
−10
−02
2017
−10
−06
2017
−10
−10
2017
−10
−14
2017
−10
−18
2017
−10
−22
2017
−10
−26
2017
−10
−30
2017
−11
−03
2017
−11
−07
2017
−11
−11
Time (days)
m3 /
day colour
PredictionsReal
Figure B.7: Forecast of the model of representative series Q25% for the interval 12/08/2017 - 08/11/2017of Cluster 3 in the original scale. The MAE is equal to 0.769.
85
0
5
10
15
20
2017
−08
−22
2017
−08
−26
2017
−08
−30
2017
−09
−03
2017
−09
−07
2017
−09
−11
2017
−09
−15
2017
−09
−19
2017
−09
−23
2017
−09
−27
2017
−10
−01
2017
−10
−05
2017
−10
−09
2017
−10
−13
2017
−10
−17
2017
−10
−21
2017
−10
−25
2017
−10
−29
2017
−11
−02
2017
−11
−06
2017
−11
−10
Time (days)
m3 /
day
colour
Predictions(Mean)
Predictions(Q1 25%)
Predictions(Q3 95%)
Real (Mean)
Figure B.8: Forecast and band intervals for Cluster 3 from 22/08/2017 until 8/11/2017 in the originalscale.
86
Appendix C
Additional daily disaggregation of
consumption results
In this Appendix, the estimates of the total consumption are shown for Groups 3 and 4 in Figure C.1 and
Figure C.3, respectively. The estimates of the garden watering consumption and domestic consumption
are presented in Figure C.2 for Group 3 and in Figure C.4 for Group 4.
Some results of the second disaggregation method discussed in Section 4.5 are shown in this Ap-
pendix. In Figure C.5, the estimates of the total consumption are shown for Group 1 Large, that is the
group created from members of Group 1 with exterior area bigger than 1600m2. The estimates of the
garden watering consumption and domestic consumption for this group are presented in Figure C.6. In
Figure C.7, the estimates of the total consumption are shown for Group 3 Small, that is the group created
from members of Group 3 with exterior area smaller than 1600m2. The estimates of the garden watering
consumption and domestic consumption for this group are presented in Figure C.8.
5
10
2017
−08
−17
2017
−08
−21
2017
−08
−25
2017
−08
−29
2017
−09
−02
2017
−09
−06
2017
−09
−10
2017
−09
−14
2017
−09
−18
2017
−09
−22
2017
−09
−26
2017
−09
−30
2017
−10
−04
2017
−10
−08
2017
−10
−12
2017
−10
−16
2017
−10
−20
2017
−10
−24
2017
−10
−28
2017
−11
−01
2017
−11
−05
2017
−11
−09
2017
−11
−13
2017
−11
−17
2017
−11
−21
2017
−11
−25
2017
−11
−29
2017
−12
−03
Time (days)
m3 /
day colour
PredictionsReal
Figure C.1: Estimates of the total consumption between 22/08/2017 and 30/11/2017 and the real totalconsumption of Group 3 in the original scale.
87
0.0
2.5
5.0
7.5
10.0
12.5
2017
−08
−17
2017
−08
−21
2017
−08
−25
2017
−08
−29
2017
−09
−02
2017
−09
−06
2017
−09
−10
2017
−09
−14
2017
−09
−18
2017
−09
−22
2017
−09
−26
2017
−09
−30
2017
−10
−04
2017
−10
−08
2017
−10
−12
2017
−10
−16
2017
−10
−20
2017
−10
−24
2017
−10
−28
2017
−11
−01
2017
−11
−05
2017
−11
−09
2017
−11
−13
2017
−11
−17
2017
−11
−21
2017
−11
−25
2017
−11
−29
2017
−12
−03
Time (days)
m3 /
day
colourDomesticconsumptionestimatesGardenwateringestimates
Real (total)
Figure C.2: Estimates of the garden watering and domestic consumption between 22/08/2017 and30/11/2017 and the real total consumption of Group 3 in the original scale.
5
10
15
2017
−08
−17
2017
−08
−21
2017
−08
−25
2017
−08
−29
2017
−09
−02
2017
−09
−06
2017
−09
−10
2017
−09
−14
2017
−09
−18
2017
−09
−22
2017
−09
−26
2017
−09
−30
2017
−10
−04
2017
−10
−08
2017
−10
−12
2017
−10
−16
2017
−10
−20
2017
−10
−24
2017
−10
−28
2017
−11
−01
2017
−11
−05
2017
−11
−09
2017
−11
−13
2017
−11
−17
2017
−11
−21
2017
−11
−25
2017
−11
−29
2017
−12
−03
Time (days)
m3 /
day colour
PredictionsReal
Figure C.3: Estimates of the total consumption between 22/08/2017 and 30/11/2017 and the real totalconsumption of Group 4 in the original scale.
0
5
10
15
2017
−08
−17
2017
−08
−21
2017
−08
−25
2017
−08
−29
2017
−09
−02
2017
−09
−06
2017
−09
−10
2017
−09
−14
2017
−09
−18
2017
−09
−22
2017
−09
−26
2017
−09
−30
2017
−10
−04
2017
−10
−08
2017
−10
−12
2017
−10
−16
2017
−10
−20
2017
−10
−24
2017
−10
−28
2017
−11
−01
2017
−11
−05
2017
−11
−09
2017
−11
−13
2017
−11
−17
2017
−11
−21
2017
−11
−25
2017
−11
−29
2017
−12
−03
Time (days)
m3 /
day
colourDomesticconsumptionestimatesGardenwateringestimates
Real (total)
Figure C.4: Estimates of the garden watering and domestic consumption between 22/08/2017 and30/11/2017 and the real total consumption of Group 4 in the original scale.
88
0
5
10
15
2017
−08
−26
2017
−08
−30
2017
−09
−03
2017
−09
−07
2017
−09
−11
2017
−09
−15
2017
−09
−19
2017
−09
−23
2017
−09
−27
2017
−10
−01
2017
−10
−05
2017
−10
−09
2017
−10
−13
2017
−10
−17
2017
−10
−21
2017
−10
−25
2017
−10
−29
2017
−11
−02
2017
−11
−06
2017
−11
−10
2017
−11
−14
2017
−11
−18
2017
−11
−22
2017
−11
−26
2017
−11
−30
2017
−12
−04
Time (days)
m3 /
day colour
PredictionsReal
Figure C.5: Estimates of the total consumption between 27/08/2017 and 30/11/2017 and the real totalconsumption of Group 1 Large in the original scale. The MAPE was equal to 66.41%.
0
5
10
15
2017
−08
−26
2017
−08
−30
2017
−09
−03
2017
−09
−07
2017
−09
−11
2017
−09
−15
2017
−09
−19
2017
−09
−23
2017
−09
−27
2017
−10
−01
2017
−10
−05
2017
−10
−09
2017
−10
−13
2017
−10
−17
2017
−10
−21
2017
−10
−25
2017
−10
−29
2017
−11
−02
2017
−11
−06
2017
−11
−10
2017
−11
−14
2017
−11
−18
2017
−11
−22
2017
−11
−26
2017
−11
−30
2017
−12
−04
Time (days)
m3 /
day
colourGardenwateringestimatesIndoorconsumptionestimates
Real (total)
Figure C.6: Estimates of the garden watering and domestic consumption between 27/08/2017 and30/11/2017 and the real total consumption of Group 1 Large in the original scale.
5
10
15
2017
−08
−17
2017
−08
−21
2017
−08
−25
2017
−08
−29
2017
−09
−02
2017
−09
−06
2017
−09
−10
2017
−09
−14
2017
−09
−18
2017
−09
−22
2017
−09
−26
2017
−09
−30
2017
−10
−04
2017
−10
−08
2017
−10
−12
2017
−10
−16
2017
−10
−20
2017
−10
−24
2017
−10
−28
2017
−11
−01
2017
−11
−05
2017
−11
−09
2017
−11
−13
2017
−11
−17
2017
−11
−21
2017
−11
−25
2017
−11
−29
2017
−12
−03
Time (days)
m3 /
day colour
PredictionsReal
Figure C.7: Estimates of the total consumption between 22/08/2017 and 30/11/2017 and the real totalconsumption of Group 3 Small in the original scale. The MAPE was equal to 44.05%.
89
0
5
10
15
2017
−08
−17
2017
−08
−21
2017
−08
−25
2017
−08
−29
2017
−09
−02
2017
−09
−06
2017
−09
−10
2017
−09
−14
2017
−09
−18
2017
−09
−22
2017
−09
−26
2017
−09
−30
2017
−10
−04
2017
−10
−08
2017
−10
−12
2017
−10
−16
2017
−10
−20
2017
−10
−24
2017
−10
−28
2017
−11
−01
2017
−11
−05
2017
−11
−09
2017
−11
−13
2017
−11
−17
2017
−11
−21
2017
−11
−25
2017
−11
−29
2017
−12
−03
Time (days)
m3 /
day
colourGardenwateringestimatesIndoorconsumptionestimates
Real (total)
Figure C.8: Estimates of the garden watering and domestic consumption between 22/08/2017 and30/11/2017 and the real total consumption of Group 3 Small in the original scale.
90