Mathematical modeling of garden watering demand...Aplicamos um algoritmo de clustering para agrupar...

Mathematical modeling of garden watering demand

Ana Rosa da Cruz Lopes Marques

Thesis to obtain the Master of Science Degree in

Mathematics and Applications

Supervisors: Prof. Maria da Conceição Esperança AmadoDr. Dália Susana dos Santos da Cruz Loureiro

Examination Committee

Chairperson: Prof. António Manuel Pacheco PiresSupervisor: Prof. Maria da Conceição Esperança AmadoMembers of the Committee: Prof. Isabel Maria Alves Rodrigues

Magister Maria Regina Guerreiro Casimiro

July 2018

ii

Acknowledgments

I would first like to express my sincere gratitude to my advisor Prof. Conceicao Amado for the continuous

support, motivation, guidance and immense knowledge. Her guidance helped me in all the time of

research and writing of this thesis. I could not have imagined having a better advisor for my thesis.

Besides my advisor, I would like to thank my co-advisor Dr. Dalia Loureiro for welcoming me into NES

(Nucleo de Engenharia Sanitaria) and I am gratefully indebted to her for her very valuable comments on

this thesis.

I would like to thank Engr. Regina Casimiro and Engr. Pedro Pascoal for their availability.

Finally, I must express my gratitude to my parents and to my brother for providing me with unfailing

support and encouragement throughout my years of study. This accomplishment would not have been

possible without them.

iii

iv

Resumo

O aumento do turismo em regioes costeiras e o problema da intrusao salina nos aquıferos, que leva

ao fecho de furos usados para rega de jardins, causam uma pressao sobre o fornecimento de agua na

regiao em que se situa o caso de estudo. A esta situacao somam-se as consequencias das mudancas

climaticas, o que torna desafiante prever cenarios de consumo a medio e longo prazo. Este estudo

tem como objetivos caracterizar, modelar e prever o consumo de agua para rega numa regiao costeira

turıstica. Isto e possıvel devido a situacao particular da existencia de dois contadores de agua nos lotes

em estudo: um que mede o consumo de agua no interior e outro que mede o consumo no exterior.

Aplicamos um algoritmo de clustering para agrupar os consumidores segundo o padrao de consumo.

Para cada cluster, propomos um modelo aditivo generalizado. Para alem disso, testamos um metodo

de desagregacao do consumo total em uso de agua interior e uso de agua exterior.

Palavras-chave: Rega, Consumo de agua no exterior, Clustering de series temporais, Mod-

elos Aditivos Generalizados, Desagregacao de consumo de agua

v

vi

Abstract

An increase in tourism in coastal regions and the saltwater intrusion problem in the aquifers, which

will cause the closure of boreholes used to water gardens, create a pressure over the water supply of

the region in study. This situation, along with climate change, makes it challenging to envisage mean

and long term consumption scenarios. This study is aimed at characterizing, modeling and forecasting

the garden watering demand in a coastal touristic region. This is possible due to the particular situation

where the lots to study have two water meters: one to measure indoor water use and another to measure

outdoor water use.

We apply a clustering algorithm to group the customers by similarity of consumption pattern. For each

cluster, we propose a generalized additive model. Furthermore, we test a method to disaggregate the

total water use into indoor and outdoor use.

Keywords: Garden watering, Outdoor water use, Time series clustering, Generalized Additive

Models, Disaggregation of water consumption

vii

viii

Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 State-of-the-art 5

3 Methodology 9

3.1 Time series basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.1 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1.2 Autocovariance, Autocorrelation and Partial Autocorrelation Functions . . . . . . . 10

3.1.3 White Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1.4 Differencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1.5 Variance Stabilizing Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1.6 Cross-correlation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2 Time series Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.1 Linear Stationary Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2.2 Non-stationary Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.3 Model Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2.4 Diagnostic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.5 Forecast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3 Time series clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3.1 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3.2 Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3.3 Comparing clustering methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.4 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

ix

3.4.1 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4.2 Generalized Additive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4.3 Mixed Models - GAMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.5 Disaggregation of water consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.5.1 Classification algorithm: K-Nearest Neighbors (KNN) . . . . . . . . . . . . . . . . . 31

3.5.2 Method for disaggregation of water consumption . . . . . . . . . . . . . . . . . . . 32

4 Results and Discussion 33

4.1 Case study description and data processing . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2 Exploratory Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3 Time Series Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.3.1 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.3.2 Choosing the best number of clusters . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.3.3 Discussion of the clustering results . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.4 Modeling garden watering demand using GAM . . . . . . . . . . . . . . . . . . . . . . . . 52

4.4.1 Explanatory variables selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.4.2 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.4.3 Analysis of the Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.4.4 Forecast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.5 Daily disaggregation of water consumption . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5 Conclusions 69

5.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

Bibliography 73

A Results of the Clustering 77

A.1 Exploratory analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

B Additional forecast results 83

C Additional daily disaggregation of consumption results 87

x

List of Tables

3.1 Summary of the properties of the stationary models (Source: Bisgaard and Kulahci [26]). 16

4.1 Information regarding the extreme observations of the 57 outdoor water meters. . . . . . 42

4.2 Comparison of the values of the four indexes for the best number of clusters for Ward

Method and Complete Linkage with periodogram based distance when using the Stan-

dard normalization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.3 Best number of clusters according to each index using complete linkage method with

periodogram based distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.4 Size of each cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.5 Summary of the outdoor areas per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.6 Size of each cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.7 Summary of the outdoor areas per cluster (final clusters). . . . . . . . . . . . . . . . . . . 49

4.8 Summary of the estimated watered areas per cluster (final clusters). . . . . . . . . . . . . 50

4.9 Summary of the building areas per cluster (final clusters). . . . . . . . . . . . . . . . . . . 50

4.10 Average ratio between outdoor area and lot area per cluster (final clusters). . . . . . . . . 51

4.11 Mean estimated pool volume per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.12 Monthly peak factor per cluster for 2015 and 2016. . . . . . . . . . . . . . . . . . . . . . . 52

4.13 Mean monthly ratio betwen the garden watering and total water consumption per Cluster

for the months of August, September, October and November and years 2015 and 2016. 62

4.14 Group size for the test data set (N = 41). . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.15 KNN classification results of the Groups’s representative series according to the clusters

obtained for the water consumption for garden watering data set. . . . . . . . . . . . . . . 64

4.16 Size of each group. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

xi

xii

List of Figures

4.1 Mean daily water consumption for garden watering of the 57 water meters and mean daily

temperature from 01/01/2015 to 30/11/2017. . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2 Daily accumulated precipitation from January 2015 to November 2017. . . . . . . . . . . . 35

4.3 Boxplot of the monthly consumptions of the 57 water meters between January 2015 and

November 2017. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.4 Boxplot of monthly consumptions of the 57 time series and grouped by year (2015, 2016

and 2017). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.5 Monthly consumption in November for three years (2015, 2016 and 2017). . . . . . . . . . 38

4.6 Scatterplot of the mean daily consumption of each outdoor water meter versus outdoor

area for the 57 water meters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.7 Scatterplot of the mean daily consumption of each outdoor water meter versus estimated

watered area for the 57 water meters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.8 Scatterplots of the mean daily consumption versus a) outdoor area, b) estimated watered

area. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.9 Median monthly indoor consumption and median monthly water consumption for garden

watering of the 57 water meters between January 2015 and November 2017. . . . . . . . 41

4.10 Mean daily pattern of indoor and water consumption for garden watering of the 57 water

meters between 01/01/2015 and 30/11/2017. . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.11 The number of clusters versus Dunn index. . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.12 The number of clusters versus Entropy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.13 The number of clusters versus Gamma index. . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.14 The number of clusters versus Silhouette index. . . . . . . . . . . . . . . . . . . . . . . . . 44

4.15 Partition of the 57 time series in 5 clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.16 Representative series of Cluster 1 between 01/01/2015 and 31/07/2017. . . . . . . . . . . 45

4.17 Normalized monthly consumption aggregated by the median for each cluster between

January 2015 and July 2017. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.18 Boxplot of the outdoor area per cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.19 Boxplot of the mean daily water consumption for garden watering per cluster. . . . . . . . 47

4.20 Boxplot of the normalized monthly consumption of the new Cluster 1 between January

2015 and July 2017. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

xiii

4.21 Boxplot per month of the year of the normalized monthly consumption of the new Cluster 1. 48

4.22 Boxplot per day of the week of the new Cluster 1. . . . . . . . . . . . . . . . . . . . . . . . 48

4.23 Daily pattern per month of the new Cluster 1. . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.24 Boxplot of the outdoor area per cluster (final clusters). . . . . . . . . . . . . . . . . . . . . 50

4.25 Boxplot of the estimated garden area per cluster (final clusters). . . . . . . . . . . . . . . . 50

4.26 Boxplot of the building area per cluster (final clusters). . . . . . . . . . . . . . . . . . . . . 51

4.27 Scatterplot of the mean daily consumption versus outdoor area grouped by cluster (final

clusters). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.28 Representative series Mean of the new Cluster 1 between 01/01/2015 and 31/07/2017. . 53

4.29 Sample ACF of the response variable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.30 Sample PACF of the response variable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.31 CCF between the differentiated mean temperature and the differentiated representative

series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.32 CCF between the differentiated maximum temperature and the differentiated representa-

tive series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.33 CCF between the differentiated minimum temperature and the differentiated representa-

tive series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.34 CCF between the accumulated precipitation and the differentiated representative series. . 55

4.35 Histogram of the residuals of Model 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.36 QQ-Plot of the residuals of Model 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.37 Residuals versus the linear predictor of Model 1. . . . . . . . . . . . . . . . . . . . . . . . 57

4.38 Daily forecast of the model of representative series Mean (Model 1, Equation 4.4) of

Cluster 1 and the real aggregated values by the mean, both in the original scale between

10/08/2017 and 30/11/2017. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.39 Daily forecast of Model 1 (Equation 4.4), Model 2 (Equation 4.5) and Model 3 (Equa-

tion 4.6) of Cluster 1 and the real aggregated values by the mean in the original scale

between 16/08/2017 and 30/11/2017. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.40 Daily forecast of the model of representative seriesMean (Model 4, Equation 4.8) of Clus-

ter 2 and the real aggregated values by the mean in the original scale between 27/08/2017

and 30/11/2017. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.41 Daily forecast of Model 4 (Equation 4.8), Model 5 (Equation 4.9) and Model 6 (Equa-

tion 4.10) of Cluster 2 and the real aggregated values by the mean in the original scale

between 19/08/2017 and 16/11/2017. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.42 Mean monthly indoor consumption and mean monthly water consumption for garden wa-

tering of Cluster 1 between January 2015 and July 2017. . . . . . . . . . . . . . . . . . . 61





xiv

4.45 Boxplot of the outdoor area per group for the test data set (N = 41). . . . . . . . . . . . . 64

4.46 Estimates of the total daily consumption between 22/08/2017 and 30/11/2017 and the real

total daily consumption of Group 1 in the original scale. . . . . . . . . . . . . . . . . . . . 65

4.47 Estimates of the total daily consumption between 22/08/2017 and 30/11/2017 and the real

total daily consumption of Group 2 in the original scale. . . . . . . . . . . . . . . . . . . . 65

4.48 Estimates of the daily garden watering and daily indoor consumption between 22/08/2017

and 30/11/2017 and the real total daily consumption of Group 1 in the original scale. . . . 66

4.49 Estimates of the daily garden watering and daily indoor consumption between 22/08/2017

and 30/11/2017 and the real total daily consumption of Group 2 in the original scale. . . . 66

A.1 Representative series Q95% of Cluster 1 between 01/01/2015 and 31/07/2017. . . . . . . 77

A.2 representative series Q25% of Cluster 1 between 01/01/2015 and 31/07/2017. . . . . . . 78

A.3 Representative series Mean of Cluster 2 between 01/01/2015 and 31/07/2017. . . . . . . 78


A.5 representative series Q25% of Cluster 2 between 01/01/2015 and 31/07/2017. . . . . . . 79

A.6 Hourly pattern per month of Cluster 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

A.7 Boxplot per month of the normalized aggregated monthly consumptions of the members

of Cluster 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

A.8 Boxplot per month of the year of the normalized aggregated monthly consumptions of the

members of Cluster 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

A.9 Boxplot per day of the week of the normalized consumptions of the members of Cluster 4. 80

A.10 Representative series Mean of Cluster 3 between 01/01/2015 and 31/07/2017. . . . . . . 80



A.13 Hourly pattern per month of Cluster 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

A.14 Boxplot per month of the normalized aggregated monthly consumptions of the members

of Cluster 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

A.15 Boxplot per month of the year of the normalized aggregated monthly consumptions of the

members of Cluster 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

A.16 Boxplot per day of the week of the normalized consumptions of the members of Cluster 3. 82

B.1 Forecast of the model of representative seriesQ95% for the interval 07/08/2017 - 30/11/2017

of Cluster 1 in the original scale. The MAPE is equal to 19.712%. . . . . . . . . . . . . . . 83


of Cluster 1 in the original scale. The MAE is equal to 0.898. . . . . . . . . . . . . . . . . . 84





xv

B.5 Forecast of the model of representative seriesMean for the interval 22/08/2017 - 30/11/2017






B.8 Forecast and band intervals for Cluster 3 from 22/08/2017 until 8/11/2017 in the original

scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

C.1 Estimates of the total consumption between 22/08/2017 and 30/11/2017 and the real total

consumption of Group 3 in the original scale. . . . . . . . . . . . . . . . . . . . . . . . . . 87

C.2 Estimates of the garden watering and domestic consumption between 22/08/2017 and

30/11/2017 and the real total consumption of Group 3 in the original scale. . . . . . . . . 88


consumption of Group 4 in the original scale. . . . . . . . . . . . . . . . . . . . . . . . . . 88


30/11/2017 and the real total consumption of Group 4 in the original scale. . . . . . . . . 88


consumption of Group 1 Large in the original scale. The MAPE was equal to 66.41%. . . . 89


30/11/2017 and the real total consumption of Group 1 Large in the original scale. . . . . . 89


consumption of Group 3 Small in the original scale. The MAPE was equal to 44.05%. . . . 89


30/11/2017 and the real total consumption of Group 3 Small in the original scale. . . . . . 90

xvi

Chapter 1

Introduction

In this Chapter, the motivation of this dissertation is presented in Section 1.1. In Section 1.2, the main

goals set for this study, as well as the adopted approach are discussed and a summarised description

of the case study is presented. Lastly, the structure of the dissertation is described in Section 1.3.

1.1 Motivation

With a rapid population growth worldwide, urban water systems must keep up with the increasing water

demands. Along with this, the increase in tourism in certain regions implies a pressure over their water

supplies, making a need for studies that will help prepare a sustainable future. Without enough water,

tourism in certain regions is compromised. The importance of implementation of water saving measure-

ments rises while water sources are affected with the growing extraction rate (Danilenko et al. [1]). With

climate change, an average global temperature increase is verified, the precipitation rates decrease and

many regions become drier, causing periods of drought. At the same time, high-precipitation events and

flooding are becoming more frequent in other regions. Thus, an effective water demand management

becomes increasingly important.

Indoor water use, that is, water use inside the houses, in residential areas remains, generally, the

same throughout the year (Makwiza [2]). However, outdoor water use, or the total amount of water

people use outside of their house, can suffer significant changes according to the weather change,

including an increase in garden irrigation during drier seasons. An important part of water conservation

strategies must go through a better understanding of the outdoor water use, how much water is used

outdoors and to what end it is used. There is a high potential to improve water savings of the outdoor

water use in residential areas.

During periods of drought, restrictions to water usage can be implemented, which typically aim out-

door water use, such as watering gardens, washing vehicles and refilling pools (Syme et al. [3], Root

and Survis [4]). These restrictions can also be complemented with a rise in water prices (Randolph and

Troy [5]). Situations where restrictions are necessary are becoming more frequent. In the Summer of

2017, severe restrictions to water use were absolutely necessary in certain cities of Portugal, since the

1

basic water needs of the population were at risk (www.publico.pt [6]). The dams, which are the main

water sources of these cities, presented an extremely low water level. Thus, there is a pressing need

to study the outdoor water use in residential areas to better understand how the water is being used

and how much of that water can be saved in the future. On the other hand, the outdoor water use may

represent a significant part of the total water use in a water supply system, influencing its operational

capacity. In addition, predicting how the outdoor water consumption will evolve is crucial to plan a sus-

tainable reabilitation of the water supply systems. Considering the research made for this study, this is

a topic that is not yet sufficiently explored. Moreover, these studies are crucial to educate the general

population, in order to move people to adopt conservation measures.

1.2 Objectives

The main goal of this dissertation is to study and characterize the daily outdoor water demand in a

coastal touristic region, such as its seasonality, its relation to the dimension of the outdoor area of the lot

and which other variables influence the outdoor water use the most. In this region, the gardens occupy

the majority of the outdoor areas, hence, the water use due to garden watering is the most important

component of the oudoor water uses and other possible outdoor water uses have little significance. Thus,

the terms outdoor water demand and garden watering demand are used interchangeably throughout

this dissertation. Another of the main objectives is to forecast daily future values of the outdoor water

demand and for that we will build a predictive model. Since there is data from several clients comprised

in this study, it is not practical to build a model for each one. Thus, we want to find groups of clients with

similar behaviours using clustering, allowing us to build one model for each group. For this, we need to

investigate which is the best clustering method we can use, as well as the similarity measure.

With the right grouping of the clients, we are able to proceed to the modeling. The models we

will study can include external variables. This is an important feature that we require, since we are

in a situation where the weather variables, such as daily average temperature and daily accumulated

precipitation, can have an important influence in the consumption. In order to include the weather

variables in the model, we will study the relation between them and the garden watering consumption.

Then, having the models that can explain the consumption of each cluster, we are able to predict future

values. Having each group characterized, it is possible to place a new client in a group that possibly will

have the same consumption pattern and use the respective model to predict future values.

Additionally, a secondary goal is to disaggregate indoor and outdoor water use for the cases where

there is a single water meter for the lot, using the garden watering demand models obtained. By doing

this, the water utility company can estimate the amount of water accounted for outdoor water use in

the total water use. A better understanding of indoor and outdoor water use is also important to the

management of residual waters drainage networks. Outdoor water uses include mainly garden watering

and this water is not returned to the residual water network, contrary to the case for indoor water uses,

such as showers/baths, washing clothes and dishes. In the case of indoor water use, a significant part

of the water is collected through the residual waters drainage system.

2

www.publico.pt

For this study, hourly water consumption of several clients for a period of almost 3 years is available.

These will be aggregated to daily consumption, since we wish to work with daily values. For each client,

we have at our disposal the lot area, building area (floor area of the house) and outdoor area. Also, we

have available the mean, maximum and minimum daily temperature and daily accumulated precipitation

for the period in study.

This study will focus on data collected from residential lots in a coastal and strongly touristic region

in the south of Portugal. A strong seasonal variability is present, motivated by the touristic affluence and

the garden watering necessities, due to dry weather, along with high temperatures, in the summer. In

this region, we encounter a particular situation with a group of lots that contain two high resolution water

meters, one that measures exclusively indoor water uses and another that measures exclusively outdoor

water uses, thus creating an exceptional opportunity to study outdoor water demand. The use of two

separate water meters, one to measure the indoor uses and another to measure the outdoor uses, is

not a widespread practice in Portugal. However, this is recommended in the cases where the outdoor

water use is very significant and it is necessary a better management of this component (for example, a

differentiated tariff).

A present problem in the region in study is the saltwater intrusion in the aquifers, which contaminates

the boreholes water. This leads to the closure of the borehole, which is used as a source to water the

garden. It is expected that the saltwater intrusion will increase in the region in study, therefore there

will be more gardens watered by the mains water. It is specially of high importance to determine if the

current water supply system can provide enough water and adequate service level if all the boreholes

are closed. This is all the more pertinent since it is expected that in the limit all the boreholes will be

closed. The situation described makes it challenging to envisage mean and long term consumption

scenarios. It is then necessary to study the consumption habits in order to know if it is possible to give

answer to the water demand for future planning purposes of the water supply system.

The results obtained from the study will be important to improve the water supply network manage-

ment. This study is not only relevant for residential consumers and water utility companies, but also large

consumers and municipalities. Large consumers, such as hotels and airports, may have a significant

outdoor water use due to garden watering, pools and street cleaning. In municipalities, a significant

part of non-revenue water corresponds to garden watering and improving efficiency is crucial to their

economic and environmental sustainability.

1.3 Thesis Outline

This dissertation is divided in 5 Chapters. In Chapter 2, an assessment of the variables that may

influence outdoor consumption, as well as an assessment of the methods of analysis adopted, are made.

The methods used to cluster time series and to model the data are described in Chapter 3. In Chapter

4, we focus on the exploratory analysis, the work developed towards modeling the data, the forecast

results obtained and their analysis. Also in Chapter 4, we discuss the method used to disaggregate

the total water use into indoor and outdoor water use. Chapter 5 is dedicated to the conclusions of the

3

dissertation, what was achieved and suggestions for future work.

4

Chapter 2

State-of-the-art

The research for this Chapter covers the assessment of the variables that influence the indoor and

outdoor water use, as well as methods of analysis adopted. Some of the models that have been used

to model the total water use are referred. Two studies that analyse the outdoor water use are examined.

An assessment of the approches used to characterize the end-use of water within a household is made.

Lastly, the approaches used in these studies that can be adapted to this thesis are presented.

There are different kind of studies that can be made regarding the study of residential water use,

such as a consumer habits study, an assessment of the variables that influence the water consumption

the most, the modeling of the water demand and the prediction of future values, among others. Wa-

ter consumption within a household can occur inside the house, including due to washing machines,

showers, toilets, taps, dishwasher and evaporative air conditioning system, or on the outdoor space,

including garden watering and water consumption related to swimming pools (Loh et al. [7]). There are

many possible variables that can influence the total water demand in residential areas, such as water

price, income, education, sustainability concern, temperature, precipitation, house size, housing typol-

ogy, outdoor space size, garden typology, presence of pool, among others (House-Peters and Chang

[8]).

The importance of garden watering varies from country to country according to the meteorological

conditions. Also, the weight of the outdoor water use on the total water consumption will be different

in different climates and according to the different consumption habits. Thus, the importance given to

water management and to study and understand the water use is also expected to vary. For example, in

Australia, the water scarcity problem is extremely important in certain regions. Therefore studies related

to consumption patterns and forecasts are quite relevant. It is important to understand and monitor the

outdoor water use in Australia in residential areas. In a study conducted with data collected in a resi-

dential area in Perth, Australia, between 1998 and 2001 (Loh et al. [7]) it was estimated that the outdoor

water use accounted for 56% of the total water use of a single detached residential household and almost

all of this water was used to water the garden. Moreover, the authors did not find a relationship between

the watered area of the outdoor space and the outdoor water consumption. With this study, it was also

verified that houses with a borehole use less water from the public supply system on the outdoor space

5

than the houses without a borehole.

Though studies have been made modeling and forecasting water demand (Ghiassi et al. [9], Caiado

[10], Gato et al. [11] ), commonly using time series models or Artificial Neural Networks, not many have

focused solely on modeling the outdoor water use. There are references to the water use in private

residential gardens in studies from a point of view of individual habits and environmental awareness

(Randolph and Troy [5]). In analysis of water use literature, regression models are most commonly used,

as well as time series analysis (Makwiza [2], House-Peters and Chang [8]). Meteorological variables

such as temperature and precipitation are included in regression models (Chang et al. [12]). It seems

that a mathematical study has not been yet conducted focused solely on the garden watering demand

in residential homes. In particular, there is little research done specifically with regard to the garden

watering demand in residential homes in Portugal.

Syme et al. [3] performed a study to better understand and predict the monthly water consumption

in outdoor areas of residential homes. This study was conducted using estimates of external water

use, such as on gardens or swimming pools, for 397 houses in Perth, Australia. It was used monthly

consumption data from 1 year and 5 months. To estimate the outdoor water use throughout the year,

the authors assumed that, during the winter months, it is not necessary to water the gardens due to

precipitation. This implies that only the indoor water use is registered during the winter months. The

outdoor water use in the summer was then estimated by the difference between the total water use

in the summer and the total water use in the winter. In this study, socio-demographic variables were

considered, including income, lot size, presence of swimming pool, interest in gardening, importance of

garden and green spaces in their personal life, attitudes towards water conservation, type of equipment

used to water the lawn, among others. A questionnaire was made to each of the clients that included

the variables mentioned. Syme et al. [3] applied a Structural Equation Model with latent variables,

which is commonly used in social sciences. The authors assessed that lots with larger sizes used

more water, lots with a swimming pool tended to use more water as well and the presence of more

sophisticated watering systems usually implied the use of more water. Also, it was concluded that the

lifestyle preferences, garden interest and garden use for leisure had an impact on the outdoor water

use. Moreover, it was also concluded that, when determining outdoor water use, the socio-demographic

variables were just as important as the consumer’s attitudes towards garden and gardening.

Jain et al. [13] proposed Artifical Neural Networks to model the water demand at the Indian Institute

of Technology. The authors assumed that the majority of the water consumption at the Indian Institute

of Technology was to water the lawns and gardens. For this study, the weekly water demand at the

Institute and campus was used, as well as the weekly accumulated rainfall and weekly average of the

daily maximum temperature. Furthermore, the authors verified that the occurrence of rainfall was a

more significant variable than the amount of rainfall, since that ”people may not want to water their

lawns/gardens on a rainy day regardless of the amount of rainfall”. The authors found a correlation

between the weekly water consumption and the weekly average of the daily maximum temperature, as

well as a correlation between the weekly water consumption at two consecutive weeks. However, they

found that there was no correlation between the weekly water consumption and the weekly total rainfall.

6

It was concluded that the water demand at the Institute of Technology in Kanpur and its campus is a

”dynamic process driven by the temperature and interrupted by the occurrence of rainfall”.

In order to characterize the end-use of water within a household, that is, when it was used, for

example, by the washing machine or in the shower, smart metering is usually used (Fontdecaba et al.

[14], Gurung et al. [15]). Smart meters, which are considerably expensive, collect data automatically

and communicate readings in real time, or nearly real time. There are references that explore different

disaggregation methods. Makwiza and Jacobs [16] conducted a study in which microphones were used

to record sound when an outdoor tap was being used, thus capturing outdoor water use events. The data

was collected in homes located in the City of Lilongue, Malawi, between December 2014 and January

2015 and later between May 2015 and July 2015. This technique had already been used to capture

water use within the homes (Chen et al. [17], Fogarty et al. [18]), so the goal of the authors was to verify

the validity of this low-cost method to capture the outdoor water use in residential homes. This method

allowed to identify the start and end of outdoor water use, however it could not accurately report the

volume of water used.

Generalized Additive Models have been successfully used to model and forecast short-term elec-

tricity load. Pierrot and Goude [19] applied Generalized Additive Models to electricity load hourly data,

including meteorological data as explanatory variables (temperature, cloud cover and wind speed). The

models exhibited a good performance in terms of prediction accuracy. Ba et al. [20] also used General-

ized Additive Models to model and forecast half-hourly load data.

In this project, we intend to understand the relation between the garden watering demand and the

the size of outdoor space, as in Loh et al. [7]. Also, we will verify the correlation between temperature

and outdoor water use, as well as between accumulated precipitation and outdoor water use. Based

on Pierrot and Goude [19] and Ba et al. [20], we apply Generalized Additive Models to garden watering

demand, which to our knowledge, has not been done yet. As Jain et al. [13], we will verify if the event of

precipitation is a more significant variable in the models than the accumulated precipitation. We will also

discuss the method used in Syme et al. [3] to disaggregate the total water use into indoor and outdoor

use for the case of our study.

7

8

Chapter 3

Methodology

In Section 3.1, some basic concepts of time series are presented, which will be needed throughout

the entire thesis. Some of the time series analysis classical models are presented in Section 3.2. In

Section 3.3, the clustering methods used in this project to group time series are described. This will

allow to form a partition of the clients according to their similarity and build one model for each group,

instead of building one model for each client, which is impractical. In Section 3.4, Generalized Additive

Models, which were the models used in this project to fit the garden watering demand, are studied. Also,

Generalized Additive Mixed Models are described. Lastly, in Section 3.5, a method of water consumption

disaggregation into indoor and outdoor use is discussed, which will be applied to a set of clients with a

single water meter that measures both the indoor and outdoor water use.

3.1 Time series basic concepts

A time series is a collection of observations obtained through repeated measurements over time. The

objectives of studying time series include understand the physical characteristics that generate them

and predict future values (Wei [21]).

To give a formal definition of time series, a stochastic process must be defined first.

Definition 3.1.1. A stochastic process Z = {Z(t), t ∈ T} is a collection of random variables, that is, for

each t in the index set T , Z(t) is a random variable. Usually, t is interpreted as time, therefore, Z(t)

is the state of the process at time t. If the index set T is a countable set, Z is called a discrete time

stochastic process and if T is continuous, Z is a continuous time stochastic process.

Definition 3.1.2. A stochastic process Z = {Z(t), t ∈ T} with values in R is a time series if T ⊆ R is

discrete.

A time series can be decomposed into trend (Tt), seasonal (St) and irregular or noise component

(εt) (Pires [22]). This decomposition can be additive

9

Zt = Tt + St + εt (3.1)

Or it can be multiplicative

Zt = Tt × St × εt (3.2)

The additive decomposition is usually chosen. The decomposition can also include a cyclic compo-

nent.

3.1.1 Stationarity

For the following definitions, a finite set of random variables {Zt1 , Zt2 , ..., Ztn} from a stochastic process

{Z(t) : t ∈ Z} is considered. Its n-dimensional joint distribution function is denoted by F (Zt1 , ..., Ztn).

Definition 3.1.3. A process is strongly stationary (or strictly stationary) if F (Zt1 , ..., Ztn) = F (Zt1+k, ..., Ztn+k)

for any finite set of indices {t1, t2, ..., tn} ⊂ Z with n ∈ Z+ and any k ∈ Z.

Definition 3.1.4. A process is first order stationary (or stationary on average) if F (Zt1) = F (Zt1+k) for

any t1, k, t1 + k ∈ Z, that is, if the distributuion function of dimension 1 is time invariant.

Definition 3.1.5. A process is second order stationary (or weakly stationary) if F (Zt1 , Zt2) = F (Zt1+k, Zt2+k)

for any t1, t2, k, t1 + k, t2 + k ∈ Z.

Definition 3.1.6. A process is rth order stationary if F (Zt1 , ..., Ztn) = F (Zt1+k, ..., Ztn+k) for any n ≤ r

and k, t1, t2, ..., tn ∈ Z.

Some relations regarding stationarity are worth mentioning, such as:

- A higher order of stationarity implies a lower order of stationarity.

- Second order stationarity does not imply strongly stationary.

- Strongly stationary does not imply second order stationary.

3.1.2 Autocovariance, Autocorrelation and Partial Autocorrelation Functions

Considering a weakly stationary process, Zt, where its mean, E[Zt] = µ, and variance, V ar(Zt) = σ2,

are constant.

The covariance, Cov(Zt, Zs), is a function that only depends on the difference |t− s|, ∀s, t ∈ Z. The

covariance between Zt and Zt+k can be written as (Wei [21]):

γk = Cov(Zt, Zt+k) = E [(Zt − µ)(Zt+k − µ)] (3.3)

γk is called the autocovariance function and it measures the linear dependance between two random

variables. This function presents the following properties for a stationary process:

10

1. γ0 = V ar(Zt);

2. |γk| ≤ γ0;

3. γk = γ−k for all k, i.e., γk is an even function.

The correlation between Zt and Zt+k is given by:

ρk =Cov(Zt, Zt+k)√

V ar(Zt)√V ar(Zt+k)

=γkγ0

(3.4)

This function, ρk, is called the autocorrelation function (ACF) of Zt and it verifies the following prop-

erties:

1. ρ0 = 1;

2. |ρk| ≤ 1, if the value is close to 1, it indicates a very strong positive correlation between Zt and

Zt+k; if it is close to −1, it indicates a very strong negative correlation;

3. ρk = ρ−k for all k, i.e., it is an even function;

Both the autocovariance function and autocorrelation function are positive semidefinite, that is, for

any set of time points t1, ..., tn:

n∑i=1

n∑j=1

αiαjγ|ti−tj | ≥ 0, ∀α1, α2, . . . , αn ∈ R (3.5)

n∑i=1

n∑j=1

αiαjρ|ti−tj | ≥ 0, ∀α1, α2, . . . , αn ∈ R (3.6)

This is a necessary condition for a function to be an autocovariance function or an autocorrelation

function of a process.

Consider a weakly stationary process Zt with null mean. Its partial autocorrelation function (PACF),

φkk, represents the coefficient of partial correlation between Zt and Zt+k after removing the linear de-

pendence with the variables Zt+1, Zt+2, ..., Zt+k−1. The partial autocorrelation function is calculated in

the following way (Wei [21]):

φ11 = ρ1 (3.7)

φ22 =

∣∣∣∣ 1 ρ1

ρ1 ρ2

∣∣∣∣∣∣∣∣ 1 ρ1

ρ1 1

∣∣∣∣=ρ2 − ρ2

1

1− ρ21

(3.8)

And, more generally:

11

φkk =

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣

1 ρ1 ρ2 ... ρk−2 ρ1

ρ1 1 ρ1 ... ρk−3 ρ2

. . . . .

. . . . .

. . . . .

ρk−1 ρk−2 ρk−3 ... ρ1 ρk

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣

1 ρ1 ρ2 ... ρk−2 ρ1

ρ1 1 ρ1 ... ρk−3 ρ2

. . . . .

. . . . .

. . . . .

ρk−1 ρk−2 ρk−3 ... ρ1 1

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣

(3.9)

3.1.3 White Noise

A time series {Zt, t ∈ Z} is said to be a white noise serie if it is a sequence of non-correlated random

variables of a fixed distribution and a constant mean, E(Zt) = µ (usually, it is assumed to be zero),

constant variance V ar(Zt) = σ2 and γk = Cov(Zt, Zt+k) = 0, for any k 6= 0, denoted by {Zt, t ∈ Z} ∼

WN(µ, σ2). A white noise process is weakly stationary and its autocovariance function is given by:

γk =

{σ2, k = 0

0, k 6= 0(3.10)

Its autocorrelation function is given by:

ρk =

{1, k = 0

0, k 6= 0(3.11)

And its partial autocorrelation function:

φkk =

{1, k = 0

0, k 6= 0(3.12)

A white noise serie is said to be Gaussian if (Zt1 , Zt2 , ..., Ztn) has multivariate normal distribution,

∀n ≥ 1, t1, t2, ..., tn ∈ Z. In this case, weak stationarity implies strong stationarity.

3.1.4 Differencing

Data collected in a real life situation will usually not be stationary in the mean, that is, the mean will not

be constant over time. It is possible to make a series stationary in the mean by applying an operator,

which will be defined next.

The backward shift operator, B, is defined by

12

BZt = Zt−1 (3.13)

Hence, BmZt = Zt−m. The backward difference operator, ∇, is defined as follows

∇Zt = Zt − Zt−1 = (1−B)Zt (3.14)

For a higher order of the backward difference operator, ∇kZt = ∇(∇k−1Zt), for k ≥ 2. For example,

∇2Zt = ∇(∇Zt) = ∇Zt −∇Zt−1 = Zt − 2Zt−1 + Zt−2 (3.15)

By applying the difference operator, the trend of a series is removed. If a time series Zt has a linear

trend, then ∇Zt has no trend. In the case of a series with a non-linear trend, in order to remove it, the

differences should be built successively, i.e., first differences, second differences, until the time series

no longer possesses a trend (Pires [22]).

To verify if a series is stationary in the mean, unit root tests can be performed, such as the Aug-

mented Dickey-Fuller (ADF) Test and the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) Test (Hyndman

and Athanasopoulos [23]). The first is one of the most popular unit root tests and its null hypothesis

states that the data are not stationary. A small p-value suggests the data are stationary. This test

estimates the parameters of the following regression model

Z ′t = φZt−1 + β1Z′t−1 + ...+ βkZ

′t−k (3.16)

Where Z ′t is equal to Zt+Zt−1 and k is the number of lags included in the regression. The estimated

coefficient φ should be approximately zero if the series requires differencing and if it does not, then the

coefficient is smaller than zero.

The null hypothesis of the KPSS test is that the data are stationary. A small p-value indicates that

the series is not stationary and differencing is required. This test starts with the model

Zt = δt+ µt + ut (3.17)

µt = µt−1 + εt (3.18)

Where δ is the trend component, ut is a stationary process, µt is the random walk term and εt is an

independent and identically distributed process with mean equal to zero and variance σ2. If the variance

σ2 is equal to zero, then the random walk term is constant. So, the null hypothesis is that σ2 is equal to

zero. The test statistic is

KPSS =

∑Tt=1 S

2t

s2T 2(3.19)

Where T is the sample size, St =∑ti=1 ei, ei the residuals of a regression model on Zt and s2 is the

Newey-West estimate of the long-run variance (Zivot [24]).

13

3.1.5 Variance Stabilizing Transformations

In practice, many time series are not stationarity in the variance and these can be transformed in sta-

tionary time series using the proper techniques.

Considering a non-stationary time series {Zt, t ∈ Z} with finite mean and variance, the following

transformation was introduced by Box and Cox ([25]):

Tλ(Zt) =

{ Zλt − 1

λ, λ 6= 0

ln(Zt), λ = 0(3.20)

Where Zt is a positive time series and Tλ(Zt) is called the transformed series. In the case of a

non-positive time series, a positive constant can be added to the series. Usually, it is considered that

λ assumes values in the interval [−1, 1]. To find its optimal value, one can evaluate the residual mean

square error on a series of λ values.

3.1.6 Cross-correlation function

The cross-correlation function allows to understand the strength and direction of the correlation between

two random variables (Wei [21]). Consider two stochastic processes Xt and Yt with mean µx and µy,

respectively, and standard deviation σx and σy, respectively. The cross-covariance function between Xt

and Yt is given by

γxy(k) = E[(Xt−k − µx)(Yt − µy)] (3.21)

For k = 0,±1,±2, .... The cross-correlation function (CCF) is calculated by the following formula

ρxy(k) =γxy(k)

σxσy(3.22)

For k = 0,±1,±2, .... The cross-correlation function is a dimensionless quantity and it is not symmet-

ric around 0, that is, ρxy(k) 6= ρxy(−k). Since γxy(k) = E[(Xt−k − µx)(Yt − µy)] = E[(Yt − µy)(Xt−k −

µx)] = γyx(−k), it is verified that ρxy(k) = ρyx(−k). It is important to examine both negative and positive

lags of the CCF, since it is not symmetric.

3.2 Time series Models

In this Section, some of the classical time series analysis models are discussed. These can be stationary

or non-stationary models. Furthermore, the steps that can be taken to identify a good model that fits the

data in study are also discussed, as well as points to consider in the model diagnostic (Subsections 3.2.3

and 3.2.4). In Subsection 3.2.5, the minimum mean square error forecasts for stationary and non-

stationary time series models are presented.

14

3.2.1 Linear Stationary Models

Consider {Zt, t ∈ R} as a time series.

Autoregressive Processes (AR)

In situations where the values of a time series depend on the previous values plus a random shock,

autoregressive processes are useful to describe them (Wei [21]). A model is fit to the variable using

a linear combination of past values of the same variable. Zt is an autoregressive process of order p,

denoted by AR(p), if

Zt = φ1Zt−1 + ...+ φpZt−p + εt =

p∑i=1

φiZt−i + εt (3.23)

Where {εt, t ∈ R} ∼WN(0, σ2ε ). This can also be written by using the backward shift operator B

Zt = φ1BZt + φ2B2Zt + ...+ φpB

pZt + εt (3.24)

φp(B)Zt =(1− φ1B + φ2B2 + ...+ φpB

p)Zt = εt (3.25)

Where φp(B) is called the characteristic polynomial.

A process is invertible if it posesses an autoregressive representation. This process is always invert-

ible, since∑pj=1 |φj | <∞. However, it is not necessarily stationary, for that, the roots of the characteristic

polynomial must lie outside of the unit circle.

Moving Average Processes (MA)

These processes are useful to describe situations in which events have an immediate effect that lasts for

short periods of time (Wei [21]). Zt is a moving average process of order q, which is denoted by MA(q),

if

Zt = εt − θ1εt−1 − ...− θqεt−q (3.26)

Again, it is possible to rewrite the expression using the backward shift operator as follows

θq(B)εt = Zt (3.27)

Where θq(B) = (1− θ1B − ...− θqBq).

A moving average process is always stationary, since 1 + θ21 + ... + θ2

q < ∞, but not necessarily

invertible. It will be invertible if the roots of θq(B) = 0 lie outside of the unit circle.

15

Autoregressive Moving Average Processes (ARMA)

A process Zt is a mixed autoregressive moving average process, ARMA(p, q), if

φp(B)Zt = θq(B)εt (3.28)

Where φp(B) = 1−φ1B−...−φpBp, θq(B) = (1−θ1B−...−θqBq) and εt ∼WN(0, σ2ε ). It is assumed

that φp(B) = 0 and θq(B) = 0 have no roots in common andθq(B)

φp(B)is called the ARMA polynomial. Also,

if the roots of θq(B) = 0 lie outside of the unit circle, the process is invertible and if the roots of φp(B) = 0

lie outside of the unit circle, the process is stationary.

Note that an autoregresssive process of order p is a special case of an ARMA process with order q

equal to zero and a moving average process of order q is an ARMA process with order p equal to zero.

A stationary and invertible ARMA process can have a pure autoregressive representation, as well as

a pure moving average representation.

In Table 3.1, a summary of the behaviours of the ACF and PACF of theAR(p),MA(q) andARMA(p, q)

processes is presented. This is useful in the process of identifying models, as well as the order p and q

of the models.

Table 3.1: Summary of the properties of the stationary models (Source: Bisgaard and Kulahci [26]).

AR(p) MA(q) ARMA(p, q)

ACF

Infinite dampedexponentials and/or

damped sine waves; Tailsoff

Cuts off after lag q



PACF Cuts off after lag p





There is a dual relationship between AR and MA processes, which can be summarized in the follow-

ing properties

• A stationary AR process of finite order is equivalent to an infinite order MA process.

• An invertible MA process of finite order is equivalent to an infinite order AR process.

• The duality of the respective ACF and PACF functions is also present, as can be seen in Table 3.1.

3.2.2 Non-stationary Linear Models

The previous models are based on the stationary assumption, however in many practical situations, the

time series are non-stationary.

Autoregressive Integrated Moving Average Processes (ARIMA)

An ARIMA(p, d, q) process has the following representation:

16

φp(B)(1−B)dZt = θ0 + θq(B)εt (3.29)

Where θ0 is a real number, εt ∼ WN(0, σ2ε ) and φp(z) = 1 − φ1z − φ2z

2 − ... − φpzp and θq(z) =

1−θ1z−θ2z2− ...−θqzq do not have any common roots. These models can be transformed in stationary

models by applying the simple difference operator. For example, an ARIMA(p, d, q) series can be

studied in the frame of the ARMA(p, q) models if the referred operator is applied d times to the series.

Note that the ARMA(p, q) models are a special case of the ARIMA(p, d, q) models when d = 0.

Seasonal Autoregressive Integrated Moving Average Processes (SARIMA)

A seasonal event is an event that repeats after a regular period of time and the smallest time period for

this phenomenon is called seasonal period (Wei [21]). Then, the ARIMA models are extended to model

seasonal time series.

Introducing the lag-S operator, ∇S , which is defined by

∇SXt = Xt −Xt−s = (1−BS)Xt (3.30)

For d and D non-negative integers, Xt is said to the a SARIMA(p, d, q)× (P,D,Q)S process if it has

the following representation

Φ(Bs)φ(B)(1−BS)D(1−B)dXt = Θ(BS)θ(B)εt (3.31)

Where εt ∼ WN(0, σ2ε ), and the functions φ(.) and θ(.) do not have common roots and no roots in

the unit circle. The functions Φ(.) and Θ(.) respect these same properties.

3.2.3 Model Identification

In time series analysis, the first step is to identify one or more possible models, then comes estimation

of the parameters and finally the evaluation and diagnostic of the model.

When dealing with real data, the ACF and PACF are not known, so it is necessary to estimate and

compare them with the ”theoretical” functions of each model. For that, Table 3.1 can be quite helpful.

The identification of one model is never exact, since there is not a method to do so, it is necessary

the critical thinking of the person performing the study. At this stage, the graphical analysis has a big

importance, as well as the model diagnostic.

According to Wei [21] , one can follow several steps to identify a model:

Step 1 Create the plot of the time series. By analysing the plot, it is possible to see, for example, if the

series have some trend, outliers or non-constant variance. After this, apply the necessary trans-

formations to the data. One of the most common ones is the Box-Cox transformation, which is

applied in the case of non-constant variance.

17

Step 2 Estimate and examine the sample autocorrelation function and the sample partial autocorrelation

function, to investigate if it is necessary to apply the difference operator. For example, when the

sample autocorrelation function decays very slowly and the sample partial autocorrelation function

is zero for lags k > 1, usually the first differences are applied, (1−B)Zt.

Step 3 After the transformations applied in the previous step, estimate and examine once again the sam-

ple autocorrelation function and the sample partial autocorrelation function in order to determine

the values of p and q. For such, it is necessary to compare the functions mentioned with the theo-

retical functions of the models (AR, MA, ARMA) and find a match. The Table 3.1 is a good auxiliar

in this step.

Step 4 Test if the term θ0 of the deterministic trend should be included when d > 0. The sample mean W

of the differentiated series, Wt = (1−B)dZt, is compared with its approximated standard deviation,

SW .

At this stage, more than one possible model are being considered, thus the goal is to select the best

model in order to go through with the analysis. For such, certain measures can be used, such as Akaike

information criterion (AIC) or Bayesian information criterion (BIC) (Box et al. [27]). These measures can

be respectively calculated by the following formulas

AIC(M) = nln(σ2ε ) + 2M (3.32)

BIC(M) = nln(σ2ε )− (n−M)ln(1− M

n) +Mln(n) +Mln

[(σ2z

σ2ε

− 1

)/M

](3.33)

Where M is the number of parameters of the model, σ2ε is the maximum likelihood estimator of σ2

ε

and σ2z is the sample variance of the series.

3.2.4 Diagnostic

Once the ”best” model is identified, its parameters should be estimated and it is necessary to check if

the initial assumptions are satisfied, namely:

• the error term εt follows a normal distribution. For this, the histogram of the residuals, εt, and the

QQ-Plot can be analysed and a goodness of fit test can be performed.

• the variance of the εt is constant. Examine the plot of the residuals or check the effect of the

Box-Cox transformation for several λ.

• the εt are white noise. Analyse the plots of the sample ACF and sample PACF and, additionally, a

portmanteau test, like Ljung-Box test, can be performed.

18

3.2.5 Forecast

One of the main goals of time series analysis is to predict future values. When obtaining predictions

of future values, the goal is to produce values with the minimum error as possible. In this Section, it

is discussed how to predict using the minimum mean square error forecasts for the different models

presented in Subsections 3.2.1 and 3.2.2, as it is in Wei [21].

Consider at time t = n the observations Zn, Zn−1, Zn−2, ... and the objective is to forecast the l-step

ahead value of Zn+l, with l > 0.

Forecast stationary time series

Consider the case of a stationary ARMA model with representation

φ(B)Zt = θ(B)εt (3.34)

Note that, since the model stationarity is being assumed, it can have a purely moving average repre-

sentation.

Zt = εt + ψ1εt−1 + ψ2εt−2 + ... (3.35)

With ψ0 = 1. Considering t = n+ l,

Zn+l =

∞∑j=0

ψjεn+l−j (3.36)

Knowing that each Zj can be written in the form 3.35, it can be defined the minimum mean square

error forecast of Zn+l, Zn(l), as

Zn(l) = ψ∗l εn + ψ∗l+1εn−1 + ψ∗l+2εn−2 + ... (3.37)

Where ψ∗j are to be determined.

The goal is to forecast Zn+l as a linear combination of the observations Zn, Zn−1, Zn−2, ... with

minimum mean square prediction error (MSPE), which is given by

Pn(l) = E(Zn+l − Zn(l))2 (3.38)

This can be rewritten as

E(Zn+l − Zn(l))2 = σ2ε

l−1∑j=0

ψ2j + σ2

ε

∞∑j=0

[ψl+j − ψ∗l+j ]2 (3.39)

The previous equation is minimized when ψl+j = ψ∗l+j . Therefore, Equation 3.37 can be rewritten as

Zn(l) = ψlεn + ψl+1εn−1 + ψl+2εn−2 + ... (3.40)

19

Now, using Equation 3.36 and the following property

E(εn+j |Zn, Zn−1, ...) =

0, j > 0

εn+j , j ≤ 0

(3.41)

it can be written:

E(Zn+l|Zn, Zn−1, ...) = ψlεn + ψl+1εn−1 + ψl+2εn−2 + ... (3.42)

The right-hand side of the previous equation is equal to the right-hand side of Equation 3.40. Thus,

the minimum mean of square error forecast of Zn+l, or the l-step ahead forecast of Zn+l at the forecast

origin n, is equal to

Zn(l) = E(Zn+l|Zn, Zn−1, Zn−2, ...) (3.43)

The forecast error, en(l), is given by

en(l) = Zn+l − Zn(l) =

l−1∑j=0

ψjεn+l−j (3.44)

The forecast is unbiased, since E(en(l)|Zt, t ≤ n) = 0, and its error variance is given by

V ar(en(l)) = σ2ε

l−1∑j=0

ψ2j (3.45)

Considering that Zt is a normal process and that zα/2 is the quantile of standard normal distribution,

the (1− α)× 100% forecast limits are given by

Zn(l)± zα/2[1 +

l−1∑j=1

ψ2j

]1/2

σε (3.46)

Forecast non-stationary time series

Consider a non-stationary ARIMA(p, d, q) model, with d 6= 0

φ(B)(1−B)dZt = θ(B)εt (3.47)

Where φ(B) = (1−φ1B− ...−φpBp) is a stationary autoregressive operator and θ(B) = (1−φ1B−

...− θqBq) is an invertible moving average operator.

Since the model is invertible, it can be rewritten in an AR representation. So, the AR representation

of the model at time t+ l is given by

π(B)Zt+l = εt+l (3.48)

Where

20

φ(B) = 1−∞∑j=1

φjBj =

φ(B)(1−B)d

θ(B)(3.49)

Or, it can also be written

Zt+l =

∞∑j=0

πjZt+l−j + εt+l (3.50)

By applying the operator 1 + ψ1B + ...+ ψl−1Bl−1 to Equation 3.50, Equation 3.51 is obtained.

∞∑j=0

l−1∑k=0

πjψkZt+l−j−k +

l−1∑k=0

ψkεt+l−k = 0 (3.51)

Where π0 = −1 and ψ0 = 1. It can be shown that

∞∑j=0

l−1∑k=0

πjψkZt+l−j−k = π0Zt+l +

l−1∑m=1

m∑l=0

πm−lψlZt+l−m +∞∑j=1

∞∑i=0

πl−1+j−iψjZt−j+1 (3.52)

By choosing the weights ψ such that

m∑i=0

πm−iψi = 0, for m = 1, 2, ..., l − 1 (3.53)

The expression in Equation 3.54 will be reached.

Zt+l =

∞∑j=1

π(l)j Zt−j+1 +

l−1∑i=0

ψiεt+l−i (3.54)

Where π(l)j =

∑l−1i=0 πl−1+j−iψi. Therefore, for t ≤ n, given Zt

Zt =E(Zn+l|Zt, t ≤ n)

=

∞∑j=1

π(l)j Zn−j+1

(3.55)

Since E(εn+j |Zt, t ≤ n) = 0, for j > 0.

The forecast error is then given by

en(l) =Zn+l − Zn(l)

=

l−1∑j=0

ψjεn+l−j

(3.56)

The weights ψ can be calculated recursively from the πj weights in the following manner

ψj =

j−1∑i=0

πj−iψi, j = 1, 2, ..., l − 1 (3.57)

21

Forecast evaluation

To evaluate the performance of a model, one can use certain measures to verify the quality of the

predictions. Also, this can be a way to compare different models to aid in the selection of a model. Let el

denote the one-step prediction error, that is, the difference between the real value, Zl, and the predicted

value at time l, l = j + 1, ..., n− 1.

Though there are many measures that can be used, in this project, there was a focus on two mea-

sures. The mean absolute percentage error can be calculated by the formula

MAPE =

(1

n− j

n−1∑k=j

∣∣∣∣ ekZk+1

∣∣∣∣)100% (3.58)

This measure is scale independent. However, it is not adequate to use if the time series takes values

equal or close to zero. Therefore, in those cases, another measure was also used, the mean absolute

error, which is calculated by

MAE =1

n− j

n−1∑k=j

|ek| (3.59)

The model that has the lower value for these measures will be preferred over the others.

3.3 Time series clustering

Clustering is a technique used to group objects in terms of similarity. It is not known in advance any class

information (unsupervised learning). The objects within the same cluster will be close to each other in

terms of distance (they will share similar data features) and far from the members of the other clusters.

When working with time series data, clustering is used to identify patterns in the time series. Time

series are a dynamic type of data due to their dependance of time. The choice of dissimilarity measure

for time series is still controversial and a research topic, however Dynamic Time Warping (DTW) is one

of the most used (Aghabozorgi et al. [28]). The periodogram based dissimilarity has been used as a

distance measure between time series (Caiado et al. [29]), as well as the dissimilarity index combining

temporal correlation and raw values behaviours (Chouakria and Nagabhushan [30]).

Hierarchical clustering has some advantages over other types of clustering, namely the number of

clusters is not required as an initial parameter and the results are presented in an intuitive dendrogram

(Pereira and de Mello [31]). For this study, hierarchical clustering was used, therefore, this type of

clustering will be described in more detail.

3.3.1 Hierarchical Clustering

In hierarchical clustering, each point is placed in its own cluster and two points are successively merged

according to the lowest dissimilarity value until all points are merged into one cluster (Giudici [32]). Along

with the hierarchical clustering, a dendrogram is built. A dendrogram is a tree like structure, where the

22

initial clusters, that contain only one point, are the leafs. At each step of the algorithm, one branch is

drawn on the tree to represent the merge of two clusters. The final cluster that contains all points is

represented by the root of the tree.

Different dissimilarity measures can be considered for this algorithm, depending on the choice of

method. Although there exists a wide variety of methods, four will be discussed: Single Linkage, Average

Linkage, Complete Linkage and Ward’s Method. All of these are agglomerative methods, that is, the

clusters are built from the leafs to the root. On the other hand, divisive methods, which were not used in

the project, build the clusters from the root to the leafs.

In Single Linkage, the distance betwen two clusters is defined as the minimum distance between

the observations of the two clusters. Complete Linkage defines the distance between two clusters as

the maximum distance between each point of one cluster and each point of the other cluster. Average

Linkage considers the following dissimilarity measure. The distances between each point of one cluster

and each point of the second cluster are calculated and then the average value of these distances is

computed. Ward’s Method uses a cost function in a way that a merger of two clusters is made if it has

the smallest increase of the cost function.

In practice, the choice of method is not a linear one, since there is not one that yields good results

for all types of data. Therefore, it is necessary to use different methods in order to make the best choice.

3.3.2 Distances

Choosing a distance is an important step in clustering, since different distances can lead to different

results. Three dissimilarity measures were considered in this study, in order to assess which one would

be better suited for the data. The dissimilarity measure that ended up being used was the Periodogram

based distance.

Dynamic Time Warping (DTW):

Dynamic Time Warping, proposed by Berndt and Clifford [33], is widely used with time series data

sets and it has been proven to be more robust than Euclidean Distance. DTW allows to compare two

time series that are similar in shape but have an axis misalignment.

In order to calculate the DTW distance between two realizations of time series, one must follow a

series of steps (Pereira and de Mello [31]). Consider two realizations of time series, x = (x1, x2, ... , xn)

and y = (y1, y2, ... ym):

• Compute the distance matrix (dij)n×m, where dij = d(xi, yj) = (xi − yj)2 is the distance between

points. Each element dij is an alignment of points xi and yj .

• Create a warping path W in the distance matrix that starts in entry (1, 1) and ends in entry (n,m).

This path defines a mapping between x and y. Each element wk of W has to be adjacent to wk−1.

Also, given wk = (a, b), then wk−1 = (c, d) with a ≥ c and b ≥ d, i.e., the points in W have to be

monotonically spaced in time. With this, W = (w1, w2, ... , wK) with max(n,m) ≤ K < n+m− 1.

23

• Select the path that minimizes the warping cost :

DTW (x,y) = min

(√√√√ K∑k=1

wk

)(3.60)

Dissimilarity Index Combining Temporal Correlation and Raw Values Behaviours (CORT):

This distance combines temporal correlation between two series, as well as the distance between

their raw values (Chouakria and Nagabhushan [30]). Consider again two time series, x and y, both with

length n. This dissimilarity index is given by:

d(x,y) = Φ[CORT (x,y)]δ(x,y) (3.61)

Where Φ(u) is an adaptative tuning function given by Equation 3.62, CORT (x,y) is a temporal

correlation coefficient given by Equation 3.63 and δ(x,y) is a dissimilarity measure between the raw

values of x and y, for example, the Euclidean Distance or DTW distance.

Φ(u) =2

1 + exp(ku)with k ≥ 0 (3.62)

CORT (x,y) =

∑n−1i=1 (xi+1 − xi)(yi+1 − yi)√∑n−1

i=1 (xi+1 − xi)2

√∑n−1i=1 (yi+1 − yi)2

(3.63)

Periodogram Based Dissimilarity:

This dissimilarity measure takes into account the distance between the periodogram coefficients of

two series (Caiado et al. [29] and Shumway and Stoffer [34]). To define the periodogram function, as

well as the dissimilarity measure, it is necessary to introduce some concepts.

The Discrete Fourier Transform (DFT) represents the discrete time signal into periodic Fourier series.

For a sequence x = (x1, x2, ..., xn), define the DFT as d(ω0), d(w1), ..., d(wn−1), where:

d(ωj) = n−1/2n∑t=1

xtexp−2πiωjt (3.64)

For j = 0, 1, ... , n− 1, where ωj = j/n are called the Fourier frequencies.

The periodogram of x is defined as the squared modulus of the DFT:

Ix(ωj) =| d(ωj) |2 (3.65)

Let Ix(ωj) and Iy(ωj) be the periodograms of x and y, respectively. One periodogram based dis-

tance is given by:

dLNP (x,y) =

√√√√bn/2c∑j=1

[log NIx(ωj)− log NIy(ωj)]2 (3.66)

Where NIx(ωj) and NIy(ωj) are the normalized periodograms, i.e., NIx(ωj) = Ix(ωj)/σx and

24

NIy(ωj) = Iy(ωj)/σy, with σx and σy being the sample variance of x and y, respectively.

3.3.3 Comparing clustering methods

The goal when applying a clustering algorithm is to find groups that are both similar and cohesive

internally and different from other groups (Giudici [32]). Therefore, it is important to have measures

to compare how well the clustering results of each method fit the data. There are internal measures

to assess the similarity of the members of each clusters, as well as external measures to evaluate how

different the clusters are from each other. Once the clustering results are obtained, there are a number of

values that can be examined, such as the average within distance calculated per cluster, the separation

between each cluster, among others.

Also, these measures are important to decide which is the best number of clusters for the dataset,

as in hierarchical clustering the number of clusters is not given by the user. The number of clusters can

vary from 2 until a value m, with m smaller or equal to the number of observations, and in order to decide

the best number of clusters, one can use several indexes or measures.

For this project, four indexes were used to select the optimal number of clusters.

Dunn Index

To calculate this index, the distance between the points in each cluster and the points in the remaining

clusters is computed. Select the minimum of these distances as the inter-cluster separation,min.separation.

Then, compute for all the clusters the distances between the points belonging to the same cluster and

take the maximum value, that is the maximum diameter, max.diameter. The Dunn Index is then calcu-

lated by

D =min.separation

max.diameter(3.67)

If the clusters are quite different from each other, then the distance between them must be large and

if the objects within each cluster are similar, then the diameter of the clusters is expected to be small.

Therefore, this index should be maximized.

Entropy

Entropy is another measure to evaluate the performance of a clustering algorithm that measures the

degree to which each cluster consists of objects of a single class (Giudici [32]). Consider n observations,

K clusters, mi is the number of objects in cluster i and mij represents the number of objects of class j

in cluster i. First, the probability that an observation of cluster i belongs to class j is estimated.

pij =mij

mi(3.68)

With this, the entropy of cluster i can be calculated.

25

ei = −L∑j=1

pij log2pij (3.69)

Where L is the number of classes. The total entropy of the cluster set can be computed by

e =

K∑i=1

mi

nei (3.70)

Gamma (The Baker-Hubert Gamma index)

To understand how to calculate the Gamma index, the concept of concordant vectors must be defined.

Let A and B be two same sized vectors with elements ai and bi, respectively. If for two indices i and j

, ai < aj and bi < bj , then the vectors are concordant (Desgraupes [35]). The number of concordant

pairs {i, j} is denoted by s+ and the number of discordant pairs is denoted by s−. Note that the pairs

where there is equality are not considered. The Gamma index is calculated by the formula

Γ =s+ − s−

s+ + s−(3.71)

The index takes values from −1 to 1 and it should be maximized.

Silhouette Method

The Silhouette coefficient of a cluster is calculated by taking the average value of the Silhouette coeffi-

cient of all the points in the cluster (Giudici [32]). The Silhouette coefficient varies between -1 and 1. If

the coefficient value of one observation is close to 1, it indicates that the observation is well placed in

its cluster. If the value is close to -1, then it means the observation is poorly grouped. To compute the

coefficient value of a single observation i, one must start by calculating the average distance from this

point to all other points in the same cluster, ai. Then, for all the clusters in which observation i is not

contained, compute the average distance to all the points and save the minimum average value, bi. The

coefficient value for point i will be equal to

si =bi − ai

max(ai, bi)(3.72)

3.4 Models

In this Section, the notes of Wood [36] are followed. In order to present the Generalized Additive Models,

first the Generalized Linear Models must be described. Then, the Generalized Additive Models will be

briefly discussed, followed by a brief mention of interactions between explanatory variables. Lastly, the

Mixed Models are presented.

26

3.4.1 Generalized Linear Models

Suppose that Y is a response random variable and X1, X2, ...Xp is a set of explanatory variables. In

regression models, the general idea is to predict Y from X1, X2, ...Xp. The generalized linear models

(GLM) allow for the response variable to have a different distribution, not just normal distribution (as

in linear regression models), from the exponential family and for a degree of non-linearity in the model

structure (Wood [36]). Some distributions in the exponential family are the Poisson, Binomial, Gamma

and Normal distributions. For these models, it is considered a smooth monotonic link function, g(.), Y

as the response variable, the mean E(Yi|X = x) as µi, where Yi are assumed to be independent and

identically distributed, following a distribution of the exponential family. The model’s general form can be

presented by Equation 3.73.

g(µi) = β0 + β1xi1 + ...+ βpxip, i = 1, 2, ..., n, (3.73)

Where β = (β0, β1, ..., βp) is a vector of unknown parameters.

3.4.2 Generalized Additive Models

In fact, in Generalized Linear Models the link function g(.) is used to relate the conditional mean µi to

the linear predictor. However, there is no requirement forcing that relationship to be linear, it can be, in

general, additive. In the generalized additive models, by using smooth functions, f(.), of the explanatory

variables, non-linear predictors are related to the expected value. Arbitrary smooth functions can be

used, for instance, splines that are real functions that are defined piecewise by polynomial functions and

the places where its pieces connect are designated by knots. The form of a generalized additive model,

with fi, i = 1, ..., p, univariate smooth functions, is given by Equation 3.74.

g(µ) = β0 + f1(x1) + ...+ fp(xp) (3.74)

To introduce the idea of smooth functions, consider a linear model with one smooth function of one

explanatory variable.

yi = f(xi) + εi (3.75)

Where yi is the response variable, xi an explanatory variable, f a univariate smooth function and

εi are independent and identically distributed N(0, σ2) random variables. For simplicity, suppose that

xi ∈ [0, 1].

The aim is to estimate f and for that it needs to be represented in a manner that Equation 3.75

becomes a linear model. For that, it is assumed that f is composed by a sum of basis functions bi(x)

and the corresponding regression coefficients βi. The bi(x) is the ith basis function of a chosen basis

that defines the space of functions to which f belongs to. Therefore, f can be written as follows

27

f(x) =

q∑i=1

bi(x)βi (3.76)

Where q is the basis dimension. With this representation, f is said to be modeled by regression

splines and substituting Equation 3.76 into Equation 3.75 plainly produced a linear model. Some ex-

amples of smoothing basis b include thin plate regression splines, cubic regression spline, cyclic cubic

regression spline and P-splines.

To control the smoothness of a spline, penalized regression splines can be used. The model can be

fit by minimizing

‖ y − βX ‖2 +λ

∫ 1

0

[f ′′(x)]2dx (3.77)

Where λ is the smoothing parameter, which controls how fit or how smooth the model will be. If λ

is chosen as 0, it will result in an un-penalized regression spline estimate for f . If λ → ∞, then it will

culminate in a straight line estimate. The integral of squares of second derivatives in Equation 3.77 can

be written as (3.78), since f is linear in the parameters.

∫ 1

0

[f ′′(x)]2dx = βTSβ (3.78)

Where S is the matrix of known coefficients. Therefore, the problem becomes to minimize the follow-

ing expression with regard to β

‖ y − βX ‖2 +λβTSβ (3.79)

Then, the estimation of the regression coefficients can be obtained by

β = (XTX + λS)−1XT y (3.80)

Also, the hat matrix for the model, H, is given by

H = X(XTX + λS)−1XT (3.81)

Then, it is important to choose an optimal smoothing parameter, λ, that is, one that leads to a spline

estimate of f , f , as close as possible to the true f , as well as to choose the number of basis dimensions.

To choose the smoothing parameter λ, consider the notation fi = f(xi) and fi = f(xi). The param-

eter λ can be chosen to minimize the following criterion:

M =1

n

n∑i=1

(fi + fi)2 (3.82)

M can not be used directly, because f is unknown, however an estimate of E(M)+σ2 can be made.

Let f [−i] denote the model fitted to all data except yi. The Ordinary Cross Validation (OCV) score is

defined by

28

υ0 =1

n

n∑i=1

(f [−i] − yi)2 (3.83)

This score takes the average of the squared differences between the missing point and its predicted

value. If yi is replaced by fi + εi in Equation 3.83, then the following is obtained

υ0 =1

n

n∑i=1

(fi[−i]− fi − εi)2

=1

n

n∑i=1

(fi[−i]− fi)2 − (fi

[−i]− fi)εi + ε2i

(3.84)

Taking the expectation of Equation 3.84 and knowing that E(εi) = 0 and that εi and fi[−i]

are inde-

pendent, the following equation is obtained

E(υ0) =1

nE

( n∑i=1

(fi[−i]− fi)2

)+ σ2 (3.85)

Now, f [−i] ≈ f with equality in the large sample limit, so E(υ0) ≈ E(M) + σ2 also with equality in the

large sample limit. Therefore, if the ideal would be to minimize M , then to choose λ in order to minimize

υ0 is a reasonable approach and this process is called Ordinary Cross Validation (OCV) method.

This approach is, however, inefficient and it makes it computationally expensive to calculate υ0, but

it can be shown that

υ0 =1

n

n∑i=1

(yi − fi)2

(1−Hii)2(3.86)

Where f is the estimate from fitting to all the data and H is the model hat matrix, which reduces

computational time to compute υ0 . In practice, the weights 1 − Hii are replaced by the mean weighttr(I−H)

n, where tr(.) indicates the trace of a matrix. With this, the Generalized Cross Validation score

(GCV) is obtained.

υg =n∑ni=1(yi + fi)

2

[tr(I−A)]2(3.87)

Therefore, GCV is used to choose λ that minimizes υg.

Interactions

Interactions between multiple explanatory variables can be important to the model and with GAM there

are four main ways to include them. First, there is the multiplication of two independent variables, x1×x2.

Second, it is possible to use a smoothed function to one variable, f1(x)× x2. Also, the same smoothed

function can be used for both variables, f1(x1)× f1(x2), which can also be denoted by f1(x1, x2). These

are invariant to rotation of explanatory variables space, that is, it produces an isotropic smooth. This

is appropriate when the quantities are measured in the same units, for example, spatial coordinates.

Lastly, there are tensor product interactions, that is, different smoothing bases can be used for variables

29

and penalize it in two different ways, f1(x1)⊗f2(x2). Tensor product interactions can be written as

f12(x1, x2) =

I∑i=1

J∑j=1

δijb1i(x1)b2j(x2) (3.88)

Where b1 and b2 are the basis functions, I and J are basis dimensions and δ is a vector of unknown

parameters. These interactions are invariant to linear rescaling of explanatory variables and appropriate

when the quantities are measured in different units or when it is necessary to have different degrees of

smoothness relative to different explanatory variables.

3.4.3 Mixed Models - GAMMs

Generalized additive models can be represented as mixed models with the smooth terms as random

effects. First, a brief description of linear mixed models and generalized linear mixed models will be

made.

In general, a linear mixed model extends the following model in Equation 3.89 to the model in Equa-

tion 3.90.

y = Xβ + ε, ε ∼ N (0, Iσ2) (3.89)

y = Xβ + Zb + ε, b ∼ N (0, ψ), ε ∼ N (0,Λσ2) (3.90)

Where vector b contains random effects, Z is a model matrix for the random effects and Λ is a

positive definite matrix. Usually, Λ can be the identity matrix.

The generalized linear mixed models (GLMM) follow from the linear mixed models and have the

following structure

g(µbi ) = Xiβ + Zib (3.91)

Where it is considered that µb = E(y|b), b follows a normal distribution with vector zero expected

value and covariance matrix ψ, which is usually parameterized in terms of a parameter vector θ and

yi|b. These random variables are independent and they follow a distribution from the exponential family.

Now, the generalized additive mixed models (GAMM) can be defined as follows:

yi = Xiβ + f1(x1i) + f2(x2i, x3i) + ...+ Zib + εi, (3.92)

Where Xi represents the row of a fixed effects model matrix, fj are smooth functions of the ex-

planatory variables, Zi represents the row of a random effects model matrix, b ∼ N (0,ψ) is a vector of

random effects coefficients, ψ is a positive definite covariance matrix and ε ∼ N (0,Λ) is a residual error

vector.

30

3.5 Disaggregation of water consumption

In this study, there are clients that have two water meters, one measures the indoor water use and

the other measures the outdoor water use. However, most of the clients have a single water meter that

measures both indoor and outdoor water use. A secondary goal of this dissertation was to test a method

to disaggregate the total water use of clients with a single water meter into indoor and outdoor water

use. In this Section, the steps taken in a possible disaggregation of water consumption method are

described. This method uses time series clustering, discussed in Section 3.3, a classification algorithm,

K-Nearest Neighbors, which will be described in Subsection 3.5.1, and garden watering demand models

(Generalized Additive Models).

Note that the method relies heavily on the results obtained for the data set studied when modeling

the garden watering demand models, as well as on the models themselves.

3.5.1 Classification algorithm: K-Nearest Neighbors (KNN)

K-Nearest Neighbors is one of the most commonly used methods with easy interpretation and applica-

tions in classification and regression problems (Giudici [32]). In this study, KNN algorithm was applied

to predict a class of a set of time series. The similarity used was the same used with the time series

clustering algorithm, the periodogram based distance (Subsection 3.3.2).

Consider a training set composed of observations (x, y) from the explanatory variables X and the

label variable Y . KNN can be used to predict a value of the class variable Y , y0, when the values of the

explanatory variables, x0, are known. This set of known instances of the explanatory variables is called

test set.

The steps to be taken in the KNN algorithm for each instance in the test set are as follows:

1. Specify a positive integer k. This indicates the number of nearest neighbors to take a vote from.

2. Calculate the distance between the instance and each element in the training set using the chosen

distance.

3. Sort the calculated distances in ascending order.

4. Select the k top entries that are closest to the sample.

5. Find the most common classification among these k entries. This is the predicted class of the

instance.

When the k chosen is equal to 1, the algorithm is denoted as 1-NN. In this case, the new instance is

assigned the same class as its nearest neighbor.

Note that KNN performs better if the data is on the same scale, thus the data can be normalized

before applying the method.

31

3.5.2 Method for disaggregation of water consumption

In this Subsection, the steps involved in the disaggregation method of the mean total consumption of a

set of clients that have a single water meter are described. Let the set of clients that have a single water

meter, that measures both indoor and outdoor water use, be denoted by single water meter set.

This method takes advantages of the results obtained while modeling the garden watering demand.

In order to model the daily garden watering consumption, time series clustering (discussed in Sec-

tion 3.3) was applied to the set of clients that have two water meters. Let the partition of the set of clients

that have two water meters (one indoor and one outdoor) be denoted by C, with size m. So, m models

were built, one for each cluster. This method assumes that the majority of the monthly total water use is

due to outdoor water use. Thus, the daily garden watering demand models are used in this method to

estimate the total daily water use of the single water meter clients.

The steps taken in this method are as follows:

Step 1 Apply the chosen clustering algorithm with the chosen distance to the normalized single water

meter set. Then, choose the best number of clusters k. Let the clusters of the single water meter

set be denoted by G.

Step 2 Having chosen the optimal number k of clusters, build the representative series for each cluster.

A representative serie of a cluster is calculated by at each time point t taking the mean of all the

time series in that cluster at time t. Then, this series is normalized.

Step 3 Consider the train set composed of the normalized representative series of the garden watering

consumption clusters, where each series will represent its own class, and the test set comprised

of the normalized representative series of the new single water meter clusters. Apply 1-NN (1-

Nearest Neighbor) with this train set and test set.

Step 4 According to the classification results of 1-NN, one of the m models of the garden watering con-

sumption is used to predict an estimation of the total daily consumption for each of the k clusters.

Step 5 Calculate the percentage that the monthly outdoor use represents in the monthly total water use

for each of the m clusters in C. Using the appropriate percentage values, estimate the future

daily outdoor water use by taking a percentage of the estimates obtained in Step 4. For the daily

estimates in a same month, the same percentage value is used.

Step 6 The estimates of the indoor water use are obtained by the difference between the estimates of the

total consumption (Step 4) and the estimates of the outdoor water consumption (Step 5).

32

Chapter 4

Results and Discussion

In this Chapter, the clustering results, the models obtained and forecast results are shown and dis-

cussed. In Section 4.1, the case study is described, as well as the initial data treatment applied to the

data. In Section 4.2, the exploratory work is shown. The results of the time series clustering are dis-

cussed in Section 4.3. Lastly, in Section 4.4, the steps to fitting the Generalized Additive Models are

explained and the forecast results are presented. In Section 4.5, a preliminary work for disaggregation

of the consumption into indoor and outdoor water use in clients with a single water meter is made and

the results obtained are discussed.

4.1 Case study description and data processing

In this Section, the data available for this dissertation is described. The steps taken in data treatment

are explained, including how the missing values were dealt with. The process to select the water meters

to be used in this study is explained. Furthermore, the meteorological variables are plotted for the period

in study to understand the meteorological conditions of the region.

We received several information associated with 73 lots, including the total area of the lot, building

area, outdoor area and housing typology (apartment or detached house). The outdoor area can be

comprised of grass, trees, small bushes, an assortment of small plants, pavements, annexes, as well as

a swimming pool. Each of these 73 lots has two water meters associated: one exclusive to the indoor

water consumption and the other exclusive to the outdoor consumption. The first measures all the

water consumed in the house by kitchen and bathroom sinks, dish washer, washing machine, showers,

bathtubs, toilet flush and possibly refrigerators. The second water meter counts the external uses of

water, these may include the garden watering, filling and maintaining the pool, washing of pavements

and vehicles, as well as the maintenance of decorative fountains. It is important to mention that all of the

lots have an exterior swimming pool. Note that the majority of the outdoor water use is due to garden

watering. Hence, throughout the dissertation we use the terms outdoor water use and garden watering

demand interchangeably. Note that these 73 clients belong to a set of almost 3000 managed by this

water utility company.

33

The time series data set provided contained the hourly water consumption from 01/01/2015 to

31/07/2017 of the two water meters from each lot. Subsequently, we were also given hourly water

consumption from 01/08/2017 to 30/11/2017 to be used as a test set to validate the models. For the first

stage of the study, we focused only on the consumption of the outdoor water meters, with the goal of

modeling the water consumption for garden watering. We aggregated the data to daily consumption to

build the models, since it would not be possible to model the hourly data, due to the many variations in

the patterns of each time series.

When a water meter has a data collection problem, it will not provide records for the whole day. Also,

the water meter has an indicator that registers the accumulation of water consumption since the day it

started working. Even when the water meter does not register the entries for one day, the indicator will

have those values accumulated. To fill in the missing observations, we made use of this indicator. For

example, if only one day is missing in a time series, we extract the real value of water consumed in that

day by the indicator. However, if we have n consecutive days missing, we use the indicator to know how

much water was spent over this period and divide that amount evenly over the n days.

Knowing beforehand that there will be extreme observations in the time series that we want to explore

and study, we did not apply any outlier detecting algorithm in this data set. These observations are an

important aspect of the real consumption’s pattern and they would be classified as outliers. This is

information that we did not wish to replace and lose. A more thorough analysis of these events in each

time series allows us to better understand the water consumption behaviour of the clients and in order

to do this each time series had to be examined individually. We were expecting to see in the daily

consumption of each client the moment when the swimming pool was filled, since it implies a large

volume of water, around the months of April or May. However, it was verified that not all the time series

have significant peaks and that the ones that do can present peaks in the consumption in any time of

the year. We needed to further study each peak to understand if it corresponded, for example, to the

renovation of the water in the swimming pool, to a human error or to a leakage. In Section 4.2, it can be

found how we identified the cause for each peak and more information about the renovation of water in

the swimming pools.

After some exploratory work, we ended up using only 57 time series in the study. The process

of selection took place over several stages. Firstly, having in mind the given reference by the wa-

ter utility company that the expected average consumption for garden watering varies between 3 and

5 l/(m2.day), we identified clients that presented a consistently low consumption. There is the suspicion

that these lots have a borehole installed, which means that the garden is not watered using the supply

network. Thefore, these lots were not included in the study. Another case included clients that showed

a significant alteration in the consumption during the period considered, presenting also a consistently

low consumption, making us suspect that this change in the consumption was caused by the installation

of a borehole. Furthermore, when inspecting each time series, there were some that stood out with an

unexpected behaviour and, not being possible to find a cause for it, they were excluded as well. Note

that out of the 57 lots selected, only one is an apartment. Therefore, the results that were obtained can

be used for detached houses, but can not be generalized to apartments.

34

2015 2016 2017 2018

02

46

810

1214

Time (days)

m3 /

day

1015

2025

30

Deg

rees

(ºC

)

Meanconsumption

Meantemperature

Figure 4.1: Mean daily water consumption for garden watering of the 57 water meters and mean dailytemperature from 01/01/2015 to 30/11/2017.

Furthermore, we downloaded the meteorological conditions of the nearest possible location from

which the data was collected during the time period considered from the website Weather Underground

(www.wunderground.com [37]). This instrument gave us the mean, maximum and minimum daily tem-

perature and the accumulated daily precipitation. With these variables, we intend to understand if they

are correlated with the water consumption for garden watering and, if so, include them in the model. In

Figure 4.1, the mean daily water consumption for garden watering of the 57 water meters is plotted with

the mean daily temperature in a two y-axis plot. The daily accumulated precipitation from 01/01/2015 to

30/11/2017 is plotted in Figure 4.2.

0

25

50

75

2014

−12

−01

2015

−01

−01

2015

−02

−01

2015

−03

−01

2015

−04

−01

2015

−05

−01

2015

−06

−01

2015

−07

−01

2015

−08

−01

2015

−09

−01

2015

−10

−01

2015

−11

−01

2015

−12

−01

2016

−01

−01

2016

−02

−01

2016

−03

−01

2016

−04

−01

2016

−05

−01

2016

−06

−01

2016

−07

−01

2016

−08

−01

2016

−09

−01

2016

−10

−01

2016

−11

−01

2016

−12

−01

2017

−01

−01

2017

−02

−01

2017

−03

−01

2017

−04

−01

2017

−05

−01

2017

−06

−01

2017

−07

−01

2017

−08

−01

2017

−09

−01

2017

−10

−01

2017

−11

−01

2017

−12

−01

2018

−01

−01

Time (days)

mm

/day

Figure 4.2: Daily accumulated precipitation from January 2015 to November 2017.

Moreover, we received the hourly water consumption of lots with just one water meter, that is, the

indoor water use and outdoor water use are measured together by one water meter. For these lots, we

35

www.wunderground.com

also have the lot, building and outdoor areas. The second goal of this study is to disaggregate the daily

consumption of these clients to find out how much of this consumption is residual water.

4.2 Exploratory Analysis

We proceeded to perform some exploratory analysis to better understand the data. This study is focused

on the water consumption for garden watering, therefore we looked into the seasonality in the data, how

the consumption of each client relates to their respective outdoor area and what is the ratio of this

consumption in regard to the total water consumption. Moreover, we performed a close analysis to the

extreme values present in the data and attempted to relate them with the renovation of the pools water.

To explore the seasonality in the data, we built a boxplot per month of the aggregated monthly

consumption, shown in Figure 4.3. There is consumption throughout the entire year, that is, even in the

winter months, the gardens are watered. There is a clear yearly seasonality: the consumption is higher

in the summer months and lower in the winter months. We note that there are extreme observations

every month. The highest value recorded occurred in June 2016, one client consumed over 1500 m3 in

this month. Beside this, there are other four observations with a high value, around 1000 m3, and three

of them belong to the same client with the highest observation. In addition, in December 2015 there is a

slight increase in the median with regard to November and it is followed by a decrease in January 2016.

The following year, there is an increase in the median in January, followed by a decrease in February.

0

500

1000

1500

Jan

2015

Feb

201

5

Mar

201

5

Apr

201

5

May

201

5

Jun

2015

Jul 2

015

Aug

201

5

Sep

201

5

Oct

201

5

Nov

201

5

Dec

201

5

Jan

2016

Feb

201

6

Mar

201

6

Apr

201

6

May

201

6

Jun

2016

Jul 2

016

Aug

201

6

Sep

201

6

Oct

201

6

Nov

201

6

Dec

201

6

Jan

2017

Feb

201

7

Mar

201

7

Apr

201

7

May

201

7

Jun

2017

Jul 2

017

Aug

201

7

Sep

201

7

Oct

201

7

Nov

201

7

Time (months)

m3 /

mon

th

Figure 4.3: Boxplot of the monthly consumptions of the 57 water meters between January 2015 andNovember 2017.

The variability is higher in June, July, August and September and lower in January, February and

36

December over these three years. Note that the median value starts to increase in March and April,

as well as the variability. Furthermore, we notice that the month of May is similar in the median and

variability in the years of 2015 and 2017, but it is quite different in 2016 (Figure 4.4). The median is

significantly lower and it presents less variability. This is possibly due to the unusually high amount of

precipitation in May 2016 (Figure 4.2), that could have led to a lower consumption to water the gardens

in this month.

In Figure 4.5, we see the changes in the month of November over the three years with more detail.

The month of November in 2016 does not present many changes when compared to the year 2015,

however, in 2017 the median value in this month has a significant increase. Also, there is a higher

variability in November 2017. The months of October and November were significantly drier in 2017

(Figure 4.2) and that could have caused the increase in the consumption in November 2017.

0

500

1000

1500

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep Oct

Nov

Dec

Months

m3 /

mon

th Year201520162017

Figure 4.4: Boxplot of monthly consumptions of the 57 time series and grouped by year (2015, 2016 and2017).

As we are dealing with water consumption for garden watering, it is important to take into account the

outdoor area of each lot and how they relate. As mentioned in Section 4.1, the outdoor area of each lot

can consist of grass, small bushes, pavements, amongst others, which means that the actual watered

area does not correspond to the outdoor area. The watered area of each lot is smaller than the outdoor

area. As a reference, the mean size of the outdoor areas of the 57 lots is 1547 m2 and the mean lot

size is 1898.5 m2. We computed the mean daily consumption of each outdoor water meter and plotted it

against the outdoor area of each lot, as shown in Figure 4.6

We expected that the higher the size of the outdoor area, the higher the mean daily consumption,

i.e., a linear relationship. However, there are some points that are far from the linear regression line in

blue, as can be seen in Figure 4.6. There is a group with big outdoor areas, higher than 2000 m2, that is

37

0

100

200

300

400

Nov

Months

m3 /

mon

th Year201520162017

Figure 4.5: Monthly consumption in November for three years (2015, 2016 and 2017).

0

5

10

15

0 1000 2000 3000 4000

Outdoor Area ( m2 )

m3 /

day

Figure 4.6: Scatterplot of the mean daily consumption of each outdoor water meter versus outdoor areafor the 57 water meters.

above the line. Also, there are a few points below the line, and this seems to indicate that these clients

use less water than expected. This could be because their actual watered area is considerably smaller

than the outdoor area or their garden tipology requires less water. For example, the lot with the largest

outdoor area, over 4000 m2, also has the mean daily consumption value below the line. Since it has

such a large outdoor area, it would be expected to have a high mean daily consumption, however, it

has a lower value than expected. Possibly, not all of the outdoor area is looked after. Additionally, note

that the majority of the mean daily consumption values are comprehended between 2.5 m3 and 7.5 m 3.

Moreover, Loh et al. [7] did not find any relationship between the watered area of the outdoor space and

the outdoor water consumption, which is not our case, since there is a linear trend, meaning that the

larger the outdoor area, the higher the mean daily consumption.

38

To know if there was a meaningful relation between the mean daily consumption and the watered

areas, we resorted to Google Maps and took measurements of the non-watered areas, that is, swimming

pools and their surrounding pavements, exterior garages, car entries and front entry pavements. This

way, we got a value that is closer to the real watered area of each lot. It was not possible to get this

estimated watered area of a few lots for which Google Maps did not give the exact location. Moreover,

it was not possible to collect the garden tipology description of each watered area. However, it was

possible to assess that all of the 57 lots have a grass area, small bushes, smalls patches of plants and

trees. Only one of the considered lots did not have a grass area, only bushes, small plants and trees.

The mean daily consumption of each outdoor water meter plotted against the estimated watered area

can be found in Figure 4.7. In this plot, there is still a group of points above the regression line with

corresponding estimated watered areas higher than 1500 m2. There is now also a group of points that

is further from the regression line, but with smaller estimated watered areas.

0

5

10

15

1000 2000 3000

Estimated Watered Area ( m2 )

m3 /d

ay

Figure 4.7: Scatterplot of the mean daily consumption of each outdoor water meter versus estimatedwatered area for the 57 water meters.

To have a better understanding of the consumption behaviour of each client, that is, which ones

spend more or less water than expected for their respective outdoor areas and which ones spend within

the expected values. For this, we computed the mean daily consumption in litres per square meter

of outdoor area for each client (l/(m2.day)) and plotted it against the outdoor area. This scatterplot

can be found in Figure 4.8 (a). The majority of the points are concentrated between 3 l/(m2.day) and

5 l/(m2.day), which is the reference for a reasonable consumption, as mentioned in Section 4.1. Then,

it is easy to identify the clients that are consistently using a high volume of water to water their gardens.

This includes the client with the smallest outdoor area, that consumes an extremely high mean quantity

of water per day. Also, there are a few clients that consume on average less than 3 l/(m2.day).

39

0

5

10

0 1000 2000 3000 4000

Outdoor Area ( m2 )

l/( m

2 . da

y)

(a) Scatterplot of the mean daily consumption each outdoorwater meter versus outdoor area.

0

5

10

1000 2000 3000

Estimated Watered Area ( m2 )

l/( m

2 . da

y)

(b) Scatterplot of the mean daily consumption of each outdoorwater meter versus estimated watered area.

Figure 4.8: Scatterplots of the mean daily consumption versus a) outdoor area, b) estimated wateredarea.

In Figure 4.8 (b), we plot the mean daily consumption in litres per square meter of estimated wa-

tered area against the estimated watered areas. In this case, the majority of the points is concentrated

between 3 l/(m2.day) and 6 l/(m2.day). Here, it stands out that there is a group of points with values

above 7.5 l/(m2.day). These clients consume consistently a high volume of water per day with regard to

their respective estimated watered areas. These clients are the same as the ones identified in Figure 4.8

(a) as being big consumers, that is, with a mean daily consumption per square meter of outdoor area

above 5 l, with the addition of one client. Also, we notice that all of the big consumers have a smaller

estimated watered area, below 1500 m2. In addition, this plot reveals more information about the be-

haviour of the 16 clients that have a low average consumption per square meter of outdoor area, below

3 l/(m2.day). With these new values computed considering the estimated watered area, only 8 clients

have a mean daily consumption per square meter below 3 l. That is, of the 16 clients that were being

considered as low consumers, only 8 are considered as such. We note that one of these low consumers

is the only lot in this set that has no grass area and there were six clients in total for which we could not

estimate the watered area. Also, the client with the largest outdoor area continues to be considered as

a low consumer.

Regarding the low consumers, it can be important to understand which practices they adopt to have

these lower consumptions, for example, their consumption habits and garden typology. On the other

hand, the big consumers are potential clients to be targets of awareness-raising campaigns to reduce

the consumption.

It is also important to understand the weight of the water consumption for garden watering in the

total monthly consumption. In Figure 4.9, the median water consumption for garden watering per month

and the median indoor consumption per month of the 57 clients are presented in a stacked bar plot. On

top of every bar, the percentage of water consumed for garden watering in that month is presented. It

is very clear that the consumption for garden watering represents the majority of the total consumption

per month. For these clients, the weight of the garden watering in the total consumption is much more

significant than what Loh et al. [7] verified in Perth, Australia. This study conducted between 1998 and

2001 verified that 56% of the total water consumption was due to outdoor water use.

40

9394

95

93

96

95

94 94

95

92

9396

85

93

9193

94

96

9493

96

94

93

85

90

85

94

93

95

95

94

93

95

94

95

0

100

200

300

2014

−11

−01

2014

−12

−01

2015

−01

−01

2015

−02

−01

2015

−03

−01

2015

−04

−01

2015

−05

−01

2015

−06

−01

2015

−07

−01

2015

−08

−01

2015

−09

−01

2015

−10

−01

2015

−11

−01

2015

−12

−01

2016

−01

−01

2016

−02

−01

2016

−03

−01

2016

−04

−01

2016

−05

−01

2016

−06

−01

2016

−07

−01

2016

−08

−01

2016

−09

−01

2016

−10

−01

2016

−11

−01

2016

−12

−01

2017

−01

−01

2017

−02

−01

2017

−03

−01

2017

−04

−01

2017

−05

−01

2017

−06

−01

2017

−07

−01

2017

−08

−01

2017

−09

−01

2017

−10

−01

2017

−11

−01

2017

−12

−01

2018

−01

−01

Time (months)

m3 /

mon

th

TypeIndoorGarden watering

Figure 4.9: Median monthly indoor consumption and median monthly water consumption for gardenwatering of the 57 water meters between January 2015 and November 2017.

In Figure 4.10, the mean daily indoor consumption and the mean water consumption for garden

watering of the 57 clients is plotted in a two y-axis plot, in order to compare the patterns. We note that

there is also a seasonality in the indoor consumption, it is higher in the summer months and lower in the

winter months. However, the difference between these two periods is not as strong as in the mean water

consumption for garden watering.

2015 2016 2017 2018

02

46

810

1214

Time (days)

m3 /

day

0.5

1.0

1.5

2.0

m3 /

day

Meangardenwatering

Meanindoor

Figure 4.10: Mean daily pattern of indoor and water consumption for garden watering of the 57 watermeters between 01/01/2015 and 30/11/2017.

As mentioned, the 57 lots we worked with to build the model all have an exterior swimming pool. We

41

wanted to know more about the renovation of the water of the pools, namely in what time of the year

does this occur, how often, how long it takes on average to fill a pool and what is the average water flow

(m3/h).

To identify the filling of a pool in the consumption, it was necessary to look at each daily time series

individually. It was very clear which time series had a significant peak. In order to have more certainty

that a peak in the consumption corresponds to the renovation of the water of the swimming pool, we used

Google Maps to measure the surface area of the pool of each lot. By considering a standard residential

swimming pool depth, we estimated the volume of the pool in each lot. With this, we can compare the

quantity of water spent by a client during the peak with the respective estimated pool volume. However,

it was not possible to perform this estimation for a few of the pools, since we did not know the exact

location of a few lots.

Table 4.1: Information regarding the extreme observations of the 57 outdoor water meters.Mean estimated pool volume 75.85 m3

Median estimated pool volume 71.30 m3

Average duration of pool filling 30.90 hoursAverage water flow during pool filling 3.2 m3/h

Events caused possibly by pool filling 12

Estimated volume of water spent filling pools 950.760 m3

Events caused possibly by filling of a reservatory 14

Events with unknown cause 16

In addition, we inspected more closely the peaks that were ruled out as being caused by the filling

of a pool. In some cases, we see a continuous consumption with the same water flow that begins at the

end of the afternoon and stops in the morning of the next day. We believe that these cases happened

due to human error.

For other peaks in the consumption during the Winter months was not possible to discover the cause,

as well as some that presented a variable water flow, showing an erratic pattern.

4.3 Time Series Clustering

In this Section, the clustering results of the 57 time series are described. We discuss the different

clustering algorithms and dissimilarity measures used to group the clients by consumption pattern, in

order to find the algorithm and dissimilarity measure that best fit the data. Moreover, some exploratory

work was done to understand how different are the groups obtained and in what way they are different.

So far, we have been working with the whole data set, from 01/01/2015 to 30/11/2017, but from this

point on we worked with the data until 31/07/2017, leaving the last four months as a test set. The aim

was to group the series by consumption pattern and not by scale. For this reason, it was necessary to

normalize the time series before applying the clustering algorithms. The goal is to build a model for each

one of the resulting clusters.

42

We applied three normalizations to each time series, the Standard, the ”Median-Mad” and the ”Min-

Max”, whose formulas are displayed in equations 4.1, 4.2 and 4.3, respectively.

yti =xti −meani(xt1 , ... , xtn)

sdi(xt1 , ... , xtn)(4.1)

yti =xti −mediani(xt1 , ... , xtn)

madi(xt1 , ... , xtn)(4.2)

yti =xti −mini(xt1 , ... , xtn)

maxi(xt1 , ... , xtn)−mini(xt1 , ... , xtn)(4.3)

Where Yt = (yt1 , ... , ytn) is the normalized time series, Xt = (xt1 , ... , xtn) is the original time series,

sd represents the standard deviation and mad stands for median absolute deviation.

Next, we show the hierarchical method and distance chosen, the decision of the best number of

clusters, along with a discussion of the clustering results comparing the clusters.

4.3.1 Hierarchical clustering

We applied Ward Method, Single Linkage, Average Linkage and Complete Linkage. However, Single

and Average Linkage gave consistently poor results for all the distances, so we only consider here

Complete Linkage and Ward Method. In addition, the normalization given by Equation 4.2 would lead

to poor clustering results, separating just one time series in one cluster and all the other time series

in another. Also, better results were obtained with the Standard normalization, when compared to the

results obtained with the ”Min-Max” normalization. For that reason, we will only discuss results obtained

with the Standard normalization.

We applied the two clustering algorithms with three different distances, Dynamic Time Warping

(DTW), Dissimilarity Index Combining Temporal Correlation and Raw Values Behaviours and Periodogram

Based Dissimilarity for the normalized set of time series. With both clustering algorithms, the distance

that led to better results was the periodogram based dissimilarity. Complete Linkage performed slightly

better than the Ward Method, when comparing the Dunn, Entropy, Gamma and Silhouette indexes.

These values can be found in Table 4.2. For Complete Linkage, we chose 5 as the best number of

clusters. The steps that led to this choice are explained in Subsection 4.3.2.

Table 4.2: Comparison of the values of the four indexes for the best number of clusters for Ward Methodand Complete Linkage with periodogram based distance when using the Standard normalization.

Number of clusters Dunn Entropy Gamma Silhouette

Ward Method 5 0.251 1.590 0.781 0.218Complete Linkage 5 0.321 1.600 0.794 0.201

43

2 3 4 5 6 7 8

0.20

0.25

0.30

0.35

Dunn Index

Number of Clusters

Dun

n In

dex

Figure 4.11: The number of clusters ver-sus Dunn index.

2 3 4 5 6 7 8

0.6

0.8

1.0

1.2

1.4

1.6

Entropy

Number of Clusters

Ent

ropy

Figure 4.12: The number of clusters ver-sus Entropy.

2 3 4 5 6 7 8

0.65

0.70

0.75

0.80

Gamma

Number of Clusters

Gam

ma

Figure 4.13: The number of clusters ver-sus Gamma index.

2 3 4 5 6 7 8

0.20

0.22

0.24

0.26

0.28

Silhouette Method

Number of Clusters

Silh

ouet

te

Figure 4.14: The number of clusters ver-sus Silhouette index.

4.3.2 Choosing the best number of clusters

To choose the best number of clusters, we used the indexes Dunn, Entropy, Gamma and Silhouette. We

computed the values of the different indexes for k clusters, k varying between 2 and 8. The plots for the

four indexes are presented below in Figure 4.11, Figure 4.12, Figure 4.13 and Figure 4.14.

Table 4.3: Best number of clusters according to each index using complete linkage method with peri-odogram based distance.

Dunn Entropy Gamma Silhouette

Best number of clusters 8 2 8 2

Even though the maximum values of both the Dunn and Gamma index occur with 8 clusters, they

also have high values for 5 clusters and the difference is not very relevant. For both the Entropy and

Silhouette, the best number of clusters is 2. Since it is not relevant to divide the set of time series into

two groups, we chose 5 as the best number of clusters. The partition of the dendrogram into 5 clusters

can be found in Figure 4.15 and the size of each cluster is specified in Table 4.4.

Table 4.4: Size of each cluster.Cluster 1 2 3 4 5

Size 20 11 7 6 13

44

0.0

0.1

0.2

0.3

0.4

Hei

ght

V9

V27 V

8V

55V

22 V3

V20

V18

V30

V37

V34

V39

V31

V50

V19

V54

V12

V49 V

1V

42V

25V

40V

24V

13V

43V

35V

51V

48V

52 V2

V38

V11

V16

V36

V47

V17

V41

V45

V46

V23

V32

V15 V

7V

57V

53V

10V

14V

56 V6

V29

V28

V33 V

5V

44V

26 V4

V21

Cluster

1

2

3

4

5

Figure 4.15: Partition of the 57 time series in 5 clusters.

4.3.3 Discussion of the clustering results

Once we had the clustering results, we intended to understand better the differences between each

cluster, what characterized them and how different they are from each other. For that, we resorted to

several plots.

First, we calculated representative series to each one of the 5 clusters. The representative series of

a cluster is calculated by at each time point t taking the mean of all the time series in that cluster at time

t. Then, this series is normalized with the Standard normalization. As an example, the representative

series of Cluster 1 is shown in Figure 4.16.

−1

0

1

2

2014

−12

−01

2015

−01

−01

2015

−02

−01

2015

−03

−01

2015

−04

−01

2015

−05

−01

2015

−06

−01

2015

−07

−01

2015

−08

−01

2015

−09

−01

2015

−10

−01

2015

−11

−01

2015

−12

−01

2016

−01

−01

2016

−02

−01

2016

−03

−01

2016

−04

−01

2016

−05

−01

2016

−06

−01

2016

−07

−01

2016

−08

−01

2016

−09

−01

2016

−10

−01

2016

−11

−01

2016

−12

−01

2017

−01

−01

2017

−02

−01

2017

−03

−01

2017

−04

−01

2017

−05

−01

2017

−06

−01

2017

−07

−01

2017

−08

−01

2017

−09

−01

Time (days)

Val

ue

Figure 4.16: Representative series of Cluster 1 between 01/01/2015 and 31/07/2017.

For each cluster, we aggregated the consumption to monthly consumption, then normalized each

time series (with Standard Normalization) and aggregated by the median in order to compare the pattern.

In Figure 4.17, the plot of the normalized monthly consumption per cluster is presented. The patterns

are similar for all clusters and there is not a clear difference between each cluster.

In Figure 4.18, the boxplot of the outdoor area per cluster is presented. Cluster 4 has a higher

45

−1

0

1

2014

−12

−01

2015

−01

−01

2015

−02

−01

2015

−03

−01

2015

−04

−01

2015

−05

−01

2015

−06

−01

2015

−07

−01

2015

−08

−01

2015

−09

−01

2015

−10

−01

2015

−11

−01

2015

−12

−01

2016

−01

−01

2016

−02

−01

2016

−03

−01

2016

−04

−01

2016

−05

−01

2016

−06

−01

2016

−07

−01

2016

−08

−01

2016

−09

−01

2016

−10

−01

2016

−11

−01

2016

−12

−01

2017

−01

−01

2017

−02

−01

2017

−03

−01

2017

−04

−01

2017

−05

−01

2017

−06

−01

2017

−07

−01

2017

−08

−01

Time (months)

Val

ueCluster

12345

Figure 4.17: Normalized monthly consumption aggregated by the median for each cluster betweenJanuary 2015 and July 2017.

median value, followed by Cluster 5, while Clusters 1, 2 and 3 have smaller values that are very close to

each other. It seems that, even though the outdoor area was not used to group the series, it is implicit in

the clusters. Cluster 4 has members with larger outdoor areas, suggesting that some of the clients with

bigger lots have a similar consumption pattern. In Table 4.5 it is presented a summary of the outdoor

areas by cluster.

0

1000

2000

3000

4000

1 2 3 4 5Cluster

Are

a ( m

2 )

Clusters

1

2

3

4

5

Figure 4.18: Boxplot of the outdoor area per cluster.

Table 4.5: Summary of the outdoor areas per cluster.

Cluster 1 2 3 4 5

Outdoor area Minimum 429.9 158 775.8 901.5 680.8Median 1264 1318 1361 1839 1514Mean 1396 1406 1473 2096 1684

Maximum 2850 2494 2594 4185 3417

The boxplot of the mean daily consumption of each outdoor water meter per cluster is in Figure 4.19.

46

Note that these values are not normalized, since we intended to understand the consumption scale of

each cluster. In this plot, we see again that Cluster 4 stands out by having the highest median value.

Cluster 1 and 3 have similar median, though Cluster 2 has a lower median value and Cluster 5 has the

lowest median value.

5

10

1 2 3 4 5Cluster

Val

ue (

m3 )

Clusters

1

2

3

4

5

Figure 4.19: Boxplot of the mean daily water consumption for garden watering per cluster.

The members of Clusters 1, 2 and 3 have some similarities between them that separate them from

Cluster 4 and Cluster 5. Also, Cluster 4 is very different from the others. Additionally, the representative

series of Clusters 1, 2 and 3 did not seem to be very different. Therefore, we decided to join the first

3 clusters, hence avoiding building five different models, creating a new Cluster 1. In Figure 4.20, the

boxplot per month of the normalized monthly consumption of the new Cluster 1 is presented. In this

Figure, the yearly seasonality is very clear, the summer months (June, July and August) correspond

to higher values and the months of December, January and February correspond to lower values. In

Figure 4.21, we show the boxplot per month of the year of the normalized monthly consumption of the

same cluster, where we can see that July and August are quite similar do each other in terms of median

value and variability, while June has a lower median value with report to these months. Also, January,

February and December have median values close to each other.

We also verified that there is no difference between weekdays or weekend days, as can be seen in

Figure 4.22, the median value remains approximately the same for all days of the week.

We looked at the daily pattern per month of each cluster. In Figure 4.23, we show the plot for Cluster

1. We can see that there are two peaks during the day, one around 5 a.m. and the other around 10 p.m.

In the months June, July, August and September, these peaks are much more significant than in the

months of January, February, November and December. During day hours, between 8 a.m. and 7 p.m.,

the values are much less significant and in the months of January, February, November and December

they are close to zero. These results are particularly important for the water utility company to define

the day period to the real loss analysis. Usually, this period is defined during the night, when there is

approximately no indoor consumption. However, in a region with a very significant outdoor water use,

the period to monitor real losses should be during the day.

47

−1

0

1

2

3

2015

jan

2015

fev

2015

mar

2015

abr

2015

mai

2015

jun

2015

jul

2015

ago

2015

set

2015

out

2015

nov

2015

dez

2016

jan

2016

fev

2016

mar

2016

abr

2016

mai

2016

jun

2016

jul

2016

ago

2016

set

2016

out

2016

nov

2016

dez

2017

jan

2017

fev

2017

mar

2017

abr

2017

mai

2017

jun

2017

jul

Time (months)

Val

ue

Figure 4.20: Boxplot of the normalized monthly consumption of the new Cluster 1 between January 2015and July 2017.

−1

0

1

2

3

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep Oct

Nov

Dec

Months

Val

ue

Figure 4.21: Boxplot per month of the year of the normalized monthly consumption of the new Cluster 1.

0

5

10

15

Mon

day

Tues

day

Wed

nesd

ay

Thu

rsda

y

Frid

ay

Sat

urda

y

Sun

day

Day of the Week

Val

ue

Figure 4.22: Boxplot per day of the week of the new Cluster 1.

48

0.0

0.5

1.0

0 5 10 15 20Hour

Val

ue

Month123456789101112

Figure 4.23: Daily pattern per month of the new Cluster 1.

Furthermore, while building the models we encountered difficulties in finding a good model for Cluster

4. For that reason, we decided to apply clustering to the 6 members of this cluster. With this, we found

that 2 members were in fact quite different from the other 4 members, as well as different from each

other. Therefore, these 2 members were no longer considered for building the models. In Table 4.6, the

size of the final clusters is indicated.

Table 4.6: Size of each cluster.Cluster 1 2 3

Size 38 4 13

In Figure 4.24, the boxplot of the outdoor area per cluster for the final 3 clusters is shown. Cluster

2 is significantly different from the other 2 clusters, having the highest median, 1815m2, and highest

variability. As for Cluster 1 and Cluster 3, the outdoor areas of these two clusters do not differ as much:

the median value of Cluster 3, equal to 1514m2, is slightly higher than the median of Cluster 1, 1300m2.

The summary of the outdoor areas per cluster can be seen in Table 4.7.

Table 4.7: Summary of the outdoor areas per cluster (final clusters).

Cluster 1 2 3

Outdoor area Minimum 158 901.5 680.8Median 1300 1815 1514Mean 1413 2171 1684

Maximum 2850 4155 3417

In Figure 4.26, the boxplot of the building area per cluster is presented. Again, Cluster 2 stands out

with the highest median value equal to 695.2m2. For Cluster 1 and Cluster 3, the median values are very

49

0

1000

2000

3000

4000

1 2 3Cluster

m2

Clusters

1

2

3

Figure 4.24: Boxplot of the outdoor area per cluster (final clusters).

1000

2000

3000

1 2 3Cluster

m2

Clusters

1

2

3

Figure 4.25: Boxplot of the estimated garden area per cluster (final clusters).

Table 4.8: Summary of the estimated watered areas per cluster (final clusters).

Cluster 1 2 3

Estimated Watered Area Minimum 158 554.5 667.4Median 1005 1488 1179Mean 1082 1708 1354

Maximum 2415 3301 2853

close to each other, being equal to 409.3m2 and 424.4m2, respectively.

Table 4.9: Summary of the building areas per cluster (final clusters).

Cluster 1 2 3

Building Area Minimum 177.4 468.1 259.7Median 409.3 695.2 424.4Mean 405.9 664 431.4

Maximum 638 797.6 570.1

50

200

400

600

800

1 2 3Cluster

m2

Clusters

1

2

3

Figure 4.26: Boxplot of the building area per cluster (final clusters).

In Table 4.10, we see that the average ratio between the outdoor area and the lot area of each cluster

is quite similar for all three. Therefore, this measure can not be used to differentiate the clusters.

Table 4.10: Average ratio between outdoor area and lot area per cluster (final clusters).

Cluster 1 2 3

Percentage 79.3% 78.1% 80.8%

The time series were not clustered by the scale of the consumption, as we can see in Figure 4.27,

where the mean daily consumption of each water meter is plotted against the outdoor area. All the

clusters have members with high, average and low mean daily consumption.

5

10

0 1000 2000 3000 4000

Area ( m2 )

Val

ue (

m3 ) Clusters

123

Figure 4.27: Scatterplot of the mean daily consumption versus outdoor area grouped by cluster (finalclusters).

51

We also looked at the mean estimated pool volume per cluster, to investigate if there were significant

differences in pool size between the clusters. In Table 4.11, the mean estimated pool volume per cluster

is shown.

Table 4.11: Mean estimated pool volume per cluster.

Cluster 1 2 3

Volume (m3) 74.5 58.4 77.8

Note that it was not possible to estimate the pool volume for certain lots, as explained in Section 4.2.

The mean estimated pool volume of Cluster 1 and Cluster 3 are not too far apart, while the value of

Cluster 2 is significantly lower. Note however that it was not possible to estimate the pool volume for 2

of the clients that belong to this Cluster, that has only 4 members.

Moreover, we analysed the monthly peak factor for each cluster in the years of 2015 and 2016, which

are presented in Table 4.12. The monthly peak factor is the ratio of the maximum monthly consumption

observed during the year to the average monthly consumption of the same year. In 2015, all of the

clusters had the highest monthly consumption value in July, with no significant difference between the

monthly peak factor between the three clusters. In 2016, Cluster 1 and 3 had the highest monthly

consumption in August and Cluster 2 in July. Again in 2016, the monthly peak factor values for the 3

clusters are not significantly different.

Table 4.12: Monthly peak factor per cluster for 2015 and 2016.

Cluster 1 2 3

Year 2015 2016 2015 2016 2015 2016Month July August July July July AugustValue 1.88 2.07 2.26 2.26 2.00 2.17

4.4 Modeling garden watering demand using GAM

In this section, we show the steps taken when building the GAM models for each one of the 3 clusters.

We discuss the possible explanatory variables that can be used in the model. Furthermore, we dis-

cuss the process that led to finding the final models. Finally, we present and discuss the future values

predicted for 2 of the 3 clusters.

As mentioned before, we wanted to build a model for each one of the clusters. In this section, we

show the GAM models built for the 3 clusters and how we selected the explanatory variables used. The

data has already been split into train and test sets, as mentioned in Section 4.3.

We fitted three GAM models to each cluster. We computed three representative series for each

cluster: the aggregation by the mean (representative series Mean), the aggregation by the quantile 95%

(representative series Q95%) and the aggregation by the quantile 25% (representative series Q25%). In

Figure 4.28, the representative series Mean of Cluster 1 is plotted. A GAM model was fitted to each

52

one of these representative series. This way, we will get more information on forecasting the future

values. The predictions obtained from the models of representative series Q95% and Q25% will serve

as consumption intervals. That is, we followed a non-parametric approach as opposed to prediction

intervals.

Note that we removed the extreme consumptions that were mentioned in Section 4.2 from each time

series, in order to verify if the models would yield better results. However, when compared to the models

with the original data, they performed poorer, having a higher MAPE value. Therefore, the models were

built with the original data, without removing any of the extreme observations.

We present in detail the steps taken to build the model for representative series Mean of Cluster 1.

The model for the series aggregated by the median was also built, however this model led to poorer

results, therefore they will not be shown. In order to compare, we will show and comment the models

and results obtained for both Cluster 1 and Cluster 2 in Section 4.4.4. In Appendix B, we present the

forecast results for the models of Cluster 3.

−1

0

1

2

2014

−12

−01

2015

−01

−01

2015

−02

−01

2015

−03

−01

2015

−04

−01

2015

−05

−01

2015

−06

−01

2015

−07

−01

2015

−08

−01

2015

−09

−01

2015

−10

−01

2015

−11

−01

2015

−12

−01

2016

−01

−01

2016

−02

−01

2016

−03

−01

2016

−04

−01

2016

−05

−01

2016

−06

−01

2016

−07

−01

2016

−08

−01

2016

−09

−01

2016

−10

−01

2016

−11

−01

2016

−12

−01

2017

−01

−01

2017

−02

−01

2017

−03

−01

2017

−04

−01

2017

−05

−01

2017

−06

−01

2017

−07

−01

2017

−08

−01

2017

−09

−01

Time (days)

Val

ue

Figure 4.28: Representative series Mean of the new Cluster 1 between 01/01/2015 and 31/07/2017.

We began by checking if the representative series Mean of Cluster 1 is stationary. For that, the

KPSS test was used and gave a p-value equal to 0.04475. Considering a significance level of 1%, the

null hypothesis that states that the series is stationary, should be rejected, thus the test suggests that

the series is not stationary. Therefore, we applied the difference operator once. To determine if it was

necessary to apply a Box-Cox transformation to the data, we tried this transformation with different λ

values and checked the sample variance of the resulting series. Since none of the transformations led

to a lower sample variance, no Box-Cox transformation was applied.

4.4.1 Explanatory variables selection

To verify if a past lag of the response variable should be included in the model, the sample Autocor-

relation Function (ACF) and sample Partial Autocorrelation Function (PACF) of the differentiated series

53

−0.

3−

0.2

−0.

10.

00.

10.

20.

3

Lag

AC

F

0 5 10 15 20 25 30

ACF

Figure 4.29: Sample ACF of the responsevariable.

−0.

3−

0.2

−0.

10.

00.

10.

2

Lag

Par

tial A

CF

0 5 10 15 20 25 30

PACF

Figure 4.30: Sample PACF of the re-sponse variable.

were computed and are shown in Figure 4.29 and Figure 4.30. Both functions present a significant

spike in lag 1, as well as significant spikes around lags 7, 14 and 21. Note that the ACF and PACF are

symmetric with respect to the y-axis, as it is mentioned in Section 3.1, thus there are also significant

spikes in lags −1, −7, −14 and −21. Having a significant spike in the ACF and PACF in lag −1 means

that the response variable at time t is correlated with the response variable at time t− 1, or in our case,

the previous day. This seems to indicate some seasonality is present in the data and that lag −1 or −7

of the response variable might be needed in the model.

We proceeded to compute the cross-correlations with the meteorological variables: mean, maximum

and minimum daily temperatures and daily accumulated precipitation. Keep in mind that the temperature

series were differentiated once, since they were not stationary. We see in Figure 4.31 the sample CCF

between the differences of the mean temperature and the response variable, that the most significant

lags are −6 and 16, with correlation values respectively equal to 0.095 and −0.085. In Figure 4.32, where

the sample CCF between the differences of the maximum temperature and the response variable is

presented, there are some significant lags, namely lags 12, 13 and 16, with values respectively equal to

0.088, −0.099 and 0.088, however none of these ended up in the final model. In Figure 4.33, where the

sample CCF between the differences of the minimum temperature and the response variable is shown,

lag 24 is the most significant one with a value equal to 0.105. In fact, none of the temperature variables

were used in the model, since they were not significant in the model and, when present, did not result in

better forecasts. As for the Precipitation, in Figure 4.34, lag 0 and lag −1 are the most significant, with

cross-correlation values equal to −0.143 and −0.133, respectively. Both lags of the precipitation were

tested when building the models, but lag −1 was the one used in the model. A variable of the event of

precipitation was also used when building the models to verify if there was an improvement with regard

to the variable of precipitation quantity. The event of precipitation, EventPrecipt, is equal to 1 if the

value of precipitation was higher than zero in day t and zero if there was no occurence of precipitation.

Note that Jain et al. [13] verified that the occurrence of rainfall was a more significant variable than the

amount of rainfall, however, in our case, when the variable event of precipitation was included in the

models, it did not improve the forecast accuracy. In fact, this variable was only used in one model, where

it improved the forecast accuracy.

Also included in the model is the Month variable, taking values from 1 to 12, this variable represents

54

−30 −20 −10 0 10 20 30

−0.

050.

000.

050.

10

Lag (days)

cros

s−co

rrel

atio

n

DiffMeanTemp & DiffRepSeries1

Figure 4.31: CCF between the differenti-ated mean temperature and the differen-tiated representative series.

−30 −20 −10 0 10 20 30−0.

10−

0.05

0.00

0.05

Lag (days)

cros

s−co

rrel

atio

n

DiffMaxTemp & DiffRepSeries1

Figure 4.32: CCF between the differen-tiated maximum temperature and the dif-ferentiated representative series.

−20 −10 0 10 20

−0.

050.

000.

050.

10

Lag (days)

cros

s−co

rrel

atio

n

DiffMinTemp & DiffRepSeries1

Figure 4.33: CCF between the differenti-ated minimum temperature and the differ-entiated representative series.

−30 −20 −10 0 10 20 30−0.

15−

0.10

−0.

050.

000.

05

Lag (days)

cros

s−co

rrel

atio

n

Precip & DiffRepSeries1

Figure 4.34: CCF between the accumu-lated precipitation and the differentiatedrepresentative series.

the yearly seasonality present in the data.

Furthermore, we include in the model the Impulse variable. Impulse is equal to 1 only in the first

consecutive days it rains in October and 0 otherwise. It can be seen as a variable that represents

the transition from the summer to the winter season, which is a more sudden change in the mean

consumption than the transition from winter to summer. This binary variable represents an event that

happens once a year, every year, therefore it needs to be represented in the model.

The Trend variable captures the trend of the representative series, which is calculated by taking the

trend component of the STL decomposition of the representative series.

4.4.2 Modeling

In order to find a good model to fit the data that predicts values as close as possible to the real values,

we built several models with different combinations of the variables discussed in Subsection 4.4.1 and

interactions between them. At this stage, the following variables were used: lags −1 and −7 of the

response variable, lags −6 and 16 of the differentiated mean temperature, lags 12, 13 and 16 of the

differentiated maximum temperature, lag 24 of the differentiated minimum temperature, lags 0 and −1 of

the daily accumulated precipitation, Month, Trend and Impulse. Then, we built several models with the

possible combinations between these variables, in order to find the model that gave the best forecast

results. We also built models with a smooth function f(.) applied to a certain variable and another without

55

the smooth function applied to the same variable, to verify which variables require a smooth function. To

compare the forecast accuracy of the models, we used MAPE (Mean Absolute Percentage Error) and

chose the model with the lowest MAPE value as the best one.

We show the ”best” models fitted to the three representative series of Cluster 1. The models of rep-

resentative series Mean, Q95% and Q25% are indicated as Model 1, Model 2 and Model 3, respectively.

Model 1: yt = yt−1 + yt−7 + Prect−1 + f(Month) + Trend+ Impulse+ Impulse× Trend (4.4)

Model 2: yt = f1(Month) + f23(yt−3,Month) + Trend+ Impulse+ Impulse× Trend (4.5)

Model 3: yt = f1(yt−1) + yt−7 + Trend+Month+ Trend×Month+

Impulse+ Impulse× Trend+ f2(Month)(4.6)

Where yt−i represent the past lag −i of the response variable, f(.) represent smooth functions,

Month, Trend and Impulse are as explained in Subsection 4.4.1.

Model 1 (Equation 4.4) that fits the differentiated representative series Mean (response variable yt)

was built with variables of the lag −1 and −7 of the response variable (yt−1 and yt−7, respectively); lag

−1 of the precipitation variable, Prect−1; a smooth function applied to the Month variable representing

the seasonality present in the data; a variable that represents the trend present in the data, Trend ;

Impulse variable that represents the first days of consecutive rain in October and an interaction between

Impulse and Trend, that represents the shift in the values that occurs in the first days of consecutive rain

in October.

4.4.3 Analysis of the Residuals

Once the ”best” GAM model was chosen, Model 1 (Equation 4.4), it was necessary to analyse the

residuals, such as checking their stationarity and if they follow a Normal distribution. The KPSS test

applied to the residuals gave a p-value greater than 0.10, which suggested that they are stationary when

considering the usual significance levels (1%, 5% and 10%). In Figure 4.35 and Figure 4.36, we find

the histogram of the residuals and the QQ-Plot, respectively. In Figure 4.35, the pattern is similar to

a bell shape around zero, with a slight negative skew due probably to the extreme observations. In

Figure 4.36, the residuals follow the straight line, only deviating on the tails. Thus, both plots indicate

that the residuals seem to follow a Normal distribution.

The plot of the residuals versus the linear predictor is shown in Figure 4.37. The points appear to

be randomly distributed around zero without any clear pattern. This indicates that the residuals are

uncorrelated.

56

Histogram of residuals

Residuals

Fre

quen

cy

−0.5 0.0 0.5

050

100

150

200

250

300

Figure 4.35: Histogram of the residuals ofModel 1.

−3 −2 −1 0 1 2 3

−0.

8−

0.6

−0.

4−

0.2

0.0

0.2

0.4

QQ−plot

norm quantiles

Sam

ple

quan

tiles

Figure 4.36: QQ-Plot of the residuals ofModel 1.

−0.6 −0.4 −0.2 0.0 0.2

−0.

8−

0.6

−0.

4−

0.2

0.0

0.2

0.4

Resids vs. linear pred.

linear predictor

resi

dual

s

Figure 4.37: Residuals versus the linear predictor of Model 1.

4.4.4 Forecast

We used the chosen model to predict values from 10th August 2017 until 30th November 2017. These

predictions were compared to the actual values in the test set. Note that to compute the predictions,

the values of 2016 were used for the lags of the response variable, as well as the trend of 2016 for the

variable Trend, since it was the most recent data period available.

In order to show the forecast results in the original scale, we must reverse the transformations done

to the data. First, the differences were inversed, using the last observation in the train set, 31/07/2017,

as the initial point. Then, the normalization applied (showed in Equation 4.1) was reversed, using the for-

mula shown in Equation 4.7. The daily forecasts in the original scale betwen 10/08/2017 and 30/11/2017

can be found in Figure 4.38.

Xt = Yt × sd(Xt) +mean(Xt) (4.7)

To analyse the accuracy of the model, we calculated the MAPE, which gave a value of 9.959%. The

MAPE is calculated according to Equation 3.58 using the predictions in the original scale and the real

aggregated values and the lower the percentage value, the better the forecast accuracy of the model.

In Figure 4.39, the daily forecasts from 16/08/2017 until 30/11/2017 of the models of representative

57

3

6

920

17−

08−

08

2017

−08

−12

2017

−08

−16

2017

−08

−20

2017

−08

−24

2017

−08

−28

2017

−09

−01

2017

−09

−05

2017

−09

−09

2017

−09

−13

2017

−09

−17

2017

−09

−21

2017

−09

−25

2017

−09

−29

2017

−10

−03

2017

−10

−07

2017

−10

−11

2017

−10

−15

2017

−10

−19

2017

−10

−23

2017

−10

−27

2017

−10

−31

2017

−11

−04

2017

−11

−08

2017

−11

−12

2017

−11

−16

2017

−11

−20

2017

−11

−24

2017

−11

−28

2017

−12

−02

Time (days)

m3 /

day colour

PredictionsReal

Figure 4.38: Daily forecast of the model of representative series Mean (Model 1, Equation 4.4) ofCluster 1 and the real aggregated values by the mean, both in the original scale between 10/08/2017and 30/11/2017.

series Mean, 2 and 3 are shown, as well as the real aggregated values by the mean. The forecasts

of the models of representative series Q95% and 3 (aggregated by the quantile 95% and aggregated

by the quantile 25%, respectively) can be seen as consumption intervals of the forecasts of the mean

consumption. The model of representative series Q95% had a MAPE value equal to 19.71%. For the

model of representative series Q25%, the MAE (Mean Absolute Error) measure was used, calculated by

Equation 3.59, since we are dealing with values close to zero. Its value was equal to 0.867.

0

5

10

15

20

2017

−08

−14

2017

−08

−18

2017

−08

−22

2017

−08

−26

2017

−08

−30

2017

−09

−03

2017

−09

−07

2017

−09

−11

2017

−09

−15

2017

−09

−19

2017

−09

−23

2017

−09

−27

2017

−10

−01

2017

−10

−05

2017

−10

−09

2017

−10

−13

2017

−10

−17

2017

−10

−21

2017

−10

−25

2017

−10

−29

2017

−11

−02

2017

−11

−06

2017

−11

−10

2017

−11

−14

2017

−11

−18

2017

−11

−22

2017

−11

−26

2017

−11

−30

2017

−12

−04

Time (days)

m3 /

day

colour

Predictions(Mean)

Predictions(Q 25%)

Predictions(Q 95%)

Real

Figure 4.39: Daily forecast of Model 1 (Equation 4.4), Model 2 (Equation 4.5) and Model 3 (Equation 4.6)of Cluster 1 and the real aggregated values by the mean in the original scale between 16/08/2017 and30/11/2017.

The models of Cluster 2 performed poorer, when compared to the results obtained by the models

of Cluster 1 or Cluster 3. Note that the forecast interval is not equal for all the models due to the lags

58

used of the explanatory variables. Below, the models that were fitted to the 3 representative series are

presented, respectively. Note that, in R, the gamm function from package mgcv fits generalized additive

mixed models to the data and allows for the residuals of the model to be fit with an ARMA model.

Model 4: yt = f12(yt−1, P rect−18) + f3(yt−2) + f4(Month) + Trend,

with residuals ε ∼ ARMA(2, 1)(4.8)

Model 5: yt = f1(yt−2) + f2(DiffMinTempt+14) + EventPrect−18 + Trend+Month

+ Trend×Month, with residuals ε ∼ ARMA(2, 1)(4.9)

Model 6: yt = β + f1(DiffMaxTempt−13) + f2(yt−1) + yt−6 + Trend+Month

+ Trend×Month+ f(Month)(4.10)

The model of representative series Mean had a MAPE equal to 35.327% for the forecast interval

between 27/08/2017 and 30/11/2017 and it is shown in Figure 4.40. The model of representative series

Q95% had a MAPE equal to 30.024% and the model of representative series Q25% had a MAE equal to

2.098.

The forecast interval bands between 19/08/2017 and 16/11/2017 obtained by the models of Cluster

2 are shown in Figure 4.41. As can be seen in Figure 4.41, the real aggregated mean of this Cluster is

not contained in the interval bands, the predictions of the representative series Q95% model have some

values inferior to the real aggregated mean.

0

10

20

2017

−08

−26

2017

−08

−30

2017

−09

−03

2017

−09

−07

2017

−09

−11

2017

−09

−15

2017

−09

−19

2017

−09

−23

2017

−09

−27

2017

−10

−01

2017

−10

−05

2017

−10

−09

2017

−10

−13

2017

−10

−17

2017

−10

−21

2017

−10

−25

2017

−10

−29

2017

−11

−02

2017

−11

−06

2017

−11

−10

2017

−11

−14

2017

−11

−18

2017

−11

−22

2017

−11

−26

2017

−11

−30

2017

−12

−04

Time (days)

m3 /

day colour

PredictionsReal

Figure 4.40: Daily forecast of the model of representative seriesMean (Model 4, Equation 4.8) of Cluster2 and the real aggregated values by the mean in the original scale between 27/08/2017 and 30/11/2017.

If a new construction will begin in the area, the only information available about the new client is

actually the lot, outdoor and building areas. There is no information a priori about the behaviour or

consumption pattern of the new client. Thus, we can use the outdoor area to determine the Cluster

59

0

10

2020

17−

08−

26

2017

−08

−30

2017

−09

−03

2017

−09

−07

2017

−09

−11

2017

−09

−15

2017

−09

−19

2017

−09

−23

2017

−09

−27

2017

−10

−01

2017

−10

−05

2017

−10

−09

2017

−10

−13

2017

−10

−17

2017

−10

−21

2017

−10

−25

2017

−10

−29

2017

−11

−02

2017

−11

−06

2017

−11

−10

2017

−11

−14

2017

−11

−18

Time (days)

m3 /

day

colour

Predictions(Mean)

Predictions(Q 25%)

Predictions(Q 95%)

Real

Figure 4.41: Daily forecast of Model 4 (Equation 4.8), Model 5 (Equation 4.9) and Model 6 (Equa-tion 4.10) of Cluster 2 and the real aggregated values by the mean in the original scale between19/08/2017 and 16/11/2017.

whose model can be used to predict a possible consumption pattern for this new client. We can use the

Boxplot in Figure 4.24 to decide which model to use. For example, for a new lot that has an outdoor area

of 2000m2, the model of Cluster 2 should be used. If the lot will have an outdoor area of 1200m2, then

both models of Cluster 1 and Cluster 3 can be used and we can take the average of the predictions of

the models.

In addition, these models may be used to determine clients that have a borehole. By taking advan-

tage of the interval bands, we can see if a certain client is consistently below the values of the interval.

If that is the case, then the client has a suspiciously low water use, when compared to the mean con-

sumption of the set of clients used to build a model, and there is the possibility the client has a borehole.

4.5 Daily disaggregation of water consumption

A secondary goal of this dissertation was to disaggregate the consumption of the lots that have a single

water meter and with the disaggregation we will be able to say how much of the total consumption

corresponds to indoor consumption and how much corresponds to garden watering for the lots that have

a single water meter. In this Section, we present the method that was used to disaggregate daily water

consumption, which used the models of the garden watering demand, in the period between August and

November 2017. Moreover, the results obtained by this method are shown.

Since we wish to use the results obtained from modeling the garden watering demand, we began

by examining the weight of the outdoor water use in the total monthly consumption in each of the 3

clusters discussed in Subsection 4.3.3. In Figures 4.42, 4.43 and 4.44, we can see stacked bar plots

relative to Clusters 1, 2 and 3, respectively, where the mean monthly total consumption is represented

and separated into indoor and outdoor consumption. On top of each bar, the percentage that represents

60

the outdoor consumption in the total consumption is indicated. There are not very significant differences

between the percentages of each cluster. As expected, since analysing Figure 4.9, the garden watering

represents the majority of the mean monthly consumption in all clusters. For example, for Cluster 1,

between the months of March and September 2015, the percentage values were higher or equal to 90%.

In 2016, the percentage was higher or equal to 90% between April and September and in 2017, it was

higher or equal to 90% from April through July.

89 85

92

93

95

95

9392

94

87

83

86

65

77

82

9090

95

9494

95

89

75

74

89

83

88

92

94

95

94

0

100

200

300

400

2014

−12

−01

2015

−01

−01

2015

−02

−01

2015

−03

−01

2015

−04

−01

2015

−05

−01

2015

−06

−01

2015

−07

−01

2015

−08

−01

2015

−09

−01

2015

−10

−01

2015

−11

−01

2015

−12

−01

2016

−01

−01

2016

−02

−01

2016

−03

−01

2016

−04

−01

2016

−05

−01

2016

−06

−01

2016

−07

−01

2016

−08

−01

2016

−09

−01

2016

−10

−01

2016

−11

−01

2016

−12

−01

2017

−01

−01

2017

−02

−01

2017

−03

−01

2017

−04

−01

2017

−05

−01

2017

−06

−01

2017

−07

−01

2017

−08

−01

Time (months)

m3 /

mon

th


Figure 4.42: Mean monthly indoor consumption and mean monthly water consumption for garden wa-tering of Cluster 1 between January 2015 and July 2017.

8787

91 88

92

94

92

85

85

8783

87

69

84

93

87

89

90

95

9395

94

90

7590 85

90

94

95

94

93

0

100

200

300

400

500

2014

−12

−01

2015

−01

−01

2015

−02

−01

2015

−03

−01

2015

−04

−01

2015

−05

−01

2015

−06

−01

2015

−07

−01

2015

−08

−01

2015

−09

−01

2015

−10

−01

2015

−11

−01

2015

−12

−01

2016

−01

−01

2016

−02

−01

2016

−03

−01

2016

−04

−01

2016

−05

−01

2016

−06

−01

2016

−07

−01

2016

−08

−01

2016

−09

−01

2016

−10

−01

2016

−11

−01

2016

−12

−01

2017

−01

−01

2017

−02

−01

2017

−03

−01

2017

−04

−01

2017

−05

−01

2017

−06

−01

2017

−07

−01

2017

−08

−01

Time (months)

m3 /

mon

th



In Table 4.13, the mean monthly ratio between the garden watering and total water consumption

per Cluster is presented for the months between August until November, since we performed the disag-

gregation method for the same months. This method uses the garden watering demand models, that

61

92

8493

90

95

94

95

93

91

94

9696

7093

9493

94

96

94

92

97

91

80

80

93

90

92

95

96

96

94

0

100

200

300

400

2014

−12

−01

2015

−01

−01

2015

−02

−01

2015

−03

−01

2015

−04

−01

2015

−05

−01

2015

−06

−01

2015

−07

−01

2015

−08

−01

2015

−09

−01

2015

−10

−01

2015

−11

−01

2015

−12

−01

2016

−01

−01

2016

−02

−01

2016

−03

−01

2016

−04

−01

2016

−05

−01

2016

−06

−01

2016

−07

−01

2016

−08

−01

2016

−09

−01

2016

−10

−01

2016

−11

−01

2016

−12

−01

2017

−01

−01

2017

−02

−01

2017

−03

−01

2017

−04

−01

2017

−05

−01

2017

−06

−01

2017

−07

−01

2017

−08

−01

Time (months)

m3 /

mon

th



were fit to train sets within the period of 01/01/2015 until 31/07/2017 and were tested over the months

August-November 2017. For this reason, the disaggregation method was applied between August 2017

and November 2017.

Table 4.13: Mean monthly ratio betwen the garden watering and total water consumption per Cluster forthe months of August, September, October and November and years 2015 and 2016.

Cluster 1 Cluster 2 Cluster 3

August 0.93 0.89 0.925

September 0.945 0.90 0.94

October 0.88 0.905 0.925

November 0.79 0.865 0.88

As mentioned in Section 4.1, the water utility company that provided the data has almost 3000 clients

with only 73 clients with two water meters. So, the majority of the clients have a single water meter that

measures both indoor and outdoor water use and can be used to test this method. To select which

clients to form this new set, we followed certain criteria. Since we used the garden watering demand

models, we wanted the new set to have a certain similarity with the water consumption for garden

watering data set of 57 clients. First of all, the clients needed to have data available from 01/01/2015,

since the training period used to build the garden watering demand models begins on that date. Second,

the clients needed to have a detached house, since this is the housing typology of all of the clients in the

water consumption for garden watering data set (with the exception of one apartment). Furthermore,

by looking at the Figure 4.24, we gather that the majority of the clients in the water consumption for

garden watering data set have an outdoor area between approximately 1100m2 and 2300m2. So, having

an outdoor area between 1100m2 and 2300m2 was another criterion when selecting clients for this new

data set. Moreover, it was important to select clients that do not have a very low mean daily water

62

consumption when compared to the size of the outdoor area, because it is possible that these clients

have a borehole. If a client has a borehole, it will be used to water the garden and the values registered

by the water meter will be mainly indoor water use, thus the water consumption can not be disaggregated

into indoor and outdoor water use. So, if a client presented a mean daily consumption close to zero, it

was not selected.

We ended up with a set of 41 clients that have a single water meter. Let us name this set as single

water meter set. The method discussed in Section 3.5 was tested. We discuss the steps taken in this

method and show some of the results, leaving additional results to be shown in Appendix C.

We outline the steps taken in this method before discussing the results:

Step 1 We applied the clustering algorithm Complete Linkage with the periodogram based distance to

the normalized single water meter set (normalized with the Standard Normalization, Equation 4.1).

Then, we chose the best number of clusters k.

Step 2 Having chosen the optimal number k of single water meter clusters, we built the representative

series for each cluster. These series are calculated by at each time point t taking the mean of all

the time series in that cluster at time t. Then, this series is normalized with Standard Normalization.

Step 3 We considered the train set composed of the normalized representative seriesMean of the 3 water

consumption for garden watering clusters and each series represents its own class. For the test

set, we considered the normalized representative series of the k single water meter clusters. We

then applied 1-NN (1-Nearest Neighbor) with the mentioned train set and test set to be classified.

Step 4 According to the classification results of 1-NN (1-Nearest Neighbor), one of the 3 water consump-

tion for garden watering models of representative series Mean of the water consumption for gar-

den watering clusters (discussed in Section 4.4) was used to predict estimates of the total daily

consumption for each of the k clusters.

Step 5 Using the appropriate values in Table 4.13, we estimated the future outdoor water use by taking a

percentage of the estimates obtained in Step 4. For the daily estimates in a same month, we used

the same percentage value.

Step 6 The estimates of the indoor water use were obtained by the difference between the estimates of

the total consumption (Step 4) and the estimates of the outdoor water consumption (Step 5).

In Step 1, we used the same clustering algorithm and the same distance that were used when

clustering the set of 57 exclusively outdoor water meters, as well as the indeces to choose the optimal

number of clusters, described in Section 4.3. Also, in Step 3, when applying 1-NN, the periodogram

based distance was used one more time, since we wanted to classify the normalized series by similarity

of pattern and we had already assessed in Section 4.3 that this distance was the best to do so.

By applying Complete Linkage with the periodogram based distance and the Dunn Index, Entropy,

Silhouette and Gamma to choose the number of clusters, the test set was partitioned into 5 clusters.

Let us name these clusters as Group 1, Group 2, Group 3, Group 4 and Group 5, respectively, to avoid

63

confusion with the clusters obtained in Section 4.3. The size of these groups are as shown in Table 4.14.

Note that Group 5 has only one member and it was not taken into consideration from this point on. In

Figure 4.45, the boxplot of the outdoor area per group is presented. Group 2 has the lowest median

value equal to 1478m2 and Group 3 has the highest equal to 1644m2, however there is not a clear

distinction between the groups.

Table 4.14: Group size for the test data set (N = 41).

Group 1 2 3 4 5

Size 9 11 11 9 1

1250

1500

1750

2000

2250

1 2 3 4Group

Are

a ( m

2 )

Group1234

Figure 4.45: Boxplot of the outdoor area per group for the test data set (N = 41).

Proceeding with 1-NN, the representative series of each group were classified according to the sim-

ilarity to the representative series Mean of Cluster 1, Cluster 2 and Cluster 3, obtained for the water

consumption for garden watering data set and are described in Section 4.3. The classification results

are shown in Table 4.15. This means that the garden watering demand model of representative series

Mean of Cluster 1 will be used to estimate future values of total consumption of Group 1. In the same

way, the model of representative series Mean of Cluster 3 will be used to estimate future values of total

consumption of Group 2. For all the cases, we performed the daily estimation between the 22/08/2017

and 30/11/2017.

Table 4.15: KNN classification results of the Groups’s representative series according to the clustersobtained for the water consumption for garden watering data set.

Group 1 2 3 4

Classification Cluster 1 Cluster 3 Cluster 3 Cluster 1

As seen in Figure 4.9 and Figures 4.42, 4.43 and 4.44, the outdoor water use of the 57 client set

studied represents the majority of the total consumption, which is why in Step 4 we use the garden

64

watering demand models forecasts as an estimate of the total consumption.

In Figure 4.46 and Figure 4.47, the predictions of the garden watering demand models as estimates

of the total consumption are shown along with the respective real total consumption for Group 1 and

Group 2. To evaluate the accuracy of the models, we calculate the measure MAPE (Mean Absolute

Percentage Error), Equation 3.58, and we remember that the lowest the MAPE value, the better. For

the results of Group 1, a MAPE equal to 28.40% was obtained and for Group 2, MAPE was equal to

16.25%, which was the best value out of all four. As for Group 3 and Group 4, MAPE values of 37.81%

and 26.22% were obtained, respectively. The corresponding plots obtained for Group 3 and Group 4 can

be found in Appendix C.

2.5

5.0

7.5

10.0

12.5

2017

−08

−17

2017

−08

−21

2017

−08

−25

2017

−08

−29

2017

−09

−02

2017

−09

−06

2017

−09

−10

2017

−09

−14

2017

−09

−18

2017

−09

−22

2017

−09

−26

2017

−09

−30

2017

−10

−04

2017

−10

−08

2017

−10

−12

2017

−10

−16

2017

−10

−20

2017

−10

−24

2017

−10

−28

2017

−11

−01

2017

−11

−05

2017

−11

−09

2017

−11

−13

2017

−11

−17

2017

−11

−21

2017

−11

−25

2017

−11

−29

2017

−12

−03

Time (days)

m3 /

day colour

PredictionsReal

Figure 4.46: Estimates of the total daily consumption between 22/08/2017 and 30/11/2017 and the realtotal daily consumption of Group 1 in the original scale.

5

10

2017

−08

−17

2017

−08

−21

2017

−08

−25

2017

−08

−29

2017

−09

−02

2017

−09

−06

2017

−09

−10

2017

−09

−14

2017

−09

−18

2017

−09

−22

2017

−09

−26

2017

−09

−30

2017

−10

−04

2017

−10

−08

2017

−10

−12

2017

−10

−16

2017

−10

−20

2017

−10

−24

2017

−10

−28

2017

−11

−01

2017

−11

−05

2017

−11

−09

2017

−11

−13

2017

−11

−17

2017

−11

−21

2017

−11

−25

2017

−11

−29

2017

−12

−03

Time (days)

m3 /

day colour

PredictionsReal

Figure 4.47: Estimates of the total daily consumption between 22/08/2017 and 30/11/2017 and the realtotal daily consumption of Group 2 in the original scale.

65

Then, we can proceed to Step 5 and Step 6 to get the disaggregated values from the estimates of

the total consumption. In Figure 4.48 and Figure 4.49, the estimates of consumption disaggregation for

Group 1 and Group 2 are shown, respectively. The estimates of garden watering are shown in green,

the estimates of the indoor water use are shown in red and the real total consumption is shown in blue.

Again, the respective plots of Group 3 and Group 4 are shown in Appendix C

0.0

2.5

5.0

7.5

10.0

12.5

2017

−08

−17

2017

−08

−21

2017

−08

−25

2017

−08

−29

2017

−09

−02

2017

−09

−06

2017

−09

−10

2017

−09

−14

2017

−09

−18

2017

−09

−22

2017

−09

−26

2017

−09

−30

2017

−10

−04

2017

−10

−08

2017

−10

−12

2017

−10

−16

2017

−10

−20

2017

−10

−24

2017

−10

−28

2017

−11

−01

2017

−11

−05

2017

−11

−09

2017

−11

−13

2017

−11

−17

2017

−11

−21

2017

−11

−25

2017

−11

−29

2017

−12

−03

Time (days)

m3 /

day

colourGardenwateringestimatesIndoorconsumptionestimates

Real (total)

Figure 4.48: Estimates of the daily garden watering and daily indoor consumption between 22/08/2017and 30/11/2017 and the real total daily consumption of Group 1 in the original scale.

0

5

10

2017

−08

−17

2017

−08

−21

2017

−08

−25

2017

−08

−29

2017

−09

−02

2017

−09

−06

2017

−09

−10

2017

−09

−14

2017

−09

−18

2017

−09

−22

2017

−09

−26

2017

−09

−30

2017

−10

−04

2017

−10

−08

2017

−10

−12

2017

−10

−16

2017

−10

−20

2017

−10

−24

2017

−10

−28

2017

−11

−01

2017

−11

−05

2017

−11

−09

2017

−11

−13

2017

−11

−17

2017

−11

−21

2017

−11

−25

2017

−11

−29

2017

−12

−03

Time (days)

m3 /

day


Real (total)

Figure 4.49: Estimates of the daily garden watering and daily indoor consumption between 22/08/2017and 30/11/2017 and the real total daily consumption of Group 2 in the original scale.

With this method, we were able to obtain satisfactory estimates of the total consumption, allowing

good estimates of indoor and outdoor water use of clients that have a lot with one water meter and

similar characteristics with the 57 lots studied to build the garden watering demand models.

We now proceed to explain another method that was explored, but that gave less satisfactory results.

66

The first step of this method is equal to Step 1 of the method already discussed, therefore, we have

also 4 groups. In this method, the second step is to further separate the clusters obtained according to

the outdoor areas of the members. With this method, we also wanted to use garden watering demand

models obtained to estimate the total consumption, but by using the outdoor area as a determinant to

choose which model to use. Using Figure 4.18 as guidance, we separate each cluster into two groups:

one that has members with outdoor areas between 1100m2 and 1600m2 and another with areas between

1600m2 and 2300m2 , as it is shown is Table 4.16. The groups with smaller outdoor areas are represented

with an S and the groups with larger areas are represented with an L. The groups with larger outdoor

areas will use the model of Cluster 2 and the groups with smaller outdoor areas will use the models of

both Cluster 1 and Cluster 3, by taking a mean of their results. The last steps are to estimate the water

consumption for garden watering also by taking a percentage of the estimates of the total consumption,

according to Table 4.13. Then, similar to the previous method, the indoor consumption is estimated

by taking the difference between the estimates of the total consumption and the estimates of the water

consumption for garden watering.

Table 4.16: Size of each group.

Cluster 1 2 3 4

S L S L S L S LSize 5 4 7 4 5 6 5 4

For some of the groups, reasonably good results were obtained, however in other cases, very poor

results were obtained. With all the larger area groups, inadequate results were obtained. So, the

consumption pattern of these groups seems to be different from the one of Cluster 2.

With this method, the overall results were poorer when compared to the first method discussed. For

example, when estimating the total water consumption of Group 1 with outdoor area between 1600m2

and 2300m2 (Group 1 L) the MAPE was equal to 63.41%. This shows that having the outdoor area as a

determinant is not sufficient and better results are obtained when the consumption pattern is taken into

account. In Appendix C, the results for certain groups are shown.

Let us consider the idea of estimation from Syme et al. [3], as described in Chapter 2. In this paper,

the authors estimated the outdoor water consumption as the subtraction between the consumption in

summer months and the consumption in winter months. If we attempt to use this idea, the first question

that arrises is how do we select the ”summer months” and ”winter months”? As we have already dis-

cussed, the amount of precipitation and periods of occurences of rainfall have changed in 2017 when

compared to 2016 and 2015. Thus, it is not so clear how to define the same ”summer” and ”winter”

months for different years. Moreover, the region where this study is focused is a touristic region, in which

it is expected that the clients do not reside in the homes. Therefore, the clients are expected to be in the

houses, for example, during the usual period of summer holidays, June and August, and possibly Easter

holidays and Christmas holidays. Therefore, there will be a water consumption inside the homes only

during these periods when the clients are in the homes. Also, as we have already seen, in this region

even during the ”winter months” the gardens are watered. Thus, the approach used in Syme et al. [3] is

67

not appropriate for our case.

68

Chapter 5

Conclusions

In this Chapter, the achievements obtained throughout this study are stated in Section 5.1. In Sec-

tion 5.2, some ideas to develop future work are mentioned.

5.1 Achievements

In this dissertation our aim was to study, model and forecast garden watering demand in a coastal

touristic area. For that we used data collected between 01/01/2015 and 31/07/2017 from 57 water

meters that measure exclusively outdoor use.

We were able to verify that the relationship between the outdoor area of a lot and its respective mean

daily outdoor water use is not a linear one. Also, the characterization of the outdoor area typology of

each lot could be important information, however, it was not available at the time of this study. Then, we

made an estimate of the actual watered area of each lot, using pictures available on Google Maps, and

confirmed that these values reveal more information about the clients’s water use.

The first step taken was the clustering of the time series. To build a model to each one of the clients

is not practical, therefore this is an important step to group similar time series. The time series were

normalized before the clustering algorithm was applied, in order to group them by pattern. Had we clus-

tered the original series, it would result in a grouping by scale, which was not in our interest. Therefore,

we were able to identify 5 groups according to the consumption pattern in the set of clients, using the

Complete Linkage hierarchical clustering algorithm with the periodogram based distance. However, af-

ter some exploratory work of the different characteristics of each cluster, we decided to join 3 clusters

into one and applied the same clustering algorithm to one of the clusters, since its members were quite

different from each other.

We proceeded to build Generalized Additive Models to each one of the three clusters. With these

models, it was possible to use the weather variables, mean, minimum and maximum daily temperature

and daily accumulated precipitation, as explanatory variables. One important explanatory variable used

in the models was Impulse, which explained the abrupt shift in the consumption in the first consecutive

days of rain in the month of October. We also attempted to used one of the classical time series

69

models, SARIMA, however, the forecast values obtained from these were far from satisfactory, thus,

the Generalized Additive Models were more adequate for the data set.

Three models were built for each cluster, thus being able to provide a consumption interval for the

forecasts of the mean values of each cluster. After forecast evaluation, it was verified that the models

of one of the clusters (Cluster 2) did not achieve satisfactory results. The models of the remaining two

clusters presented a good forecast accuracy, the best being the models of Cluster 1.

These models can be used, for example, in the case a new lot is being built and it is necessary to

estimate the outdoor water use of the new client or to estimate the outdoor water consumption of an

existing client that will close the borehole in the lot. For both cases, the information about the lot area,

outdoor and building areas is available. Thus, to decide which model to use, the outdoor area is used as

guidance. For an outdoor area between around 1600m2 and 2300m2 , the models of Cluster 2 are used

to predict future daily values. For an outdoor area between around 1100m2 and 1600m2, an average of

the predicted values of both the models of Cluster 1 and Cluster 3 is considered.

The results obtained will be important to improve the water supply network management of the water

utility company. The predictive garden watering demand models are also important for future planning

in the case more clients close their boreholes and connect to the mains water. Additionally, this study is

also of interest to the management of outdoor areas of large consumers, such as hotels.

A secondary goal of this study was to identify a method to disaggregate daily water consumption

of meters that measure both indoor and outdoor water use. For this, 41 lots with only one meter and

outdoor areas between 1100m2 and 2300m2 were selected. By clustering these time series, we obtained

4 groups and by looking at the similarity between their representative series and the ones from the

clusters of the 57 meters that measure exclusively outdoor use, we were able to classify these 4 groups.

We used the garden watering demand models that we built in order to estimate the total consumption,

since we verified that the garden watering represented the majority of the total consumption. With the

method presented, we were able to obtain satisfactory estimates of the total consumption, allowing good

estimates of indoor and outdoor water use of lots with only one water meter.

This method can be helpful in future water management planning. By providing estimates of average

indoor consumption, this information can be important in sewage system planning. Also, understanding

the weight of the indoor and outdoor water use in the total consumption may help in a future billing

change. Furthermore, it can be important in the decision making of installing new meters that measure

exclusively indoor use and exclusively outdoor use in more lots.

5.2 Future Work

With regards to suggestions to future work, there are interesting possibilities that can derive from this

study.

Taking into consideration the disaggregation problem, it can be looked from a different point of view.

In this study, we attempted to disaggregate the daily consumptions, this way we were able to use the

garden watering demand models that were built. However, it might be possible to do this with hourly

70

observations. Knowing that the gardens are watered generally at night around 4 a.m. until 6 a.m., it

might be possible to identify the type of consumption according to the pattern within a day.

In this study, we modeled the data collected from exclusive outdoor water meters, that have a corre-

sponding exclusive indoor water meter. By adding the values of each outdoor meter with the respective

indoor meter, the total water consumption values of each lot are obtained. Thus, it is possible to model

the total water consumption. Then, it would be possible to use this together with the garden watering

demand models to obtain estimates of the indoor water consumption.

In addition, within the scope of this project, if a thorough study of the garden typology of each lot was

made, i.e., measurement of the area of lawn, types of small plants, types of trees or bushes present

in the lot’s outdoor area and the space occupied by them, it would be possible to better understand

the relation between the outdoor water use and the real watered area. Also, it would provide with an

understanding of how the presence of different types of plants or trees affect the water consumption.

Furthermore, the garden watering demand models may be used to identify clients with a borehole.

The models can be used to identify possible boreholes in the set of almost 3000 clients managed by the

water utility company. This is of high importance for future planning of the water utility company, since

it is expected the boreholes will eventually be closed due to saltwater intrusion, and the clients with a

borehole will connect to the mains water supply system.

71

72

Bibliography

[1] A. Danilenko, E. D. M., and Jacobsen. Climate change and urban water utilities: challenges and op-

portunities. Water Working Notes No 24, Water Sector Board, Sustainable Development Network.

World Bank, Washington DC, (50), 2010.

[2] C. Makwiza. Estimating outdoor water use allowing for the possible impacts of climate change.

PhD thesis, Faculty of Engineering at Stellenbosch University, March 2018.

[3] G. J. Syme, Q. Shao, and et all. Predicting and Understanding Home Garden Water Use. Land-

scape and Urban Planning, 68:121–128, May 2004.

[4] T. Root and Survis. Human water climate interactions in the context of managing Florida’s water

supplies. 43:4–16, 01 2012.

[5] B. Randolph and P. Troy. Understanding Water Consumption in Sydney. 2007.

[6] Publico. https://www.publico.pt/2018/06/04/sociedade/entrevista/

entrevista-godinho-1832382. Accessed: 2018-06-04.

[7] M. Loh, P. Coghlan, and W. Australia. Domestic water use study : in Perth, Western Australia,

1998-2001 / Michael Loh, Peter Coghlan. Water Corporation [West Leederville, W.A.], 2003.

[8] L. A. House-Peters and H. Chang. Urban water demand modeling: Review of concepts, methods,

and organizing principles. Water Resources Research, 47(15), 2011.

[9] M. Ghiassi, D. K. Zimbra, and H. Saidane. Urban Water Demand Forecasting with a Dynamic

Artificial Neural Network Model. Journal of Water Resources Planning and Management, 134(2):

138–146, 2008.

[10] J. Caiado. Performance of combined double seasonal univariate time series models for forecasting

water demand. CEMAPRE Working Papers 0903, Centre for Applied Mathematics and Economics

(CEMAPRE), School of Economics and Management (ISEG), Technical University of Lisbon, May

2009.

[11] S. Gato, N. Jayasuriya, and P. Roberts. Forecasting Residential Water Demand: Case Study.

Journal of Water Resources Planning and Management, 133:309–319, 2007.

73

https://www.publico.pt/2018/06/04/sociedade/entrevista/entrevista-godinho-1832382

https://www.publico.pt/2018/06/04/sociedade/entrevista/entrevista-godinho-1832382

[12] H. Chang, S. Praskievicz, and H. Parandvash. Sensitivity of urban water consumption to weather

and climate variability at multiple temporal scales: The case of portland, oregon. International

Journal of Geospatial and Environmental Research, 1(1):1–19, 2014.

[13] A. Jain, A. K. Varshney, and U. C. Joshi. Short-term Water Demand Forecast Modelling at IIT

Kanpur Using Artificial Neural Networks. Water Resources Management, 15:299–321, 2001.

[14] S. Fontdecaba, J. A. Sanchez-Espigares, L. Marco-Almagro, X. Tort-Martorell, F. Cabrespina, and

J. Zubelzu. An Approach to Disaggregating Total Household Water Consumption into Major End-

Uses. Water Resources Management, 27(7):2155–2177, May 2013.

[15] T. R. Gurung, R. Stewart, C. Beal, and A. Sharma. Smart water meter data for improved water

demand modelling of diversified water supply schemes. 02 2015.

[16] C. Makwiza and H. E. Jacobs. Sound recording to characterize outdoor tap water use events.

Journal of Water Supply: Research and Technology - Aqua, 2017.

[17] J. Chen, A. H. Kam, J. Zhang, N. Liu, and L. Shue. Bathroom activity monitoring based on sound.

In H. W. Gellersen, R. Want, and A. Schmidt, editors, Pervasive Computing, pages 47–61. Springer

Berlin Heidelberg, 2005.

[18] J. Fogarty, C. Au, and S. E. Hudson. Sensing from the basement: A feasibility study of unobtrusive

and low-cost home activity recognition. In Proceedings of the 19th Annual ACM Symposium on

User Interface Software and Technology, UIST ’06, pages 91–100, 2006.

[19] A. Pierrot and Y. Goude. Short-term electricity load forecasting with generalized additive models.

In Conference: Proceedings of ISAP power, pages 593–600, 2011.

[20] A. Ba, M. Sinn, Y. Goude, and P. Pompey. Adaptive Learning of Smoothing Functions: Application to

Electricity Load Forecasting. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors,

Advances in Neural Information Processing Systems 25, pages 2510–2518. Curran Associates,

Inc., 2012.

[21] W. W. S. Wei. Time Series Analysis: Univariate and Multivariate Methods. Pearson Addison Wesley,

2nd edition, 2006.

[22] A. P. Pires. Notas de Series Temporais. March 2001.

[23] R. Hyndman and G. Athanasopoulos. Forecasting: principles and practice. OTexts: Melbourne,

Australia, 2013. URL http://otexts.org/fpp/.

[24] E. Zivot. Time Series Econometrics - Lecture notes. 2006.

[25] G. P. E. Box and D. R. Cox. An Analysis of Transformations. Journal of the Royal Statistical Society,

26(2):211–252, 1964.

[26] S. Bisgaard and M. Kulahci. Time Series Analysis and Forecasting by Example. John Wiley and

Sons, Inc., 2011.

74

http://otexts.org/fpp/

[27] G. Box, G. Jenkins, and G. Reinsel. Time Series Analysis: Forecasting and Control. Prentice Hall,

3rd edition, 1994.

[28] S. Aghabozorgi, A. S. Shirkhorshidi, and T. Y. Wah. Time series clustering - A decade review.

Information Systems, 53:16 – 38, October 2015.

[29] J. Caiado, N. Crato, and D. Pena. A periodogram-based metric for time series classification. Com-

put. Stat. Data Anal., 50(10):2668–2684, June 2006.

[30] A. D. Chouakria and P. N. Nagabhushan. Adaptive dissimilarity index for measuring time series

proximity. Advances in Data Analysis and Classification, 1(1):5–21, Mar 2007.

[31] C. M. M. Pereira and R. F. de Mello. Common dissimilarity measures are inappropriate for time

series clustering. RITA, 20:25–48, 2013.

[32] P. Giudici. Applied Data Mining: Statistical Methods for Business and Industry. John Wiley and

Sons, Inc., 2003.

[33] D. J. Berndt and J. Clifford. Using dynamic time warping to find patterns in time series. In Proceed-

ings of the 3rd International Conference on Knowledge Discovery and Data Mining, AAAIWS’94,

pages 359–370. AAAI Press, 1994.

[34] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications (Springer Texts in

Statistics). Springer-Verlag New York, Inc., 2005.

[35] B. Desgraupes. Clustering Indices. University of Paris Ouest - Lab Modal’X, pages 1–34, 2013.

[36] S. Wood. Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC, 1st edition,

2006.

[37] Weather underground. https://www.wunderground.com/. Accessed: 2017-12-01.

[38] R. Hyndman, K. Smith, and X. Wang. Characteristic-Based Clustering for Time Series Data. Data

Mining and Knowledge Discovery, 13:335–364, November 2006.

[39] P. Montero and J. A. Vilar. Tsclust: An R Package for Time Series Clustering. Journal of Statistical

Software, 62(1), November 2014.

[40] M. Charrad, N. Ghazzali, V. Boiteau, and A. Niknafs. Nbclust: An R Package for Determining the

Relevant Number of Clusters in a Data Set. Journal of Statistical Software, 61(6), October 2014.

[41] H. Jacobs and J. Haarhoff. Structure and data requirements of an end-use model for residential

water demand and return flow. Water SA, 30(3):293–304, 2004.

[42] S. Gato-Trinidad, N. Jayasuriya, and P. Roberts. Understanding urban residential end uses of water.

Water Science and Technology, 64(1):36–42, 2011.

75

https://www.wunderground.com/

76

Appendix A

Results of the Clustering

In this appendix we present further exploratory analysis on the clustering results obtained.

A.1 Exploratory analysis

In Figure A.3 and Figure A.10, the representative series Mean of Cluster 2 and Cluster 3 are shown. In

Figure A.1, Figure A.4 and Figure A.11, the representative series Q95% of the clusters are shown. In

Figure A.2, Figure A.5 and Figure A.12, the representative series Q25% of the clusters are shown. In

Figure A.9 and Figure A.16, the boxplots per day of week are shown. In Figure A.7 and Figure A.14,

the boxplots per month of the aggregated monthy consumptions of the members of Cluster 4 and of

the members of Cluster 5 are presented, respectively. In Figure A.8 and Figure A.15, the boxplots per

month of the year for both Clusters are shown.

−1

0

1

2

2014

−12

−01

2015

−01

−01

2015

−02

−01

2015

−03

−01

2015

−04

−01

2015

−05

−01

2015

−06

−01

2015

−07

−01

2015

−08

−01

2015

−09

−01

2015

−10

−01

2015

−11

−01

2015

−12

−01

2016

−01

−01

2016

−02

−01

2016

−03

−01

2016

−04

−01

2016

−05

−01

2016

−06

−01

2016

−07

−01

2016

−08

−01

2016

−09

−01

2016

−10

−01

2016

−11

−01

2016

−12

−01

2017

−01

−01

2017

−02

−01

2017

−03

−01

2017

−04

−01

2017

−05

−01

2017

−06

−01

2017

−07

−01

2017

−08

−01

2017

−09

−01

Time (days)

Val

ue

Figure A.1: Representative series Q95% of Cluster 1 between 01/01/2015 and 31/07/2017.

77

−1

0

1

2

2014

−12

−01

2015

−01

−01

2015

−02

−01

2015

−03

−01

2015

−04

−01

2015

−05

−01

2015

−06

−01

2015

−07

−01

2015

−08

−01

2015

−09

−01

2015

−10

−01

2015

−11

−01

2015

−12

−01

2016

−01

−01

2016

−02

−01

2016

−03

−01

2016

−04

−01

2016

−05

−01

2016

−06

−01

2016

−07

−01

2016

−08

−01

2016

−09

−01

2016

−10

−01

2016

−11

−01

2016

−12

−01

2017

−01

−01

2017

−02

−01

2017

−03

−01

2017

−04

−01

2017

−05

−01

2017

−06

−01

2017

−07

−01

2017

−08

−01

2017

−09

−01

Time (days)

Val

ue

Figure A.2: representative series Q25% of Cluster 1 between 01/01/2015 and 31/07/2017.

0

2

4

6

2014

−12

−01

2015

−01

−01

2015

−02

−01

2015

−03

−01

2015

−04

−01

2015

−05

−01

2015

−06

−01

2015

−07

−01

2015

−08

−01

2015

−09

−01

2015

−10

−01

2015

−11

−01

2015

−12

−01

2016

−01

−01

2016

−02

−01

2016

−03

−01

2016

−04

−01

2016

−05

−01

2016

−06

−01

2016

−07

−01

2016

−08

−01

2016

−09

−01

2016

−10

−01

2016

−11

−01

2016

−12

−01

2017

−01

−01

2017

−02

−01

2017

−03

−01

2017

−04

−01

2017

−05

−01

2017

−06

−01

2017

−07

−01

2017

−08

−01

2017

−09

−01

Time (days)

Val

ue

Figure A.3: Representative series Mean of Cluster 2 between 01/01/2015 and 31/07/2017.

0.0

2.5

5.0

7.5

10.0

2014

−12

−01

2015

−01

−01

2015

−02

−01

2015

−03

−01

2015

−04

−01

2015

−05

−01

2015

−06

−01

2015

−07

−01

2015

−08

−01

2015

−09

−01

2015

−10

−01

2015

−11

−01

2015

−12

−01

2016

−01

−01

2016

−02

−01

2016

−03

−01

2016

−04

−01

2016

−05

−01

2016

−06

−01

2016

−07

−01

2016

−08

−01

2016

−09

−01

2016

−10

−01

2016

−11

−01

2016

−12

−01

2017

−01

−01

2017

−02

−01

2017

−03

−01

2017

−04

−01

2017

−05

−01

2017

−06

−01

2017

−07

−01

2017

−08

−01

2017

−09

−01

Time (days)

Val

ue


78

−1

0

1

2

3

4

2014

−12

−01

2015

−01

−01

2015

−02

−01

2015

−03

−01

2015

−04

−01

2015

−05

−01

2015

−06

−01

2015

−07

−01

2015

−08

−01

2015

−09

−01

2015

−10

−01

2015

−11

−01

2015

−12

−01

2016

−01

−01

2016

−02

−01

2016

−03

−01

2016

−04

−01

2016

−05

−01

2016

−06

−01

2016

−07

−01

2016

−08

−01

2016

−09

−01

2016

−10

−01

2016

−11

−01

2016

−12

−01

2017

−01

−01

2017

−02

−01

2017

−03

−01

2017

−04

−01

2017

−05

−01

2017

−06

−01

2017

−07

−01

2017

−08

−01

2017

−09

−01

Time (days)

Val

ue

Figure A.5: representative series Q25% of Cluster 2 between 01/01/2015 and 31/07/2017.

0.0

0.5

1.0

1.5

0 5 10 15 20Hour

Val

ue

Month123456789101112

Figure A.6: Hourly pattern per month of Cluster 2.

−1

0

1

2

2015

Jan

2015

Feb

2015

Mar

2015

Apr

2015

May

2015

Jun

2015

Jul

2015

Aug

2015

Sep

2015

Oct

2015

Nov

2015

Dec

2016

Jan

2016

Feb

2016

Mar

2016

Apr

2016

May

2016

Jun

2016

Jul

2016

Aug

2016

Sep

2016

Oct

2016

Nov

2016

Dec

2017

Jan

2017

Feb

2017

Mar

2017

Apr

2017

May

2017

Jun

2017

Jul

Time (months)

Val

ue

Figure A.7: Boxplot per month of the normalized aggregated monthly consumptions of the members ofCluster 2.

79

−1

0

1

2

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep Oct

Nov

Dec

Months

Val

ue

Figure A.8: Boxplot per month of the year of the normalized aggregated monthly consumptions of themembers of Cluster 4.

0

4

8

12

Mon

day

Tues

day

Wed

nesd

ay

Thu

rsda

y

Frid

ay

Sat

urda

y

Sun

day

Day of the Week

Val

ue

Figure A.9: Boxplot per day of the week of the normalized consumptions of the members of Cluster 4.

−1

0

1

2

2014

−12

−01

2015

−01

−01

2015

−02

−01

2015

−03

−01

2015

−04

−01

2015

−05

−01

2015

−06

−01

2015

−07

−01

2015

−08

−01

2015

−09

−01

2015

−10

−01

2015

−11

−01

2015

−12

−01

2016

−01

−01

2016

−02

−01

2016

−03

−01

2016

−04

−01

2016

−05

−01

2016

−06

−01

2016

−07

−01

2016

−08

−01

2016

−09

−01

2016

−10

−01

2016

−11

−01

2016

−12

−01

2017

−01

−01

2017

−02

−01

2017

−03

−01

2017

−04

−01

2017

−05

−01

2017

−06

−01

2017

−07

−01

2017

−08

−01

2017

−09

−01

Time (days)

Val

ue

Figure A.10: Representative series Mean of Cluster 3 between 01/01/2015 and 31/07/2017.

80

−1

0

1

2

3

2014

−12

−01

2015

−01

−01

2015

−02

−01

2015

−03

−01

2015

−04

−01

2015

−05

−01

2015

−06

−01

2015

−07

−01

2015

−08

−01

2015

−09

−01

2015

−10

−01

2015

−11

−01

2015

−12

−01

2016

−01

−01

2016

−02

−01

2016

−03

−01

2016

−04

−01

2016

−05

−01

2016

−06

−01

2016

−07

−01

2016

−08

−01

2016

−09

−01

2016

−10

−01

2016

−11

−01

2016

−12

−01

2017

−01

−01

2017

−02

−01

2017

−03

−01

2017

−04

−01

2017

−05

−01

2017

−06

−01

2017

−07

−01

2017

−08

−01

2017

−09

−01

Time (days)

Val

ue


−1

0

1

2

2014

−12

−01

2015

−01

−01

2015

−02

−01

2015

−03

−01

2015

−04

−01

2015

−05

−01

2015

−06

−01

2015

−07

−01

2015

−08

−01

2015

−09

−01

2015

−10

−01

2015

−11

−01

2015

−12

−01

2016

−01

−01

2016

−02

−01

2016

−03

−01

2016

−04

−01

2016

−05

−01

2016

−06

−01

2016

−07

−01

2016

−08

−01

2016

−09

−01

2016

−10

−01

2016

−11

−01

2016

−12

−01

2017

−01

−01

2017

−02

−01

2017

−03

−01

2017

−04

−01

2017

−05

−01

2017

−06

−01

2017

−07

−01

2017

−08

−01

2017

−09

−01

Time (days)

Val

ue


0.0

0.5

1.0

0 5 10 15 20Hour

Val

ue

Month123456789101112

Figure A.13: Hourly pattern per month of Cluster 3.

81

−1

0

1

2

3

2015

jan

2015

fev

2015

mar

2015

abr

2015

mai

2015

jun

2015

jul

2015

ago

2015

set

2015

out

2015

nov

2015

dez

2016

jan

2016

fev

2016

mar

2016

abr

2016

mai

2016

jun

2016

jul

2016

ago

2016

set

2016

out

2016

nov

2016

dez

2017

jan

2017

fev

2017

mar

2017

abr

2017

mai

2017

jun

2017

jul

Time (months)

Val

ue

Figure A.14: Boxplot per month of the normalized aggregated monthly consumptions of the members ofCluster 3.

−1

0

1

2

3

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep Oct

Nov

Dec

Months

Val

ue

Figure A.15: Boxplot per month of the year of the normalized aggregated monthly consumptions of themembers of Cluster 3.

0

3

6

Mon

day

Tues

day

Wed

nesd

ay

Thu

rsda

y

Frid

ay

Sat

urda

y

Sun

day

Day of the Week

Val

ue

Figure A.16: Boxplot per day of the week of the normalized consumptions of the members of Cluster 3.

82

Appendix B

Additional forecast results

In this Appendix, the results of the models of representative series Q95% and 3 of Clusters 1 and 2 are

shown. Also, the forecast results obtained from the models of representative series Mean, 2 and 3 of

Cluster 3 are presented in Figure B.5, Figure B.6 and Figure B.7, respectively. Also, in Figure B.8, we

show the forecasts of the models of representative series Q95% and 3 as consumption intervals of the

forecasts of the model of representative series Mean.

10

15

20

2017

−08

−05

2017

−08

−09

2017

−08

−13

2017

−08

−17

2017

−08

−21

2017

−08

−25

2017

−08

−29

2017

−09

−02

2017

−09

−06

2017

−09

−10

2017

−09

−14

2017

−09

−18

2017

−09

−22

2017

−09

−26

2017

−09

−30

2017

−10

−04

2017

−10

−08

2017

−10

−12

2017

−10

−16

2017

−10

−20

2017

−10

−24

2017

−10

−28

2017

−11

−01

2017

−11

−05

2017

−11

−09

2017

−11

−13

2017

−11

−17

2017

−11

−21

2017

−11

−25

2017

−11

−29

2017

−12

−03

Time (days)

m3 /

day colour

PredictionsReal

Figure B.1: Forecast of the model of representative series Q95% for the interval 07/08/2017 - 30/11/2017of Cluster 1 in the original scale. The MAPE is equal to 19.712%.

83

0

2

4

620

17−

08−

14

2017

−08

−18

2017

−08

−22

2017

−08

−26

2017

−08

−30

2017

−09

−03

2017

−09

−07

2017

−09

−11

2017

−09

−15

2017

−09

−19

2017

−09

−23

2017

−09

−27

2017

−10

−01

2017

−10

−05

2017

−10

−09

2017

−10

−13

2017

−10

−17

2017

−10

−21

2017

−10

−25

2017

−10

−29

2017

−11

−02

2017

−11

−06

2017

−11

−10

2017

−11

−14

2017

−11

−18

2017

−11

−22

2017

−11

−26

2017

−11

−30

2017

−12

−04

Time (days)

m3 /

day colour

PredictionsReal

Figure B.2: Forecast of the model of representative series Q25% for the interval 16/08/2017 - 30/11/2017of Cluster 1 in the original scale. The MAE is equal to 0.898.

0

20

40

60

2017

−08

−18

2017

−08

−22

2017

−08

−26

2017

−08

−30

2017

−09

−03

2017

−09

−07

2017

−09

−11

2017

−09

−15

2017

−09

−19

2017

−09

−23

2017

−09

−27

2017

−10

−01

2017

−10

−05

2017

−10

−09

2017

−10

−13

2017

−10

−17

2017

−10

−21

2017

−10

−25

2017

−10

−29

2017

−11

−02

2017

−11

−06

2017

−11

−10

2017

−11

−14

2017

−11

−18

Time (days)

m3 /

day colour

PredictionsReal


0.0

2.5

5.0

7.5

10.0

2017

−08

−12

2017

−08

−16

2017

−08

−20

2017

−08

−24

2017

−08

−28

2017

−09

−01

2017

−09

−05

2017

−09

−09

2017

−09

−13

2017

−09

−17

2017

−09

−21

2017

−09

−25

2017

−09

−29

2017

−10

−03

2017

−10

−07

2017

−10

−11

2017

−10

−15

2017

−10

−19

2017

−10

−23

2017

−10

−27

2017

−10

−31

2017

−11

−04

2017

−11

−08

2017

−11

−12

2017

−11

−16

2017

−11

−20

2017

−11

−24

2017

−11

−28

2017

−12

−02

Time (days)

m3 /

day colour

PredictionsReal


84

2

4

6

8

10

12

2017

−08

−17

2017

−08

−21

2017

−08

−25

2017

−08

−29

2017

−09

−02

2017

−09

−06

2017

−09

−10

2017

−09

−14

2017

−09

−18

2017

−09

−22

2017

−09

−26

2017

−09

−30

2017

−10

−04

2017

−10

−08

2017

−10

−12

2017

−10

−16

2017

−10

−20

2017

−10

−24

2017

−10

−28

2017

−11

−01

2017

−11

−05

2017

−11

−09

2017

−11

−13

2017

−11

−17

2017

−11

−21

2017

−11

−25

2017

−11

−29

2017

−12

−03

Time (days)

m3 /

day colour

PredictionsReal

Figure B.5: Forecast of the model of representative seriesMean for the interval 22/08/2017 - 30/11/2017of Cluster 3 in the original scale. The MAPE is equal to 17.444%.

10

15

20

25

2017

−08

−09

2017

−08

−13

2017

−08

−17

2017

−08

−21

2017

−08

−25

2017

−08

−29

2017

−09

−02

2017

−09

−06

2017

−09

−10

2017

−09

−14

2017

−09

−18

2017

−09

−22

2017

−09

−26

2017

−09

−30

2017

−10

−04

2017

−10

−08

2017

−10

−12

2017

−10

−16

2017

−10

−20

2017

−10

−24

2017

−10

−28

2017

−11

−01

2017

−11

−05

2017

−11

−09

2017

−11

−13

2017

−11

−17

2017

−11

−21

2017

−11

−25

2017

−11

−29

Time (days)

m3 /

day colour

PredictionsReal


0

2

4

6

8

2017

−08

−11

2017

−08

−15

2017

−08

−19

2017

−08

−23

2017

−08

−27

2017

−08

−31

2017

−09

−04

2017

−09

−08

2017

−09

−12

2017

−09

−16

2017

−09

−20

2017

−09

−24

2017

−09

−28

2017

−10

−02

2017

−10

−06

2017

−10

−10

2017

−10

−14

2017

−10

−18

2017

−10

−22

2017

−10

−26

2017

−10

−30

2017

−11

−03

2017

−11

−07

2017

−11

−11

Time (days)

m3 /

day colour

PredictionsReal


85

0

5

10

15

20

2017

−08

−22

2017

−08

−26

2017

−08

−30

2017

−09

−03

2017

−09

−07

2017

−09

−11

2017

−09

−15

2017

−09

−19

2017

−09

−23

2017

−09

−27

2017

−10

−01

2017

−10

−05

2017

−10

−09

2017

−10

−13

2017

−10

−17

2017

−10

−21

2017

−10

−25

2017

−10

−29

2017

−11

−02

2017

−11

−06

2017

−11

−10

Time (days)

m3 /

day

colour

Predictions(Mean)

Predictions(Q1 25%)

Predictions(Q3 95%)

Real (Mean)

Figure B.8: Forecast and band intervals for Cluster 3 from 22/08/2017 until 8/11/2017 in the originalscale.

86

Appendix C

Additional daily disaggregation of

consumption results

In this Appendix, the estimates of the total consumption are shown for Groups 3 and 4 in Figure C.1 and

Figure C.3, respectively. The estimates of the garden watering consumption and domestic consumption

are presented in Figure C.2 for Group 3 and in Figure C.4 for Group 4.

Some results of the second disaggregation method discussed in Section 4.5 are shown in this Ap-

pendix. In Figure C.5, the estimates of the total consumption are shown for Group 1 Large, that is the

group created from members of Group 1 with exterior area bigger than 1600m2. The estimates of the

garden watering consumption and domestic consumption for this group are presented in Figure C.6. In

Figure C.7, the estimates of the total consumption are shown for Group 3 Small, that is the group created

from members of Group 3 with exterior area smaller than 1600m2. The estimates of the garden watering

consumption and domestic consumption for this group are presented in Figure C.8.

5

10

2017

−08

−17

2017

−08

−21

2017

−08

−25

2017

−08

−29

2017

−09

−02

2017

−09

−06

2017

−09

−10

2017

−09

−14

2017

−09

−18

2017

−09

−22

2017

−09

−26

2017

−09

−30

2017

−10

−04

2017

−10

−08

2017

−10

−12

2017

−10

−16

2017

−10

−20

2017

−10

−24

2017

−10

−28

2017

−11

−01

2017

−11

−05

2017

−11

−09

2017

−11

−13

2017

−11

−17

2017

−11

−21

2017

−11

−25

2017

−11

−29

2017

−12

−03

Time (days)

m3 /

day colour

PredictionsReal

Figure C.1: Estimates of the total consumption between 22/08/2017 and 30/11/2017 and the real totalconsumption of Group 3 in the original scale.

87

0.0

2.5

5.0

7.5

10.0

12.5

2017

−08

−17

2017

−08

−21

2017

−08

−25

2017

−08

−29

2017

−09

−02

2017

−09

−06

2017

−09

−10

2017

−09

−14

2017

−09

−18

2017

−09

−22

2017

−09

−26

2017

−09

−30

2017

−10

−04

2017

−10

−08

2017

−10

−12

2017

−10

−16

2017

−10

−20

2017

−10

−24

2017

−10

−28

2017

−11

−01

2017

−11

−05

2017

−11

−09

2017

−11

−13

2017

−11

−17

2017

−11

−21

2017

−11

−25

2017

−11

−29

2017

−12

−03

Time (days)

m3 /

day

colourDomesticconsumptionestimatesGardenwateringestimates

Real (total)

Figure C.2: Estimates of the garden watering and domestic consumption between 22/08/2017 and30/11/2017 and the real total consumption of Group 3 in the original scale.

5

10

15

2017

−08

−17

2017

−08

−21

2017

−08

−25

2017

−08

−29

2017

−09

−02

2017

−09

−06

2017

−09

−10

2017

−09

−14

2017

−09

−18

2017

−09

−22

2017

−09

−26

2017

−09

−30

2017

−10

−04

2017

−10

−08

2017

−10

−12

2017

−10

−16

2017

−10

−20

2017

−10

−24

2017

−10

−28

2017

−11

−01

2017

−11

−05

2017

−11

−09

2017

−11

−13

2017

−11

−17

2017

−11

−21

2017

−11

−25

2017

−11

−29

2017

−12

−03

Time (days)

m3 /

day colour

PredictionsReal

Figure C.3: Estimates of the total consumption between 22/08/2017 and 30/11/2017 and the real totalconsumption of Group 4 in the original scale.

0

5

10

15

2017

−08

−17

2017

−08

−21

2017

−08

−25

2017

−08

−29

2017

−09

−02

2017

−09

−06

2017

−09

−10

2017

−09

−14

2017

−09

−18

2017

−09

−22

2017

−09

−26

2017

−09

−30

2017

−10

−04

2017

−10

−08

2017

−10

−12

2017

−10

−16

2017

−10

−20

2017

−10

−24

2017

−10

−28

2017

−11

−01

2017

−11

−05

2017

−11

−09

2017

−11

−13

2017

−11

−17

2017

−11

−21

2017

−11

−25

2017

−11

−29

2017

−12

−03

Time (days)

m3 /

day

colourDomesticconsumptionestimatesGardenwateringestimates

Real (total)

Figure C.4: Estimates of the garden watering and domestic consumption between 22/08/2017 and30/11/2017 and the real total consumption of Group 4 in the original scale.

88

0

5

10

15

2017

−08

−26

2017

−08

−30

2017

−09

−03

2017

−09

−07

2017

−09

−11

2017

−09

−15

2017

−09

−19

2017

−09

−23

2017

−09

−27

2017

−10

−01

2017

−10

−05

2017

−10

−09

2017

−10

−13

2017

−10

−17

2017

−10

−21

2017

−10

−25

2017

−10

−29

2017

−11

−02

2017

−11

−06

2017

−11

−10

2017

−11

−14

2017

−11

−18

2017

−11

−22

2017

−11

−26

2017

−11

−30

2017

−12

−04

Time (days)

m3 /

day colour

PredictionsReal

Figure C.5: Estimates of the total consumption between 27/08/2017 and 30/11/2017 and the real totalconsumption of Group 1 Large in the original scale. The MAPE was equal to 66.41%.

0

5

10

15

2017

−08

−26

2017

−08

−30

2017

−09

−03

2017

−09

−07

2017

−09

−11

2017

−09

−15

2017

−09

−19

2017

−09

−23

2017

−09

−27

2017

−10

−01

2017

−10

−05

2017

−10

−09

2017

−10

−13

2017

−10

−17

2017

−10

−21

2017

−10

−25

2017

−10

−29

2017

−11

−02

2017

−11

−06

2017

−11

−10

2017

−11

−14

2017

−11

−18

2017

−11

−22

2017

−11

−26

2017

−11

−30

2017

−12

−04

Time (days)

m3 /

day


Real (total)

Figure C.6: Estimates of the garden watering and domestic consumption between 27/08/2017 and30/11/2017 and the real total consumption of Group 1 Large in the original scale.

5

10

15

2017

−08

−17

2017

−08

−21

2017

−08

−25

2017

−08

−29

2017

−09

−02

2017

−09

−06

2017

−09

−10

2017

−09

−14

2017

−09

−18

2017

−09

−22

2017

−09

−26

2017

−09

−30

2017

−10

−04

2017

−10

−08

2017

−10

−12

2017

−10

−16

2017

−10

−20

2017

−10

−24

2017

−10

−28

2017

−11

−01

2017

−11

−05

2017

−11

−09

2017

−11

−13

2017

−11

−17

2017

−11

−21

2017

−11

−25

2017

−11

−29

2017

−12

−03

Time (days)

m3 /

day colour

PredictionsReal

Figure C.7: Estimates of the total consumption between 22/08/2017 and 30/11/2017 and the real totalconsumption of Group 3 Small in the original scale. The MAPE was equal to 44.05%.

89

0

5

10

15

2017

−08

−17

2017

−08

−21

2017

−08

−25

2017

−08

−29

2017

−09

−02

2017

−09

−06

2017

−09

−10

2017

−09

−14

2017

−09

−18

2017

−09

−22

2017

−09

−26

2017

−09

−30

2017

−10

−04

2017

−10

−08

2017

−10

−12

2017

−10

−16

2017

−10

−20

2017

−10

−24

2017

−10

−28

2017

−11

−01

2017

−11

−05

2017

−11

−09

2017

−11

−13

2017

−11

−17

2017

−11

−21

2017

−11

−25

2017

−11

−29

2017

−12

−03

Time (days)

m3 /

day


Real (total)

Figure C.8: Estimates of the garden watering and domestic consumption between 22/08/2017 and30/11/2017 and the real total consumption of Group 3 Small in the original scale.

90

Date post:	23-Dec-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Mathematical modeling of garden watering demand...Aplicamos um algoritmo de clustering para agrupar...

Documents