Machine Learning for Sparse Time-Series Classification

IN DEGREE PROJECT VEHICLE ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2019

Machine Learning for Sparse Time-Series ClassificationAn Application in Smart Metering.

CARL RIDNERT

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ENGINEERING SCIENCES

www.kth.se

Abstract

Smart Meters are measuring devices collecting labeled time series data of util-ity consumptions from sub-meters and are capable of automatically transmit-ting this between the customer and utility companies together with other com-panies that offer services such as monitoring of consumption and cleaning ofdata. The smart meters are in some cases experiencing communication errors.One such error occurs when the information about what the utility sub-metersare measuring is lost. This information is important for when the producers ofthe utility are billing the customers for their usage.

The information has had to be collected manually, something which is ineffi-cient in terms of time and money. In this thesis a method for classifying themeters based on their raw time series data is investigated. The data used inthe thesis comes from Metry AB and contains thousands of time series in fivedifferent classes. The task is complicated by the fact that the data has a highclass imbalance, contains many missing values and that the time series varysubstantially in length.

The proposed method is based on partitioning the time series into slices ofequal size and training a Deep Neural Network (DNN) together with a BayesianNeural Network (BNN) to classify the slices. Prediction on new time series isperformed by the prediction of individual slices for that time series followedby a voting procedure. The method is justified through a set of assumptionsabout the underlying stochastic process generating the time series coupledwith an analysis based on the multinomial distribution.

The results indicate that the models tend to perform worse on the samplescoming from the classes ”water” and ”hot water” and that the worst per-formance is on the ”hot water”-class. On all the classes the models achieveaccuracies of around 60%, by excluding the ”hot water” class it is possible toachieve accuracies of at least 70% on the data set. The models perform worseon time series that contain a few number of good quality slices, by consideringonly time series which has many good quality slices, accuracies of 70% areachieved for all classes and above 80% when excluding ”Hot Water”.

It is concluded that in order to further improve the classification performance,more data is needed. Drawbacks with the method are the increased numberof hyper-parameters involved in the extraction of slices. However, the votingmethod seems promising enough to investigate further on more highly sparsedata sets.

i

Sammanfattning

Smarta Matare ar maskiner kapabla att automatiskt sanda data fran sub-matarematandes forbrukningar av nyttigheter(utility) mellan kunden och foretag somproducerar nyttigheterna. Detta har inneburit att en marknad har oppnatsupp for foretag som tar forbrukningsdata och erbjuder tjanster sa som appardar kunden kan se sin forbrukning samt rensning eller interpolering av data.Denna kommunikation har inneburit vissa problem, ett identifierat sadant aratt det hander att information om vilken nyttighet som har matts gar forlorat.Denna information ar viktig och har tidigare behovt hamtas manuellt pa etteller annat satt, nagot som ar ineffektivt.

I detta examensarbete undersoks huruvida den informationen gar att fa tagpa med enbart radatan och klassificeringsalgoritmer. Datan kommer fran Me-try AB och innehaller tusentals tidsserier fran fem olika klasser. Uppgiftenforsvaras av att datan uppvisar en stor obalans i klasserna, innehaller mangasaknade datapunkter och att tidsserierna varierar stort i langd.

Metoden som foreslas baseras pa en uppstyckning av tidsserierna i sa kallade”slices” av samma storlek och att trana Djupa Neurala Natverk (DNN) ochBayesiska Neurala Natverk (BNN) pa dessa. Klassificering av nya tidsseriersker genom att lata modellerna rosta pa slices fran dem och valja den klass somfar flest roster. Arbetet innehaller en teoretisk analys av rostningsprocessenbaserat pa en multinomial fordelning kombinerat med olika antaganden omprocessen som genererar dessa slices, denna syftar till att motivera valet avmetod.

Resultaten visar att modellerna kan tranas och korrekt klassificera matarnatill en viss grad samt att rostningsprocessen tenderar till att ge battre resul-tat an att bara anvanda en slice per matare. Det pavisas att prestandan armycket samre for en specifik klass, genom att exkludera den klassen sa ly-ckas modellerna prestera slutgiltiga noggrannheter pa mellan 70− 80%. Detpavisas vissa skillnader mellan BNN modellen och DNN modellen i termer avnoggrannhet, dock sa ar skillnaderna for sma for att det ska ga att dra nagragenerella slutsatser om vilken klassificeringsalgoritm som ar bast.

Slutsatserna ar att metoden verkar fungera rimligt val pa denna typ av datamen att det behovs mer arbete for att forsta nar den fungerar och hur mankan gora den battre, detta ar framtida arbete. Den storsta mojligheten tillforbattring for just denna tillampning identifieras vara att samla in mer data.

ii

Acknowledgments

I would like to thank Hossein for taking time to discuss the various problemsI have encountered throughout the thesis, offering much valuable feedbackand motivating me to finish with a work I can be proud of. I would like tothank Hadi for his feedback during the early stages of the project and Adamat Metry AB for providing me with the dataset and offering his viewpoints onthe project. I would also like to express my gratitude to all the people who Iam fortunate enough to have as friends.

Lastly I would like to thank my father Thommy and my mother Eija who hasalways been supportive and loving.

iii

Contents

1 Introduction and Problem Statement 11.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Related Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Background 72.1 Smart Metering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.2 Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Classification Algorithms . . . . . . . . . . . . . . . . . . . . . . . 122.2.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 132.2.2 Bayesian Neural Networks . . . . . . . . . . . . . . . . . . 212.2.3 Voting Procedure . . . . . . . . . . . . . . . . . . . . . . . . 28

3 Data Set 36

4 Methodology 384.1 Preprocessing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5 Results 425.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6 Discussion 526.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

7 Appendix A. 56

iv

List of Figures

2.1 Computational graph for a Neural Network with two hiddenlayers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2 Computational graph Batch Normalization. . . . . . . . . . . . . . 192.3 Computation of true probability of correct prediction in voting

procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.4 Computation of True probability of correct prediction when vot-

ing for P2 = [0.3, 0.7, 0, 0, .., 0] and [0.3, 0.7− ε, ε, 0, .., 0] as a func-tion of K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.5 Simulation Result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.1 Distribution of Length of Time Series. . . . . . . . . . . . . . . . . 373.2 Histogram of Missing Value Percentages. . . . . . . . . . . . . . . 37

4.1 Preprocessing of Data workflow. . . . . . . . . . . . . . . . . . . . 40

5.2 Performance as a function of the Slice Size. . . . . . . . . . . . . . 445.3 Increasing the minimum number of slices per meter in the vot-

ing process, all classes. . . . . . . . . . . . . . . . . . . . . . . . . . 455.4 Performance as a function of the Slice Size, four classes. . . . . . 465.5 Final Accuracy when increasing the minimum number of slices

per meter in the voting process, four classes. . . . . . . . . . . . . 465.6 Entropy Distributions BNN . . . . . . . . . . . . . . . . . . . . . . 475.7 Entropy Distributions DNN . . . . . . . . . . . . . . . . . . . . . . 475.8 Voting Accuracy On Meters. . . . . . . . . . . . . . . . . . . . . . . 495.9 Final Accuracy when increasing the maximum number of slices

per meter in the voting process, four classes. . . . . . . . . . . . . 50

v

List of Tables

2.1 Communication Protocols for Smart Meters. . . . . . . . . . . . . 102.2 Sub-meter types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5.1 Confusion Matrix of NN model for five classes. . . . . . . . . . . 445.2 Final Voting Accuracies, (x) = DNN, x = BNN. . . . . . . . . . . . 485.3 Effect of Using Entropy in Voting. . . . . . . . . . . . . . . . . . . 48

vi

Chapter 1

Introduction and ProblemStatement

Utility sub-meters are devices that measure different kinds of utility consump-tion such as water, gas, electricity or heating. The information from severalsub-meters can be collected by a so called smart meter, which gathers the in-formation and communicates the consumption’s to the utility providers. Theintroduction of smart meters replaced old systems where the utility usage hadto be manually read and broadcast. It has revolutionized traditional utilitymonitoring and management policies. The increased communication band-width that follows has opened up a market for companies to offer variousservices related to the utility usage such as applications where the customercan track their consumption enabling them to take actions towards minimiz-ing their costs. It also enables the utility providers to handle peak loads andreduce generation and maintenance costs [1].

The smart meters play a part in the context of smart grids which can be viewedas a digitalization strategy of utility grids with the end goal of saving energyand increasing the transparency of what customers are paying for. In thisintroductory chapter a problem connected to this communication of consump-tion data will be presented, followed by some related research and how theapproach for a solution will contribute.

1.1 Problem Statement

This project is made in cooperation with Metry AB and the School of ElectricalEngineering and Computer Science (EECS) at KTH. Metry is a company thataims to reduce environmental impact and increase savings in the energy sectorthrough the gathering of data from utility meters. They offer services such asapplications where the user can track their consumption and data processing.

There are several communication stages between the utility meters and Metry,

1

and it can happen that some information about the meter is lost or that theinformation was not present from the beginning. One such thing is the lossof information about what utility the data from the sub meter is measured on.This information is important for several purposes such as billing and findingmalfunctions based on the meter data. An example would be a flow meterthat is first used to measure hot-water and then for some reason is changedto measure cold water. In order to obtain the correct label for the meter, somemanual work has had to be done such as contacting the customer and askingabout the correct utility. This is of course a time consuming task which wouldbe good to avoid. The idea of this Master Thesis is to investigate if it is possibleto classify the time series of consumptions using deep learning techniques.

There are some challenges which the mathematical models will have to ad-dress.

• Missing Data: Some time series are highly sparse meaning that they con-tain many missing data points.

• Varying Lengths: The length of the time series varies substantially, somecontain years of data while some contain only a few days of data.

• Imbalanced Classes: There is a large imbalance in the number of metersfor each class.

1.2 Related Research

In this chapter previous literature connected to the problem statement are re-viewed.

At first the task of time series classification required some form of featureselection, this is due to the continuous assumption of a time-series and thatthere are often too many features or data-points in a time-series. Finding thepatterns that can be useful is a problem that has been worked on for sometime. Some of the early works in the late 90’s showed that the feature selectionof time series would outperform a simple Euclidean distance based approachfor calculating the similarity of time series [2].

Later, a different feature selection technique using pruning of regression treescreating piece-wise constant slices of the time series was shown to outperformthe previous feature selections [3]. However the classifiers were still usingNearest Neighbor methods for quantifying the similarity.

In 2009, the method of learning time series shapelets was proposed in the con-text of data-mining [4], this work was partly based on the work from Geurts.The paper introduced the notion of shapelets, which coincide with the slice no-tation in this thesis, the key idea is that key class-features can be found locallyinstead of globally. They proposed a method where the slices or shapelets aretaken such that the information gain is maximized.

2

In 2012 some improvements to the shapelet transformation was proposed, theauthors use an F-Statistic instead of the information gain for evaluating theoptimal split [5]. Their proposed method is unlike the previous paper made towork with any classification problem, not restricting it to tree-based methods.

With the recent advancements in computer intensive Machine Learning algo-rithms following the increase in data capacity and processing power, differentmethods has been considered. One idea is that Deep Neural Networks (DNN)could learn the features automatically for a large set of slices where relaxationsof the selection is made. By doing this, one avoids the computer intensive taskof finding the optimal shapelets. It has been shown that a Convolutional Neu-ral Network (CNN) model using data augmentation techniques called Multi-Scale Convolutional Neural Network (MCNN) outperforms the wavelet meth-ods and provided state-of-the-art accuracy on several data sets at the time [6].The framework has three stages which automatically extracts features of differ-ent time-scales. The authors outlined a voting method where the extracted slicesof different time-scale were classified by the network and the final class wasselected based on the class receiving the highest number of classified slices.The idea of the voting procedure will be used in this thesis. The same papergave an analysis and insight on the similarity between the convolutional filtersand the Euclidean norm used in the learning time-series shapelets method, itwas concluded that norm could be regarded as a special case of their proposedmethod.

In a recent paper, some additional data augmentation techniques besides con-structing slices was evaluated using a CNN classifier [7], the authors aug-mented the data set by adding time-warped versions of the extracted slicesand used the method of voting procedure. They showed a general improve-ment in accuracies on some standard time series data sets.

The MCNN-framework has been criticized, for instance in the paper by Wanget al [8]. The theoretical motivation behind the voting method is questionedand considered to be ad-hoc. They showed instead that better performancewas achieved using a Fully Convolutional Network (FCN) working directly onthe time-series as inputs without any preprocessing except for feature scaling.

Another recent publication has made an overview and comparison of modernTSC algorithms, in the paper [9]. The authors also provides criticism towardsthe previous proposed window slicing models, also criticizing the lack of the-oretical justification for the method, and can show that they perform worsethan many other TSC algorithms on the UCR data [10].

For time series classification tasks on sparse data sets, the research is moresparse. In a paper from 2012 the authors laid forth an algorithm for classifyingsparse time series using a supervised method of finding a matrix factorizationof the original data. Given a sparse time series data set X they constructeda reconstruction X ≈ UVT where U is a latent matrix representing. Thenthey connected the latent variables to the labels Y through a one-layer Neural

3

Network as Y = σ(UW). They defined a loss function as a sum of two sub-losses together with a regularization term [11].

However, a drawback with this method is that it for testing requires the com-putation of the representation in the latent space, since it is this space that isused in the final classification step. This becomes a problem in the case whereone obtains one new test sample at a time, one vector can be factorized in aninfinite number of ways. It is unclear on how to modify the test samples soas to input them into the model. From the text the only reasonable conclu-sion is that the authors computes the reconstruction matrices jointly for all thesamples, for both the training and testing sets.

In 2015 another method for classification of sparse time-series was proposed.The authors use Gaussian process regression, representing each time-seriesin the data set with the posterior Gaussian process it induces. The Gaussianprocess hyper-parameters are learned for the training set and after this trans-formation an support vector machine (SVM) classifier is trained on the inducedprocesses with the corresponding labels. For testing accuracies they computethe posteriors and feed them through the SVM to obtain predictions. Further-more, they present some methods of reducing the computational complexityof the training phase using a random Fourier feature method for kernel ap-proximation [12]. They show that the model outperforms some benchmarkmodels working with interpolated time-series on the UCR database.

There are some works related to the problem of classification under imbal-anced data sets. In general the method of dealing with imbalanced classescan be divided into two categories, data level approaches and algorithm levelapproaches [13]. Examples of the first approach includes modification of thedata sets by either oversampling from classes with few samples, undersam-pling classes with many samples or a combination of both. A paper from2002 showed that a clever combination of under and oversampling wouldlead to better classification accuracies compared to only using one method,they presented graphs suggesting that when increasing the undersampling ofthe majority classes, the performance on the majority class decreases but theminority-class performance increases [14].

In a paper by Bing Hu in 2013 an approach to the slice extraction in timeseries classification was made where the authors proposed an algorithm forbuilding so called data dictionaries, a collection of extracted sub-sequenceswith classs labels in an iterative fashion [15]. Given this dictionary they used anearest neighbor approach for classifying new slices. They showed a increasedperformance on several data sets without missing values compared to usingrandomly created data dictionaries.

Moreover P.Schafer and U.Leser used the slicing method together with a ”bagof pattern” model, the method consists of (1) preprocessing the time seriesinto a set of slices (2) transforming each slice into a discrete valued word (3)creating a feature vector from the relative frequencies of words and (4) using a

4

classifying algorithm on these feature vectors [16].

Altogether, there are several ad-hoc solutions on the classifications of timeseries data that features either missing points or data of imbalanced classes.However, to the best of our knowledge, there is no work that addresses bothproblems jointly with a practical and real-world data set. This brings us to thenext section which describes how this thesis will contribute.

1.2.1 Contribution

Here the main approach will be based on deep learning methods. It hasbeen shown that Gaussian processes can be approximated by wide neural net-works with a high number of nodes in a hidden layer [17]. The Gaussianprocess method outlined in previous work will not be considered and insteadthe work will follow a similar approach as the methods based on extractingslices/shapelets together with the voting procedure; but with other kinds ofclassifiers, a regular neural network and a Bayesian neural network. I will notfollow the bag of pattern model but will instead rely on the neural networkscapability of extracting the relevant features from the slices.

Indeed, the critique towards the proposed voting method in the paper describ-ing the MCNN algorithm can be justified in the sense that no real argumentsare made as to why the voting should work. Since both papers were workingon relatively clean data sets with the same lengths of inputs and where thetime series do not contain missing values/are sparse, it can indeed be difficultto see why one would not simply use the entire time series as inputs. Howeverit is worth mentioning that the criticism does not take into account the advan-tage of window slicing/shapelet extraction together with the voting procedurein the case of classifying sparse time series with large length differences. Thisis because the method of slicing and voting is flexible due to the fact that slicescan be taken where data exists and the voting works on time-series of differentlengths.

The proven performance on standard data sets together with the identifiedbenefits on sparse and high-length variant data together with the descriptionof the inherent problems with the data set justifies the evaluation of the slic-ing/voting method for this thesis.

I will aim to contribute by:

• Evaluating time series classification on data coming from utility sub-meters and thus tread some water in the smart grid research area.

• Evaluate the slicing/voting method’s performance on a sparse and im-balanced data set of time series with a large length variance, while pro-viding some additional insight to why it might or might not work.

• Experiment with Bayesian Neural Networks as a classifier which has notbeen done in the above mentioned papers.

5

1.3 Notation

D: Denotes a data set

W: Denotes the weights in the neural Network, I will sometimes for notationalsimplicity include the parameters in the batch normalization in this notation

R: Denotes the real numbers.

Rd: Denotes the Cartesian product of d copies of R, a d-dimensional vectorspace.

Ω: Denotes an event space.

F : Denotes a sigma algebra.

I ID: Denotes Independent and Identically Distributed random variables.

#A Denotes the cardinality of a set A.

I will for most parts follow the convention of bold lowercase letters as vectors,bold capital letters as matrices and regular letters as scalars in the variousequations throughout this paper.

Definition 1: A time series is a set of ordered data points x = x1, x2, ..., xnwhich are indexed by time.

Definition 2: A slice is defined as a sub-sequence of a time series of a fixedlength n; xi, ..., xj, 1 ≤ i < j ≤ n.

6

Chapter 2

Background

2.1 Smart Metering

Smart metering is embedded within the technological advancement of utilitygrids called smart grids. At first, the term smart grids was mostly used in thecontext of power grids which regards the production, distribution and con-sumption of electricity, but the same principles can be extended and appliedto other utilities such as water, gas and heating. The technological advance-ments of utility grids are made to address and improve on areas in which thetraditional grids are limited.

There are several ways of modernizing the utility grids, essentially the smartgrids is based upon digitalization and information control. The Energy Inde-pendence and Security Act signed by the US congress in 2007 gave the firstofficial statements that characterizes and defines a smart grid [18].

1. Increased use of digital information and controls technology to improvereliability, security, and efficiency of the electric grid.

2. Dynamic optimization of grid operations and resources, with full cyber-security.

3. Deployment and integration of distributed resources and generation, in-cluding renewable resources.

4. Development and incorporation of demand response, demand-side re-sources, and energy-efficiency resources.

5. Deployment of “smart” technologies (real-time, automated, interactivetechnologies that optimize the physical operation of appliances and con-sumer devices) for metering, communications concerning grid opera-tions and status, and distribution automation.

6. Integration of “smart” appliances and consumer devices.

7

7. Deployment and integration of advanced electricity storage and peak-shaving technologies, including plug-in electric and hybrid electric vehi-cles, and thermal-storage air conditioning.

8. Provision to consumers of timely information and control options.

9. Development of standards for communication and interoperability of ap-pliances and equipment connected to the electric grid, including the in-frastructure serving the grid.

10. Identification and lowering of unreasonable or unnecessary barriers toadoption of smart grid technologies, practices, and services.

A smart meter is referred to as a measuring device that collects data fromdifferent kinds of utility meters and that is capable of communicating the datato companies that produces the utility. The utilities considered in this thesesare electricity, water, district heating/cooling and gas. Before the introductionof smart meters, the utilities were measured by offline meters that displayedtheir readings at a screen in a close proximity to the meter itself. Energycompanies were therefore required to physically send workers to collect thereadings from the meters.

Traditionally, the energy companies produced a certain amount of utility anddistributed it over their customers without any significant feedback from indi-vidual customers regarding their consumption of the utility. The smart metersimproves the feedback step and enables the users to transmit data with a highbandwidth to the utility companies using modern network architectures. An-other key innovation is the two-way communication between customers andutility-providers, the meters can not only send but also recieve informationsuch as when the price of utility is high. This can for instance be used in anappliance control scenario such as running a dish-washer when the water-priceis cheap.

The idea is that the transparency in information that follows from the smartmeters will make the trade between the companies and the customer morefair in the sense that the customer knows what he or she is paying for andthe company knows what they are billing for. Previously consumptions inresidential buildings were taken as the average over the total consumption ofall residents. Smart meters are capable of collecting information from manysub-meters installed in separate households, so that customers more accuratelycan be billed for their actual consumptions.

As a consequence of the increased bandwidth of communication between themeters and the companies, close to real-time monitoring of the consumptionsis now possible. This has enabled companies offering different services suchas the possibility of customers viewing their consumption in an application toemerge.

This can potentially have effects such as that people can infer the hours wherethey consume more and adjust their consumption in order to save money and

8

energy. The companies can use the information to map the load cycles in theconsumptions and adjust their production according to the demand.

2.1.1 History

A device for automatic telephonic communication systems was introduced andpatented in 1972 by Theodor Paraskevakos [19]. The development and roll-outof smart meters has since then been progressed in many countries. However,due to many political and market reasons the roll-out on a national level hasdiffered between countries.

In EU, Sweden and Italy became the pioneers in the implementation of smartmetering. The 2003’s approval by the government in Sweden for the demand ofenergy companies to monitor meters on a monthly basis effectively kick-startedthe development in Sweden [20]. By 2009 this demand was satisfied and dueto this and technical advancements, energy companies started implementingsmart meters to reduce the monitoring costs.

Many other European countries has since followed the same pattern and todaythe European Union has a clear goal of having at least 80% of utility metersbeing smart by the end of 2020 wherever this is cost-effective, meaning that thebenefit from smart meters out-weights the possible negative impact.

In a report for the European conference on smart metering deployment in theEU from 2014, the following predictions and evaluations were stated [21]:

• close to 200 million smart meters for electricity and 45 million for gas willbe rolled out in the EU by 2020. This represents a potential investmentof e45 billion

• by 2020, it is expected that almost 72% of European consumers will havea smart meter for electricity. About 40% will have one for gas

• the cost of installing a smart meter in the EU is on average between e200and e250

• on average, smart meters provide savings of e160 for gas and e309 forelectricity per metering point (distributed among consumers, suppliers,distribution system operators, etc.) as well as an average energy savingof 3%.

This further motivates the expansion of smart metering.

2.1.2 Technology

The technological achievement by smart meters are mainly due to the advance-ments in digital communication. The communication protocols form a set of

9

rules governing the exchange of data between computers. There exists multi-ple different communication protocols that are used for Smart Metering, theydefine how the communication works between the utility companies serversand the integrated computers in the meters. The following table gives anoverview of the communication protocols that are typically used for smartmeters as of today [22].

Table 2.1: Communication Protocols for Smart Meters.

Name CommentGSM/GPRS Mobile phone network

IEC61107 Wired connectionWavenis 2-way wireless systemTCP/IP Used on the InternetWAN Wide Area Network

The TCP/IP protocol is interesting in this sense because it is widely usedworldwide in households for dealing with the communication between per-sonal computing devices such as desktops, laptops, mobile phones and tabletsthrough what is referred to as the internet.

Smart meters using the TCP/IP protocol for the communication makes deploy-ment in households simpler, connecting the smart meter to the router enablescommunication between any server accessing the internet possible. Moreover,work has been made to investigate the usage of the Global System for MobileCommunication (GSM) as a dedicated network platform for smart meters [23].

All-together, there are several different varieties of communication methodsused for smart meters, there is no global standard and there are differencesbetween countries. Moreover, several different protocols can be combined inthe process of sending the information.

As previously mentioned, several individual sub-meters can be connected toa smart meter. These sub-meters work in different ways depending on whatutility they are measuring. Below follows an overview of common meter-types.

Power Meters typically works by measuring the average power over sometime interval. The average power over an AC circuit can be computed as theaverage of the instantaneous power over some time interval T,

Pa =1T

∫ T

t=0v(t)I(t)dt. (2.1)

Here v(t), I(t) denotes the voltage and the current as a function of time, thesequantities are reasonably easy to measure. Typically one is interested in theenergy and not the average power, the energy that was delivered during thetime-interval T is then attained by multiplying the average power by T.

10

Flow Meters are used in order to measure quantities of utilities of fluids orgasses that is delivered to the customer.

There are several different principles of measuring flow and they can be groupedinto two categories. Mass flow meters and volumetric flow meters. Mass flowmeters measures the mass that is flowing through the system, while the vol-umetric flow meters measure the volume using the velocity of the fluid andsome cross section area of the measuring tube.

A common type is the Coriolis mass-flow meter which works by feeding theutility through a vibrating pipe and calculating the Coriolis force that is ap-plied to the pipe by the fluid using displacement meters mounted on the pipe.Then the density of the fluid is used to provide a measure of the volume flow.

Another type is the thermal mass flow meter which works by measuring achange in heat over some distance in a pipe. There are also magnetic flowmeters and ultrasonic flow meters. I will not go into further detail abouthow the different kinds of flow-meters work, the interested reader can findcomprehensive information on the internet.

Temperature Meters are used in the measurement of district cooling andheating. It can be used together with the flow rate and other properties of thecooling or heating agent (usually water) in order to make calculations on theenergy that is used for the cooling or the heating [24].

The following table depicts the relationship between some utilities and whatkind of measuring devices are used. These are the utilities that define theclasses in the classification problem this thesis deals with.

Table 2.2: Sub-meter types.

Cooling Electricity Heating Hot Water WaterTemperature + Flow Power Meter Temperature + Flow Flow Flow

11

2.2 Classification Algorithms

Consider some probability space equipped with a sigma algebra and a prob-ability measure (Ω,F , P). On this probability space we can define a set ofrandom variables XiC

i=1 : Ω → F , each equipped with some distributionfunction. Associated with the random variable XiC

i=1 is a label i representingthe class that the random variable belongs to, this is denoted YiC

i=1 and is adeterministic mapping from each observation of Xi. We assume that we havedata that are samples from this random variable, and that the class tells uswhich of the C random variables the sample is taken from.

The classification task involves finding the best mapping between the obser-vations of the random variable X or the input and the corresponding classlabels. Assume a data set D = x, y = x1, ..., xN , y1, ..., yN is given whereeach sample xn ∈ Rd has a corresponding one-hot encoded label yn ∈ RC

(see (2.5)). The labeled output/input pair can be viewed as a realization ofthe conditional random variable Xi and its corresponding label. The modelfunction fW : Rd → RC parameterized by W takes as input a sample fromthe input space and outputs a probability distribution or likelihood of the la-bels p(yn|xn). The goal in classification task is to select W that maximizes thelikelihood function for the observed data. The case when we have an infinatenumber of classes taking real values the task is called regression analysis.

After inferring the optimal parameters in the model one can classify new sam-ples by picking the class corresponding to the highest probability in the like-lihood. The idea is that if the number of samples is large enough, the modelshould learn a good approximation of the true conditional distribution andthus be able to generalize to new data points.

The name Machine Learning (ML) comes from the procedure in which theparameter inferring is done; one ”learns” the model parameters by solving anoptimization problem with respect to some subset of D called the training data,and then assuming new data coming from similar distribution as the trainingdata that the model will be able to make correct predictions on their classes.When training the supervised Machine Learning Model, the maximum like-lihood estimation means solving a (possibly highly non-convex) optimizationproblem:

minW

l(W , x, y). (2.2)

Where l(W , x, y) is a loss function of a set of model parameters, training dataand labels. The loss function is selected in such a way that it is minimizedwhen the model for all data points xn ∈ x correctly assigns them the corre-sponding label. A common choice of loss function in the classification settingis the cross entropy function. An assumption is that the classes are mutuallyexclusive which means that all samples belongs to only one class. The cross-

12

entropy function is typically averaged over the data set and for individualsamples it is defined as

H(y, y) := −∑i

yi log(yi). (2.3)

Here yi and yi are the predicted probability and the true probability respec-tively, of a sample belonging to class i, this function is averaged over the dataset to get a final loss. For DNN the resulting likelihoods are highly non-convex, thus meaning the cross entropy function often is highly non-convex aswell; reaching global optimality for the loss is therefore in regular not the case,and one has to settle with local optimality. The optimization problem is solvedusing a first order algorithm meaning that it uses first derivatives of the ob-jective function. The most commonly used method is gradient descent whereone adjusts the parameters iteratively in the opposite direction of the gradientof the objective function. This method is described in the next chapters wheretwo different models are outlined.

The models performance are typically evaluated by calculating the loss func-tion for a disjoint set of test data Dt = xt, yt, Dt ∩D = ∅ and calculating thefraction of correctly classified test samples.

2.2.1 Neural Networks

The neural network model family has its origins in the perceptron, which is alinear classifier introduced in 1957 by Rosenblatt [25]. The architecture was in-spired by neurons in the brains of animals, their interplay and role of process-ing information. The neural network consists of several stacked perceptron’sor nodes/neurons organized in layers where the output of each node receivesa summed weighted input from the nodes in the previous layers and feed thisthrough an activation function. This section will explain the neural network inmore detail.

Consider a non-linear function f (x; θ) equipped with a set of parameters orweights θ = W , b. The network equipped with weights W = W1, W2, ..., WLand biases b = b1, ...bL can be written as a composition of L functionsfi, i ∈ 1, ..., L where f (x; θ) = fL ... f1 = fL( fL(... f1(x)). One can interpretthe network as being organized into L layers where each function fi takes asinput the output from the previous layer and is of the form fi = gi(Wi fi−1 + bi)where Wi is the set of weights connecting the output from the previous layerfi−1 to the next layer and bi is the bias vector.

The function gi is the transfer function and is applied at each element in thevector of layer i and is usually the same for all layers except for the finallayer. This function makes the function non-linear, note that if we don’t have atransfer function, f is just a linear transformation of the input data. A commonchoice of transfer function is the rectified linear unit (ReLu)

13

gi(x) =

x, if x ≥ 00, otherwise.

. (2.4)

The first layer is known as the Input layer and simply is the vectorized inputdata placeholder, the last layer is known as the output layer, the output layer ina classification setting has typically the same number of nodes as the numberof classes. If we let the output layer fL’s activation be a softmax function,then the nodes will define a probability distribution over the class-space. Thenclassification can be made by selecting the outcome for which the probabilityof a given input is the highest. For a network with two hidden layers, themodel can be visualized as a computational graph as in Figure [2.1]. Here thearrows describe the order of computations, the nodes describe the informationin that stage and the text outside of nodes describe the computation.

Figure 2.1: Computational graph for a Neural Network with two hidden layers.

Assume we have a set of classes [C] = 1, 2, ..., C and further that the obser-vation yn is belonging to a class c ∈ [C]; the labels yn are represented as aone-hot encoded vector.

yn = (y1n, ..., yC

n ), where ycn = 1 and yi

n = 0, ∀i 6= c. (2.5)

With the number of nodes in the last layer of the network being equal to thenumber of classes we have that the output function maps to RC so that wecan write f (xn) = [ f 1(xn), ..., f C(xn)] . One can define the likelihood of classlabels by applying the softmax function to the output function as follows in theequation

14

p(yin| f (xn)) =

e f i(xn)

∑Cj=1 e f j(xn)

. (2.6)

The softmax function corresponds to gn, the last activation function, which forall the hidden layers were taken to be the ReLu. The class probabilities formsa vector which is a likelihood function.

The cross entropy loss for the one-hot encoding of the labels for a data set of nsamples can be written as the average loss

L(x, y; W , b) = 1N

N

∑n=1

∑i−yi

n log p(yin| f (xn)). (2.7)

As described previously this will be a measure of how well the data is mappedto the corresponding labels by the Neural Network. The goal is to minimize theloss function for the training data. Gradient descent is an optimization methodthat is used to find maximas and minimas of functions with respect to theirparameters θ = W , b. The update rule for the parameters in the gradientdescent is to perform updates until convergence of the following equation

θt+1 = θt − η∇θ L(x, y, θt). (2.8)

Here, η is the learning rate parameter. The choice of learning rate is importantfor the convergence characteristics, if the learning rate is too large the optimalsolution might be overshot and the parameter updates will never converge toa optimal solution; if it is too low the updates will take longer to converge.

The optimization landscape is most often highly complicated with lots of localminimas, and therefore finding a global optimum is intractable. The gradientdescent method ensures local optimality, however, most often this local minimais far from the best option and one would like to find a better local minima.

To this end one can use techniques such as stochastic gradient descent wherethe updates of the parameters are based on one sample at a time, approxi-mating the gradient in (2.8). This will significantly speed up training in manycases, and help avoid getting stuck in local minimas by introducing some noisein the model. As long as the approximation of the gradient is good enoughand that the computational cost of computing the full gradient is high enough,this method works well. One can also use mini-batch gradient descent wherea batch or a set of samples are used to approximate the true gradient of theloss function.

Another method is to utilize a different version of gradient descent called theADAM-optimizer where the learning rates and momentum are varied for dif-ferent weights; this optimizer has been shown to work well in practice [26].The Adam optimizer updates the learning rates and weight decay parameters

15

Algorithm 1 ADAM Optimizer Algorithm.

Require: α: step size.Require: β1, β2 in[0, 1): Exponential decay rates for the moment estimates.Require: L(x, y; θ): Stochastic Objective function with parameters θ.Require: θ0: Initial values of parameters.

m0 ← 0: (Initialize first moment vector.)v0 ← 0: (Initialize second moment vector.)t← 0: (Initialize time-step.)while θt not converged do

t← t + 1gt ← ∇θt L(x, y; θ) (Get loss gradients as described in algorithm 3)mt ← β1 ·mt−1 + (1− β1) · gt (Update Biased first moment estimate.)vt ← β2 · vt−1 + (1− β2) · g2

t (Update Biased second moment estimate.)mt ← mt/(1− βt

1) (Compute bias-corrected first moment estimate.)vt ← vt/(1− βt

2) (Compute bias-corrected second moment estimate.)θt ← θt−1 − α · mt/(

√vt + ε) (Update parameters.)

end whilereturn θt (Resulting Parameters.)

while performing stochastic gradient descent on the loss function according toalgorithm 1.

Back Propagation In order to compute the gradient of the cross entropy lossfunction with respect to all weights in the neural network a commonly usedtechnique is the backpropagation algorithm. This algorithm is based on thechain rule of calculus and is described below for a more general setting incomputational graphs.

Backpropagation for computation of partial derivatives is a quite general methodand is not only limited to the setting of neural networks. The reader will thusbenefit from a first description in its more general form. To this end con-sider some computational graph G with n nodes h1, ..., hn. We let hn denotethe output node in the graph and h1, ..., hp, p < n be the set of input nodes.Note that this assignment of nodes is merely for notational simplicity, there isnothing stopping us from arranging the indices differently. The nodes in thecomputational graph would denote some tensor-valued variable.

In graphs, some nodes can be connected by directed or undirected edges, inthis setting we assume that the graph is directed, from the input towards theoutput. Consider some node hi ∈ G. We define Parent(i) as the set of parentsto node i, this is the set of indices that corresponds to the nodes from whichthere is a directed edge connecting to node i.

We denote by L0 be the output layer and Lk = Parent(Lk−1), k ≥ 1 be theprevious layers. Algorithm 2 computes all the partial derivatives with respectto the nodes in the computational graph.

16

Algorithm 2 Partial Derivatives in a Computational Graph.

Start with inputs h1, ..., hp and make a forward pass through the networkcomputing h1, ..., hn.Set δhn

δhn= 1, L = n

for j ∈ Parent(L) doCompute and store δhn

δhj= ∑i;j∈Parent(i)

δhnδhj

δhjδhi

Set L = Parent(L)end forReturns all partial derivatives δhn

δhii

In the context of neural networks we consider hn to be the loss function and theother nodes hi, i 6= 1, ..., p, n to be the activations of the nodes in the hiddenlayers. We are not only interested in the partial derivatives with respect tothe activations but also on the parameters in the model. In order to obtainthese we have to multiply with the partial derivative of the activations withrespect to the parameters. For a more in depth review of the backpropagationalgorithm one can read chapter 6.5 in [27].

For the computational graph associated with a feed forward neural networkwe obtain the final backpropagation algorithm according to algorithm 3.

Algorithm 3 Backpropagation For Feed Forward Neural Network.

Require: Input/Output pair (x, y)Make a forward pass of inputs to obtain activations hkL

k=1 and computethe gradient on the output layer.g ← ∇ f (x)L(x, y; θ)for k = L, L− 1, ..., 1 do

Multiply the gradient on the output of the layer with the partial derivativewith respect to the activation function.g ← ∇hk L(x, y; θ) = g · g(hk)

δhkCompute gradients of the pre-activation’s with respect to the weights.And finally set:∇Wk L(x, y; θ) = ghk−1∇bk L(x, y; θ) = g

end forReturns gradients ∇θk L(x, y; θ), k = 1, ..., L

In the case of the cross entropy loss and softmax function on the output layerwe will be able to compute the partial derivatives as follows.

The gradient on the output layer for a single sample can be computed as

17

δL(xn, yn; θ)

δ f k(xn)=

δ

δ f k(xn)−∑

iyi

n loge f i(xn)

∑j e f j(xn)(2.9)

= −ykn(1−

e f k(xn)

∑j e f j(xn))−∑

k 6=i

δ

δ f k(xn)yi

n loge f i(xn)

∑j e f j(xn)(2.10)

= −ykn(1−

e f k(xn)

∑j e f j(xn))−∑

k 6=iyi

ne f k(xn)

∑j e f j(xn)(2.11)

=e f i(xn)

∑j e f j(xn)− yi

n (2.12)

Where the last equality holds because the sum of the elements in yn is one.

The partial derivative with respect to the activation function is for the ReLu:

δg(hk)

δhk=

1, if hk > 00, otherwise.

(2.13)

Since the pre-activations is a summation over the weights multiplied by theoutputs from the previous layers, the partial derivative with respect to theweights and biases is given by

δ(Whk−1 + b)δW

= hk−1,δ(Whk−1 + b)

δb= 1. (2.14)

Batch Normalization Another technique that has been proven to improvethe performance of some Neural Networks is Batch Normalization algorithmintroduced by S.Ioffe and C.Szegedy in 2015 [28]. The algortihm works byadding a linear transformation of the output of a layer in a neural network be-fore passing it through the activation function. The transform introduces twonew parameters for each hidden hidden node, a scale γ and a shift parame-ter β, these are trainable and updated via the backpropagation and gradientdescent algorithm together with the networks weights.

Assuming we have some activations in a hidden layer of a mini-batch whichare to be used for the gradient calculations B = x1, ..., xm, we can define themini-batch mean as

µB =1m

m

∑i=1

xi. (2.15)

And the mini-batch variance as:

18

σ2B =

1m

m

∑i=1

(xi − µB)2 (2.16)

The batch normalization then transforms the activations and produces the out-put y as follows

x =x− µB√σ2

B + ε(2.17)

y = γ · x + β. (2.18)

The division by the variance and the multiplication by the scale parameter aredone element-wise. We note that the addition of the trainable parameter β willeliminate the need to have a separate bias term since it does the same job ofadding some constant term to the pre-activations so that finally the parametersof the model will be θ = WiL

i=1, γiLi=1, βiL

i=1.

The computational graph for the simple example network will thus change,this is visualized in Figure 2.2.

Figure 2.2: Computational graph Batch Normalization.

Where the nodes BN1 and BN2 represents the operations carried out in theBatch Normalization algorithm. The partial derivatives that has to be mul-tiplied with the gradients with respect to that node in the backpropagation

19

algorithm are given by

δLδxi

=δLδyi· γ (2.19)

δLδσ2

B=

m

∑i=1

δLδxi· (xi − µB) ·

−(σ2B + ε)−3/2

2(2.20)

δLδµB

= (m

∑i=1

δLδxi· −1√

σ2B + ε

) +δL

δσ2B· 1

m

m

∑i=1−2(xi − µB) (2.21)

δLδxi

=δLδxi· 1√

σ2B + ε

+δL

δσ2B· 2(xi − µB)

m+

δLδµB· 1

m(2.22)

δLδγ

=m

∑i=1

δLδyi· xi (2.23)

δLδβ

=m

∑i=1

δLδyi

(2.24)

(2.25)

One might ask the question what batch normalization actually does to themodel and why it has been shown to improve training speed and accuraciesof neural networks. In the original paper, the authors argued that the batchnormalization algorithm reduced a so called internal covariance shift. Theyargued that because the output of layers depend on the previous layers, asmall error in the parameters will amplify throughout the network, they callthis change in distribution a covariance shift and argues that by scaling theinputs to each layers one makes the distributions more equal and reduces theeffects of the covariance shift.

However in 2018 it was argued the reasons for the success of batch normaliza-tion does not come from a reduction in internal covariance shift but more as aresult of smoothing the optimization landscape [29].

Dropout Overfitting is a common issue with Deep Learning architectures,it happens when the models perform much better on the training data thanon testing data; the model is said to generalize poorly. Usually this occursbecause of a high model complexity. The methods of handling the problemsare refered to as regularization techniques.

A simple way of regularization named dropout was proposed in 2014 byN.Srivastava et al. [30]. The idea is to at each epoch, sub-sample the numberof nodes in the network by dropping each of the nodes in the hidden layerswith some probability. This dropout rate will be a parameter in the model.At testing time, one considers the full network with all hidden nodes. Thismethod has been proven to improve classification results.

20

Evaluation In order to verify the performance of the model, we consider a la-beled test data set Dtest = xt, yt = xt

1, ..., xtn, yt

1, ..., ytn which is disjoint from

the training data set. Then we take the model and feed the test data through itand thereby obtaining a series of predictions p(yi|xi) = so f tmax( f (xt

i ; θ))ni=1

and take ypredi to be the one-hot encoding of arg max p(yi|xi). The test accuracy

can then be evaluated as

Final Accuracy =1n

n

∑i=1

1ypredi =yt

i. (2.26)

This measure gives an average accuracy over all the classes, thus it is hard toknow if the model has a skewed performance on some classes.

A confusion matrix can be used to see the performance for individual classes.The rows of the confusion matrix represents actual classes and the columnsrepresent the predicted class, so that a pair of i, j, i 6= j represents the numberof samples that was came from class i but were predicted as j. The diagonalelements thus corresponds to the numbers of samples correctly classified foreach class.

2.2.2 Bayesian Neural Networks

In classical probability theory, the philosophy of the notion of probability isthat the probability of an event is the long term frequencies of event occur-rences. Consider an example which involves rolling a dice, the frequentistapproach is to view the probability of rolling a four as the expected quotientof four’s and total rolls if the dice is rolled many times. The frequentist doesnot utilize any prior belief that could be incorporated in the model.

On the other side of the spectrum we have Bayesian statisticians who interpretprobability as the confidence of an event. To this end the Bayesians incorporateprior beliefs about the probability of events. For instance a Bayesian giving ananswer to the probability of rolling a four will incorporate a prior probabilityof the parameters in the model. For instance a bayesian might have reasons tothink that the dice might not be as likely to show a four and includes that priorknowledge in the probability. The way a Bayesian statistician incorporates thisinformation is by Bayes theorem, a well known result in probability theory.Assume some random variables X, W, Bayes theorem states that

P(W|X) =P(X|W)P(W)

P(X). (2.27)

In Bayesian statistics this theorem is connected to the notion of prior beliefsabout events in the following sense. Usually X describes some evidence, such asobservations or data. Then W describes some proposition related to the events,such as the probabilities of rolling some numbers of a dice. The Bayesians uses

21

Bayes theorem to incorporate uncertainties about these propositions, such un-observed variables are called latent variables and given a prior distribution P(W)of the latent variables the bayesians are interested in the posterior distributionof the latent variables p(W|X). Given new evidence, it is thus possible to inferthe most likely parameters for the model.

In the Deep Neural Network setting we assume that the weights are determin-istic. But in a real world scenario, due to randomness it is reasonable to believethat the weights are actually observations from some noisy distribution, thismeans that the weights that we obtain when training the network are perhapsnot optimal. The randomness could be accounted to several factors, it couldbe due to the randomness of the optimization landscape for the real worldscenario, computational errors due to data formats and such.

In a Bayesian sense the idea behind Bayesian Neural Networks is to accountfor some of this randomness by introducing a prior knowledge of the weightscoming from some distribution, considering the weights to be latent variables.

The Bayesian Neural Network model used in this thesis is defined as follows:

Wi ∼ N(0, I), i = 1, ..., L (2.28a)

ylogits = fNN(x, WiLi=1) (2.28b)

y = Cat(ylogits) (2.28c)

Where Cat denotes the Categorical distribution with C classes and correspond-ing distribution function

p(y) = p|y1=y|1 · · · p|yC=y|

C . (2.29)

Here | · | denotes the Iverson Brackets. This defines through the model a condi-tional likelihood p(y|x, W) for the data. In the following chapter I will denoteW = W1, ..., WL. Lastly, fNN is the function representing a feed-forward neu-ral network with L layers, defined in the same way as the notation introducedin the previous part about the DNN.

Note:

• In the programming library that has been used in the project, the categor-ical distribution function accepts log probabilities for numerical stabilityreasons, this is why the subscript logits is used.

• The Batch Normalization parameters are not stochastic due to the struc-ture of the programming library. This means that they will appear asregular parameters in the likelihood in a similar way as the weights inthe DNN network. With the goal of achieving notational simplicity, I willnot include them in the equations describing the optimization problem inthe BNN setting. These parameters are optimized using gradient descentand backpropagation on the loss function defined in (2.39).

22

In the Bayesian Setting, with the addition of a prior on the network parameters,it is possible to evaluate the posterior distributions of the parameters. The jointposterior of the weights given the data is given by

p(W|x, y) = p(W|x1,...,N , y1,...,N) (2.30)

=p(W, x, y)

p(x, y)(2.31)

=p(y|x, W)p(x, W)

p(x, y)(2.32)

=p(y|x, W)p(W)p(x|W)

p(y|x)p(x)(2.33)

= p(W)p(y|x, W)

p(y|x) . (2.34)

The last equality holds because of the assumption that the inputs have no de-pendence on the parameters of the network. The normalizing constant p(y|x)is the true likelihood of the data and is hard to estimate. Because we had nostochastic assumptions of the weights in the DNN model, we could work withthe true likelihood much more easier. Since the model only gives the cate-gorical distribution conditioned on observing the weights one would have tocompute the normalizing constant by integration with respect to the weights

p(y|x) =∫

Wp(y|x, W)p(W)dW. (2.35)

In general, this integration is computationally intractable and one has to turnto other methods for computing the posterior p(W|x, y). There exists methodssuch as Importance Sampling, Markov Chain Monte Carlo and VariationalInference that can sample from such intractable distributions. For this thesis,the choice of sampling method is Variational Inference, mainly because of itscomputational speed advantage over the other methods [31].

The idea in Variational Inference is to choose a family of parametrised distri-butions qφ(W) = qφ(W1, ..., Wn) and then find the parameters φ that makesthe variational distribution as similar as possible to the posterior distribution.The similarity between the posterior and the variational distribution can becomputed using the Kullback-Leibner (KL) divergence between them, whichis given by

KL(qφ(W)||p(W|x, y)) = Eqφ [logqφ(W)

p(W|x, y)]. (2.36)

In order to estimate the posterior distribution of the weights in the networka simple normal parametric family for the variational distribution is selectedwith the assumption that the weights are independent. This is called the mean-field approximation and it implies that the joint distribution factorizes and that

23

the variational distribution can be written as

qφ(W) =L

∏i

qφi(Wi), qφi(Wi) ∼ N(µi, σ2i ). (2.37)

Here, ∼ means that they have the same distribution function, so that the dis-tribution function qφ(W) is parameterized by φ = µi, σ2

i Li=1. The goal is

maximizing the log likelihood of the class labels given the model and data;log p(y|x) which we saw was computationally intractable. In the DNN model,we saw that minimizing the loss function over the training data results in amaximized likelihood of observations. The DNN network is fully determin-istic in the sense that one is solving a deterministic optimization problem fora set of data. For the BNN the setting is changed with the introduction ofrandomness in the weights and therefore the problem of maximizing the loglikelihood gets more complicated since it requires integrating the joint likeli-hood with respect to the parameters.

Because of this we can not maximize the class probabilities given the input di-rectly. But it can be shown that instead it is possible to construct a lower boundthe the likelihood that is easier to maximize. The lower bound is constructedby using the variational posterior distribution as follows

log p(y|x) = log∫

Wp(y, W|x)dW = log

∫W

p(y|x, W)p(W)qφ(W)

qφ(W)dW (2.38a)

= log Eqφ [p(y|x, W)p(W)

qφ(W)] ≥ Eqφ [log p(y|x, W)p(W)− log qφ(W)] (2.38b)

:= L(x, y, φ, W).

Here the fact that the logarithm is a convex function together with Jensen’sinequality was used to get the bound, this inequality is well known and is asfollows

f (E[X]) ≥ E[ f (X)], when f is convex. (2.39)

Now we start from (2.36) and show that the maximization of the lowerboundof the likelihood implies a minimization of the KL divergence between thevariational and true posterior.

24

KL(qφ(W)||p(W|x, y)) = Eqφ [logqφ(W)

p(W|x, y)] (2.40)

= Eqφ [log qφ(W)]− Eqφ [log p(W|x, y)] (2.41)

= Eqφ [log qφ(W)]− Eqφ [logp(W, x, y)

p(x, y)] (2.42)

= Eqφ [log qφ(W)]− Eqφ [logp(y|x, W)p(x, W)

p(x, y)] (2.43)

= −(Eqφ [log p(y|x, W)p(W)]− Eqφ [log qφ(W)]) + Eqφ [logp(x)

p(x, y)] (2.44)

= −L(x, y, φ, W) + logp(x)

p(x, y)(2.45)

Since p(x), p(x, y) do not depend on W or φ, one realizes that minimizing theKL divergence between the posterior distribution and the variational distribu-tion is equivalent to maximizing the lower bound L(x, y, φ, W), this is denotedthe evidence lower bound (ELBO).

The only probability that needs to be evaluated is the log-joint conditional dis-tribution of the N labels log p(y|x, W) = ∑N

n=1 p(yn|xn, W), and the variationaldistribution qφ(W).

We use a mini-batch of M samples to approximate the log-joint distributioninstead of using the whole data set in order to reduce the computational com-plexity. In the case of using only one sample in the batch, we are using Stochas-tic Gradient Ascent.

log p(y|x, W) ≈ NM

M

∑m=1

p(ym|xm, W). (2.46)

When maximizing the ELBO using gradient ascent, one would need to com-pute the gradients of the ELBO with respect to the parameters φ in the varia-tional distribution.

A Monte Carlo gradient estimator could be constructed as follows

∇φL(x, y, φ, W) = ∇φEqφ [log(p(y|x, W)p(W))− log qφ] (2.47a)

=∫

W∇φH(W)qφdW −

∫W∇φ log(qφ)qφdW (2.47b)

=∫

WH(W)qφ∇φ(log qφ)−∇φqφ −∇φ(qφ) log qφdW (2.47c)

≈ 1M

M

∑m=1

H(W(m))∇φqφ(W(m)) (2.47d)

−∇φqφ(W(m))−∇φ(qφ(W(m))) log qφ(W(m)), W(m) ∼ qφ

25

It has been shown that a regular Monte Carlo estimate of this kind exhibits ahigh variance and is therefore unsuitable as a tool in the optimization [32].

One way of dealing with the issue is to use the Stochastic Gradient DescentVariational Bayes (SGVB) estimator [33]. The SGVB estimator is essentiallya reparametrization of the variational distribution which greatly reduces thevariance that is introduced when calculating the Monte Carlo estimations ofthe expected value. The reparametrization of W ∼ qφ(W) is performed bydefining another random variable W with the same expected value as W, re-calling that qφi(Wi) ∼ N(µi, σ2

i ) and the fact that the product of Gaussiandistribution functions are Gaussian we have that qφ(W) is Gaussian with somemean and variance µ, σ2 [34]. The reparametrization is constructed as follows

W = gφ(ε) with ε ∼ p(ε) (2.48)

Where p(ε) is a noise distribution. The reparametrization enables one torewrite the expectation of some function of W as

Eqφ(W)[ f (W)] = Ep(ε)[ f (gφ(ε))]. (2.49)

A simple choice of transformation function g that makes W and W have thesame expected value is to set gφ(ε) = εσ + µ and ε ∼ N(0, I). We get thefollowing expression of the ELBO under the variational distribution.

L(x, y, φ, W) = Eqφ [log(p(y|x, W)p(W))− log qφ(W)] (2.50a)

= Ep(ε)[log(p(y|x, gφ(ε))p(gφ(ε)))− log qφ(gφ(ε))] (2.50b)

≈ 1M

M

∑m=1

log(p(y|x, gφ(ε(m)))p(gφ(ε

(m)))− log qφ(gφ(ε(m))) (2.50c)

ε(m) ∼ N(0, I) (2.50d)

The achievement of the reparametrization is that the differential of L(x, y, φ, W)can be directly taken within the expected value and thus an unbiased low-variance monte Carlo estimate of the gradient can be formed. This differentialcan be written as

∇φEp(ε)[log(p(y|x, gφ(ε))p(gφ(ε)))− log qφ(gφ(ε))] (2.51a)

= ∇φ

∫ε

log(p(y|x, gφ(ε))p(gφ(ε)))− log qφ(gφ(ε))p(ε)dε (2.51b)

≈ 1M

M

∑m=1∇φ[log(p(y|x, gφ(ε

(m)))p(gφ(ε(m)))− log qφ(gφ(ε

(m)))] (2.51c)

ε(m) ∼ N(0, I).

26

In order to maximize the ELBO, we get the following update rule for the vari-ational parameters

φi+1 = φi + η∇φL(x, y, φi, W). (2.52)

Inference Having trained the model using the SGVB estimator and gradientascent, we can make inference on test data in the following way. We constructa posterior class-probability for the label y∗ of some test data sample x∗ giventhe training data Dtrain and the trained variational distribution qφ(W). Be-cause the uncertainty of the weights in the network induces uncertainty aboutthe predictions we have to marginalize the predictive distribution over the dis-tribution of the weights as

p(yt|xt, Dtrain) =∫

Wp(yt = i|xt, W)p(W|Dtrain)dW (2.53a)

≈∫

Wp(yt = i|xt, W)qφ(W). (2.53b)

By sampling weights W(m) from the variational posterior distribution we canobtain a Monte Carlo estimation of this integral.

p(yt = i|xt, Dtrain) ≈1M

M

∑m=1

p(yt = i|xt, W(m)), W(m) ∼ qφ(W) (2.54)

Each term in the summation is the output of the network given that sampledweight and corresponding input, since this outputs a probability distributionover the classes we take the argmax of this distribution as a final prediction.The procedure for computing the accuracy on a hold-out test set is the sameas for the DNN.

Entropy Suppose it is of interest to quantify the models general uncertaintyabout its prediction. It is possible to define an entropy measure based on thelikelihood vector from the models, given the model likelihood p(y|x) = [p(y =1|x), p(y = 2|x), ..., p(y = C|x)] we can construct.

S(p(y|x)) = −C

∑i=1

p(y = i|x) log(p(y = i|x)) (2.55)

This function will tend to be larger when the mass of the likelihood is moreevenly spread out and closer to 0 as the mass of the likelihood is more concen-trated on one class. Thus, lower values of the entropy means that the modelis more certain about its prediction. This can be interesting to analyze to seewhether the model tends to classify correctly when the entropy is low and

27

miss-classify with a higher entropy [35]. This was shown to be the case in apaper regarding internet traffic classification using Bayesian neural networks[36].

For many purposes this can be an interesting quantity to look at. Given thatthe model tends to display lower entropy on correct predictions during thetraining and testing it is reasonable to expect that to hold true for new data aswell. Thus one should put less trust on samples with a high entropy.

2.2.3 Voting Procedure

This section will give an introduction to the notion of the voting procedurewhich is a part of the methodology of this thesis. The motivation is that theproblems of sparsity and length variance can be handled by the partitioning ofeach time series into slices of equal size. If one selects these slices in a smartway, one can effectively augment the data set, avoid missing values and trainmodels to classify these slices with a fixed input size. This pre-processingtogether with the training is outlined in Chapter 4. Here, I will present themathematical analysis that forms a justification for the method.

Assume we are given a time series X and that we take a set of K slices/sub-sequences from it x1, ..., xK, we wish to predict the class of X through theclassification of its subsequences. Further assume that these slices conditionedon the class are coming from the same distribution and are independent. Thenassume some model f (x; θ) has been trained on training data and is able toclassify such slices reasonably well. The voting procedure described in algo-rithm 4, describes how to classify the time series based on the slices.

Algorithm 4 Voting Procedure Algorithm.

Require: Test time series X, slice size parameter mRequire: Trained model f (x; θ), x ∈ Rm

Partition X into K slices, xiKi=1 (Chapter 4.1)

for i in range K doy∗i ← f (x; θ) (Obtain prediction for slice i)

end forfor j in number of classes do

Get Nj ← ∑Ki=1 1y∗i ==j (Number of predictions of class j)

end forif arg maxj Nj has more than one element then

Prediction for X is random draw from arg maxj Njelse

Prediction for X is arg maxj Njend ifreturn Prediction for X

The probability of correct classification for a given slice can be estimated in

28

a frequentist manner by looking at the performance of the model on trainingdata, the probability for a slice being classified into each class is given as theaverage accuracy of the respective classes. Assuming that these accuraciesare high, the intuition tells us that as we include more slices into the votingprocedure, the model should make better decisions on the class of the meter.

The purpose of this chapter is to formalize this intuition mathematically and tothis end I will model the voting procedure through a multinomial distributionwhich suitability can be motivated by assumptions about the stochastic processthat generates the time-series, see the discussion chapter.

In the multinomial distribution we let Ni ∼ Bin(K, pi) denote the random vari-ables which represents the number of times an outcome of some experimentmade K times ends up in class i. The Nic

i=1 jointly forms a multinomialdistribution and are dependent random variables since we require ∑C

i=1 Ni = K.

The distribution function of the Ni’s can be written as:

f (n1, ..., nC; K, p1, ..., pC) =

K!

n1!···nC ! pn11 × · · · × pnC

C , if ∑Ki=1 ni = K

0, otherwise.(2.56)

Where n1, ..., nC ∈ Z∗.

Assume in the following without loss of generality that the correct class is 1and we want to find the probability that the correct class is chosen during thevoting.

We start of with the probability that the correct class gets the most number ofvotes.

P[N1 > maxj 6=1

Nj] = E[1N1>maxj 6=1 Nj]. (2.57)

Lemma 1. limK→∞ P[N1 > maxj 6=1 Nj] = 1 if p1 > 0.5.Proof: We have that

P[N1 > maxj 6=1

Nj] ≥ P[N1 > ∑j 6=1

Nj] (2.58)

≥ P[N1 >K2] (2.59)

= P[N1

K>

12]. (2.60)

Now since the marginal distribution of N1 is a Binomial distribution, N1 canbe written as a sum of K IID bernoulli random variables with mean p1. We canwrite N1 = ∑K

i=1 Xi and thus the law of large numbers implies

29

limK→∞

N1/K → p1, P-a.s (2.61)

Which gives that

limK→∞

P[N1

K>

12]→ 1, iff p1 > 0.5 (2.62)

which completes the proof.

In order to further investigate this probability we look at its explicit expression

P[N1 > maxj 6=1

Nj] =K

∑n1=1

n1−1

∑n2=0· · ·

n1−1

∑nc=0

f (n1, ..., nc; K, p1, ..., pc) (2.63)

=K

∑n1=1

n1−1

∑n2=0· · ·

n1−1

∑nc=0

K!

n1!···nc ! pn11 × · · · × pnc

c , if ∑Ki=1 ni = K

0, otherwise.(2.64)

The goal is to investigate whether or not that P[N1 > maxj 6=1 Nj] ≥ p1, wesaw a proof of convergence to 1 for large K so the statement holds when thenumber of slices is high.

In order to see the speed of convergence the probability is evaluated numer-ically in the case of two, three and four classes for some reasonably smallvalues of K. In Figure 2.3 the probability in (2.57) is computed for a se-ries of probability vectors of two-class probabilities [p, 1− p, 0..., 0], three classprobabilities [p, 1− p− ε, ε, 0, ..., 0] and four class probabilities [p, 1− p− ε1 −ε2, ε1, ε2, 0, ..., 0]. As the values of the epsilon’s are increased, the probabilityvectors will form sequences of vectors sorted in order of majorization.

The probabilities P[N1 > maxi 6=1Ni] are in the three cases respectively

Two classes:

K!K

∑n1=1

n1−1

∑n2=0

pn1(1− p)n2

n1!n2!1n1+n2=K (2.65)

Three classes:

K!K

∑n1=1

n1−1

∑n2,n3

pn1(1− p− ε)n2 εn3

n1!n2!n3!1n1+n2+n3=K (2.66)

Four classes:

K!K

∑n1=1

n1−1

∑n2,n3,n4

pn1(1− p− ε1 − ε2)n2 εn31 εn4

2n1!n2!n3!n4!

1∑4i=1 ni=K (2.67)

30

The results are depicted in Figure 2.3. Again one can note that as the proba-bility vectors get more spread out, the probability of correct classification bythe voting procedure increases, and that the probability converges to one. Wenotice that there is a major decrease in accuracy when K = 2.

Figure 2.3: Computation of true probability of correct prediction in votingprocedure.

Figure 2.4 shows that if the probability of correct classification is too low, thenas K increases the probability of the voting method yielding the correct classwill converge towards zero.

31

Figure 2.4: Computation of True probability of correct prediction when votingfor P2 = [0.3, 0.7, 0, 0, .., 0] and [0.3, 0.7− ε, ε, 0, .., 0] as a function of K.

By having a strict bound in the probability we are not taking into account asituation where two or more classes share the maximum number of outcomes.as described in Algorithm 4, In the real voting procedure I will allow thisequality happening, in which a decision rule to randomly pick among theclasses is made. Thus, in order to model that situation this decision rule has tobe included.

The analysis of this probability together with the decision rule to randomizein the case of two or more sharing the highest number of predictions was initi-ated with a simulation where draws from multinomial random variables weregenerated for some fixed probability vectors. The simulation was performedusing the procedure described in algorithm 5.

The main motivation behind changing the number of classes but keeping theprobability of correct classification fixed is to try and establish a lower bound ora worst case scenario in which the probability of the voting being successfulis the lowest. It is intuitive that the worst case should be when all of theremaining probability mass is concentrated on one class, so if it holds thatthe voting is beneficial in the binary classification case then this result shouldextend to the multiple classification case.

The result for the simulation is depicted in Figure 2.5, one can observe thatthe intuition about the spread of the probability vectors does hold since as theprobability vectors become more and more spread out the prediction accuracyimproves. Also the accuracy increases with an increasing number of K andseems to be strictly increasing when the remaining probability mass of these

32

Algorithm 5 Simulation of Voting Procedure.

Require: N probability vectors p1, ..., pN of the same length L where eachpi pk, 0 ≤ i < k ≤ N and the first element of each is fixed (correct label is0).for K = 1, ..., Kmax do

FinalAccuracy = 0for n = 0, 1, .., N − 1 do

j = 0M ← set of samples from Mu(K,pn)tmp (Temporary array storing predictions of each sample)for m ∈ M do

j = j + 1if #arg max m == 1 then

tmp(j)← arg max(m)else

tmp(j)← random choice inarg max mend if

end forGet the estimated probability Q as #tmp==0

MFinalAccuracy[n,count] = Q

end forend forreturn 2-D Array Q where the rows corresponds to the probability vectorand columns corresponds to number of events/slices.

probability vetors is sufficiently spread out. If the number of slices used inthe voting is even, it is possible to have that two or more outcomes will havethe same number of votes. The decision would then be randomized betweenthe classes that has the same number of votes, this can lead to a decrease inperformance, which can be seen by looking at how the probability seems to”plateau” whenever K moves from an uneven number to an even number, theeffect seems a lot larger for the two-class situation. As K increases our intuitiontells us that the probability of this happening and that we have a single clearwinner increases. And this can be seen in the same figure since the variationsin the curve for two classes becomes more smooth as K increases.

33

Figure 2.5: Simulation Result.

From this one can see that the performance of the voting procedure seems tobe monotonically increasing in K as the probability mass gets more spread outover the other classes given that K is selected such that it cannot be evenlydistributed among the classes. It is also possible to conclude that even forthe worst case scenario where all the mass is concentrated on two classes, oneobtains a significant performance increase in accuracy when increasing thenumber of votes K, and that as K goes to infinity the probability of correctclassification converges to one.

This analysis justifies the following conjecture, which could be interesting toprove as a step in a formal analysis of the convergence rate of the discussedprobability.

Conjecture 1: Consider N = [N1, N2, ..., NC], N ∼ Multinomial(piCi=1, K) and

assume the probability of correct classification of a slice p1 > 0.5. Then thecase C = 2 gives a lower bound for the probability that the voting proceduregives the correct prediction.

Note: The normal approximation of a binomial distribution tells us that as Kincreases, the difference between P[N1 ≥ maxi 6=1 Ni] and P[N1 > maxi 6=1 Ni]vanishes, this confirms the intuition that the probability of having two or moreclasses sharing the highest number of votes vanishes as K gets large.

Given the results of the simulation it seems that indeed the results of the ef-fectiveness of the voting procedure can be extended to the case when allowingtwo or more classes to share the highest number of votes together with thedefined decision rule. This justifies the voting procedures effectiveness under

34

the given assumptions.

Moreover the entropy could be used in the voting procedure by only votingon the slices for which the model outputs a likelihood with low entropy. Thesimple way of including the entropy in the decision is to compute it for allthe available slices for that meter and sort the vector in increasing order whilekeeping track of the corresponding slices. This forms a vector E of entropywhere after one can take the slices corresponding to the first x% of the argu-ments of the entropy vector. In this way x is a hyper-parameter in the model,the percentage of slices taken based on this entropy will be denoted Ex.

35

Chapter 3

Data Set

The data consists of 19120 time series containing hourly measures of the con-sumption of utility distributed on five classes according to the following table.

Name: Cooling Electricity Heating Hot Water WaterN : 385 14166 2946 373 1250

There are three main problems with the data, the first problem of class im-balance is evident from the table above. The second problem is that the timeseries are of varying lengths, some span a couple of days while others containdata from several months to years. The third problem is one of missing values,many of the time series are containing missing values in a scale that is toolarge for simple interpolation on a large scale to be effective.

The length differences and the fact that the time series are not taken over thesame periods of time means that the underlying matrix of time series wouldhave a high rank, meaning that the time series are not linearly dependent. Thismeans that matrix completion methods would be ineffective and not consid-ered as a method.

Figure 3.1 depicts the distribution of the lengths of the time-series in the dataset coming from the different classes. One can note that although it appearsthat the lengths are quite concentrated around the left hand side of the plot,the time-series have a substantial variation in lengths, furthermore it seemsthat the class ”Electricity” has the most lengthy time-series.

36

Figure 3.1: Distribution of Length of Time Series.

In Figure 3.2, the percentage of missing values across the meters in this bal-anced data set is depicted in the form of a histogram. It can be seen that thereare a number of meters with a high number of missing values. For further in-formation on how the sparsity looks like for individual classes, see AppendixB. There it can be seen that some meters are almost completely sparse and thatthe class ”Hot Water” seems to have many such meters.

Figure 3.2: Histogram of Missing Value Percentages.

37

Chapter 4

Methodology

In this part the method of pre-processing the data and training the models areexplained.

The python code is available at my GitHub page [37]. Although the data setis not available, it should be possible to use the code with a different dataset with only minor changes. The code is essentially split into three files, thefirst ”Excelmerge.py” will take as input raw excel files and concatenate theminto a single csv-file. The file ”DataPlotting” performs the slice extractionand outputs another csv-file of the processed data set. The ”BNN.py” and”DNN.py” contains the code for the neural networks.

Due to the inherent problems with the dataset a standard approach of simplyfeeding the time series into a classification algorithm is ruled out. It is desirablefor the model to be able to handle the classification of both long and short timeseries. The method is essentially split into three parts, in the first part the timeseries are pre-processed through a slice extraction method, in the second partsome classification algorithms are trained to classify these subsequences andin the last part, inference on new meters are based on a voting procedure ofslices.

4.1 Preprocessing Data

The first step in the process of preparing the data for the training of the modelis to extract the relevant data and merge the time series. Data from individualmeters came in CSV-format files, including a column for the date and a columnfor the utility consumption measure. In order to handle the problem of classimbalance one can essentially use either a data or algorithmic approach. Thedata approach is to either over-sample classes with a few number of meters,or to under-sample the classes with a high number of meters. Because ofthe sparsity and length variance issue, oversampling can be a challenge sincethis often relies on generative methods [14]; meaning questions such as which

38

candidate meters to use for generation and how to deal with the missing valuesbefore the oversampling needs to be dealt with. For this reason, I selected aunder-sampling approach where 373 meters were selected randomly from eachclass, corresponding to the number of meters in the hot water class. The datawas then joined and grouped into their respective classes while dropping thedates. Twenty percent of the meters in each class was set aside to form atesting set and every preprocessing step for each of the disjoint data sets wereperformed completely independently.

The main idea to handle the issues with the data was to perform a slice extrac-tion, this meant taking a set of slices of equal size from each time series whereeach slice inherited the label from that time series, thereby forming a new dataset. The number of slices taken from each meter was bounded and this upperand lower bound was a hyper-parameter in the model. This hyper-parametercan be tuned to for instance control the sizes of the new data set and also inthe voting for the test set to control the possible number of slices to use foreach time series.

As previously discussed, it is possible to address multiple problems simulta-neously with the slicing method. Firstly it effectively augments the data setby constructing multiple samples from each time series. Secondly the missingdata problem is avoided to a large extent as one can choose to select slicesthat contains none or a small number of missing values. There is moreovera possibility to interpolate the missing values on a per slice basis to furtherincrease the number of slices taken. Lastly it eliminates the problem with thetime series having different lengths as the slice size is fixed and will be thesame for all the slices. The models become flexible with this strategy and canpredict any time series as long as they contain at least one good slice.

The samples are selected iteratively where each possible candidate for a slicehas to fulfill a criteria judging the quality of the slice in question. The selectionof a good slice is a heuristic, however based on some key arguments. Twocharacteristics of slices are judged in the quality assessment, the first being thenumber of missing values that a candidate slice contains. It is reasonable tobelieve that a good candidate should contain few to no missing values. Also,slices that contained more than ten percent missing values were discarded, forthose slices that has a missing value percentage between 0 and 10% the missingvalues are interpolated or extrapolated using quadratic methods.

The second criteria is that a candidate slice is considered of poor quality ifit is constant, thus a criteria of variance is imposed on the slices. Lastly aslice is considered bad if it contains too many zeros, the argument is thatmeters that are turned off would be indistinguishable. A model should workby distinguishing patterns in the time series and not to learn to classify basedon the relative frequencies of turned off meters in the data set. Thus, we acceptslices only if the have less than a set percentage of zeros. An overview of theslice extraction methodology is presented in Figure 4.1.

39

Figure 4.1: Preprocessing of Data workflow.

The feature scaling was made using the min-max scaling which normalizesthe values of the slices over the feature space, see chapter 2.3 in [38] for an indepth explanation.

4.2 Training

Software The regular Neural Network was implemented in Tensorflow, whichis a well known API for defining and running backpropagation over computa-tional graphs.

For the Bayesian Neural network, a library called Zhusuan which is built ontensorflow was used. It adds support for building probabilistic models usingstochastic tensors without adding much more complexity to the code [39].

Some of the simulations were made in C and some results were plotted inMatLab.

Training The minimum and maximum number of slices taken per meter forthe training set was selected to be 1 and 200, the training data set of sliceswas balanced in terms of classes but not in terms of meters. Having a toolarge number of slices per meter meant that after balancing most slices wouldbe coming from the same meters, due to some meters having many moregood candidate slices. This was not wanted, so the limit was set to 200, nohyper-parameter tuning was made for this parameter. The lower bound forthe variance of a candidate slice was 0.001 and the maximum percentage ofinterpolated missing values per slice was 20%.

40

The hyper-parameters of the models were selected by trial and error, theparameter setup can be found in Table 4.2. The models incorporated theAdamOptimizer using a cross-entropy loss function for the DNN and theELBO for the BNN, both using dropout and batch-normalization after eachhidden layer. The networks were trained on a Nvidia 970GTX GPU which en-abled accelerated performance. During the training process, the training andtest accuracies together with the loss functions were monitored.

Hyper-parameter DNN and BNNHidden Layers 3Neurons Per Layer 100Dropout rate 0.7Learning Rate 0.001

The mean-field variational posterior for the training of the BNN was chosenas a normal distribution with fixed mean and a variable standard deviation asdescribed in the background. The learning rate was gradually reduced by 98%after each epoch which helped in having a smooth convergence of the errorrates. In both cases the ReLu was used as an activation function and the finallikelihood function was taken to be the softmax function of the output of thelast layer.

Voting The selection of slices to be used in the voting procedure was per-formed in the way described in the previous section. After some trial anderror the optimal percentage of zeros to allow in a candidate slice was chosenas 20%. Different values for the minimum and maximum number of slicestaken for each meter in the voting was used for which the results are shown inthe following section.

Moreover, we saw that it was possible to select slices used in the voting bythe entropy of their likelihood function. This was achieved by computing theentropy for all the slices in the voting, then selecting a certain percentage ofthe lowest values to take, just as described in the background chapter.

41

Chapter 5

Results

In this section some evaluations of the performance of the models are pre-sented and conclusions are drawn regarding the strategy presented in the pre-vious sections.

Meter Quality The number of good quality slices that could be taken fromeach meters showed to have class dependence where some classes containedmeters with a higher number of good quality slices. The number of goodquality slices that can be taken from each meter is depending on the hyper-parameters, both the slice size and the ones controlling the judgment of whata good slice is.

A general behavior for the class ”Hot Water” and ”Water” to have many fewergood candidate slices than the other classes could be observed throughout theimplementation of the classification algorithms in the project. This is illus-trated in the following figures below

42

(a) Cold Water (b) Hot Water

(c) Electricity

(d) Cooling

(e) Heating

43

Performance Firstly the regular neural network was tested. Figure 5.2 showsthe accuracies of the models when the slice size parameter is varied and allother parameters are kept constant. It can be seen that the model achievesworse performance for a 12h slice size and that for the rest of the slice sizesthe accuracies are similar.

Figure 5.2: Performance as a function of the Slice Size.

It was evident that the models tended to perform worse on the ”Hot Water”class. Several printouts of the confusion matrices of the slice accuracy con-firmed this. Table 5.1 is one such realization which illustrates the phenomena.

Table 5.1: Confusion Matrix of NN model for five classes.

Cooling Electricity Heating Hot Water Water1163 140 220 159 197229 880 319 185 266205 250 1054 202 168231 133 154 418 943237 291 188 364 799

The highlighted values in the confusion matrix indicates that the accuracy forclass Hot Water is around half as good as for the other classes and that most ofthe miss-classified hot water meters are classified as water meters. To furtheranalyze the performance of the model recall the statements about the sparsityof the data where it was shown that some meters are so sparse that only a fewgood candidate slice can be extracted from them. It is interesting to see howexcluding those meters in the test set affects the performance, this is depictedin Figure 5.3.

44

Figure 5.3: Increasing the minimum number of slices per meter in the votingprocess, all classes.

It can be seen that as the minimum number of slices per meter is increased andthereby including ”better” meters, the accuracy tends to increase. Thus, themodels seem to have worse performance on meters which contains few goodcandidate slices.

Based on the results seen in the confusion matrix, it was investigated howdropping the class ”Hot Water” affected the performance of the classificationalgorithms. Figure 5.4 depicts the performance of the models on this dataset as a function of the slice size hyper-parameter. The other hyper-parameterswere left unchanged from the testing on all classes. It can be seen that by drop-ping the class ”Hot Water” the models achieve about a ten percent increase inaccuracies.

45

Figure 5.4: Performance as a function of the Slice Size, four classes.

Figure 5.5 shows that the models are capable of achieving accuracies above80% when considering only meters that have several good candidate slices.

Figure 5.5: Final Accuracy when increasing the minimum number of slices permeter in the voting process, four classes.

The following results were computed for the data set where the hot water classwas dropped.

In order to see how the model becomes more certain about it predictions andwhether or not that is reflected in the actual performance, the entropy wascomputed for all the correct and incorrectly classified samples. The distribu-tion of the entropy’s for these two categories are presented in Figure 5.6 and

46

Figure 5.6: Entropy Distributions BNN

(a) Epoch = 1. (b) Epoch = 20.

Figure 5.7: Entropy Distributions DNN

(a) Epoch = 1. (b) Epoch = 15.

5.7.

One can note that the distribution of the entropy for correctly classified sam-ples is concentrated at lower values than for the falsely classified slices, bothfor the DNN and the BNN model. This means that the models are more confi-dent about its predictions when the models are correct compared to when theyare wrong. Thus in a scenario where a new meter shall be predicted, one canbe more confident that the model will predict correctly when the entropy islow, and conversely if the entropy is very high, there is reason to have doubtsabout the prediction. Note also that the entropy for the DNN seems to belower than for the BNN.

As previously described the entropy of the likelihood function for a samplecould be used in the voting procedure by only considering slices that displaya low entropy. Below, the final accuracies on the test set is reported when theentropy and the number of slices per meter used in the voting procedure isvaried.

47

Table 5.2: Final Voting Accuracies, (x) = DNN, x = BNN.

Min #Slices 1 6 11 16E1.0 67.0,(67.4) 71.2,(69.6) 75.0,(71.2) 75.8,(69.2)E0.8 66.0,(64.2) 70.7,(67.0) 75.0,(69.4) 74.7,(71.4)E0.5 64.2,(65.1) 68,1,(68.1) 72.5,(70.6) 70.3,(72.5)E0.2 66.0,(67.4) 69.6,(71.2) 74.4,(73.8) 74.7,(73.6)

The results in Table 5.2 can be interpreted as that the BNN network seemsto perform better on the subset of meters that has more than the minimumnumber of slices available, alias the ”good” Meters. Moreover it is difficultto conclude anything about how considering only lower entropy slices in thevoting affects the performance. By looking at the rows it can be seen that theDNN seems to get worse performance by considering low entropy slices in thevoting, but that the BNN can benefit from it, especially as better quality metersare considered(min number of slices increased).

When the number of slices taken from meters in the test set is fixed, the result-ing accuracies are higher than in the case of including all the meters in the testset, the effect of only voting on slices with low entropy can be seen in Table5.3. The table is reported for a test set with precisely 200 available slices permeter.

Table 5.3: Effect of Using Entropy in Voting.

E% DNN BNNE1.0 80 82.3E0.8 80 83.1E0.5 79.2 82.3E0.2 81.5 84.6E0.1 80.8 81.5

Once again, the BNN seems to have a better performance than the DNN butthe effect of entropy in the voting cannot be concluded here.

Next the results of the voting theory applied to different meters in the data setis presented. The results are obtained by considering all the available slicesfor one meter, partitioning them into groups of K slices in each group andthereafter performing voting over each group in order to get a measure on theaverage voting accuracy for that meter as a function of K. The results for thisanalysis on some meters is depicted in Figure 5.8.

48

Figure 5.8: Voting Accuracy On Meters.

It can be seen that most meters used in the analysis seems to benefit from thevoting procedure and follow a similar convergence as seen the theoretical re-sults for the multinomial distribution, see Figure 2.3. The curves that convergeto 1 seems to correspond to the meters for which the average accuracy for oneslice is relatively high. This can be observed from the plot because the firstdata point ccorresponds to the case where K is equal to one which is just aprediction for a single slice.

For the data-points where the average accuracy for single slices is around0.35− 0.5 we observe that the accuracy is improved by voting over a largerset of slices but that the convergence is slower. Lastly we see that for someslices the voting is severely decreasing the accuracy of the predictions, wehave a convergence to 0 percent accuracy and the meters for which this hap-pens seems to have a very low average accuracy for single slices.

In order to see how just changing the upper bound for the number of slicestaken in the voting procedure, a test was made where the upper bound wasvaried between 1 and 50. The results are shown for the DNN and the BNN inFigure 5.9. One can see that as the maximum number of slices taken per meteris increased, the accuracy seems to converge to around the value obtained inthe slice size plot of Figure 5.4.

The BNN had one downside to the DNN by being slower to train, althoughthe order of magnitude of the training time was about the same for this dataset.

49

Figure 5.9: Final Accuracy when increasing the maximum number of slices permeter in the voting process, four classes.

5.1 Conclusions

In this thesis, a method for classification of sparse time series with a largevariance in lengths and an imbalanced data set was evaluated in the context ofsmart meters. The method consisted of a data augmentation by the extractionof slices from the time series, training neural network classifiers on this dataand a voting procedure on slices when classifying new time series.

The results indicate that the models tend to perform worse on the samplescoming from the classes ”water” and ”hot water” and that the worst per-formance is on the ”hot water”-class. On all the classes the models achieveaccuracies of around 60%, by excluding the ”hot water” class it is possible toachieve accuracies of at least 70% on the Data Set. The models perform worseon time series that contain few number of good quality slices, by consideringonly time series which has many good quality slices, accuracies of 70% areachieved for all classes and above 80% when excluding the hot water class.

It can be concluded that the voting procedure handles the problems with theoriginal data set reasonably well and has the additional benefit of being ca-pable of giving predictions for time series of different lengths fashion since itonly requires around 24 hours of data as input. However, there are some lim-itations. The results in Figure 5.8 shows that it is expected that some meterswill be falsely classified as more samples are included in the voting.

The voting procedure does, however increase the overall accuracy on the testset, and in this sense the method can be said to add value. Considering the

50

entropy of the predictions when voting on the slices seems to work in somecases and in some cases not, probably the lack of decisive improvement is be-cause of a trade-off between the number of slices and their quality. The resultsfrom the BNN are similar to the results of the DNN in terms of accuracies,the differences are probably due to the different hyper-parameter tunings. Interms of training time the DNN was slightly faster.

Moreover the slice-method coupled with the voting adds several new hyper-parameters and consequently gives a more complicated model that has to beadapted to new data sets and where more things can go wrong. This is thedrawback of the method.

Given these results I believe it can be concluded that given a data set whichhas one or more of the following qualities:

• The samples are sparse and of varying lengths

• Every Sample is made up of a collection of sub-samples generated fromsome random process where the sub-samples can characterize the classwell.

• Contains few samples

has the potential to be successfully solved using the method of slicing andvoting. Further work is needed to assess the methods performance on moresimilar data sets.

51

Chapter 6

Discussion

This chapter will be devoted to discussing the results and the drawbacks of themethod. Lastly some areas where it can be further evaluated and developedwill be outlined.

Motivation Behind Slice Method: As mentioned a couple of times through-out the report the main achievement by this method is to simultaneously han-dle the problems of sparsity and length variances. It also adds additionalflexibility by being able to classify new time series based on only a few hoursof data as well as many hours of data by varying the number of slices used inthe voting procedure.

The motivation for using the slices in the first place is because of a belief thatthe the slices can be viewed as independent identically distributed IID samplesfrom a distribution that generates the time series through a stochastic process.Because of the independence one can look at each slices as separate data-pointsfrom that class and transfer the problem to classifying the slices.

To give an example where this sort of assumption could hold consider a coffeemachine in a large office that is running a daily cycle starting from the morningand ending in the evening. This cycle of usage occurs since no-one is workingduring the night. For most days, the pattern of usage is the same, most peoplegrab coffee during the same hours in the morning, after lunch and during theafternoon. However, from day to day there are small variations in the usagedue to several reasons, people might be sick or not working that day or simplyhave no cravings for coffee. Assuming these disturbances are many in numbersand independent, this can be modeled as random Gaussian noise ε by a centrallimit argument.

Thus, if we measure the usage of the coffee machine for N days we wouldend up with a vector or a time-series T = [s1, 0, s2, 0..., sN ] where the s′is rep-resent the usage time-series at day i. Now because of the above arguments,we can view siN

i=1 as realizations from the random variable S = µ + ε, the

52

characteristic usage pattern plus Gaussian noise. In this way we might viewthe time-series as a stochastic process generating IID slices of data. And theproblem of classifying T can be cast into classifying the s′is

The same reasoning might suggest that the usage of utilities can be partitionedinto IID slices where each slice captures the class-conditional features neces-sary for a machine learning model to separate them. An argument againstthat the time series in the project has this property is that they could have longterm trends and seasonality or that the noise in the patterns of usage are notindependent. The assumptions about the time series are hard to either ver-ify or dismiss but the argument can be made and the methods based on thearguments evaluated.

Discussion About the Data set: Most classification algorithms performanceare limited by the data on which they are trained and evaluated on. In thisproject we considered a data set of very poor quality for several reasons. Wesaw in the results that the hot water class was the hardest to predict correctly.It also happened to be the class for which the fewest good quality slice wereobtained. By training models on a larger and more balanced data set, theperformance would probably be increased.

Accounting for the class imbalance by under-sampling was performed in thisthesis; trying also oversampling the classes with few meters could also yieldperformance gains. However, this could be tricky because of the sparsenessand different lengths of the time-series, questions would arise such as whichmeters to select for re-sampling and how the new synthetic meters would beconstructed when there are missing values present in the original data.

A problem with the under-sampling is that the data set becomes quite small,only containing 373 meters per class. As seen in the results, the test sets con-tained around 200 meters depending on the parameters. This could mean thatthe averaging of the accuracies when training on different training/test parti-tions could display high variance. In order to be more confident in the validityof the results, one would have to perform an averaging process of the testingon different training/test splits, ideally averaging over all such partitions. Dueto computational complexity and time-reasons this was not possible, howeverduring every partition into training and testing for different parameter runs,the random number generator ensured that they differed. There were someruns like this and since no significant performance drops that could be relatedto this was noted, it cannot be concluded that this was a problem.

The role of Min and Max number of slices: The results showed that theaccuracy would increase as the minimum number of slices that was taken froma meter increased. As a consequence of this, the meters that had fewer thanthose number of good slices available were simply excluded from the testing.

53

In a possible application one can use this information and be more carefulabout the meters which has a low number of good quality slices.

For the training set, the min and maximum number of slices extracted from themeters are hyper-parameters and can be adjusted. The main issue that mightarise by this hyper parameter choice is related to the length and sparsity vari-ance. If the max is set very high, then some meters will contribute much morethan others to the new data set, if the slices have distributions conditioned onmeters this might violate the IID assumptions; the models might generalizepoorly to other meters if trained on only a few meters slices. Consequentlythere is a need to re-balance the data set of extracted slices as described inthe workflow in Figure 4.1. As the slices are balanced over the classes, somemeters will be over-represented in the training data. In order to combat this, Itried to not set the maximum number of slices too high and finally settled for200. Since the models achieved sufficient accuracies on the testing data, whichof course was balanced in terms of meters, I did not spend much time optimiz-ing over this hyper-parameter or thinking about how to account for the meterimbalance. It is unclear whether or not this is a problem, I would recommendthat further analysis should be made on how changing this hyper-parameteraffects the resulting accuracies.

Comment on Entropy: As the results showed, using the entropy as a meansof deciding which slices to include in the voting yielded little performancegains. I hypothesize that this is because there exists a trade off between thegain of including more slices in the voting procedure and the per-slice accuracygain from using low-entropy slices.

6.1 Future Work

There are many ways of continuing the work that has been made in this thesis.One interesting aspect would be to try the method on other data sets whichare similar to the one used here. Preferably this analysis should be madeon a publicly available data set so that the results are more easily replicated.It would be interesting to compare the performance of the proposed votingmethod and the Gaussian process regression method for sparse time series onsuch data sets.

Furthermore there are many ways of tweaking the model, for instance by us-ing different classifiers, hyper-parameter optimization and so forth. Also it ispossible to work with the generation of slices, perhaps it would be beneficial tofurther augment the data sets by clever transformations of slices, similar to theMCNN network. Also one could try and find what slices should be includedin training and voting by active learning methods as described and applied tothe well known data set of handwritten digits (MNIST) in [40].

54

Another possibility is to use the data dictionary method explained in [15].Although the setting is a bit different in the paper their method of selecitingslices could be implemented. Their method together with a modern classifierand the voting procedure could possibly lead to better results for this data setas well.

The different methods of handling class imbalances could definitely be inves-tigated. Perhaps oversampling at a time series-level instead of under-samplingthe high sample classes would yield better results. Such a method should takeinto account the sparsity and variance in lengths for the samples.

Lastly, some effort was put into proving corollary 1 without success. A for-mal proof would help to further solidify the arguments for the voting-method.Perhaps a similar analysis could be made with some relaxations on the inde-pendence and identical distribution of the slices. This would also be furtherareas of work.

55

Chapter 7

Appendix A.

In this appendix I will present some plots related to the data set.

The following plots shows the distribution missing value percentages over thetime series. It is this information that is summed up in Figure 3.2.

(a) Cold Water (b) Hot Water

(c) Electricity

56

(a) Cooling

(b) Heating

The following figures depict the distributions of the percentage of zeros in thetime-series across the different classes.

57

(a) Water (b) Hot Water

(c) Electricity

(d) Cooling

(e) Heating

58

Bibliography

[1] Alessandro Pitı, Giacomo Verticale, Cristina Rottondi, Antonio Capone,and Luca Lo Schiavo. The role of smart meters in enabling real-timeenergy services for households: The italian case. Energies, 10(2):199, 2017.

[2] R. J. Alcock, Y. Manolopoulos, Data Engineering Laboratory, and Depart-ment Of Informatics. Time-series similarity queries employing a feature-based approach. In In 7 th Hellenic Conference on Informatics, Ioannina,pages 27–29, 1999.

[3] Pierre Geurts. Pattern extraction for time series classification. In Pro-ceedings of the 5th European Conference on Principles of Data Mining andKnowledge Discovery, PKDD ’01, pages 115–127, Berlin, Heidelberg, 2001.Springer-Verlag.

[4] Lexiang Ye and Eamonn Keogh. Time series shapelets: A new primitivefor data mining. In Proceedings of the 15th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, KDD ’09, pages 947–956, New York, NY, USA, 2009. ACM.

[5] Jason Lines, Luke M Davis, Jon Hills, and Anthony Bagnall. A shapelettransform for time series classification. In Proceedings of the 18th ACMSIGKDD international conference on Knowledge discovery and data mining,pages 289–297. ACM, 2012.

[6] Zhicheng Cui, Wenlin Chen, and Yixin Chen. Multi-scale convolutionalneural networks for time series classification. CoRR, abs/1603.06995, 2016.

[7] Arthur Le Guennec, Simon Malinowski, and Romain Tavenard. Data Aug-mentation for Time Series Classification using Convolutional Neural Net-works. In ECML/PKDD Workshop on Advanced Analytics and Learning onTemporal Data, Riva Del Garda, Italy, September 2016.

[8] Zhiguang Wang, Weizhong Yan, and Tim Oates. Time series classifica-tion from scratch with deep neural networks: A strong baseline. 2017International Joint Conference on Neural Networks (IJCNN), May 2017.

[9] Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, LhassaneIdoumghar, and Pierre-Alain Muller. Deep learning for time series classi-fication: a review, 2018.

59

[10] Yanping Chen, Eamonn Keogh, Bing Hu, Nurjahan Begum, Anthony Bag-nall, Abdullah Mueen, and Gustavo Batista. The ucr time series classifi-cation archive, July 2015. www.cs.ucr.edu/~eamonn/time_series_data/.

[11] Josif Grabocka, Alexandros Nanopoulos, and Lars Schmidt-Thieme. Clas-sification of sparse time series via supervised matrix factorization. 2012.

[12] Steven Cheng-Xian Li and Benjamin M Marlin. Classification of sparseand irregularly sampled time series with mixtures of expected gaussiankernels and random features.

[13] Yanmin Sun, Mohamed S. Kamel, Andrew K.C. Wong, and Yang Wang.Cost-sensitive boosting for classification of imbalanced data. PatternRecognition, 40(12):3358 – 3378, 2007.

[14] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W PhilipKegelmeyer. Smote: synthetic minority over-sampling technique. Jour-nal of artificial intelligence research, 16:321–357, 2002.

[15] Bing Hu, Yanping Chen, and Eamonn J. Keogh. Time series classificationunder more realistic assumptions. In SDM, 2013.

[16] Patrick Schafer and Ulf Leser. Fast and accurate time series classificationwith weasel. In CIKM, 2017.

[17] Alexander G de G Matthews, Mark Rowland, Jiri Hron, Richard E Turner,and Zoubin Ghahramani. Gaussian process behaviour in wide deep neu-ral networks. arXiv preprint arXiv:1804.11271, 2018.

[18] Energy independence and security act of 2007 (eisa-2007), sec. 1301.¡¡note: 15 usc 17381.¿¿ statement of policy on modernization of elec-tricity grid. https://www.gpo.gov/fdsys/pkg/PLAW-110publ140/html/

PLAW-110publ140.htm. Accessed: 2018-12-10.

[19] Sensor monitoring device patent us3842208a, t paraskevakos. https://

patents.google.com/patent/US3842208. Accessed: 2018-12-10.

[20] Proposition 2002/03:85. https://www.riksdagen.se/sv/

dokument-lagar/dokument/proposition/vissa-elmarknadsfragor_

GQ0385. Accessed: 2018-12-10.

[21] European conference on smart metering deployment in theeu (26 june 2014). https://ec.europa.eu/energy/en/topics/

market-and-consumers/smart-grids-and-meters. Accessed: 2018-12-10.

[22] Zdravko Liposcak and Marin Boskovic. Survey of smart metering com-munication technologies. pages 1391–1400, 07 2013.

[23] German Corrales Madueno, Cedomir Stefanovic, and Petar Popovski.How many smart meters can be deployed in a gsm cell? In Communi-

60

www.cs.ucr.edu/~eamonn/time_series_data/

https://www.gpo.gov/fdsys/pkg/PLAW-110publ140/html/PLAW-110publ140.htm

https://www.gpo.gov/fdsys/pkg/PLAW-110publ140/html/PLAW-110publ140.htm

https://patents.google.com/patent/US3842208

https://patents.google.com/patent/US3842208

https://www.riksdagen.se/sv/dokument-lagar/dokument/proposition/vissa-elmarknadsfragor_GQ0385



https://ec.europa.eu/energy/en/topics/market-and-consumers/smart-grids-and-meters

https://ec.europa.eu/energy/en/topics/market-and-consumers/smart-grids-and-meters

cations Workshops (ICC), 2013 IEEE International Conference on, pages 1263–1268. IEEE, 2013.

[24] Yassin Jomni, Jan van Deventer, and Jerker Delsing. Improving heatenergy measurement in district heating substations using an adaptivealgorithm. In International Conference on Flow Measurement: 14/09/2004-17/09/2004, pages 554–558, 2004.

[25] Frank. Rosenblatt. The Perceptron–a perceiving and recognizing automa-ton. Report 85-460-1, Cornell Aeronautical Laboratory., 1957.

[26] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic opti-mization, 2014.

[27] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MITPress, 2016. http://www.deeplearningbook.org.

[28] Sergey Ioffe and Christian Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift, 2015.

[29] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry. How Does Batch Nor-malization Help Optimization? ArXiv e-prints, May 2018.

[30] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, andRuslan Salakhutdinov. Dropout: a simple way to prevent neural networksfrom overfitting. The Journal of Machine Learning Research, 15(1):1929–1958,2014.

[31] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference:A review for statisticians. Journal of the American Statistical Association,112(518):859–877, 2017.

[32] John Paisley, David Blei, and Michael Jordan. Variational bayesian infer-ence with stochastic search. arXiv preprint arXiv:1206.6430, 2012.

[33] D. P Kingma and M. Welling. Auto-Encoding Variational Bayes. ArXive-prints, December 2013.

[34] Paul Bromiley. Products and convolutions of gaussian probability densityfunctions.

[35] Claude E Shannon and Warren Weaver. The mathematical theory of in-formation (urbana, il, 1949.

[36] Tom Auld, Andrew W Moore, and Stephen F Gull. Bayesian neural net-works for internet traffic classification. IEEE Transactions on neural net-works, 18(1):223–239, 2007.

[37] Carl ridnert, github repository. https://github.com/Ridnert/ML_

Utility_Submetering. Accessed: 2019-01-12.

[38] Bikesh Kumar Singh, Kesari Verma, and AS Thoke. Investigations onimpact of feature normalization techniques on classifier’s performance in

61

http://www.deeplearningbook.org

https://github.com/Ridnert/ML_Utility_Submetering

https://github.com/Ridnert/ML_Utility_Submetering

breast tumor classification. International Journal of Computer Applications,116(19), 2015.

[39] Jiaxin Shi, Jianfei Chen, Jun Zhu, Shengyang Sun, Yucen Luo, Yihong Gu,and Yuhao Zhou. Zhusuan: A library for bayesian deep learning. arXivpreprint arXiv:1709.05870, 2017.

[40] Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian activelearning with image data. arXiv preprint arXiv:1703.02910, 2017.

[41] Jiaxin Shi, Jianfei Chen, Jun Zhu, Shengyang Sun, Yucen Luo, Yihong Gu,and Yuhao Zhou. ZhuSuan: A Library for Bayesian Deep Learning. ArXive-prints, page arXiv:1709.05870, September 2017.

[42] Subhash Bagui and K. Mehra. Convergence of binomial to normal: mul-tiple proofs. International Mathematical Forum. 12. 399-411., 2017.

62

Date post:	15-Mar-2022
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

Machine Learning for Sparse Time-Series Classification

Documents