+ All Categories
Home > Documents > A Weighted LS-SVM Based Learning System for Time Series Forecasting

A Weighted LS-SVM Based Learning System for Time Series Forecasting

Date post: 29-Jan-2016
Category:
Upload: juan-sepulveda
View: 16 times
Download: 0 times
Share this document with a friend
Description:
Excelente articulo para comprender el uso de maquinas de soporte vectorial
18
A weighted LS-SVM based learning system for time series forecasting q Thao-Tsen Chen, Shie-Jue Lee Department of Electrical Engineering, National Sun Yat-Sen University, Kaohsiung 80424, Taiwan article info Article history: Received 22 May 2014 Received in revised form 6 December 2014 Accepted 11 December 2014 Available online 17 December 2014 Keywords: Time series forecasting Multi-step forecasting Nearest neighbor Mutual information Support vector machine abstract Time series forecasting is important because it can often provide the foundation for decision making in a large variety of fields. Statistical approaches have been extensively adopted for time series forecasting in the past decades. Recently, machine learning techniques have drawn attention and useful forecasting systems based on these techniques have been developed. In this paper, we propose a weighted Least Squares Support Vector Machine (LS-SVM) based approach for time series forecasting. Given a forecasting sequence, a suitable set of training patterns are extracted from the historical data by employing the concepts of k-nearest neighbors and mutual information. Based on the training patterns, a modified LS-SVM is developed to derive a forecasting model which can then be used for forecasting. Our proposed approach has several advantages. It can produce adaptive forecasting models. It works for univariate and multivariate cases. It also works for one-step as well as multi-step forecasting. A number of experiments are conducted to demonstrate the effectiveness of the proposed approach for time series forecasting. Ó 2014 Elsevier Inc. All rights reserved. 1. Introduction Time series forecasting concerns forecasting of future events based on a series of observations taken previously at equally spaced time intervals [25]. It has played an important role in the decision making process in a variety of fields such as finance, power supply, and medical care [21]. One example is to predict stock exchange indices or closing stock prices in the stock market [9–12,23,43,60]. Another example is to predict the electricity demand to avoid producing extra electric power in the power supply industry [27,39,68]. If forecasting is done for one time step ahead into the future, it is called sin- gle-step or one-step forecasting. For the case of two or more time steps ahead, it is usually called multi-step forecasting [5,21]. Two approaches have been adopted for constructing time series forecasting models. The global modeling approach constructs a model which is independent of the target to be forecasted. For time series prediction, the conditions of the environment may vary as time goes on. A global model is not adaptive and thus accuracy suffers. A local model constructed by the local modeling approach [28,29,45–47] is dependent on the target to be forecasted and therefore is adaptive. Local http://dx.doi.org/10.1016/j.ins.2014.12.031 0020-0255/Ó 2014 Elsevier Inc. All rights reserved. q This work was supported by the National Science Council under the Grants NSC-99-2221-E-110-064-MY3 and NSC-101-2622-E-110-011-CC3, and by ‘‘Aim for the Top University Plan’’ of the National Sun Yat-Sen University and Ministry of Education. Corresponding author. E-mail address: [email protected] (S.-J. Lee). Information Sciences 299 (2015) 99–116 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins
Transcript
Page 1: A Weighted LS-SVM Based Learning System for Time Series Forecasting

Information Sciences 299 (2015) 99–116

Contents lists available at ScienceDirect

Information Sciences

journal homepage: www.elsevier .com/locate / ins

A weighted LS-SVM based learning system for time seriesforecasting q

http://dx.doi.org/10.1016/j.ins.2014.12.0310020-0255/� 2014 Elsevier Inc. All rights reserved.

q This work was supported by the National Science Council under the Grants NSC-99-2221-E-110-064-MY3 and NSC-101-2622-E-110-011-CC‘‘Aim for the Top University Plan’’ of the National Sun Yat-Sen University and Ministry of Education.⇑ Corresponding author.

E-mail address: [email protected] (S.-J. Lee).

Thao-Tsen Chen, Shie-Jue Lee ⇑Department of Electrical Engineering, National Sun Yat-Sen University, Kaohsiung 80424, Taiwan

a r t i c l e i n f o

Article history:Received 22 May 2014Received in revised form 6 December 2014Accepted 11 December 2014Available online 17 December 2014

Keywords:Time series forecastingMulti-step forecastingNearest neighborMutual informationSupport vector machine

a b s t r a c t

Time series forecasting is important because it can often provide the foundation fordecision making in a large variety of fields. Statistical approaches have been extensivelyadopted for time series forecasting in the past decades. Recently, machine learningtechniques have drawn attention and useful forecasting systems based on these techniqueshave been developed. In this paper, we propose a weighted Least Squares Support VectorMachine (LS-SVM) based approach for time series forecasting. Given a forecastingsequence, a suitable set of training patterns are extracted from the historical data byemploying the concepts of k-nearest neighbors and mutual information. Based on thetraining patterns, a modified LS-SVM is developed to derive a forecasting model whichcan then be used for forecasting. Our proposed approach has several advantages. It canproduce adaptive forecasting models. It works for univariate and multivariate cases. It alsoworks for one-step as well as multi-step forecasting. A number of experiments areconducted to demonstrate the effectiveness of the proposed approach for time seriesforecasting.

� 2014 Elsevier Inc. All rights reserved.

1. Introduction

Time series forecasting concerns forecasting of future events based on a series of observations taken previously at equallyspaced time intervals [25]. It has played an important role in the decision making process in a variety of fields such asfinance, power supply, and medical care [21]. One example is to predict stock exchange indices or closing stock prices inthe stock market [9–12,23,43,60]. Another example is to predict the electricity demand to avoid producing extra electricpower in the power supply industry [27,39,68]. If forecasting is done for one time step ahead into the future, it is called sin-gle-step or one-step forecasting. For the case of two or more time steps ahead, it is usually called multi-step forecasting[5,21].

Two approaches have been adopted for constructing time series forecasting models. The global modeling approachconstructs a model which is independent of the target to be forecasted. For time series prediction, the conditions of theenvironment may vary as time goes on. A global model is not adaptive and thus accuracy suffers. A local model constructedby the local modeling approach [28,29,45–47] is dependent on the target to be forecasted and therefore is adaptive. Local

3, and by

Page 2: A Weighted LS-SVM Based Learning System for Time Series Forecasting

100 T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99–116

models are usually characterized by using a small number of the neighbors in the proximity of the forecasting sequence.Another issue in time series forecasting is to determine the lags to be involved in the model. The lags have a big influenceon the forecasting accuracy. For example, in steel making engineering [4,34], the furnace temperature will change two toeight hours from the time the materials are applied into the furnace. This indicates that there is a time lag of two to eighthours for the temperature change. Furthermore, the lags may vary as time goes on and adjusting them is required [59]. Lagsselection methods using t-statistics of estimated coefficients or F-statistics of groups of coefficients to measure statisticalsignificance were proposed [26]. The idea of mutual information was adopted in lags selection for electric load time seriesforecasting [7]. Two strategies, direct and iterative, have been adopted for constructing time series forecasting models. Thedifference between them lies on whether or not the forecasts of previous steps are involved in the prediction of the currentstep.

Statistical methods have been extensively adopted for time series forecasting in the past decades, among them are mov-ing average, weighted moving average, Kalman filtering, exponential smoothing, regression analysis, autoregressive movingaverage (ARMA), autoregressive integrated moving average (ARIMA), and autoregressive moving average with exogenousinputs (ARMAX) [6]. These methods are based on the assumption that a probability model generates the underlying timeseries data. Future values of the time series are assumed to be related to past values as well as to past errors. Box–Jenkinsmodels [6], i.e., ARMA, ARIMA, and ARMAX, are quite flexible due to the inclusion of both autoregressive and moving averageterms. Recently, machine learning techniques have drawn attention and useful forecasting systems based on these tech-niques have been developed [49]. The multilayer perceptron, often simply called the neural network, is a popular networkarchitecture in use for time series forecasting [13,20,24]. The neural network often encounters the local minimum problemduring the learning process and the number of nodes in the hidden layer is usually difficult to decide. The k-nearest neighborregression method is a nonparametric method which bases its prediction on the k nearest neighbors of the target to beforecasted [29]. However, it lacks the capability of adaptation and the distance measure adopted may affect prediction per-formance. Classification and Regression Trees is a regression model that is based on a hierarchical tree-like partition of theinput space [22,53,72]. Fuzzy theory is incorporated for prediction in the stock market [9,19,55–57]. However, membershipfunctions need to be determined, which is often a challenging task to the user. Also, no learning is offered by fuzzy theory.Support vector regression can provide high accuracy by mapping into a high-dimensional feature space and including a pen-alty term in the error function [30,33,47,52,65,71]. Neuro-fuzzy modeling is a hybrid approach which takes advantage ofboth neural network techniques and fuzzy theory [1,31,32,43,50]. Supervised learning algorithms are utilized to optimizethe parameters of the produced forecasting models through learning.

Some of the existing forecasting methods can only be applied in univariate forecasting, others can only work for one-stepforecasting, and yet others cannot produce adaptive models. In this paper, we propose a direct local modeling approachbased on machine learning to derive forecasting models. Given a forecasting sequence, two steps are involved in ourproposed approach. Firstly, a suitable set of training patterns are extracted from the historical data by employing the con-cepts of k-nearest neighbors and mutual information. A measure taking trends into consideration is adopted to gauge thesimilarity between two sequences of data, and proper lags associated with relevant variables for forecasting are determined.Secondly, based on the extracted training patterns, a modified LS-SVM is developed to derive a forecasting model which canthen be used for forecasting. Our proposed approach has several advantages. It can produce adaptive forecasting models. Itworks for univariate and multivariate cases. It also works for one-step as well as multi-step forecasting. The effectiveness ofthe proposed approach is demonstrated by the results obtained from a number of experiments conducted on real-worlddatasets.

The rest of this paper is organized as follows. Section 2 states the problem to be solved. Section 3 describes in detail ourproposed forecasting approach. An illustrating example is given in Section 4. Experimental results are presented in Section 5.Some relevant issues are discussed in Section 6. Finally, concluding remarks are given in Section 7.

2. Problem statement

Consider a series of real-valued observations [69]:

X0;Y0;X1;Y1; . . . ;Xt; Yt ð1Þ

taken at equally spaced time points t0; t0 þ Mt; t0 þ 2Mt; . . . for some process P, where Yi denotes the value of the outputvariable (or dependent variable) observed at the time point t0 þ iMt and Xi denotes the values of n additional variables(or independent variables), n P 0, observed at the time point t0 þ iMt. Time series forecasting is to estimate the value of Yat some future time t þ s, i.e., Ytþs, by

bY tþs ¼ GðXt�q;Yt�q; . . . ;Xt�1; Yt�1;Xt ;YtÞ ð2Þ

where s P 1 is called the horizon of prediction, G is the forecasting function or model, Yt�i is the ith lag of Yt ;Xt�i is the ith lagof Xt , and q is the lag-span of prediction. For s ¼ 1, it is called one-step forecasting. For s > 1, it is called multi-step forecast-ing. Also, if n ¼ 0, it is univariate forecasting; otherwise, it is multivariate forecasting.

Forecasting Ytþs can be regarded as a function approximation task. Two strategies are usually adopted to constructforecasting models:

Page 3: A Weighted LS-SVM Based Learning System for Time Series Forecasting

T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99–116 101

� Direct. Train on Xt�q;Yt�q; . . . ;Xt�1;Yt�1;Xt ;Yt to predict Ytþs directly, for any s P 1.� Iterative. Train to predict Ytþ1 only, but iterate to get Ytþs for any s > 1.

These strategies work identically for the case of s ¼ 1. However, for the case of s > 1, the iterative strategy cannot work formultivariate prediction since Xtþ1; . . . ;Xtþs�1 are not available to obtain bY tþ2; bY tþ3; . . . ; bY tþs successively for this case. The fore-casting model for estimating Ytþs can be obtained by a global or local modeling approach. These two approaches have theirown pros and cons as described in the previous section.

3. Proposed approach

Without loss of generality, we consider only Y and one additional variable X, i.e., n ¼ 1, in Eq. (1). Extension to the case oftwo or more additional variables is trivial. Our goal is to estimate Ytþs; s P 1, from Xt�q;Yt�q;Xt�qþ1;Yt�qþ1; . . . ;

Xt�1;Yt�1;Xt ;Yt . To do this, we try to learn from the observed data how.

� Yqþs can be forecasted from X0;Y0;X1; Y1; . . . ;Xq�1;Yq�1;Xq;Yq;� Yqþsþ1 can be forecasted from X1;Y1;X2;Y2; . . . ;Xq;Yq;Xqþ1;Yqþ1;� . . .;� Yt�1 can be forecasted from Xt�s�q�1;Yt�s�q�1;Xt�s�q;Yt�s�q; . . . ;Xt�s�2;Yt�s�2;Xt�s�1;Yt�s�1;� Yt can be forecasted from Xt�s�q;Yt�s�q;Xt�s�qþ1;Yt�s�qþ1; . . . ;Xt�s�1;Yt�s�1;Xt�s;Yt�s.

Based on what we have learned, we intend to predict Ytþs from Xt�q;Yt�q;Xtþ1�q; Ytþ1�q; . . . ;Xt ;Yt in a similar manner. Forconvenience, the sequence

Q ¼ Xt�q;Yt�q;Xt�qþ1;Yt�qþ1; . . . ;Xt�1;Yt�1;Xt ;Yt� �

ð3Þ

is called the forecasting sequence for forecasting Ytþs.We adopt the direct local modeling approach, based on machine learning techniques, to derive the forecasting model for

Ytþs. The proposed approach, as shown in Fig. 1, consists of two phases, training and forecasting. In the training phase, a fore-casting model is derived. Two steps, training patterns extraction and model derivation, are involved in this phase. A suitableset of training patterns are extracted from the observed data. Then a modified LS-SVM is adopted to derive the forecastingmodel. In the forecasting phase, the forecasted result is obtained from the derived model. Let z ¼ t � sþ 1� q and

S1 ¼ X0;Y0;X1;Y1; . . . ;Xq�1; Yq�1;Xq;Yq� �

;

S2 ¼ X1;Y1;X2;Y2; . . . ;Xq;Yq;Xqþ1;Yqþ1� �

;

. . . ¼ . . . ;

Sz�1 ¼ Xz�2;Yz�2;Xz�1;Yz�1; . . . ;Xt�s�2;Yt�s�2;Xt�s�1;Yt�s�1½ �;Sz ¼ Xz�1;Yz�1;Xz;Yz; . . . ;Xt�s�1;Yt�s�1;Xt�s;Yt�s½ �:

ð4Þ

For convenience, we call S ¼ fS1; S2; . . . ; Szg the neighbor set of Q and each element in S a neighbor of Q .

Fig. 1. Training and forecasting in the proposed approach.

Page 4: A Weighted LS-SVM Based Learning System for Time Series Forecasting

102 T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99–116

3.1. Training patterns extraction

In this step, a suitable set of training patterns are extracted from the observed data. Firstly, the most appropriate contextfor constructing the forecasting model for Ytþs is located. Then proper lags involved in training are determined.

3.1.1. Finding k nearest neighborsThe most appropriate context for constructing the forecasting model for Ytþs is defined to be the k nearest neighbors of Q

in S, where k is a number specified by the user. We adopt a measure, which is similar to that in [29], to compute thesimilarity between two sequences. Let A1 and A2 be two sequences:

A1 ¼ Xa;Ya;Xaþ1;Yaþ1; . . . ;Xaþq�1;Yaþq�1;Xaþq; Yaþq� �

;

A2 ¼ Xb;Yb;Xbþ1;Ybþ1; . . . ;Xbþq�1;Ybþq�1;Xbþq; Ybþq� �

;

and F1 and F2 be the differential sequences of A1 and A2, respectively, defined as

F1 ¼ Xaþ1 � Xa;Yaþ1 � Ya; . . . ;Xaþq � Xaþq�1;Yaþq � Yaþq�1� �

;

F2 ¼ Xbþ1 � Xb;Ybþ1 � Yb; . . . ;Xbþq � Xbþq�1;Ybþq � Ybþq�1� �

:

Note that the differential sequence reveals trends, i.e., rising or falling conditions, of the original sequence. A negative entryin the differential sequence indicates a falling occurrence at the underlying point in the original sequence while a positiveentry indicates a rising occurrence. Let NEðA1;A2Þ be the normalized Euclidean distance between A1 and A2, and NEðF1; F2Þ bethe normalized Euclidean distance between F1 and F2. The following measure [29]:

NHðA1;A2Þ ¼NEðA1;A2Þ þ NEðF1;F2Þ

2ð5Þ

is adopted to locate the k nearest neighbors of Q . We call NHðA1;A2Þ the hybrid distance between A1 and A2.We identify the k nearest neighbors of Q as follows. We calculate the hybrid distance between Q and every neighbor in S.

As a result, we have z hybrid distances. The k sequences with the k shortest hybrid distances are taken to be the k nearestneighbors of Q . For convenience, these k nearest neighbors are expressed as

Xt1�q;Yt1�q;Xt1�qþ1;Yt1�qþ1; . . . ;Xt1�1; Yt1�1;Xt1 ; Yt1

� �;

Xt2�q;Yt2�q;Xt2�qþ1;Yt2�qþ1; . . . ;Xt2�1; Yt2�1;Xt2 ; Yt2

� �;

. . . ;

Xtk�1�q;Ytk�1�q;Xtk�1�qþ1;Ytk�1�qþ1; . . . ;Xtk�1�1;Ytk�1�1;Xtk�1;Ytk�1

� �;

Xtk�q;Ytk�q;Xtk�qþ1;Ytk�qþ1; . . . ;Xtk�1; Ytk�1;Xtk; Ytk

� �ð6Þ

where q 6 ti 6 t � s for any i;1 6 i 6 k.

3.1.2. Lags selectionNext, we select certain lags out of Xt�q; Yt�q;Xtþ1�q;Ytþ1�q; . . . ;Xt�1;Yt�1;Xt , and Yt for training. One simple idea is to use

them all and the number of lags is 2qþ 2 in this case. However, this is not necessarily a good idea. Usually, we do not know inadvance how long the response time of the underlying system is. To be worse, the response time may vary along the way. If qis not large enough, we might miss some inputs which are important to the system. Therefore, q is usually set sufficientlylarge. Adopting all the lags may lead to the underfitting problem and the prediction accuracy can be poor. To avoid this,we only select a subset from the 2qþ 2 candidate lags.

We adopt the concept of mutual information [38,54,58,61,62] for lags selection. Let the mutual information between twovectors U and V be denoted as MIðU;VÞ. Intuitively, mutual information measures the information that U and V share. That is,it measures how much knowing one of these vectors reduces uncertainty about the other. If MIðU;VÞ is large, there is likelysome strong connection between U and V. The adoption of mutual information, rather than others, e.g., correlation [42], ismainly due to its capability of measuring non-linear relationship between involved vectors.

From the k nearest neighbors of Eq. (6) and their corresponding s-step ahead outputs Yt1þs;Yt2þs; . . . ;Ytkþs, we form thefollowing 2qþ 3 vectors:

Z1; Z2; . . . ; Z2q�1; Z2q; Z2qþ1; Z2qþ2; H ð7Þ

where

Z2iþ1 ¼ Xt1þi�q Xt2þi�q . . . Xtkþi�q� �T

; 0 6 i 6 q;

Z2iþ2 ¼ Yt1þi�q Yt2þi�q . . . Ytkþi�q� �T

; 0 6 i 6 q;

H ¼ Yt1þs Yt2þs . . . Ytkþs� �T

:

Page 5: A Weighted LS-SVM Based Learning System for Time Series Forecasting

T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99–116 103

We use a greedy approach with forward selection [14,59] to find a desired number of lags. Let d be the number of lags to befound. Firstly, we calculate MIðZi;HÞ;1 6 i 6 2qþ 2. Let MIðZd1 ;HÞ be the largest, indicating that the most significant connec-tion exists between Zd1 and H. Therefore, Zd1 is selected. Next, we calculate MIðfZd1 ;Zig;HÞ;1 6 i 6 2qþ 2 and i – d1. LetMIðfZd1 ;Zd2g;HÞ be the largest. Therefore, Zd2 is also selected. Then we calculate MIðfZd1 ;Zd2 ;Zig;HÞ;1 6 i 6 2qþ 2 andi – d1 and i – d2. This goes on until d lags are found. Then the lags obtained are used to determine the training patternsand the input for forecasting. For example, suppose d ¼ 3 and Z2q�1;Z2qþ1, and Z2qþ2 are selected. By referring to Eqs. (6)and (7), we extract the following k training patterns:

Xtj�1;Xtj;Ytj

h i;Ytjþs

� �; 1 6 j 6 k: ð8Þ

These training patterns are to be used in training a direct forecasting model in the model derivation step. Accordingly, theinput for forecasting Ytþs is Xt�1;Xt ;Yt½ �.

3.2. Model derivation

After we obtain the k training patterns, we can proceed to derive the direct forecasting models. Let the training patternsbe represented as fðxi; yiÞg

ki¼1, where xi is the input and yi is the desired output of pattern i. Note that we have also derived

the input x for forecasting Ytþs. In the case of the previous example shown in Eq. (8), the first training pattern includesx1 ¼ Xt1�1;Xt1 ;Yt1

� �and y1 ¼ Yt1þs, the second training pattern includes x2 ¼ Xt2�1;Xt2 ;Yt2

� �and y2 ¼ Yt2þs, . . ., and the kth

training pattern includes xk ¼ Xtk�1;Xtk;Ytk

� �and yk ¼ Ytkþs. The input x for forecasting Ytþs is x ¼ Xt�1;Xt ;Yt½ �.

Least Squares Support Vector Machine (LS-SVM) [52,65] is a powerful method for solving function estimation problems.However, all the errors induced have the same weight in LS-SVM [3]. Preferably, we would like to give more credit to a train-ing pattern that is more accurately forecasted in the training process [64,67]. That is, a training pattern with a smaller error isallowed to have a bigger weight. We modify the traditional LS-SVM to meet our requirement, and the modified LS-SVM canbe expressed as

min Jðx; eÞ ¼ 12xTxþ 1

2cXk

j¼1

gje2j

subject to yi ¼ xT/ðxiÞ þ bþ ei; i ¼ 1;2; . . . ; k

ð9Þ

where x is the weight vector to be solved, c is the regularization parameter which is a constant, / maps the original inputspace to a high-dimensional feature space, bis a real constant, and e1; e2; . . ., and ek are error variables. Note that each errorej;1 6 j 6 k, is weighted by gj which is defined as

gj ¼ exp �g1;j þ g2;j

2

� �ð10Þ

where g1;j is the hybrid distance between x and xj and g2;j is the normalized absolute difference between yj and the median ofy, i.e.

g1;j ¼ NHðx;xjÞ; ð11Þ

g2;j ¼jyj �medianðyÞj

maxfjyj �medianðyÞjg : ð12Þ

Note that medianðyÞ is the median of all the Y values and the denominator maxfjyj �medianðyÞjg is the maximum of alljyj �medianðyÞj;1 6 j 6 k. Essentially, g1;j concerns the closeness between x and xj. As xj is closer to x;g1;j gets smallerand consequently, gj is larger. Furthermore, we consider not only the input but also the output for the weighting. The ratio-nale behind this consideration is that: For time series forecasting, it is not always the case that closer inputs lead to closer

Table 1An example showing the relationship between inputs and outputs.

Input distance Output distance

1 0.23642 0.15053 0.10094 0.0893

..

. ...

148 2.0958149 1.0462150 0.8589

Page 6: A Weighted LS-SVM Based Learning System for Time Series Forecasting

104 T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99–116

outputs. An example is shown in Table 1 which contains 150 pieces of data taken from the Poland Electrical Load dataset[51]. In this table, the left column indicates the ordering of the input distances between one reference point and all its150 nearest neighbors. A smaller index indicates a neighbor closer to the reference point with respect to the input. The rightcolumn indicates the output distances between the reference point and these nearest neighbors. As can be seen, closer inputsmay not lead to closer outputs. For instance, consider neighbors 1 and 2. Neighbor 1 is closer to the reference point in termsof the input distance, while neighbor 2 is closer to the reference point in terms of the output distance. Therefore, we considerthe contribution of the output by including the term g2;j in Eq. (12). As yj is closer to the median, g2;j gets smaller and con-sequently, gj is larger. Using median instead of mean is due to the consideration of outliers which may usually occur in prac-tical applications. In the case of some outputs having abnormal values, the median is less likely affected.

The solution to Eq. (9) can be derived by constructing the Lagrangian function:

Lðx; b; e;aÞ ¼ Jðx; eÞ �Xk

i¼1

aiðxT/ðxiÞ þ bþ ei � yiÞ ð13Þ

where ai 2 R;1 6 i 6 k, are the Lagrange multipliers. By eliminating the variables, we have

0 1Tk

1k Xþ Icgj

" #b

a

� ¼

0y

� ð14Þ

where y ¼ y1 y2 . . . yk½ �T ;1k ¼ 1 1 . . . 1½ �T ;a ¼ a1 a2 . . . ak½ �T ; I is the identity matrix, and X is the kernelmatrix which is defined by

X½ �i;j ¼ /ðxiÞT/ðxjÞ ¼ exp � kxi � xjkr

� �2 !

ð15Þ

for i; j ¼ 1;2; . . . ; k. From Eq. (14), the values of a1;a2; . . . ;ak are obtained, and the forecasting model is given by

y ¼ GðxÞ ¼Xk

i¼1

ai/ðxÞT/ðxiÞ þ b ¼Xk

i¼1

ai exp � kx� xikr

� �2 !

þ b ð16Þ

which can then be used for forecasting Ytþs.

3.3. Summary of training and forecasting

Given a series of real-valued observations X0;Y0;X1;Y1; . . . ;Xt , our goal is to estimate the value of Ytþs at some future timet þ s; s P 1, based on these observations. Let the forecasting sequence Q be Xt�q;Yt�q; . . . ;Xt�1;Yt�1;Xt ;Yt

� �where q is a

sufficiently large positive integer. We find the k nearest neighbors as the most suitable context of Q . Then we use mutualinformation to select the lags of the involved variables. Then k training patterns are extracted from the observations. Basedon the training patterns, the forecasting model is derived by the modified LS-SVM. Finally, Ytþs is estimated by theapplication of the forecasting model. The whole process can be summarized as below.

procedure Proposed-Forecasting-MethodTraining phase:

Find the k nearest neighbors of Q using Eq. (5);Decide the lags of the involved variables using mutual information;A set of k training patterns are extracted;Derive the forecasting model by the modified LS-SVM of Eq. (9);

Forecasting phase:Forecast Ytþs by the derived model using Eq. (16);

end procedure

4. Example

Suppose we have a series of observations X0; Y0;X1; Y1; . . . ;X19;Y19, as shown in Table 2. We want to forecast the value ofY21 based on the given data. For this case, we have t ¼ 19 and s ¼ 2. Let q ¼ 1; k ¼ 6, and d ¼ 2. According to Eq. (3), the fore-casting sequence Q is

Q ¼ X18;Y18;X19; Y19½ � ¼ 1:140;1:066;1:174;1:114½ �

Page 7: A Weighted LS-SVM Based Learning System for Time Series Forecasting

Table 2Time series example.

Time index X Y Hybrid distance

0 1.118 1.073 –1 1.116 1.062 0.24692 1.005 0.963 0.71923 0.903 0.874 0.90594 1.099 1.048 0.83295 1.131 1.078 0.12786 0.969 1.084 0.68137 1.093 1.106 0.43498 1.125 1.104 0.20469 1.091 1.002 0.5280

10 0.943 0.897 0.908911 1.138 1.099 0.796712 1.149 1.137 0.112013 1.155 1.136 0.215314 1.158 1.130 0.224815 1.161 1.097 0.269716 1.065 0.990 0.637117 0.944 0.870 0.920718 1.140 1.066 –19 1.174 1.114 –

T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99–116 105

and according to Eq. (4), the neighbor set S contains the following 17 neighbors:

S1 ¼ X0;Y0;X1;Y1½ � ¼ 1:118;1:073;1:116;1:062½ �;S2 ¼ X1;Y1;X2;Y2½ � ¼ 1:116;1:062;1:005; 0:963½ �;. . . ¼ . . . ;

S16 ¼ X15;Y15;X16;Y16½ � ¼ 1:161;1:097;1:065; 0:990½ �;S17 ¼ X16;Y16;X17;Y17½ � ¼ 1:065;0:990;0:944; 0:870½ �:

Then we proceed with the training steps as follows.

� Finding k nearest neighbors. We calculate the hybrid distances between Q and Si; i ¼ 1;2; . . . ;17, by Eq. (5). The distancesare shown in the fourth column of Table 2. For example, the hybrid distance between Q and S17 is 0.9207, shown in therow with time index 17. The hybrid distance between Q and S16 is 0.6371, shown in the row with time index 16. The 6nearest neighbors of Q are S12; S5; S8; S13; S14, and S1. Therefore, we have t1 ¼ 12; t2 ¼ 5; t3 ¼ 8; t4 ¼ 13; t5 ¼ 14, and t6 ¼ 1for Eq. (6).� Lags selection. According to Eq. (7), we set

Z1 ¼ X11;X4;X7;X12;X13;X0½ �T ;Z2 ¼ Y11;Y4;Y7;Y12;Y13;Y0½ �T ;Z3 ¼ X12;X5;X8;X13;X14;X1½ �T ;Z4 ¼ Y12;Y5;Y8;Y13;Y14;Y1½ �T ;H ¼ Y14;Y7;Y10;Y15;Y16;Y3½ �T :

ð17Þ

Firstly, we compute MIðZ1;HÞ ¼ 0:2500, MIðZ2;HÞ ¼ 0:2083, MIðZ3;HÞ ¼ 0:2083, and MIðZ4;HÞ ¼ 0:4583. Since MIðZ4;HÞis the largest, Z4 is selected. Next, we compute MIðfZ4;Z1g, HÞ ¼ 0:1667, MIðfZ4;Z2g;HÞ ¼ 0:1250, andMIðfZ4;Z3g;HÞ ¼ 0:2917. Since MIðfZ4;Z3g;HÞ is the largest, Z3 is selected. Then we stop. Note that Z3 and Z4 correspondto the third element and the fourth element, respectively, in each nearest neighbor. Therefore, we extract the following 6training patterns:

x1 ¼ X12;Y12½ �; y1 ¼ Y14;

x2 ¼ X5;Y5½ �; y2 ¼ Y7;

x3 ¼ X8;Y8½ �; y3 ¼ Y10;

x4 ¼ X13;Y13½ �; y4 ¼ Y15;

x5 ¼ X14;Y14½ �; y5 ¼ Y16;

x6 ¼ X1;Y1½ �; y6 ¼ Y3;

ð18Þ

and the input for forecasting Y21 is x ¼ X19;Y19½ �.

Page 8: A Weighted LS-SVM Based Learning System for Time Series Forecasting

106 T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99–116

� Model derivation. We use the extracted training patterns fðxi; yiÞg6i¼1 to derive the forecasting model. Firstly, we create the

modified LS-SVM of Eq. (9). The weights gj; j ¼ 1;2; . . . ;6, are calculated by Eq. (10). For example, g1 is

g1 ¼ exp �g1;1 þ g2;1

2

� �¼ 0:7326

where

g1;1 ¼ NHðx;x1Þ ¼ 0:1120

and

g2;1 ¼jy1 �medianðyÞj

maxfjyj �medianðyÞjg ¼ 0:5103:

Then we solve the Lagrange multipliers in Eq. (14) and obtain the optimal forecasting model of Eq. (16).

Finally, we apply x ¼ X19;Y19½ � to the forecasting model and the forecast of Y21 is

bY 21 ¼ 1:0917:

5. Experimental results

We present the results of several experiments with real-world time series datasets to demonstrate the effectiveness ofour proposed Local Modeling Approach (LMA). We also compare it with other existing methods in this section. To evaluatethe performance of each method, several metrics are adopted [49], including mean absolute error (MAE), mean square error(MSE), root mean square error (RMSE), and mean absolute percentage error (MAPE), which are defined below:

MAE ¼PNt

i¼1jyi � yijNt

; ð19Þ

MSE ¼PNt

i¼1ðyi � yiÞ2

Nt; ð20Þ

RMSE ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPNti¼1ðyi � yiÞ2

Nt

s; ð21Þ

MAPE ¼ 100�PNt

i¼1jyi�yi j

yi

Ntð22Þ

where Nt is the number of the testing data, and yi and yi are the actual output value and forecasted output value,respectively, for i ¼ 1;2; . . . ;Nt .

Some points should be mentioned here. Firstly, model derivation in our approach is performed using the LS-SVM programprovided by the authors of [65] on the website of [44]. When the training patterns are obtained, they are fed into the LS-SVMprogram and the forecasting model is derived automatically. The program can determine the settings of the hyper-param-eters by itself without user intervention. Secondly, not all metrics mentioned above are used in every experiment. Tocompare with other methods, we borrow the experimental results of these methods from the literature. Since different met-rics are adopted in different cited papers, the metrics used in each experiment can be different. Thirdly, training errors arenot available in the cited papers. Therefore, we only present the training errors of LMA in the experiments.

5.1. One-step forecasting

Four experiments are conducted on the datasets Poland Electricity [51], Laser [40], Sunspot [63], and TAIEX [66] to showthe performance of one-step forecasting of different methods. The first three datasets are used for univariate forecasting,while the last is used for multivariate forecasting.

5.1.1. Poland Electricity datasetThe Poland Electricity dataset [51] records the electricity load of Poland, covering 1500 days in 1990s. Only one variable,

the output, is involved in this dataset. The first 1000 values are used for training, and the remaining 500 values are used fortesting. Some statistics [35], including Minimum (Min), Maximum (Max), Mean, Median, Mode, Standard deviation (Std), andRange, of the training data are shown in Table 3. The MAE and RMSE obtained by several one-step forecasting methods areshown in Table 4. These methods include ANFIS [32], Neural Network based on the MATLAB toolbox (NN-MAT) [24], ARMA[6], Sorjamaa et al.’s method [59], and our LMA. For LMA, we set k ¼ 150; q ¼ 9, and d ¼ 6. The hyper-parameters set auto-matically in the LS-SVM program for LMA are c � 2:4� 1010 and r2 � 3� 108. These settings are determined in the trainingphase, and are not changed in the testing phase. Note that the errors for Sorjamaa in the table are taken directly from [59] in

Page 9: A Weighted LS-SVM Based Learning System for Time Series Forecasting

Table 3Some statistics of four datasets.

Poland Electricity Laser Sunspot EUNITE

LoadMax Temperature

Min 0.6385 2 0 464 �14.225Max 1.3490 255 154.400 876 26.525Mean 0.9885 59.893 43.472 670.790 8.802Median 0.9864 43 37.600 677 8.938Mode 0.6385 14 11 799 20.250Std 0.1632 49.279 34.267 93.541 8.686Range 0.7105 253 154.400 412 40.750

Table 4Performance of one-step forecasting on Poland Electricity.

MAE RMSE

ANFIS 0.2985 0.5225NN-MAT 0.0639 0.0835ARMA 0.3744 0.4807Sorjamaa 0.1856 0.3351LMA 0.0245 0.0454

(0.0194) (0.0311)

Fig. 2. One-step forecasted results by LMA and ARMA on Poland Electricity.

T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99–116 107

which no training errors are listed. Also, the parenthesized numbers in the second row of the LMA entry indicate the trainingerrors of LMA. Also, the best result on each metric is in boldface in this table. As can be seen from the table, LMA performsbetter than the others in both MAE and RMSE. Fig. 2 shows the forecasted results obtained by LMA and ARMA. For clarity, weonly show the first 100 forecasted values in this figure. From this figure, we can see that LMA provides a better match thanARMA.

5.1.2. Laser datasetThe Laser dataset [40] contains a series of chaotic laser data obtained in a physics laboratory experiment. Only one

variable, the output, is involved in this dataset. The length of the time series is 10,093, but we use only the first 5700 obser-vations. The first 5600 observations are used for training and the remaining are used for testing. Some statistics of the train-ing data are shown in Table 3. The forecasted results obtained by LMA and ANFIS are shown pictorially in Fig. 3. For LMA, weset k ¼ 150; q ¼ 9, and d ¼ 7. The hyper-parameters set automatically in the LS-SVM program for LMA are c � 2:4� 107 andr2 � 700. These settings are determined in the training phase, and are not changed in the testing phase. Table 5 shows theMAE and RMSE obtained by different methods. The parenthesized numbers in the second row of the LMA entry indicate thetraining errors of LMA. Again, LMA performs better than the others in both MAE and RMSE.

Page 10: A Weighted LS-SVM Based Learning System for Time Series Forecasting

Fig. 3. One-step forecasted results by LMA and ANFIS on Laser.

Table 5Performance of one-step forecasting on Laser.

MAE RMSE

ANFIS 5.7257 14.4748NN-MAT 3.4300 10.3494ARMA 14.8915 27.5397Sorjamaa 1.8288 4.2405LMA 0.7236 0.9570

(0.6031) (0.8754)

108 T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99–116

5.1.3. Sunspot datasetThe Sunspot dataset [63] is related to the observations on the annual activity of sunspots. The study on sunspot activity

has practical importance to geophysicists, environment scientists, and climatologists. The dataset records the annual numberof sunspots from the year of 1700 through the year of 1987, giving a total of 288 observations. Only one variable, the output,is involved in this dataset. The data from 1700 to 1920 are used for training, and the data from 1921 to 1987 are used fortesting. Some statistics of the training data are shown in Table 3. In this experiment, we compare our method with ARIMA[70], Multiple ANN [2], ARIMA and Neural Network [70], ANN (p; d; q) [36], and Generalized ANNs/ARIMA [37]. Table 6 showsthe MAE, MSE, and MAPE obtained by these methods. Note that we borrow the numbers, except for ANFIS and LMA, in thistable from the cited papers. A dash in an entry indicates that no number is provided in the source. The parenthesizednumbers in the second row of the LMA entry indicate the training errors of LMA. For LMA, we set k ¼ 90; q ¼ 11, andd ¼ 9. The hyper-parameters set automatically in the LS-SVM program for LMA are c � 3600 and r2 � 700. These settingsare determined in the training phase, and are not changed in the testing phase. From this table, we can see that our method,LMA, provides the best performance. Fig. 4 shows the forecasted results obtained by LMA and ANFIS.

5.1.4. TAIEX datasetIn this experiment, we make forecasting for the Taiwan Stock Exchange Capitalization Weighted Stock Index (TAIEX) [66],

with additional inputs from Dow Jones [15] and NASDAQ [48]. The period we are working with is from the year of 1999through the year of 2004. The data from January to October of each year are used for training, and the data from November

Table 6Performance of one-step forecasting on Sunspot.

MAE MSE MAPE

ANFIS 16.31 613.87 36.10ARIMA in [70] 13.03 306.08 –Multiple ANN [2] – 280.48 30.69ARIMA & Neural Network [70] 12.78 280.16 –ANN (p;d; q) [36] 12.18 234.21 –Generalized ANNs/ARIMA [37] 11.45 218.64 –LMA 10.29 215.71 20.92

(7.88) (101.97) (15.37)

Page 11: A Weighted LS-SVM Based Learning System for Time Series Forecasting

Fig. 4. One-step forecasted results by LMA and ANFIS on Sunspot.

Table 7Some statistics of the TAIEX related data in 2004.

Min Max Mean Median Mode Std Range

NASDAQ 1752 2140 1958.95 1960.26 1969.99 81.59 388Dow Jones 9758 10,738 10263.81 10244.93 9757.81 224.80 980TAIEX 5317 7034 6057.18 5949.26 5316.87 458.25 1717

Table 8RMSE of one-step forecasting on TAIEX.

1999 2000 2001 2002 2003 2004 Average

With Dow Jones ANFIS 156.72 280.85 521.64 1871.23 1805.04 192.69 799.69ARMAX 102.53 126.41 115.36 63.19 52.24 53.81 85.76NN-MAT 103.78 132.46 116.27 67.24 55.01 54.10 88.14Chen & Chang [9] 101.97 148.85 113.70 79.81 64.08 82.32 98.46Chen et al. [10] 115.47 127.51 121.98 74.65 66.02 58.89 94.09Chen & Chen [11] 99.87 122.75 117.18 68.45 53.96 52.55 85.79T2NFS [43] 97.30 124.28 110.01 60.05 52.24 51.80 82.61Chen et al. [12] 102.34 131.25 113.62 65.77 52.23 56.16 86.89LMA 100.73 123.65 117.59 62.96 52.17 53.44 85.09

(91.48) (110.83) (103.77) (48.36) (45.16) (44.73)

With NASDAQ ANFIS 437.58 3708.04 425.19 601.06 3323.86 328.36 1470.68ARMAX 102.37 113.06 115.79 63.74 55.61 53.33 83.98NN-MAT 103.71 131.31 118.17 67.89 53.59 53.38 88.01Chen & Chang [9] 123.64 131.10 115.08 73.06 66.36 60.48 94.95Chen et al. [10] 119.32 129.87 123.12 71.01 65.14 61.94 95.07Chen & Chen [11] 102.60 119.98 114.81 69.07 53.16 53.57 85.53T2NFS [43] 99.63 121.94 108.68 64.83 51.05 51.81 82.99Chen et al. [12] 102.11 131.30 113.83 66.45 52.83 54.17 86.78LMA 102.84 123.56 115.26 65.59 51.59 51.44 85.05

(93.45) (113.32) (109.23) (56.16) (43.80) (42.10)

With Dow Jones and NASDAQ ANFIS 434.02 3490.57 656.78 1340.88 574.10 201.52 1116.31ARMAX 106.57 121.10 113.40 66.88 52.89 51.82 85.44NN-MAT 109.29 132.98 114.83 63.73 55.65 51.50 88.00Chen & Chang [9] 106.34 130.13 113.33 72.33 60.29 68.07 91.75Chen et al. [10] 116.64 123.62 123.85 71.98 58.06 57.73 91.98Chen & Chen [11] 101.33 121.27 114.48 67.18 52.72 52.27 84.88T2NFS [43] 99.04 120.90 103.84 58.10 52.49 51.73 81.02LMA 102.57 121.01 114.32 56.95 51.95 50.71 82.92

(94.29) (114.23) (110.27) (54.10) (45.32) (46.98)

Number of zeros in each year 32 33 11 8 8 8

T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99–116 109

Page 12: A Weighted LS-SVM Based Learning System for Time Series Forecasting

110 T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99–116

to December are used for testing. Some statistics of the training data for 2004 are shown in Table 7. Table 8 shows the RMSEobtained by different methods. In this table, we also list the results of Chen [9], Chen et al. [11], T2NFS [43], and Chen et al.[10,12]. The results under three conditions are listed in Table 8: with Dow Jones as the additional input, with NASDAQ as theadditional input, and with Dow Jones and NASDAQ as the additional inputs. For LMA, we set q ¼ 14; d ¼ 4, and k � 100. Thehyper-parameters set automatically in the LS-SVM program for LMA are c � 2� 106 � 6� 106 and r2 � 1000� 3000. Thesesettings are determined in the training phase, and are not changed in the testing phase. Note that the parenthesized numbersin the second row of the LMA entries indicate the training errors of LMA. We can find that our method has better RMSE in2002, 2003, 2004 than most of the other methods. However, LMA performs worse than [11,43] in 1999, 2000 and 2001. Thereason is that in these years, many zeros are contained in Dow Jones and NASDAQ due to their being asynchronous with TAI-EX. If Dow Jones and NASDAQ were closed while TAIEX was open on some day, a zero was recorded in Dow Jones and NAS-DAQ for that day. As can be seen, there are 32 zeros in these two datasets in 1999. Because our method is influenced by theobtained nearest neighbors, these zeros are annoying to our method. Note that the zeros also deteriorate a lot theperformance of ANFIS. However, our method still performs better than [9,10] in 1999, 2000, and 2001.

5.2. Multi-step forecasting

Two experiments are conducted on the datasets Laser [40] and EUNITE [16] to show the performance of multi-stepforecasting of different methods. The first dataset is used for univariate forecasting, while the second is used for multivariateforecasting.

5.2.1. Laser datasetThe division of training data and testing data is identical to that for one-step forecasting. Tables 9 and 10 show the MAE

and RMSE, respectively, obtained by different methods for 2-step to 12-step forecasting. For LMA, we set k ¼ 150; q ¼ 9, andd ¼ 7. The hyper-parameters set automatically in the LS-SVM program for LMA are c � 3� 104 � 7:5� 106 andr2 � 700� 2� 105. These settings are determined in the training phase, and are not changed in the testing phase. Note thatthe parenthesized numbers listed in the column of LMA in these two tables indicate the training errors of LMA. As can beseen, our method performs much better than the other methods both in MAE and RSME. Fig. 5 shows some forecasted resultsobtained by LMA and NN-MAT. Apparently, our forecasted results match more closely to the actual values.

Table 9MAE of multi-step forecasting on Laser.

ANFIS NN-MAT ARMA Sorjamaa LMA

2-step 9.1072 7.0461 17.6998 1.8294 1.7388 (0.6298)3-step 7.0787 6.7370 17.0023 3.1721 2.6654 (0.7722)4-step 7.9532 9.5677 19.9313 4.5941 2.4529 (0.7682)5-step 8.2511 7.3968 21.0612 3.8067 2.6971 (0.8238)6-step 8.7864 7.5659 21.8979 4.6517 2.7371 (0.7889)7-step 9.4033 9.5344 22.1063 5.0789 2.1663 (0.7915)8-step 9.4587 9.5561 23.6841 8.3264 2.4619 (0.7720)9-step 10.9792 10.9181 28.5120 6.4187 3.3041 (0.8603)10-step 12.7650 10.0274 29.4743 6.9559 3.7495 (0.8772)11-step 14.7686 9.5501 28.9017 19.0988 4.9000 (1.0343)12-step 13.1312 13.8144 30.7653 17.5508 4.9966 (1.0965)

Table 10RMSE of multi-step forecasting on Laser.

ANFIS NN-MAT ARMA Sorjamaa LMA

2-step 22.5924 15.0670 33.2179 4.6673 4.2313 (0.9089)3-step 17.9022 18.2621 33.2998 10.7440 7.4607 (1.3115)4-step 20.1061 24.4235 36.2056 15.7759 6.5999 (1.3107)5-step 19.5676 19.8554 37.9758 11.0067 8.3976 (1.4436)6-step 18.4178 17.9245 38.6497 12.2136 7.8629 (1.3684)7-step 20.5050 19.6204 38.9259 12.2059 5.9445 (1.3478)8-step 21.3847 19.1559 39.2201 14.3293 5.1336 (1.3172)9-step 24.8706 20.8771 44.2355 14.3791 7.7728 (1.4507)10-step 24.7307 24.2657 45.2404 14.8786 9.9128 (1.3533)11-step 29.2389 22.5536 43.4734 32.3007 12.6468 (1.6861)12-step 27.3947 30.1944 43.9369 29.2913 12.7318 (1.8214)

Page 13: A Weighted LS-SVM Based Learning System for Time Series Forecasting

(a) 5-step forecasting

(b) 9-step forecasting

Fig. 5. Multi-step forecasted results by LMA and NN-MAT on Laser.

Table 11Performance of multi-step forecasting on EUNITE.

MAPE MAX ERROR # of inputs

ARMAX 3.69 67.56 20ANFIS 4.54 94.74 5Backpropagation [17] 5.05 111.89 49Gain Scaling [17] 4.87 137.78 49Gain Scaling with SS [17] 2.19 55.95 49Gain Scaling with CIS and SS [17] 2.77 70.99 20Early Stopping [17] 1.95 40.28 49Early Stopping with SS [17] 2.13 50.90 49Early Stopping with CIS and SS [17] 2.87 71.26 20Extended Bayesian Training [17] 1.75 55.64 40L2-SVM with CV [17] 3.52 60.39 49L2-SVM with CIS and CV [17] 2.87 67.17 20L2-SVM Gradient Descent [17] 2.07 59.78 45Benchmark [16] 1.98 51.42 –LMA 1.71 40.99 12

(0.14) (73.28)

T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99–116 111

Page 14: A Weighted LS-SVM Based Learning System for Time Series Forecasting

Fig. 6. Multi-step forecasted results by LMA and ARMAX on EUNITE.

Table 12Performance of LMA, with different settings of k, on Laser.

MAE RMSE

110 1.0775 1.9369120 1.0344 1.7951130 1.0212 1.9197140 0.9904 1.6181150 0.7236 0.9570160 0.9348 1.5870170 0.9376 1.6058180 0.9766 1.5502190 1.0046 1.6440200 1.0804 1.9369

112 T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99–116

5.2.2. EUNITE datasetThe EUNITE dataset was provided by the EUNITE network for a well-known competition [16]. It contains the electricity

load demand every half an hour from January 1997 through December 1998. Additional inputs include average dailytemperatures and information about holidays. The goal of the competition was to predict maximum daily values of electricalload for January 1999 (31 values altogether). As specified in the competition, we use maximum daily values of electrical load(LoadMax) and average daily temperatures (Temperature) as inputs. The information about holidays is not used. Some sta-tistics of the training data are shown in Table 3. Table 11 shows the results obtained by a number of different methods. Thenumbers listed in the rows with citations are taken from [17]. The benchmark is the winner in the competition [8]. In thistable, MAX ERROR indicates the maximum absolute error in the 31 forecasts and ‘# of inputs’ indicates the number of inputsinvolved in forecasting. For LMA, we set k ¼ 200; q ¼ 35, and d ¼ 12. The hyper-parameters set automatically in the LS-SVMprogram for LMA are c � 400 and r2 � 40. These settings are determined in the training phase, and are not changed in thetesting phase. Note that the parenthesized numbers in the second row of the LMA entry indicate the training errors of LMA.The forecasted results obtained by LMA and ARMAX are shown pictorially in Fig. 6 where the horizontal axis indicates thenumber of steps ahead to be forecasted. Apparently, our forecasted results match more closely to the actual values.

6. Discussion

We discuss some issues related to our proposed method LMA.

6.1. Effect of k

In our experiments, the value of k is set by trial-and-error. Once set, k is not changed during the testing stage. Here, wewould like to show how much the value of k affects the performance of LMA. Table 12 shows the MAE and RMSE obtained byLMA on Laser, with kvarying in the range between 110 and 200. In this experiment, q and d are fixed to be 9 and 7, respec-tively. Note that different settings result in a variation of errors. It is not easy to determine a good k for a given application.

Page 15: A Weighted LS-SVM Based Learning System for Time Series Forecasting

T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99–116 113

One way is to get help from experts who have domain knowledge about the underlying application. Another way may utilizesome optimization methods, e.g., GA. However, optimization may take time.

6.2. Comparison among different measures of similarity

We adopt the hybrid distance to measure the similarity between two sequences in our approach. We show here thatusing the hybrid distance to find the nearest neighbors is a good idea.

Suppose the distance between two sequences is measured by the time index, i.e., two sequences with close time indicesare considered to be close to each other. For example, consider the example in Section 4. If measured by the time index, the 6nearest neighbors of Q would be S17; S16; S15; S14; S13, and S12. Table 13 shows the MAE and RMSE obtained by LMA on Laser,with kvarying in the range between 130 and 200, where the k nearest neighbors are determined by the time index. As before,q and d are fixed to be 9 and 7, respectively. We can see that the performance is poor in both MAE and RSME. This indicatesthat the time index is less appropriate than the hybrid distance in determining the nearest neighbors of Q for LMA.

Alternatively, the similarity between two sequences can be measured by only the Euclidean distance between originalsequences. Let us give an example to explain why the hybrid distance is more effective. For simplicity, suppose only one var-iable Y is involved. Consider three sequences A1;A2, and A3:

A1 ¼ 5;4;3;2;1½ �; A2 ¼ 5;4;2;0;�1½ �; A3 ¼ 5;3;1;2;3½ �:

We have

NEðA1;A2Þ ¼ NEðA1;A3Þ ¼3

4:69¼ 0:6397: ð23Þ

If we measure the similarity based on NEðA1;A2Þ and NEðA1;A3Þ, then A2 and A3 are equally similar to A1. Now we considerthe differential sequences:

F1 ¼ 1;1;1;1½ �; F2 ¼ 1;2;2;1½ �; F3 ¼ 2;2;�1;�1½ �

and have

NEðF1;F2Þ ¼1:414

3:7417¼ 0:3779;

NEðF1;F3Þ ¼3:162

3:7417¼ 0:8450: ð24Þ

Therefore, we have

NHðA1;A2Þ ¼0:6397þ 0:3779

2¼ 0:5088;

NHðA1;A3Þ ¼0:6397þ 0:8450

2¼ 0:7424: ð25Þ

By taking trends into account, we have NHðA1;A2Þ < NHðA1;A3Þ and conclude that A2 is closer to A1. This is reasonable sinceboth A1 and A2 have a down trend while A3 has a down-then-up trend. Table 14 shows the MAE and RMSE of one-stepforecasting on Laser obtained by LMA with different similarity measures. Note that we set k ¼ 150; q ¼ 9, and d ¼ 7. In thistable, ED-SEQ stands for the Euclidean distance between original sequences, ED-FOD for the Euclidean distance between dif-ferential sequences, and HD-SEQ for the hybrid distance. As can be seen, the hybrid distance works more effectively than theother two alternatives. Note that in this work, we just simply use 1/2 in Eq. (5). However, it can be specified by a domainexpert or learned through an optimization method, e.g., GA.

Table 13Performance of LMA, with distance measured by time index, on Laser.

MAE RMSE

130 18.2506 26.6031140 18.1728 27.5728150 12.3552 26.2939160 14.5660 29.1388170 10.6852 28.2920180 10.1166 27.0055190 10.2233 29.9817200 10.4542 30.0145

Page 16: A Weighted LS-SVM Based Learning System for Time Series Forecasting

Table 14Performance of LMA, with different similarity measures, on Laser.

MAE RMSE

ED-SEQ 0.9668 1.5667ED-FOD 1.4750 3.0873HD-SEQ 0.7236 0.9570

Table 15Performance of LMA, with different settings of d, on Laser.

MAE RMSE

4 1.2883 2.40915 0.9426 1.63506 0.9479 1.42457 0.7236 0.95708 0.8869 1.48609 0.9609 1.971110 (all) 1.0468 2.1052

Table 16Performance of LMA, with different settings of q, on Poland.

MAE RMSE

6 0.0264 0.04777 0.0251 0.04628 0.0249 0.04609 0.0245 0.0454

10 0.0256 0.048011 0.0258 0.048712 0.0259 0.0507

Table 17Comparison between LS-SVM and modified LS-SVM on Laser and Poland.

Laser Poland

MAE RMSE MAE RMSE

LS-SVM 0.8984 1.5171 0.0271 0.0532Modified LS-SVM 0.7236 0.9570 0.0245 0.0454

Table 18Comparison between LS-SVM and modified LS-SVM on EUNITE.

MAPE MAX ERROR

LS-SVM 1.7653 45.6756Modified LS-SVM 1.7100 40.9900

114 T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99–116

6.3. Effect of d and q

In our experiments, the values of d and q are set by trial-and-error. Once set, d and q are not changed during the testingstage. Here, we would like to show how much the values of d and q affect the performance of LMA. Note that determininggood values for d and q is not an easy task. Seeking advice from experts surely is helpful. Applying some optimization meth-ods, e.g., GA, is also useful but could be time-consuming. Table 15 shows the MAE and RMSE of one-step forecasting on Laserobtained by LMA with different numbers of d. For this table, k and q are fixed to be 150 and 9, respectively. Therefore, thetotal number of candidate lags is 10. We can see that the case of using all the candidate lags, i.e., d ¼ 10, for forecasting doesnot provide the best performance. Instead, the best performance occurs at d ¼ 7. Table 16 shows the MAE and RMSE of one-step forecasting on Poland obtained by LMA, with qvarying in the range between 6 and 12. For this table, k and d are fixed tobe 150 and 6, respectively. Note that different settings result in a small variation of errors.

Page 17: A Weighted LS-SVM Based Learning System for Time Series Forecasting

T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99–116 115

6.4. Comparison between LS-SVM and modified LS-SVM

Finally, we compare the performance between LS-SVM and the modified LS-SVM adopted in LMA. Table 17 shows theresults of one-step forecasting on Laser and Poland, and Table 18 shows the results of multi-step forecasting on EUNITE.For LS-SVM, no gj appears in Eq. (9). As can be seen, the modified LS-SVM can provide better performance than LS-SVM.

7. Conclusion

We have presented a machine learning based local modeling approach for time series forecasting. Several steps areinvolved in our approach. Firstly, the k neighbors which are most similar to the given forecasting sequence are located. Sec-ondly, proper lags associated with relevant variables for forecasting are determined. Thirdly, an optimal forecasting model isderived by applying a modified LS-SVM. The derived model can then be used for forecasting. Our proposed approach has sev-eral advantages. It can produce adaptive forecasting models. It works for univariate and multivariate cases. It also works forone-step as well as multi-step forecasting. Several experiments have been conducted and the results have shown the effec-tiveness of the proposed approach for time series forecasting.

One drawback of the proposed approach is the demanding cost in computing the distance between the forecastingsequence and each of its neighbors. Some algorithms have been proposed [34,46] for fast distance computation. Another pos-sibility is to divide the observed data into different groups in advance by certain clustering techniques, e.g., fuzzy clustering[41] or self-organizing multi-layer perceptron [18]. For any forecasting sequence, the cluster to which the forecastingsequence is most similar is identified. Then the sequences contained in this cluster are regarded as the nearest neighbors.In this way, the computation complexity can be much reduced.

Acknowledgments

The authors are grateful to the anonymous reviewers, Associate Editor, and Editor-in-Chief for their comments, whichwere very helpful in improving the quality and presentation of the paper.

References

[1] M. Abdollahzade, A. Miranian, H. Hassani, H. Iranmanesh, A new hybrid enhanced local linear neuro-fuzzy model based on the optimized singularspectrum analysis and its application for nonlinear and chaotic time series forecasting, Inform. Sci. 295 (2015) 107–125.

[2] R. Adhikari, R.K. Agrawal, A homogeneous ensemble of artificial neural networks for time series forecasting, Int. J. Comput. Appl. 32 (7) (2011) 1–8.[3] D.W. Aha, D. Kibler, M.K. Albert, Instance-based learning algorithms, Mach. Learn. 6 (1) (1991) 37–66.[4] S.K. Bag, ANN based prediction of blast furnace parameters, Inst. Eng. 68 (1) (2007) 37–42.[5] Y. Bao, T. Xiong, Z. Hu, PSO-MISMO modeling strategy for multistep-ahead time series prediction, IEEE Trans. Cybernet. 44 (5) (2014) 655–668.[6] G.E.P. Box, G.M. Jenkins, G.C. Reinsel, Time Series Analysis: Forecasting And Control, Wiley, 2008.[7] M. Bozic, M. Stojanovic, Z. Stajic, N. Floranovic, Mutual information-based inputs selection for electric load time series forecasting, Entropy 15 (2013)

926–942.[8] B.-J. Chen, M.-W. Chang, C.-J. Lin, Load forecasting using support vector machines: a study on EUNITE competition 2001, IEEE Trans. Power Syst. 19 (4)

(2004) 1821–1830.[9] S.-M. Chen, TAIEX forecasting based on fuzzy time series and fuzzy variation groups, IEEE Trans. Fuzzy Syst. 19 (1) (2011) 1–12.

[10] S.-M. Chen, Y.-C. Chang, Multi-variable fuzzy forecasting based on fuzzy clustering and fuzzy rule interpolation techniques, Inform. Sci. 180 (24) (2010)4772–4783.

[11] S.-M. Chen, H.-P. Chu, T.-W. Sheu, TAIEX forecasting using fuzzy time series and automatically generated weighted of multiple factors, IEEE Trans. Syst.Man Cybernet. Part A 42 (6) (2012) 1485–1495.

[12] S.-M. Chen, G.M.T. Manalu, J.-S. Pan, H.-C. Liu, Fuzzy forecasting based on two-factors second-order fuzzy-trend logical relationship groups and particleswarm optimization techniques, IEEE Trans. Cybernet. 43 (3) (2013) 1102–1117.

[13] Y. Chen, B. Yang, J. Dong, Time-series prediction using a local linear wavelet neural network, Neurocomputing 69 (4–6) (2006) 449–465.[14] S.F. Crone, N. Kourentzes, Feature selection for time series prediction – a combined filter and wrapper approach for neural networks, Neurocomputing

73 (10–12) (2010) 1923–1936.[15] Dow Jones Web Site. <http://www.djindexes.com/>.[16] EUNITE Data Set. <http://neuron.tuke.sk/competition/index.php>.[17] V.H. Ferreira, A.P. Alves da Silva, Toward estimating autonomous neural network-based electric load forecasters, IEEE Trans. Power Syst. 22 (4) (2007)

1554–1562.[18] B. Gas, Self-organizing multi-layer perceptron, IEEE Trans. Neural Netw. 21 (11) (2010) 1766–1779.[19] F. Gaxiola, P. Melin, F. Valdez, O. Castillo, Interval type-2 fuzzy weight adjustment for backpropagation neural networks with application in time series

prediction, Inform. Sci. 260 (2014) 1–14.[20] M. Ghiassi, H. Saidane, A dynamic architecture for artificial neural networks, Neurocomputing 63 (2005) 397–413.[21] J. G De Gooijer, R.J. Hyndman, 25 years of time series forecasting, Int. J. Forecast. 22 (3) (2006) 443–473.[22] J.A. Guajardo, R. Weber, J. Miranda, A model updating strategy for predicting time series with seasonal patterns, Appl. Soft Comput. 10 (1) (2010) 276–

283.[23] E. Guresen, G. Kayakutlu, T.U. Daim, Using artificial neural network models in stock market index prediction, Expert Syst. Appl. 38 (8) (2011) 10389–

10397.[24] M.T. Hagan, H.B. Demuth, M.H. Beale, Neural Network Design, PWS Pub. Co., 1995.[25] J. Han, M. Kamber, J. Pei, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2011.[26] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, Springer, New York, 2008.[27] H.S. Hippert, D.W. Bunn, R.C. Souza, Large neural networks for electricity load forecasting: are they overfitted?, Int J. Forecast. 21 (3) (2005) 425–434.[28] Z. Huang, M.-L. Shyu, k-NN based LS-SVM framework for long-term time series prediction, in: 2010 IEEE International Conference on Information

Reuse and Integration, 2010, pp. 69–74.

Page 18: A Weighted LS-SVM Based Learning System for Time Series Forecasting

116 T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99–116

[29] Z. Huang, M.-L. Shyu, Recent trends in information reuse and integration, in: Long-term Time Series Prediction using k-NN based LS-SVM Frameworkwith Multi-value Integration, Springer, Vienna, 2012, pp. 191–209 (Chapter 9).

[30] K.-C. Hung, K.-P. Lin, Long-term business cycle forecasting through a potential intuitionistic fuzzy least-squares support vector regression approach,Inform. Sci. 224 (2013) 37–48.

[31] J.-S.R. Jang, Fuzzy modeling using generalized neural networks and Kalman filter algorithm, in: Proceedings of the Ninth National Conference onArtificial Intelligence (AAAI-91), 1991, pp. 762–767.

[32] J.-S.R. Jang, ANFIS: adaptive-network based fuzzy inference systems, IEEE Trans. Syst. Man Cybernet. 23 (3) (1993) 665–685.[33] Z. Ji, B. Wang, S. Deng, Z. You, Predicting dynamic deformation of retaining structure by LSSVR-based time series method, Neurocomputing 137 (2014)

165–172.[34] N. Kaneko, S. Matsuzaki, M. Ito, H. Oogai, K. Uchida, Application of improved local models of large scale database-based online modeling to prediction

of molten iron temperature of blast furnace, ISIJ Int. 50 (7) (2010) 939–945.[35] H. Kantz, Nonlinear Time Series Analysis, Cambridge University Press, 2003.[36] M. Khashei, M. Bijari, An artificial neural network ðp;d; qÞ model for time series forecasting, Expert Syst. Appl. 37 (1) (2010) 479–489.[37] M. Khashei, M. Bijari, Which methodology is better for combining linear and nonlinear models for time series forecasting?, J Ind. Syst. Eng. 4 (4) (2011)

265–285.[38] A. Kraskov, H. Stgbauer, P. Grassberger, Estimating mutual information, Phys. Rev. E 69 (6) (2004) 066138.[39] A. Kusiak, H. Zheng, Z. Song, Short-term prediction of wind farm power: a data mining approach, IEEE Trans. Energy Convers. 24 (1) (2009) 125–136.[40] Laser Time Series Data Set. <http://www-psych.stanford.edu/andreas/Time-Series/SantaFe.html>.[41] S.-J. Lee, C.-S. Ouyang, A neuro-fuzzy system modeling with self-constructing rule generation and hybrid SVD-based learning, IEEE Trans. Fuzzy Syst.

11 (3) (2003) 341–353.[42] W. Li, Mutual information functions versus correlation functions, J. Stat. Phys. 60 (5–6) (1990) 823–837.[43] C.-F. Liu, C.-Y. Yeh, S.-J. Lee, Application of type-2 neuro-fuzzy modeling in stock price prediction, Appl. Soft Comput. 12 (4) (2012) 1348–1358.[44] LS-SVM Program. <http://www.esat.kuleuven.be/sista/lssvmlab/>.[45] J. McNames, A nearest trajectory strategy for time series prediction, in: Proceedings of the International Workshop on Advanced Black-Box Techniques

for Nonlinear Modeling, K.U. Leuven, Belgium, 1998, pp. 112–128.[46] J. McNames, B. Widrow, J.H. Friedman, J.P. How, Innovations in Local Modeling for Time Series Prediction, 1999. <http://web.cecs.pdx.edu/mcnames/

Publications/Dissertation.pdf>.[47] A. Miranian, M. Abdollahzade, Developing a local least-squares support vector machines-based neuro-fuzzy model for nonlinear and chaotic time

series prediction, IEEE Trans. Neural Netw. Learn. Syst. 24 (2) (2013) 207–218.[48] NASDAQ Web Site. <http://www.nasdaq.com/>.[49] M. Negnevitsky, Artificial Intelligence: A Guide to Intelligent Systems, Addison-Wesley, 2004.[50] C.-S. Ouyang, W.-J. Lee, S.-J. Lee, A TSK-type neuro-fuzzy network approach to system modeling problems, IEEE Trans. Syst. Man Cybernet. – Part B:

Cybernet. 35 (4) (2005) 751–767.[51] Poland Data Set. <http://research.ics.aalto.fi/eiml/datasets.shtml>.[52] N.I. Sapankevych, R. Sankar, Time series prediction using support vector machines: a survey, IEEE Comput. Intell. Mag. 4 (2) (2009) 24–38.[53] A. Sfetsos, C. Siriopoulos, Time series forecasting with a hybrid clustering scheme and pattern recognition, IEEE Trans. Syst. Man Cybernet. Part A 34 (3)

(2004) 399–405.[54] G. Silviu, Information Theory with Applications, McGraw-Hill, 1977.[55] O. Song, B.S. Chissom, Fuzzy time series and its models, Fuzzy Sets Syst. 54 (3) (1993) 269–277.[56] O. Song, B.S. Chissom, Forecasting enrollments with fuzzy time series – Part I, Fuzzy Sets Syst. 54 (1) (1993) 1–9.[57] O. Song, B.S. Chissom, Forecasting enrollments with fuzzy time series – Part II, Fuzzy Sets Syst. 62 (1) (1994) 1–8.[58] A. Sorjamaa, J. Hao, A. Lendasse, Mutual information and k-nearest neighbors approximator for time series prediction, Lect. Notes Comput. Sci. 3657

(2005) 553–558.[59] A. Sorjamaa, J. Hao, N. Reyhani, Y. Ji, A. Lendasse, Methodology for long-term prediction of time series, Neurocomputing 70 (16–18) (2007) 2861–2869.[60] J.H. Stock, M.W. Watson, Introduction to Econometrics, Addison-Wesley, 2010.[61] H. Stogbauer, A. Kraskov, S.A. Astakhov, P. Grassberger, Least-dependent-component analysis based on mutual information, Phys. Rev. E 70 (2004)

066123.[62] M.B. Stojanovic, M.M. Bozic, M.M. Stankovic, Z.P. Stajic, A methodology for training set instance selection using mutual information in time series

prediction, Neurocomputing 141 (2014) 236–245.[63] Sunspot Data Set. <http://sidc.oma.be/sunspot-data/>.[64] J.A.K. Suykens, J. De Brabanter, L. Lukas, J. Vandewalle, Weighted least squares support vector machines: robustness and sparse approximation,

Neurocomputing 48 (1–4) (2002) 85–105.[65] J.A.K. Suykens, T.V. Gestel, J.D. Brabanter, B.D. Moor, J. Vandewalle, Least Squares Support Vector Machines, World Scientific Publishing Company, 2002.[66] TAIEX Web Site. <http://www.tese.com.tw/en/products/indices/tsec/taiex.php>.[67] F.E.H. Tay, L.J. Cao, Modified support vector machines in financial time series forecasting, Int. J. Forecast. 48 (1) (2002) 69–84.[68] S.S. Torbaghan, A. Motamedi, H. Zareipour, L.A. Tuan, Mediumterm electricity price forecasting, in: North American Power Symposium (NAPS) 2012,

2012, pp. 1–8.[69] W.W.S. Wei, Time Series Analysis: Univariate and Multivariate Methods, Pearson, 2005.[70] G.P. Zhang, Time series forecasting using a hybrid ARIMA and neural network model, Neurocomputing 50 (2003) 159–175.[71] L. Zhang, W.-D. Zhou, P.-C. Chang, J.-W. Yang, F.-Z. Li, Iterated time series prediction with multiple support vector regression models, Neurocomputing

99 (2013) 411–422.[72] H. Zou, Y. Yang, Combining time series models for forecasting, Int. J. Forecast. 20 (1) (2004) 69–84.


Recommended