Download - Analysis of Big Dependent Data in Economics and Finance · 2 Real-world data are often dynamically dependent 3 A simple example: Methods for independent data may fail 4 Trade-off

Analysis of Big Dependent Data in Economicsand Finance

Ruey S. TsayBooth Shool of Business, University of Chicago

September 2016

Ruey S. Tsay Big Dependent Data 1 / 72

Outline

1 Big data? Machine learning? Data science? What is in foreconomics and finance?

2 Real-world data are often dynamically dependent3 A simple example: Methods for independent data may fail4 Trade-off between simplicity and reality5 Some methods useful for analyzing big dependent data in

economics and finance6 Examples7 Concluding remarks


Big dependent data

1 Accurate information is the key to success in thecompetitive global economy. Information age.

2 What is big data? High dimension (many variables)? Largesample size? Both?

3 Not all big data sets are useful. Confounding & Noises4 Need to develop methods to efficiently extract useful

information from big data5 Know the limitations of big data6 Issues emerged from big data: privacy? ethical issues?7 Focus on methods for analyzing big dependent data in

economics and finance


What are available?

Statistical methods:1 Focus on sparsity (Simplicity)2 Various penalized regressions, e.g. Lasso and its

extensions3 Various dimension reduction methods and models4 Common framework used: Independent observations, with

limited extensions to stationary dataReal data are often dynamically dependent!

Some useful concepts in analyzing big data:1 Parsimony vs sparsity: Parsimony⇒ Sparsity2 Simplicity vs reality: trade-off btw feasibility &

sophistication


Parsimonious, not sparse

A simple example

yt = c +k∑

i=1

βxit + εt = c + β

k∑i=1

xit + εt ,

where k is large, xit are not perfectly correlated, and εt are iidN(0, σ2).The model has three parameters so it is parsimonious, but notsparse because y depends on all explanatory variables.In some applications,

∑ki=1 xit is a close approximation to the

first principal component. For example, the level of interestrates is important to an economy.Fused-Lasso can solve this difficulty in some situations.


What is LASSO regression?

Model: (assume mean-adjusted)

yi =

p∑j=1

βjXj,i + εi .

Matrix form: X is the design matrix

Y = Xβ + ε.

Objective function: In particular, if p > T

β(λ) = arg minβ

(‖Y − Xβ‖22/T + λ‖β‖1),

where λ ≥ 0 is a penalty parameter, ‖β‖1 =∑p

j=1 |βj |,‖Y − Xβ‖22 =

∑Ti=1(yi − X ′iβ)2


What is the big deal?

SparsityUsing convexity, LASSO is equivalent to

βopt (R) = arg minβ;‖β‖1≤R

‖Y − Xβ‖22/T .

Old friend: Ridge regression

βRidge(λ) = arg minβ

(‖Y − Xβ‖22/T + λ‖β‖22), or

β(R) = arg minβ;‖β‖2

2≤R‖Y − Xβ‖22/T .

Special case: p = 2. ‖Y − Xβ‖22/T is quadratic. ‖β‖1 is aregion of diamond shape, yet ‖β‖22 is a circle. Thus, LASSOleads to sparsity.


Computation and extensions

1 Optimization: Least angle regression (lars) by Efron et al.(2004) makes the computation very efficient.

2 Extensions:Group lasso: Yuan and Lin (2006). Subsets of X havespecific meaning, e.g. treatmentElastic net: Zou and Hastie (2005). Using a combination ofL1 and L2 penaltiesSCAD: Fan and Li (2001). Nonconcave penalizedlikelihood. [Smoothly clipped absolute deviation (SCAD).]Various Bayesian methods: penalty function is the prior.

3 Packages available in R: lars, glmnet, gamlr, gbm andmany others.


A simulated example

p = 300, T = 150, X iid N(0,1), εi iid N(0,0.25).

yi = x3i+2(x4i+x5i+x7i)−2(x11,i+x12,i+x13,i+x21,i+x22,i+x30,i)+εi

1 How? R demonstration2 Selection of λ? Cross-validation (10-fold), measurement of

prediction accuracy3 The commands lars and cv.lars of the package lars4 The commands glmnet and cv.glmnet of the package

glmnet5 Relationship between the two packages (alpha = 0)


Lasso may fail for dependent data

1 Data generating model: scalar Gaussian autoregressive,AR(3), model

xt = 1.9xt−1 − 0.8xt−2 − 0.1xt−3 + at , at ∼ N(0,1).

Generate 2000 observations. See Figure 1.2 Big data setup

Dependent xt : t = 11, . . . ,2000Regressors: Xt = [xt−1, xt−2, . . . , xt−10, z1t , . . . , z10,t ], wherezit are iid N(0,1).Dimension = 20, sample size 1990.

3 Run the Lasso regression via the lars package of R. SeeFigure 2 for results. Lag 3, xt−3 was not selected.

Lasso fails in this case.


Time

xt

0 500 1000 1500 2000

−400

00−3

5000

−300

00−2

5000

Figure: Time plot of simulated AR(3) time series with 2000observations


*

******* *************** ** ** ***

********

* ****

** * ** * * *

0.0 0.2 0.4 0.6 0.8 1.0

−2e+

050e

+00

2e+0

54e

+05

|beta|/max|beta|

Stan

dard

ized

Coef

ficie

nts

* ******* *************** ** ** *** ***** *** **

***

*** ** * * *

* ******* *************** ** ** *** *****

**** *

***

** * ** * * ** ******* *************** ** ** ***

*****

**** * *** ** * ** * * ** ******* *************** ** **

***

***** *** * * *** ** * ** * * ** ******* *************** **

**

*** ***** *** * * *** ** * ** * * ** ******* ***************

**

** *** ***** *** * * *** ** * ** * * ** ******* *************

**** ** *** ***** *** * * *** ** * ** * * *

* ******* *************** ** ** *** ***** *** * * *** ** * ** * * ** ******* ***********

**** ** ** *** ***** *** * * *** ** * ** * * ** ******* *************** ** ** *** ***** *** * * *** ** * ** * * ** ******* *************** ** ** *** ***** *** * * *** ** * ** * * ** ******* *************** ** ** *** ***** *** * * *** ** * ** * * ** ******* *************** ** ** *** ***** *** * * *** ** * ** * * ** ******* *************** ** ** *** ***** *** * * *** ** * ** * * ** ******* *************** ** ** *** ***** *** * * *** ** * ** * * ** ******* *************** ** ** *** ***** *** * * *** ** * ** * * ** ******* *************** ** ** *** ***** *** * * *** ** * ** * * ** ******* *************** ** ** *** ***** *** * * *** ** * ** * * ** ******* *************** ** ** *** ***** *** * * *** ** * ** * * *

LASSO

26

51

0 1 9 23 28 35 39 40 43 48 50

Figure: Results of Lasso regression for the AR(3) series


OLS works if we entertain AR models

Run the linear regression using the first three variables of Xt .Fitted model

xt = 1.902xt−1 − 0.807xt−2 − 0.095xt−3 + εt , σε = 1.01.

All estimates are statistically significant with p-value lessthan 2.22× 10−5.The residuals are well behaved, e.g. Q(10) = 12.23 withp-value 0.20 (after adjusting the df).

Simple time series method works for dependent data.


Why does lasso fail?

Two possibilities:1 Scaling effect: Lasso standardizes each variable in Xt . For

unit-root non-stationary time series, standardization mightwash out the dependence in the stationary part

2 Multicollinearity: Unit-root time series have strong serialcorrelations. [ACF approach 1 for all lags.]

This artificial example highlights the difference betweenindependent and dependent data.

Need to develop methods for big dependent data!


Possible solutions

1 Re-parameterization using time series properties2 Use different penalties for different parameters

The first approach is easier.For the particular time series, we can define ∆xt = (1− B)xtand ∆2xt = (1− B)2xt . Then,

xt = 1.9xt−1 − 0.8xt−2 − 0.1xt−3 + at

= xt−1 + ∆xt−1 − 0.1∆2xt−1 + at

= double + single + stationary + at .

The coefficients of xt−1,∆xt−1,∆2xt−1 are 1, 1, an −0.1,

respectively.


Different frameworks for LASSO

The X -matrix of conventional LASSO consists of

(xt−1, xt−2, . . . , xt−10, z1t , . . . , z10,t ),

where zit are iid N(0,1).Under the re-parameterization, the X -matrix becomes

(xt−1,∆xt−1,∆2xt−1, . . . ,∆

2xt−8, z1t , . . . , z10,t ).

These two X -matrices provide theoretically the sameinformation. However, the first one has high multicollinearity,but the 2nd one does not, especially after standardization.


5 10 15 20

0.00.4

0.8

1:20

β 2

5 10 15 20

0.00.4

0.8

1:20

β 3

5 10 15 20

−0.2

0.20.6

1.0

1:20

β 4

5 10 15 20

0.00.4

0.8

1:22

β 2

5 10 15 20

0.00.4

0.8

1:22

β 3

5 10 15 200.0

0.40.8

1:22

β 4

Figure: Comparison of β-estimates of lars results


Theoretical justification

Focus on the particular series xt used. Some properties of theseries are

1 T−4∑Tt=1 x2

t ⇒∫ 1

0 W 2, where W =∫ 1

0 W (s)ds with W (s)the standard Brownian motion.

2 T−5/2∑Tt=1 xt ⇒

∫ 10 W

3 T−3∑Tt=1 xt ∆xt ⇒

∫ 10 WW

4 T−2∑Tt=1(∆xt )

2 ⇒∫ 1

0 W 2

Standardization may wash out the ∆xt−1 and ∆2xt−1 parts.


Examples of big dependent data

1 Daily returns of U.S. stocks2 Demand of electricity every 30-m intervals3 Daily spreads of CDS (credit default swaps) of selected

companies4 Monthly unemployment rates of the 50 states of U.S.5 Interest rates of an economy6 Air pollution measurements of multiple locations and health

risk. Complex spatio-temporal data in general.


2012−2013

days

N(sto

cks)

0 100 200 300 400 500

6600

6700

6600

6700

size

Figure: Sample sizes of U.S. daily stock returns in 2012 and 2013:mean 6681, range = (6593,6774)


Time series plot

−0.10 0.00 0.10

Densities of 2012

lnreturn

dens

ity

−0.10 0.00 0.10

Densities of 2013

lnreturn

dens

ity

Figure: Densities of daily log returns of U.S. stocks in 2012 and 2013.


1000 2000

0.00

0.01

0.02

0.03

0.04

0.05

demand

Monday

1000 2000

demand

Tuesday

1000 2000

demand

Wednesday

1000 2000

demand

Thursday

1000 2000

demand

Friday

1000 2000

demand

Saturday

1000 2000

demand

Sunday

Figure: Empirical densities of electricity demand, 30 minute intervals,from July 6, 1997 to March 31, 2007. Adelaide, Australia


1980 1990 2000 2010

510

15

year

urat

e

State UNRATE: 1976.1 to 2015.9

Figure: Time plots of monthly state unemployment rates of the U.S.from 1976.1 to 2015.9


Some statistical methods

Goal: Extract useful information, including pooling.

1 Classification and cluster analysisK meansTree-based classificationModel-based classification

2 Factor models & ExtensionsOrthogonal factor modelApproximate factor modelDynamic factor modelConstrained factor models (column, row constraints)

X t = Rf tC + et

3 Generalizations of Lasso methods to dependent data, e.g.LASSO for nowcasting vs MIDAS


Constrained factor models

Column (variable) constraint only: Tsai & Tsay (2010) Let z t bea k -dimensional time series

z t = Hωf t + εt , t = 1, . . . ,T

where H is a k × r known matrix, f t is m-dimensional commonfactor, ω is r ×m unknown loading parameters.For observed data in matrix form

Z = Fω′H ′ + ε


A simple illustration

Monthly log returns of 10 stocks from 2001 to 20111 Semi-conductor: TXN, MU, INTC, TSM2 Pharmaceutical: PFE, MRK, LLY3 Investment bank: JPM, MS, GS

The constraints H = [h1,h2,h3], where

h1 = (1,1,1,1,0,0,0,0,0,0)′

h2 = (0,0,0,0,1,1,1,0,0,0)′

h3 = (0,0,0,0,0,0,0,1,1,1)′


Table: Estimation Results of Constrained and Orthogonal FactorModels

Stock Constrained Model: L = Hω Orthogonal Model: PCATick L1 L2 L3 Σε,i L1 L2 L3 Σε,i

TXN 0.76 0.26 0.27 0.28 0.79 0.20 0.32 0.24MU 0.76 0.26 0.27 0.28 0.67 0.36 0.29 0.34INTC 0.76 0.26 0.27 0.28 0.79 0.18 0.33 0.23TSM 0.76 0.26 0.27 0.28 0.80 0.27 0.16 0.26PFE 0.44 -0.68 0.10 0.34 0.49 -0.64 -0.03 0.35MRK 0.44 -0.68 0.10 0.34 0.40 -0.69 0.23 0.31LLY 0.44 -0.68 0.10 0.34 0.45 -0.70 0.06 0.31JPM 0.74 0.06 -0.43 0.27 0.72 0.02 -0.35 0.36MS 0.74 0.06 -0.43 0.27 0.76 0.05 -0.43 0.25GS 0.74 0.06 -0.43 0.27 0.75 0.12 -0.50 0.18e.v. 4.58 1.65 0.88 4.63 1.68 0.93

Variability explained: 70.6% Variability explained: 72.4%


Both row and column constraints

: Tsai, et al (2016) T observations and k variables. Data matrixform

Z = F 1ω′1H ′ + GF 2ω

′2 + GF 3ω

′3H ′ + E ,

where G denotes a known T ×m row constraint matrix.


Figure: The census regions and divisions of the United States


1998 2002 2006

7.88.4

New England

1998 2002 2006

8.69.0

9.4

Middle Atlantic

1998 2002 2006

9.29.6

East North Central

year

1998 2002 2006

8.28.8

9.4

West Noth Central

1998 2002 2006

10.2

10.8

South Atlantic

1998 2002 2006

8.69.0

9.4East South Central

year

1998 2002 2006

9.29.6

10.2

West South Central

1998 2002 2006

9.49.8

moutain

1998 2002 2006

9.49.8

Pacific

year

Figure: Time plots of monthly housing starts (in logarithms) of 9 U.S.divisions: 1997-2006.


F1[,1]

Time

0 20 40 60 80 100 120

−1.0

−0.5

0.0

0.5

1.0

F1[,2]

Time

0 20 40 60 80 100 120

−1.5

−0.5

0.5

1.0

1.5

F2[,1]

Time

2 4 6 8 10 12

−0.6

−0.2

0.0

0.2

0.4

F2[,2]

Time

2 4 6 8 10 12

−0.2

0.0

0.2

0.4

F3[,1]

2 4 6 8 10 12

−0.0

3−0

.01

0.01

0.03

F3[,2]

2 4 6 8 10 12

−0.8

−0.4

0.0

0.4

Figure: Time series plots of common factors for a DCF model of order(r,p,q) = (2,2,2) via maximum likelihood estimation.


−0.10

−0.05

0.00

0.05

New

Engla

nd

−0.10

−0.05

0.00

0.05

Midd

le At

lantic

−0.2

−0.1

0.00.1

East

North

Cen

tral

−0.20

−0.10

0.00

0.10

Wes

t Nor

th Ce

ntral

−0.08

−0.04

0.00

0.04

South

Atla

ntic

−0.20

−0.10

0.00

0.10

East

South

Cen

tral

−0.15

−0.05

0.05

Wes

t Sou

th Ce

ntral

−0.10

−0.05

0.00

0.05

Moun

tain

−0.06

−0.02

0.02

0 20 40 60 80 100 120

Pacif

ic

Index

ts(Gterm)

Figure: Time series plots for GF 2ω′2 of a fitted DCF model of order

(2,2,2). Maximum likelihood estimation is used.Ruey S. Tsay Big Dependent Data 32 / 72

−0.4

−0.2

0.00.2

New

Engla

nd

−0.4

−0.2

0.00.2

Midd

le At

lantic

−0.4

−0.2

0.00.2

East

North

Cen

tral

−0.4

−0.2

0.00.2

Wes

t Nor

th Ce

ntral

−0.4

−0.2

0.00.2

0.4

South

Atla

ntic

−0.4

−0.2

0.00.2

0.4

East

South

Cen

tral

−0.4

−0.2

0.00.2

0.4

Wes

t Sou

th Ce

ntral

−0.4

−0.2

0.00.2

0.4

Moun

tain

−0.4

−0.2

0.00.2

0.4

0 20 40 60 80 100 120

Pacif

ic

Index

ts(Hterm)

Figure: Time series plots for F 1ω′1H ′ of a fitted DCF model of order


−0.10

−0.05

0.00

0.05

New

Engla

nd

−0.10

−0.05

0.00

0.05

Midd

le At

lantic

−0.10

−0.05

0.00

0.05

East

North

Cen

tral

−0.10

−0.05

0.00

0.05

Wes

t Nor

th Ce

ntral

−0.3

−0.2

−0.1

0.00.1

South

Atla

ntic

−0.3

−0.2

−0.1

0.00.1

East

South

Cen

tral

−0.3

−0.2

−0.1

0.00.1

Wes

t Sou

th Ce

ntral

−0.10

−0.05

0.00

0.05

Moun

tain

−0.10

−0.05

0.00

0.05

0 20 40 60 80 100 120

Pacif

ic

Index

ts(GHterm)

Figure: Time series plots for GF 3ω′3H ′ of a fitted DCF model of order


Matrix-valued variables

Consider simultaneously n macroeconomic variables in kcountries

U.S. Italy Spain · · · CanadaGDP X11,t X12,t X13,t · · · X1k ,tUnem X21,t X22,t X23,t X2k ,tCPI X31,t X32,t X33,t X3k ,t

......

...M1 Xn1,t Xn2,t Xn3,t · · · Xnk ,t

On-going: only preliminary results are available. See Chen et al(2016)


Classification

A possible approach: Use a two-step procedure1 Transform dependent big data into functions, e.g.

probability densities2 Apply classification methods to functional data

The density functions of daily log returns of U.S. stocks serveas an example.We can then classify the density functions to make statisticalinference


Illustration of classification

Cluster Analysis of density functions

Consider the time series of density functions {ft (x)}.For simplicity, assume the densities are evaluated atequally-spaced grid point {x1 < x2 < . . . < xN} ∈ D withincrement ∆x . The data we have become{ft (xi)|t = 1, . . . ,T ; i = 1, . . . ,N}.

Using Hellinger distance (HD), we consider two methods:K meansTree-based classification


Hellinger distance of two density functions

Let f (x) and g(x) be two density functions on the commondomain D ⊂ R. Assume both density functions are absolutelycontinuous w.r.t. the Lebesgue measure. The Hellingerdistance (HD) between f (x) and g(x) is defined as

H(f ,g)2 =12

∫D

(√f (x)−

√g(x)

)2dx = 1−

∫D

√f (x)g(x)dx

Basic properties:1 H(f ,g) ≥ 02 H(f ,g) = 0 if and only if f (x) = g(x) almost surely.


K-means method

For a given K , the K-means method seeks partitions of thedensities, say, C1, . . . ,CK , such that

1⋃K

k=1 Ck = {ft (x)}2 Ci

⋂Cj = ∅ for i 6= j

3 Sum of within-cluster variation V =∑K

k=1 V (Ck ) isminimized, where the within-cluster variation is

V (Ck ) =∑

t1,t2∈Ck

H(ft1 , ft2)2

It turns out this can easily be done by applying the K-meansmethod with squared Euclidean distance to the squared-rootdensities {

√ft (x)}.


Example of K-means

Consider the 48 density functions of half-hour demand ofelectricity on Monday in Adelaide, Australia.With K = 4 clusters, we have

k Elements (time index) Calendar Hours1 17 to 44 8:00 AM to 10:00 PM2 15, 16, 45 to 48, 1, 2, 3 7:00 − 8:00 AM; 10:00 PM − 1:30 AM3 4, 5, 13, 14 1:30 − 2:30 AM; 6:00 − 7:00 AM4 6 to 12 2:30 − 6:00 AM

Result: capture daily activities, namely, (1) active period, (2)transition period, (3) light sleeping period, and (4) soundsleeping period.


1000 1500 2000 2500 3000

0.000

0.001

0.002

0.003

0.004

megawatts

dens

ity

Mondaydemand

Figure: Density functions of half-hour electricity demand on Mondayat Adelaide, Australia. The sample period is from July 6, 1997 toMarch 31, 2007.


1000 1500 2000 2500 3000

0.000

0.001

0.002

0.003

0.004

Megawatts

dens

ity

Figure: Results of K-means Cluster Analysis Based on SquaredHellinger Distance for Electricity Demands on Monday. Differentcolors denote different clusters.


Tree-based classification

Let Z t = (z1t , . . . , zpt )′ denote p covariates. We use an iterative

procedure to build a binary tree, starting with the rootC0 = {ft (x)}.

1 For each covariate zit , let zi(j) be the j th order statistic1 Divide C0 into two sub-clusters

Ci,j,1 = {ft (x)|zit ≤ zi(j)}; Ci,j,2 = {ft (x)|zit > zi(j)}

2 Compute the sum of within-cluster variations

H(i , j) = V (Ci,j,1) + V (Ci,j,2)

3 Find the smallest j , say vi , such that H(i , vi ) = minj{H(i , j)}.2 Select i ∈ {1, . . . ,p}, say I, such that

H(I, vI) = mini{H(i , vi)}.3 Use covariate zIt with threshold vI to grow two new leaves,

i.e.C1,1 = CI,vI ,1, C1,2 = CI,vI ,2


Tree-based procedure continued

Next, consider C1,1 and C1,2 as the root of a branch and applythe same procedure with their associated covariates to findcandidate for growth.The only modification is as follows: When considering C1,1, wetreat C1,2 as a leaf in computing the sum of within-clustervariations. Similarly, when considering C1,2 for further division,we treat C1,1 as a leaf in computing the sum of within-clustervariations.This growth-procedure is iterated until the number of clusters Kis reached.


Example of tree-based classification

Consider the density functions of U.S. daily log stock returns in2012 and 2013.Using the first-differenced VIX index as the explanatory variableand K = 4, we obtain 4 clusters as follows:

(−∞,−0.73], (−0.73,0.39], (0.39,1,19], (1.19,∞).

The cluster sizes are 104, 259, 86, and 53, respectively.Note that positive zt signifies an increase in market volatility(uncertainty).


What drove the U.S. financial market?

The Fear Factor

days

VIX

0 100 200 300 400 500

1525

Change series of VIX

days

diff(V

IX)

0 100 200 300 400 500

−40

4

Figure: Time plots of the market fear factor (VIX index) and its changeseries: 2012-2013


−0.2 −0.1 0.0 0.1 0.2

020

4060

log−rtn

dens

ity

dvix > 1.19

−0.2 −0.1 0.0 0.1 0.2

020

4060

log−rtn

dens

ity

1.19 >= dvix > 0.39

−0.2 −0.1 0.0 0.1 0.2

020

4060

log−rtn

dens

ity

0.39 >= dvix > −0.73

−0.2 −0.1 0.0 0.1 0.2

020

4060

log−rtn

dens

ity

dvix <= −0.73

Figure: Results of Tree-based Cluster Analysis for the Daily Densitiesof Log Returns of the U.S. Stocks in 2012 and 2013. Thefirst-differenced series of the VIX index is used as the explanatoryvariable. The numbers of element for the clusters are 53, 86, 259,and 104, respectively. The cluster classification is given in theheading of each plot. Ruey S. Tsay Big Dependent Data 47 / 72

Model-based classification

Work directly on observed multiple time series1 Postulate a general univariate model for all time series,

e.g. an AR(p) model2 Time series in a cluster follow the same model: Pooling

data to estimate common parameters3 Time series in different clusters follow different models4 May be estimated by Markov chain Monte Carlo methods5 May employ scaled-mixture of normal innovations to

handle outliersHave been widely studied, e.g. Wang et al (2013) andFruehwirth-Schnatter (2011), among others.


Application

1 Apply to monthly unemployment rates of 50 states of theU.S.

2 Use out-of-sample predictions to compare with othermethods, including lasso.

3 For 1-step to 5-step ahead predictions, the model-basedmethod works well in comparison. Wang et al (2013, JoF).


RMSE×104 MAE×104

Method m = 1 m = 2 m = 3 m = 4 m = 1 m = 2 m = 3 m = 4UAR 1616 1492 1791 2073 879 994 1268 1386VAR 2676 2095 2129 2759 1349 1353 1506 1624

Lasso25 1798 1833 2063 2504 1245 1250 1332 1401Lasso15 1714 1798 1855 2028 1186 1228 1296 1399G-Lasso 1877 1865 1882 1905 1291 1290 1306 1327

LVAR 1550 1716 1806 1904 1065 1298 1210 1355Pls10 1239 1531 1679 1873 909 1028 1263 1226Pls30 1395 1651 1835 1890 933 1092 1281 1320Pls50 1685 1871 2006 1967 940 1158 1304 1377Pls70 1914 2040 2182 1953 996 1222 1362 1432

Pls100 2187 2279 2313 2123 1099 1342 1480 1552Pcr10 1276 1829 2077 2108 890 1073 1247 1415Pcr30 1577 1837 2049 1769 888 1093 1261 1321Pcr50 1546 1805 2017 1759 880 1035 1209 1260Pcr70 1594 1837 2049 1769 886 1042 1221 1283

Pcr100 1649 2117 2202 2163 1068 1243 1324 1421MBC 1607 1703 1809 1961 885 1035 1225 1361rMBC 1225 1481 1691 1839 873 1027 1193 1295

Table: Root mean squared errors (RMSE) and mean absolute error(MAE) of 1-step to 4-step ahead out-of-sample forecasts for variousmodels applied to 50 state unemployment rates. The forecastingperiod is from January 2006 to September 2011. In the table, mdenotes the forecasting horizon. The models used are univariateAR(4) model (UAR), traditional VAR(4) model (VAR), VAR(4) withLASSO and s = 0.25 of L1 norm (Lasso25), VAR(4) with LASSO ands = 0.15 of L1 norm (Lasso15), group LASSO (G-Lasso), large VectorAutoregression of Song and Bickel (LVAR), partial least squares withthe first k components (Plsk , k = 10,30,50,70,100), principalcomponent regression with the first k components (Pcrk ,k = 10,30,50,70,100), model-based clustering (MBC), and robustmodel-based clustering (rMBC).


Functional PCA: Singular value decomposition

1 A tool to study the time evolution of the return distributions2 Data set: In this particular instance, each density function

is evaluated at 512 points and we have

Y = [Yit = ft (xi)|i = 1, . . . ,N; t = 1, . . . ,T ]512×502

3 Perform singular value decomposition

Y = (N − 1)UDV ′

where Y denotes column-mean adjusted data matrix, U isan N × N unitary matrix, D is an N × T rectangulardiagonal matrix, and V is a T × T unitary matrix.

4 This is a simple form of functional PCA. [Large samples,smoothing of PC is not needed.]


Scree plot

Comp.1 Comp.3 Comp.5 Comp.7 Comp.9

ScreeplotVa

rianc

es

050

0010

000

1500

020

000

Figure: Scree plot of PCA for daily return densities in 2012 and 2013.


The first 6 PC functions

−0.2 −0.1 0.0 0.1 0.2

040

080

0

lnreturn

pc1

−0.2 −0.1 0.0 0.1 0.2

−150

015

0

lnreturn

pc2

−0.2 −0.1 0.0 0.1 0.2

−100

0

lnreturn

pc3

−0.2 −0.1 0.0 0.1 0.2

−60

040

lnreturn

pc4

−0.2 −0.1 0.0 0.1 0.2

−30

020

lnreturn

pc5

−0.2 −0.1 0.0 0.1 0.2−2

00

20lnreturn

pc6

Figure: The first 6 PC functions for daily log return densities in 2012and 2013.


The next 6 PC functions

−0.2 −0.1 0.0 0.1 0.2

−15

010

lnreturn

pc7

−0.2 −0.1 0.0 0.1 0.2

−10

010

lnreturn

pc8

−0.2 −0.1 0.0 0.1 0.2

−10

05

lnreturn

pc9

−0.2 −0.1 0.0 0.1 0.2

−50

5

lnreturn

pc10

−0.2 −0.1 0.0 0.1 0.2

−60

4

lnreturn

pc11

−0.2 −0.1 0.0 0.1 0.2−4

02

lnreturn

pc12

Figure: The 7th-12th PC functions for daily log return densities in2012 and 2013.


Meaning of PC functions? 1st

−0.2 −0.1 0.0 0.1 0.2

010

2030

40

lnreturn

pc1

Mean density pm first PC

Figure: Mean density ± 1st PC: Peak and tails: mean+ standardized1st PC (red).


Meaning of PC functions? 2nd

−0.2 −0.1 0.0 0.1 0.2

010

2030

40

lnrturn

pc2

Mean density pm 2nd PC

Figure: Mean density ± 2nd PC: Midrange returns


Meaning of PC functions? 3rd

−0.2 −0.1 0.0 0.1 0.2

010

2030

40

lnreturn

pc3

Mean density pm 3rd PC

Figure: Mean density ± 3rd PC: Curvature


Approximate factor models

ft (x) =

p∑i=1

λt ,igi(x) + εt (x),

where gi(x) denotes the i th common factor and εt (x) is thenoise function.

1 A generalization of the orthogonal factor model, but allowsthe error functions to be correlated.

2 Only asymptotically identified under some regularityconditions.

3 FPCA provides a way to estimate approximate factormodels.


Loadings of the first PC function

−4 −2 0 2 4

−0.05

−0.04

−0.03

−0.02

dvix

Load

ings

Figure: Scatter plot of loadings vs changes in VIX index. Red linedenotes lowess fit


Functional PC via Thresholding

1 Zero appears to be a reasonable and natural threshold2 Regime 1: dvix ≥ 0 with 244 days. [Volatile (bad) state]3 Regime 2: dvix < 0 with 258 days. [Calm (good) state]4 Perform PCA of density functions for each regime.5 The differences are clearly seen.6 Leads to different approximate factor models for the

density functions


Scree plots


dvix >= 0Va

rianc

es

040

0010

000


dvix < 0

Varia

nces

060

00

Figure: Scree plots of PCA for each regime


The first 6 PC functions

−0.2 −0.1 0.0 0.1 0.2

030

060

0

lnreturn

pc1

−0.2 −0.1 0.0 0.1 0.2

−100

010

0

lnreturn

pc2

−0.2 −0.1 0.0 0.1 0.2

−60

040

lnreturn

pc3

−0.2 −0.1 0.0 0.1 0.2

−20

20

lnreturn

pc4

−0.2 −0.1 0.0 0.1 0.2

−20

020

lnreturn

pc5

−0.2 −0.1 0.0 0.1 0.2−1

50

10lnreturn

pc6

Figure: The first 6 PC functions for daily log return densities for eachregime: red line is for the Calm state, Regime 2


Approximate factor models

1 Use approximate factor models with the first 12 principalcomponent functions

2 Compare overall fits with/without thresholding3 For Regime 1 (positive dvix): randomly select day 174 For Regime 2 (negative dvix): randomly select day 420.5 Check: (a) observed vs fits and (b) residuals of

with/without thresholding6 With 12 components, both approaches fair well, but

thresholding provides improvements.


Comparison: day 17 (in Regime 1)

−0.2 −0.1 0.0 0.1 0.2

015

30

lnreturn

dens

itydensity and its fits: day 17

−0.2 −0.1 0.0 0.1 0.2

−0.4

0.20.6

lnreturn

differ

ence

Error in approximation: red (Thr)

Figure: Top plot: observed (black), all (red), Thr (blue). Bottom plot:all (black), Thr (red)


Comparison: day 420 (in Regime 2)

−0.2 −0.1 0.0 0.1 0.2

010

2030

lnreturn

dens

itydensity and its fits: day 420

−0.2 −0.1 0.0 0.1 0.2

−0.6

0.0

0.4

lnreturn

erro

rs

Errors of approximation: day 420, red(Thr)

Figure: Top plot: observed (black), all (red), Thr (blue). Bottom plot:all (black), Thr (red)


Lasso and beyond

1 Need to exploit parsimony, beyond sparsity2 Need to take into account prior knowledge. We have

accumulated lot of knowledge in diverse scientific areas.How to take advantages of this knowledge?

3 Variable selection is not sufficient. More importantly, whatare the proper measurements to take? What questions cana given big data answer?


An illustration

Every country has many interest series1 have different maturities2 serve different financial purposes3 What is the information embedded in those interest rate

series?Consider U.S. weekly constant maturity interest rates

1 From January 8, 1982 to October 30, 20152 Maturities: 3m, 6m, 1y, 2y, 3y, 5y, 7y, 10y, and 30y∗


1985 1990 1995 2000 2005 2010 2015

05

1015

Figure: Time plots of U.S. weekly interest rates with differentmaturities: 1/8/1982 to 10/30/2015.


Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9

p1

020

4060

80

Figure: Screeplot of U.S. weekly interest rates.


1985 1990 1995 2000 2005 2010 2015

−10

010

2030

Figure: Time plots of the first four principal components of U.S.weekly interest rates


Implication?

In lasso-type of analysis,1 should we use the interest rate series directly? Even with

group lasso.This leads to sparsity.

2 should we apply PCA first, then use the PCs?This leads to parsimony.

3 should we develop other possibilities? Fused lasso?Factor models?


Concluding Remark

1 Big dependent data appear in many applications2 Methods developed for independent big data may fail3 Statistical methods for big dependent data are relatively

under-developed4 Some new challenges emerge, new opportunities exist5 Simple modifications of the traditional methods might work

well6 Both theory and methods require further research