Functional Data Analysis:Techniques and Applications
R. Todd Ogden and Jeff Goldsmith
March 17, 2014
Outline
� Examples, definitions, notation
� Display
� Smoothing
� Functional principal components analysis
� Regression with functional predictors and/or responses
1 of 67
What is functional data?
Some examples...
Child height as a function of age.
2 of 67
What is functional data?
Some examples...
Knee angle as children go through a gait cycle.
3 of 67
What is functional data?
Some examples...
Systolic blood pressure at various ages for 150 subjects.
4 of 67
What is functional data?
Some examples...
Examples of the S in Shakespeare’s signature
5 of 67
What is functional data?
Some examples...
-10 -5 0 5 10
-10
-50
510
X
Y 0o180o
0.0 0.2 0.4 0.6 0.8 1.0-10
-50
510t
PX(t)
0.0 0.2 0.4 0.6 0.8 1.0
-10
-50
510
t
PY(t)
Reaching motions made by a stroke patient
6 of 67
What is functional data?
Some examples...
Curvature and radius of the carotid artery.
7 of 67
What is functional data?
Some examples...
Brain images.8 of 67
Recurring example: DTI
0.0 0.2 0.4 0.6 0.8 1.0
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Distance Along Tract
Frac
tiona
l Ani
sotro
py
Tract profiles from diffusion tensor imaging
9 of 67
What is functional data?
Something like a definition:
“Observations on subjects that you can imagine as Xi(si),where si is continuous”
Functional notation is conceptual; observations are made on afinite discrete grid.
10 of 67
Some characteristics of functional data
The following are sometimes associated with functional data:
� High dimensional
� Temporal and/or spatial structure
� Interpretability across subject domains
11 of 67
Discretization of functional data
� Conceptually, we regard functional data as being definedon a continuum, e.g., Xi(t), 0 ≤ t ≤ 1.
� In practice, functional data are observed at a finite numberof points.
12 of 67
Discretization of functional data
Dense functional data: Often, this is a fine regular grid, i.e.,xi =
(Xi( 1
N
),Xi( 2
N
), . . . ,Xi (1)
): spectral data, imaging data,
accelerometry, ...Sparse functional data: In other situations, the points at whichobservations are taken are irregular, often random: CD4 count,blood pressure, etc.
� In such cases, some kind of interpolation is necessary.
13 of 67
Functional data are technically multivariate data!
Why not just apply multivariate techniques (MANOVA,clustering, multiple regression, etc.)?
� Any technique for functional data should take into accountthe structure of the data — results from multivariate dataanalyses are generally permutation-invariant, but resultsfrom functional data analyses should not be!
� Methodological developments in FDA are often extensionsof corresponding multivariate techniques.
14 of 67
Functional data are technically multivariate data!
Why not just apply multivariate techniques (MANOVA,clustering, multiple regression, etc.)?
� Any technique for functional data should take into accountthe structure of the data — results from multivariate dataanalyses are generally permutation-invariant, but resultsfrom functional data analyses should not be!
� Methodological developments in FDA are often extensionsof corresponding multivariate techniques.
14 of 67
Functional data are often observed with measurementerror
� Xi(t) is smooth (and continuously defined) but we observe
xi =
(Xi
(1N
)+ ε1,Xi
(2N
)+ ε2, . . . ,Xi (1) + εn
)
� It is common to smooth the data before any analysis (topicwe’ll revisit soon)
� In other situations, accounting for measurement error isbuilt in to the analysis procedure.
15 of 67
Comparison across observations
In order for functional data to be comparable acrossobservations (e.g., across subjects), they must be observed onthe same domain, i.e., t must be the same for X1(t) and X2(t).In many cases, this is straightforward:
� Spectral data
Problematic for some other situations:
� Growth curves (for adolescents, “growth spurts” may notline up)
� Brain imaging data (structure is somewhat different fromsubject to subject)
In such cases it is often possible to register the data, e.g., usinglandmarks or by warping.
16 of 67
Summary measures for functional data
Suppose we have functional data {Xi(t), t ∈ T , i = 1, . . . ,n}.Mean: µ(t) = EXi(t).
� The mean is itself functional
� Typically, we assume that the mean is smooth
� “Raw” estimator is sample mean: X(t) = 1n∑
Xi(t)
� A typical estimator of µ would be a smoothed version ofX(t) (more on this later).
17 of 67
Summary measures for functional data
Suppose we have functional data {Xi(t), t ∈ T , i = 1, . . . ,n}.Variance:Σ(s, t) = Cov(X(s),X(t)) = E [(X(s)− µ(s))(X(t)− µ(t))]
� This is a (two-dimensional) surface.
� “Raw” estimator is sample covariance:Σ(s, t) = Cov(Xi(s),Xi(t))
� Would need to smooth this as well.
18 of 67
Summary measures for functional data
0.0 0.2 0.4 0.6 0.8 1.0
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Distance Along Tract
Frac
tiona
l Ani
sotro
py
19 of 67
Summary measures for functional data
st
Cov(s, t)
0.000
0.002
0.004
0.006
0.008
20 of 67
Beyond iid functional data
Although the iid case is quite common, other situations arepossible:� Multilevel functional data:
I {Xij(t), t ∈ T , i = 1, . . . ,n, j = 1, . . . , Ji}I Example: repeated motions in gesture data
� Longitudinal functional data:I {Xij(t, vj), t ∈ T , i = 1, . . . ,n, j = 1, . . . , Ji}I Example: DTI data (multiple clinical visits)
21 of 67
Common problems in functional data analysis
Some issues arise regularly in FDA
� Data display and summarization
� Smoothing and interpolation
� Patterns in variability: principal component analysis
� Regression (with functional predictors, outcomes, or both)
22 of 67
Data display
Lots of tools for displaying data
� Spaghetti plots
� Rainbow plots
� 3D rainbow plots
� Examples for all using DTI data follow; R code is availableonline
23 of 67
Spaghetti plot
0.0 0.2 0.4 0.6 0.8 1.0
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Distance Along Tract
Frac
tiona
l Ani
sotro
py
24 of 67
2D rainbow plot
0.0 0.2 0.4 0.6 0.8 1.0
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Distance Along Tract
Frac
tiona
l Ani
sotro
py
25 of 67
3D rainbow plot
Distance Along Tract
0.0
0.2
0.4
0.6
0.8
1.0PA
SAT
Scor
e
0
10
20
30
40
50
60
Fractional Anisotropy
0.3
0.4
0.5
0.6
0.7
26 of 67
Smoothing
Why do we need smoothing?
� Data are often observed with error
� There’s a need to interpolate to a common grid
How are we going to do smoothing?
� Use a known set of basis functions
� Regress observed data onto known basis
27 of 67
Smoothing
Why do we need smoothing?
� Data are often observed with error
� There’s a need to interpolate to a common grid
How are we going to do smoothing?
� Use a known set of basis functions
� Regress observed data onto known basis
27 of 67
Some common basis functions: Splines
0.0 0.2 0.4 0.6 0.8 1.0
−1.5
−0.5
0.5
1.5 Fourier basis
s
φ k(s)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
B spline basis
s
φ k(s)
� Continuous
� Easily definedderivatives
� Good for smooth data
28 of 67
Some common basis functions: Wavelets
0.0 0.2 0.4 0.6 0.8 1.0
-0.2
-0.1
0.0
0.1
0.2
0.3
x
wav
elet
func
tion
valu
e
� Formed from a single“mother wavelet”function:ψjk(t) = 2j/2ψ(2jt− k)
� Orthonormal basis
� Particularly good whenthere are jumps, spikes,peaks, etc.
� Wavelet representationis sparse
29 of 67
Minimize sum of squares
Suppose we want to smooth a curve Yi(t) observed with error.We can use
Yi(t) =
K∑
k=1
cikψk(t).
We only need to estimate the subject-specific scores cik;minimize SSEi with respect to cik, where
SSEi =∑(
Yi(ti)−K∑
k=1
cikψk(ti)
)2
30 of 67
Example
0.0 0.2 0.4 0.6 0.8 1.0
0.35
0.40
0.45
0.50
0.55
0.60
Distance Along Tract
y
31 of 67
Tuning
For any curve, many possible smooths are available
� Depends on the spline basis
� Depends on the number of basis functions
� Depends on the estimation procedure
“Tuning” is the process of adjusting the smoother to the data athand. This is often implicit.
32 of 67
Example
0.0 0.2 0.4 0.6 0.8 1.0
0.35
0.40
0.45
0.50
0.55
0.60
Distance Along Tract
y
33 of 67
Example
0.0 0.2 0.4 0.6 0.8 1.0
0.35
0.40
0.45
0.50
0.55
0.60
Distance Along Tract
y
34 of 67
Penalization
Rather than choosing a smoother “by hand”, we could use a lotof basis functions but explicitly penalize “wiggliness”
Leads to a penalized SSE:
SSEi =∑
(Yi(t)−Ψ(t)ci)2 + λPen(Ψ(t)ci)
� Common penalties are on the derivatives (enforcingsmoothness)
� Need to choose tuning parameter λ
35 of 67
Data-driven basis
� Previous bases don’t depend on the data; only the loadingsdo.
� FPCA gives a “data-driven” basis: it is constructed fromthe observed data.
� Looks pretty similar mathematically:
Yi(t) =
K∑
k=1
cikψk(t).
� Difference is that the ψ aren’t pre-specified.
36 of 67
Data-driven basis
So where do the basis functions ψ come from?
� Construct covariance matrix Σ
� (Remove main diagonal, smooth)
� Spectral decomposition of Σ produces basis functions ψ
37 of 67
Data-driven basis
Some properties of FPCA
� The ψ are orthonormal (non-overlapping)
� Also the most parsimonious basis expansion for a givendata set
� Basis functions are often interpretable - describe the majordirections of variability in the observed data
38 of 67
Example
st
Cov(s, t)
0.000
0.002
0.004
0.006
0.008
39 of 67
Example
st
Cov(s, t)
0.000
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
40 of 67
Example
0.0 0.2 0.4 0.6 0.8 1.0
0.3
0.4
0.5
0.6
0.7
0.8
1st PC for FA (67.9%)
t1
Frac
tiona
l Ani
sotro
py
+++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++
----------------------------
-----------------------
------------------------------------------
0.0 0.2 0.4 0.6 0.8 1.00.3
0.4
0.5
0.6
0.7
0.8
2nd PC for FA (9.8%)
t1
Frac
tiona
l Ani
sotro
py
+++++++++++++++
++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++
-----------------------------
---------------------------------------------
-------------------
41 of 67
Data-driven vs Pre-specified
� Data-driven bases are the most parsimonious for a givendataset, but may not transfer to new data
� Data-driven often work better for sparse data (borrowingstrength to derive basis functions)
� Pre-specified often have better analytical properties (easilycomputed derivatives, known forms)
42 of 67
Regression modeling with functional data
� Scalar on function regression
� Function on scalar regression
� Function on function regression
43 of 67
Scalar on function regression: Example scenarios
X = temperature (over time) for the yearY = total rainfall for one year
X = NIR spectrumY = water content of a sample
X = brain imageY = clinical outcome
44 of 67
Example data: DTI
xi(s) = fractional anisotropy along the corticospinal tractYi = measure of cognitive function
Corticospinal Tract
s
Fra
ctio
nal A
niso
trop
y
0.3
0.5
0.7
0.00 0.25 0.50 0.75 1.00
45 of 67
Linear scalar-on-function regression model
Given data ({x1(s), s ∈ S},Y1), . . . , {xn(s), s ∈ S},Yn), thescalar-on-function regression model is:
Yi = α+
∫xi(s)β(s) ds + εi, i = 1, . . . ,n
Interpretation of “coefficient function” β:
� Where β(s) > 0, larger values of xi(s) lead to higherpredicted Y.
� Where β(s) < 0, larger values of xi(s) lead to lowerpredicted Y.
� Where β(s) = 0, xi(s) has no effect on Y.
46 of 67
Coefficient Interpretation
! ! !
�2.6
0
1.6
5.2
Observed Tract Profile
Coefficient Function
Profile x Coefficient (Area Under Curve Shaded)
Functional Contribution
0.00 0.25 0.50 0.75 1.00
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
β(p)
0.00 0.25 0.50 0.75 1.00
-0.25
-0.10
0.05
FA - µFA(p)
0.00 0.25 0.50 0.75 1.00
-0.25
-0.10
0.05
FA - µFA(p)
0.00 0.25 0.50 0.75 1.00
-0.25
-0.10
0.05
FA - µFA(p)
0.00 0.25 0.50 0.75 1.00
-0.25
-0.10
0.05
FA - µFA(p)
0.00 0.25 0.50 0.75 1.00
-0.2
0.2
0.00 0.25 0.50 0.75 1.00
-0.2
0.2
0.00 0.25 0.50 0.75 1.00
-0.2
0.2
0.00 0.25 0.50 0.75 1.00
-0.2
0.2
Xi(s) �(s) Xi(s)�(s)Z
Xi(s)�(s) ds
47 of 67
Scalar-on-function regression:The need for regularization
But the function xi(s) is only observed at N points!
� xi = (xi(1/N), xi(2/N), . . . , xi(1))T
� β = (β(1/N), β(2/N), . . . , β(1))T
The model becomes
Yi = α+
∫xi(s)β(s) ds + εi
≈ α+ (1/N)xTβ + εi
If we’re not thinking “functionally”, this is like doingregression with n observations and N predictors!
To get reasonable fits, we must regularize in some way.
48 of 67
Basis functions
Possible basis functions: splines, orthogonal polynomials,principal components, wavelets, etc.
Let
xi(s) =
K∑
k=1
cikψk(s)
β(s) =
K∑
k=1
θkψk(s)
This is now a K-dimensional regression problem.
49 of 67
Scalar-on-function regression:Basis function representation
Yi = α+
∫xi(s)β(s) ds + εi
= α+
∫ ( K∑
`=1
ci`ψ`(s)
)(K∑
k=1
θkψk(s)
)ds + εi
= α+
K∑
k=1
[K∑
`=1
ci`
(∫ψ`(s)ψk(s) ds
)]θk + εi
=
K∑
k=1
zkθk + εi
50 of 67
How to choose K?
K = 2 K = 7 K = 10
0.0 0.2 0.4 0.6 0.8 1.0
−10
−5
05
1015
Coef. Func. Estimates
β(t)
0.0 0.2 0.4 0.6 0.8 1.0
−10
−5
05
1015
Coef. Func. Estimates
β(t)
0.0 0.2 0.4 0.6 0.8 1.0
−10
−5
05
1015
Coef. Func. Estimates
β(t)
51 of 67
Regularization with roughness penalties
Could choose α and β to minimize
n∑
i=1
(Yi − α−
∫xi(s)β(s) dt
)2
+ λ
∫ (β′′(s)
)2 dt
� First term: (proportional to) mean squared error (MSE):measures fidelity to the data (how well the model “fits” thedata)
� Second term: measures the smoothness of the coefficientfunction
52 of 67
Example fits with a range of tuning parameters
0.0 0.2 0.4 0.6 0.8 1.0
-10
-50
510
15
β(t)
53 of 67
How to choose λ?
The tuning parameter λ controls the tradeoff between these.� If λ is too large, it will result in smooth estimates at the
expense of large MSE (underfitting).� If λ is too small, the MSE will be small but the estimated β
function will be very wiggly (overfitting).� Neither one of these will provide good “out of sample”
predictions.Could choose λ by cross-validation:
CV(λ) =
n∑
i=1
(Yi − α(i)
λ −∫
xi(t)β(i)λ (t) dt)2
Choose λ to minimize CV(λ)
Also: generalized cross-validation (GCV), restricted maximumlikelihood (REML) . . .
54 of 67
Function on scalar regression: Example scenarios
X = climate zoneY = temperature (over time)
X = ageY = activity level (over time)
X = diagnosisY = brain image
55 of 67
Canadian weather data
X = region (Arctic, Atlantic, Continental, Pacific)Y = temperature (degrees Celsius) over time
2 4 6 8 10 12
−30
−20
−10
010
20
Month
Tem
pera
ture
56 of 67
Function on scalar regression
A “functional ANOVA” model:
Yij(s) = µ(s) + αi(s) + εij(s), i = 1, . . . ,n
For identifiability, could constrain that∑
i αi(s) = 0 for all t.
More generally, given data(x1, {Y1(s), s ∈ S}), . . . , (xn, {Yn(s), s ∈ S}), where xi is ap-vector, the function-on-scalar regression model is
Yi(s) = xTi β(s) + εi(s),
where β(s) = (β1(s), . . . , βp(s)).
57 of 67
Function on scalar regression: data representation
If the functional observations are observed at a grid of points,say, s1, . . . , sN, then let
Y : n×N = [Yi(sj)], i = 1. . . . ,n, j = 1, . . . ,N.
We could also think about expressing the β functions on thesame grid, i.e., let
B : p×N = [βi(sj)], i = 1, . . . , p; j = 1, . . . ,N.
Expressing the ε’s the same way and writing the X matrix asusual, the discrete version of the model becomes
Y = XB + E.
This has the same form as multivariate analysis of variance(MANOVA).
58 of 67
Function on scalar regression: basis functionrepresentation
Given basis functions ψ1(s), . . . , ψK(s), we could express
Yi(s) =
K∑
k=1
cikψk(s)
βj(s) =
K∑
k=1
θjkψk(s)
The model then becomes
C = XΘ + E
59 of 67
Fitting by penalizing roughness
Could choose β to minimize
n∑
i=1
∫ (Yi(s)− xT
i β(s))2
dt + λ
p∑
j=1
∫ (β′′j (s)
)2dt
More generally, in the discretized space, we could minimize
||Y− XB||+ λ
p∑
j=1
BTj PBj,
where Bj is the jth row of B.
60 of 67
Application to Canadian weather data
0 100 200 300
−15
−5
515
Coe
ffici
ent f
unct
ion
0 100 200 300
−20
−10
010
20
Coe
ffici
ent f
unct
ion
0 100 200 300
−20
−10
010
20
Coe
ffici
ent f
unct
ion
0 100 200 300
−20
−10
010
20
Coe
ffici
ent f
unct
ion
0 100 200 300
−20
−10
010
20
Coe
ffici
ent f
unct
ion
61 of 67
Function on function regression: Example scenarios
X = temperature (over time)Y = precipitation (over time)
X = fractional anisotropy along corpus callosum tractY = fractional anisotropy along corticospinal tract
X = hip angle through a gait cycleY = knee angle through a gait cycle
62 of 67
Function on function regression: the model
Given functional data ({x1(s), s ∈ S}, {Y1(t), t ∈T }), . . . , ({xn(s), s ∈ S}, {Yn(t), t ∈ T }), the model could beexpressed
Yi(t) =
∫β(s, t)xi(s) ds + εi(t)
The coefficient function in this case is a (two-dimensional)surface.
63 of 67
Function on function regression: Example
X1.smat
X1.tmat
te(X1.smat,X1.tm
at,11.99):L.X1
64 of 67
Software
� refund package
� fda package
� fda.usc package
65 of 67
Stuff we haven’t even mentioned
� Inference on functional model parameters
� Model selection, model building
� Alternative penalties
� Model diagnostics and goodness of fit
� “Generalized” versions of functional linear models
� Hierarchical models for functional data
� Supervised/unsupervised classification of functional data
� Functional “depth” and functional boxplots
� Many other topics . . .
66 of 67
Useful references
� Ferraty and Vieu (2006). Nonparametric Functional DataAnalysis. Springer.
� Ramsay and Silverman (2005). Functional Data Analysis,Second Edition. Springer.
� Ramsay and Silverman (2002). Appled Functional DataAnalysis. Springer.
� Sørensen, Goldsmith, and Sangalli (2013). An introductionwith medical applications to functional data analysis.Statistics in Medicine 32:5222-5240.
67 of 67