+ All Categories
Home > Documents > Computational Statistics Lectures 10-13: Smoothing and ...

Computational Statistics Lectures 10-13: Smoothing and ...

Date post: 08-Feb-2022
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
106
Computational Statistics Lectures 10-13: Smoothing and Nonparametric Inference Dr Jennifer Rogers Hilary Term 2017
Transcript
Page 1: Computational Statistics Lectures 10-13: Smoothing and ...

Computational StatisticsLectures 10-13: Smoothing and

Nonparametric Inference

Dr Jennifer Rogers

Hilary Term 2017

Page 2: Computational Statistics Lectures 10-13: Smoothing and ...

Background

Page 3: Computational Statistics Lectures 10-13: Smoothing and ...

Smoothing and nonparametric methods

I Approximating function that attempts to capture importantpatterns in datasets or images

I Leaves out noiseI Aid data analysis by being able to extract more information

from the dataI Analyses are flexible and robust

I Should always just use a nonparametric estimator?

Page 4: Computational Statistics Lectures 10-13: Smoothing and ...

Smoothing and nonparametric methods

No!

I There is no miracle!I There is a price to pay for the gain in generalityI When we have clear evidence of a good parametric model

for the data, we should use itI Nonparametric estimators converge to true curve slower

VALID parametric estimatorsI But as soon as the parametric model is incorrect, a

parametric estimator will never converge to the true curve

So nonparametric methods have their place!

Page 5: Computational Statistics Lectures 10-13: Smoothing and ...

The regression problem

I Goal of regression→ discover the relationship betweentwo variables, X and Y

I Wish to find a curve m that passes “in the middle” of thepoints

I Observations (xi ,Yi) for i = 1, . . . ,nI xi is a real-valued variableI Yi a real-valued random response

I Yi = m(xi) + εi for i = 1, . . . ,nI E(εi | Xi ) = 0I Var(εi | Xi ) = σ2(Xi )I m(x) = E(Y | X = x)

I m(·): regression functionI Reflects the relationship between X and YI Curve of interest and “lies in the middle” of all the points

I Goal is to infer m(x) from observations (xi ,Yi)

Page 6: Computational Statistics Lectures 10-13: Smoothing and ...

Example: Cosmic microwave background data

Page 7: Computational Statistics Lectures 10-13: Smoothing and ...

Example: FTSE stock market index

Page 8: Computational Statistics Lectures 10-13: Smoothing and ...

Linear smoothers

Page 9: Computational Statistics Lectures 10-13: Smoothing and ...

Linear smoothers

I In the parametric context, we assume we know the shapeof m(·)

I Linear model: Y = α + βX + εI m(x) = E(Y | X = x) = α + βx

I We estimate α and β from the dataI Least squares estimator:

(α, β) = argmin(α,β)

∑i

(Yi − α− βXi)2

I Consider m(xi) = 1 + 2xi

I We can fit a linear model to the data and obtainα = 0.9905 and β = 2.0025

Page 10: Computational Statistics Lectures 10-13: Smoothing and ...

Linear modelling

> x <- seq(from=0,to=1,length.out=1000)> e <- rnorm(1000,0,0.2)

> y1 <- 2*x+1+e

> lm(y1˜x)

Call:lm(formula = y1 ˜ x)

Coefficients:(Intercept) x

0.9905 2.0025

Page 11: Computational Statistics Lectures 10-13: Smoothing and ...

Linear modelling

Page 12: Computational Statistics Lectures 10-13: Smoothing and ...

Linear modelling

I Linear model: linear in the parameters!I No higher order terms such as αβ or β2

I Not necessarily linear in XiI Examples of linear models:

I Yi = 8X 3i − 13.6X 2

i + 7.28Xi − 1.176 + εI Yi = cos(20Xi ) + ε

Page 13: Computational Statistics Lectures 10-13: Smoothing and ...

Linear modelling

> lm(y2˜x+I(xˆ2)+I(xˆ3))

Page 14: Computational Statistics Lectures 10-13: Smoothing and ...

Linear modelling

> lm(y3˜I(cos(20*x)))

Page 15: Computational Statistics Lectures 10-13: Smoothing and ...

Linear modelling

I Can still use linear modellingI Requires knowledge of functional form of explanatory

variablesI May not always be obvious

I Consider linear smoothers - much more generalI Obtain a non-trivial smoothing matrix even for just a single

‘predictor’ variable (p = 1)

Page 16: Computational Statistics Lectures 10-13: Smoothing and ...

Linear smoothersI For some n × n-matrix S, Y = SYI Fitted value Yi at design point xi is a linear combination of

measurements

Yi =n∑

j=1

SijYj

I Linear regression with p predictor variables:

θ = argminθ∑

i

(Yi − Xiθ)2

= argminθ(Y − θX )(Y − θX )

= argminθ(Y T Y − 2θT X T Y + θT X T Xθ)

I Differentiate with respect to θ and setting to zero

−2X T Y + 2X T Xθ = 0

θ = (X T X )−1X T Y

Page 17: Computational Statistics Lectures 10-13: Smoothing and ...

Linear smoothers

θ = (X T X )−1X T Y

I Estimated (fitted) values Y are X θ

Y = X (X T X )−1X T Y = S Y ,

I S is a n × n-matrixI Hat-matrix, H, from linear regression

Page 18: Computational Statistics Lectures 10-13: Smoothing and ...

Linear smoothersI Degrees of freedom for linear regression

tr(X (X T X )−1X T ) = tr(X T X (X T X )−1) = tr(1p) = p

I How large is the expected residual sum of squares

E(RSS) = E( n∑

i=1

(Yi − Yi)2)

?

I If Y = Xθ, then E(RSS) = E(∑n

i=1 ε2i ) = nσ2

I If Y = SY , then Y − Y = Sε− ε

E(RSS) = σ2(n − p)

A good estimator of σ2 is thus

σ2 =RSSn − p

=RSS

n − df

Page 19: Computational Statistics Lectures 10-13: Smoothing and ...

Linear smoothers

Y = m(x) = S Y ,

I Serum data, taken in connection with Diabetes researchI Y : log-concentration of a serumI x : age of children in monthsI Various ways how S can be chosen:

Page 20: Computational Statistics Lectures 10-13: Smoothing and ...

Linear smoothers

Methods considered fall into two categories

I Local regression, includingI Kernel estimators andI Local polynomial regression.

I Penalized estimators, mainlyI Smoothing splines.

Page 21: Computational Statistics Lectures 10-13: Smoothing and ...

Local estimation: Kernelestimators

Page 22: Computational Statistics Lectures 10-13: Smoothing and ...

Histogram

I X ∼ FI P(x ≤ X ≤ x + ∆x) =

∫ x+∆xx fX (v)dv

I Thus, for any u ∈ [x , x + ∆x ]:

P(x ≤ X ≤ x + ∆x) ≈ ∆x · fX (u),

I This implies

fX (u) ≈ P(x ≤ X ≤ x + ∆x)

∆x

fX (u) =#Xi : x ≤ Xi ≤ x + ∆x

n∆x

This is the idea behind histograms.

Page 23: Computational Statistics Lectures 10-13: Smoothing and ...

Histogram

I Choose an origin t0I Choose a bin size, hI Partition real line into intervals Ik = [tk , tk+1] of equal length

hI Histogram estimator:

fH(x) =#Xi : tk ≤ Xi ≤ tk+1

nh.

Step function that depends heavily on both the origin, t0, andthe bin width, h

Page 24: Computational Statistics Lectures 10-13: Smoothing and ...

Example: Old Faithful geyserDuration in minutes of 272 eruptions of the Old Faithful geyserin Yellowstone National Park

> hist(faithful$eruptions,probability = T)

Page 25: Computational Statistics Lectures 10-13: Smoothing and ...

Example: Old Faithful geyser

What happens if we change the time origin and the bin width?

> hist(faithful$eruptions,breaks=seq(1.5,5.5,1),probability = T,xlim=c(1,5.5))

> hist(faithful$eruptions,breaks=seq(1.1,5.1,1),probability = T,xlim=c(1,5.5))

> hist(faithful$eruptions,breaks=seq(0.5,5.5,0.5),probability = T,xlim=c(1,5.5))

> hist(faithful$eruptions,breaks=seq(0.75,5.75,0.5),probability = T,xlim=c(1,5.5))

Page 26: Computational Statistics Lectures 10-13: Smoothing and ...

Example: Old Faithful geyser

Page 27: Computational Statistics Lectures 10-13: Smoothing and ...

Density estimatorCan we do better? We can get rid of time origin.

fX (x) = limh→0

FX (x + h)− FX (x)

h

= limh→0

FX (x)− FX (x − h)

h

Combining the two expressions:

fX (x) = limh→0

FX (x + h)− FX (x − h)

2h= lim

h→0

P(x − h < X < x + h)

2h

Which we can estimate using proportions:

fX (x) =1

nh

n∑i=1

K(x − Xi

h

),

With K (x) = 1/2 · I| x |< 1

Page 28: Computational Statistics Lectures 10-13: Smoothing and ...

Density estimator

I Similar to the histogramI No longer have the origin, t0I More flexible

I Constructs a box of length 2h around each observation Xi

I Estimator is then the sum of the boxes at xI Density that depends on the bandwidth, h

Page 29: Computational Statistics Lectures 10-13: Smoothing and ...

Kernel estimators

I Put a smooth symmetric ‘bump’ of shape K around eachobservation

I Estimator at x is now the sum of the bumps at xI We define

fX (x) =1

nh

n∑i=1

K(x − Xi

h

),

I K : ‘kernel’ functionI Estimator has the same properties of K

I If K is continuous and differentiable→ so is the estimatorI Estimator is a density if K is a density

I Shape of K does not influence the resulting estimatorI Estimator does depend heavily on h

Page 30: Computational Statistics Lectures 10-13: Smoothing and ...

Kernel estimators

Page 31: Computational Statistics Lectures 10-13: Smoothing and ...

Kernels

A kernel is a real-valued function K (x) such thatI K (x) ≥ 0 for all x ∈ R,I∫

K (x) dx = 1,I∫

xK (x) dx = 0.

In practice, the choice of K does not influence the results much,but the value of h is crucial

Page 32: Computational Statistics Lectures 10-13: Smoothing and ...

Kernels

Commonly used kernels includeI Boxcar: K (x) = I(x)/2

I Gaussian: K (x) = (2π)−1/2 exp(−x2/2)

I Epanechnikov: K (x) = 34(1− x2)I(x)

I Biweight: K (x) = 1516(1− x2)2I(x)

I Triweight: K (x) = 3532(1− x2)3I(x)

I Uniform: 12 I(x)

I(x) = 1 if |x | ≤ 1 and I(x) = 0 otherwise

Page 33: Computational Statistics Lectures 10-13: Smoothing and ...

Example: Old Faithful geyser

Page 34: Computational Statistics Lectures 10-13: Smoothing and ...

Kernel regression

Page 35: Computational Statistics Lectures 10-13: Smoothing and ...

Kernel regression

Y = m(x) + ε

1. We want to estimate E(Y |X = x). Naive estimator:

m(x) =

∑ni=1 Yi

n.

Same for all x

2. Average the Yis of only those Xis that are close to x (localaverage):

m(x) =

∑ni=1 Yi · I|Xi − x | < h∑n

i=1 I|Xi − x | < h.

h: bandwidth, determines the size of the neighbourhoodaround x

Page 36: Computational Statistics Lectures 10-13: Smoothing and ...

Kernel regression

Y = m(x) + ε

3. Give a slowly decreasing weight to Xi as it gets far from x ,rather than giving the same weight to all observationsclose to x :

m(x) =n∑

i=1

YiW (x − Xi),

W (·): weight function that decreases as x increases and∑ni=1 W (x − Xi) = 1

Page 37: Computational Statistics Lectures 10-13: Smoothing and ...

Nadaraya-Watson estimator

W (x − Xi) = K(x − Xi

h

)/

n∑j=1

K(x − Xj

h

)Hence, the Nadaraya-Watson kernel estimator is

m(x) =

∑ni=1 YiK (x−Xi

h )∑nj=1 K (

x−Xjh )

The estimated function values Yj = m(xj) at the madeobservations are given by

Yj =∑

i

SijYj , where Sij =K (

xj−xih )∑

k K (xj−xk

h ),

The kernel smoother is thus a linear smoother

Page 38: Computational Statistics Lectures 10-13: Smoothing and ...

Local least squares

We can rewrite the kernel regression estimator as

m(x) = argminmx∈R

n∑i=1

K(

x − xi

h

)(Yi −mx )2.

Exercise: Can be verified by solvingd

dmx

∑ni=1 K (x−xi

h )(Yi −mx )2 = 0

I Thus, for every fixed x , have to search for the best localconstant mx such that the localized sum of squares isminimized

I Localization is here described by the kernel and gives alarge weight to those observations (xi ,Yi) where xi is closeto the point x of interest.

The choice of the bandwidth h is crucial

Page 39: Computational Statistics Lectures 10-13: Smoothing and ...

Example: FTSE stock market index

Page 40: Computational Statistics Lectures 10-13: Smoothing and ...

Example: Cosmic microwave background data

Page 41: Computational Statistics Lectures 10-13: Smoothing and ...

Choosing the bandwidth

Page 42: Computational Statistics Lectures 10-13: Smoothing and ...

Choosing the bandwidth

I Measure “success” of fit using mean squared error on newobservations (MSE),

MSE(h) = E

(Y − mh(x))2,I Splitting into noise, bias and variance

MSE(h) = Noise + Bias2 + Variance

I Bias decreases if h ↓ 0I Variance increases if h ↓ 0

I Choosing the bandwidth is a bias-variance trade-off.

Page 43: Computational Statistics Lectures 10-13: Smoothing and ...

Choosing the bandwidth

I But...we don’t know MSE(h), as we just have a nobservations and cannot generate new randomobservations Y

I First idea is to compute, for various values of thebandwidth h:

I The estimator mh for a training sampleI The error

n−1n∑

i=1

(Yi − mh(x))2 = n−1n∑

i=1

(Yi − Yi )2,

I Choose the bandwidth with the smallest training error

Page 44: Computational Statistics Lectures 10-13: Smoothing and ...

CMB data

Page 45: Computational Statistics Lectures 10-13: Smoothing and ...

Choosing the bandwidth

I We would choose a bandwidth close to h = 0, giving nearperfect interpolation of the data, that is m(xi) ≈ Yi

I This is unsurprisingI Parametric context

I Shape of the model is fixedI Minimising the MSE makes the parametric model as close

as possible to the dataI Nonparametric setting

I Don’t have a fixed shapeI Value of h dictates the modelI Minimising the MSE→ fitted model as close as possible to

the dataI Lead us to choose h as small as possibleI Interpolation of the data

I Misleading result→ Only noise is fitted for very smallbandwidths

Page 46: Computational Statistics Lectures 10-13: Smoothing and ...

Cross-validation

I Solution...don’t use Xi to construct m(Xi)I This is the idea behind cross-validationI Leave-one-out cross-validationI Least squares cross-validation

I For each value of hI For each i = 1, . . . ,n, compute the estimator m(−i)

h (x),where m(−i)

h (x) is computed without using observation iI The estimated MSE is then given by

MSE(h) = n−1∑

i

(Yi − m(−i)h (xi ))2

Page 47: Computational Statistics Lectures 10-13: Smoothing and ...

CMB data

Page 48: Computational Statistics Lectures 10-13: Smoothing and ...

CMB data

A bandwidth of 44 minimises the estimated MSE

Page 49: Computational Statistics Lectures 10-13: Smoothing and ...

Cross-validation

I A drawback of LOO-CV is that it is expensive to computeI Fit has to be recalculated n times (once for each left-out

observation)I We can avoid needing to calculate m(x)(−i) for all iI For some n × n-matrix S, the linear smoother fulfills

Y = SY

The risk (MSE) under LOO-CV can subsequently be written as

MSE(h) = n−1n∑

i=1

(Yi − mh(xi)

1− Sii

)2

Page 50: Computational Statistics Lectures 10-13: Smoothing and ...

Cross-validation

I Do not need to recompute mh while leaving out each of then observations in turn

I Results can much faster be obtained by rescaling theresiduals

Yi − mh(xi)

with the factor (1− Sii)

I Sii is the i-th diagonal entry in the smoothing matrix

Page 51: Computational Statistics Lectures 10-13: Smoothing and ...

Generalized Cross-Validation

MSE(h) = n−1n∑

i=1

(Yi − mh(xi)

1− Sii

)2,

Replace Sii by its average ν/n (where ν =∑

i Sii )

Choose bandwidth h that minimizes

GCV (h) = n−1n∑

i=1

(Yi − mh(xi)

1− ν/n

)2.

Page 52: Computational Statistics Lectures 10-13: Smoothing and ...

Local polynomial regression

Page 53: Computational Statistics Lectures 10-13: Smoothing and ...

Nadaraya-Watson kernel estimator

I Major disadvantage of the Nadaraya-Watson kernelestimator→ boundary bias

I Bias is of large order at the boundaries

Page 54: Computational Statistics Lectures 10-13: Smoothing and ...

Local polynomial regression

I Even when a curve doesn’t look like a polynomialI Restrict to a small neighbourhood of a given point, xI Approximate the curve by a polynomial in that

neighbourhood

I Fit its coefficients using only observations Xi close to xI (or rather, putting more weight to observations close to x)

I Repeat this procedure at every point x where we want toestimate m(x)

Page 55: Computational Statistics Lectures 10-13: Smoothing and ...

Local polynomial regression

I Kernel estimator approximates the data by taking localaverages within small bandwidths

I Use local linear regression to obtain an approximation

Page 56: Computational Statistics Lectures 10-13: Smoothing and ...

Kernel regression estimator

Recall that the kernel regression estimator is the solution to:

m(x) = argminm(x)∈R

n∑i=1

K(

x − xi

h

)(Yi −m(x))2

This given by

m(x) =

∑ni=1 YiK (x−Xi

h )∑nj=1 K (

x−Xjh )

.

Thus estimation corresponds to the solution to the weightedsum of squares problem

Page 57: Computational Statistics Lectures 10-13: Smoothing and ...

Local polynomial regression

Using Taylor Series, we can approximate m(x), where x isclose to a point x0 using the following polynomial:

m(x) ≈ m(x0) + m(1)(x0)(x − x0) +m(2)(x0)

2!(x − x0)2 + . . .

. . .+m(p)(x0)

p!(x − x0)p

= m(x0) + β1(x − x0) + β2(x − x0)2 + · · ·+ βp(x − x0)p

where m(k)(x0) = k !βk , provided that all the requiredderivatives exist

Page 58: Computational Statistics Lectures 10-13: Smoothing and ...

Local polynomial regression

I Use the data to estimate that polynomial of degree p whichbest approximates m(xi) in a small neighbourhood aroundthe point x

I Minimise with respect to β0, β1, . . . , βp the function:

n∑i

Yi − β0 − β1(xi − x)− . . .− βp(xi − x)p2K(x − xi

h

)

I Weighted least squares problem, where the weights aregiven by the kernel functions K ((x − xi)/h)

I As m(k)(x) = k !βk , we then have that m(x) = β0, or

m(x) = β0

Page 59: Computational Statistics Lectures 10-13: Smoothing and ...

CMB data

Red: Kernel smoother (p = 0)Green: Local linear regression (p = 1)

Page 60: Computational Statistics Lectures 10-13: Smoothing and ...

Boundary bias: kernel estimator

Let `i(x) = ωi(x)/∑

j ωj(x), so that

m(x) =∑

i

`i(x)Yi

For the Kernel smoother (p = 0), the bias of the linear smootheris thus

Bias = E(m(x))−m(x) = m′(x)∑

i

(xi − x)`i(x) +

m′′(x)

2

∑i

(xi − x)2`i(x) + R,

Page 61: Computational Statistics Lectures 10-13: Smoothing and ...

Boundary bias: Kernel estimator

First term in the expansion is equal to

m′(x)∑

i

(xi − x)K( x−xi

h

)K(x−xi

h

)I vanishes if the design points xi are centred symmetrically

around xI does not vanish if x sits at the boundary (all xi − x will have

the same sign)

Page 62: Computational Statistics Lectures 10-13: Smoothing and ...

Boundary bias: polynomial estimator

I m(x) is truly a local polynomial of degree pI At least p + 1 points with positive weights in the

neighbourhood of x

Bias will hence be of order

Bias = E(m(x))−m(x) =m(p+1)(x)

(p + 1)!

∑j

(xj − x)p+1`j(x) + R

Why not choose p = 20?

Page 63: Computational Statistics Lectures 10-13: Smoothing and ...

Boundary bias: polynomial estimator

I Yi = m(x) + σεi with εi ∼ N (0,1)

I Variance of the linear smoother, m(x) =∑

j `j(x)Yi , is,

Var(m(x)) = σ2∑

j

`2j (x) = σ2‖`(x)‖2

I ‖`(x)‖2 tends to be large if p is large

I In practice, p = 1 is a good choice

Page 64: Computational Statistics Lectures 10-13: Smoothing and ...

Example: Doppler function

m(x) =√

x(1− x) sin( 2.1π

x + .05

), 0 ≤ x ≤ 1.

Page 65: Computational Statistics Lectures 10-13: Smoothing and ...

Example: Doppler function

> n <- 1000> x <- seq(0,1,length=n)> m <- sqrt(x*(1-x))*sin(2.1*pi/(x+0.05))> plot(x,m,type=’l’)> y <- m+rnorm(n)*0.075> plot(x,y)> fit <- locpoly(x,y,bandwidth=dpill(x,y)*2,degree=1)> lines(fit,col=2)> plot(x,y)> fit2 <- locpoly(x,y,bandwidth=dpill(x,y)/2,degree=1)> lines(fit2,col=2)> plot(x,y)> fit3 <- locpoly(x,y,bandwidth=dpill(x,y)/4,degree=1)> lines(fit3,col=2)

Page 66: Computational Statistics Lectures 10-13: Smoothing and ...

Example: Doppler function

Page 67: Computational Statistics Lectures 10-13: Smoothing and ...

Penalised regression

Page 68: Computational Statistics Lectures 10-13: Smoothing and ...

Penalised regression

Regression model i = 1, . . . ,n,

Yi = m(xi) + εi

E(εi) = 0Estimating m by choosing m(x) to minimize

n∑i=1

(Yi − m(xi))2

leads toI linear regression estimate if minimizing over all linear

functionsI an interpolation of the data if minimizing over all

functions.

Page 69: Computational Statistics Lectures 10-13: Smoothing and ...

Penalised regression

Estimate m by choosing m(x) to minimize

n∑i=1

(Yi − m(xi))2 + λJ(m),

J(m): roughness penalty

Typically

J(m) =

∫(m′′(x))2 dx .

Parameter λ controls trade-off between fit and penaltyI For λ = 0: interpolationI For λ→∞: linear least squares line

Page 70: Computational Statistics Lectures 10-13: Smoothing and ...

Example: Doppler function

Page 71: Computational Statistics Lectures 10-13: Smoothing and ...

Splines

Page 72: Computational Statistics Lectures 10-13: Smoothing and ...

Splines

I Kernel regressionI Researcher isn’t interested in actually calculating m(x) for a

single location xI m(x) calculated on sufficiently small grid of x-valuesI Curve obtained by interpolationI Local polynomial regression: unknown mean function was

assumed to be locally well approximated by a polynomialI Alternative approach

I Represent the fit as a piecewise polynomialI Pieces connecting at points called knotsI Once the knots are selected, an estimator can be

computed globallyI Manner similar to that for a parametrically specified mean

function

This is the idea behind splines

Page 73: Computational Statistics Lectures 10-13: Smoothing and ...

Splines

IID sample (X1,Y1), (X2,Y2), . . . (Xn,Yn) coming from the model

Yi = m(Xi) + εi

Want to estimate the mean of the variable Y withm(x) = E(Y |X = x)

A very naive estimator of E(Y |X = x) would be the samplemean of x :

m(x) =

∑ni=1 Yi

nNot very good (same for all x)

Page 74: Computational Statistics Lectures 10-13: Smoothing and ...

Splines

Approximate m by piecewise polynomials, each on a smallinterval:

m(x) =

c1 if x < ξ1c2 if ξ1 ≤ x < ξ2. . .ck if ξk−1 ≤ x < ξkck+1 if x ≥ ξk

Page 75: Computational Statistics Lectures 10-13: Smoothing and ...

Splines

Use more general lines, which join at the ξs:

m(x) =

a1 + b1x if x < ξ1a2 + b2x if ξ1 ≤ x < ξ2. . .ak + bkx if ξk−1 ≤ x < ξkak+1 + bk+1x if x ≥ ξk

a and b are such that the lines join at each ξ

Page 76: Computational Statistics Lectures 10-13: Smoothing and ...

Splines

Approximate m(x) by polynomials

m(x) =

∑pj=0 β1,jx j if x < ξ1∑pj=0 β2,jx j if ξ1 ≤ x < ξ2

. . .∑pj=0 βk ,jx j if ξk−1 ≤ x < ξk∑pj=0 βk+1,jx j if x ≥ ξk

βjs are such that the polynomials join at each ξ and theapproximation has p − 1 derivatives

Splines which are piecewise polynomials of order pI Splines of order p + 1I Splines of degree pI ξ : knots

Page 77: Computational Statistics Lectures 10-13: Smoothing and ...

Splines

Page 78: Computational Statistics Lectures 10-13: Smoothing and ...

Piecewise constant splines

Page 79: Computational Statistics Lectures 10-13: Smoothing and ...

Knots

How many knots should we have?I Choose a lot of knots well widespread over the data range→ reduce the bias of the estimator

I If we make it too local→ estimator will be too wiggly

I Overcome the bias problem without increasing thevariance→ take a lot of knots, but constrain their influence

I We can do this using penalised regression

Page 80: Computational Statistics Lectures 10-13: Smoothing and ...

Spline order

What order spline should we use?I Increase the value of p→ make the estimator mp smoother (since it has p − 1continuous derivatives)

I If we have p too large→ increase the number of parameters to estimate

I In practice rarely useful to take p > 3I p = 2

I Splines of order three or quadratic splinesI p = 3

I Splines of order 4 or cubic splinesI p-th order spline is a piecewise p − 1 degree polynomial

with p − 2 continuous derivatives at the knots

Page 81: Computational Statistics Lectures 10-13: Smoothing and ...

Natural splines

I Natural spline: linear beyond the boundary is called anatural spline

I Why this constraint?I We usually have very few observations beyond the two

extreme knotsI Want to obtain an estimator of the regression curve thereI Cannot reasonably estimate anything correct thereI Rather use a simplified model (e.g. linear)I Often gives more or less reasonable results

Page 82: Computational Statistics Lectures 10-13: Smoothing and ...

Natural cubic splines

ξ1 < ξ2 < . . . < ξn set of ordered points, so-called knots,contained in an interval (a,b)

A cubic spline is a continuous function m such that(i) m is a cubic polynomial over (ξ1, ξ2), . . ., and(ii) m has a continuous first and second derivatives at the

knots.The solution to

n∑i=1

(yi − m(xi))2 + λ

∫(m′′(x))2 dx

is a natural cubic spline with knots at the data points

m(x) is called a smoothing spline

Page 83: Computational Statistics Lectures 10-13: Smoothing and ...

Natural cubic splines

I Sequence of values f1, . . . , fn at specified locationsx1 < x2 < · · · < xn

I Find a smooth curve g(x) that passes through the points(x1, f1), (x2, f2), . . . , (xn, fn)

I The natural cubic spline g is an interpolating function thatsatisfies the following conditions:

(i) g(xj ) = fj , j = 1, . . . ,n,(ii) g(x) is cubic on each subinterval

(xk , xk+1), k = 1, . . . , (n − 1),(iii) g(x) is continuous and has continuous first and second

derivatives,(iv) g′′(x1) = g′′(xn) = 0.

Page 84: Computational Statistics Lectures 10-13: Smoothing and ...

B-splines

I Need a basis for natural polynomial splinesI Convenient is the so-called B-spline basis

I Data points a = ξ0 < ξ1 < ξ2 < . . . , ξn ≤ ξn+1 = b in (a,b)I There are n + 2 real values

I The n ≥ 0 are called ‘interior knots’ or ‘control points’I And there are always two endpoints, ξ0 and ξn+1

I When the knots are equidistant they are said to be‘uniform’

Page 85: Computational Statistics Lectures 10-13: Smoothing and ...

B-splines

Now define new knots τ asI τ1 ≤ . . . ≤ τp = ξ0 = aI τj+p = ξjI b = ξn+1 = τn+p+1 ≤ τn+p+2 ≤ . . . ≤ τn+2p

I p: order of the polynomialI p + 1 is the order of the splineI Append lower and upper boundary knots ξ0 and ξn+1 p

timesI Needed due to the recursive nature of B-splines

Page 86: Computational Statistics Lectures 10-13: Smoothing and ...

B-slines

Define recursively

I For k = 0 and i = 1, . . . ,n + 2p

Bi,0(x) =

1 τi ≤ x < τi+10 otherwise

I For k = 1,2, . . . ,p and i = 1, . . . ,n + 2p

Bi,k (x) =x − τi

τi+k−1 − τiBi,k−1(x) +

τi+k − xτi+k − τi+1

Bi+1,k−1(x)

Support of Bi,k (x) is [τi , τi+k ]

Page 87: Computational Statistics Lectures 10-13: Smoothing and ...

B-splines

Page 88: Computational Statistics Lectures 10-13: Smoothing and ...

Solving

I Solution depends on regularization parameter λI Determines the amount of “roughness”

I Choosing λ isn’t necessarily intuitiveI Degrees of freedom = trace of the smoothing parameter S

I Sum of the eigenvalues

S = B(BT B + λΩ)−1BT

I Monotone relationship between df and λI Search for a value of λ for desired df

I df=2→ linear regressionI df=n→ interpolate data exactly

Page 89: Computational Statistics Lectures 10-13: Smoothing and ...

Example: Doppler function

Page 90: Computational Statistics Lectures 10-13: Smoothing and ...

Example: Doppler function

Could of course choose λ by LOO-CV or GCV

Page 91: Computational Statistics Lectures 10-13: Smoothing and ...

Cross validation> plot(x,y)> fitcv <- smooth.spline(x,y,cv=T)> lines(fitcv,col=2)> fitcvCall:smooth.spline(x = x, y = y, cv = T)

Smoothing Parameter spar= 0.157514lambda= 2.291527e-08 (16 iterations)Equivalent Degrees of Freedom (Df): 124.738Penalized Criterion: 6.071742PRESS: 0.007898575

Page 92: Computational Statistics Lectures 10-13: Smoothing and ...

Generalised cross validation> plot(x,y)> fitgcv <- smooth.spline(x,y,cv=F)> lines(fitgcv,col=4)> fitgcvCall:smooth.spline(x = x, y = y, cv = F)

Smoothing Parameter spar= 0.1597504lambda= 2.378386e-08 (15 iterations)Equivalent Degrees of Freedom (Df): 124.2353Penalized Criterion: 6.078626GCV: 0.007925571

Page 93: Computational Statistics Lectures 10-13: Smoothing and ...

Multivariate smoothing

Page 94: Computational Statistics Lectures 10-13: Smoothing and ...

Multivariate smoothing

I So far we have only considered univariate functionsI Suppose there are several predictors that we would like to

treat nonparametrically

I Most ‘interesting’ statistical problems nowadays arehigh-dimensional with, easily, p > 1000

I Biology: Microarrays, Gene maps, Network inferenceI Finance: Prediction from multi-variate time-seriesI Physics: Climate models

I Can we just extend the methods and model functionsRp 7→ R nonparametrically?

Page 95: Computational Statistics Lectures 10-13: Smoothing and ...

Curse of dimensionality

I One might consider multidimensional smoothers aimed atestimating:

Y = m(x1, x2, . . . , xp)

I Considered methods rely on ‘local’ approximationsI Examine behaviour of data-points in the neighbourhood of

the point of interestI What is ‘local’ and ‘neighbourhood’ if p →∞ and n

constant

Page 96: Computational Statistics Lectures 10-13: Smoothing and ...

Curse of dimensionality

x = (x (1), x (2), . . . , x (p)) ∈ [0,1]p.

To get 5% of all n sample points into a cube-shapedneighbourhood of x , we need a cube with side-length 0 < ` < 1such that

`p ≤ 0.05

Dimension p side length `1 0.052 0.225 0.54

10 0.741000 0.997

Page 97: Computational Statistics Lectures 10-13: Smoothing and ...

Additive models

Require the function m : Rp 7→ R to be of the form

madd (x) = µ+ m1(x (1)) + m2(x (2)) + . . .+ mp(x (p))

= µ+

p∑j=1

mj(x (j)), m ∈ R

mj(·) : R 7→ R just a univariate nonparametric function

E [mj(x (j))] = 0 j = 1, . . . ,p

I Choice of smoother is left openI Avoids curse of dimensionality→ ‘less flexible’I Functions can be estimated by ‘backfitting’

Page 98: Computational Statistics Lectures 10-13: Smoothing and ...

Backfitting

Data x (j)i , 1 ≤ i ≤ n and 1 ≤ j ≤ p

A linear smoother for variable j can be described by an × n-matrix Sj , so that

mj = S(j)Y ,

I Y = (Y1, . . . ,Yn)T : observed vector of responses

I mj = (mj(x(j)1 ), . . . , mj(x

(j)n )): regression fit

I S(j) smoother with bandwidth estimated by LOO-CV orGCV

Page 99: Computational Statistics Lectures 10-13: Smoothing and ...

Backfitting

madd (x) = µ+

p∑j=1

mj(x (j)),

Suppose µ and mk given for all k 6= j

madd (xi) =(µ+

∑k 6=j

mk (x (k)i ))

+ mj(x(j)i )

Now to apply the smoother S(j) to

Y −(µ+

∑k 6=j

mk

)Cycle through all j = 1, . . . ,p to get

madd (x) = µ+

p∑j=1

mj(x(j)i ), m ∈ R.

Page 100: Computational Statistics Lectures 10-13: Smoothing and ...

Backfitting

1. Use µ← n−1∑ni=1 Yi . Start with mj ≡ 0 for all j = 1, . . . ,p

2. Cycle through the indices j = 1,2, . . . ,p,1,2, . . . ,p, . . .,

mj ← S(j)(Y − µ1−∑k 6=j

mk .

Also normalize

mj(·)← mj(·)− n−1n∑

i=1

mj(x(j)i )

update µ← n−1∑ni=1(Yi −

∑k mk (x (k)

i ))Stop iterations if functions do not change very much

3. Return

madd (xi)← µ+

p∑j=1

mj(x(j)i )

Page 101: Computational Statistics Lectures 10-13: Smoothing and ...

Example: Ozone data

Page 102: Computational Statistics Lectures 10-13: Smoothing and ...

Example: Ozone data

Page 103: Computational Statistics Lectures 10-13: Smoothing and ...

Iteration 1

Page 104: Computational Statistics Lectures 10-13: Smoothing and ...

Iteration 2

Page 105: Computational Statistics Lectures 10-13: Smoothing and ...

Iteration 3

Page 106: Computational Statistics Lectures 10-13: Smoothing and ...

Iteration 7


Recommended