Computational StatisticsLectures 10-13: Smoothing and
Nonparametric Inference
Dr Jennifer Rogers
Hilary Term 2017
Background
Smoothing and nonparametric methods
I Approximating function that attempts to capture importantpatterns in datasets or images
I Leaves out noiseI Aid data analysis by being able to extract more information
from the dataI Analyses are flexible and robust
I Should always just use a nonparametric estimator?
Smoothing and nonparametric methods
No!
I There is no miracle!I There is a price to pay for the gain in generalityI When we have clear evidence of a good parametric model
for the data, we should use itI Nonparametric estimators converge to true curve slower
VALID parametric estimatorsI But as soon as the parametric model is incorrect, a
parametric estimator will never converge to the true curve
So nonparametric methods have their place!
The regression problem
I Goal of regression→ discover the relationship betweentwo variables, X and Y
I Wish to find a curve m that passes “in the middle” of thepoints
I Observations (xi ,Yi) for i = 1, . . . ,nI xi is a real-valued variableI Yi a real-valued random response
I Yi = m(xi) + εi for i = 1, . . . ,nI E(εi | Xi ) = 0I Var(εi | Xi ) = σ2(Xi )I m(x) = E(Y | X = x)
I m(·): regression functionI Reflects the relationship between X and YI Curve of interest and “lies in the middle” of all the points
I Goal is to infer m(x) from observations (xi ,Yi)
Example: Cosmic microwave background data
Example: FTSE stock market index
Linear smoothers
Linear smoothers
I In the parametric context, we assume we know the shapeof m(·)
I Linear model: Y = α + βX + εI m(x) = E(Y | X = x) = α + βx
I We estimate α and β from the dataI Least squares estimator:
(α, β) = argmin(α,β)
∑i
(Yi − α− βXi)2
I Consider m(xi) = 1 + 2xi
I We can fit a linear model to the data and obtainα = 0.9905 and β = 2.0025
Linear modelling
> x <- seq(from=0,to=1,length.out=1000)> e <- rnorm(1000,0,0.2)
> y1 <- 2*x+1+e
> lm(y1˜x)
Call:lm(formula = y1 ˜ x)
Coefficients:(Intercept) x
0.9905 2.0025
Linear modelling
Linear modelling
I Linear model: linear in the parameters!I No higher order terms such as αβ or β2
I Not necessarily linear in XiI Examples of linear models:
I Yi = 8X 3i − 13.6X 2
i + 7.28Xi − 1.176 + εI Yi = cos(20Xi ) + ε
Linear modelling
> lm(y2˜x+I(xˆ2)+I(xˆ3))
Linear modelling
> lm(y3˜I(cos(20*x)))
Linear modelling
I Can still use linear modellingI Requires knowledge of functional form of explanatory
variablesI May not always be obvious
I Consider linear smoothers - much more generalI Obtain a non-trivial smoothing matrix even for just a single
‘predictor’ variable (p = 1)
Linear smoothersI For some n × n-matrix S, Y = SYI Fitted value Yi at design point xi is a linear combination of
measurements
Yi =n∑
j=1
SijYj
I Linear regression with p predictor variables:
θ = argminθ∑
i
(Yi − Xiθ)2
= argminθ(Y − θX )(Y − θX )
= argminθ(Y T Y − 2θT X T Y + θT X T Xθ)
I Differentiate with respect to θ and setting to zero
−2X T Y + 2X T Xθ = 0
θ = (X T X )−1X T Y
Linear smoothers
θ = (X T X )−1X T Y
I Estimated (fitted) values Y are X θ
Y = X (X T X )−1X T Y = S Y ,
I S is a n × n-matrixI Hat-matrix, H, from linear regression
Linear smoothersI Degrees of freedom for linear regression
tr(X (X T X )−1X T ) = tr(X T X (X T X )−1) = tr(1p) = p
I How large is the expected residual sum of squares
E(RSS) = E( n∑
i=1
(Yi − Yi)2)
?
I If Y = Xθ, then E(RSS) = E(∑n
i=1 ε2i ) = nσ2
I If Y = SY , then Y − Y = Sε− ε
E(RSS) = σ2(n − p)
A good estimator of σ2 is thus
σ2 =RSSn − p
=RSS
n − df
Linear smoothers
Y = m(x) = S Y ,
I Serum data, taken in connection with Diabetes researchI Y : log-concentration of a serumI x : age of children in monthsI Various ways how S can be chosen:
Linear smoothers
Methods considered fall into two categories
I Local regression, includingI Kernel estimators andI Local polynomial regression.
I Penalized estimators, mainlyI Smoothing splines.
Local estimation: Kernelestimators
Histogram
I X ∼ FI P(x ≤ X ≤ x + ∆x) =
∫ x+∆xx fX (v)dv
I Thus, for any u ∈ [x , x + ∆x ]:
P(x ≤ X ≤ x + ∆x) ≈ ∆x · fX (u),
I This implies
fX (u) ≈ P(x ≤ X ≤ x + ∆x)
∆x
fX (u) =#Xi : x ≤ Xi ≤ x + ∆x
n∆x
This is the idea behind histograms.
Histogram
I Choose an origin t0I Choose a bin size, hI Partition real line into intervals Ik = [tk , tk+1] of equal length
hI Histogram estimator:
fH(x) =#Xi : tk ≤ Xi ≤ tk+1
nh.
Step function that depends heavily on both the origin, t0, andthe bin width, h
Example: Old Faithful geyserDuration in minutes of 272 eruptions of the Old Faithful geyserin Yellowstone National Park
> hist(faithful$eruptions,probability = T)
Example: Old Faithful geyser
What happens if we change the time origin and the bin width?
> hist(faithful$eruptions,breaks=seq(1.5,5.5,1),probability = T,xlim=c(1,5.5))
> hist(faithful$eruptions,breaks=seq(1.1,5.1,1),probability = T,xlim=c(1,5.5))
> hist(faithful$eruptions,breaks=seq(0.5,5.5,0.5),probability = T,xlim=c(1,5.5))
> hist(faithful$eruptions,breaks=seq(0.75,5.75,0.5),probability = T,xlim=c(1,5.5))
Example: Old Faithful geyser
Density estimatorCan we do better? We can get rid of time origin.
fX (x) = limh→0
FX (x + h)− FX (x)
h
= limh→0
FX (x)− FX (x − h)
h
Combining the two expressions:
fX (x) = limh→0
FX (x + h)− FX (x − h)
2h= lim
h→0
P(x − h < X < x + h)
2h
Which we can estimate using proportions:
fX (x) =1
nh
n∑i=1
K(x − Xi
h
),
With K (x) = 1/2 · I| x |< 1
Density estimator
I Similar to the histogramI No longer have the origin, t0I More flexible
I Constructs a box of length 2h around each observation Xi
I Estimator is then the sum of the boxes at xI Density that depends on the bandwidth, h
Kernel estimators
I Put a smooth symmetric ‘bump’ of shape K around eachobservation
I Estimator at x is now the sum of the bumps at xI We define
fX (x) =1
nh
n∑i=1
K(x − Xi
h
),
I K : ‘kernel’ functionI Estimator has the same properties of K
I If K is continuous and differentiable→ so is the estimatorI Estimator is a density if K is a density
I Shape of K does not influence the resulting estimatorI Estimator does depend heavily on h
Kernel estimators
Kernels
A kernel is a real-valued function K (x) such thatI K (x) ≥ 0 for all x ∈ R,I∫
K (x) dx = 1,I∫
xK (x) dx = 0.
In practice, the choice of K does not influence the results much,but the value of h is crucial
Kernels
Commonly used kernels includeI Boxcar: K (x) = I(x)/2
I Gaussian: K (x) = (2π)−1/2 exp(−x2/2)
I Epanechnikov: K (x) = 34(1− x2)I(x)
I Biweight: K (x) = 1516(1− x2)2I(x)
I Triweight: K (x) = 3532(1− x2)3I(x)
I Uniform: 12 I(x)
I(x) = 1 if |x | ≤ 1 and I(x) = 0 otherwise
Example: Old Faithful geyser
Kernel regression
Kernel regression
Y = m(x) + ε
1. We want to estimate E(Y |X = x). Naive estimator:
m(x) =
∑ni=1 Yi
n.
Same for all x
2. Average the Yis of only those Xis that are close to x (localaverage):
m(x) =
∑ni=1 Yi · I|Xi − x | < h∑n
i=1 I|Xi − x | < h.
h: bandwidth, determines the size of the neighbourhoodaround x
Kernel regression
Y = m(x) + ε
3. Give a slowly decreasing weight to Xi as it gets far from x ,rather than giving the same weight to all observationsclose to x :
m(x) =n∑
i=1
YiW (x − Xi),
W (·): weight function that decreases as x increases and∑ni=1 W (x − Xi) = 1
Nadaraya-Watson estimator
W (x − Xi) = K(x − Xi
h
)/
n∑j=1
K(x − Xj
h
)Hence, the Nadaraya-Watson kernel estimator is
m(x) =
∑ni=1 YiK (x−Xi
h )∑nj=1 K (
x−Xjh )
The estimated function values Yj = m(xj) at the madeobservations are given by
Yj =∑
i
SijYj , where Sij =K (
xj−xih )∑
k K (xj−xk
h ),
The kernel smoother is thus a linear smoother
Local least squares
We can rewrite the kernel regression estimator as
m(x) = argminmx∈R
n∑i=1
K(
x − xi
h
)(Yi −mx )2.
Exercise: Can be verified by solvingd
dmx
∑ni=1 K (x−xi
h )(Yi −mx )2 = 0
I Thus, for every fixed x , have to search for the best localconstant mx such that the localized sum of squares isminimized
I Localization is here described by the kernel and gives alarge weight to those observations (xi ,Yi) where xi is closeto the point x of interest.
The choice of the bandwidth h is crucial
Example: FTSE stock market index
Example: Cosmic microwave background data
Choosing the bandwidth
Choosing the bandwidth
I Measure “success” of fit using mean squared error on newobservations (MSE),
MSE(h) = E
(Y − mh(x))2,I Splitting into noise, bias and variance
MSE(h) = Noise + Bias2 + Variance
I Bias decreases if h ↓ 0I Variance increases if h ↓ 0
I Choosing the bandwidth is a bias-variance trade-off.
Choosing the bandwidth
I But...we don’t know MSE(h), as we just have a nobservations and cannot generate new randomobservations Y
I First idea is to compute, for various values of thebandwidth h:
I The estimator mh for a training sampleI The error
n−1n∑
i=1
(Yi − mh(x))2 = n−1n∑
i=1
(Yi − Yi )2,
I Choose the bandwidth with the smallest training error
CMB data
Choosing the bandwidth
I We would choose a bandwidth close to h = 0, giving nearperfect interpolation of the data, that is m(xi) ≈ Yi
I This is unsurprisingI Parametric context
I Shape of the model is fixedI Minimising the MSE makes the parametric model as close
as possible to the dataI Nonparametric setting
I Don’t have a fixed shapeI Value of h dictates the modelI Minimising the MSE→ fitted model as close as possible to
the dataI Lead us to choose h as small as possibleI Interpolation of the data
I Misleading result→ Only noise is fitted for very smallbandwidths
Cross-validation
I Solution...don’t use Xi to construct m(Xi)I This is the idea behind cross-validationI Leave-one-out cross-validationI Least squares cross-validation
I For each value of hI For each i = 1, . . . ,n, compute the estimator m(−i)
h (x),where m(−i)
h (x) is computed without using observation iI The estimated MSE is then given by
MSE(h) = n−1∑
i
(Yi − m(−i)h (xi ))2
CMB data
CMB data
A bandwidth of 44 minimises the estimated MSE
Cross-validation
I A drawback of LOO-CV is that it is expensive to computeI Fit has to be recalculated n times (once for each left-out
observation)I We can avoid needing to calculate m(x)(−i) for all iI For some n × n-matrix S, the linear smoother fulfills
Y = SY
The risk (MSE) under LOO-CV can subsequently be written as
MSE(h) = n−1n∑
i=1
(Yi − mh(xi)
1− Sii
)2
Cross-validation
I Do not need to recompute mh while leaving out each of then observations in turn
I Results can much faster be obtained by rescaling theresiduals
Yi − mh(xi)
with the factor (1− Sii)
I Sii is the i-th diagonal entry in the smoothing matrix
Generalized Cross-Validation
MSE(h) = n−1n∑
i=1
(Yi − mh(xi)
1− Sii
)2,
Replace Sii by its average ν/n (where ν =∑
i Sii )
Choose bandwidth h that minimizes
GCV (h) = n−1n∑
i=1
(Yi − mh(xi)
1− ν/n
)2.
Local polynomial regression
Nadaraya-Watson kernel estimator
I Major disadvantage of the Nadaraya-Watson kernelestimator→ boundary bias
I Bias is of large order at the boundaries
Local polynomial regression
I Even when a curve doesn’t look like a polynomialI Restrict to a small neighbourhood of a given point, xI Approximate the curve by a polynomial in that
neighbourhood
I Fit its coefficients using only observations Xi close to xI (or rather, putting more weight to observations close to x)
I Repeat this procedure at every point x where we want toestimate m(x)
Local polynomial regression
I Kernel estimator approximates the data by taking localaverages within small bandwidths
I Use local linear regression to obtain an approximation
Kernel regression estimator
Recall that the kernel regression estimator is the solution to:
m(x) = argminm(x)∈R
n∑i=1
K(
x − xi
h
)(Yi −m(x))2
This given by
m(x) =
∑ni=1 YiK (x−Xi
h )∑nj=1 K (
x−Xjh )
.
Thus estimation corresponds to the solution to the weightedsum of squares problem
Local polynomial regression
Using Taylor Series, we can approximate m(x), where x isclose to a point x0 using the following polynomial:
m(x) ≈ m(x0) + m(1)(x0)(x − x0) +m(2)(x0)
2!(x − x0)2 + . . .
. . .+m(p)(x0)
p!(x − x0)p
= m(x0) + β1(x − x0) + β2(x − x0)2 + · · ·+ βp(x − x0)p
where m(k)(x0) = k !βk , provided that all the requiredderivatives exist
Local polynomial regression
I Use the data to estimate that polynomial of degree p whichbest approximates m(xi) in a small neighbourhood aroundthe point x
I Minimise with respect to β0, β1, . . . , βp the function:
n∑i
Yi − β0 − β1(xi − x)− . . .− βp(xi − x)p2K(x − xi
h
)
I Weighted least squares problem, where the weights aregiven by the kernel functions K ((x − xi)/h)
I As m(k)(x) = k !βk , we then have that m(x) = β0, or
m(x) = β0
CMB data
Red: Kernel smoother (p = 0)Green: Local linear regression (p = 1)
Boundary bias: kernel estimator
Let `i(x) = ωi(x)/∑
j ωj(x), so that
m(x) =∑
i
`i(x)Yi
For the Kernel smoother (p = 0), the bias of the linear smootheris thus
Bias = E(m(x))−m(x) = m′(x)∑
i
(xi − x)`i(x) +
m′′(x)
2
∑i
(xi − x)2`i(x) + R,
Boundary bias: Kernel estimator
First term in the expansion is equal to
m′(x)∑
i
(xi − x)K( x−xi
h
)K(x−xi
h
)I vanishes if the design points xi are centred symmetrically
around xI does not vanish if x sits at the boundary (all xi − x will have
the same sign)
Boundary bias: polynomial estimator
I m(x) is truly a local polynomial of degree pI At least p + 1 points with positive weights in the
neighbourhood of x
Bias will hence be of order
Bias = E(m(x))−m(x) =m(p+1)(x)
(p + 1)!
∑j
(xj − x)p+1`j(x) + R
Why not choose p = 20?
Boundary bias: polynomial estimator
I Yi = m(x) + σεi with εi ∼ N (0,1)
I Variance of the linear smoother, m(x) =∑
j `j(x)Yi , is,
Var(m(x)) = σ2∑
j
`2j (x) = σ2‖`(x)‖2
I ‖`(x)‖2 tends to be large if p is large
I In practice, p = 1 is a good choice
Example: Doppler function
m(x) =√
x(1− x) sin( 2.1π
x + .05
), 0 ≤ x ≤ 1.
Example: Doppler function
> n <- 1000> x <- seq(0,1,length=n)> m <- sqrt(x*(1-x))*sin(2.1*pi/(x+0.05))> plot(x,m,type=’l’)> y <- m+rnorm(n)*0.075> plot(x,y)> fit <- locpoly(x,y,bandwidth=dpill(x,y)*2,degree=1)> lines(fit,col=2)> plot(x,y)> fit2 <- locpoly(x,y,bandwidth=dpill(x,y)/2,degree=1)> lines(fit2,col=2)> plot(x,y)> fit3 <- locpoly(x,y,bandwidth=dpill(x,y)/4,degree=1)> lines(fit3,col=2)
Example: Doppler function
Penalised regression
Penalised regression
Regression model i = 1, . . . ,n,
Yi = m(xi) + εi
E(εi) = 0Estimating m by choosing m(x) to minimize
n∑i=1
(Yi − m(xi))2
leads toI linear regression estimate if minimizing over all linear
functionsI an interpolation of the data if minimizing over all
functions.
Penalised regression
Estimate m by choosing m(x) to minimize
n∑i=1
(Yi − m(xi))2 + λJ(m),
J(m): roughness penalty
Typically
J(m) =
∫(m′′(x))2 dx .
Parameter λ controls trade-off between fit and penaltyI For λ = 0: interpolationI For λ→∞: linear least squares line
Example: Doppler function
Splines
Splines
I Kernel regressionI Researcher isn’t interested in actually calculating m(x) for a
single location xI m(x) calculated on sufficiently small grid of x-valuesI Curve obtained by interpolationI Local polynomial regression: unknown mean function was
assumed to be locally well approximated by a polynomialI Alternative approach
I Represent the fit as a piecewise polynomialI Pieces connecting at points called knotsI Once the knots are selected, an estimator can be
computed globallyI Manner similar to that for a parametrically specified mean
function
This is the idea behind splines
Splines
IID sample (X1,Y1), (X2,Y2), . . . (Xn,Yn) coming from the model
Yi = m(Xi) + εi
Want to estimate the mean of the variable Y withm(x) = E(Y |X = x)
A very naive estimator of E(Y |X = x) would be the samplemean of x :
m(x) =
∑ni=1 Yi
nNot very good (same for all x)
Splines
Approximate m by piecewise polynomials, each on a smallinterval:
m(x) =
c1 if x < ξ1c2 if ξ1 ≤ x < ξ2. . .ck if ξk−1 ≤ x < ξkck+1 if x ≥ ξk
Splines
Use more general lines, which join at the ξs:
m(x) =
a1 + b1x if x < ξ1a2 + b2x if ξ1 ≤ x < ξ2. . .ak + bkx if ξk−1 ≤ x < ξkak+1 + bk+1x if x ≥ ξk
a and b are such that the lines join at each ξ
Splines
Approximate m(x) by polynomials
m(x) =
∑pj=0 β1,jx j if x < ξ1∑pj=0 β2,jx j if ξ1 ≤ x < ξ2
. . .∑pj=0 βk ,jx j if ξk−1 ≤ x < ξk∑pj=0 βk+1,jx j if x ≥ ξk
βjs are such that the polynomials join at each ξ and theapproximation has p − 1 derivatives
Splines which are piecewise polynomials of order pI Splines of order p + 1I Splines of degree pI ξ : knots
Splines
Piecewise constant splines
Knots
How many knots should we have?I Choose a lot of knots well widespread over the data range→ reduce the bias of the estimator
I If we make it too local→ estimator will be too wiggly
I Overcome the bias problem without increasing thevariance→ take a lot of knots, but constrain their influence
I We can do this using penalised regression
Spline order
What order spline should we use?I Increase the value of p→ make the estimator mp smoother (since it has p − 1continuous derivatives)
I If we have p too large→ increase the number of parameters to estimate
I In practice rarely useful to take p > 3I p = 2
I Splines of order three or quadratic splinesI p = 3
I Splines of order 4 or cubic splinesI p-th order spline is a piecewise p − 1 degree polynomial
with p − 2 continuous derivatives at the knots
Natural splines
I Natural spline: linear beyond the boundary is called anatural spline
I Why this constraint?I We usually have very few observations beyond the two
extreme knotsI Want to obtain an estimator of the regression curve thereI Cannot reasonably estimate anything correct thereI Rather use a simplified model (e.g. linear)I Often gives more or less reasonable results
Natural cubic splines
ξ1 < ξ2 < . . . < ξn set of ordered points, so-called knots,contained in an interval (a,b)
A cubic spline is a continuous function m such that(i) m is a cubic polynomial over (ξ1, ξ2), . . ., and(ii) m has a continuous first and second derivatives at the
knots.The solution to
n∑i=1
(yi − m(xi))2 + λ
∫(m′′(x))2 dx
is a natural cubic spline with knots at the data points
m(x) is called a smoothing spline
Natural cubic splines
I Sequence of values f1, . . . , fn at specified locationsx1 < x2 < · · · < xn
I Find a smooth curve g(x) that passes through the points(x1, f1), (x2, f2), . . . , (xn, fn)
I The natural cubic spline g is an interpolating function thatsatisfies the following conditions:
(i) g(xj ) = fj , j = 1, . . . ,n,(ii) g(x) is cubic on each subinterval
(xk , xk+1), k = 1, . . . , (n − 1),(iii) g(x) is continuous and has continuous first and second
derivatives,(iv) g′′(x1) = g′′(xn) = 0.
B-splines
I Need a basis for natural polynomial splinesI Convenient is the so-called B-spline basis
I Data points a = ξ0 < ξ1 < ξ2 < . . . , ξn ≤ ξn+1 = b in (a,b)I There are n + 2 real values
I The n ≥ 0 are called ‘interior knots’ or ‘control points’I And there are always two endpoints, ξ0 and ξn+1
I When the knots are equidistant they are said to be‘uniform’
B-splines
Now define new knots τ asI τ1 ≤ . . . ≤ τp = ξ0 = aI τj+p = ξjI b = ξn+1 = τn+p+1 ≤ τn+p+2 ≤ . . . ≤ τn+2p
I p: order of the polynomialI p + 1 is the order of the splineI Append lower and upper boundary knots ξ0 and ξn+1 p
timesI Needed due to the recursive nature of B-splines
B-slines
Define recursively
I For k = 0 and i = 1, . . . ,n + 2p
Bi,0(x) =
1 τi ≤ x < τi+10 otherwise
I For k = 1,2, . . . ,p and i = 1, . . . ,n + 2p
Bi,k (x) =x − τi
τi+k−1 − τiBi,k−1(x) +
τi+k − xτi+k − τi+1
Bi+1,k−1(x)
Support of Bi,k (x) is [τi , τi+k ]
B-splines
Solving
I Solution depends on regularization parameter λI Determines the amount of “roughness”
I Choosing λ isn’t necessarily intuitiveI Degrees of freedom = trace of the smoothing parameter S
I Sum of the eigenvalues
S = B(BT B + λΩ)−1BT
I Monotone relationship between df and λI Search for a value of λ for desired df
I df=2→ linear regressionI df=n→ interpolate data exactly
Example: Doppler function
Example: Doppler function
Could of course choose λ by LOO-CV or GCV
Cross validation> plot(x,y)> fitcv <- smooth.spline(x,y,cv=T)> lines(fitcv,col=2)> fitcvCall:smooth.spline(x = x, y = y, cv = T)
Smoothing Parameter spar= 0.157514lambda= 2.291527e-08 (16 iterations)Equivalent Degrees of Freedom (Df): 124.738Penalized Criterion: 6.071742PRESS: 0.007898575
Generalised cross validation> plot(x,y)> fitgcv <- smooth.spline(x,y,cv=F)> lines(fitgcv,col=4)> fitgcvCall:smooth.spline(x = x, y = y, cv = F)
Smoothing Parameter spar= 0.1597504lambda= 2.378386e-08 (15 iterations)Equivalent Degrees of Freedom (Df): 124.2353Penalized Criterion: 6.078626GCV: 0.007925571
Multivariate smoothing
Multivariate smoothing
I So far we have only considered univariate functionsI Suppose there are several predictors that we would like to
treat nonparametrically
I Most ‘interesting’ statistical problems nowadays arehigh-dimensional with, easily, p > 1000
I Biology: Microarrays, Gene maps, Network inferenceI Finance: Prediction from multi-variate time-seriesI Physics: Climate models
I Can we just extend the methods and model functionsRp 7→ R nonparametrically?
Curse of dimensionality
I One might consider multidimensional smoothers aimed atestimating:
Y = m(x1, x2, . . . , xp)
I Considered methods rely on ‘local’ approximationsI Examine behaviour of data-points in the neighbourhood of
the point of interestI What is ‘local’ and ‘neighbourhood’ if p →∞ and n
constant
Curse of dimensionality
x = (x (1), x (2), . . . , x (p)) ∈ [0,1]p.
To get 5% of all n sample points into a cube-shapedneighbourhood of x , we need a cube with side-length 0 < ` < 1such that
`p ≤ 0.05
Dimension p side length `1 0.052 0.225 0.54
10 0.741000 0.997
Additive models
Require the function m : Rp 7→ R to be of the form
madd (x) = µ+ m1(x (1)) + m2(x (2)) + . . .+ mp(x (p))
= µ+
p∑j=1
mj(x (j)), m ∈ R
mj(·) : R 7→ R just a univariate nonparametric function
E [mj(x (j))] = 0 j = 1, . . . ,p
I Choice of smoother is left openI Avoids curse of dimensionality→ ‘less flexible’I Functions can be estimated by ‘backfitting’
Backfitting
Data x (j)i , 1 ≤ i ≤ n and 1 ≤ j ≤ p
A linear smoother for variable j can be described by an × n-matrix Sj , so that
mj = S(j)Y ,
I Y = (Y1, . . . ,Yn)T : observed vector of responses
I mj = (mj(x(j)1 ), . . . , mj(x
(j)n )): regression fit
I S(j) smoother with bandwidth estimated by LOO-CV orGCV
Backfitting
madd (x) = µ+
p∑j=1
mj(x (j)),
Suppose µ and mk given for all k 6= j
madd (xi) =(µ+
∑k 6=j
mk (x (k)i ))
+ mj(x(j)i )
Now to apply the smoother S(j) to
Y −(µ+
∑k 6=j
mk
)Cycle through all j = 1, . . . ,p to get
madd (x) = µ+
p∑j=1
mj(x(j)i ), m ∈ R.
Backfitting
1. Use µ← n−1∑ni=1 Yi . Start with mj ≡ 0 for all j = 1, . . . ,p
2. Cycle through the indices j = 1,2, . . . ,p,1,2, . . . ,p, . . .,
mj ← S(j)(Y − µ1−∑k 6=j
mk .
Also normalize
mj(·)← mj(·)− n−1n∑
i=1
mj(x(j)i )
update µ← n−1∑ni=1(Yi −
∑k mk (x (k)
i ))Stop iterations if functions do not change very much
3. Return
madd (xi)← µ+
p∑j=1
mj(x(j)i )
Example: Ozone data
Example: Ozone data
Iteration 1
Iteration 2
Iteration 3
Iteration 7