s interpolary divided difference formulamathstat.carleton.ca/~smills/2015-16/STAT4601-5703/Pdf...3.1...

Data Mining 2015

SECTION 3 ModelingWhen dealing with large amounts of data it may be possible to find a model for the datathat willreduce it to a much more compact form. For example, if the (x, y) points lie on a straight line, wewould only need the slope, intercept, andx values to reproduce the data. If we only wish to be able todetermine ay value for anyx value then we need only have the slope and intercept. In actual fact, thepoints may not lie exactly on the line but they may be close enough that a straight line isa closerepresentation of the actual data and, as such, the slope and intercept will be adequate to representthedata. In reality our situation is generally more complicated than this but wemay still be able to find amodel that will reduce the amount of information we need to retain.In other cases, when looking at a large data set it is often difficult to assess its behaviour. For example,if the data is the minute by minute variation of the stock market it is hard to seeany pattern to the data.On the other hand, the day to day observations may show some trends while year to yearobservationswill probably reveal even more. In order to understand our data we may find it helpfulto ignore theminute detail and look at these trends.One possible way of doing might be to determine an interpolating polynomial for the data.

3.1 Newton’s interpolary divided-difference formulax f �x� First Divided Difference Second Divided Difference

x0 f �x0 �

f �x0,x1 � �f �x1 � � f �x0 �

x1 � x0

x1 f �x1 � f �x0,x1,x2 � �f �x1,x2 � � f �x0,x1 �

x2 � x0

f �x1,x2 � �f �x2 � � f �x1 �

x2 � x1

x2 f �x2 � f �x1,x2,x3 � �f �x2,x3 � � f �x1,x2 �

x3 � x1

f �x2,x3 � �f �x3 � � f �x2 �

x3 � x2

x3 f �x3 �

Pn�x� � f �x0��x � x0�f �x0,x1� � �x � x0��x � x1�f �x0,x1,x2�

� ... � �x � x0��x � x1�...�x � xn�1�f �x0,x1, ...,xn �

� f �x0��k�1

n

�x � x0�...�x � xk�1�f �x0, ...,xk �

Modeling 72 © Mills2015

Data Mining 2015

Set up the folders for the code and data and read in the required files:

drive �- ” D:”

code. dir �- paste( drive, ” DATA/ Data Mining R- Code”, sep�”/”)

data. dir �- paste( drive, ” DATA/ Data Mining Data”, sep�”/”)

source( paste( code. dir, ” Conv_run. r”, sep�”/”))

source( paste( code. dir, ” Newton_Interp. r”, sep�”/”))

The file Newton_Interp.r has the following:#��

# Polynomial interpolation ( Newton’ s Divided Difference given the x and y values.

# The y values are replaced by the differences ( locally).

#��

InterpNewton �- function( knot. x, knot. y){

n �- length( knot. x)

for ( k in 1:( n- 1)){

knot. y[( k�1): n] �- ( knot. y[( k�1): n] - knot. y[ k])/( knot. x[( k�1): n] - knot. x[ k])

}

knot. y

}

Horner’s rule (nested multiplication) gives an efficient way to evaluatepolynomials, i.e. .

a0 � a1x � a2x2 �� an�1xn�1 � anxn � �a0 � x�a1 � x�a2 �� x�an�1 � xanxn��

#��

# Use Horner’ s rule z*( c[ n] �z*( c[ n- 1] �z*(...))) to evaluate a polynomial given

# the coefficients and knots and point( s)

#��

HornerN �- function ( coef, knot. x, z) {

n �- length( knot. x)

polyv �- coef[ n]* rep( 1, length( z))

for ( k in ( n- 1): 1){

polyv �- ( z - knot. x[ k])* polyv �coef[ k]

}

polyv

}

To illustrate, consider the noisy sinusoid.(Figure 1)

set. seed( 1234, ” default”) # Seed for random numbers

# Compute sine at 0, 0. 1, 0. 2,... for plotting a noisy sine

x. rough �- seq( 0, 6, by�0. 1)

numb. rough �- length( x. rough)

y. rough �- sin( x. rough) � runif( numb. rough)- 0. 5 # Add uniform noise

# Points at which to interpolate

x. in �- seq( 0, 6, by�0. 01)

© Mills2015 Modeling 73

Data Mining 2015

The following allows us to reproduce results.

noisy. sine �- list( x. rough, y. rough, x. in)

save( noisy. sine, file � paste( data. dir, ” noisySine. Rdata”, sep�”/”))

(the above must be commented out for any subsequent runs to allow the previous results to be retained)and we recover the data with

load( paste( data. dir, ” noisySine. Rdata”, sep�”/”))

x. rough �- noisy. sine[[ 1]]

y. rough �- noisy. sine[[ 2]]

x. in �- noisy. sine[[ 3]]

plot( x. rough, y. rough, col�” red”, pch � 16) # Plot the points

curve( sin, 0, 6, add�T, col�” blue”) # Add in the sine

and do an interpolation (Figure 2).

# Use Newton Interpolation to get the coefficients

# and Horners method to compute the polynomial.

y. rough. in �- HornerN( InterpNewton( x. rough, y. rough), x. rough, x. in)

lines( x. in, y. rough. in, col�” green”)

Figure 1. Noisy sinusoid Figure 2. Interpolation of noisy sine

While the interpolant passes through every data point (and hence represents the given data accurately)it has nopredictive power.We need our estimate to give good results for both the data used in creating the modeland data fromthe same population not used in creating the model.Figure 2 is an extreme example ofoverfitting.The reason that interpolation is bad for data subject to error is seen in the simple case of data that formsa straight line - except one point is a bit in error.


Data Mining 2015

We can do an interpolation on a straight line (Figure 3.)

x �- - 5: 5

y �- - 5: 5

x. by. 1 �- seq(- 6, 6, by�. 1)

y. by. 1 �- seq(- 5, 5, by�. 1)

y. in �- HornerN( InterpNewton( x, y), x, x. by. 1)

plot( x, y, col�” red”, xlim�c(- 6, 6), ylim�c(- 6, 7))

lines( x. by. 1, y. in, col�” green”)

lines( c(- 5, 5), c(- 5, 5), col�” blue”)

or with one point in error (Figure 4.).

#��

# Look at what happens with a slight deviation

#��

x. blip �- - 5: 5

y. blip �- - 5: 5

x. blip. by. 1 �- seq(- 6, 6, by�. 1)

y. blip. by. 1 �- seq(- 5, 5, by�. 1)

y. blip[ 8] �- 2. 5 # Move 1 point

y. blip[ 1] - 5. 0 - 4. 0 - 3. 0 - 2. 0 - 1. 0 0. 0 1. 0 2. 5 3. 0 4. 0 5. 0y. blip. in �- HornerN( InterpNewton( x. blip, y. blip), x. blip, x. blip. by. 1)

plot( x. blip, y. blip, col�” red”, pch�16, xlim�c(- 6, 6), ylim�c(- 6, 7))

lines( x. blip. by. 1, y. blip. in, col�” green”)

lines( c(- 5, 5), c(- 5, 5), col�” blue”)

Figure 3. Interpolation on straight

line

Figure 4. Interpolation on straight

line with bump

The interpolation of the straight line looks good as it should because, although we have it passingthrough 11 points, the algorithm gives a linear fit (construct a difference table).In the case of a slight deviation, we see that the interpolating polynomial has large oscillations near theends - much larger than the original deviation from the line. We have large errors inpredictions nearthe ends. This results from using a high degree polynomial. A better method is to use lower degree


Data Mining 2015

polynomials to do the fitting.


Data Mining 2015

3.2 Cubic Splines

The cubic spline is a compromise. Obviously, if we just join a bunch of cubic polynomials,passingthrough 4 points, together it would be little better than a piecewise linear. Instead we try to fit thecubics through 2 points and make use of the extra constants to make a smoother fit.

Consider a set of points on the plane�x2, f �x2�� ...

�x1, f �x1�� xn, f �xn�� x0, f �x0��

We fit a cubic

Sk�x� � sk,0 � sk,1�x � xk� � sk,2�x � xk�2 � sk,3�x � xk�3 , k � 0,1, ...,n � 1

between�xk, f �xk�� and�xk�1, f �xk�1��.

S�x� �Sk�x� xk � x � xk�1

Sk�1�x� xk�1 � x � xk�2

At x � xk, Sk�xk� � f�xk� � sk,0.At x � xk�1, Sk�xk�1� � Sk�1�xk�1�, k � 0,1, ...,n � 1.

To ‘use’ the other constants, we require a ‘smooth’ transition from one segment to another - we specifythat the slope and curvature (as well as the value) must match at the nodes or knots. Weknow that theslope at any node is


Data Mining 2015

S �k�x� � sk,1 � 2sk,2�x � xk� � 3sk,3�x � xk�2

so

S k� �xk� � sk,1 k � 0,1, ...,n � 1

and for the slopes to match we require

S k�1� �xk�1� � S k

� �xk�1�

sk�1,1 � sk,1 � 2sk,2hk � 3sk,3hk2 (2)

The curvature at any node is

S k��

�x� � 2sk,2 � 6sk,3�x � xk�

so

S k��

�xk� � 2sk,2 or sk,2 � 12

S k��

�xk�, k � 0,1, ...,n � 1

For the curvature to match at the node

S k�1�� xk�1� � S k

��

�xk�1�

so

2sk�1,2 � 2sk,2 � 6sk,3hk.

We need more information. This comes from considering the behaviour ofS�x� at the end points.

One possibility is

S��

�x0� � 0 � S��

�xn�

i.e. no curvature at the ends.

A second possibility is to match the slope of the spline with that of the function (perhaps estimatedgraphically) so that

S�

�x0� � f ��x0� � s0,1

S�

�x2� � f ��x2� � s2,1

It is not as obvious that this allows us to solve fors.,. but it does.

Each time we add a node we get one morem and an additional equation. Hence we always haveenough information to obtain a solution (we have no assurance that we can solve the general system atthis point).


Data Mining 2015

Cubic Spline Interpolant: Suppose that��xk,yk ��k�0n aren � 1 points, where

a � x0 � x1 � ... � xn � b. The functionS�x� is called acubic spline if there exist n cubicpolynomialsSk�x� with coefficientssk,0,sk,1,sk,2, andsk,3 that satisfy the properties:

1. S�x� � Sk�x� � sk,0 � sk,1�x � xk� � sk,2�x � xk�2 � sk,3�x � xk�3 for x � �xk,xk�1� andk � 0,1, ...,n � 1

2. S�xk � � yk, for eachk � 0,1, ...,n;3. Sk�xk�1� � Sk�1�xk�1�, for eachk � 0,1, ...,n � 2;4. Sk

� �xk�1� � Sk�1� �xk�1�, for eachk � 0,1, ...,n � 2;

5. Sk��xk�1� � Sk�1

�� xk�1�, for eachk � 0,1, ...,n � 2;

From one point of view, we can consider free cubic splines as the best interpolant for curve fitting.If we let S be the natural cubic spline function that interpolates our functionf at x0 � x1 � ... � xn, andlet f ��x� be continuous in the open interval�a,b� that contains thex0,x1, ...xn, then

�a

b�S ��x��2dx � �

a

b�f ��x��2dx

In other words, the average value of the curvature ofS is never larger than the average value of thecurvature ofany functionf passing through the same nodes.

The R spline routine returns the interpolated values at the number of positions indicatedby n � #.#��

# Do a spline

#��

plot( x. blip, y. blip, col�” red”, pch�16, xlim�c(- 6, 6), ylim�c(- 6, 7))

lines( spline( x. blip, y. blip, n � 201), col � ” black”)

Figure 5. Spline on straight line with bump


Data Mining 2015

This gives a much improved approximation. The error dies out quickly.

The classical example for illustrating the behaviour of interpolation is Runge’s function 11 � x2 .

#��

# Use Runge’ s function as an example

#��

runge �- function ( x){

1/( 1�x^2)

}

# Plot it at integers from - 5 to 5

x. 5. 5 �- (- 5: 5)

y. 5. 5 �- sapply ( x. 5. 5, runge)

x. 5. 5by. 1 �- seq(- 5, 5, by�. 1)

# Get the interpolated values at - 5,- 4. 9,. 4. 8...

y. in �- HornerN( InterpNewton( x. 5. 5, y. 5. 5), x. 5. 5, x. 5. 5by. 1)

plot( x. 5. 5, y. 5. 5, col�3, ylim�c(- 0. 5, 2))

lines( x. 5. 5by. 1, y. in, col�” red”)

#��

# Do a spline

#��

# Compute and plot spline ( spline returns the interpolated values).

lines( spline( x. 5. 5, y. 5. 5, n � 201), col � ” blue”)

Figure 6. The red line is the interpolation.

The blue line is the spline.

Figure 7. Computing a spline on a noisy

Runge function.

#��

# Put some noise on it and compute spline

#��

y. noise �- sapply ( x. 5. 5by. 1, runge) � rnorm( 101, 0,. 1)

plot( x. 5. 5by. 1, y. noise, col�” green”, pch�20, cex�1. 3)


Data Mining 2015

lines( spline( x. 5. 5by. 1, y. noise, n � 201), col � ” red”)

curve( runge, - 5, 5, add�T, col�” blue”)

The problem with splines on noisy data is “where do you put the knots?”


Data Mining 2015

A better approach with noisy data may be tosmooth it.A simple way of smoothing involves the use of a running mean, and a way of implementing therunning mean is by a convolution.This is a polynomial multiplication of the form

w�k� � �j

u�j�v�k � 1 � j�.

If both u andv have the same lengthn, then

w�1� � u�1�v�1�

w�2� � u�1�v�2� � u�2�v�1�

w�3� � u�1�v�3� � u�2�v�2� � u�3�v�1�

w�n � 1� � u�1�v�n � 1� � u�2�v�n � 2� � ... � u�n � 1�v�1�

w�n� � u�1�v�n� � u�2�v�n � 1� � ... � u�n�v�1�

w�n � 1� � u�2�v�n � 1� � u�3�v�n � 2� � ... � u�n�v�2�

w�2n � 2� � u�n � 1�v�n� � u�n�v�n � 1�

w�2n � 1� � u�n�v�n�

#��

# Illustrate convolution of vectors

#��

bandwidth �- 5

( x �- 1: 30)[ 1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

[ 26] 26 27 28 29 30( y �- rep( 1, bandwidth))[ 1] 1 1 1 1 1

Padx with leading and trailing zeros.

( x. 0 �- c( rep( 0, bandwidth- 1), x, rep( 0, bandwidth- 1)))[ 1] 0 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

[ 26] 22 23 24 25 26 27 28 29 30 0 0 0 0( x. 1 �- c( rep( 0, bandwidth- 1), rep( 1, length( x)), rep( 0, bandwidth- 1)))[ 1] 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0z �- rep( 0, length( x))

for ( i in 1:( length( x) �bandwidth- 1)) {

t �- y* x. 0[( 1: bandwidth) �i- 1]

n �- sum( y* x. 1[( 1: bandwidth) �i- 1])

z[ i] �- sum( t)/ n

cat(” t � ”, t, ” n � ”, n, ” z � ”, z[ i], ”\ n”)


Data Mining 2015

}t � 0 0 0 0 1 n � 1 z � 1t � 0 0 0 1 2 n � 2 z � 1. 5t � 0 0 1 2 3 n � 3 z � 2t � 0 1 2 3 4 n � 4 z � 2. 5t � 1 2 3 4 5 n � 5 z � 3t � 2 3 4 5 6 n � 5 z � 4t � 3 4 5 6 7 n � 5 z � 5t � 4 5 6 7 8 n � 5 z � 6t � 5 6 7 8 9 n � 5 z � 7t � 6 7 8 9 10 n � 5 z � 8t � 7 8 9 10 11 n � 5 z � 9t � 8 9 10 11 12 n � 5 z � 10t � 9 10 11 12 13 n � 5 z � 11t � 10 11 12 13 14 n � 5 z � 12t � 11 12 13 14 15 n � 5 z � 13t � 12 13 14 15 16 n � 5 z � 14t � 13 14 15 16 17 n � 5 z � 15t � 14 15 16 17 18 n � 5 z � 16t � 15 16 17 18 19 n � 5 z � 17t � 16 17 18 19 20 n � 5 z � 18t � 17 18 19 20 21 n � 5 z � 19t � 18 19 20 21 22 n � 5 z � 20t � 19 20 21 22 23 n � 5 z � 21t � 20 21 22 23 24 n � 5 z � 22t � 21 22 23 24 25 n � 5 z � 23t � 22 23 24 25 26 n � 5 z � 24t � 23 24 25 26 27 n � 5 z � 25t � 24 25 26 27 28 n � 5 z � 26t � 25 26 27 28 29 n � 5 z � 27t � 26 27 28 29 30 n � 5 z � 28t � 27 28 29 30 0 n � 4 z � 28. 5t � 28 29 30 0 0 n � 3 z � 29t � 29 30 0 0 0 n � 2 z � 29. 5t � 30 0 0 0 0 n � 1 z � 30

Note that a vector of ones was used to give each point equal value. It would be easy to modify this todo a weighted convolution.

Now create functions from the above to do the smoothing.

#��

# Compute a running mean with a specified bandwidth

#��

f. conv �- function( x, bandwidth) {

y �- rep( 1, bandwidth)

# pad x with leading and trailing zeros

x. 0 �- c( rep( 0, bandwidth- 1), x, rep( 0, bandwidth- 1))

x. 1 �- c( rep( 0, bandwidth- 1), rep( 1, length( x)), rep( 0, bandwidth- 1))

z �- rep( 0, length( x))

for ( i in 1:( length( x) �bandwidth- 1)) {

t �- y* x. 0[( 1: bandwidth) �i- 1]

n �- sum( y* x. 1[( 1: bandwidth) �i- 1])

z[ i] �- sum( t)/ n

}


Data Mining 2015

return( z)

}

f. run. mean �- function( x, bandwidth) {

bandwidth �- floor( bandwidth/ 2)* 2�1 # Use odd width

f. conv( x, bandwidth)[( bandwidth/ 2�1):(( bandwidth/ 2) �length( x))]

}

x �- 1: 30

f. conv( x, 5)[ 1] 1. 0 1. 5 2. 0 2. 5 3. 0 4. 0 5. 0 6. 0 7. 0 8. 0 9. 0 10. 0 11. 0 12. 0 13. 0

[ 16] 14. 0 15. 0 16. 0 17. 0 18. 0 19. 0 20. 0 21. 0 22. 0 23. 0 24. 0 25. 0 26. 0 27. 0 28. 0[ 31] 28. 5 29. 0 29. 5 30. 0f. run. mean( x, 5)

[ 1] 2. 0 2. 5 3. 0 4. 0 5. 0 6. 0 7. 0 8. 0 9. 0 10. 0 11. 0 12. 0 13. 0 14. 0 15. 0[ 16] 16. 0 17. 0 18. 0 19. 0 20. 0 21. 0 22. 0 23. 0 24. 0 25. 0 26. 0 27. 0 28. 0 28. 5 29. 0

Now try it on the noisy sinusoid from earlier on.

The expressionoldpar �- par(mfrow � c(3,1)) saves the current display parameters andsets the new ones to show the plots in a 3 row by 1 column array.

Thepar(oldpar) resets the display to the original.

main � ”5 point running mean” puts a title on the plot.

oldpar �- par( mfrow � c( 3, 1)) # Split the plot into a stack of 3 plots

for ( i in c( 5, 15, 25)) {

# Plot the points

plot( x. rough, y. rough, col�” red”, pch�20,

main � paste( i, ” point running mean”))

curve( sin, 0, 6, add�T, col�” blue”) # Add in the sine

YS �- f. run. mean( y. rough, i)

lines( x. rough, YS, col�” green”)

}

par( oldpar)

Try the smooth also on Runge’s function.

oldpar �- par( mfrow � c( 3, 1)) # Split the plot into a stack of 3 plots

for ( i in c( 5, 15, 25)) {

# Plot the points

plot( x. 5. 5by. 1, y. noise, col�” red”, pch�20,

main � paste( i, ” point running mean”))

curve( runge, - 5, 5, add�T, col�” blue”) # Add in the curve

YS �- f. run. mean( y. noise, i)

lines( x. 5. 5by. 1, YS, col�” green”)

}

par( oldpar)


Data Mining 2015

Figure 8. Smooth on noisy sine Figure 9. Smooth on noisy Runge

We see that the 5 point smooth follows well but is rough; the15 point smooth follows well and is lessrough; the 25 point smooth has flattened out much of the true structure.

There are better ways of smoothing. A couple of well known ones are LOcally Weighted Scatter plotSmoothing (LOWESS, also called LOESS) by Cleveland and Supersmoother by Friedman. Theyrequire concepts that we have not yet discussed so it will be looked at later.


Data Mining 2015

3.3 RegressionIn many cases of analyzing data we want to predict the behaviour of ‘future’ casesbased on thebehaviour of the ‘current’ cases. In mathematics, we would typically fit a function that passes throughthe data points but, in general, data will have error or “noise” associated with it. Because the datapoints may be noisy, there is no reason to assume that the function should pass throughall the datapoints and so we can try to get a function that “best” fits the data.

As in many situations with which we will deal, the concept of “best” will be taken as the smallest sumof squares of the deviations or errors (the difference between the actual and predicted results); this iscalled ordinary least squares (OLS).

Suppose that we have a set of readings of the form

x1

x2

x3

xn

y1

y2

y3

yn

and try to fit a straight line through these points to modelY in terms ofX. (i.e. we consider the case ofapproximation using a straight line.)

We define the following variables

X�

1 x1

1 x2

1 x3

1 xn

Y �

y1

y2

y3

yn

.

The statistical model is

Y �X�0

�1

�� X� � �

where� is assumed to be an error vector of independent random variables withE�� 0 and commonvariance-covariance matrix I�2. We wish to estimate the� values by using least squares. The estimatedlinear relationship is written as

yi� � b0 � b1xi.


Data Mining 2015

We obtain the estimator of� by minimizing the sum of squares of ”error” given by

Q � �i�1

n

�yi � yi� �2

� �i�1

n

�yi � �b0 � b1xi ��2

so we want

�Q�b0

� ��b0

�i�1

n

�yi � �b0 � b1xi ��2

� �i�1

n��b0

�yi � �b0 � b1xi ��2

� 2�i�1

n

�yi � �b0 � b1xi ��1� � 0

and

�Q�b1

� ��b1

�i�1

n

�yi � �b0 � b1xi ��2

� �i�1

n��b1

�yi � �b0 � b1xi ��2

� 2�i�1

n

�yi � �b0 � b1xi ��xi � � 0.

These equations are called thenormal equations and may be written as

�i�1

n

yi � �i�1

n

b0 ��i�1

n

b1xi � nb0 � b1�i�1

n

xi

�i�1

n

xiyi � �i�1

n

b0xi ��i�1

n

b1xi2 � b0�

i�1

n

xi � b1�i�1

n

xi2.

Solving this system, we get


Data Mining 2015

b0 � y� � b1x�

and

b1 �

�i�1

n

�xi � x� ��yi � y� �

�i�1

n

�xi � x� �2

where

x� �

�i�1

n

xi

n ; y� �

�i�1

n

yi

n

If there is curvature we may wish to use polynomials of higher degree (sayp ) to modelyi.. In general

Q � �i�1

n

�yi � yi� �2

� �i�1

n

�yi � Pp�x��2

� �i�1

n

�yi � �b0 � b1xi � b2xi2 � ... � bpxi

p ��2

so

�Q�b0

� ��b0

�i�1

n

�yi � Pp�x��2 � �i�1

n��b0

�yi � Pp�x��2

� 2�i�1

n

�yi � �b0 � b1xi � b2xi2 � ... � bpxi

p ��1�

� 0.

and


Data Mining 2015

�Q�bk

� ��bk

�i�1

n

�yi � Pp�x��2

� �i�1

n��bk

�yi � �b0 � b1xi � b2xi2 � ......bkxi

k � ... � bpxip ��

2

� 2�i�1

n

�yi � �b0 � b1xi � b2xi2 � ...bkxi

k � ... � bpxip ��xi

k�

� 0 for k � 1, ...,p

so

�i�1

n

yixik � �

i�1

n

b0xik � b1xi

k�1 � b2xik�2 � ... � bpxi

p�k for k � 1, ...,p


Data Mining 2015

Combining these results we obtain the set ofp � 1 normal equations:

�i�1

n

yi � b0�i�1

n

1 � b1�i�1

n

xi � b2�i�1

n

xi2

� ... � bp �i�1

n

xip

�i�1

n

yixi � b0�i�1

n

xi � b1�i�1

n

xi2

� b2�i�1

n

xi3

� ... � bp �i�1

n

xip�1

�

�i�1

n

yixip. � b0�

i�1

n

xip� b1�

i�1

n

xip�1

� b2�i�1

n

xip�2

� ... � bp �i�1

n

xi2p

One method for solving this involves defining (for example, whenp � 2)

Y �

y1

y2

y3

y4

y5

andX �

1 x1 x12

1 x2 x22

1 x3 x32

1 x4 x42

1 x5 x52

from which

XTX �

5 x1 � x2 � x3 � x4 � x5 x12 � x2

2 � x32 � x4

2 � x52

x1 � x2 � x3 � x4 � x5 x12 � x2

2 � x32 � x4

2 � x52 x1

3 � x23 � x3

3 � x43 � x5

3

x12 � x2

2 � x32 � x4

2 � x52 x1

3 � x23 � x3

3 � x43 � x5

3 x14 � x2

4 � x34 � x4

4 � x54


Data Mining 2015

or, in general

XTX �

n �i�1

n

xi �i�1

n

xi2

�i�1

n

xi �i�1

n

xi2 �

i�1

n

xi3

�i�1

n

xi2 �

i�1

n

xi3 �

i�1

n

xi4

XTY �

y1 � y2 � y3 � y4 � y5

x1y1 � x2y2 � x3y3 � x4y4 � x5y5

x12y1 � x2

2y2 � x32y3 � x4

2y4 � x52y5

�

�i�1

n

yi

�i�1

n

yixi

�i�1

n

yixi2

Then we can write, in matrix notation,

�XTX�b � XTY

so

b �� XTX��1�XTY�

Note that the problem is now reduced to thesolution of a system of equations.

We will apply these ideas to the same noisy sine as before.As afirst approximation at a predictor, we might try taking themean of the y values. Keep in mindthat if the residuals arising from using the mean as an approximation are small enough for ourcalculations, then we do not need to try fitting the data using a non-flat straight line or a higher degreepolynomial.

oldpar �- par( mfrow�c( 2, 1))

mu �- mean( y. rough)

plot( x. rough, y. rough, col�” red”, pch�20, main�” Approximation by the mean”)


Data Mining 2015

curve( sin, 0, 6, add�T, col�” blue”)

lines( c( 0, 6), c( mu, mu))

res. mean �- y. rough - mu

plot( x. rough, res. mean, pch�20, main�” Residuals for approximation by the mean”)

sum( res. mean* res. mean)

par( oldpar)[ 1] 34. 71713

Figure 10.

We see that the residuals have a pattern. This indicates that there is structure in the data not captured bythe model (mean value). Even so, if the error is within our tolerance we might decide that the model isacceptable.If not we might try a least squares fit. We will use the noisy sine from before.

In doing this we will make use of the linear models routine in thestats library (lm(y ~ x) ). Itcomputes the regression coefficients, but also the fitted values and residuals.

lm. 1 �- lm( y. rough ~x. rough))Call:lm( formula � y. rough ~x. rough)

Coefficients:( Intercept) x. rough

0. 9543 - 0. 3036oldpar �- par( mfrow�c( 2, 1))

plot( x. rough, y. rough, col�” red”, pch�20, main�” Approximation by regression line”)


abline( lm. 1$coefficients[ 1], lm. 1$coefficients[ 2], col�” green”)


Data Mining 2015

res. line �- y. rough - ( lm. 1$coefficients[ 1] � x. rough* lm. 1$coefficients[ 2])

plot( x. rough, res. line, pch�20, main�” Residuals for approximation by regression line”)

par( oldpar)The sum of squares for error:For mean � 34. 71713For regression line � 17. 28252 ( better)

Figure 11.

Again, the residual plot shows structure.

We might also notice that there seem to be two sections to the data. Perhaps wecould investigate theidea of breaking the region into two parts.

In the next section, we will look at fitting the means before and afterx � 3.

mid. pt �- min( x. rough) � sum( range( x. rough))/ 2


plot( x. rough, y. rough, col�” red”, pch�20, main�” Approximation by two means”)

# Get means on both sides

l. index �- ( x. rough � mid. pt)*( 1: numb. rough)

r. index �- ( x. rough �� mid. pt)*( 1: numb. rough)

mu. 2. l �- mean( y. rough[ l. index])

mu. 2. r �- mean( y. rough[ r. index])

lines( c( 0, 3, 3, 6), c( mu. 2. l, mu. 2. l, mu. 2. r, mu. 2. r))


res. 2 �- c( y. rough[ l. index]- mu. 2. l, y. rough[ r. index]- mu. 2. r)

plot( x. rough, res. 2, pch�20, main�” Residuals for approximation by two means”)

par( oldpar)


Data Mining 2015

The sum of squares for error:For mean � 34. 71713For regression line � 17. 28252For split � 11. 74287 an improvement

Figure 12.

The residuals look better, and we have a lower sum of squares of error.

We might extend the idea of splitting the interval and do least squares fits that donot cover the entireinterval at once but rather piecewise.

quart. pt �- min( x. rough) � sum( range( x. rough))/ 4

three. quart. pt �- min( x. rough) � sum( range( x. rough))* 3/ 4

a. index �- ( x. rough � quart. pt)*( 1: numb. rough)

b. index �- (( x. rough �� quart. pt)&( x. rough � three. quart. pt))*( 1: numb. rough)

c. index �- ( x. rough �� three. quart. pt)*( 1: numb. rough)


plot( x. rough, y. rough, col�” red”, pch�20, main�” Approximation by three regression lines”)


# Get regression in the intervals.

lm. 3. a �- lm( y. rough[ a. index] ~ x. rough[ a. index])

m �- lm. 3. a$coefficients[ 2]

int �- lm. 3. a$coefficients[ 1]

lines( c( 0, quart. pt), c( int, m* quart. pt))

lm. 3. b �- lm( y. rough[ b. index] ~ x. rough[ b. index])

m �- lm. 3. b$coefficients[ 2]

int �- lm. 3. b$coefficients[ 1]

lines( c( quart. pt, three. quart. pt), c( int�m* quart. pt, int�m*( three. quart. pt)))


Data Mining 2015

lm. 3. c �- lm( y. rough[ c. index] ~ x. rough[ c. index])

m �- lm. 3. c$coefficients[ 2]

int �- lm. 3. c$coefficients[ 1]

lines( c( three. quart. pt, 6), c( int�m* three. quart. pt, int�m* 6))

# Residuals

res. 3 �- c( lm. 3. a$residuals, lm. 3. b$residuals, lm. 3. c$residuals)

plot( x. rough, res. 3, pch�20, main�” Residuals for three regression lines”)

par( oldpar)The sum of squares for error:For mean � 34. 71713For regression line � 17. 28252For split � 11. 74287For 3 lines � 4. 267956 a further improvement

Figure 13.

The position for the split point was chosen arbitrarily. We could use an optimization routine to find thelocation that gives the knot locations for the best fit.


Data Mining 2015

Rather than use piecewise linear, suppose we try a quadratic fit.


plot( x. rough, y. rough, col�” red”, pch�20, main�” Approximation by quadratic regression”)

X. rough. q �- cbind( x. rough, x. rough^2)

lm. q �- lm( y. rough ~X. rough. q)

cat(” The coefficients are ”, lm. q$coefficients, ”\ n”)The coefficients are 0. 6796526 - 0. 02429912 - 0. 04655698T. q �- lm. q$coefficients[ 1] � x. in* lm. q$coefficients[ 2] � ( x. in)^ 2* lm. q$coefficients[ 3]

lines( x. in, T. q, col�” green”)


plot( x. rough, lm. q$residuals, pch�20, main�” Residuals for approximation by quadraticregression”)par( oldpar)The sum of squares for error:For mean � 34. 71713For regression line � 17. 28252For split � 11. 74287For 3 lines � 4. 267956For quadratic � 16. 26682 slightly better than a straight line

Figure 14.

This does not seem to give any improvement as there is still structure on the residual plot.


Data Mining 2015

As last attempt, we will try a cubic:oldpar �- par( mfrow�c( 2, 1))

plot( x. rough, y. rough, col�” red”, pch�20, main�” Approximation by cubic regression”)

X. rough. cubic �- cbind( x. rough, x. rough^2, x. rough^3)

lm. cubic �- lm( y. rough ~X. rough. cubic)

cat(” The coefficients are ”, lm. cubic$coefficients, ”\ n”)The coefficients are - 0. 3902004 2. 208157 - 0. 984476 0. 1042132T. cubic �- lm. cubic$coefficients[ 1] � x. in* lm. cubic$coefficients[ 2] �

( x. in)^ 2* lm. cubic$coefficients[ 3] � ( x. in)^ 3* lm. cubic$coefficients[ 4]

lines( x. in, T. cubic, col�” green”)


plot( x. rough, lm. cubic$residuals, pch�20, main�” Residuals for approximation by cubicregression line”)par( oldpar)The sum of squares for error:For mean � 34. 71713For regression line � 17. 28252For split � 11. 74287For 3 lines � 4. 267956For quadratic � 16. 26682For cubic � 4. 122846 similar to the 3 linear regressions

Figure 15.This seems to give a small error sum of squares and the residual plot has little structure. Perhaps weshould consider using a higher degree polynomial.

While it is true that if you have data with no noise you can use higher degree polynomials that will fitthe data exactly, it is not a good idea. We will see later that significant problems can arise in fitting datain this way.We must also keep in mind that we wish to use the current data to make predictions on the future data.Fitting too closely decreases the predictive strength of a model and, in fact,we usually apply a penaltyfor ’roughness’ as a means of discouragingoverfitting. There is a trade-off between a good fit on the


Data Mining 2015

information used to construct amodel and the ability of the model to make good predictions.


Data Mining 2015

The idea of regression can be applied to functions of several variables (as usual,for purposes ofillustration, we will look at a two variable case).numb �- 1000

max �- 6

set. seed( 1234)

x �- sort( runif( numb)* max)

y �- runif( numb)* max

z �- sin( x)* y � runif( numb)

lm. 1 �- lm( z~x�y)

lm. 1$coefficients( Intercept) x y3. 36044749 - 0. 96380760 0. 03079192

We will use thergl package for displaying three dimensional data. The data is plotted by finding thearray {zi,j� values above a rectangular grid of (xi, yj) values. We create the grid on thexy-plane fromvalues along thex and.y-axes:X. g �- seq( 0, max, by�. 1)

Y. g �- seq( 0, max, by�. 1)

g �- expand. grid( X. g, Y. g)

Compute the estimatedz value at every point on the grid and then convert to a matrix:lm. g �- lm. 1$coefficients[ 1] � g[, 1]* lm. 1$coefficients[ 2] � g[, 2]* lm. 1$coefficients[ 3]

lm. g. m �- matrix( lm. g, nrow�length( X. g))

library( rgl)

Plot the points as small spheres and the plot the plane of estimated valuesplot3d( x, y, z, col�” red”, type�” s”, size�0. 3)

surface3d( X. g, Y. g, lm. g. m, alpha�0. 5) # alpha gives the transparency

# Residuals

plot3d( x, y, zlim�c(- 6, 6), type�” s”, size�0. 3, lm. 1$residuals)

sum( lm. 1$residuals* lm. 1$residuals)[ 1] 3006. 138


Data Mining 2015

Figure 16. Dark red points are below the plane Figure 17. Residuals showing structure


Data Mining 2015

As before we can try a cubic fit:

X �- cbind( x, x^2, x^3)

lm. 2 �- lm( z~X�y)

lm. 2$coefficients( Intercept) Xx X X y

- 0. 31595442 5. 79108184 - 2. 63805108 0. 27920336 0. 01659746

lm. 2. g �- ( lm. 2$coefficients[ 1] � g[, 1]* lm. 2$coefficients[ 2] �( g[, 1])^ 2* lm. 2$coefficients[ 3] �

( g[, 1])^ 3* lm. 2$coefficients[ 4] � g[, 2]* lm. 2$coefficients[ 5])

lm. 2. g. m �- matrix( lm. 2. g, nrow�length( X. g))

plot3d( x, y, z, col�” red”, type�” s”, size�0. 3)

surface3d( X. g, Y. g, lm. 2. g. m, alpha�0. 5)

# Residuals



Figure 18. Figure 19. Residuals showing less structure


Data Mining 2015

or a more complicated one:

X �- cbind( y* x, y* x^2, y* x^3)

lm. 3 �- lm( z~X)

print( lm. 3$coefficients)( Intercept) X1 X2 X30. 37762183 1. 73255960 - 0. 83142514 0. 08954253lm. 3. g �- ( lm. 3$coefficients[ 1] � g[, 1]* g[, 2]* lm. 3$coefficients[ 2] �( g[, 1])^ 2* g[, 2]* lm. 3$coefficients[ 3] � ( g[, 1])^ 3* g[, 2]* lm. 3$coefficients[ 4])

(In the following, thezlim �c(-7,7) gives a range of [-8,8] whilezlim �c(-8,8) gives [-10.10].)

plot3d( x, y, z, col�” red”, type�” s”, size�0. 3)

surface3d( X. g, Y. g, lm. 3. g. m, alpha�0. 5)

# Residuals



Figure 20. Figure 21. Residuals


Data Mining 2015

3.3.1 All Subsets Regression

Example from Neter, Kutner, Nactsheim, Wasserman.

d. file �- paste( data. dir, ” ch08ta01. dat”, sep � ”/”)

d. temp �- matrix( scan( d. file), ncol�6, byrow�T)

d. data �- d. temp[, c( 1: 4, 6)]

names �- c(” BlodClot”,” Prog. Ind”,” EnzyneFun”,” LiverFun”,” logSurvival”)

dimnames( d. data) �- list( 1: 54, names)

df. data �- as. data. frame( d. data)

The following recursive routine creates all possible combinations of size numb

next. combo �- function ( combo, numb, all. current, prev, ind) {

i �- prev � 1

while ( i �� numb) {

combo[ ind, c( all. current, i)] �- c( all. current, i)

ind �- ind � 1

res �- next. combo ( combo, numb, c( all. current, i), i, ind)

combo �- res[[ 1]]

ind �- res[[ 2]]

i �- i � 1

}

list( combo, ind)

}

numb �- 4

# Determine the number of possible combinations

n. combo �- 0

for ( j in 1: numb) {

n. combo �- n. combo � choose( numb, j)

}

combo �- matrix( 0, n. combo, numb)

res �- next. combo ( combo, numb, {}, 0, 1)

# Puts them in a better order

( combos �- res[[ 1]][ order( apply( res[[ 1]]! �0, 1, sum)),])[, 1] [, 2] [, 3] [, 4]

[ 1,] 1 0 0 0[ 2,] 0 2 0 0[ 3,] 0 0 3 0[ 4,] 0 0 0 4[ 5,] 1 2 0 0[ 6,] 1 0 3 0[ 7,] 1 0 0 4[ 8,] 0 2 3 0[ 9,] 0 2 0 4

[ 10,] 0 0 3 4[ 11,] 1 2 3 0[ 12,] 1 2 0 4[ 13,] 1 0 3 4[ 14,] 0 2 3 4


Data Mining 2015

[ 15,] 1 2 3 4

Createregression models for all possible subsets and plot the SSE.

xx �- 1: numb

plot(- 1, 0, xlim � c( 0, numb), ylim � c( 0, 5), xlab�””, ylab � ””)

lpm. all �- {} # A null list

for ( i in 1: n. combo) {

comb �- d. data[, combos[ i,]] # Simplify names

lpm �- lm( df. data[, 5] ~ comb)

# Put model in list

lpm. all �- c( lpm. all, list( lpm))

# Plot the results

points( sum( combos[ i,] � 0), sum( lpm$residuals^2))

}

lpm. all[[ 1]]Call:lm( formula � df. data[, 5] ~ comb)Coefficients:( Intercept) comb

1. 86399 0. 05916[[ 2]]Call:lm( formula � df. data[, 5] ~ comb)Coefficients:( Intercept) comb

1. 598811 0. 009604[[ 15]]Call:lm( formula � df. data[, 5] ~ comb)Coefficients:Coefficients:

( Intercept) combBloodClot combProg. Ind combEnzyneFun combLiverFun0. 488756 0. 068520 0. 009254 0. 009475 0. 00192

table �- matrix( 0, n. combo�1, 4)

TSS �- sum(( df. data[, 5]- mean( df. data[, 5]))^ 2)

table[ 1, 1] �- 53

table[ 1, 2] �- floor( 100000* TSS)/ 100000 # 5 digits

table[ 1, 3] �- 0

table[ 1, 4] �- floor( 100000*( table[ 1, 2]/( 54- 2)))/ 100000

row. names �- ” None”


temp1 �- format( combos[ i,])

temp2 �-{}

for ( j in 1: length( temp1)) {

if ( temp1[ j] ! � ” 0”)

temp2 �- paste( temp2, ” X”, temp1[ j], sep�””)

}

row. names �- c( row. names, temp2)

table[ i�1, 1] �- floor( 100000* lpm. all[[ i]]$ df. residual)/ 100000

table[ i�1, 2] �- floor( 100000* sum( lpm. all[[ i]]$ residual^2))/ 100000

table[ i�1, 3] �- floor( 100000*( 1 - table[ i�1, 2]/ TSS))/ 100000


Data Mining 2015

table[ i�1, 4] �- floor( 100000* table[ i�1, 2])/( 54- 2)/ 100000

}

dimnames( table) �- list( row. names, c(” df”,” SSE”,” R^2”,” MSE”))

tabledf SSE R^2 MSE

None 53 3. 97277 0. 00000 0. 076390000X1 52 3. 49605 0. 11999 0. 067231731X2 52 2. 57627 0. 35151 0. 049543654X3 52 2. 21527 0. 44238 0. 042601154X4 52 1. 87763 0. 52737 0. 036108269X1X2 51 2. 23248 0. 43805 0. 042932115X1X3 51 1. 40718 0. 64579 0. 027061154X1X4 51 1. 87582 0. 52783 0. 036073462X2X3 51 0. 74301 0. 81297 0. 014288654X2X4 51 1. 39215 0. 64957 0. 026772115X3X4 51 1. 24532 0. 68653 0. 023948462X1X2X3 50 0. 10985 0. 97234 0. 002112500X1X2X4 50 1. 39052 0. 64998 0. 026740769X1X3X4 50 1. 11559 0. 71919 0. 021453654X2X3X4 50 0. 46520 0. 88290 0. 008946154X1X2X3X4 49 0. 10977 0. 97236 0. 002110962

Figure 22.

mins �- matrix( 10^( 30), numb�1, 2)

for ( i in 1: 15) {

temp �- sum( lpm. all[[ i]]$ residuals^2)

if ( temp � mins[ lpm. all[[ i]]$ rank, 1]) {

mins[ lpm. all[[ i]]$ rank, 1] �- temp

mins[ lpm. all[[ i]]$ rank, 2] �- i

}

}

mins[ 1, 1] �- TSS

mins[ 1, 2] �- 0


Data Mining 2015

lines ( 0: 4, mins[ 1: 5, 1])

points ( 0: 4, mins[ 1: 5, 1], col�” red”)

Figure 23.


Data Mining 2015

3.4 Computational difficultiesThe process of solving the normal equations can produce significant round off error. As long as thereare only a few predictor variables (X) the effect may not be noticed but, when there are a large numberof predictor variables, the effect can be serious. This is especially true when there is correlation amongthe predictors. The effect is seen in the computation of the inverse ofXTX (|XTX| will often be verysmall which leads to instability of the inverse) and for this reason we use thesingular valuedecomposition method of solution.

3.4.1 The singular value decomposition methodTo solve the normal equations, we will look at arobust method known assingular valuedecomposition (SVD).

If we have a matrixA � �m�n (note -it does not need to be square), we:1. Find the eigenvalues�i of the matrix ATA;2. Arrange the eigenvalues in descending order;3. Determine the number of nonzero eigenvalues�r�;4. Find the orthogonal eigenvectors of the matrixATA (they have to be ordered corresponding to

the order of the eigenvalues) and create a matrixV � �n�n whose columnsvi are the orderedeigenvectors;

5. Form a matrix � �m�n, with ’diagonal’ entries being the square root of the orderedeigenvalues;

6. Create a matrixU � �m�m with the firstr column vectors constructed as

u i �Avi

�i

;

7. If r � m, the remaining (�r � 1� to m) vectors are constructed using the Gram-Schmidtorthogonalization process.

Aside: If we have a set of linearly independent vectorsv1,v2, ...,vn,we select one of the vectors andnormalize it. i.e.e1 � v1/||v1||. Now define a vectore2

� � v2 � �v2 � e1�e1. We note that

e1 � e2� � e1 � v2 � �v2 � e1�e1 � e1 � 0

soe1 ande2� are orthogonal. Thene1 ande2 � e2

� /||e2� || form an orthonormal set.

Using induction, we can define

ej�1� � vj�1 ��

i�1

j

�vj�1 � ei �ei

and show that


Data Mining 2015

ej�1� � ep � vj�1 � ep ��

i�1

j

�vj�1 � ei �ei � ep � 0 �for 1 � p � j�

The orthonormal vectorsep� /||ep

� ||span the same space asv1,v2, ...,vn.

The equation of the form

Ax � b

then becomes

�UVT �x � b

or

x �V�1UTb

Consider the matrix

A �

1 1

0 1

1 0

.

A �- cbind( c( 1, 0, 1), c( 1, 1, 0))

nrow �- dim( A)[ 1]

ncol �- dim( A)[ 2]

( ATA �- t( A)%*%A)[, 1] [, 2]

[ 1,] 2 1[ 2,] 1 2( ATA. eig �- eigen( ATA))$values[ 1] 3 1$vectors

[, 1] [, 2][ 1,] 0. 7071068 0. 7071068

[ 2,] 0. 7071068 - 0. 7071068

Set ’small’ numbers to 0 and remove them.Which ones are smaller than 10�15?�None in this case, but we need this in general.)

( abs( ATA. eig$values) � 10^(- 15))[ 1] TRUE TRUE


Data Mining 2015

Create indices for the non-zero eigenvalues:

( abs( ATA. eig$values) � 10^(- 15))*( 1: length( ATA. eig$values))[ 1] 1 2

List of the non-zero eigenvalues:

( eigs. a �- ATA. eig$values[( abs( ATA. eig$values) � 10^(- 15))*( 1: length( ATA. eig$values))])[ 1] 3 1

Put in descending order:

( the. eigs �- eigs. a[ order(- eigs. a)])[ 1] 3 1r �- length( the. eigs)[ 1] 2if ( r � ncol) {

V �- as. matrix( ATA. eig$vectors[, c( order(- eigs. a),( r�1): ncol)])

} else {

V �- as. matrix( ATA. eig$vectors[, order(- eigs. a)])

}

V[, 1] [, 2]

[ 1,] 0. 7071068 0. 7071068[ 2,] 0. 7071068 - 0. 7071068t( V)%*%V

[, 1] [, 2][ 1,] 1 0[ 2,] 0 1

Create the singular value matrix using the square root of the eigenvalues as ’diagonal’elements.This is of the same shape as A.

( Sig �- diag(( the. eigs)^(. 5), nrow, ncol))[, 1] [, 2]

[ 1,] 1. 732051 0[ 2,] 0. 000000 1[ 3,] 0. 000000 0( Sig. Inv �- diag( 1/( the. eigs)^(. 5), nrow, ncol))

[, 1] [, 2][ 1,] 0. 5773503 0[ 2,] 0. 0000000 1[ 3,] 0. 0000000 0U �- matrix( 0, nrow, nrow)

for ( i in 1: r) {

U[, i] �- A%*%V[, i]/ sqrt( the. eigs[ i])

}

U[, 1] [, 2] [, 3]

[ 1,] 0. 8164966 0. 0000000 0[ 2,] 0. 4082483 - 0. 7071068 0[ 3,] 0. 4082483 0. 7071068 0


Data Mining 2015

for ( i in ( r�1): nrow) {

# Gram- Schmidt for next

u �- I[, 1]

for ( j in 1:( i- 1)) {

u �- u - t( I[, 1])%*% U[, j]* U[, j]

}

U[, i] �- u/ sqrt( sum( u* u))

}

U[, 1] [, 2] [, 3]

[ 1,] 0. 8164966 0. 0000000 0. 5773503[ 2,] 0. 4082483 - 0. 7071068 - 0. 5773503[ 3,] 0. 4082483 0. 7071068 - 0. 5773503U%*%t( U)

[, 1] [, 2] [, 3][ 1,] 1. 000000e�00 1. 110223e- 16 1. 110223e- 16[ 2,] 1. 110223e- 16 1. 000000e�00 - 1. 110223e- 16[ 3,] 1. 110223e- 16 - 1. 110223e- 16 1. 000000e�00U%*%Sig%*%t( V)

[, 1] [, 2][ 1,] 1 1[ 2,] 0 1[ 3,] 1 0

We could use this method to find the coefficients of the regression in the firstexample (the one thatproduced Figure 11).

Y �- matrix( y. rough)

X �- cbind( rep( 1, length( Y)), x. rough)

( s �- svd( t( X)%*%X))$d[ 1] 784. 39426 14. 70574$u

[, 1] [, 2][ 1,] - 0. 2452483 - 0. 9694603[ 2,] - 0. 9694603 0. 2452483$v

[, 1] [, 2][ 1,] - 0. 2452483 - 0. 9694603[ 2,] - 0. 9694603 0. 2452483( l �- s$v%*%diag( 1/ s$d)%*%t( s$u)%*%t( X)%*%Y)

[, 1][ 1,] 0. 9543387

[ 2,] - 0. 3036410

As we see, the coefficients are the same.

A second cause of problems occurs when the entries ofXTX cover a wide range of values. Thishappens if theX values have widely varying scales (e.g. should income be expressed in cents, dollars,or thousands of dollars?). One way of avoiding such problems is to standardize the data by the use of


Data Mining 2015

what is called thecorrelation transformation.

3.4.2 The correlation transformation

When we transform a variable by subtracting its mean, dividing the result by itsstandard deviation andmultiplying the result of that by�n � 1��1/2, we are performing acorrelation transformation or a formof standardizing the data. Hence, the response variable becomes

Y i� � 1

n � 1�Y i �

__Y �

sY

and the predictor variables become

X i j� � 1

n � 1

�X i j �__X j�

sX jfor j � 1, ...,p � 1.

The result of this transformation is that

Y � �0 � �1X1 � �2X2 �� p�1Xp�1 �

becomes

Y � � �1�X1

� � �2�X2

� �� p�1� Xp�1

� � �

(note the lack of an intercept term here) and the solution of the normal equations gives

b� � �X �TX ��1 �X �T Y� �

Using the factor of�n � 1��1/2 in the denominator of these transformations results in�Y �

2� 1 ,

�X �TX �� rXX

and

�X �TY� � rYX

so

b� � �rXX ��1rXY


Data Mining 2015

(Note thatrXY �Cov�X,Y��X�Y

�

#��

# Standardize data

#��

f. data. std �- function( data) {

data �- as. matrix( data)

bar �- apply( data, 2, mean)

s �- apply( data, 2, sd)

t(( t( data) - bar)/ s)

}

If we wish to return to the original coordinates, we can use the following:-

X j� � 1

n � 1

�X1j �__X j�

sX j, 1

n � 1

�X2j �__X j�

sX j, ..., 1

n � 1

�Xn j �__X j�

sX j

� 1sX j n � 1

�X1j,X2j, ...,Xn j � � 1sX j n � 1

__X j

� 1sX j n � 1

Xj �__X j

and

Y � � �1�X1

� � �2�X2

� �� p�1� Xp�1

� � �

so

1n � 1

�Y i �__Y �

sY� �1

� 1sX1 n � 1

X1 �__X1 � �2

� 1sX2 n � 1

X2 �__X2 ��

� �p�1� 1

sXp�1 n � 1Xj �

__X p�1

so

�Y i �__Y � � �1

� sYsX1

X1 �__X1 � �2

� sYsX2

X2 � sY

__X2 ��

� �p�1� sY

sXp�1Xp�1 � sY

__X p�1

This gives


Data Mining 2015

Y i �__Y ��1

� sYsX1

__X1 ��2

� sYsX2

__X2 �� p�1

� sYsXp�1

__X p�1

� �1� sY

sX1X1 � �2

� sYsX2

X2 �� p�1� sY

sXp�1Xp�1

�__Y ��1

__X1 ��2

__X2 �� p�1

__X p�1 � �1X1 � �2X2 ��

� �p�1Xp�1

� �0 � �1X1 � �2X2 �� p�1Xp�1

#��

# Convert standardized coeff to original

#��

std. to. orig �- function ( std. coef, mean. X, mean. Y, s. X, s. Y) {

sz �- length( std. coef)

B. i �- matrix( 0, sz, 1)

for ( i in 2: sz) {

B. i[ i, 1] �- s. Y/ s. X[ i- 1]* std. coef[ i]

std. coef[ i] �- B. i[ i, 1]

B. i[ 1, 1] �- B. i[ 1, 1] � B. i[ i, 1]* mean. X[ i- 1]

}

std. coef[ 1] �- mean. Y - B. i[ 1, 1]

std. coef

}


Data Mining 2015

3.5 Data splitting

Data splitting is often used to validate a model for a study by simulating replication of the study. Thedata set is split into two (or sometimes three) sets. The first, called the model-building ortraining set,is used to develop a model. The second - called thetest (validation, prediction, calibration) set - isused to evaluate the reasonableness or predictive ability of the developed model (also called theScientific Method). If only two sets are used, the second is used for both testing and validation. For theselected model we may compare regression coefficient estimates obtained in the training and test sets.We use the model obtained in the training phase to make predictions for the data in the validation dataset. This calibrates predictive ability of the model for new data.

d. file �- paste( data. dir, ” prostate. dat”, sep � ”/”)

d. temp �- matrix( scan( d. file), ncol�10, byrow�T)

data. orig �- d. temp[, 2: 10]

names �- c(” lcavol”,” lweight”,” age”,” lbph”,” svi”,” lcp”,” gleason”,” pgg45”,” lpsa”)

dimnames( data. orig) �- list( 1: 97, names)

The first 8 columns are predictors, the 9th is the response

pred �- 1: 8

resp �- 9

Number of predictorsp �- 8

The following function allows the computation of the estimated response

f. Yhat �- function( coef, X) {

X �- as. matrix( X)

cbind( rep( 1, dim( X)[ 1]), X)%*%as. matrix( coef)

}

Now take a look at the data graphically and numerically

source( paste( code. dir, ” pairs_ext. r”, sep�”/”))

pairs( data. orig, upper. panel�panel. cor, diag. panel�panel. hist)


Data Mining 2015

Figure 24.

cor( data. orig)lcavol lweight age lbph svi lcp

lcavol 1. 00000000 0. 194128307 0. 2249999 0. 027349703 0. 53884500 0. 675310484lweight 0. 19412831 1. 000000000 0. 3075286 0. 434934587 0. 10877848 0. 100237802age 0. 22499988 0. 307528601 1. 0000000 0. 350185896 0. 11765804 0. 127667752lbph 0. 02734970 0. 434934587 0. 3501859 1. 000000000 - 0. 08584324 - 0. 006999431svi 0. 53884500 0. 108778484 0. 1176580 - 0. 085843238 1. 00000000 0. 673111185lcp 0. 67531048 0. 100237802 0. 1276678 - 0. 006999431 0. 67311118 1. 000000000gleason 0. 43241706 - 0. 001275662 0. 2688916 0. 077820447 0. 32041222 0. 514830063pgg45 0. 43365225 0. 050846836 0. 2761124 0. 078460018 0. 45764762 0. 631528246lpsa 0. 73446033 0. 354120358 0. 1695928 0. 179809404 0. 56621822 0. 548813175

gleason pgg45 lpsalcavol 0. 432417056 0. 43365225 0. 7344603lweight - 0. 001275662 0. 05084684 0. 3541204age 0. 268891599 0. 27611245 0. 1695928lbph 0. 077820447 0. 07846002 0. 1798094svi 0. 320412221 0. 45764762 0. 5662182lcp 0. 514830063 0. 63152825 0. 5488132gleason 1. 000000000 0. 75190451 0. 3689868pgg45 0. 751904512 1. 00000000 0. 4223159lpsa 0. 368986806 0. 42231586 1. 0000000

Try a least squares fit on the full data set - original

lm( data. orig[, resp]~ data. orig[, pred])Coefficients:

( Intercept) data. orig[, pred] lcavol data. orig[, pred] lweight0. 669399 0. 587023 0. 454461

data. orig[, pred] age data. orig[, pred] lbph data. orig[, pred] svi- 0. 019637 0. 107054 0. 766156

data. orig[, pred] lcp data. orig[, pred] gleason data. orig[, pred] pgg45


Data Mining 2015

- 0. 105474 0. 045136 0. 004525

and standardized

# Standardize the data

data. std �- f. data. std( data. orig)

lm. std �- lm( data. std[, resp]~ data. std[, pred])Coefficients:

( Intercept) data. std[, pred] lcavol data. std[, pred] lweight- 9. 402e- 16 5. 994e- 01 1. 955e- 01

data. std[, pred] age data. std[, pred] lbph data. std[, pred] svi- 1. 267e- 01 1. 346e- 01 2. 748e- 01

data. std[, pred] lcp data. std[, pred] gleason data. std[, pred] pgg45- 1. 278e- 01 2. 824e- 02 1. 106e- 01

then convert the standardized result back.std. to. orig( lm. std$coefficients, apply( data. orig[, pred], 2, mean),

mean( data. orig[, resp]), apply( data. orig[, pred], 2, sd), sd( data. orig[, resp]))( Intercept) data. std[, pred] lcavol data. std[, pred] lweight0. 669399309 0. 587022878 0. 454460536

data. std[, pred] age data. std[, pred] lbph data. std[, pred] svi- 0. 019637207 0. 107054371 0. 766155934

data. std[, pred] lcp data. std[, pred] gleason data. std[, pred] pgg45- 0. 105473565 0. 045135971 0. 004525323

The result is the same.

Now consider splitting the data into two parts - a training sample and a test sample. This enables us totest our model in cases for which we may be unable to obtain further data for testing. Recall that the“scientific method” requires that we collect data, form a model based on thatdata, and then test themodel withother data.

If we produce a subset of indices, we often wish to know what indices are in the original set but not inthe subset.The following function produces those indices NOT in a set. Use this to get the testsample.

”%w/ o%” �- function( x, y) x[! x %in% y]

Now we get the indices for the training and test samples

#��

# Set the indices for the training/ test sets

#��

get. train �- function ( data. sz, train. sz) {

# Take subsets of data for training/ test samples

# Return the indices

train. ind �- sample( data. sz, train. sz)

test. ind �- ( 1: data. sz) %w/ o% train. ind


Data Mining 2015

list( train�train. ind, test�test. ind)

}

Train. sz �- 67 # Set the size of the training sample

# Get the indices for the training and test samples

( tt. ind �- get. train( dim( data. orig)[ 1], Train. sz))$train

[ 1] 19 8 58 86 26 80 52 84 53 11 89 69 39 36 3 9 2 72 96 66 23 74 56 46 91[ 26] 61 20 59 48 47 78 16 43 54 90 57 27 85 75 87 60 35 29 70 68 65 77 28 51 92[ 51] 4 21 97 82 64 73 14 22 71 44 40 7 55 45 25 63 83$test

[ 1] 1 5 6 10 12 13 15 17 18 24 30 31 32 33 34 37 38 41 42 49 50 62 67 76 79[ 26] 81 88 93 94 95

This produces two sets of (non-overlapping) indices to the data, so we can randomly split the data intotwo parts

train. X. orig �- data. orig[ tt. ind$train, pred]

train. Y. orig �- data. orig[ tt. ind$train, resp]

and also standardize.

train. X. std �- f. data. std( train. X. orig)

train. Y. std �- f. data. std( train. Y. orig)

Now we perform a least squares regression on the standardized data and convert theresults back to theoriginal variables.

lm. trn �- lm( train. Y. std ~train. X. std)

std. to. orig( lm. trn$coefficients, apply( train. X. orig, 2, mean),

mean( train. Y. orig), apply( train. X. orig, 2, sd), sd( train. Y. orig))( Intercept) train. X. stdlcavol train. X. stdlweight train. X. stdage1. 10007224 0. 59895775 0. 75542221 - 0. 02376003

train. X. stdlbph train. X. stdsvi train. X. stdlcp train. X. stdgleason0. 07222878 0. 49050976 - 0. 07499372 - 0. 14310867

train. X. stdpgg450. 00748505

lm( train. Y. orig ~train. X. orig)Coefficients:

( Intercept) train. X. origlcavol train. X. origlweight1. 100072 0. 598958 0. 755422

train. X. origage train. X. origlbph train. X. origsvi- 0. 023760 0. 072229 0. 490510

train. X. origlcp train. X. origgleason train. X. origpgg45- 0. 074994 - 0. 143109 0. 007485

We note that both methods produce the same result, but differ from that of the full data set. That mightcause us to wonder how sensitive the data is to the training/test splitting.

Look at the effect of using different training sets:


Data Mining 2015

for ( i in 1: 5) {

tt. ind �- get. train( dim( data. std)[ 1], Train. sz)

train. X. orig �- data. orig[ tt. ind$train, pred]

train. Y. orig �- data. orig[ tt. ind$train, resp]



lm. trn �- lm( train. Y. std~train. X. std)

print( lm( train. Y. orig~train. X. orig))

print(””)

}Call:lm( formula � train. Y. orig ~train. X. orig)Coefficients:



train. X. origlcp train. X. origgleason train. X. origpgg45- 0. 009038 0. 087536 0. 003572

( Intercept) train. X. stdlcavol train. X. stdlweight train. X. stdage0. 339875974 0. 531794655 0. 613882144 - 0. 025522689

train. X. stdlbph train. X. stdsvi train. X. stdlcp train. X. stdgleason0. 104622846 0. 669537978 - 0. 009037589 0. 087535930


[ 1] ””Call:lm( formula � train. Y. orig ~train. X. orig)Coefficients:
















Data Mining 2015








( Intercept) train. X. origlcavol train. X. origlweight- 0. 210137 0. 535780 0. 716955



( Intercept) train. X. stdlcavol train. X. stdlweight train. X. stdage- 0. 210137370 0. 535780030 0. 716954850 - 0. 014440157



[ 1] ””

The intercept estimate ranges from-0.210137 to 1.396644 !


Data Mining 2015

3.6 Cross Validation (CV)

One way to attempt to deal with this iscross-validation.

With double cross-validation, the model is built for each part of the split data and then tested on theother part of the data, yielding two measures of consistency and predictive ablity. We can expand onthis by dividing the training set into more (than 2) sets. There are various conceptsbut we will look atthe case of 10 subsets of the training data and use the method of developing a model based on 9/10thsof the training sample and then use the other 10th to do an error analysis. Using different combinationsof the 9 to form the training set, there are then 10!/9!1!� 10 separate training/testing combinations.Theidea is that we average a number of weak predictors to produce (what we hope is) a strongpredictor.

The training set must be sufficiently large as to allow development of a reasonable model. The numberof cases should be at least 6 to 10 times the number of variables in the set of predictor variables. Thismight necessitate the test data set being smaller than the training data set.

If a data set is very large, it can be divided into three parts - the first partto develop the model, thesecond toestimate the parameters of the model, and the third forvalidation. This avoids bias resultingfrom estimating the parameters from the same data set used for developing the model.

[In order to try to get reproducible results, a fixed training/test sample split is used. For totally randomsets remove the tt.ind �-list(train �- c(64, 62, 75, 6,...) ]

In the following function, a matrix is created with 10 rows (the number of columnsbeing the numberthat will be in each of the 10 sets). Because there will not (in general) be enoughdata to fill the matrix,we have to pad. Using the numbers that we wish to use for our training set (see below), we get

64 87 12 83 61 3

62 32 16 74 28 15

75 23 11 60 85 78

6 9 47 14 13 44

94 59 21 48 43 20

56 2 54 41 29 84

52 96 91 35 82 76

80 66 24 70 77 0

33 90 81 68 92 0

55 53 86 72 22 0

.


Data Mining 2015

Each set is obtained by selecting a row.

#�� Cross Validation ��

# This section does a cross validation on the full LS model

# Set up a cross validation indices

# Create a train/ test split and then create 10 subsets

# cv. sets$cv. train 10 cross validation training sets from the training data

# cv. sets$cv. test Corresponding test sets from the training data

# cv. sets$test Test data

#��

set. up. cv. ind �- function ( data. sz, Train. sz) {

tt. ind �- get. train( data. sz, Train. sz)

# Fix the samples

tt. ind �-

list( train �- c( 64, 62, 75, 6, 94, 56, 52, 80, 33, 55, 87, 32,

23, 9, 59, 2, 96, 66, 90, 53, 12, 16, 11, 47,

21, 54, 91, 24, 81, 86, 34, 51, 8, 17, 46, 45,

93, 57, 4, 50, 83, 74, 60, 14, 48, 41, 35, 70,

68, 72, 61, 28, 85, 13, 43, 29, 82, 77, 92, 22,

3, 15, 78, 44, 20, 84, 76),

test �- c( 1, 5, 7, 10, 18, 19, 25, 26, 27, 30, 31, 36,

37, 38, 39, 40, 42, 49, 58, 63, 65, 67, 69, 71,

73, 79, 88, 89, 95, 97))

n. in. set �- ceiling( Train. sz*. 1)

# pad with 0 to avoid repetition

cv. sets �- matrix( c( tt. ind[[ 1]], rep( 0, n. in. set* 10- Train. sz)),

ncol�n. in. set, byrow�F)

cv. train �- {}

cv. test �- {}

for ( i in 1: 10) {

# Select the 1/ 10th for testing and 9/ 10ths for training

cv. test �- c( cv. test, list( cv. sets[ i,])) # The indices are from the full set

cv. train �- c( cv. train, list( as. vector( cv. sets[ 1: 10%w/ o% i,])))

}

list( cv. train�cv. train, cv. test�cv. test, test�tt. ind[[ 1]])

}

cv. sets �- set. up. cv. ind( dim( data. std)[ 1], Train. sz)

cv. sets$cv. train # 10 cross validation training sets from the training data[[ 1]]

[ 1] 62 75 6 94 56 52 80 33 55 32 23 9 59 2 96 66 90 53 16 11 47 21 54 91 24[ 26] 81 86 51 8 17 46 45 93 57 4 50 74 60 14 48 41 35 70 68 72 28 85 13 43 29[ 51] 82 77 92 22 15 78 44 20 84 76 0 0 0[[ 2]]

[ 1] 64 75 6 94 56 52 80 33 55 87 23 9 59 2 96 66 90 53 12 11 47 21 54 91 24[ 26] 81 86 34 8 17 46 45 93 57 4 50 83 60 14 48 41 35 70 68 72 61 85 13 43 29[ 51] 82 77 92 22 3 78 44 20 84 76 0 0 0[[ 3]]

[ 1] 64 62 6 94 56 52 80 33 55 87 32 9 59 2 96 66 90 53 12 16 47 21 54 91 24[ 26] 81 86 34 51 17 46 45 93 57 4 50 83 74 14 48 41 35 70 68 72 61 28 13 43 29[ 51] 82 77 92 22 3 15 44 20 84 76 0 0 0[[ 4]]

[ 1] 64 62 75 94 56 52 80 33 55 87 32 23 59 2 96 66 90 53 12 16 11 21 54 91 24[ 26] 81 86 34 51 8 46 45 93 57 4 50 83 74 60 48 41 35 70 68 72 61 28 85 43 29


Data Mining 2015

[ 51] 82 77 92 22 3 15 78 20 84 76 0 0 0[[ 5]]

[ 1] 64 62 75 6 56 52 80 33 55 87 32 23 9 2 96 66 90 53 12 16 11 47 54 91 24[ 26] 81 86 34 51 8 17 45 93 57 4 50 83 74 60 14 41 35 70 68 72 61 28 85 13 29[ 51] 82 77 92 22 3 15 78 44 84 76 0 0 0[[ 6]]

[ 1] 64 62 75 6 94 52 80 33 55 87 32 23 9 59 96 66 90 53 12 16 11 47 21 91 24[ 26] 81 86 34 51 8 17 46 93 57 4 50 83 74 60 14 48 35 70 68 72 61 28 85 13 43[ 51] 82 77 92 22 3 15 78 44 20 76 0 0 0[[ 7]]

[ 1] 64 62 75 6 94 56 80 33 55 87 32 23 9 59 2 66 90 53 12 16 11 47 21 54 24[ 26] 81 86 34 51 8 17 46 45 57 4 50 83 74 60 14 48 41 70 68 72 61 28 85 13 43[ 51] 29 77 92 22 3 15 78 44 20 84 0 0 0[[ 8]]

[ 1] 64 62 75 6 94 56 52 33 55 87 32 23 9 59 2 96 90 53 12 16 11 47 21 54 91[ 26] 81 86 34 51 8 17 46 45 93 4 50 83 74 60 14 48 41 35 68 72 61 28 85 13 43[ 51] 29 82 92 22 3 15 78 44 20 84 76 0 0[[ 9]]

[ 1] 64 62 75 6 94 56 52 80 55 87 32 23 9 59 2 96 66 53 12 16 11 47 21 54 91[ 26] 24 86 34 51 8 17 46 45 93 57 50 83 74 60 14 48 41 35 70 72 61 28 85 13 43[ 51] 29 82 77 22 3 15 78 44 20 84 76 0 0[[ 10]]

[ 1] 64 62 75 6 94 56 52 80 33 87 32 23 9 59 2 96 66 90 12 16 11 47 21 54 91[ 26] 24 81 34 51 8 17 46 45 93 57 4 83 74 60 14 48 41 35 70 68 61 28 85 13 43[ 51] 29 82 77 92 3 15 78 44 20 84 76 0 0cv. sets$cv. test # Corresponding test sets from the training data[[ 1]][ 1] 64 87 12 34 83 61 3[[ 2]][ 1] 62 32 16 51 74 28 15[[ 3]][ 1] 75 23 11 8 60 85 78[[ 4]][ 1] 6 9 47 17 14 13 44[[ 5]][ 1] 94 59 21 46 48 43 20[[ 6]][ 1] 56 2 54 45 41 29 84[[ 7]][ 1] 52 96 91 93 35 82 76[[ 8]][ 1] 80 66 24 57 70 77 0[[ 9]][ 1] 33 90 81 4 68 92 0[[ 10]][ 1] 55 53 86 50 72 22 0cv. sets$test # Test data

[ 1] 1 5 7 10 18 19 25 26 27 30 31 36 37 38 39 40 42 49 58 63 65 67 69 71 73[ 26] 79 88 89 95 97

While we will not use the cross-validation at this stage, we now have the training/test sets.


Data Mining 2015

3.7 Best subset regression

We need a few tools -

#��

# Recursive function to compute all possible subsets

#��

next. combo �- function ( combo, numb, all. current, prev, ind) {

i �- prev � 1

while ( i �� numb) {

combo[ ind, c( all. current, i)] �- c( all. current, i)

ind �- ind � 1

res �- next. combo ( combo, numb, c( all. current, i), i, ind)

combo �- res[[ 1]]

ind �- res[[ 2]]

i �- i � 1

}

list( combo, ind)

}

number. of. combos �- function ( numb) {

n. combo �- 0

for ( j in 1: numb) {

n. combo �- n. combo � choose( numb, j)

}

n. combo

}

#��

create. combos �- function ( numb) {

n. combo �- number. of. combos( numb)

combo �- matrix( 0, n. combo, numb)

temp �- next. combo ( combo, numb, {}, 0, 1)[[ 1]]

temp[ order( apply( temp! �0, 1, sum)),]

}

We create the subsets

combos �- create. combos( 8)

rownames( combos) �- c( 1: 255)

combos[ c( 1, 2, 3, 11, 12, 13, 253, 254, 255),][, 1] [, 2] [, 3] [, 4] [, 5] [, 6] [, 7] [, 8]

1 1 0 0 0 0 0 0 02 0 2 0 0 0 0 0 03 0 0 3 0 0 0 0 011 1 0 0 4 0 0 0 012 1 0 0 0 5 0 0 013 1 0 0 0 0 6 0 0253 1 0 3 4 5 6 7 8


Data Mining 2015

254 0 2 3 4 5 6 7 8255 1 2 3 4 5 6 7 8

We create regression models for all possible subsets and plot the SSE.

We use the training and test sets from above. Because we want to use the full training sample for thispart we combine thecv.train andcv.test for the training sample.

train. X. orig �- data. orig[ c( cv. sets$cv. train[[ 1]], cv. sets$cv. test[[ 1]]), pred]

train. Y. orig �- data. orig[ c( cv. sets$cv. train[[ 1]], cv. sets$cv. test[[ 1]]), resp]

test. X. orig �- data. orig[ cv. sets$test, pred]

test. Y. orig �- data. orig[ cv. sets$test, resp]



# The returned coef are for the original data

n. data �- length( train. Y. std)

lpm. all �- {}

for ( i in 1: dim( combos)[ 1]) {

X �- train. X. std[, combos[ i,]]

lpm �- lm( train. Y. std ~X)

lpm$coefficients �- std. to. orig( lpm$coefficients, apply( train. X. orig, 2, mean),

mean( train. Y. orig), apply( train. X. orig, 2, sd), sd( train. Y. orig))

lpm. all �- c( lpm. all, list( lpm))

}

lpm. all[[ 1]]Call:lm( formula � train. Y. std ~X)Coefficients:( Intercept) X

1. 5626 0. 7091lpm. all[[ 9]]Call:lm( formula � train. Y. std ~X)Coefficients:( Intercept) Xlcavol Xlweight

- 0. 1422 0. 6771 0. 4752

Now plot the results:

plot(- 1, 0, xlim � c( 0, 8), ylim � c( 0, 100), xlab�” k”, ylab � ” SSE”)

for ( i in 1: dim( combos)[ 1]) {

points( sum( combos[ i,] � 0), sum( lpm. all[[ i]]$ residuals^2))

}

We would like to tabulate the results so we need a couple more functions.The first one allows us to specify the number of digits that we will display, and thesecond creates atable of our results.

#��

trunc �- function ( numb, dig) {

mask �- 10^dig


Data Mining 2015

floor( mask* numb)/ mask

}

#��

create. table �- function ( combos, Y) {

n. data �- length( Y)

n. combo �- dim( combos)[ 1]

table �- matrix( 0, n. combo�1, 4)

TSS �- sum(( Y - mean( Y))^ 2)

table[ 1, 2] �- trunc( TSS, 5)

table[ 1, 3] �- 0

table[ 1, 4] �- trunc( table[ 1, 2]/( n. data- 2), 5)

row. names �- ” None”


temp1 �- format( combos[ i,])

temp2 �-{}

for ( j in 1: length( temp1)) {

if ( temp1[ j] ! � ” 0”)

temp2 �- paste( temp2, ” X”, temp1[ j], sep�””)

}

row. names �- c( row. names, temp2)

table[ i�1, 1] �- trunc( lpm. all[[ i]]$ df. residual, 5)

table[ i�1, 2] �- trunc( sum( lpm. all[[ i]]$ residual^2), 5)

table[ i�1, 3] �- trunc( 1 - table[ i�1, 2]/ TSS, 5)

table[ i�1, 4] �- trunc( table[ i�1, 2]/( n. data- 2), 5)

}

dimnames( table) �- list( row. names, c(” df”,” SSE”,” R^2”,” MSE”))

table

}

table �- create. table( combos, train. Y. std)

table[ 1: 15,]df SSE R^2 MSE

None 0 66. 00000 0. 00000 1. 01538X1 65 26. 58699 0. 59716 0. 40903X2 65 58. 53916 0. 11304 0. 90060X3 65 63. 63167 0. 03588 0. 97894X4 65 61. 24990 0. 07197 0. 94230X5 65 48. 35566 0. 26733 0. 74393X6 65 49. 64103 0. 24786 0. 76370X7 65 58. 32761 0. 11624 0. 89734X8 65 52. 63896 0. 20244 0. 80983X1X2 64 23. 53202 0. 64345 0. 36203X1X3 64 26. 42891 0. 59956 0. 40659X1X4 64 23. 44072 0. 64483 0. 36062X1X5 64 25. 48333 0. 61388 0. 39205X1X6 64 26. 55899 0. 59759 0. 40859X1X7 64 26. 57080 0. 59741 0. 40878

We also want the minimumSSE which we get with the next function.

#��

create. mins �- function ( lpm. all, n. combo, numb) {

mins �- matrix( 10^( 30), numb�1, 2)

# Look at all the combos and find the minimum for each category ( rank)


Data Mining 2015


temp �- sum( lpm. all[[ i]]$ residuals^2)

if ( temp � mins[ lpm. all[[ i]]$ rank, 2]) {

mins[ lpm. all[[ i]]$ rank, 1] �- i

mins[ lpm. all[[ i]]$ rank, 2] �- temp

}

}

mins[ 1, 1] �- 0

mins[ 1, 2] �- table[ 1, 2]

mins

}

mins �- create. mins( lpm. all, dim( combos)[ 1], 8)

mins[, 1] [, 2]

[ 1,] 0 66. 00000[ 2,] 1 26. 58699[ 3,] 11 23. 44073[ 4,] 42 21. 89790[ 5,] 98 20. 95709[ 6,] 175 20. 47357[ 7,] 230 19. 82302[ 8,] 248 18. 89969[ 9,] 255 18. 34662

We can add that information to the plot

lines ( 0: p, mins[ 1:( p�1), 2])

points ( 0: p, mins[ 1:( p�1), 2], col�” red”)

Figure 25.

for ( i in 0: p) {

cat( dimnames( table)[[ 1]][ mins[ i�1, 1] �1], table[ mins[ i�1, 1] �1, 2], ”\ n”)

}


Data Mining 2015

and display the results.

None 66X1 26. 58699X1X4 23. 44072X1X2X8 21. 89789X1X2X4X5 20. 95709X1X2X4X5X8 20. 47356X1X2X4X5X6X8 19. 82302X1X2X3X4X5X6X8 18. 89969X1X2X3X4X5X6X7X8 18. 34661

# Display all the best

for ( i in 1:( p- 1)) {

print( lpm. all[[ mins[ i�1, 1] ]])

}Call:lm( formula � train. Y. std ~X)Coefficients:( Intercept) X

1. 5626 0. 7091Call:lm( formula � train. Y. std ~X)Coefficients:( Intercept) Xlcavol Xlbph

- 0. 1744 0. 6960 0. 4771Call:lm( formula � train. Y. std ~X)Coefficients:( Intercept) Xlcavol Xlweight Xpgg45

- 1. 96819 0. 60525 0. 51626 0. 02751Call:lm( formula � train. Y. std ~X)Coefficients:( Intercept) Xlcavol Xlweight Xlbph Xsvi

- 1. 11996 0. 59307 0. 32478 0. 02477 0. 13559Call:lm( formula � train. Y. std ~X)Coefficients:( Intercept) Xlcavol Xlweight Xlbph Xsvi Xpgg45

- 1. 07243 0. 57009 0. 36866 0. 02129 0. 10363 0. 29449Call:lm( formula � train. Y. std ~X)Coefficients:( Intercept) Xlcavol Xlweight Xlbph Xsvi Xlcp

- 0. 93951 0. 61690 0. 35376 0. 02177 0. 14212 - 0. 47232Xpgg45

0. 14464Call:lm( formula � train. Y. std ~X)Coefficients:( Intercept) Xlcavol Xlweight Xage Xlbph Xsvi

- 0. 79405 0. 64159 0. 37983 - 0. 02216 0. 14823 0. 58202Xlcp Xpgg45

- 0. 19252 0. 33245


Data Mining 2015

3.8 Variance Inflation Factors (VIFs)

Under ordinary least squares (OLS) regression,

b � �X TX��1 �X T Y�

and

�b2 � Var�b�

� �X TX��1 XT�Y

2 I X �X T X��1

� �Y2 �XT X��1

If we have used the correlation transformation, we obtain

�b�2 � Var�b��\

� �Y

�

2 �X � T X ��1

� r XX�1

The variance inflation factors (VIFs) are then the diagonal elements ofr XX�1 .


Data Mining 2015

3.9 Weighted Least Squares (WLS) Regression

We consider a weighted sum of squares for error

Q � �wi�Y i � Y i�2

assuming

Y � X � � �

but not assuming that error variances are constant. We set

W �

w1

. 0

�

0 .

wn

Then

�XTW X�bw � �XTW Y�

or

bw � �XTWX��1 �XTW Y�

and

�bw

2 � �2�XTWX��1

.Note that we may set

W 1/2 �

w1

. 0

�

0 .

wn


Data Mining 2015

so

�W 1/2Y� � �W 1/2X� � � W 1/2�

Note that this is equivalent to a transforming the response variableY and the predictor variables in theX matrix to become

Y�� W1/2Y

and

X� � W1/2X

respectively and re-writing the model to become .

Y� � X� � � ��

.

The weightswi are often taken as the reciprocal of the (unequal) error variances or some functionrelated to them.We will now take a look at some robust methods of regression.


Data Mining 2015

3.10 Ridge Regression

Recall that the least-squares solution vectorb is unbiased and minimum variance in the set of all linearunbiased estimators.

Note: We have been restricting ourselves to finding the linear unbiased estimator that has minimumvariance. Recall that

MSE � Variance � �Bias�2

We have been restricting ourselves to situations wherebias � 0 and hence ourMSE is equivalent to thevariance. Now we actually modify least squares to force biased estimators for the regressioncoefficients by adding a small amount of known biasc. ( Usually 0� c � 1.) We then see if there is asituation whereMSE � Variance � �Bias � c�2 is minimized. We are then looking for a situationwhere allowing biased estimators results in a smaller variance and a resullting MSE that is lower than ifwe restricted ourselves to only permitting unbiased estimators. This is commonly called thevariance-bias trade-off .i.e.There may be a situation wherec � 0 corresponds to a lower variance(more precise) and a lower combinationMSE (actually we can prove that this is indeed true) than whenc � 0 and we have the minimum variance (linear unbiased) estimator (BLUE). When an estimator hassmall (known) bias and is more precise than an unbiased estimator we may actually prefer the biasedestimator since it is more likely (probable) to be close to the true parameter value

d. file �- paste( data. dir, ” ch07ta01. dat”, sep � ”/”)

d. data �- matrix( scan( d. file), ncol�4, byrow�T)

names �- c(” Triceps”,” Thigh”,” MidArm”,” BodyFat”)

dimnames( d. data) �- list( 1: 20, names)

df. data �- as. data. frame( d. data)

Ridge regression requires that the data be transformed by a correlation transformation so, as we saw -

Y i� � 1

n � 1Y i �

__Y

sY

X ik� � 1

n � 1X ik �

__Xk

sY�k � 1, ...,p � 1�.

Get the mean and standard deviations:

( bar �- apply( d. data, 2, mean))Triceps Thigh MidArm BodyFat


Data Mining 2015

25. 305 51. 170 27. 620 20. 195( s �- apply( d. data, 2, sd))Triceps Thigh MidArm BodyFat5. 023259 5. 234612 3. 647147 5. 106186

and compute the new variables:

d. data. t �- t(( t( d. data) - bar)/ s)/ sqrt( dim( d. data)[ 1]- 1)

We also compute the correlation matrices:

�p�1��p�1�XTX �

�p�1��p�1�

rXX �

1 r12 � r1,p�1

r21 1 � r2,p�1

rp�1,1 rp�1,2 � 1

�p�1��1XTY �

�p�1��1

rYX �

rY1

rY2

rYp�1

( r. xx �- cor( d. data[, 1: 3]))Triceps Thigh MidArm

Triceps 1. 0000000 0. 9238425 0. 4577772Thigh 0. 9238425 1. 0000000 0. 0846675MidArm 0. 4577772 0. 0846675 1. 0000000( r. xy �- cor( d. data[, 1: 3], d. data[, 4]))

[, 1]Triceps 0. 8432654Thigh 0. 8780896MidArm 0. 1424440I �- diag( 1, 3, 3)

For ordinary least squares, the normal equations are of the form

�XTX�b �XTY

while after the correlation transformation, the least square equations are

rXXb � rYX.

For ridge regresssion, a biasing constant�c � 0� is introduced to give


Data Mining 2015

�rXX � cI�bR � rYX

so bR � �rXX � cI��1rYX.

Theridge trace is a simultaneous plot of thep � 1 ridge standardized regression coefficientsb versusc(generally on log scale). TheVIF fall rapidly asc begins to move away from 0 and then tends tostabilize asc increases. We choose the smallestc such that the regression coefficients become stable(see the ridge trace) and theVIF values become reasonably small (close to 1).

Ridge. coef �- matrix( 0, 29, 4)

i �- 1

for ( c in c( seq( 0,. 01, by�. 001), seq(. 02,. 1, by�. 01), seq(. 2, 1. 0, by�. 1))) {

Ridge. coef[ i, 1] �- c

Ridge. coef[ i, 2: 4] �- solve( r. xx�c* I, r. xy)

i �- i � 1

}

dimnames( Ridge. coef) �- list( NULL, c(” c”,” b1”,” b2”,” b3”))

Ridge. coefc b1 b2 b3

[ 1,] 0. 000 4. 2637046 - 2. 92870065 - 1. 561416794[ 2,] 0. 001 2. 0347999 - 0. 94080230 - 0. 708676831[ 3,] 0. 002 1. 4406628 - 0. 41128508 - 0. 481273455[ 4,] 0. 003 1. 1652742 - 0. 16612384 - 0. 375799189[ 5,] 0. 004 1. 0063236 - 0. 02483665 - 0. 314865607[ 6,] 0. 005 0. 9028128 0. 06699318 - 0. 275139506[ 7,] 0. 006 0. 8300164 0. 13142332 - 0. 247162829[ 8,] 0. 007 0. 7760090 0. 17909237 - 0. 226373864[ 9,] 0. 008 0. 7343315 0. 21576257 - 0. 210301840

[ 10,] 0. 009 0. 7011811 0. 24482646 - 0. 197492074[ 11,] 0. 010 0. 6741729 0. 26841149 - 0. 187032336[ 12,] 0. 020 0. 5463339 0. 37740366 - 0. 136871542[ 13,] 0. 030 0. 5003777 0. 41341436 - 0. 118077883[ 14,] 0. 040 0. 4760035 0. 43023669 - 0. 107583242[ 15,] 0. 050 0. 4604598 0. 43924481 - 0. 100508309[ 16,] 0. 060 0. 4493942 0. 44432014 - 0. 095186663[ 17,] 0. 070 0. 4409157 0. 44714834 - 0. 090893478[ 18,] 0. 080 0. 4340699 0. 44857934 - 0. 087262359[ 19,] 0. 090 0. 4283233 0. 44908795 - 0. 084087845[ 20,] 0. 100 0. 4233540 0. 44896004 - 0. 081245545[ 21,] 0. 200 0. 3914248 0. 43472004 - 0. 061289960[ 22,] 0. 300 0. 3703229 0. 41540311 - 0. 047886515[ 23,] 0. 400 0. 3529429 0. 39658090 - 0. 037644774[ 24,] 0. 500 0. 3377199 0. 37905822 - 0. 029500232[ 25,] 0. 600 0. 3240420 0. 36291486 - 0. 022888815[ 26,] 0. 700 0. 3115856 0. 34806526 - 0. 017448562[ 27,] 0. 800 0. 3001454 0. 33438723 - 0. 012926334[ 28,] 0. 900 0. 2895758 0. 32175831 - 0. 009136642[ 29,] 1. 000 0. 2797659 0. 31006643 - 0. 005939486vif �- matrix( 0, 29, 5)

i �- 1

for ( c in c( seq( 0,. 01, by�. 001), seq(. 02,. 1, by�. 01), seq(. 2, 1. 0, by�. 1))) {

vif[ i, 1] �- c

a. inv �- solve( r. xx�c* I, I)

vif[ i, 2: 4] �- diag( a. inv%*%r. xx%*%a. inv)


Data Mining 2015

vif[ i, 5] �- 1- sum( ( Ridge. coef[ i, 2: 4]%*%t( d. data. t[, 1: 3])- d. data. t[, 4])^ 2)

i �- i � 1

}

dimnames( vif) �- list( NULL, c(” c”,” VIF1”,” VIF2”,” VIF3”,” R^2”))

vifc VIF1 VIF2 VIF3 R^2

[ 1,] 0. 000 708. 8429142 564. 3433857 104. 6060050 0. 8013586[ 2,] 0. 001 125. 7308694 100. 2740321 19. 2809671 0. 7943486[ 3,] 0. 002 50. 5591892 40. 4483104 8. 2797004 0. 7901140[ 4,] 0. 003 27. 1750112 21. 8376013 4. 8561838 0. 7878135[ 5,] 0. 004 16. 9815701 13. 7247233 3. 3627919 0. 7863882[ 6,] 0. 005 11. 6434185 9. 4759221 2. 5798503 0. 7854212[ 7,] 0. 006 8. 5033223 6. 9764415 2. 1185423 0. 7847225[ 8,] 0. 007 6. 5013345 5. 3827244 1. 8237725 0. 7841937[ 9,] 0. 008 5. 1471650 4. 3045740 1. 6237998 0. 7837792

[ 10,] 0. 009 4. 1886926 3. 5413399 1. 4817327 0. 7834452[ 11,] 0. 010 3. 4855023 2. 9812730 1. 3770252 0. 7831698[ 12,] 0. 020 1. 1025508 1. 0805407 1. 0105134 0. 7817952[ 13,] 0. 030 0. 6256980 0. 6969055 0. 9234580 0. 7812036[ 14,] 0. 040 0. 4527887 0. 5552892 0. 8814032 0. 7807894[ 15,] 0. 050 0. 3704539 0. 4858773 0. 8531067 0. 7804224[ 16,] 0. 060 0. 3243741 0. 4454349 0. 8306002 0. 7800589[ 17,] 0. 070 0. 2956068 0. 4188820 0. 8110926 0. 7796807[ 18,] 0. 080 0. 2761480 0. 3998444 0. 7933948 0. 7792794[ 19,] 0. 090 0. 2621427 0. 3852497 0. 7769250 0. 7788510[ 20,] 0. 100 0. 2515476 0. 3734680 0. 7613679 0. 7783935[ 21,] 0. 200 0. 2052518 0. 3078210 0. 6341536 0. 7722580[ 22,] 0. 300 0. 1837581 0. 2685829 0. 5384602 0. 7638181[ 23,] 0. 400 0. 1675811 0. 2383292 0. 4634200 0. 7538008[ 24,] 0. 500 0. 1540398 0. 2136522 0. 4033143 0. 7427376[ 25,] 0. 600 0. 1422970 0. 1930126 0. 3543686 0. 7310054[ 26,] 0. 700 0. 1319468 0. 1754722 0. 3139517 0. 7188740[ 27,] 0. 800 0. 1227368 0. 1603865 0. 2801722 0. 7065382[ 28,] 0. 900 0. 1144865 0. 1472858 0. 2516396 0. 6941399[ 29,] 1. 000 0. 1070578 0. 1358157 0. 2273113 0. 6817825plot( log( Ridge. coef[ 2: 29, 1]), Ridge. coef[ 2: 29, 2], ylim�c(- 2, 3),” l”,

xlab�” log( c)”, ylab�” b”, col�” red”)

lines( log( Ridge. coef[ 2: 29, 1]), Ridge. coef[ 2: 29, 3], col�” blue”)

lines( log( Ridge. coef[ 2: 29, 1]), Ridge. coef[ 2: 29, 4], col�” green”)

legend(- 2, 3, legend�c(” b1”, ” b2”, ” b3”),

col�c(” red”,” blue”,” green”), lty�1)


Data Mining 2015

Figure 26.

The following combines some of the previous operations into one function.

ridge �- function ( data, p. cols, r. cols, lambda, use. c � T) {

# Get the mean and standard deviations

# p. cols are the columns for the perdictors

# r. cols are the columns for the response

data. std �- f. data. std( data)

r. xx �- t( data. std[, p. cols])%*% data. std[, p. cols]

r. xx

r. xy �- t( data. std[, p. cols])%*% data. std[, r. cols]

r. xy

n. cols �- length( r. xy)

I �- diag( 1, n. cols, n. cols)

Ridge. coef �- matrix( 0, length( lambda), n. cols�2)

i �- 1

for ( c in lambda) {

if ( use. c �� F) {

Ridge. coef[ i, 1] �- sum( d/( d�L[ i]))

} else {

Ridge. coef[ i, 1] �- c

}

r. xx. c �- r. xx�c* I

Ridge. coef[ i, 2:( n. cols�2)] �- std. to. orig( c( 0, lm( r. xy~r. xx. c- 1)$ coefficients),

apply( data. orig[, pred], 2, mean), mean( data. orig[, resp]),

apply( data. orig[, pred], 2, sd), sd( data. orig[, resp]))

i �- i � 1

}

vif �- matrix( 0, length( lambda), n. cols�2)

i �- 1

for ( c in lambda) {

if ( use. c �� F) {


Data Mining 2015

vif[ i, 1] �- sum( d/( d�L[ i]))

} else {

vif[ i, 1] �- c

}

a. inv �- solve( r. xx�c* I, I)

vif[ i,( 1: n. cols) �1] �- diag( a. inv%*%r. xx%*%a. inv)

vif[ i, n. cols�2] �- 1- sum( ( Ridge. coef[ i, 2: 4]%*%t( data. std[, 1: 3])- data. std[, 4])^ 2)

i �- i � 1

}

list ( coef�Ridge. coef, vif�vif)

}

In some cases the biasing parameterc is not used directly, but rather we use a function of it:

df�c� � �j�1

p dj2

dj2 � c

wheredj is the diagonal entry in the singular value decomposition.

d �- svd( data. orig[, pred])$ d^2

# or

( d �- eigen( t( data. orig[, pred])%*% data. orig[, pred])$ values)[ 1] 4. 790826e�05 6. 190704e�04 2. 109043e�02 1. 756330e�02 6. 479861e�01[ 6] 4. 452384e�01 2. 023903e�01 8. 093138e�00

# Set up points to give a good df( L) spread

L �- c( seq( 0, 7, by�7/ 10), seq( 8, 20, by�12/ 10), seq( 22, 50, by�28/ 10),

seq( 55, 120, by�65/ 10), seq( 130, 400, by�270/ 10), seq( 440, 3500, by�3160/ 10),

seq( 3700, 200000, by�196300/ 10), seq( 250000, 5000000, by�475000/ 10))

ridge. res �- ridge ( data. orig, pred, resp, L, F)

Ridge. coef �- ridge. res$coef

Ridge. coef[ 1: 5,][, 1] [, 2] [, 3] [, 4] [, 5] [, 6] [, 7]

[ 1,] 8. 000000 0. 6693993 0. 5870229 0. 4544605 - 0. 01963721 0. 1070544 0. 7661559[ 2,] 7. 853506 0. 6423863 0. 5779035 0. 4535208 - 0. 01910545 0. 1056473 0. 7567784[ 3,] 7. 721665 0. 6176973 0. 5692262 0. 4525009 - 0. 01859826 0. 1043013 0. 7479234[ 4,] 7. 601828 0. 5950669 0. 5609539 0. 4514107 - 0. 01811371 0. 1030117 0. 7395445[ 5,] 7. 491991 0. 5742682 0. 5530538 0. 4502586 - 0. 01765007 0. 1017743 0. 7316005

[, 8] [, 9] [, 10][ 1,] - 0. 10547357 0. 04513597 0. 004525323[ 2,] - 0. 09668191 0. 04756349 0. 004367505[ 3,] - 0. 08847371 0. 04974629 0. 004224841[ 4,] - 0. 08079499 0. 05171680 0. 004095576[ 5,] - 0. 07359826 0. 05350244 0. 003978201plot( Ridge. coef[ 1: dim( Ridge. coef)[ 1], 1], Ridge. coef[ 1: dim( Ridge. coef)[ 1], 3], ylim�c(-. 2, 1),”xlab�” df( c)”, ylab�” b”)for ( i in 2: p) {

lines( Ridge. coef[ 1: dim( Ridge. coef)[ 1], 1], Ridge. coef[ 1: dim( Ridge. coef)[ 1], i�1])

}

legend( 0, 1, legend�paste(” b”, 1: p, sep�””), col�1: p, lty�1)


Data Mining 2015

Figure 27. Note the use oflegend


Data Mining 2015

3.11 Smoothing Revisited

We saw a simple smoothing method (running mean) earlier. As the examples that we have consideredin regression show, you may need to have some knowledge of the surface that you are trying to fit inorder to know which type of regression to use. A better model may come from the class ofkernelsmoothers which use local information obtained from the data.

We will revisit the running mean and look at it as a template for the kernel smoothers in general.

The smoothing methods use awindow that is typically centered at the point at which we wish to obtainan estimate and has a width that is selected by the user. These methods computea weighted mean of allthe points in the window with the difference in the methods arising from the way in which the weightsare determined. (As with the interpolation methods, there are concerns about behaviour at the endpoints.)

The simplest example of such smoothers is the running mean which has equal weights (1/n) for thepoints in the window. We will use� as our width parameter. The running mean uses 2� � 1 points inthe window.

For example,

Figure 28.

Here the window moves across the data and computes the average of 9 data points.


Data Mining 2015

The red� is the midpoint of the window with the solid lines and the red is the midpoint of the nextwindow (dashed lines). The green� is in the first window but not in the second, whereas the green isin the second but not the first. The predicted line (shown in red) is discontinuous because the runningmean window has the same contents until the leading edge encounters a new point. At that position, thewindow drops the point at the trailing edge and a jump occurs.

In many cases we would prefer a smooth curve. This can be obtained by the use of a windowthat doesnot suddenly drop (or gain) points but allows them to ’fade’ in and out.There are three commonly used kernels that have this property.

The first has a familiar appearance as it is theGaussian kernel

G� � exp � x � x0�

2

the second is theEpanechnikov kernel

E� �34 1 � x � x0

�2

if |x � x0| � �

0 otherwise

and the third is theTri-cube

T� �1 � x � x0

�3 3

if |x � x0| � �

0 otherwise

and they look like


Data Mining 2015

Figure 29.

(the Gaussian and tri-cube are differentiable everywhere.

Figure 30. Figure 31.

Note that the wider window produces a smoother curve but a worse fit.


Data Mining 2015



It is possible to modify the ideas of kernel smoothing to make use of regression concepts.


Data Mining 2015

3.12 LOWESSThe idea behindLOWESS (or LOESS) [Cleveland, W.S. (1979) ”Robust Locally Weighted Regressionand Smoothing Scatterplots,” Journal of the American Statistical Association, Vol. 74, pp. 829-836] isthat a span (or window) of the data (xj,xj�m� is selected and the midpoint (except at the ends of thedata) is chosen as the point for estimation (similar to kernel smoothing). The distance from point to beestimated to the other points in the span is found and the distances are scaled bythe maximum distancein the span. Weights are determined - one such weighting function is thetricube weight function

w�x� �1 � |x|3

3, |x| � 1

0, |x| � 1.

A weighted least squares (WLS) is then done on the data in the span, and used to predict a new valuefor the value to be estimated.A simplistic (not robust version) follows.

f. weight �- function ( x){

( 1 - abs( x)^ 3)^ 3

}

Note the do.print �F. If this is changed todo.print �T then the loop that prints out theinformation about the process is active (by default it is not). It is possible to set default values forarguments in this manner. If a default applies then the argument need not be in the call to the function.See [Writing your own functions] [Named arguments and defaults] in the An Introduction to R.

reg. value �- function ( d. subset, pt, do. print�F) {

# Get the distance from pt to every point in the interval.

d. dist �- abs( d. subset[, 1] - pt)

scaled. dist �- d. dist/ max( d. dist)

W �- f. weight( scaled. dist)

n �- length( W)

# Do a least squares fit ( lm for linear model)

# lm( y ~x, data, subset, weights)

lw �- lm( d. subset[, 2]~ d. subset[, 1], as. data. frame( d. subset), 1: n, W)

if ( do. print) {

print( pt)

print( cbind( d. subset[, 1], d. subset[, 2], d. dist, scaled. dist, W))

print( lw$coefficients)

print( lw$coefficients%*%c( 1, pt))

print(”************************************”)

}

lw$coefficients%*%c( 1, pt)

}


Data Mining 2015

In the following, we start with a first span or ’band’ of data and hold it fixed until the point to beestimated gets to the middle (Part 1), then the point and span move together until thespan can move nofurther (Part 2) and then the point moves to the end of the last span (Part 3).

f. lowess �- function( d. data, band, do. print�F) {

# d. data has the x and y values of the data

# band is the bandwidth

# Initialize

v �- rep( 0, dim( d. lowess)[ 1]) # Holds the result

i �- 1

first �- 1

last �- first � band - 1

mid �- floor(( first�last)/ 2)

ind. pt �- 1 # Points to position of pt in the band

# Part 1 - move the pt at which we want the smooth

# until we reach midpoint of band

d. subset �- d. data[ first: last,]

while ( ind. pt �� mid) {

pt �- d. subset[ ind. pt, 1]

v[ i] �- reg. value ( d. subset, pt, do. print)

i �- i � 1

ind. pt �- ind. pt � 1

}

# Part 2 - move the pt with the band

ind. pt �- ind. pt - 1 # In middle of band

repeat {

first �- first � 1

last �- last � 1

if ( last � dim( d. data)[ 1]) break

d. subset �- d. data[ first: last,]



i �- i � 1

}

# Part 3 - move the midpoint until we reach the end


while ( ind. pt �� band) {



i �- i � 1


}

v

}


Data Mining 2015

In the following, the R version oflowess is also used for comparison.

library( stats)

d. lowess �- cbind( x. rough, y. rough)

oldpar �- par( mfrow � c( 3, 1))

for ( b in c( 9, 13, 21)) {

v �- f. lowess( d. lowess, b, T)

plot( d. lowess[, 1], d. lowess[, 2], col�” red”, pch�20,

main�paste(” Lowess with bandwidth �”, b))


lines( d. lowess[, 1], v, col�” black”)

lines( lowess( d. lowess, f�b/ 61), col�” green”)

}

par( oldpar)...[ 1] 0

d. dist scaled. dist W[ 1,] 0. 0 - 0. 47388308 0. 0 0. 00000000 1. 00000000[ 2,] 0. 1 - 0. 10563567 0. 1 0. 08333333 0. 99826489[ 3,] 0. 2 0. 01353860 0. 2 0. 16666667 0. 98617531[ 4,] 0. 3 0. 21000704 0. 3 0. 25000000 0. 95385361[ 5,] 0. 4 0. 77435571 0. 4 0. 33333333 0. 89295331[ 6,] 0. 5 0. 31873792 0. 5 0. 41666667 0. 79830593[ 7,] 0. 6 0. 50385628 0. 6 0. 50000000 0. 66992188[ 8,] 0. 7 0. 45257248 0. 7 0. 58333333 0. 51489433[ 9,] 0. 8 0. 84182740 0. 8 0. 66666667 0. 34847330

[ 10,] 0. 9 0. 33869364 0. 9 0. 75000000 0. 19322586[ 11,] 1. 0 1. 25643075 1. 0 0. 83333333 0. 07477612[ 12,] 1. 1 1. 11691797 1. 1 0. 91666667 0. 01212663[ 13,] 1. 2 0. 64749519 1. 2 1. 00000000 0. 00000000

( Intercept) d. subset[, 1]- 0. 2425715 1. 2950873

[, 1][ 1,] - 0. 2425715[ 1] ”************************************”[ 1] 0. 1


[ 10,] 0. 9 0. 33869364 0. 8 0. 7272727 0. 23297941[ 11,] 1. 0 1. 25643075 0. 9 0. 8181818 0. 09252419[ 12,] 1. 1 1. 11691797 1. 0 0. 9090909 0. 01537977[ 13,] 1. 2 0. 64749519 1. 1 1. 0000000 0. 00000000


[, 1][ 1,] - 0. 1075465[ 1] ”************************************”[ 1] 0. 2

d. dist scaled. dist W[ 1,] 0. 0 - 0. 47388308 0. 2 0. 2 0. 97619149


Data Mining 2015

[ 2,] 0. 1 - 0. 10563567 0. 1 0. 1 0. 99700300[ 3,] 0. 2 0. 01353860 0. 0 0. 0 1. 00000000[ 4,] 0. 3 0. 21000704 0. 1 0. 1 0. 99700300[ 5,] 0. 4 0. 77435571 0. 2 0. 2 0. 97619149[ 6,] 0. 5 0. 31873792 0. 3 0. 3 0. 92116732[ 7,] 0. 6 0. 50385628 0. 4 0. 4 0. 82002586[ 8,] 0. 7 0. 45257248 0. 5 0. 5 0. 66992188[ 9,] 0. 8 0. 84182740 0. 6 0. 6 0. 48189030

[ 10,] 0. 9 0. 33869364 0. 7 0. 7 0. 28359339[ 11,] 1. 0 1. 25643075 0. 8 0. 8 0. 11621427[ 12,] 1. 1 1. 11691797 0. 9 0. 9 0. 01990251[ 13,] 1. 2 0. 64749519 1. 0 1. 0 0. 00000000


[, 1][ 1,] 0. 02259554[ 1] ”************************************”[ 1] 0. 3


[ 10,] 0. 9 0. 33869364 0. 6 0. 6666667 0. 34847330[ 11,] 1. 0 1. 25643075 0. 7 0. 7777778 0. 14844970[ 12,] 1. 1 1. 11691797 0. 8 0. 8888889 0. 02637525[ 13,] 1. 2 0. 64749519 0. 9 1. 0000000 0. 00000000


[, 1][ 1,] 0. 14918[ 1] ”************************************”[ 1] 0. 4


[ 10,] 0. 9 0. 33869364 0. 5 0. 625 0. 43184014[ 11,] 1. 0 1. 25643075 0. 6 0. 750 0. 19322586[ 12,] 1. 1 1. 11691797 0. 7 0. 875 0. 03596253[ 13,] 1. 2 0. 64749519 0. 8 1. 000 0. 00000000


[, 1][ 1,] 0. 2743525[ 1] ”************************************”...[ 1] 1

d. dist scaled. dist W


Data Mining 2015

[ 1,] 0. 4 0. 7743557 0. 6 1. 0000000 2. 955864e- 46[ 2,] 0. 5 0. 3187379 0. 5 0. 8333333 7. 477612e- 02[ 3,] 0. 6 0. 5038563 0. 4 0. 6666667 3. 484733e- 01[ 4,] 0. 7 0. 4525725 0. 3 0. 5000000 6. 699219e- 01[ 5,] 0. 8 0. 8418274 0. 2 0. 3333333 8. 929533e- 01[ 6,] 0. 9 0. 3386936 0. 1 0. 1666667 9. 861753e- 01[ 7,] 1. 0 1. 2564308 0. 0 0. 0000000 1. 000000e�00[ 8,] 1. 1 1. 1169180 0. 1 0. 1666667 9. 861753e- 01[ 9,] 1. 2 0. 6474952 0. 2 0. 3333333 8. 929533e- 01

[ 10,] 1. 3 0. 9956303 0. 3 0. 5000000 6. 699219e- 01[ 11,] 1. 4 1. 2856589 0. 4 0. 6666667 3. 484733e- 01[ 12,] 1. 5 1. 4437838 0. 5 0. 8333333 7. 477612e- 02[ 13,] 1. 6 0. 7472126 0. 6 1. 0000000 0. 000000e�00


[, 1][ 1,] 0. 8276056[ 1] ”************************************”[ 1] 1. 1

d. dist scaled. dist W[ 1,] 0. 5 0. 3187379 0. 6 1. 0000000 0. 00000000[ 2,] 0. 6 0. 5038563 0. 5 0. 8333333 0. 07477612[ 3,] 0. 7 0. 4525725 0. 4 0. 6666667 0. 34847330[ 4,] 0. 8 0. 8418274 0. 3 0. 5000000 0. 66992188[ 5,] 0. 9 0. 3386936 0. 2 0. 3333333 0. 89295331[ 6,] 1. 0 1. 2564308 0. 1 0. 1666667 0. 98617531[ 7,] 1. 1 1. 1169180 0. 0 0. 0000000 1. 00000000[ 8,] 1. 2 0. 6474952 0. 1 0. 1666667 0. 98617531[ 9,] 1. 3 0. 9956303 0. 2 0. 3333333 0. 89295331

[ 10,] 1. 4 1. 2856589 0. 3 0. 5000000 0. 66992188[ 11,] 1. 5 1. 4437838 0. 4 0. 6666667 0. 34847330[ 12,] 1. 6 0. 7472126 0. 5 0. 8333333 0. 07477612[ 13,] 1. 7 0. 8396295 0. 6 1. 0000000 0. 00000000

( Intercept) d. subset[, 1]0. 02071804 0. 81446704

[, 1][ 1,] 0. 9166318[ 1] ”************************************”[ 1] 1. 2

d. dist scaled. dist W[ 1,] 0. 6 0. 5038563 0. 6 1. 0000000 0. 000000e�00[ 2,] 0. 7 0. 4525725 0. 5 0. 8333333 7. 477612e- 02[ 3,] 0. 8 0. 8418274 0. 4 0. 6666667 3. 484733e- 01[ 4,] 0. 9 0. 3386936 0. 3 0. 5000000 6. 699219e- 01[ 5,] 1. 0 1. 2564308 0. 2 0. 3333333 8. 929533e- 01[ 6,] 1. 1 1. 1169180 0. 1 0. 1666667 9. 861753e- 01[ 7,] 1. 2 0. 6474952 0. 0 0. 0000000 1. 000000e�00[ 8,] 1. 3 0. 9956303 0. 1 0. 1666667 9. 861753e- 01[ 9,] 1. 4 1. 2856589 0. 2 0. 3333333 8. 929533e- 01

[ 10,] 1. 5 1. 4437838 0. 3 0. 5000000 6. 699219e- 01[ 11,] 1. 6 0. 7472126 0. 4 0. 6666667 3. 484733e- 01[ 12,] 1. 7 0. 8396295 0. 5 0. 8333333 7. 477612e- 02[ 13,] 1. 8 1. 3994966 0. 6 1. 0000000 9. 976041e- 46

( Intercept) d. subset[, 1]0. 2648616 0. 6006984

[, 1][ 1,] 0. 9856997[ 1] ”************************************”...[ 1] 5. 7

d. dist scaled. dist W


Data Mining 2015

[ 1,] 4. 8 - 0. 9327408 0. 9 1. 0000000 0. 00000000[ 2,] 4. 9 - 1. 3073777 0. 8 0. 8888889 0. 02637525[ 3,] 5. 0 - 1. 1968627 0. 7 0. 7777778 0. 14844970[ 4,] 5. 1 - 0. 4363296 0. 6 0. 6666667 0. 34847330[ 5,] 5. 2 - 0. 4661374 0. 5 0. 5555556 0. 56875893[ 6,] 5. 3 - 0. 6437850 0. 4 0. 4444444 0. 75907091[ 7,] 5. 4 - 0. 4652568 0. 3 0. 3333333 0. 89295331[ 8,] 5. 5 - 0. 5301789 0. 2 0. 2222222 0. 96743815[ 9,] 5. 6 - 0. 5135852 0. 1 0. 1111111 0. 99589042

[ 10,] 5. 7 - 0. 8137505 0. 0 0. 0000000 1. 00000000[ 11,] 5. 8 - 0. 8017276 0. 1 0. 1111111 0. 99589042[ 12,] 5. 9 - 0. 5136304 0. 2 0. 2222222 0. 96743815[ 13,] 6. 0 0. 2006632 0. 3 0. 3333333 0. 89295331


[, 1][ 1,] - 0. 4833547[ 1] ”************************************”[ 1] 5. 8

d. dist scaled. dist W[ 1,] 4. 8 - 0. 9327408 1. 0 1. 0 0. 00000000[ 2,] 4. 9 - 1. 3073777 0. 9 0. 9 0. 01990251[ 3,] 5. 0 - 1. 1968627 0. 8 0. 8 0. 11621427[ 4,] 5. 1 - 0. 4363296 0. 7 0. 7 0. 28359339[ 5,] 5. 2 - 0. 4661374 0. 6 0. 6 0. 48189030[ 6,] 5. 3 - 0. 6437850 0. 5 0. 5 0. 66992188[ 7,] 5. 4 - 0. 4652568 0. 4 0. 4 0. 82002586[ 8,] 5. 5 - 0. 5301789 0. 3 0. 3 0. 92116732[ 9,] 5. 6 - 0. 5135852 0. 2 0. 2 0. 97619149

[ 10,] 5. 7 - 0. 8137505 0. 1 0. 1 0. 99700300[ 11,] 5. 8 - 0. 8017276 0. 0 0. 0 1. 00000000[ 12,] 5. 9 - 0. 5136304 0. 1 0. 1 0. 99700300[ 13,] 6. 0 0. 2006632 0. 2 0. 2 0. 97619149


[, 1][ 1,] - 0. 4360132[ 1] ”************************************”[ 1] 5. 9

d. dist scaled. dist W[ 1,] 4. 8 - 0. 9327408 1. 1 1. 00000000 0. 00000000[ 2,] 4. 9 - 1. 3073777 1. 0 0. 90909091 0. 01537977[ 3,] 5. 0 - 1. 1968627 0. 9 0. 81818182 0. 09252419[ 4,] 5. 1 - 0. 4363296 0. 8 0. 72727273 0. 23297941[ 5,] 5. 2 - 0. 4661374 0. 7 0. 63636364 0. 40901258[ 6,] 5. 3 - 0. 6437850 0. 6 0. 54545455 0. 58788237[ 7,] 5. 4 - 0. 4652568 0. 5 0. 45454545 0. 74388835[ 8,] 5. 5 - 0. 5301789 0. 4 0. 36363636 0. 86257264[ 9,] 5. 6 - 0. 5135852 0. 3 0. 27272727 0. 94036966

[ 10,] 5. 7 - 0. 8137505 0. 2 0. 18181818 0. 98207661[ 11,] 5. 8 - 0. 8017276 0. 1 0. 09090909 0. 99774775[ 12,] 5. 9 - 0. 5136304 0. 0 0. 00000000 1. 00000000[ 13,] 6. 0 0. 2006632 0. 1 0. 09090909 0. 99774775


[, 1][ 1,] - 0. 3888509[ 1] ”************************************”[ 1] 6

d. dist scaled. dist W[ 1,] 4. 8 - 0. 9327408 1. 2 1. 00000000 0. 00000000[ 2,] 4. 9 - 1. 3073777 1. 1 0. 91666667 0. 01212663


Data Mining 2015

[ 3,] 5. 0 - 1. 1968627 1. 0 0. 83333333 0. 07477612[ 4,] 5. 1 - 0. 4363296 0. 9 0. 75000000 0. 19322586[ 5,] 5. 2 - 0. 4661374 0. 8 0. 66666667 0. 34847330[ 6,] 5. 3 - 0. 6437850 0. 7 0. 58333333 0. 51489433[ 7,] 5. 4 - 0. 4652568 0. 6 0. 50000000 0. 66992188[ 8,] 5. 5 - 0. 5301789 0. 5 0. 41666667 0. 79830593[ 9,] 5. 6 - 0. 5135852 0. 4 0. 33333333 0. 89295331

[ 10,] 5. 7 - 0. 8137505 0. 3 0. 25000000 0. 95385361[ 11,] 5. 8 - 0. 8017276 0. 2 0. 16666667 0. 98617531[ 12,] 5. 9 - 0. 5136304 0. 1 0. 08333333 0. 99826489[ 13,] 6. 0 0. 2006632 0. 0 0. 00000000 1. 00000000


[, 1][ 1,] - 0. 3380152[ 1] ”************************************”...

Figure 36. LOWESS on noisy sine Figure 37. LOESS on noisy Runge

In Figures 36 and 37, the blue line is the original function, the black line is the simplistic version, andthe green line is the R version. The latter two are very close in value.

We will useloess which is a newer version of the function.


for ( s in c( 0. 25, 0. 5, 0. 75)) {

# Plot the noisy points

plot ( x. 5. 5by. 1, y. noise, col�” red”, pch�20, main � paste(” Span � ”, s))

# Plot thge Runge curve


lines( x. 5. 5by. 1, loess( y. noise ~x. 5. 5by. 1, span � s)$ fitted, col�” green”)

}


Data Mining 2015

par( oldpar)


Data Mining 2015

3.13 SuperSmoother

Super smoother [Friedman, J. H. (1984) ”A variable span scatterplot smoother”,. Laboratory forComputational Statistics, Stanford University Technical Report No. 5.] uses3 different intervals (orspans) calledtweeter, midrange, woofer, on the data and determines the best value from these spans.The user can fix the spans or allow the program to determine the ‘best’ spans based oncross validationprocedures.

The R documentation states -“supsmu is a running lines smoother which chooses between three spans for the lines. The runninglines smoothers are symmetric, with k/2 data points each side of the predicted point, and values of k as0.5 * n, 0.2 * n and 0.05 * n, where n is the number of data points. If span is specified, a singlesmoother with span span * n is used. The best of the three smoothers is chosen by cross-validation foreach prediction. The best spans are then smoothed by a running lines smoother and the final predictionchosen by linear interpolation.”(In the above description, the 0.05 * n is the tweeter, the 0.2 * n is the midrange, and the 0.5 * n is thewoofer.)

Friedman also notes that we maya) know that the underlying curve is smooth orb) find a smooth curve appealing.To achieve this, he introduced abass control (ranging from 0 to 10) which, for values above 0, causesthe larger spans to be used. This control is not needed if the span is selected.

Super smoother - variable span fixed bass

s �- c( 0. 05, 0. 2, 0. 5)

lab �- c(”- tweeter”, ”- midrange”, ”- woofer”)


for ( i in 1: 3) {

plot ( x. 5. 5by. 1, y. noise, col�” red”, pch�20,

main�paste(” Super smoother - span � ”, s[ i], lab[ i]))

ss �- supsmu( x. 5. 5by. 1, y. noise, span � s[ i])


lines( ss$x, ss$y, col�” green”)

}

par( oldpar)


Data Mining 2015

Figure 38. Supersmoother on Runge - variable spanFigure 39. Supersmoother on Runge - cv span

Super smoother - cv

b �- c( 2, 6, 10)


for ( i in 1: 3) {

plot ( x. 5. 5by. 1, y. noise, col�” red”, pch�20,

main�paste(” Super smoother - span � cv, bass �”, b[ i]))

ss �- supsmu( x. 5. 5by. 1, y. noise, span � ” cv”, bass � b[ i])


lines( ss$x, ss$y, col�” green”)

}

par( oldpar)

Friedman used the following example to illustrate his Super smoother

n �- 200

x. f �- runif( n)

y. f �- sin( 2* pi*( 1- x. f)^ 2) � x. f* rnorm( n)

plot( x. f, y. f)

Figure 40 uses thetweeter, midrange, woofer fixed span cases to show the effect of the change in spanvalues.

s �- c( 0. 05, 0. 2, 0. 5)

lab �- c(” tweeter”, ” midrange”, ” woofer”)

plot ( x. f, y. f, col�” red”, pch�20)


Data Mining 2015

ss. t �- supsmu( x. f, y. f, span � s[ 1])

ss. m �- supsmu( x. f, y. f, span � s[ 2])

ss. w �- supsmu( x. f, y. f, span � s[ 3])

lines( ss. t$x, ss. t$y, col�” black”)

lines( ss. m$x, ss. m$y, col�” green”)

lines( ss. w$x, ss. w$y, col�” blue”)

legend( 0, 2, lab, col � c(” black”, ” green”, ” blue”),

lty � c( 1, 1, 1), pch � c(- 1, - 1, - 1))

Figure 40. Variable span Figure 41. Effect of bass in cv span case

Figure 41 shows the effect of the various values of bass.

b �- 0: 10

plot ( x. f, y. f, col�” red”, pch�20)

for ( i in b) {

ss. cv �- supsmu( x. f, y. f, span � ” cv”, bass � i)

lines( ss. cv$x, ss. cv$y, col�i)

}

legend( 0, 2, 0: 10, col � 0: 10, lty � c( 1, 1, 1), pch � c(- 1, - 1, - 1))


Date post:	07-Nov-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

s interpolary divided difference formulamathstat.carleton.ca/~smills/2015-16/STAT4601-5703/Pdf...3.1...

Documents