Data Mining 2015
SECTION 3 ModelingWhen dealing with large amounts of data it may be possible to find a model for the datathat willreduce it to a much more compact form. For example, if the (x, y) points lie on a straight line, wewould only need the slope, intercept, andx values to reproduce the data. If we only wish to be able todetermine ay value for anyx value then we need only have the slope and intercept. In actual fact, thepoints may not lie exactly on the line but they may be close enough that a straight line isa closerepresentation of the actual data and, as such, the slope and intercept will be adequate to representthedata. In reality our situation is generally more complicated than this but wemay still be able to find amodel that will reduce the amount of information we need to retain.In other cases, when looking at a large data set it is often difficult to assess its behaviour. For example,if the data is the minute by minute variation of the stock market it is hard to seeany pattern to the data.On the other hand, the day to day observations may show some trends while year to yearobservationswill probably reveal even more. In order to understand our data we may find it helpfulto ignore theminute detail and look at these trends.One possible way of doing might be to determine an interpolating polynomial for the data.
3.1 Newton’s interpolary divided-difference formulax f �x� First Divided Difference Second Divided Difference
x0 f �x0 �
f �x0,x1 � �f �x1 � � f �x0 �
x1 � x0
x1 f �x1 � f �x0,x1,x2 � �f �x1,x2 � � f �x0,x1 �
x2 � x0
f �x1,x2 � �f �x2 � � f �x1 �
x2 � x1
x2 f �x2 � f �x1,x2,x3 � �f �x2,x3 � � f �x1,x2 �
x3 � x1
f �x2,x3 � �f �x3 � � f �x2 �
x3 � x2
x3 f �x3 �
Pn�x� � f �x0���x � x0�f �x0,x1� � �x � x0��x � x1�f �x0,x1,x2�
� ... � �x � x0��x � x1�...�x � xn�1�f �x0,x1, ...,xn �
� f �x0���k�1
n
�x � x0�...�x � xk�1�f �x0, ...,xk �
Modeling 72 © Mills2015
Data Mining 2015
Set up the folders for the code and data and read in the required files:
drive �- ” D:”
code. dir �- paste( drive, ” DATA/ Data Mining R- Code”, sep�”/”)
data. dir �- paste( drive, ” DATA/ Data Mining Data”, sep�”/”)
source( paste( code. dir, ” Conv_run. r”, sep�”/”))
source( paste( code. dir, ” Newton_Interp. r”, sep�”/”))
The file Newton_Interp.r has the following:#�����������������������������������
# Polynomial interpolation ( Newton’ s Divided Difference given the x and y values.
# The y values are replaced by the differences ( locally).
#�����������������������������������
InterpNewton �- function( knot. x, knot. y){
n �- length( knot. x)
for ( k in 1:( n- 1)){
knot. y[( k�1): n] �- ( knot. y[( k�1): n] - knot. y[ k])/( knot. x[( k�1): n] - knot. x[ k])
}
knot. y
}
Horner’s rule (nested multiplication) gives an efficient way to evaluatepolynomials, i.e. .
a0 � a1x � a2x2 �� � an�1xn�1 � anxn � �a0 � x�a1 � x�a2 �� � x�an�1 � xanxn���
#�����������������������������������
# Use Horner’ s rule z*( c[ n] �z*( c[ n- 1] �z*(...))) to evaluate a polynomial given
# the coefficients and knots and point( s)
#�����������������������������������
HornerN �- function ( coef, knot. x, z) {
n �- length( knot. x)
polyv �- coef[ n]* rep( 1, length( z))
for ( k in ( n- 1): 1){
polyv �- ( z - knot. x[ k])* polyv �coef[ k]
}
polyv
}
To illustrate, consider the noisy sinusoid.(Figure 1)
set. seed( 1234, ” default”) # Seed for random numbers
# Compute sine at 0, 0. 1, 0. 2,... for plotting a noisy sine
x. rough �- seq( 0, 6, by�0. 1)
numb. rough �- length( x. rough)
y. rough �- sin( x. rough) � runif( numb. rough)- 0. 5 # Add uniform noise
# Points at which to interpolate
x. in �- seq( 0, 6, by�0. 01)
© Mills2015 Modeling 73
Data Mining 2015
The following allows us to reproduce results.
noisy. sine �- list( x. rough, y. rough, x. in)
save( noisy. sine, file � paste( data. dir, ” noisySine. Rdata”, sep�”/”))
(the above must be commented out for any subsequent runs to allow the previous results to be retained)and we recover the data with
load( paste( data. dir, ” noisySine. Rdata”, sep�”/”))
x. rough �- noisy. sine[[ 1]]
y. rough �- noisy. sine[[ 2]]
x. in �- noisy. sine[[ 3]]
plot( x. rough, y. rough, col�” red”, pch � 16) # Plot the points
curve( sin, 0, 6, add�T, col�” blue”) # Add in the sine
and do an interpolation (Figure 2).
# Use Newton Interpolation to get the coefficients
# and Horners method to compute the polynomial.
y. rough. in �- HornerN( InterpNewton( x. rough, y. rough), x. rough, x. in)
lines( x. in, y. rough. in, col�” green”)
Figure 1. Noisy sinusoid Figure 2. Interpolation of noisy sine
While the interpolant passes through every data point (and hence represents the given data accurately)it has nopredictive power.We need our estimate to give good results for both the data used in creating the modeland data fromthe same population not used in creating the model.Figure 2 is an extreme example ofoverfitting.The reason that interpolation is bad for data subject to error is seen in the simple case of data that formsa straight line - except one point is a bit in error.
Modeling 74 © Mills2015
Data Mining 2015
We can do an interpolation on a straight line (Figure 3.)
x �- - 5: 5
y �- - 5: 5
x. by. 1 �- seq(- 6, 6, by�. 1)
y. by. 1 �- seq(- 5, 5, by�. 1)
y. in �- HornerN( InterpNewton( x, y), x, x. by. 1)
plot( x, y, col�” red”, xlim�c(- 6, 6), ylim�c(- 6, 7))
lines( x. by. 1, y. in, col�” green”)
lines( c(- 5, 5), c(- 5, 5), col�” blue”)
or with one point in error (Figure 4.).
#�����������������������������������
# Look at what happens with a slight deviation
#�����������������������������������
x. blip �- - 5: 5
y. blip �- - 5: 5
x. blip. by. 1 �- seq(- 6, 6, by�. 1)
y. blip. by. 1 �- seq(- 5, 5, by�. 1)
y. blip[ 8] �- 2. 5 # Move 1 point
y. blip[ 1] - 5. 0 - 4. 0 - 3. 0 - 2. 0 - 1. 0 0. 0 1. 0 2. 5 3. 0 4. 0 5. 0y. blip. in �- HornerN( InterpNewton( x. blip, y. blip), x. blip, x. blip. by. 1)
plot( x. blip, y. blip, col�” red”, pch�16, xlim�c(- 6, 6), ylim�c(- 6, 7))
lines( x. blip. by. 1, y. blip. in, col�” green”)
lines( c(- 5, 5), c(- 5, 5), col�” blue”)
Figure 3. Interpolation on straight
line
Figure 4. Interpolation on straight
line with bump
The interpolation of the straight line looks good as it should because, although we have it passingthrough 11 points, the algorithm gives a linear fit (construct a difference table).In the case of a slight deviation, we see that the interpolating polynomial has large oscillations near theends - much larger than the original deviation from the line. We have large errors inpredictions nearthe ends. This results from using a high degree polynomial. A better method is to use lower degree
© Mills2015 Modeling 75
Data Mining 2015
polynomials to do the fitting.
Modeling 76 © Mills2015
Data Mining 2015
3.2 Cubic Splines
The cubic spline is a compromise. Obviously, if we just join a bunch of cubic polynomials,passingthrough 4 points, together it would be little better than a piecewise linear. Instead we try to fit thecubics through 2 points and make use of the extra constants to make a smoother fit.
Consider a set of points on the plane�x2, f �x2�� � ...
�x1, f �x1�� � �xn, f �xn�� ��x0, f �x0�� �
We fit a cubic
Sk�x� � sk,0 � sk,1�x � xk� � sk,2�x � xk�2 � sk,3�x � xk�3 , k � 0,1, ...,n � 1
between�xk, f �xk�� and�xk�1, f �xk�1��.
S�x� �Sk�x� xk � x � xk�1
Sk�1�x� xk�1 � x � xk�2
At x � xk, Sk�xk� � f�xk� � sk,0.At x � xk�1, Sk�xk�1� � Sk�1�xk�1�, k � 0,1, ...,n � 1.
To ‘use’ the other constants, we require a ‘smooth’ transition from one segment to another - we specifythat the slope and curvature (as well as the value) must match at the nodes or knots. Weknow that theslope at any node is
© Mills2015 Modeling 77
Data Mining 2015
S �k�x� � sk,1 � 2sk,2�x � xk� � 3sk,3�x � xk�2
so
S k� �xk� � sk,1 k � 0,1, ...,n � 1
and for the slopes to match we require
S k�1� �xk�1� � S k
� �xk�1�
sk�1,1 � sk,1 � 2sk,2hk � 3sk,3hk2 (2)
The curvature at any node is
S k��
�x� � 2sk,2 � 6sk,3�x � xk�
so
S k��
�xk� � 2sk,2 or sk,2 � 12
S k��
�xk�, k � 0,1, ...,n � 1
For the curvature to match at the node
S k�1�� �xk�1� � S k
��
�xk�1�
so
2sk�1,2 � 2sk,2 � 6sk,3hk.
We need more information. This comes from considering the behaviour ofS�x� at the end points.
One possibility is
S��
�x0� � 0 � S��
�xn�
i.e. no curvature at the ends.
A second possibility is to match the slope of the spline with that of the function (perhaps estimatedgraphically) so that
S�
�x0� � f ��x0� � s0,1
S�
�x2� � f ��x2� � s2,1
It is not as obvious that this allows us to solve fors.,. but it does.
Each time we add a node we get one morem and an additional equation. Hence we always haveenough information to obtain a solution (we have no assurance that we can solve the general system atthis point).
Modeling 78 © Mills2015
Data Mining 2015
Cubic Spline Interpolant: Suppose that��xk,yk ��k�0n aren � 1 points, where
a � x0 � x1 � ... � xn � b. The functionS�x� is called acubic spline if there exist n cubicpolynomialsSk�x� with coefficientssk,0,sk,1,sk,2, andsk,3 that satisfy the properties:
1. S�x� � Sk�x� � sk,0 � sk,1�x � xk� � sk,2�x � xk�2 � sk,3�x � xk�3 for x � �xk,xk�1� andk � 0,1, ...,n � 1
2. S�xk � � yk, for eachk � 0,1, ...,n;3. Sk�xk�1� � Sk�1�xk�1�, for eachk � 0,1, ...,n � 2;4. Sk
� �xk�1� � Sk�1� �xk�1�, for eachk � 0,1, ...,n � 2;
5. Sk���xk�1� � Sk�1
�� �xk�1�, for eachk � 0,1, ...,n � 2;
From one point of view, we can consider free cubic splines as the best interpolant for curve fitting.If we let S be the natural cubic spline function that interpolates our functionf at x0 � x1 � ... � xn, andlet f ���x� be continuous in the open interval�a,b� that contains thex0,x1, ...xn, then
�a
b�S ���x��2dx � �
a
b�f ���x��2dx
In other words, the average value of the curvature ofS is never larger than the average value of thecurvature ofany functionf passing through the same nodes.
The R spline routine returns the interpolated values at the number of positions indicatedby n � #.#��������������������
# Do a spline
#��������������������
plot( x. blip, y. blip, col�” red”, pch�16, xlim�c(- 6, 6), ylim�c(- 6, 7))
lines( spline( x. blip, y. blip, n � 201), col � ” black”)
Figure 5. Spline on straight line with bump
© Mills2015 Modeling 79
Data Mining 2015
This gives a much improved approximation. The error dies out quickly.
The classical example for illustrating the behaviour of interpolation is Runge’s function 11 � x2 .
#����������������������������������������
# Use Runge’ s function as an example
#����������������������������������������
runge �- function ( x){
1/( 1�x^2)
}
# Plot it at integers from - 5 to 5
x. 5. 5 �- (- 5: 5)
y. 5. 5 �- sapply ( x. 5. 5, runge)
x. 5. 5by. 1 �- seq(- 5, 5, by�. 1)
# Get the interpolated values at - 5,- 4. 9,. 4. 8...
y. in �- HornerN( InterpNewton( x. 5. 5, y. 5. 5), x. 5. 5, x. 5. 5by. 1)
plot( x. 5. 5, y. 5. 5, col�3, ylim�c(- 0. 5, 2))
lines( x. 5. 5by. 1, y. in, col�” red”)
#����������������������������������������
# Do a spline
#����������������������������������������
# Compute and plot spline ( spline returns the interpolated values).
lines( spline( x. 5. 5, y. 5. 5, n � 201), col � ” blue”)
Figure 6. The red line is the interpolation.
The blue line is the spline.
Figure 7. Computing a spline on a noisy
Runge function.
#����������������������������������������
# Put some noise on it and compute spline
#����������������������������������������
y. noise �- sapply ( x. 5. 5by. 1, runge) � rnorm( 101, 0,. 1)
plot( x. 5. 5by. 1, y. noise, col�” green”, pch�20, cex�1. 3)
Modeling 80 © Mills2015
Data Mining 2015
lines( spline( x. 5. 5by. 1, y. noise, n � 201), col � ” red”)
curve( runge, - 5, 5, add�T, col�” blue”)
The problem with splines on noisy data is “where do you put the knots?”
© Mills2015 Modeling 81
Data Mining 2015
A better approach with noisy data may be tosmooth it.A simple way of smoothing involves the use of a running mean, and a way of implementing therunning mean is by a convolution.This is a polynomial multiplication of the form
w�k� � �j
u�j�v�k � 1 � j�.
If both u andv have the same lengthn, then
w�1� � u�1�v�1�
w�2� � u�1�v�2� � u�2�v�1�
w�3� � u�1�v�3� � u�2�v�2� � u�3�v�1�
w�n � 1� � u�1�v�n � 1� � u�2�v�n � 2� � ... � u�n � 1�v�1�
w�n� � u�1�v�n� � u�2�v�n � 1� � ... � u�n�v�1�
w�n � 1� � u�2�v�n � 1� � u�3�v�n � 2� � ... � u�n�v�2�
w�2n � 2� � u�n � 1�v�n� � u�n�v�n � 1�
w�2n � 1� � u�n�v�n�
#������������������������������������������
# Illustrate convolution of vectors
#������������������������������������������
bandwidth �- 5
( x �- 1: 30)[ 1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[ 26] 26 27 28 29 30( y �- rep( 1, bandwidth))[ 1] 1 1 1 1 1
Padx with leading and trailing zeros.
( x. 0 �- c( rep( 0, bandwidth- 1), x, rep( 0, bandwidth- 1)))[ 1] 0 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
[ 26] 22 23 24 25 26 27 28 29 30 0 0 0 0( x. 1 �- c( rep( 0, bandwidth- 1), rep( 1, length( x)), rep( 0, bandwidth- 1)))[ 1] 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0z �- rep( 0, length( x))
for ( i in 1:( length( x) �bandwidth- 1)) {
t �- y* x. 0[( 1: bandwidth) �i- 1]
n �- sum( y* x. 1[( 1: bandwidth) �i- 1])
z[ i] �- sum( t)/ n
cat(” t � ”, t, ” n � ”, n, ” z � ”, z[ i], ”\ n”)
Modeling 82 © Mills2015
Data Mining 2015
}t � 0 0 0 0 1 n � 1 z � 1t � 0 0 0 1 2 n � 2 z � 1. 5t � 0 0 1 2 3 n � 3 z � 2t � 0 1 2 3 4 n � 4 z � 2. 5t � 1 2 3 4 5 n � 5 z � 3t � 2 3 4 5 6 n � 5 z � 4t � 3 4 5 6 7 n � 5 z � 5t � 4 5 6 7 8 n � 5 z � 6t � 5 6 7 8 9 n � 5 z � 7t � 6 7 8 9 10 n � 5 z � 8t � 7 8 9 10 11 n � 5 z � 9t � 8 9 10 11 12 n � 5 z � 10t � 9 10 11 12 13 n � 5 z � 11t � 10 11 12 13 14 n � 5 z � 12t � 11 12 13 14 15 n � 5 z � 13t � 12 13 14 15 16 n � 5 z � 14t � 13 14 15 16 17 n � 5 z � 15t � 14 15 16 17 18 n � 5 z � 16t � 15 16 17 18 19 n � 5 z � 17t � 16 17 18 19 20 n � 5 z � 18t � 17 18 19 20 21 n � 5 z � 19t � 18 19 20 21 22 n � 5 z � 20t � 19 20 21 22 23 n � 5 z � 21t � 20 21 22 23 24 n � 5 z � 22t � 21 22 23 24 25 n � 5 z � 23t � 22 23 24 25 26 n � 5 z � 24t � 23 24 25 26 27 n � 5 z � 25t � 24 25 26 27 28 n � 5 z � 26t � 25 26 27 28 29 n � 5 z � 27t � 26 27 28 29 30 n � 5 z � 28t � 27 28 29 30 0 n � 4 z � 28. 5t � 28 29 30 0 0 n � 3 z � 29t � 29 30 0 0 0 n � 2 z � 29. 5t � 30 0 0 0 0 n � 1 z � 30
Note that a vector of ones was used to give each point equal value. It would be easy to modify this todo a weighted convolution.
Now create functions from the above to do the smoothing.
#������������������������������������������
# Compute a running mean with a specified bandwidth
#������������������������������������������
f. conv �- function( x, bandwidth) {
y �- rep( 1, bandwidth)
# pad x with leading and trailing zeros
x. 0 �- c( rep( 0, bandwidth- 1), x, rep( 0, bandwidth- 1))
x. 1 �- c( rep( 0, bandwidth- 1), rep( 1, length( x)), rep( 0, bandwidth- 1))
z �- rep( 0, length( x))
for ( i in 1:( length( x) �bandwidth- 1)) {
t �- y* x. 0[( 1: bandwidth) �i- 1]
n �- sum( y* x. 1[( 1: bandwidth) �i- 1])
z[ i] �- sum( t)/ n
}
© Mills2015 Modeling 83
Data Mining 2015
return( z)
}
f. run. mean �- function( x, bandwidth) {
bandwidth �- floor( bandwidth/ 2)* 2�1 # Use odd width
f. conv( x, bandwidth)[( bandwidth/ 2�1):(( bandwidth/ 2) �length( x))]
}
x �- 1: 30
f. conv( x, 5)[ 1] 1. 0 1. 5 2. 0 2. 5 3. 0 4. 0 5. 0 6. 0 7. 0 8. 0 9. 0 10. 0 11. 0 12. 0 13. 0
[ 16] 14. 0 15. 0 16. 0 17. 0 18. 0 19. 0 20. 0 21. 0 22. 0 23. 0 24. 0 25. 0 26. 0 27. 0 28. 0[ 31] 28. 5 29. 0 29. 5 30. 0f. run. mean( x, 5)
[ 1] 2. 0 2. 5 3. 0 4. 0 5. 0 6. 0 7. 0 8. 0 9. 0 10. 0 11. 0 12. 0 13. 0 14. 0 15. 0[ 16] 16. 0 17. 0 18. 0 19. 0 20. 0 21. 0 22. 0 23. 0 24. 0 25. 0 26. 0 27. 0 28. 0 28. 5 29. 0
Now try it on the noisy sinusoid from earlier on.
The expressionoldpar �- par(mfrow � c(3,1)) saves the current display parameters andsets the new ones to show the plots in a 3 row by 1 column array.
Thepar(oldpar) resets the display to the original.
main � ”5 point running mean” puts a title on the plot.
oldpar �- par( mfrow � c( 3, 1)) # Split the plot into a stack of 3 plots
for ( i in c( 5, 15, 25)) {
# Plot the points
plot( x. rough, y. rough, col�” red”, pch�20,
main � paste( i, ” point running mean”))
curve( sin, 0, 6, add�T, col�” blue”) # Add in the sine
YS �- f. run. mean( y. rough, i)
lines( x. rough, YS, col�” green”)
}
par( oldpar)
Try the smooth also on Runge’s function.
oldpar �- par( mfrow � c( 3, 1)) # Split the plot into a stack of 3 plots
for ( i in c( 5, 15, 25)) {
# Plot the points
plot( x. 5. 5by. 1, y. noise, col�” red”, pch�20,
main � paste( i, ” point running mean”))
curve( runge, - 5, 5, add�T, col�” blue”) # Add in the curve
YS �- f. run. mean( y. noise, i)
lines( x. 5. 5by. 1, YS, col�” green”)
}
par( oldpar)
Modeling 84 © Mills2015
Data Mining 2015
Figure 8. Smooth on noisy sine Figure 9. Smooth on noisy Runge
We see that the 5 point smooth follows well but is rough; the15 point smooth follows well and is lessrough; the 25 point smooth has flattened out much of the true structure.
There are better ways of smoothing. A couple of well known ones are LOcally Weighted Scatter plotSmoothing (LOWESS, also called LOESS) by Cleveland and Supersmoother by Friedman. Theyrequire concepts that we have not yet discussed so it will be looked at later.
© Mills2015 Modeling 85
Data Mining 2015
3.3 RegressionIn many cases of analyzing data we want to predict the behaviour of ‘future’ casesbased on thebehaviour of the ‘current’ cases. In mathematics, we would typically fit a function that passes throughthe data points but, in general, data will have error or “noise” associated with it. Because the datapoints may be noisy, there is no reason to assume that the function should pass throughall the datapoints and so we can try to get a function that “best” fits the data.
As in many situations with which we will deal, the concept of “best” will be taken as the smallest sumof squares of the deviations or errors (the difference between the actual and predicted results); this iscalled ordinary least squares (OLS).
Suppose that we have a set of readings of the form
x1
x2
x3
xn
y1
y2
y3
yn
and try to fit a straight line through these points to modelY in terms ofX. (i.e. we consider the case ofapproximation using a straight line.)
We define the following variables
X�
1 x1
1 x2
1 x3
1 xn
Y �
y1
y2
y3
yn
.
The statistical model is
Y �X�0
�1
�� �X� � �
where� is assumed to be an error vector of independent random variables withE��� � 0 and commonvariance-covariance matrix I�2. We wish to estimate the� values by using least squares. The estimatedlinear relationship is written as
yi� � b0 � b1xi.
Modeling 86 © Mills2015
Data Mining 2015
We obtain the estimator of� by minimizing the sum of squares of ”error” given by
Q � �i�1
n
�yi � yi� �2
� �i�1
n
�yi � �b0 � b1xi ��2
so we want
�Q�b0
� ��b0
�i�1
n
�yi � �b0 � b1xi ��2
� �i�1
n��b0
�yi � �b0 � b1xi ��2
� 2�i�1
n
�yi � �b0 � b1xi ����1� � 0
and
�Q�b1
� ��b1
�i�1
n
�yi � �b0 � b1xi ��2
� �i�1
n��b1
�yi � �b0 � b1xi ��2
� 2�i�1
n
�yi � �b0 � b1xi ����xi � � 0.
These equations are called thenormal equations and may be written as
�i�1
n
yi � �i�1
n
b0 ��i�1
n
b1xi � nb0 � b1�i�1
n
xi
�i�1
n
xiyi � �i�1
n
b0xi ��i�1
n
b1xi2 � b0�
i�1
n
xi � b1�i�1
n
xi2.
Solving this system, we get
© Mills2015 Modeling 87
Data Mining 2015
b0 � y� � b1x�
and
b1 �
�i�1
n
�xi � x� ��yi � y� �
�i�1
n
�xi � x� �2
where
x� �
�i�1
n
xi
n ; y� �
�i�1
n
yi
n
If there is curvature we may wish to use polynomials of higher degree (sayp ) to modelyi.. In general
Q � �i�1
n
�yi � yi� �2
� �i�1
n
�yi � Pp�x��2
� �i�1
n
�yi � �b0 � b1xi � b2xi2 � ... � bpxi
p ��2
so
�Q�b0
� ��b0
�i�1
n
�yi � Pp�x��2 � �i�1
n��b0
�yi � Pp�x��2
� 2�i�1
n
�yi � �b0 � b1xi � b2xi2 � ... � bpxi
p ����1�
� 0.
and
Modeling 88 © Mills2015
Data Mining 2015
�Q�bk
� ��bk
�i�1
n
�yi � Pp�x��2
� �i�1
n��bk
�yi � �b0 � b1xi � b2xi2 � ......bkxi
k � ... � bpxip ��
2
� 2�i�1
n
�yi � �b0 � b1xi � b2xi2 � ...bkxi
k � ... � bpxip ����xi
k�
� 0 for k � 1, ...,p
so
�i�1
n
yixik � �
i�1
n
b0xik � b1xi
k�1 � b2xik�2 � ... � bpxi
p�k for k � 1, ...,p
© Mills2015 Modeling 89
Data Mining 2015
Combining these results we obtain the set ofp � 1 normal equations:
�i�1
n
yi � b0�i�1
n
1 � b1�i�1
n
xi � b2�i�1
n
xi2
� ... � bp �i�1
n
xip
�i�1
n
yixi � b0�i�1
n
xi � b1�i�1
n
xi2
� b2�i�1
n
xi3
� ... � bp �i�1
n
xip�1
�
�i�1
n
yixip. � b0�
i�1
n
xip� b1�
i�1
n
xip�1
� b2�i�1
n
xip�2
� ... � bp �i�1
n
xi2p
One method for solving this involves defining (for example, whenp � 2)
Y �
y1
y2
y3
y4
y5
andX �
1 x1 x12
1 x2 x22
1 x3 x32
1 x4 x42
1 x5 x52
from which
XTX �
5 x1 � x2 � x3 � x4 � x5 x12 � x2
2 � x32 � x4
2 � x52
x1 � x2 � x3 � x4 � x5 x12 � x2
2 � x32 � x4
2 � x52 x1
3 � x23 � x3
3 � x43 � x5
3
x12 � x2
2 � x32 � x4
2 � x52 x1
3 � x23 � x3
3 � x43 � x5
3 x14 � x2
4 � x34 � x4
4 � x54
Modeling 90 © Mills2015
Data Mining 2015
or, in general
XTX �
n �i�1
n
xi �i�1
n
xi2
�i�1
n
xi �i�1
n
xi2 �
i�1
n
xi3
�i�1
n
xi2 �
i�1
n
xi3 �
i�1
n
xi4
XTY �
y1 � y2 � y3 � y4 � y5
x1y1 � x2y2 � x3y3 � x4y4 � x5y5
x12y1 � x2
2y2 � x32y3 � x4
2y4 � x52y5
�
�i�1
n
yi
�i�1
n
yixi
�i�1
n
yixi2
Then we can write, in matrix notation,
�XTX�b � XTY
so
b �� � �XTX��1�XTY�
Note that the problem is now reduced to thesolution of a system of equations.
We will apply these ideas to the same noisy sine as before.As afirst approximation at a predictor, we might try taking themean of the y values. Keep in mindthat if the residuals arising from using the mean as an approximation are small enough for ourcalculations, then we do not need to try fitting the data using a non-flat straight line or a higher degreepolynomial.
oldpar �- par( mfrow�c( 2, 1))
mu �- mean( y. rough)
plot( x. rough, y. rough, col�” red”, pch�20, main�” Approximation by the mean”)
© Mills2015 Modeling 91
Data Mining 2015
curve( sin, 0, 6, add�T, col�” blue”)
lines( c( 0, 6), c( mu, mu))
res. mean �- y. rough - mu
plot( x. rough, res. mean, pch�20, main�” Residuals for approximation by the mean”)
sum( res. mean* res. mean)
par( oldpar)[ 1] 34. 71713
Figure 10.
We see that the residuals have a pattern. This indicates that there is structure in the data not captured bythe model (mean value). Even so, if the error is within our tolerance we might decide that the model isacceptable.If not we might try a least squares fit. We will use the noisy sine from before.
In doing this we will make use of the linear models routine in thestats library (lm(y ~ x) ). Itcomputes the regression coefficients, but also the fitted values and residuals.
lm. 1 �- lm( y. rough ~x. rough))Call:lm( formula � y. rough ~x. rough)
Coefficients:( Intercept) x. rough
0. 9543 - 0. 3036oldpar �- par( mfrow�c( 2, 1))
plot( x. rough, y. rough, col�” red”, pch�20, main�” Approximation by regression line”)
curve( sin, 0, 6, add�T, col�” blue”)
abline( lm. 1$coefficients[ 1], lm. 1$coefficients[ 2], col�” green”)
Modeling 92 © Mills2015
Data Mining 2015
res. line �- y. rough - ( lm. 1$coefficients[ 1] � x. rough* lm. 1$coefficients[ 2])
plot( x. rough, res. line, pch�20, main�” Residuals for approximation by regression line”)
par( oldpar)The sum of squares for error:For mean � 34. 71713For regression line � 17. 28252 ( better)
Figure 11.
Again, the residual plot shows structure.
We might also notice that there seem to be two sections to the data. Perhaps wecould investigate theidea of breaking the region into two parts.
In the next section, we will look at fitting the means before and afterx � 3.
mid. pt �- min( x. rough) � sum( range( x. rough))/ 2
oldpar �- par( mfrow�c( 2, 1))
plot( x. rough, y. rough, col�” red”, pch�20, main�” Approximation by two means”)
# Get means on both sides
l. index �- ( x. rough � mid. pt)*( 1: numb. rough)
r. index �- ( x. rough �� mid. pt)*( 1: numb. rough)
mu. 2. l �- mean( y. rough[ l. index])
mu. 2. r �- mean( y. rough[ r. index])
lines( c( 0, 3, 3, 6), c( mu. 2. l, mu. 2. l, mu. 2. r, mu. 2. r))
curve( sin, 0, 6, add�T, col�” blue”)
res. 2 �- c( y. rough[ l. index]- mu. 2. l, y. rough[ r. index]- mu. 2. r)
plot( x. rough, res. 2, pch�20, main�” Residuals for approximation by two means”)
par( oldpar)
© Mills2015 Modeling 93
Data Mining 2015
The sum of squares for error:For mean � 34. 71713For regression line � 17. 28252For split � 11. 74287 an improvement
Figure 12.
The residuals look better, and we have a lower sum of squares of error.
We might extend the idea of splitting the interval and do least squares fits that donot cover the entireinterval at once but rather piecewise.
quart. pt �- min( x. rough) � sum( range( x. rough))/ 4
three. quart. pt �- min( x. rough) � sum( range( x. rough))* 3/ 4
a. index �- ( x. rough � quart. pt)*( 1: numb. rough)
b. index �- (( x. rough �� quart. pt)&( x. rough � three. quart. pt))*( 1: numb. rough)
c. index �- ( x. rough �� three. quart. pt)*( 1: numb. rough)
oldpar �- par( mfrow�c( 2, 1))
plot( x. rough, y. rough, col�” red”, pch�20, main�” Approximation by three regression lines”)
curve( sin, 0, 6, add�T, col�” blue”)
# Get regression in the intervals.
lm. 3. a �- lm( y. rough[ a. index] ~ x. rough[ a. index])
m �- lm. 3. a$coefficients[ 2]
int �- lm. 3. a$coefficients[ 1]
lines( c( 0, quart. pt), c( int, m* quart. pt))
lm. 3. b �- lm( y. rough[ b. index] ~ x. rough[ b. index])
m �- lm. 3. b$coefficients[ 2]
int �- lm. 3. b$coefficients[ 1]
lines( c( quart. pt, three. quart. pt), c( int�m* quart. pt, int�m*( three. quart. pt)))
Modeling 94 © Mills2015
Data Mining 2015
lm. 3. c �- lm( y. rough[ c. index] ~ x. rough[ c. index])
m �- lm. 3. c$coefficients[ 2]
int �- lm. 3. c$coefficients[ 1]
lines( c( three. quart. pt, 6), c( int�m* three. quart. pt, int�m* 6))
# Residuals
res. 3 �- c( lm. 3. a$residuals, lm. 3. b$residuals, lm. 3. c$residuals)
plot( x. rough, res. 3, pch�20, main�” Residuals for three regression lines”)
par( oldpar)The sum of squares for error:For mean � 34. 71713For regression line � 17. 28252For split � 11. 74287For 3 lines � 4. 267956 a further improvement
Figure 13.
The position for the split point was chosen arbitrarily. We could use an optimization routine to find thelocation that gives the knot locations for the best fit.
© Mills2015 Modeling 95
Data Mining 2015
Rather than use piecewise linear, suppose we try a quadratic fit.
oldpar �- par( mfrow�c( 2, 1))
plot( x. rough, y. rough, col�” red”, pch�20, main�” Approximation by quadratic regression”)
X. rough. q �- cbind( x. rough, x. rough^2)
lm. q �- lm( y. rough ~X. rough. q)
cat(” The coefficients are ”, lm. q$coefficients, ”\ n”)The coefficients are 0. 6796526 - 0. 02429912 - 0. 04655698T. q �- lm. q$coefficients[ 1] � x. in* lm. q$coefficients[ 2] � ( x. in)^ 2* lm. q$coefficients[ 3]
lines( x. in, T. q, col�” green”)
curve( sin, 0, 6, add�T, col�” blue”)
plot( x. rough, lm. q$residuals, pch�20, main�” Residuals for approximation by quadraticregression”)par( oldpar)The sum of squares for error:For mean � 34. 71713For regression line � 17. 28252For split � 11. 74287For 3 lines � 4. 267956For quadratic � 16. 26682 slightly better than a straight line
Figure 14.
This does not seem to give any improvement as there is still structure on the residual plot.
Modeling 96 © Mills2015
Data Mining 2015
As last attempt, we will try a cubic:oldpar �- par( mfrow�c( 2, 1))
plot( x. rough, y. rough, col�” red”, pch�20, main�” Approximation by cubic regression”)
X. rough. cubic �- cbind( x. rough, x. rough^2, x. rough^3)
lm. cubic �- lm( y. rough ~X. rough. cubic)
cat(” The coefficients are ”, lm. cubic$coefficients, ”\ n”)The coefficients are - 0. 3902004 2. 208157 - 0. 984476 0. 1042132T. cubic �- lm. cubic$coefficients[ 1] � x. in* lm. cubic$coefficients[ 2] �
( x. in)^ 2* lm. cubic$coefficients[ 3] � ( x. in)^ 3* lm. cubic$coefficients[ 4]
lines( x. in, T. cubic, col�” green”)
curve( sin, 0, 6, add�T, col�” blue”)
plot( x. rough, lm. cubic$residuals, pch�20, main�” Residuals for approximation by cubicregression line”)par( oldpar)The sum of squares for error:For mean � 34. 71713For regression line � 17. 28252For split � 11. 74287For 3 lines � 4. 267956For quadratic � 16. 26682For cubic � 4. 122846 similar to the 3 linear regressions
Figure 15.This seems to give a small error sum of squares and the residual plot has little structure. Perhaps weshould consider using a higher degree polynomial.
While it is true that if you have data with no noise you can use higher degree polynomials that will fitthe data exactly, it is not a good idea. We will see later that significant problems can arise in fitting datain this way.We must also keep in mind that we wish to use the current data to make predictions on the future data.Fitting too closely decreases the predictive strength of a model and, in fact,we usually apply a penaltyfor ’roughness’ as a means of discouragingoverfitting. There is a trade-off between a good fit on the
© Mills2015 Modeling 97
Data Mining 2015
information used to construct amodel and the ability of the model to make good predictions.
Modeling 98 © Mills2015
Data Mining 2015
The idea of regression can be applied to functions of several variables (as usual,for purposes ofillustration, we will look at a two variable case).numb �- 1000
max �- 6
set. seed( 1234)
x �- sort( runif( numb)* max)
y �- runif( numb)* max
z �- sin( x)* y � runif( numb)
lm. 1 �- lm( z~x�y)
lm. 1$coefficients( Intercept) x y3. 36044749 - 0. 96380760 0. 03079192
We will use thergl package for displaying three dimensional data. The data is plotted by finding thearray {zi,j� values above a rectangular grid of (xi, yj) values. We create the grid on thexy-plane fromvalues along thex and.y-axes:X. g �- seq( 0, max, by�. 1)
Y. g �- seq( 0, max, by�. 1)
g �- expand. grid( X. g, Y. g)
Compute the estimatedz value at every point on the grid and then convert to a matrix:lm. g �- lm. 1$coefficients[ 1] � g[, 1]* lm. 1$coefficients[ 2] � g[, 2]* lm. 1$coefficients[ 3]
lm. g. m �- matrix( lm. g, nrow�length( X. g))
library( rgl)
Plot the points as small spheres and the plot the plane of estimated valuesplot3d( x, y, z, col�” red”, type�” s”, size�0. 3)
surface3d( X. g, Y. g, lm. g. m, alpha�0. 5) # alpha gives the transparency
# Residuals
plot3d( x, y, zlim�c(- 6, 6), type�” s”, size�0. 3, lm. 1$residuals)
sum( lm. 1$residuals* lm. 1$residuals)[ 1] 3006. 138
© Mills2015 Modeling 99
Data Mining 2015
Figure 16. Dark red points are below the plane Figure 17. Residuals showing structure
Modeling 100 © Mills2015
Data Mining 2015
As before we can try a cubic fit:
X �- cbind( x, x^2, x^3)
lm. 2 �- lm( z~X�y)
lm. 2$coefficients( Intercept) Xx X X y
- 0. 31595442 5. 79108184 - 2. 63805108 0. 27920336 0. 01659746
lm. 2. g �- ( lm. 2$coefficients[ 1] � g[, 1]* lm. 2$coefficients[ 2] �( g[, 1])^ 2* lm. 2$coefficients[ 3] �
( g[, 1])^ 3* lm. 2$coefficients[ 4] � g[, 2]* lm. 2$coefficients[ 5])
lm. 2. g. m �- matrix( lm. 2. g, nrow�length( X. g))
plot3d( x, y, z, col�” red”, type�” s”, size�0. 3)
surface3d( X. g, Y. g, lm. 2. g. m, alpha�0. 5)
# Residuals
plot3d( x, y, zlim�c(- 6, 6), type�” s”, size�0. 3, lm. 2$residuals)
sum( lm. 2$residuals* lm. 2$residuals)[ 1] 1639. 442
Figure 18. Figure 19. Residuals showing less structure
© Mills2015 Modeling 101
Data Mining 2015
or a more complicated one:
X �- cbind( y* x, y* x^2, y* x^3)
lm. 3 �- lm( z~X)
print( lm. 3$coefficients)( Intercept) X1 X2 X30. 37762183 1. 73255960 - 0. 83142514 0. 08954253lm. 3. g �- ( lm. 3$coefficients[ 1] � g[, 1]* g[, 2]* lm. 3$coefficients[ 2] �( g[, 1])^ 2* g[, 2]* lm. 3$coefficients[ 3] � ( g[, 1])^ 3* g[, 2]* lm. 3$coefficients[ 4])
(In the following, thezlim �c(-7,7) gives a range of [-8,8] whilezlim �c(-8,8) gives [-10.10].)
plot3d( x, y, z, col�” red”, type�” s”, size�0. 3)
surface3d( X. g, Y. g, lm. 3. g. m, alpha�0. 5)
# Residuals
plot3d( x, y, zlim�c(- 6, 6), type�” s”, size�0. 3, lm. 3$residuals)
sum( lm. 3$residuals* lm. 3$residuals)[ 1] 148. 3741
Figure 20. Figure 21. Residuals
Modeling 102 © Mills2015
Data Mining 2015
3.3.1 All Subsets Regression
Example from Neter, Kutner, Nactsheim, Wasserman.
d. file �- paste( data. dir, ” ch08ta01. dat”, sep � ”/”)
d. temp �- matrix( scan( d. file), ncol�6, byrow�T)
d. data �- d. temp[, c( 1: 4, 6)]
names �- c(” BlodClot”,” Prog. Ind”,” EnzyneFun”,” LiverFun”,” logSurvival”)
dimnames( d. data) �- list( 1: 54, names)
df. data �- as. data. frame( d. data)
The following recursive routine creates all possible combinations of size numb
next. combo �- function ( combo, numb, all. current, prev, ind) {
i �- prev � 1
while ( i �� numb) {
combo[ ind, c( all. current, i)] �- c( all. current, i)
ind �- ind � 1
res �- next. combo ( combo, numb, c( all. current, i), i, ind)
combo �- res[[ 1]]
ind �- res[[ 2]]
i �- i � 1
}
list( combo, ind)
}
numb �- 4
# Determine the number of possible combinations
n. combo �- 0
for ( j in 1: numb) {
n. combo �- n. combo � choose( numb, j)
}
combo �- matrix( 0, n. combo, numb)
res �- next. combo ( combo, numb, {}, 0, 1)
# Puts them in a better order
( combos �- res[[ 1]][ order( apply( res[[ 1]]! �0, 1, sum)),])[, 1] [, 2] [, 3] [, 4]
[ 1,] 1 0 0 0[ 2,] 0 2 0 0[ 3,] 0 0 3 0[ 4,] 0 0 0 4[ 5,] 1 2 0 0[ 6,] 1 0 3 0[ 7,] 1 0 0 4[ 8,] 0 2 3 0[ 9,] 0 2 0 4
[ 10,] 0 0 3 4[ 11,] 1 2 3 0[ 12,] 1 2 0 4[ 13,] 1 0 3 4[ 14,] 0 2 3 4
© Mills2015 Modeling 103
Data Mining 2015
[ 15,] 1 2 3 4
Createregression models for all possible subsets and plot the SSE.
xx �- 1: numb
plot(- 1, 0, xlim � c( 0, numb), ylim � c( 0, 5), xlab�””, ylab � ””)
lpm. all �- {} # A null list
for ( i in 1: n. combo) {
comb �- d. data[, combos[ i,]] # Simplify names
lpm �- lm( df. data[, 5] ~ comb)
# Put model in list
lpm. all �- c( lpm. all, list( lpm))
# Plot the results
points( sum( combos[ i,] � 0), sum( lpm$residuals^2))
}
lpm. all[[ 1]]Call:lm( formula � df. data[, 5] ~ comb)Coefficients:( Intercept) comb
1. 86399 0. 05916[[ 2]]Call:lm( formula � df. data[, 5] ~ comb)Coefficients:( Intercept) comb
1. 598811 0. 009604[[ 15]]Call:lm( formula � df. data[, 5] ~ comb)Coefficients:Coefficients:
( Intercept) combBloodClot combProg. Ind combEnzyneFun combLiverFun0. 488756 0. 068520 0. 009254 0. 009475 0. 00192
table �- matrix( 0, n. combo�1, 4)
TSS �- sum(( df. data[, 5]- mean( df. data[, 5]))^ 2)
table[ 1, 1] �- 53
table[ 1, 2] �- floor( 100000* TSS)/ 100000 # 5 digits
table[ 1, 3] �- 0
table[ 1, 4] �- floor( 100000*( table[ 1, 2]/( 54- 2)))/ 100000
row. names �- ” None”
for ( i in 1: n. combo) {
temp1 �- format( combos[ i,])
temp2 �-{}
for ( j in 1: length( temp1)) {
if ( temp1[ j] ! � ” 0”)
temp2 �- paste( temp2, ” X”, temp1[ j], sep�””)
}
row. names �- c( row. names, temp2)
table[ i�1, 1] �- floor( 100000* lpm. all[[ i]]$ df. residual)/ 100000
table[ i�1, 2] �- floor( 100000* sum( lpm. all[[ i]]$ residual^2))/ 100000
table[ i�1, 3] �- floor( 100000*( 1 - table[ i�1, 2]/ TSS))/ 100000
Modeling 104 © Mills2015
Data Mining 2015
table[ i�1, 4] �- floor( 100000* table[ i�1, 2])/( 54- 2)/ 100000
}
dimnames( table) �- list( row. names, c(” df”,” SSE”,” R^2”,” MSE”))
tabledf SSE R^2 MSE
None 53 3. 97277 0. 00000 0. 076390000X1 52 3. 49605 0. 11999 0. 067231731X2 52 2. 57627 0. 35151 0. 049543654X3 52 2. 21527 0. 44238 0. 042601154X4 52 1. 87763 0. 52737 0. 036108269X1X2 51 2. 23248 0. 43805 0. 042932115X1X3 51 1. 40718 0. 64579 0. 027061154X1X4 51 1. 87582 0. 52783 0. 036073462X2X3 51 0. 74301 0. 81297 0. 014288654X2X4 51 1. 39215 0. 64957 0. 026772115X3X4 51 1. 24532 0. 68653 0. 023948462X1X2X3 50 0. 10985 0. 97234 0. 002112500X1X2X4 50 1. 39052 0. 64998 0. 026740769X1X3X4 50 1. 11559 0. 71919 0. 021453654X2X3X4 50 0. 46520 0. 88290 0. 008946154X1X2X3X4 49 0. 10977 0. 97236 0. 002110962
Figure 22.
mins �- matrix( 10^( 30), numb�1, 2)
for ( i in 1: 15) {
temp �- sum( lpm. all[[ i]]$ residuals^2)
if ( temp � mins[ lpm. all[[ i]]$ rank, 1]) {
mins[ lpm. all[[ i]]$ rank, 1] �- temp
mins[ lpm. all[[ i]]$ rank, 2] �- i
}
}
mins[ 1, 1] �- TSS
mins[ 1, 2] �- 0
© Mills2015 Modeling 105
Data Mining 2015
lines ( 0: 4, mins[ 1: 5, 1])
points ( 0: 4, mins[ 1: 5, 1], col�” red”)
Figure 23.
Modeling 106 © Mills2015
Data Mining 2015
3.4 Computational difficultiesThe process of solving the normal equations can produce significant round off error. As long as thereare only a few predictor variables (X) the effect may not be noticed but, when there are a large numberof predictor variables, the effect can be serious. This is especially true when there is correlation amongthe predictors. The effect is seen in the computation of the inverse ofXTX (|XTX| will often be verysmall which leads to instability of the inverse) and for this reason we use thesingular valuedecomposition method of solution.
3.4.1 The singular value decomposition methodTo solve the normal equations, we will look at arobust method known assingular valuedecomposition (SVD).
If we have a matrixA � �m�n (note -it does not need to be square), we:1. Find the eigenvalues�i of the matrix ATA;2. Arrange the eigenvalues in descending order;3. Determine the number of nonzero eigenvalues�r�;4. Find the orthogonal eigenvectors of the matrixATA (they have to be ordered corresponding to
the order of the eigenvalues) and create a matrixV � �n�n whose columnsvi are the orderedeigenvectors;
5. Form a matrix � �m�n, with ’diagonal’ entries being the square root of the orderedeigenvalues;
6. Create a matrixU � �m�m with the firstr column vectors constructed as
u i �Avi
�i
;
7. If r � m, the remaining (�r � 1� to m) vectors are constructed using the Gram-Schmidtorthogonalization process.
Aside: If we have a set of linearly independent vectorsv1,v2, ...,vn,we select one of the vectors andnormalize it. i.e.e1 � v1/||v1||. Now define a vectore2
� � v2 � �v2 � e1�e1. We note that
e1 � e2� � e1 � v2 � �v2 � e1�e1 � e1 � 0
soe1 ande2� are orthogonal. Thene1 ande2 � e2
� /||e2� || form an orthonormal set.
Using induction, we can define
ej�1� � vj�1 ��
i�1
j
�vj�1 � ei �ei
and show that
© Mills2015 Modeling 107
Data Mining 2015
ej�1� � ep � vj�1 � ep ��
i�1
j
�vj�1 � ei �ei � ep � 0 �for 1 � p � j�
The orthonormal vectorsep� /||ep
� ||span the same space asv1,v2, ...,vn.
The equation of the form
Ax � b
then becomes
�UVT �x � b
or
x �V�1UTb
Consider the matrix
A �
1 1
0 1
1 0
.
A �- cbind( c( 1, 0, 1), c( 1, 1, 0))
nrow �- dim( A)[ 1]
ncol �- dim( A)[ 2]
( ATA �- t( A)%*%A)[, 1] [, 2]
[ 1,] 2 1[ 2,] 1 2( ATA. eig �- eigen( ATA))$values[ 1] 3 1$vectors
[, 1] [, 2][ 1,] 0. 7071068 0. 7071068
[ 2,] 0. 7071068 - 0. 7071068
Set ’small’ numbers to 0 and remove them.Which ones are smaller than 10�15?�None in this case, but we need this in general.)
( abs( ATA. eig$values) � 10^(- 15))[ 1] TRUE TRUE
Modeling 108 © Mills2015
Data Mining 2015
Create indices for the non-zero eigenvalues:
( abs( ATA. eig$values) � 10^(- 15))*( 1: length( ATA. eig$values))[ 1] 1 2
List of the non-zero eigenvalues:
( eigs. a �- ATA. eig$values[( abs( ATA. eig$values) � 10^(- 15))*( 1: length( ATA. eig$values))])[ 1] 3 1
Put in descending order:
( the. eigs �- eigs. a[ order(- eigs. a)])[ 1] 3 1r �- length( the. eigs)[ 1] 2if ( r � ncol) {
V �- as. matrix( ATA. eig$vectors[, c( order(- eigs. a),( r�1): ncol)])
} else {
V �- as. matrix( ATA. eig$vectors[, order(- eigs. a)])
}
V[, 1] [, 2]
[ 1,] 0. 7071068 0. 7071068[ 2,] 0. 7071068 - 0. 7071068t( V)%*%V
[, 1] [, 2][ 1,] 1 0[ 2,] 0 1
Create the singular value matrix using the square root of the eigenvalues as ’diagonal’elements.This is of the same shape as A.
( Sig �- diag(( the. eigs)^(. 5), nrow, ncol))[, 1] [, 2]
[ 1,] 1. 732051 0[ 2,] 0. 000000 1[ 3,] 0. 000000 0( Sig. Inv �- diag( 1/( the. eigs)^(. 5), nrow, ncol))
[, 1] [, 2][ 1,] 0. 5773503 0[ 2,] 0. 0000000 1[ 3,] 0. 0000000 0U �- matrix( 0, nrow, nrow)
for ( i in 1: r) {
U[, i] �- A%*%V[, i]/ sqrt( the. eigs[ i])
}
U[, 1] [, 2] [, 3]
[ 1,] 0. 8164966 0. 0000000 0[ 2,] 0. 4082483 - 0. 7071068 0[ 3,] 0. 4082483 0. 7071068 0
© Mills2015 Modeling 109
Data Mining 2015
for ( i in ( r�1): nrow) {
# Gram- Schmidt for next
u �- I[, 1]
for ( j in 1:( i- 1)) {
u �- u - t( I[, 1])%*% U[, j]* U[, j]
}
U[, i] �- u/ sqrt( sum( u* u))
}
U[, 1] [, 2] [, 3]
[ 1,] 0. 8164966 0. 0000000 0. 5773503[ 2,] 0. 4082483 - 0. 7071068 - 0. 5773503[ 3,] 0. 4082483 0. 7071068 - 0. 5773503U%*%t( U)
[, 1] [, 2] [, 3][ 1,] 1. 000000e�00 1. 110223e- 16 1. 110223e- 16[ 2,] 1. 110223e- 16 1. 000000e�00 - 1. 110223e- 16[ 3,] 1. 110223e- 16 - 1. 110223e- 16 1. 000000e�00U%*%Sig%*%t( V)
[, 1] [, 2][ 1,] 1 1[ 2,] 0 1[ 3,] 1 0
We could use this method to find the coefficients of the regression in the firstexample (the one thatproduced Figure 11).
Y �- matrix( y. rough)
X �- cbind( rep( 1, length( Y)), x. rough)
( s �- svd( t( X)%*%X))$d[ 1] 784. 39426 14. 70574$u
[, 1] [, 2][ 1,] - 0. 2452483 - 0. 9694603[ 2,] - 0. 9694603 0. 2452483$v
[, 1] [, 2][ 1,] - 0. 2452483 - 0. 9694603[ 2,] - 0. 9694603 0. 2452483( l �- s$v%*%diag( 1/ s$d)%*%t( s$u)%*%t( X)%*%Y)
[, 1][ 1,] 0. 9543387
[ 2,] - 0. 3036410
As we see, the coefficients are the same.
A second cause of problems occurs when the entries ofXTX cover a wide range of values. Thishappens if theX values have widely varying scales (e.g. should income be expressed in cents, dollars,or thousands of dollars?). One way of avoiding such problems is to standardize the data by the use of
Modeling 110 © Mills2015
Data Mining 2015
what is called thecorrelation transformation.
3.4.2 The correlation transformation
When we transform a variable by subtracting its mean, dividing the result by itsstandard deviation andmultiplying the result of that by�n � 1��1/2, we are performing acorrelation transformation or a formof standardizing the data. Hence, the response variable becomes
Y i� � 1
n � 1�Y i �
__Y �
sY
and the predictor variables become
X i j� � 1
n � 1
�X i j �__X j�
sX jfor j � 1, ...,p � 1.
The result of this transformation is that
Y � �0 � �1X1 � �2X2 �� � �p�1Xp�1 �
becomes
Y � � �1�X1
� � �2�X2
� �� � �p�1� Xp�1
� � �
(note the lack of an intercept term here) and the solution of the normal equations gives
b� � �X �TX ���1 �X �T Y� �
Using the factor of�n � 1��1/2 in the denominator of these transformations results in�Y �
2� 1 ,
�X �TX �� � rXX
and
�X �TY� � rYX
so
b� � �rXX ��1rXY
© Mills2015 Modeling 111
Data Mining 2015
(Note thatrXY �Cov�X,Y��X�Y
�
#����������������������
# Standardize data
#����������������������
f. data. std �- function( data) {
data �- as. matrix( data)
bar �- apply( data, 2, mean)
s �- apply( data, 2, sd)
t(( t( data) - bar)/ s)
}
If we wish to return to the original coordinates, we can use the following:-
X j� � 1
n � 1
�X1j �__X j�
sX j, 1
n � 1
�X2j �__X j�
sX j, ..., 1
n � 1
�Xn j �__X j�
sX j
� 1sX j n � 1
�X1j,X2j, ...,Xn j � � 1sX j n � 1
__X j
� 1sX j n � 1
Xj �__X j
and
Y � � �1�X1
� � �2�X2
� �� � �p�1� Xp�1
� � �
so
1n � 1
�Y i �__Y �
sY� �1
� 1sX1 n � 1
X1 �__X1 � �2
� 1sX2 n � 1
X2 �__X2 ��
� �p�1� 1
sXp�1 n � 1Xj �
__X p�1
so
�Y i �__Y � � �1
� sYsX1
X1 �__X1 � �2
� sYsX2
X2 � sY
__X2 ��
� �p�1� sY
sXp�1Xp�1 � sY
__X p�1
This gives
Modeling 112 © Mills2015
Data Mining 2015
Y i �__Y ��1
� sYsX1
__X1 ��2
� sYsX2
__X2 �� � �p�1
� sYsXp�1
__X p�1
� �1� sY
sX1X1 � �2
� sYsX2
X2 �� � �p�1� sY
sXp�1Xp�1
�__Y ��1
__X1 ��2
__X2 �� � �p�1
__X p�1 � �1X1 � �2X2 ��
� �p�1Xp�1
� �0 � �1X1 � �2X2 �� � �p�1Xp�1
#��������������������������������������������
# Convert standardized coeff to original
#��������������������������������������������
std. to. orig �- function ( std. coef, mean. X, mean. Y, s. X, s. Y) {
sz �- length( std. coef)
B. i �- matrix( 0, sz, 1)
for ( i in 2: sz) {
B. i[ i, 1] �- s. Y/ s. X[ i- 1]* std. coef[ i]
std. coef[ i] �- B. i[ i, 1]
B. i[ 1, 1] �- B. i[ 1, 1] � B. i[ i, 1]* mean. X[ i- 1]
}
std. coef[ 1] �- mean. Y - B. i[ 1, 1]
std. coef
}
© Mills2015 Modeling 113
Data Mining 2015
3.5 Data splitting
Data splitting is often used to validate a model for a study by simulating replication of the study. Thedata set is split into two (or sometimes three) sets. The first, called the model-building ortraining set,is used to develop a model. The second - called thetest (validation, prediction, calibration) set - isused to evaluate the reasonableness or predictive ability of the developed model (also called theScientific Method). If only two sets are used, the second is used for both testing and validation. For theselected model we may compare regression coefficient estimates obtained in the training and test sets.We use the model obtained in the training phase to make predictions for the data in the validation dataset. This calibrates predictive ability of the model for new data.
d. file �- paste( data. dir, ” prostate. dat”, sep � ”/”)
d. temp �- matrix( scan( d. file), ncol�10, byrow�T)
data. orig �- d. temp[, 2: 10]
names �- c(” lcavol”,” lweight”,” age”,” lbph”,” svi”,” lcp”,” gleason”,” pgg45”,” lpsa”)
dimnames( data. orig) �- list( 1: 97, names)
The first 8 columns are predictors, the 9th is the response
pred �- 1: 8
resp �- 9
Number of predictorsp �- 8
The following function allows the computation of the estimated response
f. Yhat �- function( coef, X) {
X �- as. matrix( X)
cbind( rep( 1, dim( X)[ 1]), X)%*%as. matrix( coef)
}
Now take a look at the data graphically and numerically
source( paste( code. dir, ” pairs_ext. r”, sep�”/”))
pairs( data. orig, upper. panel�panel. cor, diag. panel�panel. hist)
Modeling 114 © Mills2015
Data Mining 2015
Figure 24.
cor( data. orig)lcavol lweight age lbph svi lcp
lcavol 1. 00000000 0. 194128307 0. 2249999 0. 027349703 0. 53884500 0. 675310484lweight 0. 19412831 1. 000000000 0. 3075286 0. 434934587 0. 10877848 0. 100237802age 0. 22499988 0. 307528601 1. 0000000 0. 350185896 0. 11765804 0. 127667752lbph 0. 02734970 0. 434934587 0. 3501859 1. 000000000 - 0. 08584324 - 0. 006999431svi 0. 53884500 0. 108778484 0. 1176580 - 0. 085843238 1. 00000000 0. 673111185lcp 0. 67531048 0. 100237802 0. 1276678 - 0. 006999431 0. 67311118 1. 000000000gleason 0. 43241706 - 0. 001275662 0. 2688916 0. 077820447 0. 32041222 0. 514830063pgg45 0. 43365225 0. 050846836 0. 2761124 0. 078460018 0. 45764762 0. 631528246lpsa 0. 73446033 0. 354120358 0. 1695928 0. 179809404 0. 56621822 0. 548813175
gleason pgg45 lpsalcavol 0. 432417056 0. 43365225 0. 7344603lweight - 0. 001275662 0. 05084684 0. 3541204age 0. 268891599 0. 27611245 0. 1695928lbph 0. 077820447 0. 07846002 0. 1798094svi 0. 320412221 0. 45764762 0. 5662182lcp 0. 514830063 0. 63152825 0. 5488132gleason 1. 000000000 0. 75190451 0. 3689868pgg45 0. 751904512 1. 00000000 0. 4223159lpsa 0. 368986806 0. 42231586 1. 0000000
Try a least squares fit on the full data set - original
lm( data. orig[, resp]~ data. orig[, pred])Coefficients:
( Intercept) data. orig[, pred] lcavol data. orig[, pred] lweight0. 669399 0. 587023 0. 454461
data. orig[, pred] age data. orig[, pred] lbph data. orig[, pred] svi- 0. 019637 0. 107054 0. 766156
data. orig[, pred] lcp data. orig[, pred] gleason data. orig[, pred] pgg45
© Mills2015 Modeling 115
Data Mining 2015
- 0. 105474 0. 045136 0. 004525
and standardized
# Standardize the data
data. std �- f. data. std( data. orig)
lm. std �- lm( data. std[, resp]~ data. std[, pred])Coefficients:
( Intercept) data. std[, pred] lcavol data. std[, pred] lweight- 9. 402e- 16 5. 994e- 01 1. 955e- 01
data. std[, pred] age data. std[, pred] lbph data. std[, pred] svi- 1. 267e- 01 1. 346e- 01 2. 748e- 01
data. std[, pred] lcp data. std[, pred] gleason data. std[, pred] pgg45- 1. 278e- 01 2. 824e- 02 1. 106e- 01
then convert the standardized result back.std. to. orig( lm. std$coefficients, apply( data. orig[, pred], 2, mean),
mean( data. orig[, resp]), apply( data. orig[, pred], 2, sd), sd( data. orig[, resp]))( Intercept) data. std[, pred] lcavol data. std[, pred] lweight0. 669399309 0. 587022878 0. 454460536
data. std[, pred] age data. std[, pred] lbph data. std[, pred] svi- 0. 019637207 0. 107054371 0. 766155934
data. std[, pred] lcp data. std[, pred] gleason data. std[, pred] pgg45- 0. 105473565 0. 045135971 0. 004525323
The result is the same.
Now consider splitting the data into two parts - a training sample and a test sample. This enables us totest our model in cases for which we may be unable to obtain further data for testing. Recall that the“scientific method” requires that we collect data, form a model based on thatdata, and then test themodel withother data.
If we produce a subset of indices, we often wish to know what indices are in the original set but not inthe subset.The following function produces those indices NOT in a set. Use this to get the testsample.
”%w/ o%” �- function( x, y) x[! x %in% y]
Now we get the indices for the training and test samples
#��������������������������������������������
# Set the indices for the training/ test sets
#��������������������������������������������
get. train �- function ( data. sz, train. sz) {
# Take subsets of data for training/ test samples
# Return the indices
train. ind �- sample( data. sz, train. sz)
test. ind �- ( 1: data. sz) %w/ o% train. ind
Modeling 116 © Mills2015
Data Mining 2015
list( train�train. ind, test�test. ind)
}
Train. sz �- 67 # Set the size of the training sample
# Get the indices for the training and test samples
( tt. ind �- get. train( dim( data. orig)[ 1], Train. sz))$train
[ 1] 19 8 58 86 26 80 52 84 53 11 89 69 39 36 3 9 2 72 96 66 23 74 56 46 91[ 26] 61 20 59 48 47 78 16 43 54 90 57 27 85 75 87 60 35 29 70 68 65 77 28 51 92[ 51] 4 21 97 82 64 73 14 22 71 44 40 7 55 45 25 63 83$test
[ 1] 1 5 6 10 12 13 15 17 18 24 30 31 32 33 34 37 38 41 42 49 50 62 67 76 79[ 26] 81 88 93 94 95
This produces two sets of (non-overlapping) indices to the data, so we can randomly split the data intotwo parts
train. X. orig �- data. orig[ tt. ind$train, pred]
train. Y. orig �- data. orig[ tt. ind$train, resp]
and also standardize.
train. X. std �- f. data. std( train. X. orig)
train. Y. std �- f. data. std( train. Y. orig)
Now we perform a least squares regression on the standardized data and convert theresults back to theoriginal variables.
lm. trn �- lm( train. Y. std ~train. X. std)
std. to. orig( lm. trn$coefficients, apply( train. X. orig, 2, mean),
mean( train. Y. orig), apply( train. X. orig, 2, sd), sd( train. Y. orig))( Intercept) train. X. stdlcavol train. X. stdlweight train. X. stdage1. 10007224 0. 59895775 0. 75542221 - 0. 02376003
train. X. stdlbph train. X. stdsvi train. X. stdlcp train. X. stdgleason0. 07222878 0. 49050976 - 0. 07499372 - 0. 14310867
train. X. stdpgg450. 00748505
lm( train. Y. orig ~train. X. orig)Coefficients:
( Intercept) train. X. origlcavol train. X. origlweight1. 100072 0. 598958 0. 755422
train. X. origage train. X. origlbph train. X. origsvi- 0. 023760 0. 072229 0. 490510
train. X. origlcp train. X. origgleason train. X. origpgg45- 0. 074994 - 0. 143109 0. 007485
We note that both methods produce the same result, but differ from that of the full data set. That mightcause us to wonder how sensitive the data is to the training/test splitting.
Look at the effect of using different training sets:
© Mills2015 Modeling 117
Data Mining 2015
for ( i in 1: 5) {
tt. ind �- get. train( dim( data. std)[ 1], Train. sz)
train. X. orig �- data. orig[ tt. ind$train, pred]
train. Y. orig �- data. orig[ tt. ind$train, resp]
train. X. std �- f. data. std( train. X. orig)
train. Y. std �- f. data. std( train. Y. orig)
lm. trn �- lm( train. Y. std~train. X. std)
print( lm( train. Y. orig~train. X. orig))
print(””)
}Call:lm( formula � train. Y. orig ~train. X. orig)Coefficients:
( Intercept) train. X. origlcavol train. X. origlweight0. 339876 0. 531795 0. 613882
train. X. origage train. X. origlbph train. X. origsvi- 0. 025523 0. 104623 0. 669538
train. X. origlcp train. X. origgleason train. X. origpgg45- 0. 009038 0. 087536 0. 003572
( Intercept) train. X. stdlcavol train. X. stdlweight train. X. stdage0. 339875974 0. 531794655 0. 613882144 - 0. 025522689
train. X. stdlbph train. X. stdsvi train. X. stdlcp train. X. stdgleason0. 104622846 0. 669537978 - 0. 009037589 0. 087535930
train. X. stdpgg450. 003572217
[ 1] ””Call:lm( formula � train. Y. orig ~train. X. orig)Coefficients:
( Intercept) train. X. origlcavol train. X. origlweight0. 825951 0. 615631 0. 483133
train. X. origage train. X. origlbph train. X. origsvi- 0. 019817 0. 092106 0. 643410
train. X. origlcp train. X. origgleason train. X. origpgg45- 0. 083041 - 0. 010578 0. 006723
( Intercept) train. X. stdlcavol train. X. stdlweight train. X. stdage0. 82595121 0. 61563149 0. 48313272 - 0. 01981683
train. X. stdlbph train. X. stdsvi train. X. stdlcp train. X. stdgleason0. 09210631 0. 64341000 - 0. 08304134 - 0. 01057834
train. X. stdpgg450. 00672335
[ 1] ””Call:lm( formula � train. Y. orig ~train. X. orig)Coefficients:
( Intercept) train. X. origlcavol train. X. origlweight0. 49567 0. 48873 0. 73814
train. X. origage train. X. origlbph train. X. origsvi- 0. 01703 0. 03406 0. 48843
train. X. origlcp train. X. origgleason train. X. origpgg45- 0. 11416 - 0. 10147 0. 01143
( Intercept) train. X. stdlcavol train. X. stdlweight train. X. stdage0. 49566937 0. 48872861 0. 73813804 - 0. 01703469
train. X. stdlbph train. X. stdsvi train. X. stdlcp train. X. stdgleason0. 03405864 0. 48842511 - 0. 11415935 - 0. 10147043
train. X. stdpgg450. 01143487
[ 1] ””Call:lm( formula � train. Y. orig ~train. X. orig)Coefficients:
Modeling 118 © Mills2015
Data Mining 2015
( Intercept) train. X. origlcavol train. X. origlweight1. 396644 0. 629813 0. 615335
train. X. origage train. X. origlbph train. X. origsvi- 0. 035610 0. 107752 0. 686840
train. X. origlcp train. X. origgleason train. X. origpgg45- 0. 151814 - 0. 037363 0. 008356
( Intercept) train. X. stdlcavol train. X. stdlweight train. X. stdage1. 396643662 0. 629812599 0. 615335467 - 0. 035610354
train. X. stdlbph train. X. stdsvi train. X. stdlcp train. X. stdgleason0. 107752489 0. 686840338 - 0. 151813657 - 0. 037362643
train. X. stdpgg450. 008355827
[ 1] ””Call:lm( formula � train. Y. orig ~train. X. orig)Coefficients:
( Intercept) train. X. origlcavol train. X. origlweight- 0. 210137 0. 535780 0. 716955
train. X. origage train. X. origlbph train. X. origsvi- 0. 014440 0. 046356 0. 509348
train. X. origlcp train. X. origgleason train. X. origpgg45- 0. 065754 - 0. 016679 0. 007443
( Intercept) train. X. stdlcavol train. X. stdlweight train. X. stdage- 0. 210137370 0. 535780030 0. 716954850 - 0. 014440157
train. X. stdlbph train. X. stdsvi train. X. stdlcp train. X. stdgleason0. 046355829 0. 509347668 - 0. 065753915 - 0. 016678925
train. X. stdpgg450. 007443285
[ 1] ””
The intercept estimate ranges from-0.210137 to 1.396644 !
© Mills2015 Modeling 119
Data Mining 2015
3.6 Cross Validation (CV)
One way to attempt to deal with this iscross-validation.
With double cross-validation, the model is built for each part of the split data and then tested on theother part of the data, yielding two measures of consistency and predictive ablity. We can expand onthis by dividing the training set into more (than 2) sets. There are various conceptsbut we will look atthe case of 10 subsets of the training data and use the method of developing a model based on 9/10thsof the training sample and then use the other 10th to do an error analysis. Using different combinationsof the 9 to form the training set, there are then 10!/9!1!� 10 separate training/testing combinations.Theidea is that we average a number of weak predictors to produce (what we hope is) a strongpredictor.
The training set must be sufficiently large as to allow development of a reasonable model. The numberof cases should be at least 6 to 10 times the number of variables in the set of predictor variables. Thismight necessitate the test data set being smaller than the training data set.
If a data set is very large, it can be divided into three parts - the first partto develop the model, thesecond toestimate the parameters of the model, and the third forvalidation. This avoids bias resultingfrom estimating the parameters from the same data set used for developing the model.
[In order to try to get reproducible results, a fixed training/test sample split is used. For totally randomsets remove the tt.ind �-list(train �- c(64, 62, 75, 6,...) ]
In the following function, a matrix is created with 10 rows (the number of columnsbeing the numberthat will be in each of the 10 sets). Because there will not (in general) be enoughdata to fill the matrix,we have to pad. Using the numbers that we wish to use for our training set (see below), we get
64 87 12 83 61 3
62 32 16 74 28 15
75 23 11 60 85 78
6 9 47 14 13 44
94 59 21 48 43 20
56 2 54 41 29 84
52 96 91 35 82 76
80 66 24 70 77 0
33 90 81 68 92 0
55 53 86 72 22 0
.
Modeling 120 © Mills2015
Data Mining 2015
Each set is obtained by selecting a row.
#�������������������� Cross Validation ����������������������
# This section does a cross validation on the full LS model
# Set up a cross validation indices
# Create a train/ test split and then create 10 subsets
# cv. sets$cv. train 10 cross validation training sets from the training data
# cv. sets$cv. test Corresponding test sets from the training data
# cv. sets$test Test data
#�������������������������������������������������������
set. up. cv. ind �- function ( data. sz, Train. sz) {
tt. ind �- get. train( data. sz, Train. sz)
# Fix the samples
tt. ind �-
list( train �- c( 64, 62, 75, 6, 94, 56, 52, 80, 33, 55, 87, 32,
23, 9, 59, 2, 96, 66, 90, 53, 12, 16, 11, 47,
21, 54, 91, 24, 81, 86, 34, 51, 8, 17, 46, 45,
93, 57, 4, 50, 83, 74, 60, 14, 48, 41, 35, 70,
68, 72, 61, 28, 85, 13, 43, 29, 82, 77, 92, 22,
3, 15, 78, 44, 20, 84, 76),
test �- c( 1, 5, 7, 10, 18, 19, 25, 26, 27, 30, 31, 36,
37, 38, 39, 40, 42, 49, 58, 63, 65, 67, 69, 71,
73, 79, 88, 89, 95, 97))
n. in. set �- ceiling( Train. sz*. 1)
# pad with 0 to avoid repetition
cv. sets �- matrix( c( tt. ind[[ 1]], rep( 0, n. in. set* 10- Train. sz)),
ncol�n. in. set, byrow�F)
cv. train �- {}
cv. test �- {}
for ( i in 1: 10) {
# Select the 1/ 10th for testing and 9/ 10ths for training
cv. test �- c( cv. test, list( cv. sets[ i,])) # The indices are from the full set
cv. train �- c( cv. train, list( as. vector( cv. sets[ 1: 10%w/ o% i,])))
}
list( cv. train�cv. train, cv. test�cv. test, test�tt. ind[[ 1]])
}
cv. sets �- set. up. cv. ind( dim( data. std)[ 1], Train. sz)
cv. sets$cv. train # 10 cross validation training sets from the training data[[ 1]]
[ 1] 62 75 6 94 56 52 80 33 55 32 23 9 59 2 96 66 90 53 16 11 47 21 54 91 24[ 26] 81 86 51 8 17 46 45 93 57 4 50 74 60 14 48 41 35 70 68 72 28 85 13 43 29[ 51] 82 77 92 22 15 78 44 20 84 76 0 0 0[[ 2]]
[ 1] 64 75 6 94 56 52 80 33 55 87 23 9 59 2 96 66 90 53 12 11 47 21 54 91 24[ 26] 81 86 34 8 17 46 45 93 57 4 50 83 60 14 48 41 35 70 68 72 61 85 13 43 29[ 51] 82 77 92 22 3 78 44 20 84 76 0 0 0[[ 3]]
[ 1] 64 62 6 94 56 52 80 33 55 87 32 9 59 2 96 66 90 53 12 16 47 21 54 91 24[ 26] 81 86 34 51 17 46 45 93 57 4 50 83 74 14 48 41 35 70 68 72 61 28 13 43 29[ 51] 82 77 92 22 3 15 44 20 84 76 0 0 0[[ 4]]
[ 1] 64 62 75 94 56 52 80 33 55 87 32 23 59 2 96 66 90 53 12 16 11 21 54 91 24[ 26] 81 86 34 51 8 46 45 93 57 4 50 83 74 60 48 41 35 70 68 72 61 28 85 43 29
© Mills2015 Modeling 121
Data Mining 2015
[ 51] 82 77 92 22 3 15 78 20 84 76 0 0 0[[ 5]]
[ 1] 64 62 75 6 56 52 80 33 55 87 32 23 9 2 96 66 90 53 12 16 11 47 54 91 24[ 26] 81 86 34 51 8 17 45 93 57 4 50 83 74 60 14 41 35 70 68 72 61 28 85 13 29[ 51] 82 77 92 22 3 15 78 44 84 76 0 0 0[[ 6]]
[ 1] 64 62 75 6 94 52 80 33 55 87 32 23 9 59 96 66 90 53 12 16 11 47 21 91 24[ 26] 81 86 34 51 8 17 46 93 57 4 50 83 74 60 14 48 35 70 68 72 61 28 85 13 43[ 51] 82 77 92 22 3 15 78 44 20 76 0 0 0[[ 7]]
[ 1] 64 62 75 6 94 56 80 33 55 87 32 23 9 59 2 66 90 53 12 16 11 47 21 54 24[ 26] 81 86 34 51 8 17 46 45 57 4 50 83 74 60 14 48 41 70 68 72 61 28 85 13 43[ 51] 29 77 92 22 3 15 78 44 20 84 0 0 0[[ 8]]
[ 1] 64 62 75 6 94 56 52 33 55 87 32 23 9 59 2 96 90 53 12 16 11 47 21 54 91[ 26] 81 86 34 51 8 17 46 45 93 4 50 83 74 60 14 48 41 35 68 72 61 28 85 13 43[ 51] 29 82 92 22 3 15 78 44 20 84 76 0 0[[ 9]]
[ 1] 64 62 75 6 94 56 52 80 55 87 32 23 9 59 2 96 66 53 12 16 11 47 21 54 91[ 26] 24 86 34 51 8 17 46 45 93 57 50 83 74 60 14 48 41 35 70 72 61 28 85 13 43[ 51] 29 82 77 22 3 15 78 44 20 84 76 0 0[[ 10]]
[ 1] 64 62 75 6 94 56 52 80 33 87 32 23 9 59 2 96 66 90 12 16 11 47 21 54 91[ 26] 24 81 34 51 8 17 46 45 93 57 4 83 74 60 14 48 41 35 70 68 61 28 85 13 43[ 51] 29 82 77 92 3 15 78 44 20 84 76 0 0cv. sets$cv. test # Corresponding test sets from the training data[[ 1]][ 1] 64 87 12 34 83 61 3[[ 2]][ 1] 62 32 16 51 74 28 15[[ 3]][ 1] 75 23 11 8 60 85 78[[ 4]][ 1] 6 9 47 17 14 13 44[[ 5]][ 1] 94 59 21 46 48 43 20[[ 6]][ 1] 56 2 54 45 41 29 84[[ 7]][ 1] 52 96 91 93 35 82 76[[ 8]][ 1] 80 66 24 57 70 77 0[[ 9]][ 1] 33 90 81 4 68 92 0[[ 10]][ 1] 55 53 86 50 72 22 0cv. sets$test # Test data
[ 1] 1 5 7 10 18 19 25 26 27 30 31 36 37 38 39 40 42 49 58 63 65 67 69 71 73[ 26] 79 88 89 95 97
While we will not use the cross-validation at this stage, we now have the training/test sets.
Modeling 122 © Mills2015
Data Mining 2015
3.7 Best subset regression
We need a few tools -
#��������������������������������������������������������
# Recursive function to compute all possible subsets
#��������������������������������������������������������
next. combo �- function ( combo, numb, all. current, prev, ind) {
i �- prev � 1
while ( i �� numb) {
combo[ ind, c( all. current, i)] �- c( all. current, i)
ind �- ind � 1
res �- next. combo ( combo, numb, c( all. current, i), i, ind)
combo �- res[[ 1]]
ind �- res[[ 2]]
i �- i � 1
}
list( combo, ind)
}
number. of. combos �- function ( numb) {
n. combo �- 0
for ( j in 1: numb) {
n. combo �- n. combo � choose( numb, j)
}
n. combo
}
#��������������������������������������������������������
create. combos �- function ( numb) {
n. combo �- number. of. combos( numb)
combo �- matrix( 0, n. combo, numb)
temp �- next. combo ( combo, numb, {}, 0, 1)[[ 1]]
temp[ order( apply( temp! �0, 1, sum)),]
}
We create the subsets
combos �- create. combos( 8)
rownames( combos) �- c( 1: 255)
combos[ c( 1, 2, 3, 11, 12, 13, 253, 254, 255),][, 1] [, 2] [, 3] [, 4] [, 5] [, 6] [, 7] [, 8]
1 1 0 0 0 0 0 0 02 0 2 0 0 0 0 0 03 0 0 3 0 0 0 0 011 1 0 0 4 0 0 0 012 1 0 0 0 5 0 0 013 1 0 0 0 0 6 0 0253 1 0 3 4 5 6 7 8
© Mills2015 Modeling 123
Data Mining 2015
254 0 2 3 4 5 6 7 8255 1 2 3 4 5 6 7 8
We create regression models for all possible subsets and plot the SSE.
We use the training and test sets from above. Because we want to use the full training sample for thispart we combine thecv.train andcv.test for the training sample.
train. X. orig �- data. orig[ c( cv. sets$cv. train[[ 1]], cv. sets$cv. test[[ 1]]), pred]
train. Y. orig �- data. orig[ c( cv. sets$cv. train[[ 1]], cv. sets$cv. test[[ 1]]), resp]
test. X. orig �- data. orig[ cv. sets$test, pred]
test. Y. orig �- data. orig[ cv. sets$test, resp]
train. X. std �- f. data. std( train. X. orig)
train. Y. std �- f. data. std( train. Y. orig)
# The returned coef are for the original data
n. data �- length( train. Y. std)
lpm. all �- {}
for ( i in 1: dim( combos)[ 1]) {
X �- train. X. std[, combos[ i,]]
lpm �- lm( train. Y. std ~X)
lpm$coefficients �- std. to. orig( lpm$coefficients, apply( train. X. orig, 2, mean),
mean( train. Y. orig), apply( train. X. orig, 2, sd), sd( train. Y. orig))
lpm. all �- c( lpm. all, list( lpm))
}
lpm. all[[ 1]]Call:lm( formula � train. Y. std ~X)Coefficients:( Intercept) X
1. 5626 0. 7091lpm. all[[ 9]]Call:lm( formula � train. Y. std ~X)Coefficients:( Intercept) Xlcavol Xlweight
- 0. 1422 0. 6771 0. 4752
Now plot the results:
plot(- 1, 0, xlim � c( 0, 8), ylim � c( 0, 100), xlab�” k”, ylab � ” SSE”)
for ( i in 1: dim( combos)[ 1]) {
points( sum( combos[ i,] � 0), sum( lpm. all[[ i]]$ residuals^2))
}
We would like to tabulate the results so we need a couple more functions.The first one allows us to specify the number of digits that we will display, and thesecond creates atable of our results.
#��������������������������������������������������������
trunc �- function ( numb, dig) {
mask �- 10^dig
Modeling 124 © Mills2015
Data Mining 2015
floor( mask* numb)/ mask
}
#��������������������������������������������������������
create. table �- function ( combos, Y) {
n. data �- length( Y)
n. combo �- dim( combos)[ 1]
table �- matrix( 0, n. combo�1, 4)
TSS �- sum(( Y - mean( Y))^ 2)
table[ 1, 2] �- trunc( TSS, 5)
table[ 1, 3] �- 0
table[ 1, 4] �- trunc( table[ 1, 2]/( n. data- 2), 5)
row. names �- ” None”
for ( i in 1: n. combo) {
temp1 �- format( combos[ i,])
temp2 �-{}
for ( j in 1: length( temp1)) {
if ( temp1[ j] ! � ” 0”)
temp2 �- paste( temp2, ” X”, temp1[ j], sep�””)
}
row. names �- c( row. names, temp2)
table[ i�1, 1] �- trunc( lpm. all[[ i]]$ df. residual, 5)
table[ i�1, 2] �- trunc( sum( lpm. all[[ i]]$ residual^2), 5)
table[ i�1, 3] �- trunc( 1 - table[ i�1, 2]/ TSS, 5)
table[ i�1, 4] �- trunc( table[ i�1, 2]/( n. data- 2), 5)
}
dimnames( table) �- list( row. names, c(” df”,” SSE”,” R^2”,” MSE”))
table
}
table �- create. table( combos, train. Y. std)
table[ 1: 15,]df SSE R^2 MSE
None 0 66. 00000 0. 00000 1. 01538X1 65 26. 58699 0. 59716 0. 40903X2 65 58. 53916 0. 11304 0. 90060X3 65 63. 63167 0. 03588 0. 97894X4 65 61. 24990 0. 07197 0. 94230X5 65 48. 35566 0. 26733 0. 74393X6 65 49. 64103 0. 24786 0. 76370X7 65 58. 32761 0. 11624 0. 89734X8 65 52. 63896 0. 20244 0. 80983X1X2 64 23. 53202 0. 64345 0. 36203X1X3 64 26. 42891 0. 59956 0. 40659X1X4 64 23. 44072 0. 64483 0. 36062X1X5 64 25. 48333 0. 61388 0. 39205X1X6 64 26. 55899 0. 59759 0. 40859X1X7 64 26. 57080 0. 59741 0. 40878
We also want the minimumSSE which we get with the next function.
#��������������������������������������������������������
create. mins �- function ( lpm. all, n. combo, numb) {
mins �- matrix( 10^( 30), numb�1, 2)
# Look at all the combos and find the minimum for each category ( rank)
© Mills2015 Modeling 125
Data Mining 2015
for ( i in 1: n. combo) {
temp �- sum( lpm. all[[ i]]$ residuals^2)
if ( temp � mins[ lpm. all[[ i]]$ rank, 2]) {
mins[ lpm. all[[ i]]$ rank, 1] �- i
mins[ lpm. all[[ i]]$ rank, 2] �- temp
}
}
mins[ 1, 1] �- 0
mins[ 1, 2] �- table[ 1, 2]
mins
}
mins �- create. mins( lpm. all, dim( combos)[ 1], 8)
mins[, 1] [, 2]
[ 1,] 0 66. 00000[ 2,] 1 26. 58699[ 3,] 11 23. 44073[ 4,] 42 21. 89790[ 5,] 98 20. 95709[ 6,] 175 20. 47357[ 7,] 230 19. 82302[ 8,] 248 18. 89969[ 9,] 255 18. 34662
We can add that information to the plot
lines ( 0: p, mins[ 1:( p�1), 2])
points ( 0: p, mins[ 1:( p�1), 2], col�” red”)
Figure 25.
for ( i in 0: p) {
cat( dimnames( table)[[ 1]][ mins[ i�1, 1] �1], table[ mins[ i�1, 1] �1, 2], ”\ n”)
}
Modeling 126 © Mills2015
Data Mining 2015
and display the results.
None 66X1 26. 58699X1X4 23. 44072X1X2X8 21. 89789X1X2X4X5 20. 95709X1X2X4X5X8 20. 47356X1X2X4X5X6X8 19. 82302X1X2X3X4X5X6X8 18. 89969X1X2X3X4X5X6X7X8 18. 34661
# Display all the best
for ( i in 1:( p- 1)) {
print( lpm. all[[ mins[ i�1, 1] ]])
}Call:lm( formula � train. Y. std ~X)Coefficients:( Intercept) X
1. 5626 0. 7091Call:lm( formula � train. Y. std ~X)Coefficients:( Intercept) Xlcavol Xlbph
- 0. 1744 0. 6960 0. 4771Call:lm( formula � train. Y. std ~X)Coefficients:( Intercept) Xlcavol Xlweight Xpgg45
- 1. 96819 0. 60525 0. 51626 0. 02751Call:lm( formula � train. Y. std ~X)Coefficients:( Intercept) Xlcavol Xlweight Xlbph Xsvi
- 1. 11996 0. 59307 0. 32478 0. 02477 0. 13559Call:lm( formula � train. Y. std ~X)Coefficients:( Intercept) Xlcavol Xlweight Xlbph Xsvi Xpgg45
- 1. 07243 0. 57009 0. 36866 0. 02129 0. 10363 0. 29449Call:lm( formula � train. Y. std ~X)Coefficients:( Intercept) Xlcavol Xlweight Xlbph Xsvi Xlcp
- 0. 93951 0. 61690 0. 35376 0. 02177 0. 14212 - 0. 47232Xpgg45
0. 14464Call:lm( formula � train. Y. std ~X)Coefficients:( Intercept) Xlcavol Xlweight Xage Xlbph Xsvi
- 0. 79405 0. 64159 0. 37983 - 0. 02216 0. 14823 0. 58202Xlcp Xpgg45
- 0. 19252 0. 33245
© Mills2015 Modeling 127
Data Mining 2015
3.8 Variance Inflation Factors (VIFs)
Under ordinary least squares (OLS) regression,
b � �X TX��1 �X T Y�
and
�b2 � Var�b�
� �X TX��1 XT�Y
2 I X �X T X��1
� �Y2 �XT X��1
If we have used the correlation transformation, we obtain
�b�2 � Var�b��\
� �Y
�
2 �X � T X ���1
� r XX�1
The variance inflation factors (VIFs) are then the diagonal elements ofr XX�1 .
Modeling 128 © Mills2015
Data Mining 2015
3.9 Weighted Least Squares (WLS) Regression
We consider a weighted sum of squares for error
Q � �wi�Y i � Y i�2
assuming
Y � X � � �
but not assuming that error variances are constant. We set
W �
w1
. 0
�
0 .
wn
Then
�XTW X�bw � �XTW Y�
or
bw � �XTWX��1 �XTW Y�
and
�bw
2 � �2�XTWX��1
.Note that we may set
W 1/2 �
w1
. 0
�
0 .
wn
© Mills2015 Modeling 129
Data Mining 2015
so
�W 1/2Y� � �W 1/2X� � � W 1/2�
Note that this is equivalent to a transforming the response variableY and the predictor variables in theX matrix to become
Y�� W1/2Y
and
X� � W1/2X
respectively and re-writing the model to become .
Y� � X� � � ��
.
The weightswi are often taken as the reciprocal of the (unequal) error variances or some functionrelated to them.We will now take a look at some robust methods of regression.
Modeling 130 © Mills2015
Data Mining 2015
3.10 Ridge Regression
Recall that the least-squares solution vectorb is unbiased and minimum variance in the set of all linearunbiased estimators.
Note: We have been restricting ourselves to finding the linear unbiased estimator that has minimumvariance. Recall that
MSE � Variance � �Bias�2
We have been restricting ourselves to situations wherebias � 0 and hence ourMSE is equivalent to thevariance. Now we actually modify least squares to force biased estimators for the regressioncoefficients by adding a small amount of known biasc. ( Usually 0� c � 1.) We then see if there is asituation whereMSE � Variance � �Bias � c�2 is minimized. We are then looking for a situationwhere allowing biased estimators results in a smaller variance and a resullting MSE that is lower than ifwe restricted ourselves to only permitting unbiased estimators. This is commonly called thevariance-bias trade-off .i.e.There may be a situation wherec � 0 corresponds to a lower variance(more precise) and a lower combinationMSE (actually we can prove that this is indeed true) than whenc � 0 and we have the minimum variance (linear unbiased) estimator (BLUE). When an estimator hassmall (known) bias and is more precise than an unbiased estimator we may actually prefer the biasedestimator since it is more likely (probable) to be close to the true parameter value
d. file �- paste( data. dir, ” ch07ta01. dat”, sep � ”/”)
d. data �- matrix( scan( d. file), ncol�4, byrow�T)
names �- c(” Triceps”,” Thigh”,” MidArm”,” BodyFat”)
dimnames( d. data) �- list( 1: 20, names)
df. data �- as. data. frame( d. data)
Ridge regression requires that the data be transformed by a correlation transformation so, as we saw -
Y i� � 1
n � 1Y i �
__Y
sY
X ik� � 1
n � 1X ik �
__Xk
sY�k � 1, ...,p � 1�.
Get the mean and standard deviations:
( bar �- apply( d. data, 2, mean))Triceps Thigh MidArm BodyFat
© Mills2015 Modeling 131
Data Mining 2015
25. 305 51. 170 27. 620 20. 195( s �- apply( d. data, 2, sd))Triceps Thigh MidArm BodyFat5. 023259 5. 234612 3. 647147 5. 106186
and compute the new variables:
d. data. t �- t(( t( d. data) - bar)/ s)/ sqrt( dim( d. data)[ 1]- 1)
We also compute the correlation matrices:
�p�1���p�1�XTX �
�p�1���p�1�
rXX �
1 r12 � r1,p�1
r21 1 � r2,p�1
rp�1,1 rp�1,2 � 1
�p�1��1XTY �
�p�1��1
rYX �
rY1
rY2
rYp�1
( r. xx �- cor( d. data[, 1: 3]))Triceps Thigh MidArm
Triceps 1. 0000000 0. 9238425 0. 4577772Thigh 0. 9238425 1. 0000000 0. 0846675MidArm 0. 4577772 0. 0846675 1. 0000000( r. xy �- cor( d. data[, 1: 3], d. data[, 4]))
[, 1]Triceps 0. 8432654Thigh 0. 8780896MidArm 0. 1424440I �- diag( 1, 3, 3)
For ordinary least squares, the normal equations are of the form
�XTX�b �XTY
while after the correlation transformation, the least square equations are
rXXb � rYX.
For ridge regresssion, a biasing constant�c � 0� is introduced to give
Modeling 132 © Mills2015
Data Mining 2015
�rXX � cI�bR � rYX
so bR � �rXX � cI��1rYX.
Theridge trace is a simultaneous plot of thep � 1 ridge standardized regression coefficientsb versusc(generally on log scale). TheVIF fall rapidly asc begins to move away from 0 and then tends tostabilize asc increases. We choose the smallestc such that the regression coefficients become stable(see the ridge trace) and theVIF values become reasonably small (close to 1).
Ridge. coef �- matrix( 0, 29, 4)
i �- 1
for ( c in c( seq( 0,. 01, by�. 001), seq(. 02,. 1, by�. 01), seq(. 2, 1. 0, by�. 1))) {
Ridge. coef[ i, 1] �- c
Ridge. coef[ i, 2: 4] �- solve( r. xx�c* I, r. xy)
i �- i � 1
}
dimnames( Ridge. coef) �- list( NULL, c(” c”,” b1”,” b2”,” b3”))
Ridge. coefc b1 b2 b3
[ 1,] 0. 000 4. 2637046 - 2. 92870065 - 1. 561416794[ 2,] 0. 001 2. 0347999 - 0. 94080230 - 0. 708676831[ 3,] 0. 002 1. 4406628 - 0. 41128508 - 0. 481273455[ 4,] 0. 003 1. 1652742 - 0. 16612384 - 0. 375799189[ 5,] 0. 004 1. 0063236 - 0. 02483665 - 0. 314865607[ 6,] 0. 005 0. 9028128 0. 06699318 - 0. 275139506[ 7,] 0. 006 0. 8300164 0. 13142332 - 0. 247162829[ 8,] 0. 007 0. 7760090 0. 17909237 - 0. 226373864[ 9,] 0. 008 0. 7343315 0. 21576257 - 0. 210301840
[ 10,] 0. 009 0. 7011811 0. 24482646 - 0. 197492074[ 11,] 0. 010 0. 6741729 0. 26841149 - 0. 187032336[ 12,] 0. 020 0. 5463339 0. 37740366 - 0. 136871542[ 13,] 0. 030 0. 5003777 0. 41341436 - 0. 118077883[ 14,] 0. 040 0. 4760035 0. 43023669 - 0. 107583242[ 15,] 0. 050 0. 4604598 0. 43924481 - 0. 100508309[ 16,] 0. 060 0. 4493942 0. 44432014 - 0. 095186663[ 17,] 0. 070 0. 4409157 0. 44714834 - 0. 090893478[ 18,] 0. 080 0. 4340699 0. 44857934 - 0. 087262359[ 19,] 0. 090 0. 4283233 0. 44908795 - 0. 084087845[ 20,] 0. 100 0. 4233540 0. 44896004 - 0. 081245545[ 21,] 0. 200 0. 3914248 0. 43472004 - 0. 061289960[ 22,] 0. 300 0. 3703229 0. 41540311 - 0. 047886515[ 23,] 0. 400 0. 3529429 0. 39658090 - 0. 037644774[ 24,] 0. 500 0. 3377199 0. 37905822 - 0. 029500232[ 25,] 0. 600 0. 3240420 0. 36291486 - 0. 022888815[ 26,] 0. 700 0. 3115856 0. 34806526 - 0. 017448562[ 27,] 0. 800 0. 3001454 0. 33438723 - 0. 012926334[ 28,] 0. 900 0. 2895758 0. 32175831 - 0. 009136642[ 29,] 1. 000 0. 2797659 0. 31006643 - 0. 005939486vif �- matrix( 0, 29, 5)
i �- 1
for ( c in c( seq( 0,. 01, by�. 001), seq(. 02,. 1, by�. 01), seq(. 2, 1. 0, by�. 1))) {
vif[ i, 1] �- c
a. inv �- solve( r. xx�c* I, I)
vif[ i, 2: 4] �- diag( a. inv%*%r. xx%*%a. inv)
© Mills2015 Modeling 133
Data Mining 2015
vif[ i, 5] �- 1- sum( ( Ridge. coef[ i, 2: 4]%*%t( d. data. t[, 1: 3])- d. data. t[, 4])^ 2)
i �- i � 1
}
dimnames( vif) �- list( NULL, c(” c”,” VIF1”,” VIF2”,” VIF3”,” R^2”))
vifc VIF1 VIF2 VIF3 R^2
[ 1,] 0. 000 708. 8429142 564. 3433857 104. 6060050 0. 8013586[ 2,] 0. 001 125. 7308694 100. 2740321 19. 2809671 0. 7943486[ 3,] 0. 002 50. 5591892 40. 4483104 8. 2797004 0. 7901140[ 4,] 0. 003 27. 1750112 21. 8376013 4. 8561838 0. 7878135[ 5,] 0. 004 16. 9815701 13. 7247233 3. 3627919 0. 7863882[ 6,] 0. 005 11. 6434185 9. 4759221 2. 5798503 0. 7854212[ 7,] 0. 006 8. 5033223 6. 9764415 2. 1185423 0. 7847225[ 8,] 0. 007 6. 5013345 5. 3827244 1. 8237725 0. 7841937[ 9,] 0. 008 5. 1471650 4. 3045740 1. 6237998 0. 7837792
[ 10,] 0. 009 4. 1886926 3. 5413399 1. 4817327 0. 7834452[ 11,] 0. 010 3. 4855023 2. 9812730 1. 3770252 0. 7831698[ 12,] 0. 020 1. 1025508 1. 0805407 1. 0105134 0. 7817952[ 13,] 0. 030 0. 6256980 0. 6969055 0. 9234580 0. 7812036[ 14,] 0. 040 0. 4527887 0. 5552892 0. 8814032 0. 7807894[ 15,] 0. 050 0. 3704539 0. 4858773 0. 8531067 0. 7804224[ 16,] 0. 060 0. 3243741 0. 4454349 0. 8306002 0. 7800589[ 17,] 0. 070 0. 2956068 0. 4188820 0. 8110926 0. 7796807[ 18,] 0. 080 0. 2761480 0. 3998444 0. 7933948 0. 7792794[ 19,] 0. 090 0. 2621427 0. 3852497 0. 7769250 0. 7788510[ 20,] 0. 100 0. 2515476 0. 3734680 0. 7613679 0. 7783935[ 21,] 0. 200 0. 2052518 0. 3078210 0. 6341536 0. 7722580[ 22,] 0. 300 0. 1837581 0. 2685829 0. 5384602 0. 7638181[ 23,] 0. 400 0. 1675811 0. 2383292 0. 4634200 0. 7538008[ 24,] 0. 500 0. 1540398 0. 2136522 0. 4033143 0. 7427376[ 25,] 0. 600 0. 1422970 0. 1930126 0. 3543686 0. 7310054[ 26,] 0. 700 0. 1319468 0. 1754722 0. 3139517 0. 7188740[ 27,] 0. 800 0. 1227368 0. 1603865 0. 2801722 0. 7065382[ 28,] 0. 900 0. 1144865 0. 1472858 0. 2516396 0. 6941399[ 29,] 1. 000 0. 1070578 0. 1358157 0. 2273113 0. 6817825plot( log( Ridge. coef[ 2: 29, 1]), Ridge. coef[ 2: 29, 2], ylim�c(- 2, 3),” l”,
xlab�” log( c)”, ylab�” b”, col�” red”)
lines( log( Ridge. coef[ 2: 29, 1]), Ridge. coef[ 2: 29, 3], col�” blue”)
lines( log( Ridge. coef[ 2: 29, 1]), Ridge. coef[ 2: 29, 4], col�” green”)
legend(- 2, 3, legend�c(” b1”, ” b2”, ” b3”),
col�c(” red”,” blue”,” green”), lty�1)
Modeling 134 © Mills2015
Data Mining 2015
Figure 26.
The following combines some of the previous operations into one function.
ridge �- function ( data, p. cols, r. cols, lambda, use. c � T) {
# Get the mean and standard deviations
# p. cols are the columns for the perdictors
# r. cols are the columns for the response
data. std �- f. data. std( data)
r. xx �- t( data. std[, p. cols])%*% data. std[, p. cols]
r. xx
r. xy �- t( data. std[, p. cols])%*% data. std[, r. cols]
r. xy
n. cols �- length( r. xy)
I �- diag( 1, n. cols, n. cols)
Ridge. coef �- matrix( 0, length( lambda), n. cols�2)
i �- 1
for ( c in lambda) {
if ( use. c �� F) {
Ridge. coef[ i, 1] �- sum( d/( d�L[ i]))
} else {
Ridge. coef[ i, 1] �- c
}
r. xx. c �- r. xx�c* I
Ridge. coef[ i, 2:( n. cols�2)] �- std. to. orig( c( 0, lm( r. xy~r. xx. c- 1)$ coefficients),
apply( data. orig[, pred], 2, mean), mean( data. orig[, resp]),
apply( data. orig[, pred], 2, sd), sd( data. orig[, resp]))
i �- i � 1
}
vif �- matrix( 0, length( lambda), n. cols�2)
i �- 1
for ( c in lambda) {
if ( use. c �� F) {
© Mills2015 Modeling 135
Data Mining 2015
vif[ i, 1] �- sum( d/( d�L[ i]))
} else {
vif[ i, 1] �- c
}
a. inv �- solve( r. xx�c* I, I)
vif[ i,( 1: n. cols) �1] �- diag( a. inv%*%r. xx%*%a. inv)
vif[ i, n. cols�2] �- 1- sum( ( Ridge. coef[ i, 2: 4]%*%t( data. std[, 1: 3])- data. std[, 4])^ 2)
i �- i � 1
}
list ( coef�Ridge. coef, vif�vif)
}
In some cases the biasing parameterc is not used directly, but rather we use a function of it:
df�c� � �j�1
p dj2
dj2 � c
wheredj is the diagonal entry in the singular value decomposition.
d �- svd( data. orig[, pred])$ d^2
# or
( d �- eigen( t( data. orig[, pred])%*% data. orig[, pred])$ values)[ 1] 4. 790826e�05 6. 190704e�04 2. 109043e�02 1. 756330e�02 6. 479861e�01[ 6] 4. 452384e�01 2. 023903e�01 8. 093138e�00
# Set up points to give a good df( L) spread
L �- c( seq( 0, 7, by�7/ 10), seq( 8, 20, by�12/ 10), seq( 22, 50, by�28/ 10),
seq( 55, 120, by�65/ 10), seq( 130, 400, by�270/ 10), seq( 440, 3500, by�3160/ 10),
seq( 3700, 200000, by�196300/ 10), seq( 250000, 5000000, by�475000/ 10))
ridge. res �- ridge ( data. orig, pred, resp, L, F)
Ridge. coef �- ridge. res$coef
Ridge. coef[ 1: 5,][, 1] [, 2] [, 3] [, 4] [, 5] [, 6] [, 7]
[ 1,] 8. 000000 0. 6693993 0. 5870229 0. 4544605 - 0. 01963721 0. 1070544 0. 7661559[ 2,] 7. 853506 0. 6423863 0. 5779035 0. 4535208 - 0. 01910545 0. 1056473 0. 7567784[ 3,] 7. 721665 0. 6176973 0. 5692262 0. 4525009 - 0. 01859826 0. 1043013 0. 7479234[ 4,] 7. 601828 0. 5950669 0. 5609539 0. 4514107 - 0. 01811371 0. 1030117 0. 7395445[ 5,] 7. 491991 0. 5742682 0. 5530538 0. 4502586 - 0. 01765007 0. 1017743 0. 7316005
[, 8] [, 9] [, 10][ 1,] - 0. 10547357 0. 04513597 0. 004525323[ 2,] - 0. 09668191 0. 04756349 0. 004367505[ 3,] - 0. 08847371 0. 04974629 0. 004224841[ 4,] - 0. 08079499 0. 05171680 0. 004095576[ 5,] - 0. 07359826 0. 05350244 0. 003978201plot( Ridge. coef[ 1: dim( Ridge. coef)[ 1], 1], Ridge. coef[ 1: dim( Ridge. coef)[ 1], 3], ylim�c(-. 2, 1),”xlab�” df( c)”, ylab�” b”)for ( i in 2: p) {
lines( Ridge. coef[ 1: dim( Ridge. coef)[ 1], 1], Ridge. coef[ 1: dim( Ridge. coef)[ 1], i�1])
}
legend( 0, 1, legend�paste(” b”, 1: p, sep�””), col�1: p, lty�1)
Modeling 136 © Mills2015
Data Mining 2015
Figure 27. Note the use oflegend
© Mills2015 Modeling 137
Data Mining 2015
3.11 Smoothing Revisited
We saw a simple smoothing method (running mean) earlier. As the examples that we have consideredin regression show, you may need to have some knowledge of the surface that you are trying to fit inorder to know which type of regression to use. A better model may come from the class ofkernelsmoothers which use local information obtained from the data.
We will revisit the running mean and look at it as a template for the kernel smoothers in general.
The smoothing methods use awindow that is typically centered at the point at which we wish to obtainan estimate and has a width that is selected by the user. These methods computea weighted mean of allthe points in the window with the difference in the methods arising from the way in which the weightsare determined. (As with the interpolation methods, there are concerns about behaviour at the endpoints.)
The simplest example of such smoothers is the running mean which has equal weights (1/n) for thepoints in the window. We will use� as our width parameter. The running mean uses 2� � 1 points inthe window.
For example,
Figure 28.
Here the window moves across the data and computes the average of 9 data points.
Modeling 138 © Mills2015
Data Mining 2015
The red� is the midpoint of the window with the solid lines and the red is the midpoint of the nextwindow (dashed lines). The green� is in the first window but not in the second, whereas the green isin the second but not the first. The predicted line (shown in red) is discontinuous because the runningmean window has the same contents until the leading edge encounters a new point. At that position, thewindow drops the point at the trailing edge and a jump occurs.
In many cases we would prefer a smooth curve. This can be obtained by the use of a windowthat doesnot suddenly drop (or gain) points but allows them to ’fade’ in and out.There are three commonly used kernels that have this property.
The first has a familiar appearance as it is theGaussian kernel
G� � exp � x � x0�
2
the second is theEpanechnikov kernel
E� �34 1 � x � x0
�2
if |x � x0| � �
0 otherwise
and the third is theTri-cube
T� �1 � x � x0
�3 3
if |x � x0| � �
0 otherwise
and they look like
© Mills2015 Modeling 139
Data Mining 2015
Figure 29.
(the Gaussian and tri-cube are differentiable everywhere.
Figure 30. Figure 31.
Note that the wider window produces a smoother curve but a worse fit.
Modeling 140 © Mills2015
Data Mining 2015
Figure 32. Figure 33.
Figure 34. Figure 35.
It is possible to modify the ideas of kernel smoothing to make use of regression concepts.
© Mills2015 Modeling 141
Data Mining 2015
3.12 LOWESSThe idea behindLOWESS (or LOESS) [Cleveland, W.S. (1979) ”Robust Locally Weighted Regressionand Smoothing Scatterplots,” Journal of the American Statistical Association, Vol. 74, pp. 829-836] isthat a span (or window) of the data (xj,xj�m� is selected and the midpoint (except at the ends of thedata) is chosen as the point for estimation (similar to kernel smoothing). The distance from point to beestimated to the other points in the span is found and the distances are scaled bythe maximum distancein the span. Weights are determined - one such weighting function is thetricube weight function
w�x� �1 � |x|3
3, |x| � 1
0, |x| � 1.
A weighted least squares (WLS) is then done on the data in the span, and used to predict a new valuefor the value to be estimated.A simplistic (not robust version) follows.
f. weight �- function ( x){
( 1 - abs( x)^ 3)^ 3
}
Note the do.print �F. If this is changed todo.print �T then the loop that prints out theinformation about the process is active (by default it is not). It is possible to set default values forarguments in this manner. If a default applies then the argument need not be in the call to the function.See [Writing your own functions] [Named arguments and defaults] in the An Introduction to R.
reg. value �- function ( d. subset, pt, do. print�F) {
# Get the distance from pt to every point in the interval.
d. dist �- abs( d. subset[, 1] - pt)
scaled. dist �- d. dist/ max( d. dist)
W �- f. weight( scaled. dist)
n �- length( W)
# Do a least squares fit ( lm for linear model)
# lm( y ~x, data, subset, weights)
lw �- lm( d. subset[, 2]~ d. subset[, 1], as. data. frame( d. subset), 1: n, W)
if ( do. print) {
print( pt)
print( cbind( d. subset[, 1], d. subset[, 2], d. dist, scaled. dist, W))
print( lw$coefficients)
print( lw$coefficients%*%c( 1, pt))
print(”************************************”)
}
lw$coefficients%*%c( 1, pt)
}
Modeling 142 © Mills2015
Data Mining 2015
In the following, we start with a first span or ’band’ of data and hold it fixed until the point to beestimated gets to the middle (Part 1), then the point and span move together until thespan can move nofurther (Part 2) and then the point moves to the end of the last span (Part 3).
f. lowess �- function( d. data, band, do. print�F) {
# d. data has the x and y values of the data
# band is the bandwidth
# Initialize
v �- rep( 0, dim( d. lowess)[ 1]) # Holds the result
i �- 1
first �- 1
last �- first � band - 1
mid �- floor(( first�last)/ 2)
ind. pt �- 1 # Points to position of pt in the band
# Part 1 - move the pt at which we want the smooth
# until we reach midpoint of band
d. subset �- d. data[ first: last,]
while ( ind. pt �� mid) {
pt �- d. subset[ ind. pt, 1]
v[ i] �- reg. value ( d. subset, pt, do. print)
i �- i � 1
ind. pt �- ind. pt � 1
}
# Part 2 - move the pt with the band
ind. pt �- ind. pt - 1 # In middle of band
repeat {
first �- first � 1
last �- last � 1
if ( last � dim( d. data)[ 1]) break
d. subset �- d. data[ first: last,]
pt �- d. subset[ ind. pt, 1]
v[ i] �- reg. value ( d. subset, pt, do. print)
i �- i � 1
}
# Part 3 - move the midpoint until we reach the end
ind. pt �- ind. pt � 1
while ( ind. pt �� band) {
pt �- d. subset[ ind. pt, 1]
v[ i] �- reg. value ( d. subset, pt, do. print)
i �- i � 1
ind. pt �- ind. pt � 1
}
v
}
© Mills2015 Modeling 143
Data Mining 2015
In the following, the R version oflowess is also used for comparison.
library( stats)
d. lowess �- cbind( x. rough, y. rough)
oldpar �- par( mfrow � c( 3, 1))
for ( b in c( 9, 13, 21)) {
v �- f. lowess( d. lowess, b, T)
plot( d. lowess[, 1], d. lowess[, 2], col�” red”, pch�20,
main�paste(” Lowess with bandwidth �”, b))
curve( sin, 0, 6, add�T, col�” blue”)
lines( d. lowess[, 1], v, col�” black”)
lines( lowess( d. lowess, f�b/ 61), col�” green”)
}
par( oldpar)...[ 1] 0
d. dist scaled. dist W[ 1,] 0. 0 - 0. 47388308 0. 0 0. 00000000 1. 00000000[ 2,] 0. 1 - 0. 10563567 0. 1 0. 08333333 0. 99826489[ 3,] 0. 2 0. 01353860 0. 2 0. 16666667 0. 98617531[ 4,] 0. 3 0. 21000704 0. 3 0. 25000000 0. 95385361[ 5,] 0. 4 0. 77435571 0. 4 0. 33333333 0. 89295331[ 6,] 0. 5 0. 31873792 0. 5 0. 41666667 0. 79830593[ 7,] 0. 6 0. 50385628 0. 6 0. 50000000 0. 66992188[ 8,] 0. 7 0. 45257248 0. 7 0. 58333333 0. 51489433[ 9,] 0. 8 0. 84182740 0. 8 0. 66666667 0. 34847330
[ 10,] 0. 9 0. 33869364 0. 9 0. 75000000 0. 19322586[ 11,] 1. 0 1. 25643075 1. 0 0. 83333333 0. 07477612[ 12,] 1. 1 1. 11691797 1. 1 0. 91666667 0. 01212663[ 13,] 1. 2 0. 64749519 1. 2 1. 00000000 0. 00000000
( Intercept) d. subset[, 1]- 0. 2425715 1. 2950873
[, 1][ 1,] - 0. 2425715[ 1] ”************************************”[ 1] 0. 1
d. dist scaled. dist W[ 1,] 0. 0 - 0. 47388308 0. 1 0. 0909091 0. 99774775[ 2,] 0. 1 - 0. 10563567 0. 0 0. 0000000 1. 00000000[ 3,] 0. 2 0. 01353860 0. 1 0. 0909091 0. 99774775[ 4,] 0. 3 0. 21000704 0. 2 0. 1818182 0. 98207661[ 5,] 0. 4 0. 77435571 0. 3 0. 2727273 0. 94036966[ 6,] 0. 5 0. 31873792 0. 4 0. 3636364 0. 86257264[ 7,] 0. 6 0. 50385628 0. 5 0. 4545455 0. 74388835[ 8,] 0. 7 0. 45257248 0. 6 0. 5454545 0. 58788237[ 9,] 0. 8 0. 84182740 0. 7 0. 6363636 0. 40901258
[ 10,] 0. 9 0. 33869364 0. 8 0. 7272727 0. 23297941[ 11,] 1. 0 1. 25643075 0. 9 0. 8181818 0. 09252419[ 12,] 1. 1 1. 11691797 1. 0 0. 9090909 0. 01537977[ 13,] 1. 2 0. 64749519 1. 1 1. 0000000 0. 00000000
( Intercept) d. subset[, 1]- 0. 2343405 1. 2679403
[, 1][ 1,] - 0. 1075465[ 1] ”************************************”[ 1] 0. 2
d. dist scaled. dist W[ 1,] 0. 0 - 0. 47388308 0. 2 0. 2 0. 97619149
Modeling 144 © Mills2015
Data Mining 2015
[ 2,] 0. 1 - 0. 10563567 0. 1 0. 1 0. 99700300[ 3,] 0. 2 0. 01353860 0. 0 0. 0 1. 00000000[ 4,] 0. 3 0. 21000704 0. 1 0. 1 0. 99700300[ 5,] 0. 4 0. 77435571 0. 2 0. 2 0. 97619149[ 6,] 0. 5 0. 31873792 0. 3 0. 3 0. 92116732[ 7,] 0. 6 0. 50385628 0. 4 0. 4 0. 82002586[ 8,] 0. 7 0. 45257248 0. 5 0. 5 0. 66992188[ 9,] 0. 8 0. 84182740 0. 6 0. 6 0. 48189030
[ 10,] 0. 9 0. 33869364 0. 7 0. 7 0. 28359339[ 11,] 1. 0 1. 25643075 0. 8 0. 8 0. 11621427[ 12,] 1. 1 1. 11691797 0. 9 0. 9 0. 01990251[ 13,] 1. 2 0. 64749519 1. 0 1. 0 0. 00000000
( Intercept) d. subset[, 1]- 0. 2250528 1. 2382417
[, 1][ 1,] 0. 02259554[ 1] ”************************************”[ 1] 0. 3
d. dist scaled. dist W[ 1,] 0. 0 - 0. 47388308 0. 3 0. 3333333 0. 89295331[ 2,] 0. 1 - 0. 10563567 0. 2 0. 2222222 0. 96743815[ 3,] 0. 2 0. 01353860 0. 1 0. 1111111 0. 99589042[ 4,] 0. 3 0. 21000704 0. 0 0. 0000000 1. 00000000[ 5,] 0. 4 0. 77435571 0. 1 0. 1111111 0. 99589042[ 6,] 0. 5 0. 31873792 0. 2 0. 2222222 0. 96743815[ 7,] 0. 6 0. 50385628 0. 3 0. 3333333 0. 89295331[ 8,] 0. 7 0. 45257248 0. 4 0. 4444444 0. 75907091[ 9,] 0. 8 0. 84182740 0. 5 0. 5555556 0. 56875893
[ 10,] 0. 9 0. 33869364 0. 6 0. 6666667 0. 34847330[ 11,] 1. 0 1. 25643075 0. 7 0. 7777778 0. 14844970[ 12,] 1. 1 1. 11691797 0. 8 0. 8888889 0. 02637525[ 13,] 1. 2 0. 64749519 0. 9 1. 0000000 0. 00000000
( Intercept) d. subset[, 1]- 0. 2107759 1. 1998531
[, 1][ 1,] 0. 14918[ 1] ”************************************”[ 1] 0. 4
d. dist scaled. dist W[ 1,] 0. 0 - 0. 47388308 0. 4 0. 500 0. 66992188[ 2,] 0. 1 - 0. 10563567 0. 3 0. 375 0. 84999297[ 3,] 0. 2 0. 01353860 0. 2 0. 250 0. 95385361[ 4,] 0. 3 0. 21000704 0. 1 0. 125 0. 99415206[ 5,] 0. 4 0. 77435571 0. 0 0. 000 1. 00000000[ 6,] 0. 5 0. 31873792 0. 1 0. 125 0. 99415206[ 7,] 0. 6 0. 50385628 0. 2 0. 250 0. 95385361[ 8,] 0. 7 0. 45257248 0. 3 0. 375 0. 84999297[ 9,] 0. 8 0. 84182740 0. 4 0. 500 0. 66992188
[ 10,] 0. 9 0. 33869364 0. 5 0. 625 0. 43184014[ 11,] 1. 0 1. 25643075 0. 6 0. 750 0. 19322586[ 12,] 1. 1 1. 11691797 0. 7 0. 875 0. 03596253[ 13,] 1. 2 0. 64749519 0. 8 1. 000 0. 00000000
( Intercept) d. subset[, 1]- 0. 1800034 1. 1358898
[, 1][ 1,] 0. 2743525[ 1] ”************************************”...[ 1] 1
d. dist scaled. dist W
© Mills2015 Modeling 145
Data Mining 2015
[ 1,] 0. 4 0. 7743557 0. 6 1. 0000000 2. 955864e- 46[ 2,] 0. 5 0. 3187379 0. 5 0. 8333333 7. 477612e- 02[ 3,] 0. 6 0. 5038563 0. 4 0. 6666667 3. 484733e- 01[ 4,] 0. 7 0. 4525725 0. 3 0. 5000000 6. 699219e- 01[ 5,] 0. 8 0. 8418274 0. 2 0. 3333333 8. 929533e- 01[ 6,] 0. 9 0. 3386936 0. 1 0. 1666667 9. 861753e- 01[ 7,] 1. 0 1. 2564308 0. 0 0. 0000000 1. 000000e�00[ 8,] 1. 1 1. 1169180 0. 1 0. 1666667 9. 861753e- 01[ 9,] 1. 2 0. 6474952 0. 2 0. 3333333 8. 929533e- 01
[ 10,] 1. 3 0. 9956303 0. 3 0. 5000000 6. 699219e- 01[ 11,] 1. 4 1. 2856589 0. 4 0. 6666667 3. 484733e- 01[ 12,] 1. 5 1. 4437838 0. 5 0. 8333333 7. 477612e- 02[ 13,] 1. 6 0. 7472126 0. 6 1. 0000000 0. 000000e�00
( Intercept) d. subset[, 1]- 0. 01039449 0. 83800014
[, 1][ 1,] 0. 8276056[ 1] ”************************************”[ 1] 1. 1
d. dist scaled. dist W[ 1,] 0. 5 0. 3187379 0. 6 1. 0000000 0. 00000000[ 2,] 0. 6 0. 5038563 0. 5 0. 8333333 0. 07477612[ 3,] 0. 7 0. 4525725 0. 4 0. 6666667 0. 34847330[ 4,] 0. 8 0. 8418274 0. 3 0. 5000000 0. 66992188[ 5,] 0. 9 0. 3386936 0. 2 0. 3333333 0. 89295331[ 6,] 1. 0 1. 2564308 0. 1 0. 1666667 0. 98617531[ 7,] 1. 1 1. 1169180 0. 0 0. 0000000 1. 00000000[ 8,] 1. 2 0. 6474952 0. 1 0. 1666667 0. 98617531[ 9,] 1. 3 0. 9956303 0. 2 0. 3333333 0. 89295331
[ 10,] 1. 4 1. 2856589 0. 3 0. 5000000 0. 66992188[ 11,] 1. 5 1. 4437838 0. 4 0. 6666667 0. 34847330[ 12,] 1. 6 0. 7472126 0. 5 0. 8333333 0. 07477612[ 13,] 1. 7 0. 8396295 0. 6 1. 0000000 0. 00000000
( Intercept) d. subset[, 1]0. 02071804 0. 81446704
[, 1][ 1,] 0. 9166318[ 1] ”************************************”[ 1] 1. 2
d. dist scaled. dist W[ 1,] 0. 6 0. 5038563 0. 6 1. 0000000 0. 000000e�00[ 2,] 0. 7 0. 4525725 0. 5 0. 8333333 7. 477612e- 02[ 3,] 0. 8 0. 8418274 0. 4 0. 6666667 3. 484733e- 01[ 4,] 0. 9 0. 3386936 0. 3 0. 5000000 6. 699219e- 01[ 5,] 1. 0 1. 2564308 0. 2 0. 3333333 8. 929533e- 01[ 6,] 1. 1 1. 1169180 0. 1 0. 1666667 9. 861753e- 01[ 7,] 1. 2 0. 6474952 0. 0 0. 0000000 1. 000000e�00[ 8,] 1. 3 0. 9956303 0. 1 0. 1666667 9. 861753e- 01[ 9,] 1. 4 1. 2856589 0. 2 0. 3333333 8. 929533e- 01
[ 10,] 1. 5 1. 4437838 0. 3 0. 5000000 6. 699219e- 01[ 11,] 1. 6 0. 7472126 0. 4 0. 6666667 3. 484733e- 01[ 12,] 1. 7 0. 8396295 0. 5 0. 8333333 7. 477612e- 02[ 13,] 1. 8 1. 3994966 0. 6 1. 0000000 9. 976041e- 46
( Intercept) d. subset[, 1]0. 2648616 0. 6006984
[, 1][ 1,] 0. 9856997[ 1] ”************************************”...[ 1] 5. 7
d. dist scaled. dist W
Modeling 146 © Mills2015
Data Mining 2015
[ 1,] 4. 8 - 0. 9327408 0. 9 1. 0000000 0. 00000000[ 2,] 4. 9 - 1. 3073777 0. 8 0. 8888889 0. 02637525[ 3,] 5. 0 - 1. 1968627 0. 7 0. 7777778 0. 14844970[ 4,] 5. 1 - 0. 4363296 0. 6 0. 6666667 0. 34847330[ 5,] 5. 2 - 0. 4661374 0. 5 0. 5555556 0. 56875893[ 6,] 5. 3 - 0. 6437850 0. 4 0. 4444444 0. 75907091[ 7,] 5. 4 - 0. 4652568 0. 3 0. 3333333 0. 89295331[ 8,] 5. 5 - 0. 5301789 0. 2 0. 2222222 0. 96743815[ 9,] 5. 6 - 0. 5135852 0. 1 0. 1111111 0. 99589042
[ 10,] 5. 7 - 0. 8137505 0. 0 0. 0000000 1. 00000000[ 11,] 5. 8 - 0. 8017276 0. 1 0. 1111111 0. 99589042[ 12,] 5. 9 - 0. 5136304 0. 2 0. 2222222 0. 96743815[ 13,] 6. 0 0. 2006632 0. 3 0. 3333333 0. 89295331
( Intercept) d. subset[, 1]- 2. 6373100 0. 3778869
[, 1][ 1,] - 0. 4833547[ 1] ”************************************”[ 1] 5. 8
d. dist scaled. dist W[ 1,] 4. 8 - 0. 9327408 1. 0 1. 0 0. 00000000[ 2,] 4. 9 - 1. 3073777 0. 9 0. 9 0. 01990251[ 3,] 5. 0 - 1. 1968627 0. 8 0. 8 0. 11621427[ 4,] 5. 1 - 0. 4363296 0. 7 0. 7 0. 28359339[ 5,] 5. 2 - 0. 4661374 0. 6 0. 6 0. 48189030[ 6,] 5. 3 - 0. 6437850 0. 5 0. 5 0. 66992188[ 7,] 5. 4 - 0. 4652568 0. 4 0. 4 0. 82002586[ 8,] 5. 5 - 0. 5301789 0. 3 0. 3 0. 92116732[ 9,] 5. 6 - 0. 5135852 0. 2 0. 2 0. 97619149
[ 10,] 5. 7 - 0. 8137505 0. 1 0. 1 0. 99700300[ 11,] 5. 8 - 0. 8017276 0. 0 0. 0 1. 00000000[ 12,] 5. 9 - 0. 5136304 0. 1 0. 1 0. 99700300[ 13,] 6. 0 0. 2006632 0. 2 0. 2 0. 97619149
( Intercept) d. subset[, 1]- 2. 8655861 0. 4188919
[, 1][ 1,] - 0. 4360132[ 1] ”************************************”[ 1] 5. 9
d. dist scaled. dist W[ 1,] 4. 8 - 0. 9327408 1. 1 1. 00000000 0. 00000000[ 2,] 4. 9 - 1. 3073777 1. 0 0. 90909091 0. 01537977[ 3,] 5. 0 - 1. 1968627 0. 9 0. 81818182 0. 09252419[ 4,] 5. 1 - 0. 4363296 0. 8 0. 72727273 0. 23297941[ 5,] 5. 2 - 0. 4661374 0. 7 0. 63636364 0. 40901258[ 6,] 5. 3 - 0. 6437850 0. 6 0. 54545455 0. 58788237[ 7,] 5. 4 - 0. 4652568 0. 5 0. 45454545 0. 74388835[ 8,] 5. 5 - 0. 5301789 0. 4 0. 36363636 0. 86257264[ 9,] 5. 6 - 0. 5135852 0. 3 0. 27272727 0. 94036966
[ 10,] 5. 7 - 0. 8137505 0. 2 0. 18181818 0. 98207661[ 11,] 5. 8 - 0. 8017276 0. 1 0. 09090909 0. 99774775[ 12,] 5. 9 - 0. 5136304 0. 0 0. 00000000 1. 00000000[ 13,] 6. 0 0. 2006632 0. 1 0. 09090909 0. 99774775
( Intercept) d. subset[, 1]- 3. 0145872 0. 4450401
[, 1][ 1,] - 0. 3888509[ 1] ”************************************”[ 1] 6
d. dist scaled. dist W[ 1,] 4. 8 - 0. 9327408 1. 2 1. 00000000 0. 00000000[ 2,] 4. 9 - 1. 3073777 1. 1 0. 91666667 0. 01212663
© Mills2015 Modeling 147
Data Mining 2015
[ 3,] 5. 0 - 1. 1968627 1. 0 0. 83333333 0. 07477612[ 4,] 5. 1 - 0. 4363296 0. 9 0. 75000000 0. 19322586[ 5,] 5. 2 - 0. 4661374 0. 8 0. 66666667 0. 34847330[ 6,] 5. 3 - 0. 6437850 0. 7 0. 58333333 0. 51489433[ 7,] 5. 4 - 0. 4652568 0. 6 0. 50000000 0. 66992188[ 8,] 5. 5 - 0. 5301789 0. 5 0. 41666667 0. 79830593[ 9,] 5. 6 - 0. 5135852 0. 4 0. 33333333 0. 89295331
[ 10,] 5. 7 - 0. 8137505 0. 3 0. 25000000 0. 95385361[ 11,] 5. 8 - 0. 8017276 0. 2 0. 16666667 0. 98617531[ 12,] 5. 9 - 0. 5136304 0. 1 0. 08333333 0. 99826489[ 13,] 6. 0 0. 2006632 0. 0 0. 00000000 1. 00000000
( Intercept) d. subset[, 1]- 3. 1537386 0. 4692872
[, 1][ 1,] - 0. 3380152[ 1] ”************************************”...
Figure 36. LOWESS on noisy sine Figure 37. LOESS on noisy Runge
In Figures 36 and 37, the blue line is the original function, the black line is the simplistic version, andthe green line is the R version. The latter two are very close in value.
We will useloess which is a newer version of the function.
oldpar �- par( mfrow � c( 3, 1))
for ( s in c( 0. 25, 0. 5, 0. 75)) {
# Plot the noisy points
plot ( x. 5. 5by. 1, y. noise, col�” red”, pch�20, main � paste(” Span � ”, s))
# Plot thge Runge curve
curve( runge, - 5, 5, add�T, col�” blue”)
lines( x. 5. 5by. 1, loess( y. noise ~x. 5. 5by. 1, span � s)$ fitted, col�” green”)
}
Modeling 148 © Mills2015
Data Mining 2015
par( oldpar)
© Mills2015 Modeling 149
Data Mining 2015
3.13 SuperSmoother
Super smoother [Friedman, J. H. (1984) ”A variable span scatterplot smoother”,. Laboratory forComputational Statistics, Stanford University Technical Report No. 5.] uses3 different intervals (orspans) calledtweeter, midrange, woofer, on the data and determines the best value from these spans.The user can fix the spans or allow the program to determine the ‘best’ spans based oncross validationprocedures.
The R documentation states -“supsmu is a running lines smoother which chooses between three spans for the lines. The runninglines smoothers are symmetric, with k/2 data points each side of the predicted point, and values of k as0.5 * n, 0.2 * n and 0.05 * n, where n is the number of data points. If span is specified, a singlesmoother with span span * n is used. The best of the three smoothers is chosen by cross-validation foreach prediction. The best spans are then smoothed by a running lines smoother and the final predictionchosen by linear interpolation.”(In the above description, the 0.05 * n is the tweeter, the 0.2 * n is the midrange, and the 0.5 * n is thewoofer.)
Friedman also notes that we maya) know that the underlying curve is smooth orb) find a smooth curve appealing.To achieve this, he introduced abass control (ranging from 0 to 10) which, for values above 0, causesthe larger spans to be used. This control is not needed if the span is selected.
Super smoother - variable span fixed bass
s �- c( 0. 05, 0. 2, 0. 5)
lab �- c(”- tweeter”, ”- midrange”, ”- woofer”)
oldpar �- par( mfrow � c( 3, 1))
for ( i in 1: 3) {
plot ( x. 5. 5by. 1, y. noise, col�” red”, pch�20,
main�paste(” Super smoother - span � ”, s[ i], lab[ i]))
ss �- supsmu( x. 5. 5by. 1, y. noise, span � s[ i])
curve( runge, - 5, 5, add�T, col�” blue”)
lines( ss$x, ss$y, col�” green”)
}
par( oldpar)
Modeling 150 © Mills2015
Data Mining 2015
Figure 38. Supersmoother on Runge - variable spanFigure 39. Supersmoother on Runge - cv span
Super smoother - cv
b �- c( 2, 6, 10)
oldpar �- par( mfrow � c( 3, 1))
for ( i in 1: 3) {
plot ( x. 5. 5by. 1, y. noise, col�” red”, pch�20,
main�paste(” Super smoother - span � cv, bass �”, b[ i]))
ss �- supsmu( x. 5. 5by. 1, y. noise, span � ” cv”, bass � b[ i])
curve( runge, - 5, 5, add�T, col�” blue”)
lines( ss$x, ss$y, col�” green”)
}
par( oldpar)
Friedman used the following example to illustrate his Super smoother
n �- 200
x. f �- runif( n)
y. f �- sin( 2* pi*( 1- x. f)^ 2) � x. f* rnorm( n)
plot( x. f, y. f)
Figure 40 uses thetweeter, midrange, woofer fixed span cases to show the effect of the change in spanvalues.
s �- c( 0. 05, 0. 2, 0. 5)
lab �- c(” tweeter”, ” midrange”, ” woofer”)
plot ( x. f, y. f, col�” red”, pch�20)
© Mills2015 Modeling 151
Data Mining 2015
ss. t �- supsmu( x. f, y. f, span � s[ 1])
ss. m �- supsmu( x. f, y. f, span � s[ 2])
ss. w �- supsmu( x. f, y. f, span � s[ 3])
lines( ss. t$x, ss. t$y, col�” black”)
lines( ss. m$x, ss. m$y, col�” green”)
lines( ss. w$x, ss. w$y, col�” blue”)
legend( 0, 2, lab, col � c(” black”, ” green”, ” blue”),
lty � c( 1, 1, 1), pch � c(- 1, - 1, - 1))
Figure 40. Variable span Figure 41. Effect of bass in cv span case
Figure 41 shows the effect of the various values of bass.
b �- 0: 10
plot ( x. f, y. f, col�” red”, pch�20)
for ( i in b) {
ss. cv �- supsmu( x. f, y. f, span � ” cv”, bass � i)
lines( ss. cv$x, ss. cv$y, col�i)
}
legend( 0, 2, 0: 10, col � 0: 10, lty � c( 1, 1, 1), pch � c(- 1, - 1, - 1))
Modeling 152 © Mills2015