Date post: | 05-Dec-2014 |
Category: |
Business |
Upload: | maziar-mahboubian |
View: | 492 times |
Download: | 0 times |
NFL Data Predictor Model:How the Past Predicts the Future
Ryan KunesKristy Huffman
Maziar Mahboubian
Enterprise
Due to a horrible 2010 season performance, the General Manager of the Carolina Panthers, Marty Hurney, wants to replace Jimmy Clausen with a new starting quarterback.
Managerial Question: How can Hurney rate quarterback to determine who to sign for the 2011 season?
To complete this task, we developed a multi-predictor statistical model using the 2010 NLF quarterbacks’ passing statistics.
Executive Summary
We created a 15 predictor model to gauge quarterback performance. To help Hurney, we used his number one pick, Player X’s, statistics as our new obersvation.
First, we conducted a normal multi-linear regression model with all predictors and reached an R2 of 82.2%. Then a Stepwise regression eliminated 10 variables as unreliable predictors of quarterback success. The R2 dropped to 78.73%. To improve the model we tried using LnRate and eliminating extreme observations. Our best model is our Without Extreme Observations Model.
Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI 1 136.13 30.79 (74.60, 197.66) (71.53, 200.74)XX XX denotes a point that is an extreme outlier in the predictors. Values of Predictors for New Observations New Obs Comp Att Pct Att/G Yds Avg Yds/G TD Int 1st 1st% Lng 20+ 40+ Sck 1 273 383 71.3 25.5 3845 10.0 256 35.0 6.00 130 35.5 83.0 39.0 5.00 6.00
The regression equation is Rate = - 0.97 + 5.45 Avg + 0.665 Pct + 0.931 TD - 1.12 Int + 0.0285 Lng
S = 6.24771 R-Sq = 84.8% R-Sq(adj) = 83.7%
Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI 1 129.179 3.002 (123.186, 135.173) (115.340, 143.019) Values of Predictors for New Observations New Obs Avg Pct TD Int Lng 1 10.0 71.3 35.0 6.00 83.0
Data Summary
Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 Maximum Rate 79 0 74.94 2.36 21.00 5.90 65.50 79.90 90.90 111.00 Comp 79 0 132.2 14.9 132.7 1.0 13.0 73.0 257.0 450.0 Att 79 0 217.5 23.7 210.6 1.0 27.0 133.0 432.0 679.0 Pct 79 0 59.42 1.16 10.34 37.00 52.90 59.60 63.40 100.00 Att/G 79 0 22.08 1.34 11.95 0.20 12.60 25.30 31.70 42.40 Yds 79 0 1525 172 1533 6 122 857 3018 4710 Avg 79 0 6.542 0.157 1.392 2.000 5.900 6.700 7.400 9.500 Yds/G 79 0 148.0 10.0 89.1 1.2 67.0 171.0 221.0 294.4 TD 79 0 9.43 1.20 10.64 0.00 0.00 5.00 17.00 36.00 Int 79 0 6.430 0.701 6.232 0.000 1.000 4.000 10.000 25.000 1st 79 0 73.95 8.57 76.20 0.00 7.00 39.00 149.00 253.00 1st% 79 0 31.96 1.33 11.84 0.00 26.50 32.30 35.60 100.00 Lng 79 0 51.90 2.73 24.25 6.00 31.00 53.00 73.00 92.00 20+ 79 0 19.22 2.20 19.59 0.00 1.00 10.00 38.00 65.00 40+ 79 0 3.380 0.432 3.837 0.000 0.000 2.000 6.000 14.000 Sck 79 0 14.19 1.48 13.16 0.00 2.00 9.00 25.00 52.00
Note: Extreme range of predictors leads to many outliers. This is partially explained by the statistical information of starting QBs v. backup QBs
Variable Definitions:Comp: Number of pass completionsAtt: Number of pass attemptsPct: Percentage rateAtt/G: Number of attempts per gameYds: Number of passing yardsAvg: Average number of yardYrds: Number of yards per game
TD: Number of touch downsInt: Number of interceptions1st: Number of first downs1st%: Percentage rate of first downsLng: Longest pass in yards20+: Number of passes over 20 yards40+: Number of passes over 40 yardsSck: Number of sacks
Data Correlation Matrices
400
200
0 500
250
0 100
7550 40200 4000
2000
0 963 300
150
0 40200 20100 200
100
0 100
500 100
500 50250 1050 40200
100
50
04002000
5002500100755040
20
040002000010
5
0300
150
040
20
0201002001000100
50
0100
50
050250
1050
Comp 0.481 0.000 Att 0.462 0.996 0.000 0.000 Pct 0.643 0.170 0.139 0.000 0.133 0.220 Att/G 0.273 0.801 0.809 -0.077 0.015 0.000 0.000 0.498 Yds 0.504 0.993 0.991 0.163 0.792 0.000 0.000 0.000 0.150 0.000 Avg 0.695 0.367 0.355 0.360 0.217 0.411 0.000 0.001 0.001 0.001 0.055 0.000 Yds/G 0.436 0.853 0.854 0.033 0.967 0.863 0.406 0.000 0.000 0.000 0.774 0.000 0.000 0.000 TD 0.556 0.950 0.937 0.189 0.734 0.958 0.424 0.817 0.000 0.000 0.000 0.096 0.000 0.000 0.000 0.000 Int 0.224 0.857 0.863 0.065 0.731 0.839 0.245 0.742 0.755 0.048 0.000 0.000 0.571 0.000 0.000 0.030 0.000 0.000 1st 0.500 0.995 0.991 0.169 0.786 0.997 0.395 0.851 0.961 0.837 0.000 0.000 0.000 0.136 0.000 0.000 0.000 0.000 0.000 0.000 1st% 0.534 0.193 0.180 0.477 -0.023 0.209 0.577 0.100 0.236 0.101 0.217 0.000 0.089 0.113 0.000 0.842 0.064 0.000 0.378 0.036 0.375 0.055 Lng 0.389 0.664 0.673 -0.058 0.726 0.679 0.438 0.770 0.664 0.623 0.657 0.051 0.000 0.000 0.000 0.612 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.658 20+ 0.518 0.947 0.948 0.150 0.765 0.976 0.458 0.860 0.933 0.796 0.961 0.218 0.680 0.000 0.000 0.000 0.186 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.053 0.000 40+ 0.497 0.869 0.867 0.141 0.691 0.901 0.447 0.790 0.860 0.676 0.883 0.205 0.697 0.905 0.000 0.000 0.000 0.215 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.070 0.000 0.000 Sck 0.410 0.847 0.866 0.080 0.692 0.874 0.371 0.760 0.782 0.742 0.853 0.148 0.642 0.882 0.811 0.000 0.000 0.000 0.482 0.000 0.000 0.001 0.000 0.000 0.000 0.000 0.194 0.000 0.000 0.000
MATRIX PLOT OF RATE, COMP, ATT, PCT, ATT/G, YDS, AVG, YDS/G, TD, INT, 1ST, 1ST%, LNG, 20+, 40+, SCK
Kitchen Sink Model
Regression equation: Rate = - 0.6 - 0.100 Comp + 0.222 Att + 0.856 Pct - 2.59 Att/G - 0.0136 Yds + 1.80 Avg + 0.417 Yds/G + 0.880 TD - 1.62 Int - 0.109 1st + 0.126 1st% + 0.128 Lng - 0.145 20+ - 0.907 40+ - 0.035 Sck
Analysis of Variance Source DF SS MS F P Regression 15 28259.6 1884.0 19.38 0.000 Residual Error 63 6123.8 97.2 Total 78 34383.3 Source DF Seq SS Comp 1 7971.0 Att 1 1512.0 Pct 1 9621.5 Att/G 1 13.1 Yds 1 2194.5 Avg 1 3364.8 Yds/G 1 671.5 TD 1 1294.2 Int 1 1327.7 1st 1 7.7 1st% 1 55.2 Lng 1 139.4 20+ 1 0.4 40+ 1 84.6 Sck 1 2.0
Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI 1 136.13 30.79 (74.60, 197.66) (71.53, 200.74)XX XX denotes a point that is an extreme outlier in the predictors. Values of Predictors for New Observations New Obs Comp Att Pct Att/G Yds Avg Yds/G TD Int 1st 1st% Lng 20+ 40+ Sck 1 273 383 71.3 25.5 3845 10.0 256 35.0 6.00 130 35.5 83.0 39.0 5.00 6.00
Predictor Coef SE Coef T P Constant -0.60 13.21 -0.05 0.964 Comp -0.1005 0.2127 -0.47 0.638 Att 0.2219 0.1005 2.21 0.031 Pct 0.8562 0.1445 5.93 0.000 Att/G -2.5928 0.9863 -2.63 0.011 Yds -0.01360 0.03102 -0.44 0.662 Avg 1.798 1.678 1.07 0.288 Yds/G 0.4170 0.1547 2.70 0.009 TD 0.8802 0.4970 1.77 0.081 Int -1.6194 0.4150 -3.90 0.000 1st -0.1085 0.3263 -0.33 0.741 1st% 0.1260 0.1316 0.96 0.342 Lng 0.12816 0.09180 1.40 0.168 20+ -0.1446 0.5568 -0.26 0.796 40+ -0.9067 0.9662 -0.94 0.352 Sck -0.0351 0.2440 -0.14 0.886 S = 9.85916 R-Sq = 82.2% R-Sq(adj) = 77.9%
15 predictor “Kitchen Sink” MLR model has an R2 of 82.2% and s value of 9.86
P-values for Att, Pct, Att/G, Yds/G, TD, and Int are significant at the 95% confidence limit
Predicted QB rating for Player X
Assessment: Kitchen Sink Model
Independence: The plots do not appear to make a particular shapeConstant Variance: The plots look scattered but show some fanning outMean Zero: The plots are mirrored above and below zeroNormal: Residual plot distribution is normal
Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is Rate on 15 predictors, with N = 79 Step 1 2 3 4 5 Constant 6.342 -32.066 -24.042 -17.452 -23.056 Avg 10.49 8.04 6.19 5.85 5.06 T-Value 8.48 7.43 5.92 6.01 4.95 P-Value 0.000 0.000 0.000 0.000 0.000 Pct 0.92 0.89 0.85 0.94 T-Value 6.29 6.86 7.07 7.51 P-Value 0.000 0.000 0.000 0.000 TD 0.59 1.08 0.96 T-Value 4.55 6.01 5.16 P-Value 0.000 0.000 0.000 Int -1.05 -1.22 T-Value -3.67 -4.18 P-Value 0.000 0.000 Lng 0.149 T-Value 2.09 P-Value 0.040 S 15.2 12.4 11.1 10.2 10.0 R-Sq 48.30 66.02 73.37 77.46 78.73 R-Sq(adj) 47.63 65.12 72.30 76.24 77.27 Mallows Cp 107.9 47.2 23.2 10.7 8.2
Stepwise Regression
To reduce the effect of multicolinearity, as seen in the original model for predictors, Com, Att, Yds, 1st, 20+, and 40+, we ran a stepwise regression.
To improve out model, we chose step five with the predictors, Avg, Pct, TD, Int, and Lng.
Criteria for selection:R2: 78.73%S: 10P-value: all below the 95% confidence limitNumber of Variables: Dropped from 10 to 5
Even thought the R2 decreased and S increased, we feel that this is an improved model because the predictors’ p-values are all below the 95% confidence limit.
Stepwise Model: Rate v. Avg, Pct, TD, Int, Lng
The regression equation is Rate = - 23.1 + 5.06 Avg + 0.942 Pct + 0.958 TD - 1.22 Int + 0.149 Lng
S = 10.0093 R-Sq = 78.7% R-Sq(adj) = 77.3% Analysis of Variance Source DF SS MS F P Regression 5 27069.8 5414.0 54.04 0.000 Residual Error 73 7313.6 100.2 Total 78 34383.3 Source DF Seq SS Avg 1 16608.5 Pct 1 6091.0 TD 1 2526.1 Int 1 1408.1 Lng 1 436.1
Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI 1 133.28 4.62 (124.08, 142.48) (111.31, 155.24) Values of Predictors for New Observations New Obs Avg Pct TD Int Lng 1 10.0 71.3 35.0 6.00 83.0
963 1007550 40200 20100 100500
100
50
09
6
3
100
75
50
40
20
0
20
10
0
Matrix Plot of Rate, Avg, Pct, TD, Int, Lng
Avg 0.695 0.000
Pct 0.643 0.360 0.000 0.001
TD 0.556 0.424 0.189 0.000 0.000 0.096
Int 0.224 0.245 0.065 0.755 0.048 0.030 0.571 0.000
Lng 0.389 0.438 -0.058 0.664 0.623 0.000 0.000 0.612 0.000 0.000
30150-15-30
99.9
99
90
50
10
1
0.1
Residual
Perc
ent
120906030
30
15
0
-15
-30
Fitted Value
Resi
dual
24120-12-24
30
20
10
0
Residual
Fre
quency
757065605550454035302520151051
30
15
0
-15
-30
Observation Order
Resi
dual
Normal Probability Plot Versus Fits
Histogram Versus Order
Residual Plots for Rate
Independence: The plots do not appear to make a particular shapeConstant Variance: The plots look scattered but show some fanning outMean Zero: The plots are mirrored above and below zeroNormal: Residual plot distribution is normal
R2 decreased to 78.7% and S increased to 10
LnRate Stepwise Model: LnRate v. Avg, Pct, TD, Int, Lng
The regression equation is LnRate = 2.14 + 0.114 Avg + 0.0194 Pct + 0.00954 TD - 0.0166 Int + 0.00428 Lng
Predictor Coef SE Coef T P VIF Constant 2.1439 0.2277 9.42 0.000 Avg 0.11391 0.02852 3.99 0.000 1.580 Pct 0.019426 0.003496 5.56 0.000 1.310 TD 0.009539 0.005170 1.84 0.069 3.037 Int -0.016590 0.008130 -2.04 0.045 2.574 Lng 0.004275 0.001986 2.15 0.035 2.326
S = 0.278903 R-Sq = 63.7% R-Sq(adj) = 61.2%
Analysis of Variance Source DF SS MS F P Regression 5 9.9458 1.9892 25.57 0.000 Residual Error 73 5.6785 0.0778 Total 78 15.6242 Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI 1 5.2571 0.1286 (5.0008, 5.5135) (4.6450, 5.8692) Values of Predictors for New Observations New Obs Avg Pct TD Int Lng 1 10.0 71.3 35.0 6.00 83.0
Lower R2 indicates that using Natural Log does not help to have more accurate predication interval, e^(4.645)=104.063, e^(5.8692)=353.965Prediction Interval range has significantly increased to 249.902.*A few extreme outliers prevent our model enhancement. TD’s VIF of 3.037 indicates multicolinearity, which is creating skewed results. Particularly that TD’s p-value increased to 0.069 and is no longer significant at the 95% confidence limit.
Independence: The plots appear to make an inverted parabolaConstant Variance: The plots fan inward Mean Zero: The plots are mirrored above and below zeroNormal: Residual plot distribution is left skewed.
Without Extreme Observations Model: Rate v. Avg, Pct, TD, Int, Lng
The regression equation is Rate = - 0.97 + 5.45 Avg + 0.665 Pct + 0.931 TD - 1.12 Int + 0.0285 Lng
Predictor Coef SE Coef T P VIF Constant -0.971 6.378 -0.15 0.879 Avg 5.4514 0.6974 7.82 0.000 1.374 Pct 0.66521 0.08587 7.75 0.000 1.273 TD 0.9308 0.1180 7.89 0.000 2.910 Int -1.1227 0.1829 -6.14 0.000 2.457 Lng 0.02849 0.04806 0.59 0.555 2.284
S = 6.24771 R-Sq = 84.8% R-Sq(adj) = 83.7%
Analysis of Variance Source DF SS MS F P Regression 5 14378.9 2875.8 73.67 0.000 Residual Error 66 2576.2 39.0 Total 71 16955.2
Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI 1 129.179 3.002 (123.186, 135.173) (115.340, 143.019) Values of Predictors for New Observations New Obs Avg Pct TD Int Lng 1 10.0 71.3 35.0 6.00 83.0
R2 responded well by increasing substantially.P- value of predictors is zero, except for Lng which is .555. However, its VIF of 2.284 is not the highest.Point Interval range favorably decreased to 27.679
After analyzing the players who had unusual data, their extreme observations were removed because they represent second string players who did not get enough game to play to produce adequate data.
Independence: The plots do not appear to make a particular shapeConstant Variance: The plots look scatteredMean Zero: The plots are mirrored above and below zeroNormal: Residual plot distribution is normal
LnRate Without Extreme Observations Model: LnRate V. Avg, Pct, TD, Int, Lng
The regression equation is LnRate = 3.09 + 0.0890 Avg + 0.00984 Pct + 0.0103 TD - 0.0128 Int + 0.000813 Lng
Predictor Coef SE Coef T P VIF Constant 3.0950 0.1172 26.40 0.000 Avg 0.08904 0.01282 6.95 0.000 1.374 Pct 0.009838 0.001578 6.23 0.000 1.273 TD 0.010264 0.002170 4.73 0.000 2.910 Int -0.012753 0.003362 -3.79 0.000 2.457 Lng 0.0008131 0.0008834 0.92 0.361 2.284
S = 0.114843 R-Sq = 77.2% R-Sq(adj) = 75.4%
Analysis of Variance Source DF SS MS F P Regression 5 2.93998 0.58800 44.58 0.000 Residual Error 66 0.87047 0.01319 Total 71 3.81045
Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI 1 5.0370 0.0552 (4.9269, 5.1472) (4.7826, 5.2914) Values of Predictors for New Observations New Obs Avg Pct TD Int Lng 1 10.0 71.3 35.0 6.00 83.0
R2 slightly decreased to 77.2%P value of our predictors is zero, except for Lng, which decreased to 0.361Point Interval range, e^(4.7826)=119.4144, e^(5.2914)=198.6213, is roughly twice as large as the previous model. Therefore the natural log does not improve the model.
Independence: The plots do not appear to make a particular shapeConstant Variance: The plots look scatteredMean Zero: The plots are mirrored above and below zeroNormal: Residual plot distribution is normal
Conclusion
Criteria for choice:The Without Extreme Observations Model improves both R2 and S, but the predictor, Lng, is no longer significant at the 95% confidence limit. We feel that this does put a damper on our results. However, the increased reliability of the regression equation due to the improvement and the appearance of more constant variance as compared to the Stepwise Model lead us to conclude it is the best model.
*Many improvements can be made to this model. For instance, one can observe only starting quarterbacks. Because we needed to find 50 observations this adjustment did not meet the objective requirements.
Because the R2 is 84.8% and the S is 6.25, Hurney should use caution when using this model. The model does not taken into account all the information.
Statistics For-The-Win!!!