NFL Data Predictor Model: How the Past Predicts the Future

NFL Data Predictor Model:How the Past Predicts the Future

Ryan KunesKristy Huffman

Maziar Mahboubian

Enterprise

Due to a horrible 2010 season performance, the General Manager of the Carolina Panthers, Marty Hurney, wants to replace Jimmy Clausen with a new starting quarterback.

Managerial Question: How can Hurney rate quarterback to determine who to sign for the 2011 season?

To complete this task, we developed a multi-predictor statistical model using the 2010 NLF quarterbacks’ passing statistics.

Executive Summary

We created a 15 predictor model to gauge quarterback performance. To help Hurney, we used his number one pick, Player X’s, statistics as our new obersvation.

First, we conducted a normal multi-linear regression model with all predictors and reached an R2 of 82.2%. Then a Stepwise regression eliminated 10 variables as unreliable predictors of quarterback success. The R2 dropped to 78.73%. To improve the model we tried using LnRate and eliminating extreme observations. Our best model is our Without Extreme Observations Model.

Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI 1 136.13 30.79 (74.60, 197.66) (71.53, 200.74)XX XX denotes a point that is an extreme outlier in the predictors. Values of Predictors for New Observations New Obs Comp Att Pct Att/G Yds Avg Yds/G TD Int 1st 1st% Lng 20+ 40+ Sck 1 273 383 71.3 25.5 3845 10.0 256 35.0 6.00 130 35.5 83.0 39.0 5.00 6.00

The regression equation is Rate = - 0.97 + 5.45 Avg + 0.665 Pct + 0.931 TD - 1.12 Int + 0.0285 Lng

S = 6.24771 R-Sq = 84.8% R-Sq(adj) = 83.7%

Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI 1 129.179 3.002 (123.186, 135.173) (115.340, 143.019) Values of Predictors for New Observations New Obs Avg Pct TD Int Lng 1 10.0 71.3 35.0 6.00 83.0

Data Summary

Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 Maximum Rate 79 0 74.94 2.36 21.00 5.90 65.50 79.90 90.90 111.00 Comp 79 0 132.2 14.9 132.7 1.0 13.0 73.0 257.0 450.0 Att 79 0 217.5 23.7 210.6 1.0 27.0 133.0 432.0 679.0 Pct 79 0 59.42 1.16 10.34 37.00 52.90 59.60 63.40 100.00 Att/G 79 0 22.08 1.34 11.95 0.20 12.60 25.30 31.70 42.40 Yds 79 0 1525 172 1533 6 122 857 3018 4710 Avg 79 0 6.542 0.157 1.392 2.000 5.900 6.700 7.400 9.500 Yds/G 79 0 148.0 10.0 89.1 1.2 67.0 171.0 221.0 294.4 TD 79 0 9.43 1.20 10.64 0.00 0.00 5.00 17.00 36.00 Int 79 0 6.430 0.701 6.232 0.000 1.000 4.000 10.000 25.000 1st 79 0 73.95 8.57 76.20 0.00 7.00 39.00 149.00 253.00 1st% 79 0 31.96 1.33 11.84 0.00 26.50 32.30 35.60 100.00 Lng 79 0 51.90 2.73 24.25 6.00 31.00 53.00 73.00 92.00 20+ 79 0 19.22 2.20 19.59 0.00 1.00 10.00 38.00 65.00 40+ 79 0 3.380 0.432 3.837 0.000 0.000 2.000 6.000 14.000 Sck 79 0 14.19 1.48 13.16 0.00 2.00 9.00 25.00 52.00

Note: Extreme range of predictors leads to many outliers. This is partially explained by the statistical information of starting QBs v. backup QBs

Variable Definitions:Comp: Number of pass completionsAtt: Number of pass attemptsPct: Percentage rateAtt/G: Number of attempts per gameYds: Number of passing yardsAvg: Average number of yardYrds: Number of yards per game

TD: Number of touch downsInt: Number of interceptions1st: Number of first downs1st%: Percentage rate of first downsLng: Longest pass in yards20+: Number of passes over 20 yards40+: Number of passes over 40 yardsSck: Number of sacks

Data Correlation Matrices

400

200

0 500

250

0 100

7550 40200 4000

2000

0 963 300

150

0 40200 20100 200

100

0 100

500 100

500 50250 1050 40200

100

50

04002000

5002500100755040

20

040002000010

5

0300

150

040

20

0201002001000100

50

0100

50

050250

1050

Comp 0.481 0.000 Att 0.462 0.996 0.000 0.000 Pct 0.643 0.170 0.139 0.000 0.133 0.220 Att/G 0.273 0.801 0.809 -0.077 0.015 0.000 0.000 0.498 Yds 0.504 0.993 0.991 0.163 0.792 0.000 0.000 0.000 0.150 0.000 Avg 0.695 0.367 0.355 0.360 0.217 0.411 0.000 0.001 0.001 0.001 0.055 0.000 Yds/G 0.436 0.853 0.854 0.033 0.967 0.863 0.406 0.000 0.000 0.000 0.774 0.000 0.000 0.000 TD 0.556 0.950 0.937 0.189 0.734 0.958 0.424 0.817 0.000 0.000 0.000 0.096 0.000 0.000 0.000 0.000 Int 0.224 0.857 0.863 0.065 0.731 0.839 0.245 0.742 0.755 0.048 0.000 0.000 0.571 0.000 0.000 0.030 0.000 0.000 1st 0.500 0.995 0.991 0.169 0.786 0.997 0.395 0.851 0.961 0.837 0.000 0.000 0.000 0.136 0.000 0.000 0.000 0.000 0.000 0.000 1st% 0.534 0.193 0.180 0.477 -0.023 0.209 0.577 0.100 0.236 0.101 0.217 0.000 0.089 0.113 0.000 0.842 0.064 0.000 0.378 0.036 0.375 0.055 Lng 0.389 0.664 0.673 -0.058 0.726 0.679 0.438 0.770 0.664 0.623 0.657 0.051 0.000 0.000 0.000 0.612 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.658 20+ 0.518 0.947 0.948 0.150 0.765 0.976 0.458 0.860 0.933 0.796 0.961 0.218 0.680 0.000 0.000 0.000 0.186 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.053 0.000 40+ 0.497 0.869 0.867 0.141 0.691 0.901 0.447 0.790 0.860 0.676 0.883 0.205 0.697 0.905 0.000 0.000 0.000 0.215 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.070 0.000 0.000 Sck 0.410 0.847 0.866 0.080 0.692 0.874 0.371 0.760 0.782 0.742 0.853 0.148 0.642 0.882 0.811 0.000 0.000 0.000 0.482 0.000 0.000 0.001 0.000 0.000 0.000 0.000 0.194 0.000 0.000 0.000

MATRIX PLOT OF RATE, COMP, ATT, PCT, ATT/G, YDS, AVG, YDS/G, TD, INT, 1ST, 1ST%, LNG, 20+, 40+, SCK

Kitchen Sink Model

Regression equation: Rate = - 0.6 - 0.100 Comp + 0.222 Att + 0.856 Pct - 2.59 Att/G - 0.0136 Yds + 1.80 Avg + 0.417 Yds/G + 0.880 TD - 1.62 Int - 0.109 1st + 0.126 1st% + 0.128 Lng - 0.145 20+ - 0.907 40+ - 0.035 Sck

Analysis of Variance Source DF SS MS F P Regression 15 28259.6 1884.0 19.38 0.000 Residual Error 63 6123.8 97.2 Total 78 34383.3 Source DF Seq SS Comp 1 7971.0 Att 1 1512.0 Pct 1 9621.5 Att/G 1 13.1 Yds 1 2194.5 Avg 1 3364.8 Yds/G 1 671.5 TD 1 1294.2 Int 1 1327.7 1st 1 7.7 1st% 1 55.2 Lng 1 139.4 20+ 1 0.4 40+ 1 84.6 Sck 1 2.0

Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI 1 136.13 30.79 (74.60, 197.66) (71.53, 200.74)XX XX denotes a point that is an extreme outlier in the predictors. Values of Predictors for New Observations New Obs Comp Att Pct Att/G Yds Avg Yds/G TD Int 1st 1st% Lng 20+ 40+ Sck 1 273 383 71.3 25.5 3845 10.0 256 35.0 6.00 130 35.5 83.0 39.0 5.00 6.00

Predictor Coef SE Coef T P Constant -0.60 13.21 -0.05 0.964 Comp -0.1005 0.2127 -0.47 0.638 Att 0.2219 0.1005 2.21 0.031 Pct 0.8562 0.1445 5.93 0.000 Att/G -2.5928 0.9863 -2.63 0.011 Yds -0.01360 0.03102 -0.44 0.662 Avg 1.798 1.678 1.07 0.288 Yds/G 0.4170 0.1547 2.70 0.009 TD 0.8802 0.4970 1.77 0.081 Int -1.6194 0.4150 -3.90 0.000 1st -0.1085 0.3263 -0.33 0.741 1st% 0.1260 0.1316 0.96 0.342 Lng 0.12816 0.09180 1.40 0.168 20+ -0.1446 0.5568 -0.26 0.796 40+ -0.9067 0.9662 -0.94 0.352 Sck -0.0351 0.2440 -0.14 0.886 S = 9.85916 R-Sq = 82.2% R-Sq(adj) = 77.9%

15 predictor “Kitchen Sink” MLR model has an R2 of 82.2% and s value of 9.86

P-values for Att, Pct, Att/G, Yds/G, TD, and Int are significant at the 95% confidence limit

Predicted QB rating for Player X

Assessment: Kitchen Sink Model

Independence: The plots do not appear to make a particular shapeConstant Variance: The plots look scattered but show some fanning outMean Zero: The plots are mirrored above and below zeroNormal: Residual plot distribution is normal

Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is Rate on 15 predictors, with N = 79 Step 1 2 3 4 5 Constant 6.342 -32.066 -24.042 -17.452 -23.056 Avg 10.49 8.04 6.19 5.85 5.06 T-Value 8.48 7.43 5.92 6.01 4.95 P-Value 0.000 0.000 0.000 0.000 0.000 Pct 0.92 0.89 0.85 0.94 T-Value 6.29 6.86 7.07 7.51 P-Value 0.000 0.000 0.000 0.000 TD 0.59 1.08 0.96 T-Value 4.55 6.01 5.16 P-Value 0.000 0.000 0.000 Int -1.05 -1.22 T-Value -3.67 -4.18 P-Value 0.000 0.000 Lng 0.149 T-Value 2.09 P-Value 0.040 S 15.2 12.4 11.1 10.2 10.0 R-Sq 48.30 66.02 73.37 77.46 78.73 R-Sq(adj) 47.63 65.12 72.30 76.24 77.27 Mallows Cp 107.9 47.2 23.2 10.7 8.2

Stepwise Regression

To reduce the effect of multicolinearity, as seen in the original model for predictors, Com, Att, Yds, 1st, 20+, and 40+, we ran a stepwise regression.

To improve out model, we chose step five with the predictors, Avg, Pct, TD, Int, and Lng.

Criteria for selection:R2: 78.73%S: 10P-value: all below the 95% confidence limitNumber of Variables: Dropped from 10 to 5

Even thought the R2 decreased and S increased, we feel that this is an improved model because the predictors’ p-values are all below the 95% confidence limit.

Stepwise Model: Rate v. Avg, Pct, TD, Int, Lng


S = 10.0093 R-Sq = 78.7% R-Sq(adj) = 77.3% Analysis of Variance Source DF SS MS F P Regression 5 27069.8 5414.0 54.04 0.000 Residual Error 73 7313.6 100.2 Total 78 34383.3 Source DF Seq SS Avg 1 16608.5 Pct 1 6091.0 TD 1 2526.1 Int 1 1408.1 Lng 1 436.1


963 1007550 40200 20100 100500

100

50

09

6

3

100

75

50

40

20

0

20

10

0

Matrix Plot of Rate, Avg, Pct, TD, Int, Lng

Avg 0.695 0.000

Pct 0.643 0.360 0.000 0.001

TD 0.556 0.424 0.189 0.000 0.000 0.096

Int 0.224 0.245 0.065 0.755 0.048 0.030 0.571 0.000

Lng 0.389 0.438 -0.058 0.664 0.623 0.000 0.000 0.612 0.000 0.000

30150-15-30

99.9

99

90

50

10

1

0.1

Residual

Perc

ent

120906030

30

15

0

-15

-30

Fitted Value

Resi

dual

24120-12-24

30

20

10

0

Residual

Fre

quency

757065605550454035302520151051

30

15

0

-15

-30

Observation Order

Resi

dual

Normal Probability Plot Versus Fits

Histogram Versus Order

Residual Plots for Rate

Independence: The plots do not appear to make a particular shapeConstant Variance: The plots look scattered but show some fanning outMean Zero: The plots are mirrored above and below zeroNormal: Residual plot distribution is normal

R2 decreased to 78.7% and S increased to 10

LnRate Stepwise Model: LnRate v. Avg, Pct, TD, Int, Lng

The regression equation is LnRate = 2.14 + 0.114 Avg + 0.0194 Pct + 0.00954 TD - 0.0166 Int + 0.00428 Lng

Predictor Coef SE Coef T P VIF Constant 2.1439 0.2277 9.42 0.000 Avg 0.11391 0.02852 3.99 0.000 1.580 Pct 0.019426 0.003496 5.56 0.000 1.310 TD 0.009539 0.005170 1.84 0.069 3.037 Int -0.016590 0.008130 -2.04 0.045 2.574 Lng 0.004275 0.001986 2.15 0.035 2.326

S = 0.278903 R-Sq = 63.7% R-Sq(adj) = 61.2%

Analysis of Variance Source DF SS MS F P Regression 5 9.9458 1.9892 25.57 0.000 Residual Error 73 5.6785 0.0778 Total 78 15.6242 Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI 1 5.2571 0.1286 (5.0008, 5.5135) (4.6450, 5.8692) Values of Predictors for New Observations New Obs Avg Pct TD Int Lng 1 10.0 71.3 35.0 6.00 83.0

Lower R2 indicates that using Natural Log does not help to have more accurate predication interval, e^(4.645)=104.063, e^(5.8692)=353.965Prediction Interval range has significantly increased to 249.902.*A few extreme outliers prevent our model enhancement. TD’s VIF of 3.037 indicates multicolinearity, which is creating skewed results. Particularly that TD’s p-value increased to 0.069 and is no longer significant at the 95% confidence limit.

Independence: The plots appear to make an inverted parabolaConstant Variance: The plots fan inward Mean Zero: The plots are mirrored above and below zeroNormal: Residual plot distribution is left skewed.

Without Extreme Observations Model: Rate v. Avg, Pct, TD, Int, Lng


Predictor Coef SE Coef T P VIF Constant -0.971 6.378 -0.15 0.879 Avg 5.4514 0.6974 7.82 0.000 1.374 Pct 0.66521 0.08587 7.75 0.000 1.273 TD 0.9308 0.1180 7.89 0.000 2.910 Int -1.1227 0.1829 -6.14 0.000 2.457 Lng 0.02849 0.04806 0.59 0.555 2.284

S = 6.24771 R-Sq = 84.8% R-Sq(adj) = 83.7%

Analysis of Variance Source DF SS MS F P Regression 5 14378.9 2875.8 73.67 0.000 Residual Error 66 2576.2 39.0 Total 71 16955.2


R2 responded well by increasing substantially.P- value of predictors is zero, except for Lng which is .555. However, its VIF of 2.284 is not the highest.Point Interval range favorably decreased to 27.679

After analyzing the players who had unusual data, their extreme observations were removed because they represent second string players who did not get enough game to play to produce adequate data.

Independence: The plots do not appear to make a particular shapeConstant Variance: The plots look scatteredMean Zero: The plots are mirrored above and below zeroNormal: Residual plot distribution is normal

LnRate Without Extreme Observations Model: LnRate V. Avg, Pct, TD, Int, Lng

The regression equation is LnRate = 3.09 + 0.0890 Avg + 0.00984 Pct + 0.0103 TD - 0.0128 Int + 0.000813 Lng

Predictor Coef SE Coef T P VIF Constant 3.0950 0.1172 26.40 0.000 Avg 0.08904 0.01282 6.95 0.000 1.374 Pct 0.009838 0.001578 6.23 0.000 1.273 TD 0.010264 0.002170 4.73 0.000 2.910 Int -0.012753 0.003362 -3.79 0.000 2.457 Lng 0.0008131 0.0008834 0.92 0.361 2.284

S = 0.114843 R-Sq = 77.2% R-Sq(adj) = 75.4%

Analysis of Variance Source DF SS MS F P Regression 5 2.93998 0.58800 44.58 0.000 Residual Error 66 0.87047 0.01319 Total 71 3.81045


R2 slightly decreased to 77.2%P value of our predictors is zero, except for Lng, which decreased to 0.361Point Interval range, e^(4.7826)=119.4144, e^(5.2914)=198.6213, is roughly twice as large as the previous model. Therefore the natural log does not improve the model.

Independence: The plots do not appear to make a particular shapeConstant Variance: The plots look scatteredMean Zero: The plots are mirrored above and below zeroNormal: Residual plot distribution is normal

Conclusion

Criteria for choice:The Without Extreme Observations Model improves both R2 and S, but the predictor, Lng, is no longer significant at the 95% confidence limit. We feel that this does put a damper on our results. However, the increased reliability of the regression equation due to the improvement and the appearance of more constant variance as compared to the Stepwise Model lead us to conclude it is the best model.

*Many improvements can be made to this model. For instance, one can observe only starting quarterbacks. Because we needed to find 50 observations this adjustment did not meet the objective requirements.

Because the R2 is 84.8% and the S is 6.25, Hurney should use caution when using this model. The model does not taken into account all the information.

Statistics For-The-Win!!!

Date post:	05-Dec-2014
Category:	Business
Upload:	maziar-mahboubian
View:	492 times
Download:	0 times

NFL Data Predictor Model: How the Past Predicts the Future

Business