+ All Categories
Home > Documents > ACCUPREDICT: A Method for Forecasting NASCAR · 2012. 1. 9. · For NASCAR, there is a rich dataset...

ACCUPREDICT: A Method for Forecasting NASCAR · 2012. 1. 9. · For NASCAR, there is a rich dataset...

Date post: 21-Aug-2021
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
22
ACCUPREDICT: A Method for Forecasting NASCAR AVAILABLE BY SUBSCRIPTION AT FANTASYRACINGCHEATSHEET.COM This report examines several driver performance measures and develops a method for predicting the finishing order of NASCAR Sprint Cup races. The author can be contacted at [email protected]. Cliff DeJong 1/9/2012
Transcript
Page 1: ACCUPREDICT: A Method for Forecasting NASCAR · 2012. 1. 9. · For NASCAR, there is a rich dataset of past races: I have each race back to 1991 in my database with the finishing

ACCUPREDICT: A Method for Forecasting NASCAR AVAILABLE BY SUBSCRIPTION AT FANTASYRACINGCHEATSHEET.COM This report examines several driver performance measures and develops a method for predicting the finishing order of NASCAR Sprint Cup races. The author can be contacted at [email protected]. Cliff DeJong 1/9/2012

Page 2: ACCUPREDICT: A Method for Forecasting NASCAR · 2012. 1. 9. · For NASCAR, there is a rich dataset of past races: I have each race back to 1991 in my database with the finishing

INTRODUCTION

Several years ago, I entered a friendly NASCAR fantasy league competition with my brother, who has

beaten me in just about everything. But, I’m a nerd and he is not, so I soon started to look at statistics

to improve my picks. It then became an obsession that has consumed untold hours of my time.

There is a lot of randomness in NASCAR. The plot below shows one of the better measures that I have

found for predictions. It shows the actual finish of each driver plotted against the average of the last 18

races prior to that race for the 2011 season, 1260 data points. Only finishes of 35 and better are

included. I have also shown the trendline as a summary of these data.

Figure 1

2011 Actual Finish Compared to Average of The Last 18 Races

The spread of the data is amazing, and it is not obvious that this can be useful. Yet there are tendencies

that are valuable since the data are clustered about the trendline. The important fact is that the order

of drivers in a specific race can be predicted in a meaningful way.

This paper will identify several metrics that will be used to forecast NASCAR outcomes. It will also

address how to combine these to get the best forecast. It is not a rigorous scientific paper, but intended

to show the methods used in general terms.

Page 3: ACCUPREDICT: A Method for Forecasting NASCAR · 2012. 1. 9. · For NASCAR, there is a rich dataset of past races: I have each race back to 1991 in my database with the finishing

The implementation of this method is available as ACCUPREDICT, on FantasyRacingCheatSheet.com. We

will also provide an estimate of fantasy points for NASCAR.com’s Fantasy Live Game on the website,

using similar methods but not presented here. That estimate was used without alteration to score 22nd

overall in 2011 out of several thousand competitors.

SUMMARY

NASCAR data from 1991 through 2011 are used to develop performance metrics. The key driver

performance measures identified here are

average finish over the last 15 races,

year-to-date driver rating,

finishes at the last eight tracks of the same type,

driver ratings at the same track for the last eight races,

practice and

starting position.

Driver Rating is the NASCAR Loop Driver Rating, a formula that combines wins, finishes, green flag

passes and several other driver performance measures.

In this paper, track types are examined and a regrouping of types is suggested by statistical

considerations. Restrictor plate races are scored by a subset of the key measures listed above.

Driver scores based on the above measures are correlated with the actual finishes for the 2011 season

with a value of 0.554. During the 2011 season, ACCUPREDICT achieved a correlation of 0.538.

DATABASE

Predictions of almost anything are either historically based, assuming the past repeats itself, or based on

first principles of physics, like your daily weather forecast. Predictions based on historical databases are

looking for similarities with the past: if a situation has come up before, what has happened and how

does that apply to this week’s race? In other words, if a driver has done well at a particular track in the

past, does this mean he will do well this weekend? Maybe… you can also consider how well he is doing

this year, and at similar tracks, and how he practiced and qualified.

For NASCAR, there is a rich dataset of past races: I have each race back to 1991 in my database with the

finishing positions of each driver. This database is from the LeonardFrye.com website, which is an

excellent source for NASCAR statistics. There are over 19000 data points. The database is in a

computer-readable form, not scattered over various web sites, so it is relatively easy to process. Plus,

each week in the season, and for past races, there are driver loop data, practice data and qualifying

results, and other data such as bonus points earned, laps led, etc. My primary source for these data is

FantasyRacingCheatSheet.com.

There are some very good expert picks available on the web at no cost and some better ones that cost a

subscription fee, including ACCUPREDICT, which is the result of this analysis. Not all of these expert

Page 4: ACCUPREDICT: A Method for Forecasting NASCAR · 2012. 1. 9. · For NASCAR, there is a rich dataset of past races: I have each race back to 1991 in my database with the finishing

picks rank all the drivers; some only give a list of the top 5 or so drivers, and perhaps a dark horse or

two.

Success in the fantasy leagues often depends on how well the low-ranked drivers do. These guys are

necessary picks because of fantasy salary constraints. So, I wanted to be able to rank each driver… not

just get someone’s opinion on who would do well at the next track.

METRICS

A metric is a quantifiable measure of a driver’s performance. Metrics available each week for each

driver include

Performance in the last several races

Performance at the same track

Performance at the same type track

Practice

Qualifying

Expert opinions (cheat sheets)

Performance can be measured in two primary ways: finishing position and Driver Rating. Lots of other

data reflecting performance are also available, for example, laps led, fast laps, green flag passes, quality

passes, etc. These later metrics are not as easy to process, but they are available on web sites such as

fantasyracingcheatsheet.com, and will be addressed for the 2011 season only.

Cheat sheets are opinions of experts, often based on unspecified statistics, and not used in this analysis.

I have found that other cheat sheets often do not score middle and lower ranked drivers.

DNFs or other major problems during a race can easily move a top ranked driver from a predicted top

five to a finish of 40th. I define a DNF as finishing behind anyone who does not complete the race—that

is a clear indication of a major problem, not just poor performance. Typical DNF rates are 15-20%. Since

DNFs are unpredictable, there is no obvious way to include them. Their effects on finishing position are

in the database that is used.

The process of how to combine the various metrics is a complex subject that takes serious effort but, as

will be seen, providing little gain beyond simple measures. The metrics are not independent—a driver

who has done well at a particular track has generally done well at the same track types, and he is likely

to practice well and qualify well.

STATISTICAL MEASURES

How do you measure the effectiveness of a metric or combinations of metrics? There are two primary

ways that I use: (1) correlation with the predicted finish and (2) the standard deviation of predicted

Page 5: ACCUPREDICT: A Method for Forecasting NASCAR · 2012. 1. 9. · For NASCAR, there is a rich dataset of past races: I have each race back to 1991 in my database with the finishing

finish. I also use less frequently the likelihood that a higher-ranked driver will finish ahead of a lower-

ranked driver.

Correlation is a standard statistical measure that essentially plots one variable (the actual finish, for

example) as a function of the other variable (the metric, practice speed, for example), and measures

how well a straight line will fit the data. Correlation ranges between -1 and 1, with the two extremes

indicating a perfect fit. A correlation of zero indicates that the result is independent of the metric. In

other words, a very low correlation indicates the metric is not a useful indicator of a driver’s finishing

position. I will show some plots later to make this a lot clearer. Correlation can also be expressed as a

percentage: -100% to 100%. Typically, in NASCAR, numbers range from 30 to 50%, that is, there is a lot

of randomness in NASCAR. The data shown in the introduction has about 0.50 or 50% correlation.

A negative correlation means that as the metric gets larger, the actual finish gets smaller. Correlations

for NASCAR finishing positions are positive when past performance is measured by finishing positions,

that is, a small actual finish is expected when the average finish over the last several races is good (or

low). When performance is measured by Driver Rating, correlations are negative since a high driver

rating number implies a better driver and therefore a better predicted finish. In this paper, I deal only

with positive correlations by scaling the metrics—for example, the Driver Rating becomes a simple

ranking of the drivers, with the best driver scored a one, second best a two, etc.

The standard deviation of the predicted finish is a measure of how accurate the prediction is. In

essence, it is a measure of how much you are wrong on average. Almost 70% of the data are within plus

or minus one standard deviation. It is larger than you might think: typical numbers are 9 to 10,

showing, again, a lot of variability in NASCAR. This is not at all unreasonable if you think about a DNF

rate of about 20%. A driver that finishes 1, 2, 3, 4 and 35 (due to an accident), will average only a 9th

place finish for these five races, despite four outstanding races. The relative average finishing positions

among drivers is the important point.

Drivers will be ranked by a score, based on the metrics selected. The likelihood that a higher ranked

driver finishes ahead of a lower ranked driver is calculated by comparing each driver with every other

driver ranked below him. The percentage of correct rankings is then calculated, and averages about

70%. This percentage is higher if the difference in rankings is high, and less if differences are small. This

measure is not used often, since it is related closely to correlations.

RECENT PAST PERFORMANCE-ALL RACES

One metric is how well a driver has done in the last several races, counting every track. If you use a

small number of races, you will measure how a particular driver has done lately and be able to react to a

driver on a hot streak, such as Tony Stewart at the end of the 2011 season (or Kyle Bush’s annual

collapse during the Chase). A small number of races will better reflect how a driver has improved with

time as well. On the other hand, using a large number of past races will not be sensitive to one bad race

caused, for example, by an accident, and will be a better estimate of how consistent a driver is.

Using the database from 1991 through 2011, I evaluated the correlation of actual finishing position to

the driver’s average finish over the last N races. If a driver was only in some of the last N races, the

Page 6: ACCUPREDICT: A Method for Forecasting NASCAR · 2012. 1. 9. · For NASCAR, there is a rich dataset of past races: I have each race back to 1991 in my database with the finishing

average is over those races he was in. The plot below shows correlations for the entire database, and

with the years 2010 and 2011 separated out.

Figure 2

Correlation for All Races Averages

The curves shown have similar shapes, and all show that averaging over the last 10 races or more give

the best results. However, an obvious question is why 2010 and 2011 are so much better than the

entire span of data over the years from 1991 to 2011. I believe that this is due to recent phenomena of

start-and-park, where some drivers with little or no sponsorship will attempt to qualify and then only

run a few laps due to cost issues. Those drivers are almost certain to finish very poorly every race and

are therefore easy to predict. This raises the overall correlation for those races in a misleading manner.

Since the top 35 cars in owner points are locked into each race and do not start-and-park, I repeated the

calculations using only cars that finished in the top 35. This is shown in Figure 3 below. It would be

better to look only at drivers in the top 35 in points (those locked into the race and not likely to start-

and-park), but that is not readily available. It would take significant effort to add this to the database.

Page 7: ACCUPREDICT: A Method for Forecasting NASCAR · 2012. 1. 9. · For NASCAR, there is a rich dataset of past races: I have each race back to 1991 in my database with the finishing

Figure 3

Correlation for All Races (Top 35 Finishers Only)

Again, the curve shapes are similar and correlations for 2011 are somewhat above the long-term

averages, but the differences are much smaller. There is rapid improvement as the number of races

averaged increases to about 12-15, with little or no improvement above that. Trying to read differences

of less than a percent is pushing the data beyond what is reasonable, so I have selected 15 as a

reasonable number to average over all races. I wanted as small a number as possible to preserve any

information about drivers on hot streaks.

RECENT PAST PERFORMANCE-SAME TRACK

Often a driver will excel at a particular track. Denny Hamlin, for example, has always done very well at

Pocono. Using the 1991-2011 data, I calculated correlations of actual finishes for all drivers with his

average finishing position for each track. Again, only the top 35 finishers are counted. Figure 4 shows

the results for three typical tracks: Phoenix, Atlanta and Daytona.

Page 8: ACCUPREDICT: A Method for Forecasting NASCAR · 2012. 1. 9. · For NASCAR, there is a rich dataset of past races: I have each race back to 1991 in my database with the finishing

Figure 4

Representative Same Track Averages

Phoenix and Atlanta show typical curve shapes, with the correlations rising as the number of races

averaged gets larger and then flatten out. The curves peak out at 6 or more races. Daytona, on the

other hand, has a very poor correlation, no matter how many races are averaged. There may be a few

drivers that have done well in the past and will do well at Daytona in the future, but in general, at

Daytona, past performance at that track does not imply continued success. Conversely, poor past

performance at Daytona does not necessarily imply another poor finish.

In Figure 5, I have taken each track’s correlation as a function of the number of races and averaged all

the tracks together to give the curve in red. The figure also shows in blue the average correlation for

averages from the most recent N races at any track (from Figure 3 above).

Page 9: ACCUPREDICT: A Method for Forecasting NASCAR · 2012. 1. 9. · For NASCAR, there is a rich dataset of past races: I have each race back to 1991 in my database with the finishing

Figure 5

Average of Race Correlations at the Same Track vs All Tracks

The two curves have similar shapes: both start relatively low and then improve as the number of races

averaged increase. The correlation for averages at each individual track starts to level out above five or

six, while the average for the correlations using all tracks climbs more slowly, and peaks at around 15.

Consideration of each individual track’s correlation curve suggests that averaging eight races at the

same track gives very good performance for this measure. The table below in Figure 6 gives each track’s

correlation for performance averaged over the last eight races. I have also included the last four race

averages, since I have frequently used that in the past.

For almost all tracks, averaging over eight races improves correlation over the four-race averages, but

not by much. These correlations are also seen to be lower than the correlations over the last 15 races at

all tracks (see Figure 5). In other words, average driver finishes over the last 15 races at all tracks is a

better indicator of how well he will do at a particular track than his past finishes at the same track. Of

course, both performance measures will be used to estimate driver finishes. Notice also that some

tracks are not correlated well at all to past performance at that track: California, Chicago, Daytona,

Sonoma and Talladega.

Page 10: ACCUPREDICT: A Method for Forecasting NASCAR · 2012. 1. 9. · For NASCAR, there is a rich dataset of past races: I have each race back to 1991 in my database with the finishing

Figure 6

Same Track Correlations

One possible explanation for the most recent 15 races at all tracks being a better indicator than the

most recent eight races at a specific track is the time interval covered. The last 15 races at any track is

almost half a season, or about half a year, while the last eight races covers the past four or eight years of

data at that track, depending on whether or not one or two races are run each season at that track (such

as once yearly races at Chicago or biannual races at Martinsville). I suspect that the reason for needing

to average over several races, even over several years at some tracks, is due to accidents or other

problems, like flat tires, that could skew a driver’s performance downward and distort his performance

unfairly. It is interesting to note that the need to use several races is more important than reflecting a

hot streak over a few races, that is, driver consistency for the long haul is more important.

PAST RECENT PERFORMANCE-SIMILAR TRACKS

One way to decrease the time interval for measurement of a driver’s performance is to look at

performance at similar tracks. For example, Jeff Gordon always does well at flat tracks, such as

Martinsville and Loudon. There are several races each year at flat tracks, so averaging over the past

eight races at flat tracks would only cover races during the past year and therefore would reflect more

recent performance, rather than requiring several years’ performance at an individual track.

Page 11: ACCUPREDICT: A Method for Forecasting NASCAR · 2012. 1. 9. · For NASCAR, there is a rich dataset of past races: I have each race back to 1991 in my database with the finishing

Tracks can be grouped by several means: until now, the most accurate grouping that I have seen is

illustrated below in Figure 7. This grouping is based on track physical similarities of track length and

corner banking, and was made by Christopher Harris of ESPN in 2007. The ODD tracks are those that do

not fit nicely into the other categories, but have some similarities to others in the same ODD listing.

Figure 7

Traditional Track Groupings

The theory for similar track groupings is that a driver’s performance at all alike tracks will be consistent.

To evaluate this, I looked at the 1991-2011 database and correlated each driver’s finish to his average

performance at previous races at tracks in the same grouping. For example, I looked at Bristol finishes

for each driver against the average finish over the last N races at any steep track. This may include

previous races at Bristol. When these averages for Bristol based on steep tracks are plotted against

averages over all steep tracks, you would expect similar shaped curves if Bristol is properly classified as a

steep track and the supposition of similar performance holds.

Figure 8 shows the results for all eight groupings of tracks. Curves are generally tightly clustered for Flat

Tracks, Steep Tracks and Restrictor Plate Tracks, and have similar shapes for Road Course Tracks. Curve

shapes are somewhat different for Shallow Tracks and Cookie Cutter Tracks. The ODD1 and ODD2

Tracks, plus Road Courses, have curves that are spread out. Kentucky is based on only one result so

little can be determined for that track. Note also that some of the tracks, such as Chicago, Watkins Glen

and the plate tracks are poorly correlated to other tracks in their categories. Most of the curves in

general show their best correlation for an average taken over the past eight races of tracks in the same

category.

Page 12: ACCUPREDICT: A Method for Forecasting NASCAR · 2012. 1. 9. · For NASCAR, there is a rich dataset of past races: I have each race back to 1991 in my database with the finishing

Figure 8-Traditional Track Groupings

Page 13: ACCUPREDICT: A Method for Forecasting NASCAR · 2012. 1. 9. · For NASCAR, there is a rich dataset of past races: I have each race back to 1991 in my database with the finishing

Some tracks were regrouped because of these results. It was found that including the ODD tracks with

the Cookie Cutter Tracks was advantageous. This revised category was named Large Ovals and its

performance is shown in Figure 9. Again, the average over the last eight races is a good measure of

performance.

Figure 9

Large Oval Track Grouping

This shows a fairly tight clustering of the curves and improved correlations for each of the member

tracks with average finishes over the other tracks in the Large Ovals category. The exception is still

Kentucky, based on only one race, and therefore not a concern. The correlations for these tracks in the

new Large Oval category are shown below in Figure 10 for their previous grouping of track types and the

revised category of Large Oval. Results are for eight races averaged. All tracks in this new category

perform better and Chicago’s correlation is much improved and is now on a par with other tracks in this

category. Michigan and Atlanta are significantly improved as well.

Page 14: ACCUPREDICT: A Method for Forecasting NASCAR · 2012. 1. 9. · For NASCAR, there is a rich dataset of past races: I have each race back to 1991 in my database with the finishing

Figure 10

Improvement in Regrouping as Large Oval Tracks

Indianapolis and Pocono, grouped as Shallow Tracks, offered only three races per year, and therefore

were grouped into the Flat Tracks category. Results are given in Figure 11 for eight-race averages, and

show improvement in all Flat Track correlations. Pocono in particular is better correlated with this new

grouping of Flat Tracks.

Page 15: ACCUPREDICT: A Method for Forecasting NASCAR · 2012. 1. 9. · For NASCAR, there is a rich dataset of past races: I have each race back to 1991 in my database with the finishing

Figure 11

Revised Flat Track Category Performance

There are no obvious reclassifications for Road Course or Restrictor Plate Tracks. When the four of them

were grouped together as an excursion, the correlations were improved for Plate Tracks, but the Road

Course results were worse than the original groupings. There does not appear to be any justification for

regrouping these tracks.

Figure 12

Revised Similar Track Categories

Page 16: ACCUPREDICT: A Method for Forecasting NASCAR · 2012. 1. 9. · For NASCAR, there is a rich dataset of past races: I have each race back to 1991 in my database with the finishing

Below in Figure 13 are the correlation averages for the revised track types. The curves all increase with

the number of races averaged and plateau near eight races. The Large Oval and Expanded Flat Tracks

have the highest correlations, with Steep Tracks close behind. Road Course and Restrictor Plate Tracks

are relatively poorly correlated.

Figure 13

Averages for Revised Track Categories

PRACTICE

There are at least two practices each race weekend, except in the case of rainouts, and sometimes there

are more. The last practice is called Happy Hour. Happy Hour sometimes is after qualifying, but more

often in recent years, it is before qualifying. When before qualifying, some drivers are making mock

qualifying runs and those are generally faster than practice in race trim. So, performance in Happy Hour

may not be the best comparative measure of a driver’s upcoming performance. In addition, practice

speeds are sometimes measured as the fastest lap a driver has made, and sometimes as the best 10-lap

average speed for those drivers that run at least 10 consecutive laps. TV commentators often have

commented about the value of the 10-lap averages. Average speeds for each practice session are also

available.

Practice is not usually intended to give the driver a rehearsal at a particular track, that is to familiarize

him with the track itself, but serves as a means to dial in the car’s handling characteristics and to

understand how to adjust the car as the track changes. A driver who dominates in practice almost

always does well in the following race.

Page 17: ACCUPREDICT: A Method for Forecasting NASCAR · 2012. 1. 9. · For NASCAR, there is a rich dataset of past races: I have each race back to 1991 in my database with the finishing

A comprehensive database of the various practice measures does not seem to be available, so I went

back over the 2011 season and put together various statistics for each race. The correlations of finishing

position to several practice measures are shown below.

Figure 14

Practice Measures Comparison for 2011

The bar labeled Happy Hour is a ranking of the fastest Happy Hour speeds, while Average HH is the

average speed during Happy Hour. The Top 10 laps in Happy Hour are shown. Because of the mixture

of race trim and qualifying trim during Happy Hour, I also looked at peak lap speeds in the practice just

prior to Happy Hour. This is often the first practice, which also serves to set the qualifying order and is

therefore important to the driver. As the chart shows, the fastest laps in the next to last practice are the

best measure of practice, and the correlation achieved of 42% is about the same as the other measures

discussed so far in this analysis.

QUALIFYING

Qualifying is important in multiple ways to a driver. It is an obvious measure of how fast a driver can run

a single lap. Pit selection is chosen by a team in the order of the qualifying results, and it may give the

driver an easier (faster) entry and/or exit from his pit stall, and less chance of a pit road mishap. The

qualifying result is also the starting position and this can be very important at tracks where it is difficult

to pass.

The correlation of starting position to finish position is shown in Figure 15 for the last several seasons.

Some drivers are required to start at the end of the field because of an engine change, for example, and

those drivers are treated here by how they qualified.

Page 18: ACCUPREDICT: A Method for Forecasting NASCAR · 2012. 1. 9. · For NASCAR, there is a rich dataset of past races: I have each race back to 1991 in my database with the finishing

Figure 15

Correlation of Finishing Position with Starting Position

It is not clear why the correlation has improved with time. Correlations for the latest season (2011) are

again about 40%, which is about the same as the other performance measures examined this far.

OTHER METRICS

Thus far, we have looked at performance in past races in a number of ways: the past finish positions of

all recent races, races at the same track, and races at similar tracks. A revision of track types from

similar physical characteristics to those with similar statistics was assessed and found to be beneficial.

Practice speeds at the next to last practice are useful, and the starting position (or qualifying results) is

also valuable.

Each season is a new start for drivers and their teams, and may bring a new crew chief or even a new

team for some drivers. I have observed that each year has drivers that seem to do consistently better

(or worse) than expected, based solely on their past performance from previous years. As a

consequence, another measure that I have found to be useful is the current year-to-date standings of

the drivers. I do not use year-to-date statistics until after four races have been run. I have used point

standings and Driver Ratings in past years, and found that driver rankings based on Driver Ratings are a

little better to use. I have not done a formal analysis, but at the end of the 2010 season found that the

Driver Rating was correlated to average finish with a value of 0.93 while the correlation of points (after

taking out the Chase points adjustment) to average finish was 0.76. These are correlations to the final

values for the entire 2010 season and cannot be compared to other correlation measures in this report

that are for single races.

Page 19: ACCUPREDICT: A Method for Forecasting NASCAR · 2012. 1. 9. · For NASCAR, there is a rich dataset of past races: I have each race back to 1991 in my database with the finishing

Driver Rating, as defined by NASCAR loop data, combines several measures of driver performance,

including green flag passes, green flag times passed, fast laps, laps led, and more. I collected several of

these measures during the 2011 season for assessment and will show these in the next section.

2011 SEASON ASSESSMENT OF SELECTED METRICS

For the 2011 season, a number of performance measures were collected for each race. The figure

below shows those measures and their correlations with finishing position.

Figure 16

2011 Driver Performance Measures and Correlations

Here are the definitions of each measure:

L18-F: Average of the finishing position of the last 18 races

L4-DR: Ranking of average Driver Rating for the last 4 races

YTD-DR: Ranking of average Driver Rating for the year to date

Not used in the first four races

SType-F: Average finish position for races at the same type track

This uses the traditional track groupings, before revisions above, and averages over 4-12

races for different tracks.

SType-DR: Average Driver Rating for races at the same type track, as above

Page 20: ACCUPREDICT: A Method for Forecasting NASCAR · 2012. 1. 9. · For NASCAR, there is a rich dataset of past races: I have each race back to 1991 in my database with the finishing

SType-4F: Average finish position for the last four races at the same type track

STrack-F: Average finish position for races at the same track, over 2-11 races

STrack-DR: Ranking of average Driver Ratings at the same track, over 2-11 races

STrack-Pwr: Ranking of average Driver Ratings at the same track over the past two years

Start: Start Position, defined as qualifying results

Practice: Ranking of fast speeds in the next to the last practice

Bonus Points: Average of bonus points earned

Pass Dif: Average of green flag passes, less green flag times passed, over the last 2-11 races

Laps Led: Number of Laps Led at the same track, averaged over the last 2-11 races

Fast Laps: Number of Fast Laps at the same track, averaged over the last 2-11 races

These metrics were developed as possibly useful in unpublished analyses of past seasons. I offer no

rationale for their selection; they have evolved over time. Some of the measures are much better than

others; the average finish over the last 18 races is the best, with year-to-date driver rating the second

best. Others like the green flag pass differential have little information with relatively poor correlations.

COMBINING METRICS

Given these 15 measures of driver performance, how can they be combined to give the best estimate of

finish position for each driver and race? This is far from an obvious question, because all of these are

heavily correlated to each other, that is, a driver that has finished well in the last 18 races, is also placed

highly in the year-to-date Driver Ratings, etc. If two measures are highly correlated, then the second

measure adds little new information to the information in the first measure. An additional complication

is that not all measures are always available for each driver; Trevor Bayne, for example, had no prior

Sprint Cup history at Daytona before the 2011 season opening race.

The desire is to find a simple method for combining selected measures. First, all measures must be

transformed mathematically so that a small number will indicate a likely good finish. The easy way to do

this is to fit the various measures to the average finish and then use the curve fit data to represent the

measure in question.

A score defined as a simple average of all the transformed measures gives a correlation of 0.538 to the

actual finishes. The standard deviation of the estimated finishes based on the simple average for 2011 is

9.45. A large number of perturbations on combinations of the measures were examined, and the best

approach was to average L18-F, YTD-DR, STY-F, STR-DR, Start and Practice. This gave a correlation for

the score to actual finish of 0.550 and a standard deviation of 9.36 for the estimated finishes.

To determine the best possible fit of the metrics to the finishing positions, a multiple regression was

calculated, using all 15 metrics as inputs. Regression, strictly speaking, is only valid for independent

variables, and these are not independent. Still, in practice, regression can be very useful even here. For

Page 21: ACCUPREDICT: A Method for Forecasting NASCAR · 2012. 1. 9. · For NASCAR, there is a rich dataset of past races: I have each race back to 1991 in my database with the finishing

data points without all 15 metrics, the simple averages of the best combinations in the previous

paragraph were used. This gave a correlation of 0.559 and a standard deviation of 9.29. The

disadvantage of using this approach, however, is complexity and the fact that the regression is highly

tuned to the data in the 2011 season. The regression for 2012 data will almost certainly be different.

Plus, the 0.559 is not dramatically better than the 0.550 found by experiment.

There are other approaches to maximize the correlation of a combination of correlated variables. For

statistics geeks, I tried Principal Component Analysis and a method in a paper by Keller and Olkin. You

can Google these for more info. The required assumptions are only partially met, and results were a

correlation of 0.550-0.551, not quite as good as the regression results. These have also been tried in

earlier seasons, with similar results, and have the same drawbacks as the regression method.

Another interesting approach that I tried was to look at each driver as ranked for all metrics. If a driver

was ranked ahead of another driver on more of the metrics, then he was ranked higher in the

combination. This was only slightly different from the averages of the metrics, and performance was

slightly worse.

For all approaches, the likelihood of a driver finishing ahead of a lower ranked driver was calculated. It

varied by race, but all of the best approaches averaged about 70%.

The approach of a simple average of the metrics was chosen. With this, the performance of the best

combination is very poor for the restrictor plate races, so they were split out. Best combinations for

these were L18-F, YTD-DR and Practice. Correlations for the plate races improved from 0.243 to 0.318,

and finish standard deviation went from 11.5 to 11.2. This is still poor performance.

When this was combined with the non-plate races and their best combinations, the final correlation of

score with finish is 0.554, and the standard deviation is 9.29. Corresponding ACCUPREDICT results were

0.535 and 9.53.

FINAL ACCUPREDICT METHOD FOR 2012

The method proposed is somewhat better than the approach used in 2011. For each race, the top 35

drivers in points are identified, and their finishes in the last 15 races are averaged. Their performance in

year-to-date driver ratings is ranked. Similar track types are identified, using the revised definitions in a

previous section, and finishing position is averaged over the last eight races on those tracks. The

average driver ratings at the last eight races at the same track are ranked. Practice speeds at the next to

last practice are ranked, and the start position is used. These six performance measures, or whatever

exist for each driver, are averaged, and the resulting score gives the expected finishing position by a

simple curve fit to the 2011 data. For restrictor plate races, the average of the last 15 races, year-to-

date Driver Ratings, and the practice rankings are used.

It is noted that the 2012 proposed metrics are slightly different from the 2011 season metrics used to

select combinations. The last 15 races are used, and the track groupings into related track types have

been revised. The number of races averaged for 2012 for same type and same track metrics is eight,

rather than variable 2-14 races in 2011. The rationale for this is that the numbers and groupings have

been changed to improve correlations and they are measuring very similar information.

Page 22: ACCUPREDICT: A Method for Forecasting NASCAR · 2012. 1. 9. · For NASCAR, there is a rich dataset of past races: I have each race back to 1991 in my database with the finishing

This new approach was applied to five sample races from 2011, one from each track type, and

correlations improved on average from 0.470 to 0.502.

A similar approach has been defined for NASCAR.com’s Fantasy Live to predict fantasy points for each

driver. In 2012, it will be included in the ACCUPREDICT forecasts. The method in 2011 finished 22nd

overall out of several thousand competitors.


Recommended