Predicting BABIP

7/24/2019 Predicting BABIP

1/37

Yudelman 1

Predicting Batting Average of Balls in

Play in Major League Baseball usingStatCast Data

Adam Yudelman

ECN 215

Professor Allin Cottrell

11/23/2015


2/37

Yudelman 2

Table of Contents:

I. Introduction

II. Literature Review

III. Data

IV. Expectations

V.

Modeling and Testing

VI. Interpretation and Conclusions

VII. Suggestions for Future Work

VIII. Appendix

a.

Summary Statistics

b.

Actual vs Predicted BABIP

c.

R Script

IX. Works Cited


3/37

Yudelman 3

Introduction:

In Major League Baseball, teams are constantly searching for undervalued players. This process

is considered arbitrage, where players are signed to contracts that do not accurately represent their skill

level or value to a team. This search became popularized by the movie Moneyball, where the Oakland

Athletics target on-base percentage, a measure of how often a player reaches base given his plate

appearances, as an undervalued skill. The movie brought the idea of sports analytics to the forefront of

American pop culture as the movie received more than a handful of Oscar nominations (Moneyball).

Sports analytics crosses econometrics with sports. Baseball in particular is ripe for data analysis; every

pitch is recorded with a speed, location, and break, and every hit is defined in variety of ways, such as

how hard the ball was hit and the location at where it landed, to properly describe the process and

outcome. Moreover, baseball data is almost entirely independent. A singular pitcher pitches to a

singular hitter. The outcomes depend almost entirely on those two players. However, in some cases, this

does not hold.

As more data is being collected, teams are looking for new types of arbitrage. One of the

greatest factors teams have to keep in mind is luck. Over a sample of 162 games, players can in fact get

lucky, improving their hitting statistics. For a hitter, this can mean a weak ground ball finding a gap in

between two players. The premise of luck is that a pitcher and hitter cannot control where a defensive

player is set up. Hitting the ball hard and on a line is the best outcome for hitter, yet sometimes that

batted ball is hit directly at a defender; thus, the players statistics, particularly his batting average

(defined as (total hits)/(total at bats)) are penalized with an out rather than a hit. Unfortunately, for a

long while, teams looked at batting average as the most important metric for determining a players skill

level. However, as described above, batting average is dependent on the exogenous variable of

defensive positioning. In an attempt to improve the understanding of a batting average, analysts came

up the statistic Batting Average of Balls in Play(BABIP).


4/37

Yudelman 4

BABIP is defined as:

BABIP = (H HR)/(AB K HR + SF)

where H = hits, HR = homerun, AB = at bats, K = strikes outs, HR = home run, and SF = sacrifice fly

BABIP is textually defined as the frequency with which a player gets a hit on a ball in play. Itrelies on

three factors: Skill, defense, and luck. Skill is described as the ability to hit a ball hard and in a manner

that likely would result in a hit. Defense is the defensive positioning of the opponent, an aspect the

player cannot control. Luck is best exemplified by a weakly hit fly ball that lands right between two

players; its an unintentional outcome. The idea is that a player who has one season with a high BABIP

compare to the rest of his career may have been the beneficiary of luck more so than an improvement

in skill (BABIP).

This paper investigates BABIP in a non-results based format. Essentially, the research intends to

predict BABIP using batted ball descriptors. This is important because batted ball descriptors are not

privy to luck or defense. It is only the outcome of the ball off the bat. These descriptors range from a

subjective categorical method of how hard a ball is hit to the direction in which the ball is headed. A

good model that predicts BABIP (generally referred to as expected BABIP, or xBABIP) can isolate skill

against the luck factor; thus, large residuals against actual BABIP can identify players who are victims or

beneficiaries of this luck. For teams, identifying those unlucky players, and thus those likely

undervalued, can allow them to sign good players at a below value price. This can been seen as two

separate markets: one for analytically based teams and one for non-analytically based teams. More

often than not, by using these underlying statistics, the analytically based team will be the one to

commit arbitrage through the two market valuation difference.

This paper in particular looks at the release of a new dataset and its potential effects on

predicting BABIP. Last year, Major League Baseball introduce a new ball tracking system called StatCast.


5/37

Yudelman 5

StatCast uses optical tracking technology to measure how fast, with what acceleration, and how far a

defender runs. In addition, and more importantly for this paper, StatCast tracks and publishes the exit

velocity of a baseball hit by the batter. Previous research in to predicting a non-results based BABIP has

not used this new StatCast data, so this research intends to see if the data can be used to build a better

model for xBABIP.

Literature Review

The idea of creating an xBABIP is not new. The rise in popularity in baseball analytics has created

countless dedicated websites where individuals commit their free time to analyzing statistics in hopes of

better understanding the game. Three of the most reputable websites are FanGraphs,Baseball

Prosepectus, and The Hardball Times. Searching the archives of these websites return a bit of prior

research on the subject matter.

In 2008, Chris Dutton of The Hardball Times, first delved into predictive BABIP in 2008. Popular

opinion at the time said that adding .120 to a players line drive percentage (percentage of batted balls

categorized as line drives) could act as a proxy for what a players BABIP should look like. Dutton refused

this as a reasonable explanation, for other factors, such as speed and ability to control the strike zone

seem to be variables that would also be relevant. He postulated that a quicker player would be able to

get a hit on a slow grounder to an infielder while a slower player, who hit the ball with the exact same

profile, would not get a hit.

Using data from 2002 to 2008, Dutton developed a model that took into account the batted ball

profile as well as some relevant metrics for the players personal skill profile. His OLS regression found

positive and significant (at the 1% level) for a hitters eye (defined as strikeouts divided by walks), line

drive percentage, speed score (to be discussed later on), and pitches per plate appearance (Dutton,

2008). He found negative coefficient for pitches per extra base hit, fly ball to ground ball ratio, spray (a


6/37

Yudelman 6

measure of how well a hitter disperses his hits all over the field), and contact rate (a metric that looks at

how well a player avoids strikeouts). Dutton also attempted to control for park effects, the year, and

whether a batter hits lefty or right or both, but he found these indicator variables to be insignificant.

Duttons model had an r-squared of .348. During out of sample testing, he found a correlation of 59%

between his xBABIP and actual BABIP. In comparison, the rudimentary formula of xBABIP = (.120 + LD%)

only had an r-squared of .03 and an out of sample correlation of 18% (Dutton, 2008).

Duttons results very clearly show that a model for expected BABIP is necessary. Conventional

wisdom with the very simple model using only line drive percentage proved not very predictive at all.

Dutton isolated significant variables and his work is used as the backbone for other research on the

topic.

In 2010, Matt Swartz of Baseball Prospectuslooked once again at batting average of balls in

play. Swartz noted that year-to-year BABIP only has a correlation of about .37. BABIP is highly influenced

by the type of batted ball, as defined by line drive percentage, outfield fly ball percentage, ground ball

percentage, and infield fly ball percentage. The following table shows the league average distribution,

the year-to-year correlation for this distribution, the league average BABIP, and the year-to-year

correlation for BABIP for each type of batted ball.

BABIP by Type of Batted Ball

Batted Ball Type League Average

Type of Hit

Distribution

Type of Hit

Distribution Year-

to-Year

Correlation

Average BABIP for

Type of Hit

BABIP Average

Year-to-Year

Correlation

Line Drive .21 .37 .730 .12

Outfield Fly Ball .44 .72 .240 .22

Ground Ball .35 .78 .170 .30

Infield Ground Ball .11 .68 .020 .17


7/37

Yudelman 7

Interpreting the year-to-year correlations, batted ball type distribution is rather steady. This is to say if a

batter hits 30% fly balls one year, the numbers suggest that the batter will hit very close to that same

percentage the next. This is very important, for if a model is based on these batted ball type distribution,

a player has to have similar numbers year to year for the predictions to mean anything. In contrast, the

BABIP average year-to-year correlations suggest that there is far more uncertainty and inconsistency.

In addition to batted ball type, Swartz also looked at speed of the player similarly to Dutton.

Rather than using the speed score, Swartz used triples per at bat as a proxy because triples require a

component of speed to beat the ball to third base rather than just settle for a double.

Swartz then developed two models. The first used the weighted average of the previous three

years to predict the fourth years BABIP. This model found positive and statistically significant

coefficients for line drive percentage, ground ball percentage, ground ball BABIP, infield hits per infield

chances, outfield fly ball BABIP, the natural log of homeruns per at bats, and the natural log for contact

made per pitches swung at. The model then had one variable, infield fly ball percentage, with a negative

coefficient. This model had an r-squared of .31. For this papers purpose, this model is not very helpful,

for the data being used is only from the 2015 season and does not include any of the previous seasons

BABIP data. Also, given the BABIP year-to-year inconsistency, it is questionable to have used the

previous years BABIP as a variable.

Swartzs second model looked only at the previous years data, a model that is much more alike

the one this paper attempts to build. Swartz finds that line drive percentage, ground ball percentage,

infield hits per infield chances, the natural log of homeruns per at bats, outfield fly ball percentage, and

triples all have positive coefficients. Again, infield fly ball percentage has a negative coefficient. This

model has an r-square of just .21 (Swartz, 2010). Looking at his approach, Swartz proves that batted ball


8/37

Yudelman 8

type is in fact an important variable when predicting BABIP; however, his models did not show

improvement on Duttons previous research.

The most recent study on expected batting average of balls in play comes from Alex

Chamberlain of FanGraphs. Chamberlains impetus for the research comes from the release of a few

new batted ball type statistics. Hard%, in conjunction with Medium% and Soft%, measures how often a

player hits a ball hard. Interestingly, Chamberlain notes that Hard% has almost no correlation in line

drive percentage; thus, Hard% captures well hit groundballs as well as well hit fly balls. True FB% is

defined as fly ball percentage minus infield fly ball percentage and True IFFB% measures how many

infield fly balls per ball hit in play rather than per fly ball. Finally, Chamberlain also introduces Oppo%,

complemented by Pull% and Center%, which measures the percent of batted balls that are hit into

opposite field. Using data from 2002 to 2014 (n = 1971), Chamberlain developed the following model:

xBABIP = .1975 - .4838*(True IFFB%) - .0914*(True FB%) + .2594*(LD%) + .1822*(Hard%) +

.1198*(Oppo%) + .0042*(Speed Score)

The model has an adjusted r-squared of .456 as well as a year-to-year correlation of .4712 (Chamberlain,

2015).

Using these new statistics, this model has significantly more predictability that the models

discussed previously. Batted ball type and speed remain very important factors in predicting BABIP over

all the research reviewed, and it seems that the new metrics, which further describe the batted ball

profile, improve the model. Considering this with the new dataset for which this paper intends to build a

model upon, this is a promising outcome. Because of this, this paper intends to use Chamberlains model

as the archetype.


9/37

Yudelman 9

Data:

The data for this paper comes from FanGraphs.com and BaseballSavant.com. Both websites

have full data from the 2015 Major League Baseball season for their respective statistics. This paper

limits the sample to only players who qualified for the batting title, which requires 3.1 plate appearances

per game in the season. This translates out to at least 502 plate appearances for the entire season. Thus,

the sample is limited to the 141 qualified players from the 2015 season. The training set is a randomly

selected collection of 106 players and the testing set consists of the remaining 36 players.

The dataset currently has 43 columns representing the key (the player) and 42 descriptive

statistics. However, the research presented only builds an OLS model on a selected group of the

variables. The following explains each of these selected variables. The attached appendix also includes

the means, medians, and standard deviations for each statistic. Note that for all percentage metrics, this

paper will be using decimal format (i.e. 10% is represented as .10).

Batting Average of Balls in Play (BABIP):

Calculated as (HitsHome Runs)/(At BatsStrikeoutsHome Runs + Sacrifice Flies)

Batting Average of Balls in Play measures how often a ball put in play results in a hit. As

discussed before, BABIP incorporates talent, luck, and defense. No one with over 4,000 career

plate appearances (roughly 6ish seasons) has ever had a BABIP of over .380, and a more

traditional mark of .350 indicates the best players in the league (BABIP). The following box

plots intends to show the wide spread of BABIP:


10/37

Yudelman 10

The most significant outlier is Albert Pujols at .217. This is a far departure from Pujols career

average BABIP of .297, so even as he ages, it is unlikely that .217 is a representative measure for

his hitting skill.

Batted Ball Type: Line Drive Percentage (LD%), Groundball Percentage (GB%), Fly Ball Percentage (FB%),

and Infield Fly Ball Percentage (IFFB%):

Calculated as:

Line Drive Percentage = Line Drives / Balls in Play

Fly Ball Percentage = Fly Balls / Balls in Play

Ground Ball Percentage = Ground Balls / Balls in Play

Infield Fly Ball Percentage = Infield Fly Balls / Fly Balls


11/37

Yudelman 11

These four metrics are grouped together, for they are all related. These are the four categorized

outcomes of a ball in play. LD%, FB%, and GB% sum to 1, while IFFB% is a category of defining a

fly ball. The following shows the hit distribution for 2015 players (ordered by ascending GB%):

For the majority of players, GB% dominates the hit profile; however, line drive percentage is

rather steady across the board regardless of the other two metrics.

Batted Ball Direction: Pull Percent (Pull%), Center Percent (Cent%), and Opposite Field Percent (Oppo%):

Calculated as:

Pull% = Pulled Balls/Total Batted Balls

Cent% = Centered Balls/Total Batted Balls

Oppo% = Opposite Balls/Total Batted Balls

Batted ball direction metrics split the field into three equal 30 degree sections. A pull location is

defined as the batter pulling the ball towards the side from which he hits from. For example, if a


12/37

Yudelman 12

right handed hitter pulls the ball towards third base, it counts towards the Pull%. If the ball is hit

up the middle, it is counted towards the Cent%, and if the ball is hit towards the first baseman, it

is counted towards the Oppo%. For a left handed hitter, a ball hit to the first base side counts

toward the Pull%, and if the ball is hit towards the third baseman, it is counted towards the

Oppo%. Similar to FB%, GB%, and LD%, these metrics sum to 1 for each player. The following,

curtesy of the FanGraphspage Batted Ball Direction, attempts to describe a players hitting

style based on the distribution of his batted ball direction breakdown.

Batter Type Pull% Cent% Oppo%

Average .40 .35 .25

Extreme Pull .55 .25 .20

Extreme Oppo. .30 .30 .40

Of not, players want to have as balanced of a distribution as possible so that defenses are not

able to position themselves heavily towards one side or the other.

Quality of Contact: Soft Hit Percentage (Soft%), Medium Hit Percentage (Med%), Hard Hit Percentage

(Hard%):

Quality of contact statistics are proprietary metrics from Baseball Information Solutions, and

have only recently been released to the public. While the exact formula is not known, it is

common knowledge that hang time, trajectory, and landing location factor into the calculation

(Quality of Contact Stats). Once again, these metrics sum to 1, so every batted ball is assigned

int one of the three buckets. The following shows the distribution for the qualified players in

2015.


13/37

Yudelman 13

Medium hit balls dominate every players profile. Based on the research of Alex Chamberlain,

hard hit percentage should lead to a higher BABIP. This study hopes to confirm that.

Speed Score (Spd%):

Speed score is also a propriety metric. It attempts to capture both the speed and base running

ability of a player. The metric varies depending on the website, but FanGraphsuses a

combination of Stolen Base Percentage, Frequency of Stolen Base Attempts, Percentage of

Triples, and Runs Scored Percentage (Speed Score). The 2015 sample shows speed scores vary

quite a bit from player to player. The following is a box plot of the distribution:


14/37

Yudelman 14

The wide distribution makes intuitive sense. The nature of baseball is that some positions

require far more athleticism than others, so rosters have variety of body types and athleticism.

From FanGraphsown research on speed score over the years, the following, taken from their

Speed Score page, shows how one can rate a players speed according to their score:

Rating Speed Score

Excellent 7.0

Great 6.0

Above Average 5.5

Average 4.5

Below Average 4.0

Poor 3.0

Awful 2.0


15/37

Yudelman 15

StatCast Data: Average miles per hour of a ball off the bat (AvgMPH), Average miles per hour of a ball

off the bat for a line drive and fly ball (AvgLD/FB MPH), Average miles per hour of a ball off the bat for

a ground ball (AvgGB MPH)

Before delving into each individual metric, each of which is rather self-explanatory, there must

be a discussion regarding the reliability of StatCast data. Research done at FanGraphsby Tony

Blengino looked at the limitation of the 2015 data. Blengino downloaded all the data from the

first half of the 2015 season and found that for 25.4% of batted balls, the batted ball velocity

was reported as NULL. Hence, for any batter, it is safe to assume that one-fourth of the data is

missing. More troubling is the split among the missing and reported data. Blengino found that

reported data associated with a much higher average and slugging percentage (a measurement

of players ability to consistently get extra base hits in addition) (Blengino, 2015). Digging

further, StatCast reported infield fly balls as NULL 56.3% of the time and often missed weak

ground balls (Blengino, 2015). This no doubt will have an effect on this papers analysis and is

important to keep in mind.

Breaking down each of the statistics, the following is a line graph plotting each of the three

metrics:


16/37

Yudelman 16

As expected, line drives are hit a higher speed than groundballs. However, there is a lot of

spread among the data. Regardless, this paper hopes that StatCast data can be used as a

complement, or even a substitute, to the quality of contact metrics.

Plate Discipline Statistics: Outside of the Strike Zone Swing Percentage (O-Swing%), Inside of the Strike

Zone Swing Percentage (Z-Swing%), Overall Swing Percentage (Swing%), and Swinging Strike Percentage

(SwStr%)

Calculated as:

O-Swing% = Swings at pitches outside of the strike zone/ Total pitches outsize of the

strike zone


17/37

Yudelman 17

Z-Swing% = Swings at pitches inside of the strike zone/Total pitches inside of the strike

zone

Swing% = Swings at pitches/Total pitches

SwStr% = Swings and misses/Total pitches

These metrics represent how well a player is able to control the strike zone. Balls pitched inside

the strike zone are easier to hit, so players with a high O-Swing% lack plate discipline. SwStr% is

a metric that captures a players ability to make contact consistently. The data summary in the

appendix shows the league averages with rather small standard deviations; thus, players are

rather consistent with this metric.

Contact Consistency Metrics: Contact Rate for Swings for Pitches Outside of the Strike Zone (O-

Contact%), Contact Rate for Swings for Pitches Inside the Strike Zone (Z-Contact%), Contact Rate for All

Swings (Contact%):

Calculated as:

O-Contact% =Contact made on pitches outside of the strike zone / Swings on pitches

outside the zone

Z-Contact% = Contact made on pitches inside the zone / Swings on pitches inside the

zone

Contact% = Contact made on a swing / Swings

The ability to avoid swinging at pitches outside of the zone is important because getting ahead

in the count (more balls than strikes) allows for hitters to expect pitches closer to the middle of

the zone. Having a high contact rate itself is indicative of hitters bat control.


18/37

Yudelman 18

Expectations:

The expectations for this paper is that this research will prove to be a more thorough

examination of predictive BABIP methods. Chamberlains model proved to be the best method

examined, yet he openly admitted that he handpicked statistics that he thought would be helpful in

predicting BABIP. This research intends to look at all the batted ball profile statistics described above to

develop a thorough model complete with diagnostic tests. The expectation is that the StatCast data will

help the model by providing previously unused data. The model must be careful of collinearity however,

for quality of contact and StatCast MPH are likely related. The model likely will see positive and

significant coefficients associated with LD%, Oppo%, HR/FB%, and Speed Score. Negative coefficients are

to be expected on FB%, GB%, Pull%, and Soft%. Plate discipline metrics are more difficult to project. O-

Swing% should have a negative coefficient, for hitting balls outside of the strike zone is difficult, and

SwStr% should be negative as well, for lots of missed swings do not likely correlated in good contact

when the ball is eventually put in play. Z-Swing% should have a positive coefficient, for players are

swinging at strikes early and often, which are usually the easiest pitches to hit. For the StatCast data,

higher MPH should mean more well hit balls resulting in hits, so there should be a positive effect on

BABIP.

Modeling And Testing:

First, the data is split into a 75% training set and a 25% testing set. This leaves 106 players for

training the model and 36 for testing it. Because of the fear of collinearity, the first step to building this

model is making a correlation matrix of the concerned metrics:


19/37

Yudelman 19

By interpreting this matrix, it because clear that including certain combinations of metrics will

lead to the over fitting of the data. The following bullet points summarize the findings:

GB% and FB% are highly correlated, but not LD%

Hard% is highly correlated with Soft% and Med%

Hard%/Med%/Soft% are highly correlated with the Statcast data

StatCast data for GB and LD/FB are not correlated with each other, so both can be used in

one model as long as the overall StatCast average is not used.

Pull% is correlated heavily with Med% and Opp%, but Opp% and Med% are not.

All the swing metrics are correlated

All the contact metrics are correlated


20/37

Yudelman 20

Given these findings, I intend to build two models. One model will use the Quality of Contact data and

the other will use StatCast data as a substitute. Comparing the two models should show whether the

release of StatCast data helps improve the predictive power of xBABIP.

The advantage of this papersQuality of Contact Model will be the addition of several additional

metrics. The model attempts to predict BABIP using LD%, GB%, Oppo%, Hard%, Speed Score, O-Swing%,

, and Contact%. Using these metrics, it accounts for the following: Batted ball type, batted ball direction,

quality of contact, player speed, player discipline, and player contact skills. Thinking through the metrics,

there does not seem to be any variable where diminishing effects would come about; thus, none of the

metrics are transformed. The following is the first model output:

The two insignificant variables, Swing% and Contact%, lead to a omit F-test where the null hypothesis is

BSwing = BContact = 0. The following shows the regression output for the new model:


21/37

Yudelman 21

As we see, removing the two variables result is being left with just significant variables. However, the

omit F-test requires an F-statistic in order to reject the null hypothesis. The following shows the anova

outputs for both models:

Calculation of the F-statistic finds a p-value just below .05, so we have reject the null hypothesis. Thus,

Contact% and Swing% do improve the model, even if they are not significant.


22/37

Yudelman 22

Given this result, the paper will go ahead with the unrestricted model as the Quality of Contact Model.

The following shows the residual plot:

The scatter is random, has no influential outliers, and is centered around zero, so the model satisfies iid

errors. Furthermore, a heteroscedasticity test confirms there are no signs of heteroscedasticity:

Moving on to the StatCast model, the variables remain the same except for the replacement of

Hard% with AvgLD/FB MPH. In theory, these metrics are measuring the same skill, so the results

should be analogous. The following shows the first output:


23/37

Yudelman 23

This initial output is very similar to the initial output of the Quality of Contact Model. Again, an omit f-

test to test the null hypothesis BSwing% = BContact% = 0. Below is the reduced model:

And the Anova tables:


24/37

Yudelman 24

The omit f-test again gives a p-value below .05, so we can reject the null hypothesis.

Given this result, the research will continue with the unrestricted model. To evaluate the diagnostics,

below are the residual plot and heteroscedasticity test:


25/37

Yudelman 25

Again, the scatter is random, has no influential outliers, and is centered around zero, so the model

satisfies iid errors. There is also no heteroscedasticity.

Interpretation and Conclusions:

The coefficients for both models agree with each other. LD%, Oppo%, Hard% and Speed score all

have positive coefficients. Unsurprisingly LD% has the largest coefficient, which makes sense given how

high the batting average is on those type of hits. For Swing% and Contact%, both models have negative

coefficients. This was not expected, yet makes sense. If a player is swinging and getting contact on the

majority of pitches, the player is likely to sacrifice waiting for the one pitch in one zone to hit for just

hitting anything. If the player waits for a pitch in a certain zone, he is more likely to make better contact

and thus have a higher chance of getting a hit. The variable with which the models diverge, Hard% and

AvgLD/FB MPH, both have positive coefficients. This is not surprising. However, the StatCast Avg

LD/FB MPH is not as significant; thus, it is likely not as good of a predictor compared to Hard%. To

compare the two models, the following are graphs that plot the actual BABIP against the predicted

BABIP using the 36 player testing set:


26/37

Yudelman 26

The plots act as a complement to the r-squared analysis of the two models. Both plot have pretty linear

relationships, which suggest good predictive power. Interestingly, the StatCast Model seems to be


27/37

Yudelman 27

consistently over-predicting BABIP by .02-.03. This may be a factor of the previously discussed data

issues associated with StatCast.

Overall, the models and research show that Quality of Contact Model is better than the StatCast

Model. This is evident in the significance of the variables, the adjusted r-squares, and the results of the

testing set. The Quality of Contact Model explains 44.1% of the variation in BABIP while the StatCast

model explains 40.1% of the variation. This is a disappointing finding. StatCast was announced to much

excitement, yet the data quality issues seem to cloud its ability to be a truly useful dataset. This is in no

way a damning statement for the future of StatCast the results of the Stat Cast Model are indeed

promising. Despite the small sample size, it is valuable to have been able to confirm and improve upon

(ever slightly) the previous research on the topic. To further examine the results of the models, the

actual and predicted values for BABIP for all 141 players are attached in the appendix as Table 2. The full

dataset used is available digitally.

Suggestions for Future Work:

In several years, StatCast data should be more relatable, creating a much more thorough and

comprehensive dataset. At that point, this research should be repeated. Moreover, more data on

exogenous variables, such as defensive positioning, should become available in the next few years, so

adding more variables to a larger training scould help the predictive value.

There is also a completely different approach to think about predicting BABIP. Rather than use

player averages over an entire season to try predict BABIP, a different attempt at modeling could look at

each at bat individually. Given the depth of StatCast data and the adjoining exit angle, defensive

positioning, and hang time data, a logit model should be able to give a value for whether a batted ball

will become a hit. Averaging the results of this model over an entire season could give an expected


28/37

Yudelman 28

BABIP devoid of luck. There is no telling whether this model would be much better, but it is an

alternative worth looking at.


29/37

Yudelman 29

Appendix

Table 1: Summary Statistics for Metrics

Metric Mean MedianStandardDeviation

BABIP 0.30893662 0.309 0.033764557

LD% 0.212 0.212 0.028549438

GB% 0.443767606 0.447 0.068567399

FB% 0.34421831 0.3495 0.069931366

IFFB% 0.086190141 0.086 0.043866303

HR/FB 0.120309859 0.1155 0.059034626

IFH% 0.066866197 0.0595 0.035786776

Pull% 0.397605634 0.394 0.059530316

Cent% 0.350443662 0.3495 0.034337908Oppo% 0.252077465 0.246 0.042932649

Soft% 0.169485915 0.167 0.036206905

Med% 0.525901408 0.525 0.038956333

Hard% 0.304753521 0.306 0.056165649

Spd 4.102816901 3.95 1.684098943

ABs With Data 320.2112676 322.5 44.25579153

Avg - MPH 89.23725352 89.38 2.318449623

Avg - FB/LD MPH 92.27697183 92.39 2.636167724

Avg - GB MPH 86.94746479 87 2.551268791

O-Swing% 0.317943662 0.313 0.057985765Z-Swing% 0.676295775 0.678 0.060572828

Swing% 0.477323944 0.479 0.051245962

SwStr% 0.091823944 0.087 0.030332268


30/37

Yudelman 30

Table 2: Actual BABIP vs Predicted

Name Actual BABIP Quality of Contact Prediction StatCast Prediction

Lucas Duda 0.285 0.292 0.297

Jose Bautista 0.237 0.259 0.266

Todd Frazier 0.271 0.284 0.275

Brian McCann 0.235 0.248 0.254

Brandon Moss 0.285 0.286 0.282

Kris Bryant 0.378 0.314 0.313

Edwin Encarnacion 0.267 0.277 0.283

Jay Bruce 0.251 0.293 0.287

Justin Upton 0.304 0.302 0.303

Brian Dozier 0.261 0.280 0.291

Nolan Arenado 0.284 0.293 0.286

Asdrubal Cabrera 0.306 0.284 0.291Anthony Rizzo 0.289 0.295 0.296

J.D. Martinez 0.339 0.319 0.308

Chris Davis 0.319 0.308 0.311

Aramis Ramirez 0.253 0.272 0.265

Carlos Beltran 0.297 0.275 0.280

Jimmy Rollins 0.246 0.272 0.279

Joc Pederson 0.262 0.285 0.292

Mookie Betts 0.31 0.296 0.298

Albert Pujols 0.217 0.263 0.259

Curtis Granderson 0.305 0.319 0.325Matt Carpenter 0.321 0.327 0.331

Mike Moustakas 0.294 0.282 0.279

Derek Norris 0.31 0.275 0.280

David Ortiz 0.264 0.303 0.296

Kyle Seager 0.278 0.299 0.300

Trevor Plouffe 0.274 0.288 0.287

Addison Russell 0.324 0.281 0.289

Ian Kinsler 0.323 0.299 0.301

Logan Forsythe 0.323 0.290 0.296

Josh Reddick 0.278 0.293 0.309Evan Longoria 0.309 0.291 0.301

Stephen Vogt 0.29 0.280 0.291

Nick Castellanos 0.322 0.314 0.314

Mark Trumbo 0.313 0.301 0.302

Bryce Harper 0.369 0.315 0.306

Logan Morrison 0.238 0.283 0.280


31/37

Yudelman 31

Marcus Semien 0.312 0.308 0.318

Alex Rodriguez 0.278 0.289 0.292

Manny Machado 0.297 0.296 0.296

Mike Trout 0.344 0.346 0.348

Andrew McCutchen 0.339 0.327 0.325

Josh Donaldson 0.314 0.307 0.306

Yoenis Cespedes 0.323 0.311 0.309

Brandon Belt 0.363 0.355 0.347

Evan Gattis 0.264 0.291 0.292

Salvador Perez 0.27 0.263 0.269

Carlos Santana 0.261 0.277 0.289

Yangervis Solarte 0.279 0.280 0.269

Wilmer Flores 0.273 0.288 0.285

Troy Tulowitzki 0.331 0.309 0.298

Freddy Galvis 0.309 0.307 0.309

Charlie Blackmon 0.325 0.328 0.322

Neil Walker 0.306 0.304 0.301

Kevin Pillar 0.306 0.304 0.303

Adrian Gonzalez 0.294 0.319 0.316

Ryan Howard 0.272 0.321 0.322

Carlos Gonzalez 0.284 0.294 0.288

Dexter Fowler 0.308 0.309 0.316

Adam Jones 0.286 0.288 0.280

Marlon Byrd 0.297 0.314 0.308

Daniel Murphy 0.278 0.299 0.296

Jose Reyes 0.301 0.276 0.278

Adrian Beltre 0.295 0.308 0.303

Prince Fielder 0.323 0.292 0.287

Kole Calhoun 0.304 0.303 0.309

Jose Altuve 0.329 0.280 0.272

Matt Kemp 0.311 0.333 0.309

Adam Lind 0.309 0.303 0.294

Paul Goldschmidt 0.382 0.352 0.353

Gregory Polanco 0.308 0.313 0.313

Kendrys Morales 0.319 0.304 0.303Mitch Moreland 0.317 0.301 0.303

Torii Hunter 0.258 0.282 0.280

Chris Owings 0.305 0.340 0.334

Chris Coghlan 0.284 0.323 0.318

Didi Gregorius 0.297 0.292 0.297

Angel Pagan 0.31 0.304 0.310


32/37

Yudelman 32

Nelson Cruz 0.35 0.315 0.320

Brett Gardner 0.312 0.318 0.334

Buster Posey 0.32 0.313 0.299

Brandon Crawford 0.294 0.310 0.307

Matt Duffy 0.336 0.258 0.277

Russell Martin 0.262 0.296 0.304

Kolten Wong 0.296 0.311 0.307

Joey Votto 0.371 0.342 0.342

Brett Lawrie 0.32 0.304 0.310

Miguel Cabrera 0.384 0.352 0.349

Ben Zobrist 0.288 0.287 0.295

Pablo Sandoval 0.27 0.290 0.290

Jose Abreu 0.333 0.318 0.313

Yadier Molina 0.295 0.301 0.297

Elvis Andrus 0.283 0.308 0.309

Jace Peterson 0.296 0.315 0.320

Michael Taylor 0.311 0.331 0.333

Michael Brantley 0.318 0.310 0.307

Billy Butler 0.282 0.299 0.303

Lorenzo Cain 0.347 0.345 0.339

Jhonny Peralta 0.311 0.313 0.307

Ian Desmond 0.307 0.308 0.313

Ryan Braun 0.322 0.346 0.334

Chase Headley 0.317 0.308 0.317

Brandon Phillips 0.315 0.332 0.327

Martin Prado 0.313 0.312 0.319

Alcides Escobar 0.286 0.316 0.316

Melky Cabrera 0.297 0.314 0.316

Gerardo Parra 0.325 0.336 0.330

Kevin Kiermaier 0.306 0.320 0.322

Odubel Herrera 0.387 0.339 0.344

Alexei Ramirez 0.264 0.284 0.285

A.J. Pollock 0.338 0.337 0.324

Starlin Castro 0.298 0.277 0.277

Shin-Soo Choo 0.335 0.318 0.326Billy Burns 0.339 0.314 0.317

Jason Kipnis 0.356 0.347 0.353

Adam Eaton 0.345 0.342 0.344

Nick Markakis 0.338 0.313 0.314

Francisco Cervelli 0.359 0.325 0.324

Avisail Garcia 0.32 0.330 0.327


33/37

Yudelman 33

David Peralta 0.368 0.337 0.333

Matt Duffy 0.336 0.342 0.337

Erick Aybar 0.3 0.300 0.293

Ender Inciarte 0.329 0.337 0.328

Xander Bogaerts 0.372 0.336 0.333

Robinson Cano 0.316 0.325 0.322

Anthony Gose 0.352 0.349 0.350

Wilson Ramos 0.256 0.302 0.314

Austin Jackson 0.342 0.344 0.342

Eric Hosmer 0.336 0.349 0.349

Jean Segura 0.298 0.324 0.322

Jason Heyward 0.329 0.327 0.322

Brock Holt 0.35 0.336 0.335

Yunel Escobar 0.347 0.323 0.323

Starling Marte 0.333 0.338 0.337

Andrelton Simmons 0.285 0.304 0.306

Joe Mauer 0.309 0.347 0.354

Cameron Maybin 0.316 0.331 0.345

DJ LeMahieu 0.362 0.379 0.385

Ben Revere 0.338 0.335 0.339

Dee Gordon 0.383 0.335 0.334

Christian Yelich 0.37 0.364 0.361

R Script:

attach(Data)

#Diagnostic Plots

#Box Plot for BABIP

boxplot(Data$BABIP, data = Data, main = "BABIP in 2015 for Qualified Hitters")

#Line Graph for Batted Ball Types

plot(Data$GB., type = "b", col = "blue", lwed = 2, ylim = c(0,1), ylab = "Percentage of Hits", xlab =

"Player", main = "Hit Distribution in 2015 for Qualified Hitters")

lines(Data$FB., col = "red", type = "b", lwd=2)

lines(Data$LD., col = "green", type = "b", lwd=2)

legend("topright",legend=c("GB%","FB%","LD%"),

lty=1,lwd=2,pch=21,col=c("Blue","Red", "Green"),

ncol=2,bty="n",cex=0.8,

text.col=c("blue","red","green"),

inset=0.01)


34/37

Yudelman 34

#Line Graph for Quality of Contact

plot(Data$Soft., type = "b", col = "blue", lwed = 2, ylim = c(0,1), ylab = "Percentage of Hits", xlab =

"Player", main = "Quality of Contact Distribution in 2015 for Qualified Hitters")

lines(Data$Med., col = "red", type = "b", lwd=2)

lines(Data$Hard., col = "green", type = "b", lwd=2)

legend("topright",legend=c("Soft%","Med%","Hard%"),




inset=0.01)

#Box Plot for Speed Score

boxplot(Data$Spd, data = Data, main = "Speed Score in 2015 for Qualified Hitters")

#Line Graph for StatCast data

plot(Data$Avg...GB.MPH, type = "b", col = "blue", lwed = 2, ylim = c(80, 115), ylab = "Average MPH", xlab

= "Player", main = "StatCast Data off the Bat in 2015 for Qualified Hitters")

lines(Data$Avg...FB.LD.MPH, col = "red", type = "b", lwd=2)lines(Data$Avg...MPH, col = "green", type = "b", lwd=2)

legend("topright",legend=c("Average MPH for Ground Balls" ,"Average MPH for Line Drives and Fly

Balls","Average MPH Overall"),




inset=0.01)

Data$HR.FB


35/37

Yudelman 35

Data$Oppo., Data$O.Swing., Data$Z.Swing., Data$SwStr., Data$O.Contact.,Data$Z.Contact.,

Data$Contact.)

M


36/37

Yudelman 36

plot(TestingSet$BABIP, predict(QualityOfContactModel, TestingSet), main = "Quality of Contact Model

Against Actual")

#Creating Final Dataset

FinalDataSet


37/37

Yudelman 37

Works Cited

BABIP. (n.d.) Retrieved November 19, 2015, from http://www.fangraphs.com/library/pitching/babip/

Batted ball direction. (n.d.) Retrieved November 19, 2015, from http://www.fangraphs.com/library/

offense/batted-ball-direction/

Blengino, T. (2015, August 6). The limitations of the 2015 statcast data. Retrieved November 19,

2015, from http://www.fangraphs.com/blogs/the-limitations-of-the-statcast-data/

Chamberlain, A. (2015, May 6). New hitter xBABIP based on BIS batted ball data. Retrieved November

19, 2015, from http://www.fangraphs.com/fantasy/new-hitter-xbabip-based-on-bis-batted-ball-

data/

Dutton, C. (2008, December 2). Batters and BABIP. Retrieved November 19, 2015, from

http://www.hardballtimes.com/batters-and-babip/

Moneyball. (n.d.). Retrieved November 19, 2015, from Internet Movie Database:

http://www.imdb.com/title/tt1210166/

Spd. (n.d.). Retrieved November 19, 2015, from http://www.fangraphs.com/library/offense/spd/

Swartz, M. (2010, March 23). Ahead in the count: Predicting BABIP, part 1. Retrieved November 19,

2015, from http://www.baseballprospectus.com/article.php?articleid=10333

Quality of contact stats. (n.d.). Retrieved November 19, 2015, from http://www.fangraphs.com/library/

offense/quality-of-contact-stats/

Date post:	20-Feb-2018
Category:	Documents
Upload:	adam-yudelman
View:	226 times
Download:	0 times

Predicting BABIP

Documents