+ All Categories
Home > Documents > Predicting BABIP

Predicting BABIP

Date post: 20-Feb-2018
Category:
Upload: adam-yudelman
View: 226 times
Download: 0 times
Share this document with a friend

of 37

Transcript
  • 7/24/2019 Predicting BABIP

    1/37

    Yudelman 1

    Predicting Batting Average of Balls in

    Play in Major League Baseball usingStatCast Data

    Adam Yudelman

    ECN 215

    Professor Allin Cottrell

    11/23/2015

  • 7/24/2019 Predicting BABIP

    2/37

    Yudelman 2

    Table of Contents:

    I. Introduction

    II. Literature Review

    III. Data

    IV. Expectations

    V.

    Modeling and Testing

    VI. Interpretation and Conclusions

    VII. Suggestions for Future Work

    VIII. Appendix

    a.

    Summary Statistics

    b.

    Actual vs Predicted BABIP

    c.

    R Script

    IX. Works Cited

  • 7/24/2019 Predicting BABIP

    3/37

    Yudelman 3

    Introduction:

    In Major League Baseball, teams are constantly searching for undervalued players. This process

    is considered arbitrage, where players are signed to contracts that do not accurately represent their skill

    level or value to a team. This search became popularized by the movie Moneyball, where the Oakland

    Athletics target on-base percentage, a measure of how often a player reaches base given his plate

    appearances, as an undervalued skill. The movie brought the idea of sports analytics to the forefront of

    American pop culture as the movie received more than a handful of Oscar nominations (Moneyball).

    Sports analytics crosses econometrics with sports. Baseball in particular is ripe for data analysis; every

    pitch is recorded with a speed, location, and break, and every hit is defined in variety of ways, such as

    how hard the ball was hit and the location at where it landed, to properly describe the process and

    outcome. Moreover, baseball data is almost entirely independent. A singular pitcher pitches to a

    singular hitter. The outcomes depend almost entirely on those two players. However, in some cases, this

    does not hold.

    As more data is being collected, teams are looking for new types of arbitrage. One of the

    greatest factors teams have to keep in mind is luck. Over a sample of 162 games, players can in fact get

    lucky, improving their hitting statistics. For a hitter, this can mean a weak ground ball finding a gap in

    between two players. The premise of luck is that a pitcher and hitter cannot control where a defensive

    player is set up. Hitting the ball hard and on a line is the best outcome for hitter, yet sometimes that

    batted ball is hit directly at a defender; thus, the players statistics, particularly his batting average

    (defined as (total hits)/(total at bats)) are penalized with an out rather than a hit. Unfortunately, for a

    long while, teams looked at batting average as the most important metric for determining a players skill

    level. However, as described above, batting average is dependent on the exogenous variable of

    defensive positioning. In an attempt to improve the understanding of a batting average, analysts came

    up the statistic Batting Average of Balls in Play(BABIP).

  • 7/24/2019 Predicting BABIP

    4/37

    Yudelman 4

    BABIP is defined as:

    BABIP = (H HR)/(AB K HR + SF)

    where H = hits, HR = homerun, AB = at bats, K = strikes outs, HR = home run, and SF = sacrifice fly

    BABIP is textually defined as the frequency with which a player gets a hit on a ball in play. Itrelies on

    three factors: Skill, defense, and luck. Skill is described as the ability to hit a ball hard and in a manner

    that likely would result in a hit. Defense is the defensive positioning of the opponent, an aspect the

    player cannot control. Luck is best exemplified by a weakly hit fly ball that lands right between two

    players; its an unintentional outcome. The idea is that a player who has one season with a high BABIP

    compare to the rest of his career may have been the beneficiary of luck more so than an improvement

    in skill (BABIP).

    This paper investigates BABIP in a non-results based format. Essentially, the research intends to

    predict BABIP using batted ball descriptors. This is important because batted ball descriptors are not

    privy to luck or defense. It is only the outcome of the ball off the bat. These descriptors range from a

    subjective categorical method of how hard a ball is hit to the direction in which the ball is headed. A

    good model that predicts BABIP (generally referred to as expected BABIP, or xBABIP) can isolate skill

    against the luck factor; thus, large residuals against actual BABIP can identify players who are victims or

    beneficiaries of this luck. For teams, identifying those unlucky players, and thus those likely

    undervalued, can allow them to sign good players at a below value price. This can been seen as two

    separate markets: one for analytically based teams and one for non-analytically based teams. More

    often than not, by using these underlying statistics, the analytically based team will be the one to

    commit arbitrage through the two market valuation difference.

    This paper in particular looks at the release of a new dataset and its potential effects on

    predicting BABIP. Last year, Major League Baseball introduce a new ball tracking system called StatCast.

  • 7/24/2019 Predicting BABIP

    5/37

    Yudelman 5

    StatCast uses optical tracking technology to measure how fast, with what acceleration, and how far a

    defender runs. In addition, and more importantly for this paper, StatCast tracks and publishes the exit

    velocity of a baseball hit by the batter. Previous research in to predicting a non-results based BABIP has

    not used this new StatCast data, so this research intends to see if the data can be used to build a better

    model for xBABIP.

    Literature Review

    The idea of creating an xBABIP is not new. The rise in popularity in baseball analytics has created

    countless dedicated websites where individuals commit their free time to analyzing statistics in hopes of

    better understanding the game. Three of the most reputable websites are FanGraphs,Baseball

    Prosepectus, and The Hardball Times. Searching the archives of these websites return a bit of prior

    research on the subject matter.

    In 2008, Chris Dutton of The Hardball Times, first delved into predictive BABIP in 2008. Popular

    opinion at the time said that adding .120 to a players line drive percentage (percentage of batted balls

    categorized as line drives) could act as a proxy for what a players BABIP should look like. Dutton refused

    this as a reasonable explanation, for other factors, such as speed and ability to control the strike zone

    seem to be variables that would also be relevant. He postulated that a quicker player would be able to

    get a hit on a slow grounder to an infielder while a slower player, who hit the ball with the exact same

    profile, would not get a hit.

    Using data from 2002 to 2008, Dutton developed a model that took into account the batted ball

    profile as well as some relevant metrics for the players personal skill profile. His OLS regression found

    positive and significant (at the 1% level) for a hitters eye (defined as strikeouts divided by walks), line

    drive percentage, speed score (to be discussed later on), and pitches per plate appearance (Dutton,

    2008). He found negative coefficient for pitches per extra base hit, fly ball to ground ball ratio, spray (a

  • 7/24/2019 Predicting BABIP

    6/37

    Yudelman 6

    measure of how well a hitter disperses his hits all over the field), and contact rate (a metric that looks at

    how well a player avoids strikeouts). Dutton also attempted to control for park effects, the year, and

    whether a batter hits lefty or right or both, but he found these indicator variables to be insignificant.

    Duttons model had an r-squared of .348. During out of sample testing, he found a correlation of 59%

    between his xBABIP and actual BABIP. In comparison, the rudimentary formula of xBABIP = (.120 + LD%)

    only had an r-squared of .03 and an out of sample correlation of 18% (Dutton, 2008).

    Duttons results very clearly show that a model for expected BABIP is necessary. Conventional

    wisdom with the very simple model using only line drive percentage proved not very predictive at all.

    Dutton isolated significant variables and his work is used as the backbone for other research on the

    topic.

    In 2010, Matt Swartz of Baseball Prospectuslooked once again at batting average of balls in

    play. Swartz noted that year-to-year BABIP only has a correlation of about .37. BABIP is highly influenced

    by the type of batted ball, as defined by line drive percentage, outfield fly ball percentage, ground ball

    percentage, and infield fly ball percentage. The following table shows the league average distribution,

    the year-to-year correlation for this distribution, the league average BABIP, and the year-to-year

    correlation for BABIP for each type of batted ball.

    BABIP by Type of Batted Ball

    Batted Ball Type League Average

    Type of Hit

    Distribution

    Type of Hit

    Distribution Year-

    to-Year

    Correlation

    Average BABIP for

    Type of Hit

    BABIP Average

    Year-to-Year

    Correlation

    Line Drive .21 .37 .730 .12

    Outfield Fly Ball .44 .72 .240 .22

    Ground Ball .35 .78 .170 .30

    Infield Ground Ball .11 .68 .020 .17

  • 7/24/2019 Predicting BABIP

    7/37

    Yudelman 7

    Interpreting the year-to-year correlations, batted ball type distribution is rather steady. This is to say if a

    batter hits 30% fly balls one year, the numbers suggest that the batter will hit very close to that same

    percentage the next. This is very important, for if a model is based on these batted ball type distribution,

    a player has to have similar numbers year to year for the predictions to mean anything. In contrast, the

    BABIP average year-to-year correlations suggest that there is far more uncertainty and inconsistency.

    In addition to batted ball type, Swartz also looked at speed of the player similarly to Dutton.

    Rather than using the speed score, Swartz used triples per at bat as a proxy because triples require a

    component of speed to beat the ball to third base rather than just settle for a double.

    Swartz then developed two models. The first used the weighted average of the previous three

    years to predict the fourth years BABIP. This model found positive and statistically significant

    coefficients for line drive percentage, ground ball percentage, ground ball BABIP, infield hits per infield

    chances, outfield fly ball BABIP, the natural log of homeruns per at bats, and the natural log for contact

    made per pitches swung at. The model then had one variable, infield fly ball percentage, with a negative

    coefficient. This model had an r-squared of .31. For this papers purpose, this model is not very helpful,

    for the data being used is only from the 2015 season and does not include any of the previous seasons

    BABIP data. Also, given the BABIP year-to-year inconsistency, it is questionable to have used the

    previous years BABIP as a variable.

    Swartzs second model looked only at the previous years data, a model that is much more alike

    the one this paper attempts to build. Swartz finds that line drive percentage, ground ball percentage,

    infield hits per infield chances, the natural log of homeruns per at bats, outfield fly ball percentage, and

    triples all have positive coefficients. Again, infield fly ball percentage has a negative coefficient. This

    model has an r-square of just .21 (Swartz, 2010). Looking at his approach, Swartz proves that batted ball

  • 7/24/2019 Predicting BABIP

    8/37

    Yudelman 8

    type is in fact an important variable when predicting BABIP; however, his models did not show

    improvement on Duttons previous research.

    The most recent study on expected batting average of balls in play comes from Alex

    Chamberlain of FanGraphs. Chamberlains impetus for the research comes from the release of a few

    new batted ball type statistics. Hard%, in conjunction with Medium% and Soft%, measures how often a

    player hits a ball hard. Interestingly, Chamberlain notes that Hard% has almost no correlation in line

    drive percentage; thus, Hard% captures well hit groundballs as well as well hit fly balls. True FB% is

    defined as fly ball percentage minus infield fly ball percentage and True IFFB% measures how many

    infield fly balls per ball hit in play rather than per fly ball. Finally, Chamberlain also introduces Oppo%,

    complemented by Pull% and Center%, which measures the percent of batted balls that are hit into

    opposite field. Using data from 2002 to 2014 (n = 1971), Chamberlain developed the following model:

    xBABIP = .1975 - .4838*(True IFFB%) - .0914*(True FB%) + .2594*(LD%) + .1822*(Hard%) +

    .1198*(Oppo%) + .0042*(Speed Score)

    The model has an adjusted r-squared of .456 as well as a year-to-year correlation of .4712 (Chamberlain,

    2015).

    Using these new statistics, this model has significantly more predictability that the models

    discussed previously. Batted ball type and speed remain very important factors in predicting BABIP over

    all the research reviewed, and it seems that the new metrics, which further describe the batted ball

    profile, improve the model. Considering this with the new dataset for which this paper intends to build a

    model upon, this is a promising outcome. Because of this, this paper intends to use Chamberlains model

    as the archetype.

  • 7/24/2019 Predicting BABIP

    9/37

    Yudelman 9

    Data:

    The data for this paper comes from FanGraphs.com and BaseballSavant.com. Both websites

    have full data from the 2015 Major League Baseball season for their respective statistics. This paper

    limits the sample to only players who qualified for the batting title, which requires 3.1 plate appearances

    per game in the season. This translates out to at least 502 plate appearances for the entire season. Thus,

    the sample is limited to the 141 qualified players from the 2015 season. The training set is a randomly

    selected collection of 106 players and the testing set consists of the remaining 36 players.

    The dataset currently has 43 columns representing the key (the player) and 42 descriptive

    statistics. However, the research presented only builds an OLS model on a selected group of the

    variables. The following explains each of these selected variables. The attached appendix also includes

    the means, medians, and standard deviations for each statistic. Note that for all percentage metrics, this

    paper will be using decimal format (i.e. 10% is represented as .10).

    Batting Average of Balls in Play (BABIP):

    Calculated as (HitsHome Runs)/(At BatsStrikeoutsHome Runs + Sacrifice Flies)

    Batting Average of Balls in Play measures how often a ball put in play results in a hit. As

    discussed before, BABIP incorporates talent, luck, and defense. No one with over 4,000 career

    plate appearances (roughly 6ish seasons) has ever had a BABIP of over .380, and a more

    traditional mark of .350 indicates the best players in the league (BABIP). The following box

    plots intends to show the wide spread of BABIP:

  • 7/24/2019 Predicting BABIP

    10/37

    Yudelman 10

    The most significant outlier is Albert Pujols at .217. This is a far departure from Pujols career

    average BABIP of .297, so even as he ages, it is unlikely that .217 is a representative measure for

    his hitting skill.

    Batted Ball Type: Line Drive Percentage (LD%), Groundball Percentage (GB%), Fly Ball Percentage (FB%),

    and Infield Fly Ball Percentage (IFFB%):

    Calculated as:

    Line Drive Percentage = Line Drives / Balls in Play

    Fly Ball Percentage = Fly Balls / Balls in Play

    Ground Ball Percentage = Ground Balls / Balls in Play

    Infield Fly Ball Percentage = Infield Fly Balls / Fly Balls

  • 7/24/2019 Predicting BABIP

    11/37

    Yudelman 11

    These four metrics are grouped together, for they are all related. These are the four categorized

    outcomes of a ball in play. LD%, FB%, and GB% sum to 1, while IFFB% is a category of defining a

    fly ball. The following shows the hit distribution for 2015 players (ordered by ascending GB%):

    For the majority of players, GB% dominates the hit profile; however, line drive percentage is

    rather steady across the board regardless of the other two metrics.

    Batted Ball Direction: Pull Percent (Pull%), Center Percent (Cent%), and Opposite Field Percent (Oppo%):

    Calculated as:

    Pull% = Pulled Balls/Total Batted Balls

    Cent% = Centered Balls/Total Batted Balls

    Oppo% = Opposite Balls/Total Batted Balls

    Batted ball direction metrics split the field into three equal 30 degree sections. A pull location is

    defined as the batter pulling the ball towards the side from which he hits from. For example, if a

  • 7/24/2019 Predicting BABIP

    12/37

    Yudelman 12

    right handed hitter pulls the ball towards third base, it counts towards the Pull%. If the ball is hit

    up the middle, it is counted towards the Cent%, and if the ball is hit towards the first baseman, it

    is counted towards the Oppo%. For a left handed hitter, a ball hit to the first base side counts

    toward the Pull%, and if the ball is hit towards the third baseman, it is counted towards the

    Oppo%. Similar to FB%, GB%, and LD%, these metrics sum to 1 for each player. The following,

    curtesy of the FanGraphspage Batted Ball Direction, attempts to describe a players hitting

    style based on the distribution of his batted ball direction breakdown.

    Batter Type Pull% Cent% Oppo%

    Average .40 .35 .25

    Extreme Pull .55 .25 .20

    Extreme Oppo. .30 .30 .40

    Of not, players want to have as balanced of a distribution as possible so that defenses are not

    able to position themselves heavily towards one side or the other.

    Quality of Contact: Soft Hit Percentage (Soft%), Medium Hit Percentage (Med%), Hard Hit Percentage

    (Hard%):

    Quality of contact statistics are proprietary metrics from Baseball Information Solutions, and

    have only recently been released to the public. While the exact formula is not known, it is

    common knowledge that hang time, trajectory, and landing location factor into the calculation

    (Quality of Contact Stats). Once again, these metrics sum to 1, so every batted ball is assigned

    int one of the three buckets. The following shows the distribution for the qualified players in

    2015.

  • 7/24/2019 Predicting BABIP

    13/37

    Yudelman 13

    Medium hit balls dominate every players profile. Based on the research of Alex Chamberlain,

    hard hit percentage should lead to a higher BABIP. This study hopes to confirm that.

    Speed Score (Spd%):

    Speed score is also a propriety metric. It attempts to capture both the speed and base running

    ability of a player. The metric varies depending on the website, but FanGraphsuses a

    combination of Stolen Base Percentage, Frequency of Stolen Base Attempts, Percentage of

    Triples, and Runs Scored Percentage (Speed Score). The 2015 sample shows speed scores vary

    quite a bit from player to player. The following is a box plot of the distribution:

  • 7/24/2019 Predicting BABIP

    14/37

    Yudelman 14

    The wide distribution makes intuitive sense. The nature of baseball is that some positions

    require far more athleticism than others, so rosters have variety of body types and athleticism.

    From FanGraphsown research on speed score over the years, the following, taken from their

    Speed Score page, shows how one can rate a players speed according to their score:

    Rating Speed Score

    Excellent 7.0

    Great 6.0

    Above Average 5.5

    Average 4.5

    Below Average 4.0

    Poor 3.0

    Awful 2.0

  • 7/24/2019 Predicting BABIP

    15/37

    Yudelman 15

    StatCast Data: Average miles per hour of a ball off the bat (AvgMPH), Average miles per hour of a ball

    off the bat for a line drive and fly ball (AvgLD/FB MPH), Average miles per hour of a ball off the bat for

    a ground ball (AvgGB MPH)

    Before delving into each individual metric, each of which is rather self-explanatory, there must

    be a discussion regarding the reliability of StatCast data. Research done at FanGraphsby Tony

    Blengino looked at the limitation of the 2015 data. Blengino downloaded all the data from the

    first half of the 2015 season and found that for 25.4% of batted balls, the batted ball velocity

    was reported as NULL. Hence, for any batter, it is safe to assume that one-fourth of the data is

    missing. More troubling is the split among the missing and reported data. Blengino found that

    reported data associated with a much higher average and slugging percentage (a measurement

    of players ability to consistently get extra base hits in addition) (Blengino, 2015). Digging

    further, StatCast reported infield fly balls as NULL 56.3% of the time and often missed weak

    ground balls (Blengino, 2015). This no doubt will have an effect on this papers analysis and is

    important to keep in mind.

    Breaking down each of the statistics, the following is a line graph plotting each of the three

    metrics:

  • 7/24/2019 Predicting BABIP

    16/37

    Yudelman 16

    As expected, line drives are hit a higher speed than groundballs. However, there is a lot of

    spread among the data. Regardless, this paper hopes that StatCast data can be used as a

    complement, or even a substitute, to the quality of contact metrics.

    Plate Discipline Statistics: Outside of the Strike Zone Swing Percentage (O-Swing%), Inside of the Strike

    Zone Swing Percentage (Z-Swing%), Overall Swing Percentage (Swing%), and Swinging Strike Percentage

    (SwStr%)

    Calculated as:

    O-Swing% = Swings at pitches outside of the strike zone/ Total pitches outsize of the

    strike zone

  • 7/24/2019 Predicting BABIP

    17/37

    Yudelman 17

    Z-Swing% = Swings at pitches inside of the strike zone/Total pitches inside of the strike

    zone

    Swing% = Swings at pitches/Total pitches

    SwStr% = Swings and misses/Total pitches

    These metrics represent how well a player is able to control the strike zone. Balls pitched inside

    the strike zone are easier to hit, so players with a high O-Swing% lack plate discipline. SwStr% is

    a metric that captures a players ability to make contact consistently. The data summary in the

    appendix shows the league averages with rather small standard deviations; thus, players are

    rather consistent with this metric.

    Contact Consistency Metrics: Contact Rate for Swings for Pitches Outside of the Strike Zone (O-

    Contact%), Contact Rate for Swings for Pitches Inside the Strike Zone (Z-Contact%), Contact Rate for All

    Swings (Contact%):

    Calculated as:

    O-Contact% =Contact made on pitches outside of the strike zone / Swings on pitches

    outside the zone

    Z-Contact% = Contact made on pitches inside the zone / Swings on pitches inside the

    zone

    Contact% = Contact made on a swing / Swings

    The ability to avoid swinging at pitches outside of the zone is important because getting ahead

    in the count (more balls than strikes) allows for hitters to expect pitches closer to the middle of

    the zone. Having a high contact rate itself is indicative of hitters bat control.

  • 7/24/2019 Predicting BABIP

    18/37

    Yudelman 18

    Expectations:

    The expectations for this paper is that this research will prove to be a more thorough

    examination of predictive BABIP methods. Chamberlains model proved to be the best method

    examined, yet he openly admitted that he handpicked statistics that he thought would be helpful in

    predicting BABIP. This research intends to look at all the batted ball profile statistics described above to

    develop a thorough model complete with diagnostic tests. The expectation is that the StatCast data will

    help the model by providing previously unused data. The model must be careful of collinearity however,

    for quality of contact and StatCast MPH are likely related. The model likely will see positive and

    significant coefficients associated with LD%, Oppo%, HR/FB%, and Speed Score. Negative coefficients are

    to be expected on FB%, GB%, Pull%, and Soft%. Plate discipline metrics are more difficult to project. O-

    Swing% should have a negative coefficient, for hitting balls outside of the strike zone is difficult, and

    SwStr% should be negative as well, for lots of missed swings do not likely correlated in good contact

    when the ball is eventually put in play. Z-Swing% should have a positive coefficient, for players are

    swinging at strikes early and often, which are usually the easiest pitches to hit. For the StatCast data,

    higher MPH should mean more well hit balls resulting in hits, so there should be a positive effect on

    BABIP.

    Modeling And Testing:

    First, the data is split into a 75% training set and a 25% testing set. This leaves 106 players for

    training the model and 36 for testing it. Because of the fear of collinearity, the first step to building this

    model is making a correlation matrix of the concerned metrics:

  • 7/24/2019 Predicting BABIP

    19/37

    Yudelman 19

    By interpreting this matrix, it because clear that including certain combinations of metrics will

    lead to the over fitting of the data. The following bullet points summarize the findings:

    GB% and FB% are highly correlated, but not LD%

    Hard% is highly correlated with Soft% and Med%

    Hard%/Med%/Soft% are highly correlated with the Statcast data

    StatCast data for GB and LD/FB are not correlated with each other, so both can be used in

    one model as long as the overall StatCast average is not used.

    Pull% is correlated heavily with Med% and Opp%, but Opp% and Med% are not.

    All the swing metrics are correlated

    All the contact metrics are correlated

  • 7/24/2019 Predicting BABIP

    20/37

    Yudelman 20

    Given these findings, I intend to build two models. One model will use the Quality of Contact data and

    the other will use StatCast data as a substitute. Comparing the two models should show whether the

    release of StatCast data helps improve the predictive power of xBABIP.

    The advantage of this papersQuality of Contact Model will be the addition of several additional

    metrics. The model attempts to predict BABIP using LD%, GB%, Oppo%, Hard%, Speed Score, O-Swing%,

    , and Contact%. Using these metrics, it accounts for the following: Batted ball type, batted ball direction,

    quality of contact, player speed, player discipline, and player contact skills. Thinking through the metrics,

    there does not seem to be any variable where diminishing effects would come about; thus, none of the

    metrics are transformed. The following is the first model output:

    The two insignificant variables, Swing% and Contact%, lead to a omit F-test where the null hypothesis is

    BSwing = BContact = 0. The following shows the regression output for the new model:

  • 7/24/2019 Predicting BABIP

    21/37

    Yudelman 21

    As we see, removing the two variables result is being left with just significant variables. However, the

    omit F-test requires an F-statistic in order to reject the null hypothesis. The following shows the anova

    outputs for both models:

    Calculation of the F-statistic finds a p-value just below .05, so we have reject the null hypothesis. Thus,

    Contact% and Swing% do improve the model, even if they are not significant.

  • 7/24/2019 Predicting BABIP

    22/37

    Yudelman 22

    Given this result, the paper will go ahead with the unrestricted model as the Quality of Contact Model.

    The following shows the residual plot:

    The scatter is random, has no influential outliers, and is centered around zero, so the model satisfies iid

    errors. Furthermore, a heteroscedasticity test confirms there are no signs of heteroscedasticity:

    Moving on to the StatCast model, the variables remain the same except for the replacement of

    Hard% with AvgLD/FB MPH. In theory, these metrics are measuring the same skill, so the results

    should be analogous. The following shows the first output:

  • 7/24/2019 Predicting BABIP

    23/37

    Yudelman 23

    This initial output is very similar to the initial output of the Quality of Contact Model. Again, an omit f-

    test to test the null hypothesis BSwing% = BContact% = 0. Below is the reduced model:

    And the Anova tables:

  • 7/24/2019 Predicting BABIP

    24/37

    Yudelman 24

    The omit f-test again gives a p-value below .05, so we can reject the null hypothesis.

    Given this result, the research will continue with the unrestricted model. To evaluate the diagnostics,

    below are the residual plot and heteroscedasticity test:

  • 7/24/2019 Predicting BABIP

    25/37

    Yudelman 25

    Again, the scatter is random, has no influential outliers, and is centered around zero, so the model

    satisfies iid errors. There is also no heteroscedasticity.

    Interpretation and Conclusions:

    The coefficients for both models agree with each other. LD%, Oppo%, Hard% and Speed score all

    have positive coefficients. Unsurprisingly LD% has the largest coefficient, which makes sense given how

    high the batting average is on those type of hits. For Swing% and Contact%, both models have negative

    coefficients. This was not expected, yet makes sense. If a player is swinging and getting contact on the

    majority of pitches, the player is likely to sacrifice waiting for the one pitch in one zone to hit for just

    hitting anything. If the player waits for a pitch in a certain zone, he is more likely to make better contact

    and thus have a higher chance of getting a hit. The variable with which the models diverge, Hard% and

    AvgLD/FB MPH, both have positive coefficients. This is not surprising. However, the StatCast Avg

    LD/FB MPH is not as significant; thus, it is likely not as good of a predictor compared to Hard%. To

    compare the two models, the following are graphs that plot the actual BABIP against the predicted

    BABIP using the 36 player testing set:

  • 7/24/2019 Predicting BABIP

    26/37

    Yudelman 26

    The plots act as a complement to the r-squared analysis of the two models. Both plot have pretty linear

    relationships, which suggest good predictive power. Interestingly, the StatCast Model seems to be

  • 7/24/2019 Predicting BABIP

    27/37

    Yudelman 27

    consistently over-predicting BABIP by .02-.03. This may be a factor of the previously discussed data

    issues associated with StatCast.

    Overall, the models and research show that Quality of Contact Model is better than the StatCast

    Model. This is evident in the significance of the variables, the adjusted r-squares, and the results of the

    testing set. The Quality of Contact Model explains 44.1% of the variation in BABIP while the StatCast

    model explains 40.1% of the variation. This is a disappointing finding. StatCast was announced to much

    excitement, yet the data quality issues seem to cloud its ability to be a truly useful dataset. This is in no

    way a damning statement for the future of StatCast the results of the Stat Cast Model are indeed

    promising. Despite the small sample size, it is valuable to have been able to confirm and improve upon

    (ever slightly) the previous research on the topic. To further examine the results of the models, the

    actual and predicted values for BABIP for all 141 players are attached in the appendix as Table 2. The full

    dataset used is available digitally.

    Suggestions for Future Work:

    In several years, StatCast data should be more relatable, creating a much more thorough and

    comprehensive dataset. At that point, this research should be repeated. Moreover, more data on

    exogenous variables, such as defensive positioning, should become available in the next few years, so

    adding more variables to a larger training scould help the predictive value.

    There is also a completely different approach to think about predicting BABIP. Rather than use

    player averages over an entire season to try predict BABIP, a different attempt at modeling could look at

    each at bat individually. Given the depth of StatCast data and the adjoining exit angle, defensive

    positioning, and hang time data, a logit model should be able to give a value for whether a batted ball

    will become a hit. Averaging the results of this model over an entire season could give an expected

  • 7/24/2019 Predicting BABIP

    28/37

    Yudelman 28

    BABIP devoid of luck. There is no telling whether this model would be much better, but it is an

    alternative worth looking at.

  • 7/24/2019 Predicting BABIP

    29/37

    Yudelman 29

    Appendix

    Table 1: Summary Statistics for Metrics

    Metric Mean MedianStandardDeviation

    BABIP 0.30893662 0.309 0.033764557

    LD% 0.212 0.212 0.028549438

    GB% 0.443767606 0.447 0.068567399

    FB% 0.34421831 0.3495 0.069931366

    IFFB% 0.086190141 0.086 0.043866303

    HR/FB 0.120309859 0.1155 0.059034626

    IFH% 0.066866197 0.0595 0.035786776

    Pull% 0.397605634 0.394 0.059530316

    Cent% 0.350443662 0.3495 0.034337908Oppo% 0.252077465 0.246 0.042932649

    Soft% 0.169485915 0.167 0.036206905

    Med% 0.525901408 0.525 0.038956333

    Hard% 0.304753521 0.306 0.056165649

    Spd 4.102816901 3.95 1.684098943

    ABs With Data 320.2112676 322.5 44.25579153

    Avg - MPH 89.23725352 89.38 2.318449623

    Avg - FB/LD MPH 92.27697183 92.39 2.636167724

    Avg - GB MPH 86.94746479 87 2.551268791

    O-Swing% 0.317943662 0.313 0.057985765Z-Swing% 0.676295775 0.678 0.060572828

    Swing% 0.477323944 0.479 0.051245962

    SwStr% 0.091823944 0.087 0.030332268

  • 7/24/2019 Predicting BABIP

    30/37

    Yudelman 30

    Table 2: Actual BABIP vs Predicted

    Name Actual BABIP Quality of Contact Prediction StatCast Prediction

    Lucas Duda 0.285 0.292 0.297

    Jose Bautista 0.237 0.259 0.266

    Todd Frazier 0.271 0.284 0.275

    Brian McCann 0.235 0.248 0.254

    Brandon Moss 0.285 0.286 0.282

    Kris Bryant 0.378 0.314 0.313

    Edwin Encarnacion 0.267 0.277 0.283

    Jay Bruce 0.251 0.293 0.287

    Justin Upton 0.304 0.302 0.303

    Brian Dozier 0.261 0.280 0.291

    Nolan Arenado 0.284 0.293 0.286

    Asdrubal Cabrera 0.306 0.284 0.291Anthony Rizzo 0.289 0.295 0.296

    J.D. Martinez 0.339 0.319 0.308

    Chris Davis 0.319 0.308 0.311

    Aramis Ramirez 0.253 0.272 0.265

    Carlos Beltran 0.297 0.275 0.280

    Jimmy Rollins 0.246 0.272 0.279

    Joc Pederson 0.262 0.285 0.292

    Mookie Betts 0.31 0.296 0.298

    Albert Pujols 0.217 0.263 0.259

    Curtis Granderson 0.305 0.319 0.325Matt Carpenter 0.321 0.327 0.331

    Mike Moustakas 0.294 0.282 0.279

    Derek Norris 0.31 0.275 0.280

    David Ortiz 0.264 0.303 0.296

    Kyle Seager 0.278 0.299 0.300

    Trevor Plouffe 0.274 0.288 0.287

    Addison Russell 0.324 0.281 0.289

    Ian Kinsler 0.323 0.299 0.301

    Logan Forsythe 0.323 0.290 0.296

    Josh Reddick 0.278 0.293 0.309Evan Longoria 0.309 0.291 0.301

    Stephen Vogt 0.29 0.280 0.291

    Nick Castellanos 0.322 0.314 0.314

    Mark Trumbo 0.313 0.301 0.302

    Bryce Harper 0.369 0.315 0.306

    Logan Morrison 0.238 0.283 0.280

  • 7/24/2019 Predicting BABIP

    31/37

    Yudelman 31

    Marcus Semien 0.312 0.308 0.318

    Alex Rodriguez 0.278 0.289 0.292

    Manny Machado 0.297 0.296 0.296

    Mike Trout 0.344 0.346 0.348

    Andrew McCutchen 0.339 0.327 0.325

    Josh Donaldson 0.314 0.307 0.306

    Yoenis Cespedes 0.323 0.311 0.309

    Brandon Belt 0.363 0.355 0.347

    Evan Gattis 0.264 0.291 0.292

    Salvador Perez 0.27 0.263 0.269

    Carlos Santana 0.261 0.277 0.289

    Yangervis Solarte 0.279 0.280 0.269

    Wilmer Flores 0.273 0.288 0.285

    Troy Tulowitzki 0.331 0.309 0.298

    Freddy Galvis 0.309 0.307 0.309

    Charlie Blackmon 0.325 0.328 0.322

    Neil Walker 0.306 0.304 0.301

    Kevin Pillar 0.306 0.304 0.303

    Adrian Gonzalez 0.294 0.319 0.316

    Ryan Howard 0.272 0.321 0.322

    Carlos Gonzalez 0.284 0.294 0.288

    Dexter Fowler 0.308 0.309 0.316

    Adam Jones 0.286 0.288 0.280

    Marlon Byrd 0.297 0.314 0.308

    Daniel Murphy 0.278 0.299 0.296

    Jose Reyes 0.301 0.276 0.278

    Adrian Beltre 0.295 0.308 0.303

    Prince Fielder 0.323 0.292 0.287

    Kole Calhoun 0.304 0.303 0.309

    Jose Altuve 0.329 0.280 0.272

    Matt Kemp 0.311 0.333 0.309

    Adam Lind 0.309 0.303 0.294

    Paul Goldschmidt 0.382 0.352 0.353

    Gregory Polanco 0.308 0.313 0.313

    Kendrys Morales 0.319 0.304 0.303Mitch Moreland 0.317 0.301 0.303

    Torii Hunter 0.258 0.282 0.280

    Chris Owings 0.305 0.340 0.334

    Chris Coghlan 0.284 0.323 0.318

    Didi Gregorius 0.297 0.292 0.297

    Angel Pagan 0.31 0.304 0.310

  • 7/24/2019 Predicting BABIP

    32/37

    Yudelman 32

    Nelson Cruz 0.35 0.315 0.320

    Brett Gardner 0.312 0.318 0.334

    Buster Posey 0.32 0.313 0.299

    Brandon Crawford 0.294 0.310 0.307

    Matt Duffy 0.336 0.258 0.277

    Russell Martin 0.262 0.296 0.304

    Kolten Wong 0.296 0.311 0.307

    Joey Votto 0.371 0.342 0.342

    Brett Lawrie 0.32 0.304 0.310

    Miguel Cabrera 0.384 0.352 0.349

    Ben Zobrist 0.288 0.287 0.295

    Pablo Sandoval 0.27 0.290 0.290

    Jose Abreu 0.333 0.318 0.313

    Yadier Molina 0.295 0.301 0.297

    Elvis Andrus 0.283 0.308 0.309

    Jace Peterson 0.296 0.315 0.320

    Michael Taylor 0.311 0.331 0.333

    Michael Brantley 0.318 0.310 0.307

    Billy Butler 0.282 0.299 0.303

    Lorenzo Cain 0.347 0.345 0.339

    Jhonny Peralta 0.311 0.313 0.307

    Ian Desmond 0.307 0.308 0.313

    Ryan Braun 0.322 0.346 0.334

    Chase Headley 0.317 0.308 0.317

    Brandon Phillips 0.315 0.332 0.327

    Martin Prado 0.313 0.312 0.319

    Alcides Escobar 0.286 0.316 0.316

    Melky Cabrera 0.297 0.314 0.316

    Gerardo Parra 0.325 0.336 0.330

    Kevin Kiermaier 0.306 0.320 0.322

    Odubel Herrera 0.387 0.339 0.344

    Alexei Ramirez 0.264 0.284 0.285

    A.J. Pollock 0.338 0.337 0.324

    Starlin Castro 0.298 0.277 0.277

    Shin-Soo Choo 0.335 0.318 0.326Billy Burns 0.339 0.314 0.317

    Jason Kipnis 0.356 0.347 0.353

    Adam Eaton 0.345 0.342 0.344

    Nick Markakis 0.338 0.313 0.314

    Francisco Cervelli 0.359 0.325 0.324

    Avisail Garcia 0.32 0.330 0.327

  • 7/24/2019 Predicting BABIP

    33/37

    Yudelman 33

    David Peralta 0.368 0.337 0.333

    Matt Duffy 0.336 0.342 0.337

    Erick Aybar 0.3 0.300 0.293

    Ender Inciarte 0.329 0.337 0.328

    Xander Bogaerts 0.372 0.336 0.333

    Robinson Cano 0.316 0.325 0.322

    Anthony Gose 0.352 0.349 0.350

    Wilson Ramos 0.256 0.302 0.314

    Austin Jackson 0.342 0.344 0.342

    Eric Hosmer 0.336 0.349 0.349

    Jean Segura 0.298 0.324 0.322

    Jason Heyward 0.329 0.327 0.322

    Brock Holt 0.35 0.336 0.335

    Yunel Escobar 0.347 0.323 0.323

    Starling Marte 0.333 0.338 0.337

    Andrelton Simmons 0.285 0.304 0.306

    Joe Mauer 0.309 0.347 0.354

    Cameron Maybin 0.316 0.331 0.345

    DJ LeMahieu 0.362 0.379 0.385

    Ben Revere 0.338 0.335 0.339

    Dee Gordon 0.383 0.335 0.334

    Christian Yelich 0.37 0.364 0.361

    R Script:

    attach(Data)

    #Diagnostic Plots

    #Box Plot for BABIP

    boxplot(Data$BABIP, data = Data, main = "BABIP in 2015 for Qualified Hitters")

    #Line Graph for Batted Ball Types

    plot(Data$GB., type = "b", col = "blue", lwed = 2, ylim = c(0,1), ylab = "Percentage of Hits", xlab =

    "Player", main = "Hit Distribution in 2015 for Qualified Hitters")

    lines(Data$FB., col = "red", type = "b", lwd=2)

    lines(Data$LD., col = "green", type = "b", lwd=2)

    legend("topright",legend=c("GB%","FB%","LD%"),

    lty=1,lwd=2,pch=21,col=c("Blue","Red", "Green"),

    ncol=2,bty="n",cex=0.8,

    text.col=c("blue","red","green"),

    inset=0.01)

  • 7/24/2019 Predicting BABIP

    34/37

    Yudelman 34

    #Line Graph for Quality of Contact

    plot(Data$Soft., type = "b", col = "blue", lwed = 2, ylim = c(0,1), ylab = "Percentage of Hits", xlab =

    "Player", main = "Quality of Contact Distribution in 2015 for Qualified Hitters")

    lines(Data$Med., col = "red", type = "b", lwd=2)

    lines(Data$Hard., col = "green", type = "b", lwd=2)

    legend("topright",legend=c("Soft%","Med%","Hard%"),

    lty=1,lwd=2,pch=21,col=c("Blue","Red", "Green"),

    ncol=2,bty="n",cex=0.8,

    text.col=c("blue","red","green"),

    inset=0.01)

    #Box Plot for Speed Score

    boxplot(Data$Spd, data = Data, main = "Speed Score in 2015 for Qualified Hitters")

    #Line Graph for StatCast data

    plot(Data$Avg...GB.MPH, type = "b", col = "blue", lwed = 2, ylim = c(80, 115), ylab = "Average MPH", xlab

    = "Player", main = "StatCast Data off the Bat in 2015 for Qualified Hitters")

    lines(Data$Avg...FB.LD.MPH, col = "red", type = "b", lwd=2)lines(Data$Avg...MPH, col = "green", type = "b", lwd=2)

    legend("topright",legend=c("Average MPH for Ground Balls" ,"Average MPH for Line Drives and Fly

    Balls","Average MPH Overall"),

    lty=1,lwd=2,pch=21,col=c("Blue","Red", "Green"),

    ncol=2,bty="n",cex=0.8,

    text.col=c("blue","red","green"),

    inset=0.01)

    Data$HR.FB

  • 7/24/2019 Predicting BABIP

    35/37

    Yudelman 35

    Data$Oppo., Data$O.Swing., Data$Z.Swing., Data$SwStr., Data$O.Contact.,Data$Z.Contact.,

    Data$Contact.)

    M

  • 7/24/2019 Predicting BABIP

    36/37

    Yudelman 36

    plot(TestingSet$BABIP, predict(QualityOfContactModel, TestingSet), main = "Quality of Contact Model

    Against Actual")

    #Creating Final Dataset

    FinalDataSet

  • 7/24/2019 Predicting BABIP

    37/37

    Yudelman 37

    Works Cited

    BABIP. (n.d.) Retrieved November 19, 2015, from http://www.fangraphs.com/library/pitching/babip/

    Batted ball direction. (n.d.) Retrieved November 19, 2015, from http://www.fangraphs.com/library/

    offense/batted-ball-direction/

    Blengino, T. (2015, August 6). The limitations of the 2015 statcast data. Retrieved November 19,

    2015, from http://www.fangraphs.com/blogs/the-limitations-of-the-statcast-data/

    Chamberlain, A. (2015, May 6). New hitter xBABIP based on BIS batted ball data. Retrieved November

    19, 2015, from http://www.fangraphs.com/fantasy/new-hitter-xbabip-based-on-bis-batted-ball-

    data/

    Dutton, C. (2008, December 2). Batters and BABIP. Retrieved November 19, 2015, from

    http://www.hardballtimes.com/batters-and-babip/

    Moneyball. (n.d.). Retrieved November 19, 2015, from Internet Movie Database:

    http://www.imdb.com/title/tt1210166/

    Spd. (n.d.). Retrieved November 19, 2015, from http://www.fangraphs.com/library/offense/spd/

    Swartz, M. (2010, March 23). Ahead in the count: Predicting BABIP, part 1. Retrieved November 19,

    2015, from http://www.baseballprospectus.com/article.php?articleid=10333

    Quality of contact stats. (n.d.). Retrieved November 19, 2015, from http://www.fangraphs.com/library/

    offense/quality-of-contact-stats/


Recommended