Batting Practice: Introduction and Statistics Numbers have always played a major role in baseball and its fans’ love for the game. How many people can explain the signiﬁcance of these numbers: 5¡¡; 73; 7¡4; 56; .406; 4256? Respectively, they represent Cy Young’s career wins total; Barry Bonds’s 200¡ home run total; Babe Ruth’s career home run total; the length in games of Joe DiMaggio’s ¡94¡ hitting streak; Ted Williams’s ¡94¡ batting average; Pete Rose’s career hits total. What is sabermetrics? The term is a combination of the acronym SABR (Society of American Baseball Research) and “metrics,” meaning “measurement.” Deﬁned variously as the “search for objective knowledge about baseball” and “the mathematical and statistical analysis of baseball records” by the man who coined the term, noted baseball author Bill James, sabermetrics has become more and more widely accepted as an evaluation tool. Baseball fans, who already memorize and quote numbers to the thousandths place (as in no other part of their lives), now work into their baseball arguments such terms as OPS (on-base plus slugging percentages), its mathematically superior cousin, the SLOB (slugging times on base), WHIP (walks plus hits per inning pitched) and RF (range factor). James once wrote that the main reason for sabermetrics is that there is a Baseball Hall of Fame, and sabermetrics arguments are frequently used to plead the case for a player’s inclusion in (or exclusion from) the Hall. However, as Michael Lewis’s bestseller Moneyball attests, baseball insiders are taking a serious look at sabermetrics as a team-building tool. Just as a study of baseball statistics can help people better understand mathematics, Moneyball is viewed in the business community as a model for business startups. The main contribution of Bill James and Pete Palmer, another noted sabermetrician, is their exposure of the deﬁciency of looking at merely the 1 2 BATTING PRACTICE traditional statistics. Fresh analysis can be provided by the new statistics, or at the very least, by a new twist on the old statistics. When doing statistical analyses and data mining, anomalies are bound to appear, and baseball numbers are no stranger to these. For example, when combining unequally sized groups into a larger data set, expectations can be confounded. An example of this is known as Simpson’s Paradox. To illustrate how this can work in baseball, consider the following example. Player A may have a .223 BA against right-handed pitching (45 H / 202 AB) and a .284 BA against lefties (7¡ H / 250 AB), giving him an overall BA of .257 (¡¡6 H / 452 AB). Player B may have a higher BA against righties (.232 on 58 H / 250 AB) and a higher one against lefthanders as well (.296 on 32 H / ¡08 AB), but his overall batting average can nonetheless be lower than that of player A (.25¡, or 90 H / 358 AB). Descriptive statistics provide the mathematical underpinning for many of the measures used in sabermetrics, so it is here that we review some terms and formulas. If you feel comfortable with this subject, you may skip ahead to the next chapter. Statistics Refresher We deﬁne the mean, or average, of a data set to be the sum of the elements in the set divided by the total number of elements in the set. It is a measure of central tendency. Let’s think about the mean by way of an example. The following are the year-by-year home run totals for Hank Aaron over the course of his career: ¡3 27 26 44 30 39 40 34 45 44 24 32 44 39 29 44 38 47 34 40 20 ¡2 ¡0. The mean of this set is denoted by the symbol ¯ ¯ x and is determined to be x = 32.83 (755 total home runs divided by 23 seasons). Another measure of central tendency is the median, which provides the midpoint of the data set. For Aaron’s home runs, this is 34, i.e., he had as many seasons with more than 34 home runs as he did with fewer. Finally, the mode is the most frequently occurring element in the data set. For Aaron’s home runs, this number is 44, which, coincidentally, also happens to be his uniform number. The mode and median are least a›ected by unusually high or low score, while the mean is most stable, meaning that it shows the least variability when several random samples are taken. In a data set that is normally distributed, one in which the data can be modeled by a bell-shaped curve, all three of the measures are equal. For Aaron, they are not; however, in his playing days, it would be reasonable to expect Hank Aaron to hit 32 to 34 home runs per season based on this data. Introduction and Statistics 3 In statistics, measures of dispersion show how tightly spread out the data is in relation to a measure of central tendency. The main measures of the dispersion of the data are range, variance and standard deviation. Aaron’s seasonal home run totals vary from a low of ¡0 to a high of 47. The range is the maximum minus the minimum, so for Aaron, this is 47 minus ¡0, or 37. If the home run data is broken up into quartiles, we see that the ﬁrst quartile would be those values less than 26, the second quartile ends at the median (34) and the third quartile ends at 44. Thus, the interquartile range (IQR) is 44 − 26 = ¡8, meaning that 50 percent of the data is separated by ¡8. A measure of dispersion that utilizes the mean is called the variance. Its formula is given by where n represents the number of items in the data set. Aaron’s seasonby-season home run totals have a variance of ¡¡9.62. The square root of the variance is the standard deviation, and it is this measure that gives a clearer picture of the spread of the data. Aaron’s standard deviation is ¡0.49, and it can be inferred from Chebyshev’s rule that at least 75 percent of the data falls within two standard deviations of the mean, i.e., between 32.83 and ±(2 × ¡0.49) or between about ¡0 and 55, which in fact ¡00 percent of the values do. The mathematical bases for many of the formulas used in sabermetrics are provided by a study of statistical regression and correlation. These studies attempt to determine a line that nearly approximates data that can be expressed as ordered pairs, and how well-deﬁned this linear relationship is. If the second coordinate increases when the ﬁrst coordinate does, then the correlation is said to be positive. If the second coordinate decreases when the ﬁrst increases, the correlation is said to be negative. As an example, we will use a sample of the home run and runs batted in totals for some of the seasons of Gil Hodges’ career. Hodges played ¡8 seasons in the National League, for the Brooklyn (and later Los Angeles) Dodgers and the New York Mets. Consider the following chart: HR RBI 11 70 23 32 40 32 31 42 27 115 113 103 102 122 130 102 32 87 27 22 25 98 64 80 Table ¡.¡ Gil Hodges’ HR and RBI numbers for ¡2 years of his career The data is entered point by point to create a picture called a scatterplot, and then we infer a curve or line that approximately passes through the 4 BATTING PRACTICE points, which could then be used to predict second coordinates given a ﬁrst coordinate. Here is the scatterplot for Hodges’ home run and runs batted in data: Figure ¡.¡ Scatterplot for Gil Hodges’ ¡2 seasons of HR and RBI It would seem that a line having positive slope would pass through the points, which would indicate a positive correlation. Obviously, a straight line could not possibly hit every point. Figure ¡.2 shows the scatterplot with line segments connecting up all the points. If we wanted to ﬁnd a line to model this data in order to make some predictions, we would ﬁnd that every x-value in the data set has a y-value associated with it that would not actually lie exactly on a line. We want to ﬁnd a line that best ﬁts the data. One common method of ﬁnding such a line is a process leading to a line in which the data points are shown to have Introduction and Statistics 5 Figure ¡.2 Scatterplot for Gil Hodges’ ¡2 seasons of HR and RBI, with line segments joining the points the minimum distance from it. Since the standard formula for distances involve square roots, a process in which the sum of the squares of the distances from the data points to their best ﬁt line is called the method of least squares; the line is called the least squares line. Recall that the slope-intercept equation for a line is generally represented as y = mx + b, where x represents the independent variable, y represents the dependent variable, m represents the slope, or the ratio of the di›erence between y-coordinates and the di›erence between their corresponding x-coordinates. The standard form for the least squares line is y = ax + b, where and . 6 BATTING PRACTICE For Gil Hodges’ home run and RBI data, these values are summarized in Table ¡.2. x 11 22 23 25 27 27 31 32 32 32 40 42 344 y 70 64 115 80 98 102 122 87 102 113 103 130 1186 xy 770 1408 2645 2000 2646 2754 3782 2784 3264 3616 4120 5460 35249 x2 121 484 529 625 729 729 961 1024 1024 1024 1600 1764 10614 Table ¡.2 Least squares computations for Gil Hodges’ ¡2 seasons By the formulas, a = ¡.66, and b = 5¡.2¡, so the regression line has equation y = ¡.66 x + 5¡.2¡. Thus, if Hodges had hit 35 home runs in the season, he would be expected to have ¡.66 × 35 + 5¡.2¡ RBI, or roughly ¡09, not “dead-on” perfect, but not unreasonable in the context of his other seasons. To determine just how good the ﬁt is, we compute the correlation coe‡cient r: For this particular data set, the value of r is approximately 0.675. A value close to ¡ indicates a high positive correlation, one close to – ¡ indicates a high negative correlation, and values close to zero mean a very weak correlation, or no correlation at all. Thus, Gil Hodges’ home runs are moderately correlated with his RBI.