In the summer of 1994, I was pursuing a doctorate in government. Desperately procrastinating, I found myself building datasets of 1993 major league baseball player data.
Funny that I never finished that doctorate…
Another motivation was frustration with the letter-grade system then in use to determine compensation for major league baseball teams for the loss of free agents to a different team. I think the formula for assigning compensation grades was a simple unweighted combination of batting average (BA), home runs (HR), runs batted in (RBI) and stolen bases (SB). Somehow this formula felt invalid to me, even if I was did not really become familiar with the details of that concept until later graduate studies in biostatistics and epidemiology. But leaving out runs scored (baseball’s primary goal) and stolen base percentage (stealing 20 bases is not impressive if you are also caught 15 times) felt incomplete somehow.
Procrastination and frustration thus converged in my development of metrics for hitters, starting pitchers and relief pitchers that started from first principles (what defines success as a hitter or a pitcher?) and attempted to weight components in a more thoughtful (but not overly complex) way.
And what I created for hitters was the Index of Offensive Ability (IOA):
(Runs produced + 0.5*SB*SB%)*OPS
Runs produced is RBI plus runs scored minus HR (to avoid double-counting runs and RBI when you drive yourself home with a HR). OPS is on base percentage (OBP) plus slugging percentage (SLUG). On base percentage is the number of times a hitter reaches base (hit, walk, hit by pitch) divided by plate appearances (PA). Slugging percentage is total bases (singles=1, doubles=2, triples=3, HR=4) divided by at-bats. OPS thus measures on-base ability AND power (albeit double-counting singles). SB% is the percentage of successful stolen base attempts.
IOA cannot drop below 0, while it can reach (in very rare cases) 300: a player who drove in 150 runs, scored 150 runs, hit 50 HR, stole 50 bases without being caught, with OPS=1.200, would have IOA=300.0.
In 2016, 633 non-pitchers had at least one PA; Mookie Betts of the American League (AL) Boston Red Sox led them with a 213.8 IOA. Betts scored 122 runs, drove in 113 runs and hit 31 HR; he stole 26 bases in 30 attempts; and his OPS was 0.897. Ranked 2nd was Nolan Arenado of the National League (NL) Colorado Rockies (IOA=208.3), with 116 runs, 133 RBI, 41 HR, 2 SB (3 CS) and OPS=0.932. Rounding out the top 10 were Ian Desmond (177.2), Edwin Encarnacion (184.6), Kris Bryant (186.1), Josh Donaldson (186.6), Xander Bogaerts (186.8), Paul Goldschmidt (189.1), Jose Altuve (190.2) and Mike Trout (205.7). Nobody would argue these were not 10 of the top offensive baseball players in 2016.
The average IOA in 2016 was 58.3, with a standard deviation (SD) of 52.7; median IOA was 41.7. If you combine rounded 2016 averages for all 633 non-pitchers of 34 runs scored, 32 RBI, 9 HR, 4 SB, 2 CS and 0.751 OPS (weighted by PA) into a single player, that player’s IOA would be 43.8.
I wrote four papers from these analyses. Two years later, when I was applying for my first professional data analysis job, I used these papers as my writing sample
I got the job.
Since then, I have used the IOA formula (along with a defensive index I devised) to determine All-Star game vote and rank hitters on my beloved Philadelphia Phillies. I continue to use my pitching indices as well (or simply innings pitched divided by opposition OPS).
Meanwhile, sabermetrics boomed over the last few decades, especially after the 2004 release of Michael Lewis’ Moneyball, leading to the proliferation of more statistically-sophisticated metrics of offense, pitching and defense. Two such metrics are wins above replacement (WAR) and OPS+ .
WAR calculates how many wins a player would add over the course of a season relative to a minor league replacement; it combines runs created on offense (more complex than runs+RBI-HR) and runs prevented/allowed on defense. It thus combines offense AND defense, while IOA does not. According to baseball-reference.com, WAR>8.0 is “MVP,” 5.0-7.9 is “All-Star,” 2.0-4.9 is “Starter,” 0-1.9 is “Sub” and < 0 is “Replacement.” In 2016, the number of non-pitchers in each category was 2 (0.3%), 24 (3.8%), 103 (16.3%), 269 (42.1%) and 235 (37.5%), respectively.
The top 10 2016 non-pitchers based on WAR were Nolan Arenado (6.5), Brian Dozier (6.5), Manny Machado (6.7), Kyle Seager (6.9), Robinson Cano (7.3), Josh Donaldson (7.4), Jose Altuve (7.7), Kris Bryant (7.7), Mookie Betts (9.6) and Mike Trout (10.6); all but Seager were also in the top 25 in 2016 IOA. The average WAR in 2016 was 0.9, with SD=1.8; median WAR was 0.3.
OPS+ is how much higher or lower a player’s OPS is than league average. Thus, OPS+= 110 means OPS was 10% above league average, while OPS+ =90 means OPS was 10% below league average. OPS and OPS+ both require a minimum number of PA (3.1 per game played) to be reported on “leader boards” (OPS=0.800 over 10 PA is far less meaningful than over 100 PA or 500 PA). OPS+ can be negative (n=48 in 2016), dropping as low as -100 (n=17 in 2016).
The top 10 2016 non-pitchers using OPS+ (minimum 200 PA, n=354) were Kris Bryant (149), Josh Donaldson (152), Jose Altuve (154), Freddie Freeman (157), Miguel Cabrera (157), Daniel Murphy (157), Joey Votto (160), David Ortiz (162), Gary Sanchez (168) and Mike Trout (174); all 10 were also in the top 30 in 2016 IOA. Average OPS+ in 2016 was 99.6, with SD=23.6; median OPS+ was 100.
Just bear with me while I briefly discuss validity.
Validity is the extent to which an index/measure/score actually measures what it is designed to measure, or “underlying construct”. While now considered a unitary concept, historically, there were three broad approaches to “assessing” validity: content, construct and criterion. Criterion validity relates to outcomes derived from the index/measure/score, which are less relevant here.
Content validity is the extent to which an index/measure/score includes the appropriate set of components (not too many, not too few) to capture the underlying construct (say, baseball offensive performance).
The goal of a baseball game is to score more runs than your opponent. To do this, hitters reach base, advance on the bases, and get driven home. The IOA posits that better offensive players meet three criteria: they produce more runs (runs, RBI), they steal more bases (with fewer caught stealings), and they reach base more often (OPS) with more power (SLG). Both IOA and WAR appear to have high content validity, with their emphasis on runs produced/created. At the same time, OPS+ has fewer dimensions (reaching base, power), perhaps giving it less content validity vis-à-vis baseball offensive performance.
Construct validity is how strongly your index/measure/score relates to other indices/measures/scores of the same underlying construct, including a priori expectations of what values should be (sometimes called face validity). The presence of Mike Trout (2014, 2016 AL Most Valuable Player [MVP]), Kris Bryant (2016 NL MVP, 2015 NL Rookie of the Year), Josh Donaldson (2015 AL MVP) and Jose Altuve (4 All-Star Game appearances) on each of the 2016 top 10 in IOA, WAR and OPS+ suggests each metric has at least moderately-high construct validity.
When I developed these indices 20+ years ago, metrics like WAR and OPS+ did not exist. The internet was in its infancy; I was dependent upon volumes like this and publications like this…
for data. The easiest way for me to assess construct validity was to see how well IOA (and pitching metric) score rankings fit perceived wisdom regarding the game’s top players. The top 10 in 1993 IOA were Albert Belle (176.4), Juan Gonzalez (178.3), Lenny Dykstra (183.7), Roberto Alomar (185.7), Rafael Palmeiro (186.4), Ken Griffey Jr. (186.9), Paul Molitor (199.5), Frank Thomas (200.5), John Olerud (205.8) and Barry Bonds (245.4).
Let’s see. Bonds and Thomas were 1993 NL and AL MVPs, respectively; Dykstra was 1993 NL MVP runner-up. Gonzalez would be AL MVP three years later. Alomar, Griffey Jr., Molitor and Thomas are in the Baseball Hall of Fame; Bonds may join them someday (Palmeiro, despite 3,020 hits and 569 HR, will not). And Belle and Olerud had seven All-Star Game appearances between them.
Not bad on the construct validity front.
But now I am able to compare IOA to OPS+ and WAR in a more formal way.
First, however, just bear with me while I briefly describe correlation coefficients and ordinary least squares (OLS) regression.
A correlation coefficient (“r”) is a number between -1.00 and 1.00 indicating how two variables co-relate to each other in a linear way. If every time one variable increases, the other variable increases, that would be r= 1.00, and if every time one variable increases, the other variable decreases (and vice versa), that would be r=-1.00. R=0.00 means there is no linear association between the two variables.
OLS regression formally details a linear relationship between one or more independent variables and a dependent variable, analogous to calculating the slope of a line (y = m*x + b).
IOA vs. OPS+. OPS+, it turns out, is not statistically different from OPS (r=0.996 in 2016), an element of IOA. IOA is correlated 0.50 with OPS+ (and OPS) in 2016, as shown in Figure 1:
Figure 1: 2016 IOA and 2016 OPS+ (n=632)
IOA is very highly correlated with PA (0.975), so what Figure 1 really shows is that OPS+/OPS is extremely variable at very low values of PA (with no obvious association between IOA and OPS+/OPS); otherwise, the association appears strongly positive and linear. Indeed, limiting the analysis to the 499 non-pitchers with more than 50 PA increases r to 0.65 (and makes the regression equation IOA= -29.0 + 1.1*OPS+ + error).
Using all 632 non-pitchers, IOA equals 20.6 plus OPS+ times 0.5, on average. For example, OPS+=110 would equate to IOA=75.6, on average. However, limiting the analysis to PA>50 would yield IOA=92.0 on average and limiting to PA>500 would yield IOA=140.6 on average. The association between IOA and OPS+ in 2016 seems to get stronger with more PA.
IOA vs. WAR. This association should be stronger because the contents of the two metrics substantially overlap (runs produced/created). The biggest differences are a) IOA (but not WAR) includes OPS and b) WAR (but not IOA) includes runs prevented/allowed.
And, in fact, the correlation between IOA and WAR in 2016 was a solid 0.79, as shown in Figure 2:
Figure 2: 2016 IOA and 2016 WAR (n=633)
Similar to OPS+ and IOA, the relationship between WAR and IOA vanishes at very low levels of WAR. A more precise picture of the association is shown in Figure 3, which displays average values of IOA across 15 categories of WAR: <-2.0, -1.9 to -1.0, -1.0 to -0.6, -0.5 to -0.01, 0.0 to 0.4, 0.6 to 0.9, 1.0 to 1.4, 1.5 to 1.9, 2.0 to 2.9, 3.0 to 3.9, 4.0 to 4.9, 5.0 to 5.9, 6.0 to 6.9, 7.0 to 7.9, 8.0 and higher.
Figure 3: Mean 2016 IOA Scores by 2016 WAR Category (n=633)
While the bottom two and top five averages are calculated using fewer than 20 cases, the pattern is remarkably clear. There is a strong positive linear association between WAR and IOA when WAR is -0.5 or higher (n=576), but when WAR is less than -0.5 (n=57), the association is strong, linear and negative.
This J-shaped curve suggests that the lowest WAR values are driven by poor defense more than run production. This could explain why Matt Kemp, who had a 2016 WAR of 0.0 despite season totals with the NL San Diego Padres and Atlanta Braves of 89 runs scored, 108 RBI, 35 RBI, 1 SB, 0 CS and OPS=0.803, for IOA=162.2. Based on the simple OLS regression in Figure 2, a WAR of 0.0 yields IOA=36.4 on average. And the average WAR for IOA between 160.0 and 164.9 (n=11) was 3.5. The discrepancy lies in poor defensive metrics for the two-time (2009, 2011) NL Gold Glove winner, which essentially cancelled out the 162 runs Kemp produced.
The bottom line? The straightforward metric of baseball player offensive skill I devised in 1994 is very strongly associated with more sabermetrically-sophisticated metrics developed since then, particularly wins above replacement. The association would likely be stronger had I included defense in my metric, or if I had used Offensive WAR instead of Total WAR.
IOA has high content validity (its components include all run-production necessities) and construct validity (its rankings comport with a priori expectations, and it is highly correlated with analogous metrics), and possibly even good criterion validity.
The larger point is that increasing the apparent “sophistication” of a metric does not necessarily make it qualitatively “better.” It is hard to strike the balance between overly simplistic and overly complex, and I believe that IOA does that as well as any baseball offensive metric.
Until next time…
 Standard deviation is the positive square root of variance, a measure of how widely (or narrowly) values are distributed around the average (if you must know, variance is the sum of the squared differences from the mean, divided by the number of observations [sometimes, the number of observations minus 1]).
 Setting the minimum to 500 PA (n=146) knocks out Gary Sanchez, making Nelson Cruz (147) #10.
 I weighted SB efficiency (SB*SB%) 50% to reduce its influence on the final metric: SB are important, but not THAT important.
 This could be strong evidence of criterion validity if “future MVP “and “will be elected to Baseball Hall of Fame” are the outcomes.
 More formally r = covariance(x,y) divided by SD(x) * SD(y).
 Technically, to get the actual value of Y, you add an “error” amount, also known as a residual.
 IOA is correlated 0.99 with runs, 0.87 with HR, 0.97 with RBI, 0.49 with SB, 0.53 with CS, and 0.50 with OPS in 2016.
 OPS could not be calculated for Spencer Kieboom of the Washington Nationals, who walked (then scored a run) in his only 2016 plate appearance.
 Correlations with PA: 0.47 (OPS+), 0.72 (WAR).
 Further limiting to the 146 non-pitchers with PA>500 yields r=0.70 and the equation IOA = 30.6 + 1.0*OPS+ + error.
 Unlike most other offensive data, I could not find a single tabulation of 2016 player WAR values. I entered each value by hand from baseball-reference.com, meaning that I use a single measure of WAR, and limit analyses to 2016.