Revisiting my old baseball player metrics…

In the summer of 1994, I was pursuing a doctorate in government. Desperately procrastinating, I found myself building datasets of 1993 major league baseball player data.

Funny that I never finished that doctorate…

Another motivation was frustration with the letter-grade system then in use to determine compensation for major league baseball teams for the loss of free agents to a different team. I think the formula for assigning compensation grades was a simple unweighted combination of batting average (BA), home runs (HR), runs batted in (RBI) and stolen bases (SB). Somehow this formula felt invalid to me, even if I was did not really become familiar with the details of that concept until later graduate studies in biostatistics and epidemiology. But leaving out runs scored (baseball’s primary goal) and stolen base percentage (stealing 20 bases is not impressive if you are also caught 15 times) felt incomplete somehow.

Procrastination and frustration thus converged in my development of metrics for hitters, starting pitchers and relief pitchers that started from first principles (what defines success as a hitter or a pitcher?) and attempted to weight components in a more thoughtful (but not overly complex) way.

And what I created for hitters was the Index of Offensive Ability (IOA):

(Runs produced + 0.5*SB*SB%)*OPS

Runs produced is RBI plus runs scored minus HR (to avoid double-counting runs and RBI when you drive yourself home with a HR). OPS is on base percentage (OBP) plus slugging percentage (SLUG). On base percentage is the number of times a hitter reaches base (hit, walk, hit by pitch) divided by plate appearances (PA). Slugging percentage is total bases (singles=1, doubles=2, triples=3, HR=4) divided by at-bats. OPS thus measures on-base ability AND power (albeit double-counting singles). SB% is the percentage of successful stolen base attempts.

IOA cannot drop below 0, while it can reach (in very rare cases) 300: a player who drove in 150 runs, scored 150 runs, hit 50 HR, stole 50 bases without being caught, with OPS=1.200, would have IOA=300.0.

In 2016, 633 non-pitchers had at least one PA; Mookie Betts of the American League (AL) Boston Red Sox led them with a 213.8 IOA. Betts scored 122 runs, drove in 113 runs and hit 31 HR; he stole 26 bases in 30 attempts; and his OPS was 0.897. Ranked 2nd was Nolan Arenado of the National League (NL) Colorado Rockies (IOA=208.3), with 116 runs, 133 RBI, 41 HR, 2 SB (3 CS) and OPS=0.932. Rounding out the top 10 were Ian Desmond (177.2), Edwin Encarnacion (184.6), Kris Bryant (186.1), Josh Donaldson (186.6), Xander Bogaerts (186.8), Paul Goldschmidt (189.1), Jose Altuve (190.2) and Mike Trout (205.7). Nobody would argue these were not 10 of the top offensive baseball players in 2016.

The average IOA in 2016 was 58.3, with a standard deviation (SD[1]) of 52.7; median IOA was 41.7. If you combine rounded 2016 averages for all 633 non-pitchers of 34 runs scored, 32 RBI, 9 HR, 4 SB, 2 CS and 0.751 OPS (weighted by PA) into a single player, that player’s IOA would be 43.8.


I wrote four papers from these analyses. Two years later, when I was applying for my first professional data analysis job, I used these papers as my writing sample

I got the job.

Since then, I have used the IOA formula (along with a defensive index I devised) to determine All-Star game vote and rank hitters on my beloved Philadelphia Phillies. I continue to use my pitching indices as well (or simply innings pitched divided by opposition OPS).

Meanwhile, sabermetrics boomed over the last few decades, especially after the 2004 release of Michael Lewis’ Moneyball, leading to the proliferation of more statistically-sophisticated metrics of offense, pitching and defense. Two such metrics are wins above replacement (WAR) and OPS+ . 

WAR calculates how many wins a player would add over the course of a season relative to a minor league replacement; it combines runs created on offense (more complex than runs+RBI-HR) and runs prevented/allowed on defense. It thus combines offense AND defense, while IOA does not. According to, WAR>8.0 is “MVP,” 5.0-7.9 is “All-Star,” 2.0-4.9 is “Starter,” 0-1.9 is “Sub” and < 0 is “Replacement.” In 2016, the number of non-pitchers in each category was 2 (0.3%), 24 (3.8%), 103 (16.3%), 269 (42.1%) and 235 (37.5%), respectively.

The top 10 2016 non-pitchers based on WAR were Nolan Arenado (6.5), Brian Dozier (6.5), Manny Machado (6.7), Kyle Seager (6.9), Robinson Cano (7.3), Josh Donaldson (7.4), Jose Altuve (7.7), Kris Bryant (7.7), Mookie Betts (9.6) and Mike Trout (10.6); all but Seager were also in the top 25 in 2016 IOA. The average WAR in 2016 was 0.9, with SD=1.8; median WAR was 0.3.

OPS+ is how much higher or lower a player’s OPS is than league average. Thus, OPS+= 110 means OPS was 10% above league average, while OPS+ =90 means OPS was 10% below league average. OPS and OPS+ both require a minimum number of PA (3.1 per game played) to be reported on “leader boards” (OPS=0.800 over 10 PA is far less meaningful than over 100 PA or 500 PA). OPS+ can be negative (n=48 in 2016), dropping as low as -100 (n=17 in 2016).

The top 10 2016 non-pitchers using OPS+ (minimum 200 PA, n=354) were Kris Bryant (149), Josh Donaldson (152), Jose Altuve (154), Freddie Freeman (157), Miguel Cabrera (157), Daniel Murphy (157), Joey Votto (160), David Ortiz (162), Gary Sanchez (168) and Mike Trout (174)[2]; all 10 were also in the top 30 in 2016 IOA. Average OPS+ in 2016 was 99.6, with SD=23.6; median OPS+ was 100.


Just bear with me while I briefly discuss validity.

Validity is the extent to which an index/measure/score actually measures what it is designed to measure, or “underlying construct”. While now considered a unitary concept, historically, there were three broad approaches to “assessing” validity: content, construct and criterion. Criterion validity relates to outcomes derived from the index/measure/score, which are less relevant here.

Content validity is the extent to which an index/measure/score includes the appropriate set of components (not too many, not too few) to capture the underlying construct (say, baseball offensive performance).

The goal of a baseball game is to score more runs than your opponent. To do this, hitters reach base, advance on the bases, and get driven home. The IOA posits that better offensive players meet three criteria: they produce more runs (runs, RBI), they steal more bases (with fewer caught stealings[3]), and they reach base more often (OPS) with more power (SLG). Both IOA and WAR appear to have high content validity, with their emphasis on runs produced/created. At the same time, OPS+ has fewer dimensions (reaching base, power), perhaps giving it less content validity vis-à-vis baseball offensive performance.

Construct validity is how strongly your index/measure/score relates to other indices/measures/scores of the same underlying construct, including a priori expectations of what values should be (sometimes called face validity). The presence of Mike Trout (2014, 2016 AL Most Valuable Player [MVP]), Kris Bryant (2016 NL MVP, 2015 NL Rookie of the Year), Josh Donaldson (2015 AL MVP) and Jose Altuve (4 All-Star Game appearances) on each of the 2016 top 10 in IOA, WAR and OPS+ suggests each metric has at least moderately-high construct validity.

When I developed these indices 20+ years ago, metrics like WAR and OPS+ did not exist. The internet was in its infancy; I was dependent upon volumes like this and publications like this…

Baseball Weekly

for data. The easiest way for me to assess construct validity was to see how well IOA (and pitching metric) score rankings fit perceived wisdom regarding the game’s top players. The top 10 in 1993 IOA were Albert Belle (176.4), Juan Gonzalez (178.3), Lenny Dykstra (183.7), Roberto Alomar (185.7), Rafael Palmeiro (186.4), Ken Griffey Jr. (186.9), Paul Molitor (199.5), Frank Thomas (200.5), John Olerud (205.8) and Barry Bonds (245.4).

Let’s see. Bonds and Thomas were 1993 NL and AL MVPs, respectively; Dykstra was 1993 NL MVP runner-up. Gonzalez would be AL MVP three years later. Alomar, Griffey Jr., Molitor and Thomas are in the Baseball Hall of Fame; Bonds may join them someday (Palmeiro, despite 3,020 hits and 569 HR, will not)[4]. And Belle and Olerud had seven All-Star Game appearances between them.

Not bad on the construct validity front.

But now I am able to compare IOA to OPS+ and WAR in a more formal way.

First, however, just bear with me while I briefly describe correlation coefficients and ordinary least squares (OLS) regression.

A correlation coefficient (“r”) is a number between -1.00 and 1.00 indicating how two variables co-relate to each other in a linear way[5]. If every time one variable increases, the other variable increases, that would be r= 1.00, and if every time one variable increases, the other variable decreases (and vice versa), that would be r=-1.00. R=0.00 means there is no linear association between the two variables.

OLS regression formally details a linear relationship between one or more independent variables and a dependent variable, analogous to calculating the slope of a line (y = m*x + b[6]).

IOA vs. OPS+. OPS+, it turns out, is not statistically different from OPS (r=0.996 in 2016), an element of IOA. IOA is correlated 0.50 with OPS+ (and OPS[7]) in 2016, as shown in Figure 1:

Figure 1: 2016 IOA and 2016 OPS+ (n=632[8])

IOA vs OPS+ 2016

IOA is very highly correlated with PA (0.975[9]), so what Figure 1 really shows is that OPS+/OPS is extremely variable at very low values of PA (with no obvious association between IOA and OPS+/OPS); otherwise, the association appears strongly positive and linear. Indeed, limiting the analysis to the 499 non-pitchers with more than 50 PA increases r to 0.65 (and makes the regression equation IOA= -29.0 + 1.1*OPS+ + error)[10].

Using all 632 non-pitchers, IOA equals 20.6 plus OPS+ times 0.5, on average. For example, OPS+=110 would equate to IOA=75.6, on average. However, limiting the analysis to PA>50 would yield IOA=92.0 on average and limiting to PA>500 would yield IOA=140.6 on average. The association between IOA and OPS+ in 2016 seems to get stronger with more PA.

IOA vs. WAR. This association should be stronger because the contents of the two metrics substantially overlap (runs produced/created). The biggest differences are a) IOA (but not WAR) includes OPS and b) WAR (but not IOA) includes runs prevented/allowed.

And, in fact, the correlation between IOA and WAR in 2016 was a solid 0.79, as shown in Figure 2:

Figure 2: 2016 IOA and 2016 WAR (n=633)

IOA vs WAR 2016

Similar to OPS+ and IOA, the relationship between WAR and IOA vanishes at very low levels of WAR. A more precise picture of the association is shown in Figure 3, which displays average values of IOA across 15 categories of WAR: <-2.0, -1.9 to -1.0, -1.0 to -0.6, -0.5 to -0.01, 0.0 to 0.4, 0.6 to 0.9, 1.0 to 1.4, 1.5 to 1.9, 2.0 to 2.9, 3.0 to 3.9, 4.0 to 4.9, 5.0 to 5.9, 6.0 to 6.9, 7.0 to 7.9, 8.0 and higher.

Figure 3: Mean 2016 IOA Scores by 2016 WAR Category (n=633)IOA vs WAR categories 2016

While the bottom two and top five averages are calculated using fewer than 20 cases, the pattern is remarkably clear. There is a strong positive linear association between WAR and IOA when WAR is -0.5 or higher (n=576), but when WAR is less than -0.5 (n=57), the association is strong, linear and negative.

This J-shaped curve suggests that the lowest WAR values are driven by poor defense more than run production. This could explain why Matt Kemp, who had a 2016 WAR of 0.0 despite season totals with the NL San Diego Padres and Atlanta Braves of 89 runs scored, 108 RBI, 35 RBI, 1 SB, 0 CS and OPS=0.803, for IOA=162.2. Based on the simple OLS regression in Figure 2, a WAR of 0.0 yields IOA=36.4 on average. And the average WAR for IOA between 160.0 and 164.9 (n=11) was 3.5. The discrepancy lies in poor defensive metrics for the two-time (2009, 2011) NL Gold Glove winner, which essentially cancelled out the 162 runs Kemp produced.


The bottom line? The straightforward metric of baseball player offensive skill I devised in 1994 is very strongly associated with more sabermetrically-sophisticated metrics developed since then, particularly wins above replacement. The association would likely be stronger had I included defense in my metric, or if I had used Offensive WAR instead of Total WAR[11].

IOA has high content validity (its components include all run-production necessities) and construct validity (its rankings comport with a priori expectations, and it is highly correlated with analogous metrics), and possibly even good criterion validity.

The larger point is that increasing the apparent “sophistication” of a metric does not necessarily make it qualitatively “better.” It is hard to strike the balance between overly simplistic and overly complex, and I believe that IOA does that as well as any baseball offensive metric.

Until next time…

[1] Standard deviation is the positive square root of variance, a measure of how widely (or narrowly) values are distributed around the average (if you must know, variance is the sum of the squared differences from the mean, divided by the number of observations [sometimes, the number of observations minus 1]).

[2] Setting the minimum to 500 PA (n=146) knocks out Gary Sanchez, making Nelson Cruz (147) #10.

[3] I weighted SB efficiency (SB*SB%) 50% to reduce its influence on the final metric: SB are important, but not THAT important.

[4] This could be strong evidence of criterion validity if “future MVP “and “will be elected to Baseball Hall of Fame” are the outcomes.

[5] More formally r = covariance(x,y) divided by SD(x) * SD(y).

[6] Technically, to get the actual value of Y, you add an “error” amount, also known as a residual.

[7] IOA is correlated 0.99 with runs, 0.87 with HR, 0.97 with RBI, 0.49 with SB, 0.53 with CS, and 0.50 with OPS in 2016.

[8] OPS could not be calculated for Spencer Kieboom of the Washington Nationals, who walked (then scored a run) in his only 2016 plate appearance.

[9] Correlations with PA: 0.47 (OPS+), 0.72 (WAR).

[10] Further limiting to the 146 non-pitchers with PA>500 yields r=0.70 and the equation IOA = 30.6 + 1.0*OPS+ + error.

[11] Unlike most other offensive data, I could not find a single tabulation of 2016 player WAR values. I entered each value by hand from, meaning that I use a single measure of WAR, and limit analyses to 2016.

A (kinda sorta) brief epidemiology primer

As an epidemiologist (OK, as “a person with a doctorate in epidemiology,” which is totally the same thing), I expect that I will write future blog posts relating to epidemiologic research. In those posts, I will refer readers unfamiliar with epidemiologic concepts and methods to my “epidemiology primer” post.

This is that post.


Epidemiology can be described two ways:

  1. The study of the distribution and determinants of health-related states in specified populations.
  2. The process for estimating the most valid and precise measure of the association between an exposure and an outcome in a specified population.

These definitions are basically the same:

Epidemiology helps to answer the question: At a pre-determined point in the future, how much more/less likely is it that a person with characteristic X will have outcome A than a person with characteristic Y?

To answer this deceptively simple question, members of a selected population (none of whom has A) are divided into two groups—those with X and those with Y. Members of these groups are followed for a certain period of time, after which we compare the percentage of people with X who got A to the percentage of people with Y who got A. If a higher percentage of people with X got A, then X is a risk factor for A. If a lower percentage with X got A, then X is protective against A.

Makes sense, right?

Here is an example (demonstrating why I prefer definition #2). In 2016, the 30 major league baseball (MLB) teams played a total of 2,424 games as the “home” team and 2,427 teams as the “road” team. One may ask: Is a road team more likely to lose than a home team? In this case, the exposure would be “road” vs. “home,” and the outcome would be “lose MLB game.”

Wait, you say. Isn’t epidemiology about who gets what disease, or tracking outbreaks of infections, or preventing epidemics, or something?

Yes, epidemiology typically measures occurrence and etiology (“causes”) of diseases (or “health-related states”). My doctoral research explored, in part, the association between an exposure (neighborhood walkability) and two health outcomes, diabetes and depressive symptoms.

But…one can apply epidemiologic methods to any scenario which compares the likelihood of some future event or state across different exposure categories.

An epidemiologist would begin to answer the question posed above with the 2×2 table:

Exposed Unexposed Total
Outcome a b a+b
No Outcome c d c+d
Total a+c b+d a+b+c+d
% with Outcome a/(a+c) b/(b+d)


The 2×2 table for our question shows that road teams lost 53.1% of their games in 2016, while home teams lost 47.0% of their games.

Road Home Total
Lose 1,287 1,140 2,427
Win 1,137 1,287 2,424
Total 2,424 2,427 4,851
% Lose 53.1% 47.0%  

There are two ways to compare these percentages, or risks: arithmetic difference and ratio. Collectively these comparisons are called measures of effect

The risk difference (RD) is the exposed percentage with the outcome minus the unexposed percentage with the outcome. Here, the RD would be 53.1 – 47.0 = 6.1, meaning that a road team’s risk of losing a game was 6.1 percentage points higher than a home team’s risk in 2016. Difference measures are useful for determining how many people would be impacted by a change in exposure status, say through a public health intervention. When risks are equal across exposure categories, the risk difference is 0, implying no association between the exposure and the outcome (the null).

The risk ratio (RR), meanwhile, is the exposed percentage with the outcome divided by the unexposed percentage with the outcome. RR are most commonly reported in peer-reviewed journal articles (and then by the media). Dividing 53.1 by 47.0 yields 1.13, meaning that a road team had a 13% higher risk of losing a game than a home team in 2016. For ratio measures, the null is 1.00, the ratio of two equal values.

A common variant of the risk ratio is the odds ratio, the ratio of the odds[1] an outcome will occur across exposure categories. It can be calculated as (a*d)/(b*c). For our baseball data, the calculation is (1287*1287)/(1137*1140)=1.28. The odds of losing a game for the road team were 28% higher than the odds for the home team in 2016.


Of course, it is not really this simple, leading the chair of my government doctoral committee to say, in the middle of an American politics seminar, “You know, if you want to do something REALLY hard, get a doctorate in epidemiology.”

[I am not sure what it says about me that I completed the “hard” doctorate (epidemiology) but not the “less hard” doctorate (political science).]

The factors that make epidemiologists want to tear out their hair can be divided into four broad categories:

  1. Counting and timing of events
  2. Study design
  3. Number of exposed cases, AKA random error, or “precision”
  4. Difficulties ascertaining measures of effect, AKA systematic error, or “validity”

Just bear with me while I review each category.

Counting and timing of events. Epidemiology depends upon accurately counting and measurement. Examples include the number of people in Malawi who currently have diabetes, the number of people averaging more than 30 minutes of exercise a day in Saskatchewan in June 2010, or the number of people who will be diagnosed with ovarian cancer in Massachusetts in the next 15 years.

The first two are examples of prevalence: the number of people with a specific condition (or falling into a specific category) at a specific time and place. The third is an example of incidence: the number of people who currently do not have an outcome (not yet a case) who will do so by a specified future date.

Time is an underappreciated aspect of epidemiology. For one thing, the exposure must come BEFORE the outcome to have any sort of causal association (pollen I will ingest in 2024 cannot be the cause of my current nasal allergy misery). For another, epidemiologic methods depend upon following people over time (follow-up) to see who becomes a case, though follow-up may already have occurred when our intrepid researcher enters the picture.

Some measures of effect include the time from the start of follow-up to case ascertainment. Consider the incidence rate (IR), the number of incident cases in an exposure category divided by the total person-time (PT) in that category. Person-time is how long each person is followed in a study until a) (s)he becomes a case, b) (s)he is lost from the study or c) the study period ends[2]. The associated measures of effect are incident rate difference (IRD) and incident rate ratio (IRR).

This is where things begin to get tricky…because it is not always clear exactly when a person becomes a case. Memories of diagnosis timing may be murky. Research subjects may only be queried every year or two about becoming a case, with only “since the last questionnaire” for timing.

And even this is only when a diagnosis occurred, which is NOT the same as when a person actually became a case.

So here are two more concepts: induction and latency. Induction is the period between when an exposure occurs and a person gets an outcome (if [s]he does). Latency is the period between when the person gets an outcome and when (s)he learns that fact. For an outcome like the common cold, both induction and latency periods are quite short. But some outcomes have long latency periods; a person may have diabetes for as many as 10 years without being aware of it. And for as many as 10 years that person will “truthfully” say/record “I do not have diabetes” whenever asked.

This is our first example of misclassification; more on this later.

Study design. Underlying all epidemiologic studies is the conception that people in different exposure categories are compared on outcome occurrence over a period of time.

I cannot emphasize this fundamental structure (epistemology, even) often or loudly enough.

However, study design differs by when and how data are collected and by when our intrepid researcher enters the picture.

Two study designs (sorry, ecological, cross-sectional and prevalence studies; randomized control trials; and natural experiments) most interest us here: cohort studies and case-control studies.

A cohort study (e.g., the Black Women’s Health Study, a key data source for my doctoral research) is one where investigators enroll a large “cohort” of people (often tens of thousands) at a single point in time (baseline), give them a comprehensive questionnaire, then follow them for as long as possible. Its key strength is that you can collect a lot of information on a lot of people for a long period of time, allowing you to study any exposure or outcome queried in the questionnaires/interviews. Cohort studies include follow-up time and case ascertainment, so each measure of effect discussed above can be estimated from them. Its weaknesses primarily involve cost and time; following tens of thousands of people for decades costs money and requires patience to wait for enough incident cases for adequate study precision. They are also less useful for rare outcomes; if an outcome only occurs to one in one thousand people, even a cohort of 100,000 persons would only contain 100 or so cases, lowering study precision.

And that leads us to case-control studies, which begin by identifying and enrolling cases. What better way to study the etiology of a rare disease (say, molar pregnancy, present in about 1 in 1,500 women with early pregnancy symptoms) than by gathering as many prevalent cases of that disease as possible?

That is the relatively easy part. The harder part is selecting “controls.” Controls are more than just “non-cases,” as they provide “the distribution of exposure in the source population that gave rise to the cases.”

Just bear with me…

It is easiest to think of all case-control studies as being embedded within a larger cohort study. The population of the cohort study would be the “source population,” and so, just as every case came from the cohort study, then so must every control. We can use the “would criterion” here. Would a control have been selected as a case in the study had that person gotten the outcome? If not, then that person cannot be a control[3].

Every case and control is quizzed about their history to determine exposure status and other relevant study information.

Other than being ideal for studying rare disease etiology, case control studies are more efficient and cost-effective than cohort studies. At the same time, however, case control study populations number in the hundreds (not tens of thousands), reducing precision. Controls are only a sample of all possible study participants in the source population, meaning only an OR can be calculated from a case control study. Finally, nearly all exposure and other relevant data are based on participant memory, and thus may not be as accurate as you would prefer (likely resulting in misclassification).

Random error (Precision). Every epidemiologic study draws a sample from a population of interest (e.g., black women), leaving open the possibility of an estimated measure of effect differing from the “true” measure of effect purely by chance. Arithmetic differences between estimated and “true” measures of effect are called bias.

And that leads us to the confidence interval (CI), a range of values around your measure of effect which you are highly confident includes the “true” measure of effect[4]. At a fixed confidence level (typically 90% or 95%), the larger the sample size and (more specifically) the more exposed cases a study has, the narrower the confidence interval.

In the first study of my doctoral thesis (neighborhood walkability and diabetes), a woman was exposed if she lived in a “non-most-walkable” neighborhood, and she was unexposed if she lived in a “most walkable” neighborhood. The estimated IRR[5] was 1.06, with a 95% CI of 0.97-1.18. In other words, while my best estimate was that women living in less walkable neighborhoods had a 6% higher incidence of diabetes over 16 years than women living in the most walkable neighborhoods, I was 95% confident that the truth is somewhere between a 3% decrease and an 18% increase in diabetes incidence over 16 years. That is a very narrow range of possibilities, meaning I had a fairly precise estimate.

Systematic error (Validity). And here is where things REALLY get interesting.

There are errors in estimated measures of effect that CANNOT be fixed by increasing sample size: misclassification and confounding.

Misclassification is when an exposed person is recorded as unexposed (or vice versa) and/or a person with the outcome is recorded as not having the outcome (or vice versa). All epidemiologic studies have some degree of misclassification.

This can happen for many reasons, including a) misrecording or misremembering relevant information; b) being in the latency period for an outcome; c) inadequate measures of exposure/outcome that miscategorize people; d) adequate measures of exposure/outcome divided into etiologically-incorrect categories; and so forth.

Misclassification is measured in two ways: sensitivity and specificity. Sensitivity is the percentage of study participants categorized as exposed/have outcome that truly are exposed/have outcome. Specificity is the percentage of study participants categorized as unexposed/not have outcome who truly are unexposed/do not have outcome.

To show how sensitivity and specificity work in practice, let us return to the baseball data.

What if I (or someone else) made transcription errors in defining road and home status for a given game, so that exposure sensitivity and specificity was each 90%? Then this is what the data would “truly” be:

Road Home Total
Lose 1,305 1,122 2,427
Win 1,118 1,306 2,424
Total 2,424 2,427 4,851
% Lose 53.9% 46.2%  

RR (1.13 to 1.17) and OR (1.28 to 1.36) are now somewhat further from the null. Identical changes in RR and OR would occur if outcome (loss-win) sensitivity and specificity was each 90%.

In general (and there enough caveats[6] here to fill volumes, or graduate seminars), misclassification tends to “bias toward the null,” meaning that your study estimate will be closer to the null than the true value.

A good way to think about confounding is the counterfactual.

Ideally, if one wanted to measure the independent effect of an exposure—the etiologic role played exclusively by that exposure—on the risk of an outcome, you would follow a group of exposed people over time, ascertaining who does/does not get the outcome. You would then reverse time and rerun the entire study again, with the single difference that everyone is now unexposed. In this way, the ONLY explanation for differences in outcome incidence would be exposure status, because everything else was exactly the same.

The counterfactual is a lot like “what if?” history questions (What if Abraham Lincoln had NOT been assassinated in April 1865? What if “Dutch” Schultz HAD knocked off a young special prosecutor named Thomas Dewey in 1934 or 1935?)

Clearly, we are not able to re-run epidemiologic studies (or history), so we need unexposed study participants to represent what would have happened to exposed study participants had they been unexposed.

Confounding thus occurs when a study’s exposed and unexposed participants differ on a risk (or protective) factor for the outcome, biasing estimated measure(s) of effect. A rule of thumb for determining whether a factor is a confounder is adjusting for it changes the measure of effect by at least 10%[7].

Here is an example. One factor assessed in my doctoral research was neighborhood socioeconomic status, or SES. SES combines income and education data to provide a relative sense of how “well-off” a person or place is; it is strongly associated with health outcomes, including diabetes (lower SES=>higher diabetes risk). In my study, more walkable neighborhoods had lower SES, and vice versa. In other words, neighborhood SES appeared to be a strong confounder of the association between neighborhood walkability and diabetes incidence. Indeed, the simple (or crude) association was this: women living in the least walkable neighborhoods had a 21.7% lower incidence of diabetes over 16 years than women living in the most walkable neighborhoods. After adjusting for neighborhood SES, the association was a 1.2% higher incidence, a 29.2% change in the estimated IRR[8].

Analytic techniques exist to identify[9] and adjust for confounders. The latter essentially involves stratifying (or “drilling down”) study participants into categories of the suspected confounder (e.g., men and women, age groups, selection criteria—“selection bias” is really “confounding by selection”), calculating measures of effect within each subgroup, then calculating a weighted average of those stratified measures of effect in such a way (I will spare you the gory details) that removes the confounding by that particular factor.

However, when stratified measures of effect differ widely (e.g., American League teams’ RR was 1.17 and National League teams’ RR was 1.09—but “league” cannot be a confounder because it does not differ between road and home teams), this is effect modification (or effect measure modification). It is usual practice to report these differences and not adjust for the effect modifier as a confounder.

Of course, one way to remove confounding before a study begins is to randomize study participants into exposure categories, theoretically balancing all possible confounders. This is what happens in randomized control trials, often considered the gold standard of health research. There are two problems for most epidemiologic research, however. Randomization does not guarantee balance, and it creates ethical dilemmas around knowingly exposing a study subject to something suspected to be harmful (or at least increases outcome risk).


I would say “and that is basically it,” except that I have barely scratched the surface of the surface of epidemiologic thinking and methods. However, I think I have presented enough information to clarify future epidemiology-based posts.

If you want to learn more about epidemiology, I strongly recommend Kenneth Rothman’s Epidemiology: An Introduction and Aschengrau and Seage’s Essentials of Epidemiology in Public Health. For eminently readable historical context, I highly recommend Steven Johnson’s The Ghost Map. Finally, the uber-text for contemporary epidemiology is Rothman, Greenland and Lash’s Modern Epidemiology (3rd edition).

Until next time…

[1] The odds of an event is the probability the event occurs divided by the probability the event does not occur: p/(1-p) .

[2] Let’s say you follow 10 people for five years. Five people are followed for five years without becoming a case, which would total 5*5=25 person-years (PY). Three people become cases, one in two years, two in four years; 2+4+4=10 PY. Two people drop out of the study after three years, which would be 2*3=6 PY. Thus, total study PY is 41 person years. Let’s posit five exposed persons and five unexposed persons. The five exposed persons include two cases over 14 PY, and the five unexposed persons include one case over 27 PY. The IR for the exposed would be 2/14 PY = 14.3 cases per 100 PY, while the IR for the unexposed would be 1/27 PY = 3.7 cases per 100 PY. The IR difference would be 14.3 – 3.7 = 10.6 cases per 100 PY, and the IR ratio would be 14.3/3.7 =3.86.

[3] Clearly, there is more to control selection than that, but that is a level of angst I am happy to avoid here.

[4] I am deliberately avoiding comparing confidence intervals and p-values. I may address these issues in a later post.

[5] Technically a “hazard ratio,” which estimates an IRR.

[6] Caveats primarily relate to whether misclassification was “non-differential” or “differential (did sensitivity and specificity differ across subgroups or not?), and whether it was “non-dependent” or “dependent” (was the source of exposure misclassification independent or dependent upon the source of outcome misclassification?). Another caveat relates to whether exposure has two categories (binary) or more, and where the cut-points are. A more accurate “rule of thumb” is thus “Non-differential, non-dependent misclassification of a binary exposure (or outcome) tends, on average, to bias estimated measures of effect toward the null.”

[7] For various reasons, I used a 5% change-in-estimate criterion in my doctoral studies.

[8] I also adjusted for age and city, yielding the IRR discussed earlier.

[9] A confounder MUST satisfy three conditions: it must differ by exposure status, it must be a risk/protective factor for the outcome among the unexposed, and it must not be on the causal pathway between exposure and outcome.

Gerrymandering is a bigger problem for democracy than for Democrats

It is an article of faith among Democrats and some political commentators that a major barrier to Democrats retaking control of the House of Representative in 2018, or even in 2020, is Republican gerrymandering following the 2010 U.S. Census. Republicans, the narrative goes, used the governor’s mansions and state legislatures they controlled after the 2010 midterm elections to draw state legislative and U.S. House districts to their partisan advantage.

Gerrymandering is nearly as old as the Republic. The word “gerrymander” comes from Elbridge Gerry, the Massachusetts governor who supervised the redrawing of his state’s legislative districts (U.S. House, state senate, state house) to advantage his Democratic-Republicans following the 1810 U.S. Census. One new state senate district resembled a salamander, leading to the term “Gerry-mander” to describe the drawing of legislative district lines for partisan advantage.

However, outside of “compactness” and rough equality of population across districts of the same type in the same state (under the 14th Amendment’s Equal Protection Clause, per Reynolds v. Sims [1964]), there is no official guidance for drawing legislative districts. Article 1, Section 2 of the U.S. Constitution says only that Representatives “shall be apportioned among the several States…according to their respective Numbers…[and] each State shall have at Least one Representative,” thus requiring the United States Census Bureau to determine every 10 years how many U.S. House members each state has (and thus how many electoral votes it has [# U.S. House members plus 2]). And the Voting Rights Act mandates majority-minority districts under certain circumstances.

What does not exist is a measure of how “fair” legislative districts are. That is why, until very recently, courts have been reluctant to overturn legislative district maps that may proffer a partisan advantage, yet still adhere to the requirements of compactness, relative population equality and, where necessary, the creation of majority-minority districts.

Naturally, I will now describe and analyze two potential, related measures of redistricting “fairness,” utilizing two-party vote for the U.S. House of Representatives[1].

Turning votes into seats. The first measure is the seat-to-vote ratio (SVR):  the percentage of legislative districts (seats) won by a party in an election divided by the percentage of the vote won by that party. This is a straightforward concept—if one party wins 53% of the vote, you would expect that party to win (very close to) 53% of available seats. An SVR of 1.00 is a perfect correspondence of seats and votes, while an SVR greater than 1.00 suggests a partisan advantage.

Figure 1: Percentage Democratic U.S. House Seats and Two-Party U.S. House Vote, 1968-2016

 Dem % Seats and Votes, U.S. House, 1968-2016

In 2016, Democrats won 47.6% of the total U.S. House vote, and Republicans won 48.7%, equivalent to 49.4% and 50.6% of the two-party vote, respectively (Figure 1). However, Democrats “only” won 44.6% (194) of the seats, when they “should” have won 49.4% (~215 seats; SVR=0.90). In other words, Democrats won 10% fewer seats than you would expect if seats and votes lined up evenly.

Just bear with me while I dive into the arcana of counting U.S. House votes. Feel free to skip the next six italicized paragraphs.

My sources for U.S. House election data are the biennial Statistics of the Presidential and Congressional Election reports published by the Office of the Clerk, U.S. House of Representatives (House Clerk). These reports tabulate votes for the Democratic and Republican candidates as well as for every other smaller party candidate (e.g., Libertarian, Green) as well as No Party Affiliation/No Political Party/Nonaffiliated/Nonpartisan, None of these Candidates (Nevada), Nominated by Petition (Iowa), Write-in/Other Write-in votes, Blank Votes (Hawaii, Maine, Massachusetts, New York, Vermont), Under Votes (Wyoming), Spoiled Votes (Vermont), Void (New York), Over Votes (Hawaii, Wyoming), Scatter/Scattering (New Hampshire, Wisconsin), All Others (Massachusetts) and Miscellaneous (Oregon).

 California and Washington have all candidates for an office, regardless of party affiliation, run in a single “jungle primary” prior to Election Day. If no candidate receives more than 50% of the vote, the top two finishers face off on Election Day. In seven Congressional Districts (CD) in California and one CD in Washington, two Democrats faced off on Election Day, while one CD in Washington had two Republicans facing off on Election Day. I recorded all Democratic and Republican votes from these nine CD in my major party vote totals. Similarly, Louisiana holds their jungle primary on Election Day (with run-offs for the top two vote winners in December). I counted all Democratic and Republican votes on Election Day in my major party vote totals. Finally, Alabama, Georgia, Kentucky, North Carolina and South Dakota only list the Democratic and Republican candidates on their Election Day ballots.

Three states tabulate candidate votes from multiple party lines because smaller parties officially endorsed either the Democrat or Republican candidate for an office; the official House Clerk Democratic and Republican vote tallies exclude these minor party votes. Democrats in Connecticut received votes on the Working Families line, and Republicans received votes on the Independent line. In New York, Democrats received votes on the Working Families, Women’s Equality and, in CDs 2, 18 and 20, Independence lines, and Republicans received votes on the Conservative, Reform, Blue Lives Matter, Stop Iran Deal and, in CDs 1, 10, 11, 13, 19, 21, 23-25 and 27, the Independence line. In South Carolina, Democrats running in CDs 1, 2 and 7 received votes on the Working Families and Green lines. I included these smaller party votes in my major party vote totals because I also wanted to count the margin of votes separating the top two finishers in each U.S. House race.

In Minnesota, the Democratic Party is listed in CDs, 1, 4, 5, 7 and 8, while the Democratic-Farmer-Labor Party is listed in CDs 2, 3 and 6. Following Minnesota tradition, I counted all of these votes as “Democratic” in my major party votes totals.

Two states—Florida and Oklahoma—do not print unopposed candidates’ names on their ballots. This means that I have no votes for Democrat Frederica S. Wilson in Florida CD 24 and Republican Jim Bridenstine in Oklahoma CD 1.

For these reasons, while the House Clerk and I agree on a total of 129,888,395 votes cast in U.S. House elections in 2016, the House Clerk breakdown is 47.3% Democratic, 48.3% Republican, 4.3% All Other votes, while my breakdown is 47.6%, 48.7% and 3.7%–a difference that does not materially affect these analyses.

Figure 2: Ratio of Percentage Democratic U.S. House Seats and Percentage Democratic of Two-Party U.S. House Vote, 1968-2016

Dem Seat-Vote Ratios, U.S. House, 1968-2016

The Democrats’ 0.90 SVR in 2016 is in line with their 0.91 and 0.92 SVRs from 2012 and 2014 (Figure 2), suggesting that their disadvantage in converting votes to seats has been consistent since the 2010 redistricting.

However, as Figures 1 and 2 show, it is perhaps hypocritical for Democrats to yell too much about having SVR<1.00. Between 1968 and 1992, Democrats averaged 60.1% of U.S. House seats. They also averaged SVRs of 1.13 in the five elections following the 1970 redistricting (1972-80) and 1.11 in the five elections following the 1980 redistricting (1982-90), higher than the average 1.09 SVR for Republicans since 2010.

In 1994, the Republicans netted 54 U.S. House seats to win the majority for the first time since 1952, and they have held that majority for all but four years (2007-10) since. Still, in the five elections after each of the 1990 and 2000 redistricting processes (1992-2010), the U.S. House was, on average, evenly divided (50.1% Democratic) with an average SVR of 1.00.

In other words, when one party is consistently in the majority (Democrats in the 1970s and 1980s; Republicans since 2010), their SVR tends to be greater than 1.00. But is that a sign of systematic bias (partisan redistricting) or merely success at winning a majority of close elections?

One way to address this question is to examine results at the state level.

Seven states (Alaska, Delaware, Montana, North Dakota, South Dakota, Vermont, Wyoming) have only one U.S. House member, rendering gerrymandering moot. Four states (Arizona, California, Idaho and Virginia) use an independent commission for redistricting, ostensibly removing political motivations from redistricting[2].

Removing the 82 U.S. House seats in these 11 states leaves Democrats with 145 (33.8%) seats after the 2016 elections, and 47.7% of the two-party vote, for a SVR of 0.71. That is, in the 39 states where partisan gerrymandering could happen, Democrats won 29% fewer seats than would be expected based upon their two-party vote share. In these states, moreover, Republicans netted 21.7 “extra” seats (Table 1).

Table 1: 2016 U.S. House Election Results in 39 States Where State Legislatures Control Redistricting

State U.S. House Seats Democratic % 2-Party Vote Actual Democratic Seats Expected Democratic Seats Actual – Expected
Pennsylvania 18 45.9% 5 8.3 -3.3
Texas 36 39.3% 11 14.2 -3.2
North Carolina 13 46.7% 3 6.1 -3.1
Ohio 16 41.8% 4 6.7 -2.7
Michigan 14 49.4% 5 6.9 -1.9
South Carolina 7 40.2% 1 2.8 -1.8
Indiana 9 42.2% 2 3.8 -1.8
Georgia 14 39.7% 4 5.6 -1.6
Oklahoma 5 28.1% 0 1.4 -1.4
Alabama 7 33.7% 1 2.4 -1.4
Florida 27 45.7% 11 12.3 -1.3
Utah 4 33.4% 0 1.3 -1.3
Kansas 4 31.4% 0 1.3 -1.3
Tennessee 9 35.3% 2 3.2 -1.2
Wisconsin 8 52.1% 3 4.2 -1.2
Missouri 8 39.4% 2 3.2 -1.2
West Virginia 3 33.5% 0 1.0 -1.0
Louisiana 6 31.0% 1 1.9 -0.9
Nebraska 3 28.4% 0 0.9 -0.9
Iowa 4 45.3% 1 1.8 -0.8
Kentucky 6 29.3% 1 1.8 -0.8
Mississippi 4 39.8% 1 1.6 -0.6
Arkansas 4 12.8% 0 0.5 -0.5
Colorado 7 49.5% 3 3.5 -0.5
Maine 2 52.0% 1 1.0 0.0
New Mexico 3 56.0% 2 1.7 0.3
Hawaii 2 78.7% 2 1.6 0.4
Washington 10 55.3% 6 5.5 0.5
New Jersey 12 54.2% 7 6.5 0.5
Rhode Island 2 65.1% 2 1.3 0.7
New York 27 63.5% 18 17.2 0.8
Minnesota 8 51.8% 5 4.1 0.9
New Hampshire 2 51.6% 2 1.0 1.0
Nevada 4 50.5% 3 2.0 1.0
Oregon 5 58.4% 4 2.9 1.1
Illinois 18 54.0% 11 9.7 1.3
Massachusetts 9 83.9% 9 7.5 1.5
Connecticut 5 63.5% 5 3.2 1.8
Maryland 8 63.0% 7 5.0 2.0
TOTAL 353 47.7% 145 166.7 -21.7

In 17 of these 39 states, Republicans won at least one full seat more than expected from their two-party vote share, totaling 30.5 “extra” seats, while in seven states, Democrats won at least one full seat more than expected from their two-party vote share, totaling 9.6 “extra” seats.

So is this clear evidence that gerrymandering cost Democrats 21 or 22 U.S. House seats in 2016 (nearly the 24 they need to win back control)? Or, as suggested earlier, is it simply that parties with SVR>1.00 simply do a better job of winning close elections?

The marginals have vanished. Let’s define a “close” election as one decided by less than 10 percentage points and/or where the winning candidate received less than 55% of the total vote. In the four states where Republicans won at least two full seats more than expected from their two-party vote share (Pennsylvania, Texas, North Carolina, Ohio; total=12.3 seats), there were only four close elections (out of 83); Republicans won three of them for a net of two seats. There were zero close seats in the three states (Massachusetts, Connecticut, Maryland; total=5.3 seats) where Democrats won at least 1.5 more seats than expected from their two-party vote share.

In fact, only 47 U.S. House races (10.8%) met this definition of “close” in 2016. Republicans won 27 of them, for a net of seven seats–only one-third of the total advantage Republicans had in winning U.S. House seats in 2016 using the SVR measure.

In 2016, Democrats won their seats by an average 43.4 percentage points, while Republicans won their seats by an average 34.4 percentage points: the vast majority of U.S. House seats are extremely safe. Even if you exclude the 59 CD (13.6%) where candidates of one major party ran essentially unopposed[3] (31 Democrats, 28 Republicans), the average winning margin was still 35.1 percentage points for Democrats and 26.9 percentage points for Republicans.

And that leads us to a second measure of redistricting “fairness”: extraneous votes (ExV).

To win a U.S. House election, you simply need to receive one vote more than your nearest opponent; all other votes are extraneous. For example, Democrat Carol Shea-Porter defeated Republican Frank Guinta in New Hampshire CD 1 by 162,080 to 157,176 votes. Ms. Shea-Porter only NEEDED 157,177 votes to win, meaning that she received 4,903 “extraneous” votes. By contrast, Democrat Dwight Evans defeated Republican James A. Jones in Pennsylvania CD 2 by 322,514 to 35,131, fully 288,382 votes more than he needed.

This matters because, ever since Governor Gerry and his salamander-shaped state senate district, partisan gerrymandering has meant packing as many of one party’s voters (Party A) into as few (very safe because filled with extraneous voters) legislative districts as possible. The other party’s voters (Party B, in control of redistricting) would then be spread across the remaining (somewhat less safe, with fewer extraneous voters) legislative districts. Evidence for partisan gerrymandering would thus be if one party tends to have more extraneous votes.

And there is such evidence. In 2016, Democrats averaged 112,222 ExV and Republicans averaged 98,582, meaning Democrats averaged 13.8% more ExV than Republicans[4]. Narrowing the analysis only to the 39 states where partisan redistricting is even possible closes the gap: 111,401 to 102,963, with Democrats averaging 8.2% more ExV. Further removing seats with candidate(s) of only one major party reduces the absolute gap to 97,701 to 89,970, with Democrats averaging 8.9% more ExV than Republicans.

Table 2: Extraneous Votes for Winning U.S. House Candidates, and State Seat-Vote Ratios for Eight States, 2016

State Average Democratic ExV 2016 Democratic U.S. House Seats 2016 Average Republican ExV 2016 Republican U.S. House Seats 2016 2016 SVR
Pennsylvania 181,773 5 106,175 13 0.61
Texas 82,805 11 103,112 25 0.78
North Carolina 126,537 3 112,303 10 0.49
Ohio 135,747 4 71,190 12 0.60
Michigan 122,686 5 73,650 9 0.72
South Carolina 107,847 1 82,484 6 0.36
Indiana 135,898 2 94,554 7 0.53
Georgia 165,971 4 143,680 10 0.72
OVERALL 125,430 35 100,805 92 0.60
Minus Texas 144,967 24 99,944 67 0.57

There were eight states in 2016 where Republican U.S. House candidates won at least 1.5 more seats than expected from their two-party vote share (Table 1), for a total of 19.3 seats. Across these eight states, Democrats were also well behind on both measures of redistricting unfairness, averaging 24,625 more ExV and a SVR of 0.60 (without Texas: 45,023 ExV and SVR=0.57). Following the 2010 elections (and thus the 2010 U.S. Census), all but North Carolina had a Republican governor and all had state legislatures controlled by Republicans.

There is thus reasonably compelling evidence that Democrats were on the losing end of partisan redistricting in these eight states that was not nearly countered by controlling the post-2010 redistricting process in such states as Massachusetts, Connecticut and Maryland.

What this means for Democrats—and for democracy. There is nothing Democrats can do until after the 2020 U.S. Census to reverse the post-2010 redistricting. The first step, of course, would be to control governor’s mansions in these eight states in Table 2 following the 2020 midterm elections. This would require Democrats to hold governor’s mansions in North Carolina and Pennsylvania; win open races in Georgia, Michigan and Ohio; and defeat incumbent Republican governors in Indiana, South Carolina and Texas. They also need to chip away at Republican control of these state legislatures. If I were in charge of the Democratic Party, this is where I would focus my energy.

But there are two larger problems that I see, neither of which is good for our democracy because of the ways they impact partisan polarization and the responsiveness of elected  officials.

The first problem stems from a limiting factor on drawing legislative district lines: where each party’s voters live. If Democratic and Republican voters are scattered evenly across states, partisan redistricting is much harder.

The fact, however, is that voters are NOT evenly distributed across states. Democrats primarily self-segregate into large urban areas and college towns, the Atlantic and Pacific Coasts, and majority-minority counties (e.g. the southern “Black Belt,” northern New Mexico, southwestern Texas). And while this self-sorting may allow Democrats to win state races by winning large majorities in a few overwhelmingly Democratic counties, it also makes it easier for Republicans to create a few safe Democratic CDs and state legislative districts in a state while carving up the rest of the state for themselves.

To be fair, many of these CDs are majority-minority as dictated by the Voting Rights Act, but there is a difference between winning 90% of the vote and winning 55% of the vote. To a large extent, Democrats have been complicit in adverse redistricting by allowing for the creation of super-majority-minority CDs.

For example, consider the 2016 election results in Pennsylvania (Figures 3 and 4, taken from here and here):

Figure 3: Pennsylvania Counties won by Democrat Hillary Clinton (Red) and by Republican Donald Trump (Blue) in 2016

Pennsylvania 2016 Clinton-Trump

Figure 4: Pennsylvania Congressional Districts won by Democrat (Blue) and Republican (Red) in 2016

Pennsylvania 2016 CDs

Just six of Pennsylvania’s 67 counties (9.0%) accounted for 57.9% of Hillary Clinton’s vote in that state: five southeastern counties around Philadelphia and Allegheny County (Pittsburgh). In total, she won 11 Pennsylvania counties (16.4%), accounting for 67.1% of her statewide total.

Democrats won only five of 18 Pennsylvania CDs in 2016. Three of them (1,2,13) are clustered in the southeastern corner of the state (average winning margin 81.6 percentage points); the Democrats also won a Pittsburgh-area CD (14) by 48.7 percentage points and a Scranton-area CD (17) by 7.6 percentage points.

Do you see the problem?

Democrats won CDs 1, 2 and 13 in 2016 by an average 239, 757 votes; they are bordered to the north, south and west by CDs 6-8, which Republicans won by an average 52,676 votes. If Democrats control the post-2020 redistricting in Pennsylvania, they theoretically could redraw CD boundaries to shift enough Democratic voters from CDs 1, 12 and 13 into CDs 6-8 to make the latter CDs no worse than swing districts, while keeping the former CDs in Democratic hands, partially winning back the 3.3 “extra” seats Republicans won in 2016 (Table 1).

But nothing would be guaranteed and Democrats could end up violating the rules of compactness and relative population equality. Also, the population of Pennsylvania CD 2 is 61.2% Black; redrawing it could run afoul of the Voting Rights Act.

The other two Democratic seats in Pennsylvania offer little aid. Michael Doyle won his CD 14 by 167,294 votes, while Republicans won the bordering CDs (12, 18) by 84,498 and 293,684 votes, respectively; Tim Murphy was unopposed in the latter race. And Matt Cartwright only won his CD 17 by 22,304 votes; not very many Democratic votes to spare there.

So, Democrats’ best chance to make Pennsylvania’s redistricting more “fair” lies in just three CDs in the southeastern corner of the state because that is where the vast majority of the state’s Democrats are. It would almost be easier for Democrats either to win (back?) voters in the rest of the state or to redistribute themselves across the state, neither of which is necessarily all that easy.

And this is broadly true for Democrats nationwide, who would need to flip a net 24 CDs to win back control of the U.S. House in 2018 (since 1970 the average seat loss for the party holding the White House in its first midterm election [n=7] is 24.1). The path of least resistance is to win back the 27 “close” seats won by Republicans in 2016, while losing no more than three of the 20 “close” seats they won; Republicans won 12 of those 27 seats by more than 10 percentage points.

Talk about threading the needle!

This leads us to the second large problem. Even though Democrats are behind on both the ExV and SVR measures, Republicans also had very high average winning margins and extraneous vote totals. BOTH parties are drawing safer and safer CDs for themselves, a trend Yale political scientist David Mayhew first pointed out in 1974.

This lack of competition makes U.S. House members (and legislators generally) more worried about being challenged in a primary (from the left for Democrats and from the right for Republicans) than they are of losing the “swing” voters. When elected officials are more responsive to the median voter in their own party than they are to the median voter in their district (or state), their incentive to cooperate and compromise with elected officials of the other party diminishes accordingly. This leads to legislative gridlock, meaningless “gotcha” votes and increased political polarization.

If partisan gerrymandering exists, and if it then leads to such perverse incentives, then it should viewed not as threat for one political party, but for our democracy more generally.

Until next time…

[1] I use two-party vote (i.e., the percentage Democratic and percentage Republican of the votes cast only for those two parties) because I will, inter alia, compare percentage of votes won to U.S. House seats held. Every House member is either a Democrat or a Republican, so including third-party votes in calculating these percentages would be misleading.

[2] Hawaii and New Jersey use a “political commission” for redistricting. As I am uncertain how free from partisan pressures these commissions are, I choose to include them from these analyses.

[3] Or faced primarily third-party opposition, including the three Arkansas Republicans whose closest opponent was a Libertarian.

[4] In 34 CDs, only one major party fielded candidates while smaller parties (usually Libertarians) fielded candidates. the extraneous votes is relative to “all other votes combined.” This increased the average ExV for Democrats and Republicans, only marginally affecting the analysis, as shown later in the paragraph.

Degree or not degree? That is (still) the Democrats question.

Democrat Hillary Clinton, despite winning a 2.1 percentage popular vote margin over Republican Donald Trump, lost the presidency in 2016 because she lost the combined 46 electoral votes (EV) from three states: Pennsylvania, Wisconsin and Michigan. Clinton lost these states by a combined 77,744 votes, and an average of 0.57 percentage points, based on data from Dave Leip’s indispensable Atlas of U.S. Presidential Elections.

In an earlier post, I observed that she lost these three states because, while she essentially held her own in the core Democratic counties of these states, she dramatically underperformed Democrat Barack Obama’s 2012 performance in the rest of these states. In a follow-up post, I found that two variables, state percentage white and state percentage of persons over age 25 with a college degree, account for three-quarters of the variation in state 2016 Democratic presidential margin. It is thus becomes apparent that a primary driver of Trump’s narrow victories in these three states was his overwhelming support among white voters without a college degree, heavily concentrated in the vast majority of counties outside the Democratic base counties.

Still, very smart people like pollster Cornell Belcher continue to argue that Clinton was actually doomed by Clinton’s failure to turn out the Obama coalition of younger voters, minority voters and women.

Here is an excerpt from his argument:

Donald Trump is a president who did not win a plurality of the public. In fact, one of my reports was leaked to the New York Times, saying that millennials were rejecting the binary choice of the lesser of two evils.

When you look at the exit data, you have 8 or 9 percent of younger African-Americans voting third-party. You have 6 or 7 percent of younger Latinos voting third-party. Hillary is almost off Barack Obama’s winning margins by the same percentage of our young people voting third party. So that’s how Trump squeaked in. 

Again, Trump didn’t expand the Republican tent. He didn’t bring in all these millions upon millions of new Republican voters. This was about Democrats losing, more so than Trump remaking the electorate and winning in some sort of profound and new way. (Quoted here)

Belcher is correct that younger voters of ALL races were twice as likely as older voters to vote third party/no answer in 2016, although he overstates the percentages. According to 2016 CNN exit polling data, 10% of white voters aged 18-29 voted third party/no answer, compared to 6% each of black and Latino voters in this age range. Overall, 9% of voters aged 18-29 voted third party/no answer.

Here is the flaw in his argument, however. Voters aged 18-29 of ALL races only comprised 19% of the national electorate in 2016, averaging 18% in the three states that doomed Clinton. So, while up to 9% of these voters cast a third-party ballot, these votes account for less than 2% of all votes cast (0.19*0.09=0.017)…not enough to account for the average decline in the Democratic margin in these three states of 7.8 percentage points.

The counter-argument, of course, is that the margins were so close in these three states that simply holding the youngest voters to, say, 5% third-party would have allowed Clinton to eke out the narrowest of victories in these states.

But when elections are that close, you could argue that ANY demographic group was responsible.

I thus think the best approach is to examine changes from 2012 to 2016 in the number of votes cast for each candidate by members of specified demographic groups in Pennsylvania, Wisconsin and Michigan.

One way to do this is to, first, multiply each group’s percentage of the electorate by the number of votes cast for president in that state, then multiply these totals by the percentage voting for the Democrat, the Republican and Others to get the number of votes cast for each candidate by members of these groups. Exit polling data for 2012 were obtained here.

For example, 6,166,710 presidential votes were cast in Pennsylvania in 2016. An estimated 48% of these voters had a college degree, which translates to 2,960,021 voters. Of these voters, 52% reported voting for Clinton, which translates to 1,539,211 voters. The corresponding number voting for Obama in 2012 is 1,270,841, meaning that an estimated 268,370 more college-educated Pennsylvanians supported the Democratic presidential nominee in 2016 than in 2012.

There are problems with this estimation method. State exit polls, with roughly 3,000-vote samples, have a margin of error of approximately 1.7% for “share of the electorate” percentages and larger margins of error for candidate percentages within each group. Also, because these percentages are sample-based, exit-poll-based state-wide candidate percentages differ slightly from actual percentages[1], meaning that summing estimated candidate votes across groups that comprise the entire population (e.g., men and women) gives you state-wide candidate totals slightly different than the actual totals.

In other words, the two sets of “change in vote total” figures reported in Tables 1-3 should be taken with a pinch of salt (or with a margin of error between 2% and 7%). Even with that, however, these data allow for a relativistic comparison of election-to-election changes in voting behavior by group.

Table 1: Presidential Votes in Pennsylvania, 2012 to 2016, Overall and by Group, Estimated from Actual Votes Cast and Exit Polling

Group 2016 Electorate Change from 2012 Change in Total Votes, 2012-16 Net Change in Democratic Votes, 2012-16*
Overall 6,166,710 +7.1% +411,090 -538,756
Women 53% +1% +250,767 -34,782
Men 47% -1% +160,323 -474,096
White 81% +3% +505,652 -280,706
Non-White 19% -3% -94,562 -222,088
Black 10% -3% -131,560 -125,475
Latino/a 6% 0% +24,665 -29,601
White Women 43% +3% +349,437 +71,124
White Men 38% 0% +156,214 -362,436
Non-White Women 10% -2% -98,670 -122,876
Non-White Men 9% -1% +4,109 -111,660
18-29 16% -3% -91,477 -244,244
30-44 24% -1% +56,522 -54,165
45-64 38% -1% +114,975 -146,089
65+ 21% +4% +331,970 -20,267
College 48% 0% +197,323 +339,417
Non-College 52% 0% +213,767 -739,678

* Change in Democratic votes minus sum of changes in Republican votes and changes in Other votes

Table 2: Presidential Votes in Wisconsin, 2012 to 2016, Overall and by Group, Estimated from Actual Votes Cast and Exit Polling

Group 2016 Electorate Change from 2012 Change in Total Votes, 2012-16 Net Change in Democratic Votes, 2012-16*
Overall 2,976,150 -3.0% -92,284 -384,614
Women 51% 0% -47,605 -128,016
Men 49% 0% -45,219 -201,451
White 86% 0% -79,364 -303,964
Non-White 14% 0% -12,920 -60,723
Black 7% 0% -6,460 -14,018
Latino/a 4% 0% -3,691 -8,324
White Women 45% +1% -10,843 -107,792
White Men 42% -1% -69,444 -163,887
Non-White Women 6% 0% -5,537 -498
Non-White Men 7% 0% -6,460 -57,713
18-29 17% -4% -138,426 -159,231
30-44 23% -3% -113,278 +52,496
45-64 41% +4% +70,020 -239,668
65+ 20% +4% +89,400 -3,576
College 45% +3% +50,525 +56,602
Non-College 55% -3% -142,809 -362,970

* Change in Democratic votes minus sum of changes in Republican votes and changes in Other votes

Table 3: Presidential Votes in Michigan, 2012 to 2016, Overall and by Group, Estimated from Actual Votes Cast and Exit Polling

Group 2016 Electorate Change from 2012 Change in Total Votes, 2012-16 Net Change in Democratic Votes, 2012-16*
Overall 4,824,542 +1.7% +79,226 -670,686
Women 52% +1% +88,651 -188,290
Men 48% -1% -9,425 -416,840
White 75% -2% -35,487 -574,687
Non-White 25% +2% +114,713 -94,411
Black 15% -1% -35,569 -75,433
Latino/a 5% +2% +98,868 +6,407
White Women 40% +1% +79,144 -159,134
White Men 36% -2% -66,385 -404,891
Non-White Women 12% 0% +9,507 -29,156
Non-White Men 12% +1% +56,960 -11,949
18-29 21% +2% +111,544 -92,577
30-44 21% -2% -78,269 -252,549
45-64 39% -2% -64,008 -264,700
65+ 19% +2% +109,959 -22,732
College 42% -4% -156,538 -43,657
Non-College 58% +4% +235,764 -587,320

* Change in Democratic votes minus sum of changes in Republican votes and changes in Other votes

For context, Clinton lost Pennsylvania by 44,292 votes (-0.72%), Wisconsin by  22,748 votes (-0.76%) and Michigan by 10,704 votes  (-0.22%).

The Pennsylvania and Michigan electorates were larger in 2016 than in 2012, while the Wisconsin electorate was smaller. Overall, the Democratic presidential margin in these states dropped between 384,000 and 671,000 votes.

The composition of the electorates, though, changed very little between 2012 and 2016 in these three states. In Pennsylvania, the 2016 electorate had more white voters, white women, and voters aged 65 and older, and fewer black voters and voters aged 18-29. In Wisconsin, the 2016 electorate had more voters aged 45 and older and voters with a college degree, and fewer voters aged 18-44 and voters without a college degree. In Michigan, the 2016 electorate had fewer voters with a college degree. The one consistency across all three states was a 2-4 percentage point increase in the share of voters aged 65 and older.

However, the voting behavior of demographic groups did change from 2012 to 2016, often dramatically. In all three states, the steepest declines in Democratic vote margin were among white men, all white voters, all men, and, especially, voters without a college degree; there were less steep, but still substantial, declines among voters aged 45-64 (and among voters aged 18-29 in Pennsylvania and Wisconsin, and among voters aged 30-44 in Michigan). Put another way, while the average decline in state-wide Democratic vote margin was 531,352, the average decline in Democratic margin only among voters without a college degree was even higher: 563,323!

At the same time, the average decline in the Democratic vote margin among non-white men was 60,441 and among voters aged 18-29 was 163,351, meaning that the average decline in the Democratic vote margin among non-white men aged 18-29—the demographic Belcher claims was primarily responsible for Clinton’s defeat—was well below 60,000 votes, probably closer to 15-20,000 votes. Ms. Clinton’s average margin of defeat in these states was about 25,915 votes, so even if non-white men aged 18-29 had voted in the same numbers and in the same percentage Democratic as they had in 2012, she still would have lost these three states.

Why? Because Clinton absolutely cratered among white middle-aged men with no college degree. Had the decline in Democratic vote margin just among voters without a college degree been as little as 10% lower, she would have eked out victories in these three states and won the 2016 presidential election.

Still, the news was not all bad for Democrats in these three states.  Voters with a college degree became dramatically more Democratic in Pennsylvania and somewhat more Democratic in Wisconsin; on average, there was a 31,971 vote increase in the Democratic margin among these voters. Voters aged 65 and older essentially held steady, averaging 15,525 votes less Democratic, while becoming a larger share of the population. White women in Pennsylvania, voters aged 30-44 in Wisconsin and Latino/a voters in Michigan also drifted toward the Democrats in these three states.

So Belcher has a valid point: the margins in these three states were narrow enough that even a small improvement among the Obama coalition—higher turnout and/or higher Democratic margin—would flip these states blue.

But to truly return to majority status, as I keep demonstrating, Democrats face a choice: try to improve with white voters without a college degree (which would help in states like Iowa, Ohio and, to a lesser extent, North Carolina) or continue to attract white voters with a college degree (which would help in states like Georgia, Texas and Arizona).

Until next time.

[1] Exit-poll-based values (averaging values derived from different sets of groups): Clinton 48.0%, Trump 48.6% (actual margins: 47.5%, 48.2%, respectively) in Pennsylvania; Clinton 46.7%, Trump 48.4% (46.5%, 47.2%) in Wisconsin; and Clinton 47.3%, Trump 47.1% (47.0%, 47.2%) in Michigan.