Using Jon Ossoff polling data to make a point about statistical significance testing

I do not like the phrase “statistical dead heat,” nor do I like the phrase “statistical tie.” These phrases oversimplify the level of uncertainty accruing to any value (e.g., polling percentage or margin) estimated from a sample of a larger population of interest, such as the universe of election-day voters; when you sample, you are only estimating the value you wish to discern. These phrases also reduce quantifiable uncertainty (containing interesting and useful information) to a metaphorical shoulder shrug: we really have no idea which candidate is leading in the poll, or whether two estimated values differ or not.

For example, a poll released June 16, 2017 showed Democrat Jon Ossoff leading Republican Karen Handel 49.7% to 49.4% among 537 likely voters in the special election runoff in Georgia’s 6^th Congressional District. The margin of error (MOE) for the poll was +/-4.2%, meaning that we are 95% confident that Ossoff’s “true” percentage is between 45.5 and 53.9%, while Handel’s is between 45.2% and 53.6%.

In other words, these data suggest a wide range of possible values, anywhere from Ossoff being ahead 53.9 to 45.2% to Handel being ahead 53.6 to 45.5%. In fact, there is a 5% chance that either candidate is further ahead than that. Finally, because random samples such as these are drawn from a normal (or “bell curve”) distribution, percentages closer to those reported (Ossoff ahead 49.7 to 49.4%) are more likely than percentages further from those reported.

But this is a lot to report, and to digest, so we use phrases like “statistical dead heat” or “statistical tie” as cognitive shorthand for “there is a wide range of possible values consistent with the data we collected, including each candidate having the exact same percentage of the vote.”

Each phrase has its roots in classical statistical significance testing. The goal of this testing is to assure ourselves that any value we estimate from data we have collected (a percentage in a poll, a relative risk, a difference between two means) is not 0.

To do so, we use the following, somewhat convoluted, logic.

Let’s assume that the value (or some test statistic derived from that value) we have estimated actually is 0; we will call this the null hypothesis. What is the probability (we will call this the “p-value”) that we would have obtained this value/test statistic or one even higher purely by chance?

Got that?

We are measuring the probability—assuming that the null hypothesis is true—that a value (or one higher) was obtained purely by chance.

And if the probability is very low, it would therefore be very unlikely that we have gotten our value purely by chance, so it must be the case that we did NOT get it by chance. And so we can “reject” the null hypothesis (even though we assumed it to be true to arrive at this rejection), given that value that we got was so high.

The higher the probability, the more difficult it becomes to reject the null hypothesis.

By a historical accident, any p-value less than 0.05 is considered “statistically significant,” meaning that we can reject the null hypothesis.

Of course, we REALLY want to know how probable the null hypothesis itself is, but that is a vastly trickier proposition.

Or, even better…we REALLY want to know how likely the actual value we observed is.

Think about it. All we are really learning from classical statistical significance testing is either “our value is probably not 0” or “we can’t be certain that our value is not 0…it just might be.” This tells us nothing about the quality of the actual estimate we obtained is, how near the “true” value it actually is.

Now, to be fair to the 0.05 cut-point for determining “statistical significance,” it does have an analogue in the 95% confidence interval.

The 95% confidence interval (CI) is very similar to the polling MOE discussed earlier. It is a range of values (often calculated as value +/-1.96*standard error[1]) which we are 95% confident includes the “true” value.

Let’s say you estimate the impact of living in a less walkable neighborhood relative to living in a more walkable neighborhood on incident diabetes over 16 years of follow-up. Your estimate is 1.06 (i.e., you have 6% higher risk of contracting diabetes), with a 95% CI of 0.90 to 1.24. In other words, you are 95% confident that the “true” effect is somewhere between a 10% decrease in incident diabetes risk and a 24% increase in incident diabetes risk.

Ahh, but this is where that pesky cognitive shorthand comes back. See, that 95% CI you reported includes the value 1.00 (i.e., no effect at all). Therefore, there is likely no effect of neighborhood walkability on incident diabetes.

No, no, a thousand times no.

It simply means that there is a specified range of possible measures of effect, only one of which is “no effect.” In fact, the bulk of the possible effects are on the risk side (1.01-1.24), rather than on the “protective side” (0.90-0.99).

Just bear with me while I come to the point of this statistical rigmarole.

Early this morning, I posted this on Facebook:

The election-eve consensus is that the Jon Ossoff-Karen Handel race (special election runoff in Georgia’s 6th Congressional District) is a dead heat, with Handel barely ahead. This consensus is based in large part on the RealClearPolitics polling average (Handel +0.25). However, the RCP only looks at the most recent poll by any given pollster, and only within a very narrow time frame

Hogwash (for the most part).

All polls are samples from a population of interest, meaning that you WANT to pool recent polls from the same pollster (each is a separate dive into the same pool using the same methods). Also, I found no evidence that the polling average has changed much since the first election April 19

My analysis (90% hard science, 10% voodoo) is that Ossoff is ahead by 1.4 percentage points. Assume a very wide “real” margin of error of 9 percentage points, and Ossoff is about a 62% favorite to win today.

Meaning, of course, that there is a 38% chance Handel wins

That is still a very close race, but I would give Ossoff a small edge

And, bloviating punditry aside, for Ossoff even to lose by a percentage point would be a remarkable pro-Democratic shift for a Congressional seat Republicans have dominated for 40 years.

Polls close at 7 pm EST.

Here is the full extent of my reasoning.

I collected all 12 polls of this race taken after the first round of voting on April 20, 2017. Four were conducted by WSVB-TV/Landmark and showed Ossoff ahead by 1 percentage point (polling midpoint 5/31/2017), 3 (6/7), 2 (6/15) and 0 (6/19) percentage points. Two each were conducted by the Republican firm Trafalgar Group (Ossoff +3 [6/12], Ossoff -2 [6/18]) and by WXIA-TV/SurveyUSA (Ossoff +7 [5/18], even [6/9]). Other polls were conducted by Landmark Communications (Ossoff -2 [5/7]), Gravis Marketing (Ossoff +2 [5/9]), the Atlanta Journal-Constitution (Ossoff +7 [6/7]) and Fox 5 Atlanta/Opinion Savvy (Ossoff +1 [6/15]).

On average, these polls show Ossoff ahead by an average of 1.85 percentage points.

Using a procedure I suggest here, I subtracted the average of all other polls from those from a single pollster. For example, the average of the four WSVB-TV/Landmark was Ossoff +1.5, while the average of the other eight polls was Ossoff +2.0. This difference—or “bias”—of -0.5 percentage points shows the WSVB-TV/Landmark polls may have slightly underestimated the Ossoff margin.

I then “adjusted” each poll by subtracting its “bias” from the original polling value (e.g., I added 0.5 to each WSVB-TV/Landmark Ossoff margin). For convenience, I lumped the pollsters releasing only one poll into a single “other” category; its “bias” was only 0.2.

The “adjusted” Ossoff margin was now +1.865.

To see whether the Ossoff margin had been increasing or decreasing monotonically over time, I ran an ordinary least squares (OLS) regression of Ossoff margin against polling date midpoint (using the average, if polls had the same midpoint date). There was no evidence of change over time; the r-squared (a measure of the variance in Ossoff margin accounted for by time) was 0.01.

Still, out of a surfeit of caution, I decided to assign a weight of “2” to the most recent poll by WSVB-TV/Landmark, Trafalgar Group and WXIA-TV/SurveyUSA and a weight of 1 to the other nine polls.

Using the bias-adjusted polls and this simple weighting scheme, I calculated an Ossoff margin of 1.38, suggesting recent tightening in the race not captured by my OLS regression[2].

So, let’s say that our best estimate is that Ossoff is ahead by 1.38 percentage points heading into today’s voting. There is a great deal of uncertainty around this estimate, resulting both from sampling error (an overall MOE of 2.5 to 3 percentage points around an average Ossoff percentage and an average Handel percentage, which you would double to get the MOE for the Ossoff margin—say, 5 to 6 percentage points) and the quality of the polls themselves.

Now, let’s say that our Ossoff margin MOE is nine percentage points. I admit up front that this is a somewhat arbitrary MOE-larger-than-6-percentage-points I am using to make a point.

In a normal distribution, 95% of all values are within two (OK, 1.96) standard deviations (SD) of the midpoint, or mean. If you think of the Ossoff margin of +1.38 as the midpoint of a range of possible margins distributed normally around the midpoint, then the MOE is analogous to the 95% CI, and the standard deviation of this normal distribution is thus 9/1.96 = 4.59.

To win this two-candidate race, Ossoff needs a margin of one vote more than 0%. We can use the normal distribution (mean=1.38, SD=4.59) to determine the probability (based purely upon these 12 polls taken over two months with varying quality) that Ossoff’s margin will be AT LEAST 0.01%.

And the answer is…61.7%!

Using a higher SD will yield a win probability somewhat closer (but still larger than) 50%, while a lower SD will yield an even higher win probability.

Here is the larger point.

It may sound like Ossoff +1.38 +/-9.0 is a “statistical dead heat” or “statistical tie” because it includes 0.00 and covers a wide range of possible margins (Ossoff -7.62 to Ossoff +10.38, with 95% confidence), but the reality is that this range of values includes more Ossoff wins than Ossoff losses, by ratio of 62 to 38.

You can reanalyze these polls and/or question my assumptions, but you cannot change the mathematical fact that a positive margin, however small and however large the MOE, is still indicative of a slight advantage (more values above 0 than below).

Until next time…

**********

This is an addendum started at 12:13 am on June 21, 2017.

According the New York Times, Handel beat Ossoff by 3.8 percentage points, 51.9 to 48.1%. My polling average (Ossoff+1.4) was thus off by -5.2 percentage points. That is a sizable polling error. RealClearPolitics (RCP) was somewhat closer (-3.6 percentage points), while HuffPostPollster (HPP) was the most dramatically different (-6.2 percentage points).

Why such a stark difference? And why was EVERY pollster off (the best Handel did in any poll was +2 percentage points, twice)?

I think the answer can be found in a simple difference in aggregation methods. RCP used four polls in its final average, with starting dates of June 7, June 14, June 17 and June 18, and their final average was Handel+0.2. HPP, however, included no polls AFTER June 7, and their final average was Ossoff+2.4, a difference of 2.6 percentage points in Handel’s favor.

Moreover, Handel’s final polling average was 2.1 percentage points higher in RCP (49.0 vs. 46.9%), while Ossoff’s final polling average was only 0.5% lower (48.8 vs. 49.3%).

In other words, over the last week or so of the race, Handel was clearly gaining ground, while Ossoff was fading slightly.

What could have caused this shift?

On the morning of June 14, 2017, a man named James T. Hodgkinson opened fire on a group of Republican members of Congress, members of the Capitol Police and others on an Alexandria, Virginia baseball diamond. Mr. Hodgkinson, who claimed to have volunteered on Senator Bernie Sanders’ 2016 presidential campaign, appeared to be singling out Republicans for attack; he had posted violent anti-Trump and anti-Republican screeds on his Facebook page.

When this ad, brazenly (and absurdly) tying Ossoff to the left-wing rage and violence deemed responsible for the Alexandria shooting, started playing in Georgia’s 6th Congressional District, I thought it was a despicable and desperate attempt to save Handel from a certain loss.

But the overarching message of “blame the left” appears to have resonated with district residents who otherwise may not have voted. The final poll of the campaign found that “…a majority of voters who had yet to cast their ballots said the recent shootings had no effect on their decision. About one-third of election-day voters said the attack would make them ‘more likely’ to cast their ballots, and most of those were Republican.”

It is conceivable that this event changed a narrow Ossoff win into a narrow loss, as disillusioned Republicans decided to cast an election-day ballot for Handel in defense of their party. While Ossoff won the early vote by 5.6 percentage points (and 9,363 votes), he lost the election day vote by a whopping 16.4 percentage points (and 19,073 votes).

Ossoff may well have lost anyway, for other reasons: his non-residence in the district, the difference between Republican opposition to Trump and support for mainstream Republicans, the amount of outside money which flowed into the district (making it harder for Ossoff to cast himself as a more centrist, district-friendly Democrat; the Democrat in the most expensive U.S. House race in history lost by a larger margin [3.8 percentage points vs. 3.2 percentage points] than the Democrat in the barely-noticed South Carolina 5th Congressional District special election held the same day) and his inexperience as a politician.

But the fact that Handel herself cited the Alexandria shooting in her victory speech (starting at 03:23) speaks loudly about why SHE thinks she won the election.

Until next time…again…

[1] Itself usually calculated as standard deviation divided by the square root of the sample population.

[2] Other recency weighting schemes yielded similar results.