As an epidemiologist (OK, as “a person with a doctorate in epidemiology,” which is totally the same thing), I expect that I will write future blog posts relating to epidemiologic research. In those posts, I will refer readers unfamiliar with epidemiologic concepts and methods to my “epidemiology primer” post.
This is that post.
Epidemiology can be described two ways:
- The study of the distribution and determinants of health-related states in specified populations.
- The process for estimating the most valid and precise measure of the association between an exposure and an outcome in a specified population.
These definitions are basically the same:
Epidemiology helps to answer the question: At a pre-determined point in the future, how much more/less likely is it that a person with characteristic X will have outcome A than a person with characteristic Y?
To answer this deceptively simple question, members of a selected population (none of whom has A) are divided into two groups—those with X and those with Y. Members of these groups are followed for a certain period of time, after which we compare the percentage of people with X who got A to the percentage of people with Y who got A. If a higher percentage of people with X got A, then X is a risk factor for A. If a lower percentage with X got A, then X is protective against A.
Makes sense, right?
Here is an example (demonstrating why I prefer definition #2). In 2016, the 30 major league baseball (MLB) teams played a total of 2,424 games as the “home” team and 2,427 teams as the “road” team. One may ask: Is a road team more likely to lose than a home team? In this case, the exposure would be “road” vs. “home,” and the outcome would be “lose MLB game.”
Wait, you say. Isn’t epidemiology about who gets what disease, or tracking outbreaks of infections, or preventing epidemics, or something?
Yes, epidemiology typically measures occurrence and etiology (“causes”) of diseases (or “health-related states”). My doctoral research explored, in part, the association between an exposure (neighborhood walkability) and two health outcomes, diabetes and depressive symptoms.
But…one can apply epidemiologic methods to any scenario which compares the likelihood of some future event or state across different exposure categories.
An epidemiologist would begin to answer the question posed above with the 2×2 table:
|% with Outcome||a/(a+c)||b/(b+d)|
The 2×2 table for our question shows that road teams lost 53.1% of their games in 2016, while home teams lost 47.0% of their games.
There are two ways to compare these percentages, or risks: arithmetic difference and ratio. Collectively these comparisons are called measures of effect
The risk difference (RD) is the exposed percentage with the outcome minus the unexposed percentage with the outcome. Here, the RD would be 53.1 – 47.0 = 6.1, meaning that a road team’s risk of losing a game was 6.1 percentage points higher than a home team’s risk in 2016. Difference measures are useful for determining how many people would be impacted by a change in exposure status, say through a public health intervention. When risks are equal across exposure categories, the risk difference is 0, implying no association between the exposure and the outcome (the null).
The risk ratio (RR), meanwhile, is the exposed percentage with the outcome divided by the unexposed percentage with the outcome. RR are most commonly reported in peer-reviewed journal articles (and then by the media). Dividing 53.1 by 47.0 yields 1.13, meaning that a road team had a 13% higher risk of losing a game than a home team in 2016. For ratio measures, the null is 1.00, the ratio of two equal values.
A common variant of the risk ratio is the odds ratio, the ratio of the odds an outcome will occur across exposure categories. It can be calculated as (a*d)/(b*c). For our baseball data, the calculation is (1287*1287)/(1137*1140)=1.28. The odds of losing a game for the road team were 28% higher than the odds for the home team in 2016.
Of course, it is not really this simple, leading the chair of my government doctoral committee to say, in the middle of an American politics seminar, “You know, if you want to do something REALLY hard, get a doctorate in epidemiology.”
[I am not sure what it says about me that I completed the “hard” doctorate (epidemiology) but not the “less hard” doctorate (political science).]
The factors that make epidemiologists want to tear out their hair can be divided into four broad categories:
- Counting and timing of events
- Study design
- Number of exposed cases, AKA random error, or “precision”
- Difficulties ascertaining measures of effect, AKA systematic error, or “validity”
Just bear with me while I review each category.
Counting and timing of events. Epidemiology depends upon accurately counting and measurement. Examples include the number of people in Malawi who currently have diabetes, the number of people averaging more than 30 minutes of exercise a day in Saskatchewan in June 2010, or the number of people who will be diagnosed with ovarian cancer in Massachusetts in the next 15 years.
The first two are examples of prevalence: the number of people with a specific condition (or falling into a specific category) at a specific time and place. The third is an example of incidence: the number of people who currently do not have an outcome (not yet a case) who will do so by a specified future date.
Time is an underappreciated aspect of epidemiology. For one thing, the exposure must come BEFORE the outcome to have any sort of causal association (pollen I will ingest in 2024 cannot be the cause of my current nasal allergy misery). For another, epidemiologic methods depend upon following people over time (follow-up) to see who becomes a case, though follow-up may already have occurred when our intrepid researcher enters the picture.
Some measures of effect include the time from the start of follow-up to case ascertainment. Consider the incidence rate (IR), the number of incident cases in an exposure category divided by the total person-time (PT) in that category. Person-time is how long each person is followed in a study until a) (s)he becomes a case, b) (s)he is lost from the study or c) the study period ends. The associated measures of effect are incident rate difference (IRD) and incident rate ratio (IRR).
This is where things begin to get tricky…because it is not always clear exactly when a person becomes a case. Memories of diagnosis timing may be murky. Research subjects may only be queried every year or two about becoming a case, with only “since the last questionnaire” for timing.
And even this is only when a diagnosis occurred, which is NOT the same as when a person actually became a case.
So here are two more concepts: induction and latency. Induction is the period between when an exposure occurs and a person gets an outcome (if [s]he does). Latency is the period between when the person gets an outcome and when (s)he learns that fact. For an outcome like the common cold, both induction and latency periods are quite short. But some outcomes have long latency periods; a person may have diabetes for as many as 10 years without being aware of it. And for as many as 10 years that person will “truthfully” say/record “I do not have diabetes” whenever asked.
This is our first example of misclassification; more on this later.
Study design. Underlying all epidemiologic studies is the conception that people in different exposure categories are compared on outcome occurrence over a period of time.
I cannot emphasize this fundamental structure (epistemology, even) often or loudly enough.
However, study design differs by when and how data are collected and by when our intrepid researcher enters the picture.
Two study designs (sorry, ecological, cross-sectional and prevalence studies; randomized control trials; and natural experiments) most interest us here: cohort studies and case-control studies.
A cohort study (e.g., the Black Women’s Health Study, a key data source for my doctoral research) is one where investigators enroll a large “cohort” of people (often tens of thousands) at a single point in time (baseline), give them a comprehensive questionnaire, then follow them for as long as possible. Its key strength is that you can collect a lot of information on a lot of people for a long period of time, allowing you to study any exposure or outcome queried in the questionnaires/interviews. Cohort studies include follow-up time and case ascertainment, so each measure of effect discussed above can be estimated from them. Its weaknesses primarily involve cost and time; following tens of thousands of people for decades costs money and requires patience to wait for enough incident cases for adequate study precision. They are also less useful for rare outcomes; if an outcome only occurs to one in one thousand people, even a cohort of 100,000 persons would only contain 100 or so cases, lowering study precision.
And that leads us to case-control studies, which begin by identifying and enrolling cases. What better way to study the etiology of a rare disease (say, molar pregnancy, present in about 1 in 1,500 women with early pregnancy symptoms) than by gathering as many prevalent cases of that disease as possible?
That is the relatively easy part. The harder part is selecting “controls.” Controls are more than just “non-cases,” as they provide “the distribution of exposure in the source population that gave rise to the cases.”
Just bear with me…
It is easiest to think of all case-control studies as being embedded within a larger cohort study. The population of the cohort study would be the “source population,” and so, just as every case came from the cohort study, then so must every control. We can use the “would criterion” here. Would a control have been selected as a case in the study had that person gotten the outcome? If not, then that person cannot be a control.
Every case and control is quizzed about their history to determine exposure status and other relevant study information.
Other than being ideal for studying rare disease etiology, case control studies are more efficient and cost-effective than cohort studies. At the same time, however, case control study populations number in the hundreds (not tens of thousands), reducing precision. Controls are only a sample of all possible study participants in the source population, meaning only an OR can be calculated from a case control study. Finally, nearly all exposure and other relevant data are based on participant memory, and thus may not be as accurate as you would prefer (likely resulting in misclassification).
Random error (Precision). Every epidemiologic study draws a sample from a population of interest (e.g., black women), leaving open the possibility of an estimated measure of effect differing from the “true” measure of effect purely by chance. Arithmetic differences between estimated and “true” measures of effect are called bias.
And that leads us to the confidence interval (CI), a range of values around your measure of effect which you are highly confident includes the “true” measure of effect. At a fixed confidence level (typically 90% or 95%), the larger the sample size and (more specifically) the more exposed cases a study has, the narrower the confidence interval.
In the first study of my doctoral thesis (neighborhood walkability and diabetes), a woman was exposed if she lived in a “non-most-walkable” neighborhood, and she was unexposed if she lived in a “most walkable” neighborhood. The estimated IRR was 1.06, with a 95% CI of 0.97-1.18. In other words, while my best estimate was that women living in less walkable neighborhoods had a 6% higher incidence of diabetes over 16 years than women living in the most walkable neighborhoods, I was 95% confident that the truth is somewhere between a 3% decrease and an 18% increase in diabetes incidence over 16 years. That is a very narrow range of possibilities, meaning I had a fairly precise estimate.
Systematic error (Validity). And here is where things REALLY get interesting.
There are errors in estimated measures of effect that CANNOT be fixed by increasing sample size: misclassification and confounding.
Misclassification is when an exposed person is recorded as unexposed (or vice versa) and/or a person with the outcome is recorded as not having the outcome (or vice versa). All epidemiologic studies have some degree of misclassification.
This can happen for many reasons, including a) misrecording or misremembering relevant information; b) being in the latency period for an outcome; c) inadequate measures of exposure/outcome that miscategorize people; d) adequate measures of exposure/outcome divided into etiologically-incorrect categories; and so forth.
Misclassification is measured in two ways: sensitivity and specificity. Sensitivity is the percentage of study participants categorized as exposed/have outcome that truly are exposed/have outcome. Specificity is the percentage of study participants categorized as unexposed/not have outcome who truly are unexposed/do not have outcome.
To show how sensitivity and specificity work in practice, let us return to the baseball data.
What if I (or someone else) made transcription errors in defining road and home status for a given game, so that exposure sensitivity and specificity was each 90%? Then this is what the data would “truly” be:
RR (1.13 to 1.17) and OR (1.28 to 1.36) are now somewhat further from the null. Identical changes in RR and OR would occur if outcome (loss-win) sensitivity and specificity was each 90%.
In general (and there enough caveats here to fill volumes, or graduate seminars), misclassification tends to “bias toward the null,” meaning that your study estimate will be closer to the null than the true value.
A good way to think about confounding is the counterfactual.
Ideally, if one wanted to measure the independent effect of an exposure—the etiologic role played exclusively by that exposure—on the risk of an outcome, you would follow a group of exposed people over time, ascertaining who does/does not get the outcome. You would then reverse time and rerun the entire study again, with the single difference that everyone is now unexposed. In this way, the ONLY explanation for differences in outcome incidence would be exposure status, because everything else was exactly the same.
The counterfactual is a lot like “what if?” history questions (What if Abraham Lincoln had NOT been assassinated in April 1865? What if “Dutch” Schultz HAD knocked off a young special prosecutor named Thomas Dewey in 1934 or 1935?)
Clearly, we are not able to re-run epidemiologic studies (or history), so we need unexposed study participants to represent what would have happened to exposed study participants had they been unexposed.
Confounding thus occurs when a study’s exposed and unexposed participants differ on a risk (or protective) factor for the outcome, biasing estimated measure(s) of effect. A rule of thumb for determining whether a factor is a confounder is adjusting for it changes the measure of effect by at least 10%.
Here is an example. One factor assessed in my doctoral research was neighborhood socioeconomic status, or SES. SES combines income and education data to provide a relative sense of how “well-off” a person or place is; it is strongly associated with health outcomes, including diabetes (lower SES=>higher diabetes risk). In my study, more walkable neighborhoods had lower SES, and vice versa. In other words, neighborhood SES appeared to be a strong confounder of the association between neighborhood walkability and diabetes incidence. Indeed, the simple (or crude) association was this: women living in the least walkable neighborhoods had a 21.7% lower incidence of diabetes over 16 years than women living in the most walkable neighborhoods. After adjusting for neighborhood SES, the association was a 1.2% higher incidence, a 29.2% change in the estimated IRR.
Analytic techniques exist to identify and adjust for confounders. The latter essentially involves stratifying (or “drilling down”) study participants into categories of the suspected confounder (e.g., men and women, age groups, selection criteria—“selection bias” is really “confounding by selection”), calculating measures of effect within each subgroup, then calculating a weighted average of those stratified measures of effect in such a way (I will spare you the gory details) that removes the confounding by that particular factor.
However, when stratified measures of effect differ widely (e.g., American League teams’ RR was 1.17 and National League teams’ RR was 1.09—but “league” cannot be a confounder because it does not differ between road and home teams), this is effect modification (or effect measure modification). It is usual practice to report these differences and not adjust for the effect modifier as a confounder.
Of course, one way to remove confounding before a study begins is to randomize study participants into exposure categories, theoretically balancing all possible confounders. This is what happens in randomized control trials, often considered the gold standard of health research. There are two problems for most epidemiologic research, however. Randomization does not guarantee balance, and it creates ethical dilemmas around knowingly exposing a study subject to something suspected to be harmful (or at least increases outcome risk).
I would say “and that is basically it,” except that I have barely scratched the surface of the surface of epidemiologic thinking and methods. However, I think I have presented enough information to clarify future epidemiology-based posts.
If you want to learn more about epidemiology, I strongly recommend Kenneth Rothman’s Epidemiology: An Introduction and Aschengrau and Seage’s Essentials of Epidemiology in Public Health. For eminently readable historical context, I highly recommend Steven Johnson’s The Ghost Map. Finally, the uber-text for contemporary epidemiology is Rothman, Greenland and Lash’s Modern Epidemiology (3rd edition).
Until next time…
 The odds of an event is the probability the event occurs divided by the probability the event does not occur: p/(1-p) .
 Let’s say you follow 10 people for five years. Five people are followed for five years without becoming a case, which would total 5*5=25 person-years (PY). Three people become cases, one in two years, two in four years; 2+4+4=10 PY. Two people drop out of the study after three years, which would be 2*3=6 PY. Thus, total study PY is 41 person years. Let’s posit five exposed persons and five unexposed persons. The five exposed persons include two cases over 14 PY, and the five unexposed persons include one case over 27 PY. The IR for the exposed would be 2/14 PY = 14.3 cases per 100 PY, while the IR for the unexposed would be 1/27 PY = 3.7 cases per 100 PY. The IR difference would be 14.3 – 3.7 = 10.6 cases per 100 PY, and the IR ratio would be 14.3/3.7 =3.86.
 Clearly, there is more to control selection than that, but that is a level of angst I am happy to avoid here.
 I am deliberately avoiding comparing confidence intervals and p-values. I may address these issues in a later post.
 Technically a “hazard ratio,” which estimates an IRR.
 Caveats primarily relate to whether misclassification was “non-differential” or “differential (did sensitivity and specificity differ across subgroups or not?), and whether it was “non-dependent” or “dependent” (was the source of exposure misclassification independent or dependent upon the source of outcome misclassification?). Another caveat relates to whether exposure has two categories (binary) or more, and where the cut-points are. A more accurate “rule of thumb” is thus “Non-differential, non-dependent misclassification of a binary exposure (or outcome) tends, on average, to bias estimated measures of effect toward the null.”
 For various reasons, I used a 5% change-in-estimate criterion in my doctoral studies.
 I also adjusted for age and city, yielding the IRR discussed earlier.
 A confounder MUST satisfy three conditions: it must differ by exposure status, it must be a risk/protective factor for the outcome among the unexposed, and it must not be on the causal pathway between exposure and outcome.