Have you heard about the Oregon Health Study yet? Believe me, you will. The Oregon Health Study is that rare creature in American policy discourse: a statistically complex argument that boils down to an easily understandable conclusion. Such unicorns are worth looking at because they hold some insight into some of the limitations our public discourse faces when it tries to agree on facts.
The last one of these studies to capture our attention was probably a paper by Carmen Reinhart and Kenneth Rogoff, of Harvard University, purporting to show a strong demarcation between countries with debts amounting to less than 90% of GDP and countries that exceeded that threshold. Their study showed that when a country’s debt exceeds 90% of GDP, median economic output drops by 1%, and average growth falls a lot more.
Republicans had a field day, and understandably so. The Reinhart-Rogoff study vindicated the central premise of their 2012 campaign: the crippling effect of debt on the US economy.
So imagine the brouhaha when a handful of researchers from UMass-Amherst published an article showing that the Reinhart-Rogoff conclusions were based on a mathematical error. When the error was corrected, the 90% threshold disappeared. There were still some interesting findings in there, but hardly the revelation it first appeared to be.
The important lesson is this: social science quantitative research is an art, and it’s hard. If you try to over-interpret the data, you’re going to get burned. Let’s keep that in mind as we take a look at the Oregon study.
So, you’re probably wondering by now, what is this Oregon study anyway? I’m glad you asked. In 2008, Oregon decided to expand its Medicaid rolls. But because it had limited funds, it held a lottery to decide who would get the opportunity to apply for the program. This was a bit of a sad state of affairs, but it was also a huge gift to social science.
To understand why, you need to first understand the magic that is the randomized controlled trial, the gold standard for clinical studies. The basic challenge in trying to tease out the effect of any intervention or treatment – whether it’s aspirin or access to healthcare – is that you can never be sure of whether or not the effects you’re seeing are the product of the treatment or instead caused by some unaccounted variable. Randomization allows you to control for this because, in theory, if people are randomly assigned to group A or group B, then those groups should be essentially the same as each other. With two initially similar groups, you can be confident that any difference you later observe between them is due to the treatment that you introduced.
In the social sciences, we rarely get to do this because it’s usually either unethical or absurdly expensive. So the Oregon Medicaid expansion presented a rare opportunity: the lottery effectively randomized the applicants into two comparable groups and then introduced the treatment (access to Medicaid) into only one of those groups. So, what did it show?
The short version is this: over the two-year window analyzed, access to Medicaid had little to no effect on blood pressure, blood sugar, or weight. Diabetes diagnoses increased, but blood-sugar tests showed no improvement. Healthcare spending in the group spiked, and the increase can’t be explained by pent-up demand. Depression improved. Not surprisingly, people with access to Medicaid were less likely to face financial difficulty than people who were eligible for Medicaid but didn’t get it because they weren’t picked in the lottery.
Two major narratives have emerged from the discussion around the study.
People who generally oppose expanding Medicaid have argued that the study highlights the limited value of the program. The central premise of their argument is compelling: the study shows little measurable return on a huge public expenditure. What’s the point of spending all this money if it isn’t going to improve people’s health?
The counter-narrative, from people who support Medicaid, is that the study is flawed in ways that invalidate some of the most scathing claims against the program. How reasonable, they argue, is it to imagine that people’s health will show improvement on these metrics in a two-year window? Many of the most important effects of healthcare, they argue, are long-term. Routine medical care might not do much for diabetes in two years, but it might mean the difference between dying from and surviving breast cancer. Such events are relatively rare, but hugely significant. A two-year study mostly misses them.
Another concern is the generalizability of the study. It’s discussed as a study about the effects of Medicaid, but in non-trivial ways it’s really a study about the effects of Medicaid in Oregon during that two-year window. Maybe Oregon is a good proxy for Kentucky or for wherever you live, but maybe it isn’t. This study simply can’t answer that question.
Lastly, Medicaid supporters point out that, in many ways, the point of health insurance (as opposed to healthcare, per se) is precisely to insulate people from the financial exposure of negative health events. On that measure, the Oregon study seems to indicate Medicaid was largely a success.
There are also more technical arguments, some of which seem to support the common-sense reading of the study and some of which seem to undermine it. I want to focus on two that I think are important for how we should think about social science research generally.
The first of these has to do with what’s called ‘statistical power’. Simply put, statistical power refers to the ability of a research protocol and corresponding analysis to detect an effect if it is there. Think about it this way: if you look at the night sky and you don’t see the moon, you should probably conclude that the moon isn’t there. But if you don’t see Jupiter’s moons, you should probably conclude nothing because your eyes couldn’t see them even if they were there. Statistical power covers the same concept: with a limited sample size and a given set of variables under study, an effect has to be of a certain size to be detected at all. If the experiment is ‘underpowered’, then the absence of evidence isn’t evidence of absence.
That’s precisely the claim Austin Frakt makes about the Oregon Health Study.] A lot of people agree with him and a lot of people disagree with him. I don’t have the time or expertise to tell you who’s right.
But at one level, the claim that the Oregon study is underpowered must be a little grating to the researchers because they didn’t control the research conditions. Usually, you calculate the power of the study as you’re drafting your research protocol, and then you amend that protocol as necessary to make sure you have the necessary statistical power. But the researchers in Oregon were presented with an established scenario scenario and couldn’t really make alterations.
In addition, whether or not the experiment is underpowered depends on what you think is a reasonable expectation for an effect size. Frakt maintains that the Oregon study is powered for absurdly large effect sizes. The authors, not surprisingly, seem to disagree.
But the dispiriting thing is this: this is a really good study. How good? Well, we don’t really have anything that’s nearly as good. The closest thing is a RAND Corporation study from almost thirty years ago! And yet, here we are, having fights about whether or not the study is underpowered. And it isn’t just that. If you read through the reports, you’ll notice table after table reporting statistically insignificant results.
‘Statistical significance’, for the interested reader, is measured against something called a ‘p-value’ and refers to the probability that a result as extreme as the one observed occurs purely by chance, given that the true effect size is zero. The gold standard is a p-value of 0.05, which means that there is a 5% chance that a result showing an effect that large is really just a random fluctuation.
In this sense, ‘significant’ doesn’t mean the same thing as ‘important’ – rather, it’s a measure of your confidence in your estimate of the effect size. A lot of research, in the interest of completeness, presents both significant and insignificant findings. Findings are also oftentimes reported with error bars – that is, the range of values that are, collectively, statistically significant. The broader the range, the less reliable the point-estimate.
Casual readers tend to focus on the point-estimates and ignore the error bars or whether results are statistically significant. This is similar to people paying attention to who has a one or two point lead in a poll, while ignoring the fact that the poll has a four or five point margin of error.
The Oregon study presents a few pitfalls in this regard because the error bars are huge and so many of the estimates in the findings are insignificant. For most readers, including the politicians and pundits that will try to make noise with these numbers, this is worse than uninformative; it can be downright misleading.
The second technical point has to do with what researchers refer to as the ‘intent-to-treat’ principal. It basically amounts to the insight that even if you randomize people into two similar groups, if not everyone within your treatment group receives treatment, you lose some of your ability to control for the effects of unmeasured variables.
Again, we see a similar effect in political polling. Polling firms will attempt to contact a representative sample of voters and ask them about their preferences. The problem is that sometimes as few as 20% of people contacted actually respond to the survey. Randomization assures us that the people we contact don’t differ from the people we don’t contact, but a limited response rate raises concerns that the subset of people who participate are different from the general population in some unseen way. Much of the art of analyzing polling data comes down to how you try to control for that.
The Oregon study faces a similar challenge. Once people were selected for Medicaid, only about a half of those people actually enrolled, but a lot of the effects reported in the study compare people who enrolled to people who did not win the lottery. There’s no way to compare them, instead, to people who did not win the lottery but would have enrolled if they had – which would be the true apples-to-apples randomized comparison. The intent-to-treat effect estimates are typically about one-fourth of the effect estimates reported in the main findings of the study – implying that the effects of offering Medicaid are even lower than indicated.
This is because, while offering Medicaid is a relatively straightforward matter of public policy, whether or not people actually take advantage of the program is a different matter. This seems to argue, if you weren’t yet convinced, that health outcomes are about much more than just nominal access to health insurance or healthcare.
After all that, I think it’s important to re-emphasize one of the most important things to keep in mind about the Oregon Health Study: as far as these things go, it’s a really good study.
The researchers have been impressively transparent, which is actually what has facilitated so much of the detailed criticism. They also had the foresight and integrity to specify their analysis beforehand.
The problem is that a lot of people are trying to do too much with what should be a modest data point. Part of it is because the conclusions are so politically expedient for some, just as was the case with the Reinhart-Rogoff study. But part of it is also because we simply don’t have enough quality social science research.
A lot of people have an idea of discovery that basically amounts to a lone genius or a counter-conventional rogue making a breakthrough. One of the least easily understood concepts about how academic and scientific research progress is the important role played by community consensus. The media loves to play up the lone genius or silver-bullet approach because it makes good copy, but the truth is that, in science, new and potentially disruptive ideas lurk under every pile of scattered papers.
It takes time and a lot of effort for a consensus to emerge on whether or not those disruptive ideas have value.
It also takes a lot of dead ends and a fair amount of funding. No one study is ever going to settle a question on a large and complex issue like the relative benefits of healthcare spending. What we need is a sustained effort over a long period of time and in a variety of different settings – not a single study augmenting another from thirty years ago.
(You know, the kind of sustained effort that Lamar Smith’s legislation in Congress seems designed to make impossible.)