Donald Trump’s disastrous interview with Jonathan Swan centered around COVID testing data which I want to take up here and just organize my thoughts on the subject.  I am looking for a good overview of COVID case data and where its potential biases are and I am a bit flummoxed.

As best as I can tell, there are two main stages where COVID data goes awry:

  • Who Gets Tested
  • How Testing Data Is Aggregated

Who Gets Tested

It goes without saying that those infected with SARS-Cov-2 but are not tested are not going to show up in the data.  Who are these cases?  My initial thought was that these would primarily be asymptomatic cases and people who get sick but get over the disease fairly quickly.  But this does not take into account how testing availability is segregated by race and class.  Science Magazine discusses the research of epidemiologists Wil Lieberman-Cribbin, Stephanie Tuminello, Raja Flores, and Emanuela Taioli showing that in New York City testing rates in March 2020 were highest in White-dominated zip codes (although it was not correlated with zip code SES).  Taioli told Science Magazine that NYC equalized testing capacity  since March but we do not know what the situation is in other parts of the country (h/t Vox).

How Testing Data is Aggregated


Even more depressingly, even the very simple act of combining testing data to get a picture of what’s going on in the entire United States is fraught.  Ourworldindata talks about their data sources for COVID tests which is the COVID Tracking Project and the CDC (they don’t make clear how they combine these two sources of data). As of this writing, the CDC’s “Testing Data in the U.S.” page (updated 8/3/2020) lists 52.9 million tests done and 5.0 million positive tests, while the CDC COVID Data Tracker (also updated 8/3/2020) lists 61.5 million tests done and 5.5 million positive tests.  And the COVID Tracking Project (updated today 8/4/2020)  lists 52.5 million tests and 4.7 million positive tests.


As best as I can tell, all COVID-testing labs are required to report their testing results to “the appropriate state or local public health department”.  [I am not sure about this, but I believe that the recent kerfuffle over HHS being the sole recipient of hospital COVID data is not related to this basic issue of COVID testing and case counts but I could be wrong].  In turn, these state and local public health departments report their data to the CDC (as well as make it public, so the COVID Tracking Project picks that up).  We also know that the testing aggregation is screwed up as some states are combining viral tests (which measure if someone has a current infection) and antibody tests (which measure if someone was infected in the past) which really screws things up (the focus of most international comparisons and time trends have been viral tests).  According to the CDC Data Tracker, the states still doing this as of now are Delaware, D.C., Maine, Mississippi, Missouri, Oklahoma, Puerto Rico, Tennessee.  It appears this aggregation is where we are getting these different estimates.

Back to Trump

Trump’s argument that increased COVID cases is just an artifact of increased COVID testing is not a totally crazy idea.  As Joel Best points out in Damned Lies and Statistics, often times when researchers start counting incidents of a new social problem, they inevitably show dramatic increases early on just because surveillance has gotten better, but not necessarily because the actual problem got worse).  Best illustrates this with the example of FBI statistics on the incidents of hate crimes.

Sharon Begley at StatNews wrote an article taking on this argument.  Essentially, she treats the percentage of tests that are positive as a more “valid” measure of prevalence.  So in Florida (the state that saw the biggest relative increase in COVID cases from May to July), .002% of the population tested as positive in mid-May; in mid-July, the corresponding percentage was .059%, 25 times the percentage in May.  But Florida’s testing also increased over time, from 479 tests to 65,567 tests.    That 2,400 percentage increase in positive cases could just be an artifact of the increased tests.  So Begley looks at the percentage of tests that were positive over the same time frame; in mid-May, it was 3.160 percent; in mid-July, it was 19.254 percent, 5 times the percentage in mid-May.  So the prevalence of COVID increased even among the tested which undermines Trump’s argument.

According to Begley’s calculations, there is only a handful of states seeing increased cases of COVID but steady (or even declining) percentages among those tested: Colorado, Indiana, Michigan, Missouri, North Carolina, Ohio, and Wisconsin.

I find this exercise a bit unsatisfying though.  Although Begley did not do it, we can do the same calculation for the United States.  In mid-May, .008 percent of the population tested positive; in mid-July it was .019 percent of the population, 1.4 times that in May.    And guess what? The percentage of tests that came back positive help pretty steady from mid-May to mid-July, from 7.4 to 7.6 percent.  So if you really think the percentage of positive cases among the tested is a more valid measure of COVID prevalence, you would have to accept Trump’s argument that overall, the increased number of COVID cases in the United States is due to greater testing.  I guess the state-level analysis is a bit more valid because it is getting “closer to the ground” but it seems like an arbitrary unit of analysis (why is the state-level better than, say, the county-level?).

[to be clear, I am skeptical of the argument that trends in the percentage of positive cases among the tested is a more valid estimate of trends in the population overall; it seems to me that the number of tests is responding to more and more people getting the disease]

I think more compelling would be looking at COVID hospitalizations and deaths.  Both trends increased since mid-June, but the levels are still quite lower than what the United States was experiencing in April (although hospitalization data has been disrupted due to the switchover to HHS reporting so the recent hospitalization counts are probably downwardly biased).


To be honest, I am a bit pessimistic that data could really put Trump’s argument to rest, because our data is so poor (which I would guess is the legacy of our federal system combined with willful negligence in the executive branch).  Vox had an interesting argument about the need for “surveillance testing” which, instead of aggregating all tests ever done,  would involve outpatient clinics recording data on all patients showing symptoms.  This would supposedly yield trends that would be more comparable over time.