Administrative Data Sources to Understand the Covid 19 Pandemic
by Joshua K. Dubrow, IFiS PAN
This article appeared as:
Dubrow, Joshua K. 2020. “Administrative Data Sources to Understand the Covid 19 Pandemic.” Harmonization: Newsletter on Survey Data Harmonization in the Social Sciences Spring/Summer 6(1): 2 – 10.
The Covid 19 pandemic is a modern-day disaster whose severity the media broadcasts, often via tables and graphs, the growth, stability, and decline in cases and mortality across nations and time. Media broadcasts are generally based on administrative data – cases of infected persons and deaths – that various people and organizations collect, harmonize, and aggregate. Many data providers have made their data available in machine-readable form for public scrutiny. An intention is to guide policy-makers and educate the public.
This note focuses on cross-national administrative Covid 19 data that are a main empirical foundation for the innumerable present and future comparative analyses on the virus’ impact, and for possible harmonization projects. I ask two basic questions: What does the landscape of major distributors of cross-national Covid 19 counts look like? And what are some major sources of error in these data?
The Landscape of Administrative Data on Covid 19 Cases and Mortality
In this information age, if data can be collected, they will be collected. This is true for Covid 19 – there are a great many sources of the numbers of infected persons (Covid 19 cases) and mortality (Covid 19 deaths), or what can be called “counts,” which are widely available for modelers and media. Many organizations, through their websites, distribute count data. These counts are not raw numbers (especially at the individual or hospital level for obvious ethical and privacy reasons). Organizations distribute aggregates at various levels of administration that they produce from raw numbers, or from already aggregated information that other actors (people, organizations) gave them. Many, but not all organizations provide their data in GitHub, a data and computer program repository.
To survey this landscape of data sources, I ask two basic questions: First, who are some of the major distributors of cross-national Covid 19 count data? Second, do their data overlap? I define overlap as the situation in which an organization relies, in part or in whole, on another source for the data it distributes. The answers to these questions will tell us a bit about the field of data providers that governments, academics, and others, must choose from.
To identify highly visible organizations that report Covid 19 counts across nations and time, in April 2020, at what arguably was the height of the pandemic, I Googled the term “covid 19 data.” I chose a group of 14 organizations such as one would find referenced in news reports:(in alphabetical order) 1point3acres, Bing/covid, European Centre for Disease Prevention and Control (ECDC), Google’s webpage for “Covid 19”, Healthdata.org, HealthMap, Humanitarian Data Exchange (HDX), Johns Hopkins University (JHU) COVID Tracker, Reuters Graphics, The New York Times, Statista, Wikipedia, World Health Organization (WHO), and Worldometer. Figure 1 displays these organizations and the data sharing relationships between them.
On the 14 data sources’ websites, I searched for information on where they get their Covid 19 count data. A first observation is that, in their textual description, data providers can be vague about where exactly their data come from and, perhaps to assure the reader that the data are thorough, some boast about how many sources they have. The European Union’s ECDC claims that “a team of epidemiologists screens up to 500 relevant sources to collect the latest figures.” Worldometer writes that they “validate the data from an ever-growing list of over 5,000 sources.” The initially popular 1point3acres reports in their FAQ:
“Where does your data come from? We have a mix of scripts, API calls, and user submissions. Our data team then jumps in and fact check, cross referencing difference sources, de-dup and make sure we present most accurate data from credible sources… we use data from COVID Tracking and JHU (Rest of the World).”
Figure 1. Graph of 14 “Covid 19” Data Sources of Cases and Mortality
Note: Data hubs are marked in large print and blue circles. Organizations that claim to collect their own data are in bold. Organizations that do not claim to collect their own data, but instead rely on data from others in this network (and also outside of this network), are in italics. Arrows indicate that organizations share Covid 19 data. The tip of the arrow points to where, in this group, an organization takes the data from (e.g. healthdata.org takes data from WHO and JHU; arrows point to WHO, but WHO does not publicly claim to use data from anyone in this network).
Of the 14 data sources, nine claim to collect their own data (Figure 1, names in bold), whereas five do not make such statements (names in italics). Among the organizations that report data collection, two – WHO and JHU Covid Tracker (Figure 1, circle in blue) – constitute what I call main data hubs. Covid 19 counts that WHO collects are used by six other organizations, and counts by JHU Covid Tracker are used by eight other organizations (including Bing and Google).
Most organizations (11 out of 14) explicitly indicate on their websites that the Covid 19 count data they distribute come, at least in part, from other organizations. Most of these (9 of the 11) rely on at least two other sources (in Figure 1, see nodes with at least two outgoing arrows). Within this network, JHU Covid Tracker is the distributor with the most heterogeneous source base; it relies on four other organizations: WHO, ECDC, 1point3acres, and Worldometer. Bing/Covid depends on three sources: WHO, ECDC, and Wikipedia and thus, without explicit mention, Bing/Covid depends on JHU Covid Tracker.
This map is useful to understand the not-explicitly-stated basis of Covid 19 count data. Consider the following data chain. For cross-national counts, Google/Covid 19 depends solely on Wikipedia. Wikipedia depends solely on JHU Covid Tracker. Google and Wikipedia thus depend on the same source – JHU.
Of the 14 organizations’ websites I examined, four do not explicitly mention that they take Covid 19 counts from other organizations shown in Figure 1: WHO, Worldometer, Healthmap, and Reuter Graphics. Yet, the absence of such a statement does not necessarily mean that there is no data overlap. For example, while Worldometer does not list any of the organizations in Figure 1 as a data source, they do claim to use 5000+ sources. Since I do not have the resources (e.g. patience) to go through each of them, I listed Worldometer as non-overlap. Healthmap’s textual description of sources is too vague to allow for an assessment. Reuters Graphics lists only local and national health authorities and themselves as sources of their data; to get their data, they write on the website that users must contact Reuters Graphics directly.
Of course, data providers within this group of 14 rely on outside sources. These sources include public authorities (various national health authorities and various subnational health authorities, including their press conferences and social media presence), reports in the mainstream media, social media (Twitter, Facebook, Telegram), specialty media sources such as BNO News and 24/7 Wall St., and what they call “user submissions,” meaning that anybody in the world can contact them to report some information that could, perhaps, be included in their dataset.
Sources of Error
These Data Providers Depend on Upstream Reporting
Data collection of Covid 19 counts is difficult. At root, organizations depend on information provided by various national and subnational data sources that, in turn, received it from hospitals, labs, and other health organizations and medical authorities, which in turn depend on professionals within those organizations to report on Covid 19 cases and mortality. We have limited descriptions of upstream reporting from the USA via the CDC. Descriptions within other nations, in English, that share details of this upstream data collection process are difficult to find.
There are attempts to standardize Covid 19 count data collection in order to compare counts across nations and within nations, among lower administrative units, and over time. For example, WHO provides guidelines for case reporting. They do so through the “Revised Case Report Form for Confirmed Novel Coronavirus COVID 19 (report to WHO within 48 hours of case identification)” Standardization requires that the data from a variety of sources and at different levels of aggregation are harmonized ex-post. Two short excerpts from WHO and The New York Times, respectively, illustrate this need well. According to WHO:
“Due to differences in reporting methods, retrospective data consolidation, and reporting delays, the number of new cases may not always reflect the exact difference between yesterday’s and today’s totals.”
The New York Times effort to track Covid 19 wrote:
“In tracking the cases, the reporting process is labor-intensive but straightforward much of the time. But with dozens of states and hundreds of local health departments using their own reporting methods — and sometimes moving patients from county to county or state to state with no explanation — judgment calls have sometimes been required.”
Ex-post harmonization of Covid 19 data counts is difficult and data providers (including those who aggregate the data) do not often explicitly state these difficulties. I mention some of the difficulties in this brief note. First, we can imagine the difficulties that the initial reporting agencies – i.e. the tens of thousands of hospitals with unequal economic development (as Covid Tracking Project hints at with their State Data Quality Grade) – are likely to have. Belgium stated well the problem of standardizing information when data sources are so disparate. In reporting the prevalence of Covid 19 in Belgium, “The Health, Food Chain Safety and Environment, a Federal Public Service of Belgium,” notes: “In practice, we collect the data reported to us by: the national reference lab; the hospitals; the residential care centres; the General Practitioners (GPs); and the network of sentinel GPs and hospitals for the monitoring of flu-like syndrome.” They go on to write that: “The various sources do not always report the same type of data by any means, and the manner and frequency of reporting can also vary.”
Second, there will be discrepancies in the quality of data reporting. Sometimes, political reasons lead to inaccuracy on Covid 19 cases and deaths. For example, in May 2020, The New Yorker reported about Iran:
“Soon, Iran became a global center of the coronavirus, with nearly seventy thousand reported cases and four thousand deaths. But the government maintained tight control over information; according to a leaked official document, the Revolutionary Guard ordered hospitals to hand over death tallies before releasing them to the public.”
Also in May 2020, Russia experienced a huge upswing in cases, but reported low mortality. In the Bloomberg News article, “Experts Question Russian Data on Covid 19 Death Toll”
“Russian Deputy Prime Minister Tatyana Golikova Tuesday rejected suggestions Russia was understating the death rate. ‘That’s what it is and we never manipulate official data,’ she said.”
Yet, critics of Russia have questioned this general stance:
“Russian authorities detained the leader of an independent doctors’ union, an outspoken critic of the Kremlin who has dismissed as ‘lies’ the country’s low official numbers for coronavirus infections.”
In response to the criticism, Russia in June retrospectively doubled the number of Moscow’s mortalities that they had reported in April.
Recently, as Brazil became one of the world leaders in Covid 19 cases, for a weekend in June the country simply stopped the daily reporting of cases and removed all previous information about Covid-19 tracking.
We find unequal reporting between and within countries. In the US, the Center for Disease Control and Prevention (CDC) admitted that, in their harmonization and aggregation of data, they combined serology tests for antibodies with diagnostic tests of active viral infection, a data situation that may have led to a slight over-count of the number of Americans tested for Covid 19. In a New York Times article, blame for this mishap was attributed to too much pressure on too much work in too short span of time, an understandable situation that unfortunately led to poor decision making:
“Epidemiologists, state health officials and a spokeswoman for the C.D.C. said there was no ill intent; they attributed the flawed reporting system to confusion and fatigue in overworked state and local health departments that typically track infections — not tests — during outbreaks. The C.D.C. relies on states to report their data.”
The CDC, and many state health officials, acknowledged the error and vowed to separate these counts in their future counts, The New York Times reported.
There have also been recent complaints that states have found errors in their Covid 19 counts. An article by NBC News from May 25, 2020, “’I’m looking for the truth’: States face criticism for COVID 19 data cover-ups” summarizes some of the headlines since the beginning of May. Georgia apologized for a “processing error” that led to the erroneous presentation that counts were decreasing, rather than increasing. And in Florida
“… officials last month stopped releasing the list of coronavirus deaths being compiled by the state’s medical examiners, which had at times shown a higher death toll than the total being published by the state. State officials said that list needed to be reviewed as a result of the discrepancy.”
Indeed, widespread is the notion that counts are under-reported. Although the Coronavirus Conspiracists argue that there is a case and mortality over-count, there is no logic or evidence for that argument. In the second half of April, The New York Times reported “63,000 Missing Deaths: Tracking the True Toll of the Coronavirus Outbreak” and The Economist reported that “Official Covid 19 death tolls still under-count the true number of fatalities.” “The data is limited,” writes The New York Times, “and, if anything, excess deaths are underestimated because not all deaths have been reported.” In June, the director of the CDC said, “Our best estimate right now is that for every case that was reported, there actually were 10 other infections.”
A third difficulty for ex-post harmonization is that collecting Covid 19 counts, like any other data collection process, introduces errors related both to representation (e.g. who gets tested) and measurement (e.g. inadequate testing instruments and data processing errors). Regarding the processing errors, one reason for changes in the numbers of Covid 19 cases and mortality is the continual re-definition of “what is a case” and what counts as a fatality due to Covid 19.
Systematic errors can come from humans, from machines, or from some combination of the two, especially since Covid 19 data collection and reporting is not well standardized within or between nations, and in the midst of a pandemic, people and systems are severely stressed. Many of these errors may be extremely difficult to identify, let alone to correct ex-post, which in turn can affect the accuracy of statistics derived from Covid 19 counts.
The COVID Tracking Project for the US reports that the data situation has improved: “Reporting on even basic testing data was very patchy when we first began collecting data in early March, but is mostly now well reported.” Still, they have “State data quality grades,” ranging from A+ (best, such as Iowa) to F (worst, such as Arkansas).
The CDC seems to also make great efforts to standardize reporting. To improve the timeliness of reporting, and apparently, in the beginning of the pandemic there had been quite a bit of manual entry of forms, the CDC has worked to improve their electronic case reporting (eCR) system:
“According to the CDC, electronic case reporting (eCR) is defined as the automated generation and sending of EHR case reports to public health officials. eCR allows for automatic, complete, and accurate data to be reported in real-time. In return, it lessens burden for providers by improving the timeliness and accuracy of case reports… In an effort to reduce the healthcare system’s burden of manually completing the COVID 19 reporting forms, the CDC will make these forms available electronically.”
The speed required in the pandemic to standardize across all parts of the reporting system whose infrastructure cannot handle the load can cause problems in the production of timely and accurate data. The CDC is a case in point. A recent New York Times article reported that the CDC’s data infrastructure contains
“antiquated data systems, many of which rely on information assembled by or shared with local health officials through phone calls, faxes and thousands of spreadsheets attached to emails. The data is not integrated, comprehensive or robust enough, with some exceptions, to depend on in real time.”
To cope, “The agency rushed to hire extra workers to process incoming emails from hospitals,” but the White House had turned toward JHU for timely accounts.
Improvements in collecting and reporting Covid 19 counts likely occur in other countries, too, as organizations at different levels of administration get more experience with the process, and learn from previous mistakes. Yet, it is important to remember that, in the pursuit of data, errors occur. Identifying error sources in publicly released datasets of Covid 19 counts and understanding to what extent errors can be accounted for is an important part of ex-post harmonization decisions that will inform the future use of these data for social scientists.
The importance of Covid 19 administrative data cannot be overstated. Data on Covid 19 cases and deaths have led to a host of institutional decisions that potentially impacted the lives of billions of people. These data are the basis of social distancing policies, economic redistribution, and how to conduct elections, among other things. This brief note explored some of what we know, and some of what we do not know, about the landscape of data sources that provide crucial information for governments, academics, and the public.
The pandemic has given policy-makers and academics plenty of data, and for their projects present and future, they will need to choose which data sources to use. In a limited analysis presented here, we see that, for cross-national Covid 19 counts, the data landscape is dominated by WHO and JHU. There is considerable overlap. Many data sources use others’ sources. I explored 14 different organizations, but I predict that WHO, JHU, and Worldometer will emerge as the core cross-national data sources that academics and governments will use. WHO is the sole official source of cross-national data. Yet, JHU quickly established itself as a premier data provider early in the process and has held on to that status. Worldometer has been well-known to social scientists, and thus may be used because their data can easily be merged with the social, economic, and political data that they already provide.
I do not assess the validity and reliability of these data, but I do mention some of the sources of error that may lead to discrepancies across nations and time. A main source is the frequent redefinitions or cases and mortality, which are due in part to changes in knowledge about Covid 19. As countries update their knowledge, it is not clear whether they will expend the effort to retrospectively change their data to reflect the new knowledge. Other errors occur in any large-scale data collection process, such as processing errors.
A source of error that has yet to gain much attention is the difficulty in harmonizing and aggregating data from multiple local sources that report upstream to national organizations. These discrepancies may be due to unequal infrastructures and the unequal resources of hospitals, labs, and other organizations staffed with time and social pressured people who, due to systemic problems and simple fatigue, make mistakes that can introduce a series of minor errors in the data that they report upstream. The upstream reporting problem may not matter much, or it may matter a lot. We don’t know. Upstream reporting is a black box that we should open.
Joshua K. Dubrow is co-editor of Harmonization: Newsletter on Survey Data Harmonization in the Social Sciences. This material is based upon work supported by the National Science Foundation under Grant No. (PTE Federal award 1738502) and by the National Science Centre, Poland (2016/23/B/HS6/03916).
 Error can be defined as “a measure of the estimated difference between the observed or calculated value of a quantity and its true value” (Google). Here, I am interested in the difference between observed values of Covid 19 cases and mortality as collected by people and organizations and the true counts across nations and time. Deviations from the true counts are errors. Reasons for the deviations can be called “the sources of error.”
 e.g. HealthMap provides a link to GitHub, but Reuters Graphics does not.
 I focus here on cross-national counts and thus exclude US-only sources, e.g. CDC, The COVID Tracking Project, and Wunderground (an IBM company).
 The list is not exhaustive, but it does cover many of the websites available on the first few pages of Google search results.
 ECDC is “An agency of the European Union” https://web.archive.org/web/*/https://www.ecdc.europa.eu/en/Covid 19-pandemic
 They do mention the other sources, but as topics. Specifically, they write:
“This includes websites of ministries of health (43% of the total number of sources), websites of public health institutes (9%), websites from other national authorities (ministries of social services and welfare, governments, prime minister cabinets, cabinets of ministries, websites on health statistics and official response teams) (6%), WHO websites and WHO situation reports (2%), and official dashboards and interactive maps from national and international institutions (10%). In addition, ECDC screens social media accounts maintained by national authorities, for example Twitter, Facebook, YouTube or Telegram accounts run by ministries of health (28%) and other official sources (e.g. official media outlets) (2%). Several media and social media sources are screened to gather additional information which can be validated with the official sources previously mentioned. Only cases and deaths reported by the national and regional competent authorities from the countries and territories listed are aggregated in our database.” https://www.ecdc.europa.eu/en/Covid 19/data-collection
 https://www.healthmap.org/Covid 19/# reports on data source: “All data used to produce this map are exclusively collected from publicly available sources including government reports and news media.” This is their textual description, contained in a pop-up.
 They are protective of this publicly available resource. At the bottom of this document they wrote:
“This is a draft. The content of this document is not final, and the text may be subject to revisions before publication. The document may not be reviewed, abstracted, quoted, reproduced, transmitted, distributed, translated or adapted, in part or in whole, in any form or by any means without the permission of the World Health Organization.”
 From “Coronavirus disease 2019 (COVID 19) Situation Report – 96.” https://web.archive.org/web/20200503174111/https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200425-sitrep-96-Covid 19.pdf?sfvrsn=a33836bb_4
 “We’re Sharing Coronavirus Case Data for Every U.S. County” by The New York Times March 28, 2020.
 “The COVID 19 figures: collection, verification and publication,” April 14, 2020. https://web.archive.org/web/*/https://www.info-coronavirus.be/en/news/collection-data/
 “The COVID 19 figures: collection, verification and publication,” April 14, 2020. https://web.archive.org/web/*/https://www.info-coronavirus.be/en/news/collection-data/
 “The Twilight of the Iranian Revolution” by Dexter Filkins in The New Yorker, May 25, 2020. https://www.newyorker.com/magazine/2020/05/25/the-twilight-of-the-iranian-revolution
 “Experts Question Russian Data on Covid 19 Death Toll,” by Henry Meyer in Bloomberg News, May 13, 2020. https://web.archive.org/web/*/https://www.bloomberg.com/news/articles/2020-05-13/experts-question-russian-data-on-Covid 19-death-toll
 “Russian Doctor Detained After Challenging Virus Figures” by Andrew Higgins, The New York Times, April 3, 2020 Updated April 10, 2020. https://web.archive.org/web/20200527132124/https://www.nytimes.com/2020/04/03/world/europe/russian-virus-doctor-detained.html?action=click&module=RelatedLinks&pgtype=Article
 “Moscow more than doubles city’s Covid 19 death toll,” BBC News, May 29, 2020. https://web.archive.org/web/20200529002935/https://www.bbc.com/news/world-europe-52843976
 “Brazil stops releasing Covid 19 death toll and wipes data from official site” by Dom Phillips, The Guardian, June 7, 2020. https://web.archive.org/web/20200608015743/https://www.theguardian.com/world/2020/jun/07/brazil-stops-releasing-Covid 19-death-toll-and-wipes-data-from-official-site
 “C.D.C. Test Counting Error Leaves Epidemiologists ‘Really Baffled’” by Sheryl Gay Stolberg, Sheila Kaplan and Sarah Mervosh, The New York Times, May 22, 2020. https://web.archive.org/web/20200526230234/https://www.nytimes.com/2020/05/22/us/politics/coronavirus-tests-cdc.html
 “’I’m looking for the truth’: States face criticism for COVID 19 data cover-ups” by Allan Smith, NBC News, May 25, 2020. https://web.archive.org/web/20200525164215/https://www.nbcnews.com/politics/politics-news/i-m-looking-truth-states-face-criticism-Covid 19-data-n1202086
 https://www.nytimes.com/interactive/2020/04/21/world/coronavirus-missing-deaths.html. Note that they consider “data” to be singular, not plural.
On death counts, see also an article in The Independent, who writes: “One reason was for the time-lag in cases being reported, and how information about the disease included in coroners’ reports was not always complete. Factors such as a person dying because they were too afraid to go hospital would likely not be included.”
 “As Virus Surges, Younger People Account for ‘Disturbing’ Number of Cases.” By Julie Bosman and Sarah Mervosh, The New York Times, June 26, 2020, Section A, Page 1. https://www.nytimes.com/2020/06/25/us/coronavirus-cases-young-people.html
 As of May 2020.
 Electronic Health Reports (EHR)
 “CDC Unveils FHIR-Based COVID 19 EHR Reporting Application” by Christopher Jason April 20, 2020.
This does not describe how individual states conduct upstream reporting. I found one county that described it, in part. According to the official website of Pinal County, Arizona: “How are COVID 19 cases reported? All infectious diseases, including COVID 19 cases, are reported to local Public Health departments through an electronic platform known as MEDSIS (Medical Electronic Disease Surveillance Intelligence System). Our epidemiologist team manually checks each of the data entered to ensure the data is accurate.” https://www.govserv.org/US/Florence/726925130750461/Pinal-County-Board-of-Supervisors
 “Built for This, C.D.C. Shows Flaws in Crisis.” By Eric Lipton, Abby Goodnough, Michael D. Shear, Megan Twohey, Apoorva Mandavilli, Sheri Fink and Mark Walker in The New York Times, June 3, 2020. https://web.archive.org/web/20200607180433/https://www.nytimes.com/2020/06/03/us/cdc-coronavirus.html
“As the number of suspected cases — and deaths — mounted, the C.D.C. struggled to record them accurately. Still, many officials turned to Johns Hopkins University, which became the primary source for up-to-date counts. Even the White House cited its numbers instead of the C.D.C.’s lagging tallies.”