Organizations often do great work collecting data, but then share it in ways that are hard to access or understand, or require all users to repeat hours of cleaning to make the data usable. Sometimes a data hero comes along to share their own improved version that is cleaned and easier to access and understand. Here I share links to some of these “most-improved” datasets.
IPUMS.org is the gold standard here if you want microdata (individual-level survey responses) from any of the following:
– American Community Survey (surveys ~3million Americans per year about demographics, work, income, et c)
– Current Population Survey (surveys ~100,000 Americans per month about demographics and work, with supplements on additional topics. Some questions asked since 1962)
–Medical Expenditure Panel Survey (surveys ~30,000 Americans per year about health status and health spending)
County Business Patterns Database: The Census Bureau has long collected data about the number of employees and establishments in each industry in each county. But their website makes you download each year separately, and only goes back to 1986. The authors of the County Business Patterns Database provide a harmonized panel in one file that goes from the present all the way to 1975.
Quarterly Census of Employment and Wages: The Bureau of Labor Statistics has collected data on employment, wages, and the number of establishments by state and detailed industry back to 1975. Their page is actually decent; they provide links to each year of data, and they have a good reason for not providing one file with all years- it would be well over 10GB. Still, it could take each user hours to download each year they want, delete extraneous information, and merge years together into a reasonably sized panel. That’s why its great that some people who already spent those hours shared their code: here’s R code and Stata code to get exactly what you want (and nothing more) out of the QCEW. The Stata code comes from Gabriel Chodorow-Reich; his page has code for several other datasets too.
Statistics of US Business: The SUSB is compiled by the Census Bureau, and like the QCEW it collects data on employment, payrolls, and the number of establishments by state and detailed industry. They each have slight advantages and disadvantages; the SUSB has firm counts as well as establishment counts, and has more detail at some levels (e.g. 4-digit NAICS codes by establishment size), but its only annual (instead of quarterly) and only goes back to 1997. The official SUSB page has the same basic issue as the QCEW page, with the additional problem that they change their file naming conventions from year to year sometimes. But because its not quite as big as the QCEW, its actually reasonable to merge all years into a single file that retains all variables; doing so comes in at just under 3GB. Here’s the Stata code I used to do so, here’s a page with the full merged SUSB (1997-2019) and a smaller version with less detail (up to 3-digit NAICS).
Behavioral Risk Factor Surveillance System Survey: The BRFSS has been collected by the Centers for Disease Control since the 1980s. It now surveys 400,000 Americans each year on health-related topics including alcohol and drug use, health status, chronic disease, health care use, height and weight, diet, and exercise, along with demographics and geography. It’s a great survey that is underused because the CDC only offers it in XPT and ASC formats. So I offer the 1987-2023 BRFSS in Stata DTA and Excel CSV formats here.
State Life Expectancy Data 1990-2019: The CDC NCHS collects the underlying mortality data, but only makes state life expectancy easily available back to 2018. IHME extended this back to 1990, but puts it in a complex sheet that never actually gives overall life expectancy by state. I offer a simplified, easy to use version of state life expectancy data back to 1990 here.
National Health Expenditure Accounts Historical State Data: The original data from the Centers for Medicare and Medicaid Services on health spending by state and type of provider are actually pretty good as government datasets go: they offer all years (1980-2020) together in a reasonable format (CSV). But it comes in separate files for overall spending, Medicare spending, and Medicaid spending; I merge the variables from all 3 into a single file, transform it from a “wide format” to a “long format” that is easier to analyze in Stata, and in the “enhanced” version I offer inflation-adjusted versions of all spending variables. Excel and Stata versions of these files, together with the code I used to generate them, are here.
National Survey of Drug Use and Health State-Level Data: The NSDUH is mostly quite good as government datasets go- they share individual-level data in many formats and with the option to get most years together in a single file. But due to privacy concerns, the individual-level data doesn’t tell you what state people live in, which means it can’t be used to study things like state policy. SAMHSA does offer a state-level version of their data, but it is messy and only available in a SAS format. I offer a cleaned version available in Stata .dta and Excel .xlsx formats here.
Real Time Crime Index: Crimes in the US are reported separately by each state and local police department. Eventually the FBI comes along to offer a combine most of this into a single dataset on crime in the US. But if you want crime data from the past year, the Real Time Crime Index is the way to go. They also offer a nice spreadsheet of city-level crime data going back to 2018.
National Welfare Data- Poverty, Population, Gross State Product: The University of Kentucky Center for Poverty Research offers this great panel of state-level data from 1980 to 2021 that brings together data from many government sources. As you’d expect from the name it has lots of poverty- and welfare-related variables, but is also just the simplest place to get a long state-level panel of more general variables like population, employment, and personal income. I particularly want to highlight one variable they offer that is not just difficult to get from the official government sources, but seems to currently be impossible: Gross State Product (aka State GDP) from prior to 2017.
Andrew Forrester maintains a similar page to this one here that provides cleaned versions of data on economic freedom, DOL PERM (permanent labor certification), the Home Mortgage Disclosure Act (HMDA), and the Community Reinvestment Act (CRA).
If you’re looking for data from one of my papers, see the links on my Research page or my Open Science Foundation profile. Not everything is posted publicly yet but if you ask about something I’ll move it up the priority list.
Coming soon: AHRF files in non-terrible formats
Think another dataset belongs on this page? Want me to update one of the datasets I manage to include a new release? Let me know