Organizations often do great work collecting data, but then share it in ways that are hard to access or understand, or require all users to repeat hours of cleaning to make the data usable. Sometimes a data hero comes along to share their own improved version that is cleaned and easier to access and understand. Here I share links to some of these “most-improved” datasets.
IPUMS.org is the gold standard here if you want microdata (individual-level survey responses) from any of the following:
– American Community Survey (surveys ~3million Americans per year about demographics, work, income, et c)
– Current Population Survey (surveys ~100,000 Americans per month about demographics and work, with supplements on additional topics. Some questions asked since 1962)
–Medical Expenditure Panel Survey (surveys ~30,000 Americans per year about health status and health spending)
County Business Patterns Database: The Census Bureau has long collected data about the number of employees and establishments in each industry in each county. But their website makes you download each year separately, and only goes back to 1986. The authors of the County Business Patterns Database provide a harmonized panel in one file that goes from the present all the way to 1975.
Quarterly Census of Employment and Wages: The Bureau of Labor Statistics has collected data on employment, wages, and the number of establishments by state and detailed industry back to 1975. Their page is actually decent; they provide links to each year of data, and they have a good reason for not providing one file with all years- it would be well over 10GB. Still, it could take each user hours to download each year they want, delete extraneous information, and merge years together into a reasonably sized panel. That’s why its great that some people who already spent those hours shared their code: here’s R code and Stata code to get exactly what you want (and nothing more) out of the QCEW. The Stata code comes from Gabriel Chodorow-Reich; his page has code for several other datasets too.
Statistics of US Business: The SUSB is compiled by the Census Bureau, and like the QCEW it collects data on employment, payrolls, and the number of establishments by state and detailed industry. They each have slight advantages and disadvantages; the SUSB has firm counts as well as establishment counts, and has more detail at some levels (e.g. 4-digit NAICS codes by establishment size), but its only annual (instead of quarterly) and only goes back to 1997. The official SUSB page has the same basic issue as the QCEW page, with the additional problem that they change their file naming conventions from year to year sometimes. But because its not quite as big as the QCEW, its actually reasonable to merge all years into a single file that retains all variables; doing so comes in at just under 3GB. Here’s the Stata code I used to do so, here’s a page with the full merged SUSB (1997-2019) and a smaller version with less detail (up to 3-digit NAICS).
Behavioral Risk Factor Surveillance System Survey: The BRFSS has been collected by the Centers for Disease Control since the 1980s. It now surveys 400,000 Americans each year on health-related topics including alcohol and drug use, health status, chronic disease, health care use, height and weight, diet, and exercise, along with demographics and geography. It’s a great survey that is underused because the CDC only offers it in XPT and ASC formats. So I offer the 1987-2021 BRFSS in Stata DTA and Excel CSV formats here.
State Life Expectancy Data 1990-2019: The CDC NCHS collects the underlying mortality data, but only makes state life expectancy easily available back to 2018. IHME extended this back to 1990, but puts it in a complex sheet that never actually gives overall life expectancy by state. I offer a simplified, easy to use version of state life expectancy data back to 1990 here.
Andrew Forrester maintains a similar page to this one here that provides cleaned versions of data on economic freedom, DOL PERM (permanent labor certification), the Home Mortgage Disclosure Act (HMDA), and the Community Reinvestment Act (CRA).
Coming soon: AHRF files in non-terrible formats
Think another dataset belongs on this page? Let me know