Twin Designs and Cultural Capital

I am late to this party, but in 2017 sociologists Mads Meier Jæger and Stine Møllegaard published a study using a monozygotic twin design to study the effects of cultural capital, a concept in education research capturing familiarity with the dominant culture.  Other sociologists [1,2]  have made convincing claims that cultural capital matters for academic success, but I am not so sure about this study.

The study’s design is clever.  The authors used administrative data to get a list of all twin births in Denmark from 1985 to 2000.  In 2013, they surveyed the mothers and asked them, for each of their children (both twins and up to two non-twin siblings), 12 questions about the kind of cultural capital they received when they were 12 years old.  They linked these cultural capital reports to the children’s academic success at the end of their compulsory schooling years (which is grade 9 in Denmark).  They estimate the “within-twin-pair” effects of cultural capital and find some astonishing findings–the standardized effect of cultural capital on an end-of-compulsory-schooling exam is .301, and a standard deviation increase in cultural capital is associated with a 12.5 percentage point increased chance of enrolling in upper-secondary schooling.  They also had some sizable but nonsignificant effects for GPA and Danish exams.

My suspicion, however, is that the paper says little about the effects of cultural capital on academic success and more about the effects of academic success on mothers’ recall of the cultural capital their twin children recieved.

To their credit, the authors are upfront about potential issues with their measurement of cultural capital.  They report intra-class-correlations (ICCs) for the mother’s cultural capital reports; for their omnibus cultural capital scale the ICC is .972.  This means that the correlation between twins’ cultural capital reports is .972.  In other words, there are precious few differences between twins’ cultural capital reports.  The large effect sizes the authors see are are driven by minute differences between twins.

I am just trying to imagine what would produce a situation where a parent, raising identical twins, says that one twin had more cultural capital than the other twin (e.g. they took one twin to a museum more times than the other twin, or one twin had many more books than the other twin, or they talked with one twin about social issues much more often than with the other twin).

I can think of three scenarios.

Scenario A: These mothers are filling out an online survey; we don’t know how many questions total are asked on the survey, but they have to answer at least 12 cultural capital questions for each twin (as well as up to two non-twin children).  I would guess the survey is kind of boring and is asking a lot of mothers to accurately estimate the extent of each of these forms of cultural capital for at least two kids.  Most parents are probably not going to think too hard about each question.  To the extent that they think about distinguishing the cultural capital each kid gets, they are probably going to fall back on easily retrievable information, like how the kid is doing presently, or how overall did the kid fare in school, and that is going to influence their reports of cultural capital.

Now, Jæger  and Møllegaard anticipated this objection, and argued that this “recall bias” should mean that parents are much less consistent about reporting cultural capital for differently-aged kids than for equally-aged kids like twins.  Fortunately they did ask parents about their non-twin children and they show that moms are roughly as consistent in reporting on cultural capital for differently-aged kids as for their twins.  I do not find this compelling and it seems to me they are taking the “recall” in “recall bias” too literally.  I believe the scenario I laid out above is going to be very similar for a mother reporting on her 25-year-old twin children as for a mother reporting on her 15-year-old twins.

Scenario B: An alternative scenario is that within-twin differences in cultural capital were caused by some kind of health mishap or trauma (e.g. if a kid gets disabled they will not be making many trips to museums; if a kid gets bullied at school they may not talk much even with their parents).  In that case, the effects of cultural capital in this study are not the effects of cultural capital but rather trauma.  In this case, the study is failing to account for important confounds.

Scenario C: Within-twin differences in cultural capital are random.  This could especially be the case if the survey is asking mothers to estimate each child’s cultural capital on different screens as opposed to a grid-like format (Jæger  and Møllegaard are not clear on this point).  The authors acknowledge this possibility as well and say “Random measurement error leads to attenuation bias, i.e. downwardly biased estimates of the effect of cultural capital on educational success…we get statistically significant estimates of the effect of individual cultural capital on educational success even in the presence of attenuation bias.”

This is falling prey to the “‘What does not kill my statistical significance makes it stronger’ fallacy”, coined by statistician Andrew Gelman the same year this study came out.  In statistics, “power” means the ability of a statistical test to detect a real effect. One thing that weakens power are small sample sizes.  Another thing is random measurement error.   The fallacy is the tempting notion that if you detect an effect using an underpowered design, that effect must be real.  Jæger  and Møllegaard are essentially saying that they have an underpowered design but they still found an effect–it must be REALLY real.

It is true that if you have an underpowered study, you will be less likely to detect an effect that exists in the population.  HOWEVER, if you have an underpowered study, and you still detect an effect, as Gelman shows, the chances your effect is of the wrong sign increase, and if your effect is of the right sign it will inevitably be overestimated.

My guess is that the Jæger  and Møllegaard estimates of the effect of cultural capital are the results of a mix of Scenarios A and C, especially if mothers were answering questions about each kid on a different screen.  If the questions were presented using a grid and making it easy for mothers to compare their answers across kids Scenario C seems less plausible to me.  For some reason, Scenario B seems unlikely to me–about as unlikely as cultural capital having the sizable causal effects that Jæger  and Møllegaard present.

Posted in Uncategorized | Comments closed

The Quantitative Literacy Gap in Sociology Undergraduate Education

Thomas Linneman wrote an article appearing earlier this year in Teaching Sociology that documents a continual upgrading in the statistical methods used in sociology articles.  He asks the reader to ponder whether or not sociological statistics courses are preparing undergraduate students to read most published sociological quantitative investigations (they are not) and for statistics instructors to think about covering interaction and nonlinear terms in regressions and logistic regression so undergraduate students.  This way, our undergraduate students will at least be able to understand the bulk of published quantitative sociological work.

I am sympathetic to Linneman’s aims here.  In my own upperclassmen statistics course I have the students read David Brady et al.’s “Rethinking the Risks of Poverty” which uses multi-level linear probability models with interaction terms.  I try mightily to get the students to understand interaction terms so they can understand the statistics that Brady and his co-authors present.  Interactions come at the end of the term; in some semesters this is more rushed than it should be and when that happens I wonder if my students would have been better off if I hadn’t covered interactions at all.

I am having a hard time envisioning how I could cover the other “advanced” topics that Linneman advocates for (nonlinear effects and logistic regression).  The main issue is I need to review some basic quantitative literacy concepts which take up a third of the semester before I can even dive into regression.

The only way I could see this working is if I dispensed with spending class time on basic quantitative literacy (e.g. percentaging tables, comparing quantities, measures of central tendency, levels of measurement).  Instead, I would either (a) offload those topics to readings students would do on their own time, or (b) hope the General Education quantitative literacy course required of undergraduate (who may take it before, during, or after they take my statistics course) covers those things.  I am not sure either option is that appealing.

I don’t really have a solution here; I think Linneman’s goal could be met more easily if (a) the department requires majors take multiple statistics courses and (b) the instructors of the two courses work closely together to coherently sequence the content covered.



Posted in Uncategorized | Comments closed

About that youth responses to COVID study…

A couple of weeks ago I wrote skeptically about a CDC study by Mark Czeisler et al. reporting very high rates of mental health issues among young people in 2020 due to COVID.  I wrote:

If it was me doing the study, I would have minimized mention of COVID-19 and tried to mimic questions about substance abuse and suicidal ideation that are fielded in other recent surveys (pre-pandemic) and just do a pre/post comparison. 

I am chagrined to report that they actually did try to do this.  A write-up in the Philadelphia Inquirer alerted me to this and sure enough in their conclusion  they say this:

Elevated levels of adverse mental health conditions, substance use, and suicidal ideation were reported by adults in the United States in June 2020. The prevalence of symptoms of anxiety disorder was  approximately three times those reported in the second quarter of 2019 (25.5% versus 8.1%), and prevalence of depressive disorder was approximately four times that reported in the second quarter of 2019 (24.3% versus 6.5%) (2). However, given the methodological differences and potential unknown biases in survey designs, this analysis might not be directly comparable with data reported on anxiety and depression disorders in 2019 (2). Approximately one quarter of respondents reported symptoms of a TSRD related to the pandemic, and approximately one in 10 reported that they started or increased substance use because of COVID-19. Suicidal ideation was also elevated; approximately twice as many respondents reported serious consideration of suicide in the previous 30 days than did adults in the United States in 2018, referring to the previous 12 months (10.7% versus 4.3%) (6).

So the numbers regarding anxiety and depressive symptoms come from the National Health Interview Survey (NHIS).  The NHIS’s sample is quite different from that of the Czeisler et al. study.  While Czeisler et al. relied on a Qualtrics panel and an online survey, the NHIS actually randomly sampled households to do their interviews and use computer-assisted personal interviewing (CAPI).

The Czeisler et al.’s measures appear comparable to those used by the NHIS, but it is hard to tell.  Here is what Czeisler et al. say about how they measured anxiety and depression:

Symptoms of anxiety disorder and depressive disorder were assessed via the four-item Patient Health Questionnaire (PHQ-4). Those who scored ≥3 out of 6 on the Generalized Anxiety Disorder (GAD-2) and Patient Health Questionnaire (PHQ-2) subscales were considered symptomatic for these respective disorders. This instrument was included in the April, May, and June surveys. [p. 1050]

Here is how NHIS measured anxiety and depressive symptoms:

They are derived from responses to the first two questions of the eight-item Patient Health Questionnaire (PHQ-2) and the seven-item Generalized Anxiety Disorder (GAD-2) scale.  

In the PHQ-2, survey respondents are asked about how often in the last two weeks they have been bothered by 1) having little interest or pleasure in doing things, and 2) feeling down, depressed, or hopeless. In the GAD-2, survey respondents are asked about how often the respondent has been bothered by 1) feeling nervous, anxious, or on edge, and 2) not being able to stop or control worrying. For each scale, the answers are assigned a numerical value: not at all = 0, several days = 1, more than half the days = 2, and nearly every day = 3. The two responses for each scale are added together. The NHIS indicators in the table are the percentages of adults who had reported symptoms of anxiety or depression that resulted in scale scores equal to three or greater. These adults have symptoms that generally occur more than half the days or nearly every day.

I think both studies used two items each to measure anxiety symptoms (the questions about frequency of feeling nervous, anxious, or on edge, and not being able to stop or control worrying) and depressive symptoms (the questions about having little pleasure in doing things and feeling down, depressed or hopeless), all four of which were coded on a 0-3 scale (0=not at all; 3=nearly every day).

The questions about suicide ideation come from the 2018 National Survey on Drug Use and Health (NSDUH); the sampling strategy and interviewing method seems very similar to the NHIS, although NSDUH uses audio computer-assisted self-interviewing for sensitive questions. (Fun fact: I think I was interviewed for the 2016 NSDUH; I was definitely part of a very similar study at least).  It does appear that Czeisler et al. did use the same question wording regarding suicide ideation than did NSDUH, with the exception of shortening the time frame (from seriously thinking committing suicide within the past 12 months to the past 30 days).

So Czeisler et al. did their due diligence ensuring comparable measurement across the two studies, so we are left with two possibilities explaining the different prevalences of mental health issues across the two studies: the population did change (probably due to COVID-19) or the Czeisler et al.’s Qualtrics panel was biased towards representing people with mental health issues.  This is plausible but I have talked before about Stephanie Fryberg’s research on Native American attitudes towards Native American mascots for sports teams and showed her Qualtrics panel of Native Americans tended to be more highly educated than the general population of Native Americans.  I suppose a Qualtrics panel for the nation as a whole might skew towards people really struggling with COVID but it seems a bit of a stretch.

So I have to reluctantly conclude that I think Czeisler et al. are correct that people are more likely to say they have seriously considered suicide, or experienced anxious or depressive symptoms, now than they did last year or in 2018.  Having said that, the description of the PHQ makes me wonder if the measure is really tapping into the debilitating natures of clinical anxiety and depression.


Posted in Uncategorized | Comments closed

That fat politician study

Today my twitter feed had people commenting on Pavlo Blavatskyy’s study correlating the body-mass index of post-Soviety politicians with country corruption measures.  This is a fortuitous coincidence as this is one of the few studies which I can use to illutrate the importance of paying attention to the unit of analysis.

So Blavatskyy used machine learning to infer politicians’ BMI from photographs and he shows this is highly correlated with corruption indices like the Transparency International Corruption Perceptions Index (r=-.92), the World Bank Control of Corruption (r=-.91), the Index of Public Integrity (r=-.93), IDEA Absence of Corruption (r=-.76), and the Basel Anti-Money Laundering Index (r=.80).

So let’s aside two problematic aspects of this study.  First is the utter ridiculousness of it–Blatatskyy cannot even be bothered to even come up with some kind of mechanism that would explain this association.  Second are the corruption indices which I have not looked into but I am skeptical about projects to quantify such a hazy concept.

The real issue is, what is this study good for?  If you take the corruption indices for granted–which I do not–then why should anyone bother with politicians BMI?  The reality is that Blavatskyy has a nation-level cross-sectional design.  It does not speak to whether or not individual politicians are corrupt.  But Blavatskyy disregards this issue and suggests in fact that we can infer an individual, post-Soviet politicians’ corruption by their physical appearance:

Does our median estimated ministers’ body-mass index capture meaningful changes in grand political corruption? The Armenian velvet revolution (which is also known as #RejectSerzh movement) that occurred in spring 2018 offers a convenient natural experiment. The third president of Armenia, Serzh Sargsyan, at the end of his second (and last) term in office initiated a constitutional reform transforming the country from a semi-presidential to parliamentary republic. Mass protests erupted when Serzh Sargsyan was elected as the new prime minister in spring 2018 setting him as a de facto head of government for the third term. The protests resulted in a minority coalition government formed  on 12 May 2018 and headed by Nikol Pashinyan/  Our median estimated body-mass index of ministers in the Pashinyan government (cf. online Appendix Q) is 31.2, which is lower than Armenia’s 2017 value of 32.1 (cf. Table 1). Thus, according to our measure, the Armenian velvet revolution lowered grand political corruption. Yet, the Transparency International Corruption Perceptions Index 2018 for Armenia is the same as it was in 2017, indicating no change in corruption before and after the Armenian velvet revolution. This is perhaps not surprising as the Transparency International Corruption Perceptions Index is based on subjective perceptions. Individual perceptions are known to be sticky and change relatively slowly over time. In contrast, our proposed measure of grand political corruption changes every time a median cabinet minister (on the body-mass index scale) is changed.

So essentially, Armenia had a change of government headed by thinner politicians.  Blavatskyy gives no evidence this new government is less corrupt than its predecessor.  For some reason, he thinks that one of his indices should have shown a lower corruption index but it doesn’t, and this is evidence for using a dynamic, BMI-based, measure of politicians to gauge corruption of individual politicians and/or regimes.  Based on a cross-sectional, country-level design.  Talk about tautological, and all based on an n of 1.

So essentially, for the purposes of a quantitative literacy class, we can say Blavataskyy fell afoul of confusing units of analysis (the ecological fallacy).

Posted in Uncategorized | Comments closed

Youth Responses to the COVID-19 pandemic

I was a bit skeptical of this statistic when I saw it; a quarter of young people having such strong, adverse reactions to the pandemic seemed a bit incredible to me.  I looked up the study and sure enough Wellmon is accurately portraying the findings, although I am still a bit skeptical.


First, this was a Qualtrics survey, so recruitment was presumably done among Qualtrics panelists.  My suspicion is that online panelists are a bit unrepresentative of the general population, but one way to counteract this is to use weights which this study did (although it appears the weights were based on only a few core demographic variables–gender, age, and race).  The write-up in MMWR says that respondents were informed of the study purposes beforehand, and so one wonders if that drew on people who were more likely to see themselves as victimized by the pandemic.

Second this appears to be a relatively lengthy survey with 86 questions gauging people’s adverse emotional reactions to the pandemic.  Doing a survey online with so many questions tapping into relatively similar things can be pretty mind-numbing, and lends itself to careless responses.  This does not automatically translate into upwardly-biased estimates, but is just a reservation I have.

Third, the question on substance abuse explicitly asks people to attribute their behaviors to the pandemic, and I am not a fan of asking people for their conscious motivations for their behavior.  The question on suicide does not do this, but still it is being fielded in a survey that informed its participants upfront that it was looking at the effects of the pandemic on people’s lives.  Again, this does not mean the percentage is biased upward but is another reservation I have.  If it was me doing the study, I would have minimized mention of COVID-19 and tried to mimic questions about substance abuse and suicidal ideation that are fielded in other recent surveys (pre-pandemic) and just do a pre/post comparison.

And of course, starting or increasing substance use is pretty vague and one can imagine a lot of situations where someone would say yes but still not suffering from substance misuse disorder.  This does not just include people becoming alcholics or misusing opioids.  For the record, the study says that substance use was defined as use of “alcohol, legal or illegal drugs, or prescription drugs that are taken in a way not recommended by your doctor.”  (p. 1051).

I note that the percentage of young people also are also classified as suffering from anxiety disorder, depressive disorder, or trauma or stressor-related disorder (TSRD) also are quite high and I would guess we have a similar problem of quite broad conceptual definitions

Posted in Uncategorized | Comments closed

COVID testing data addendum

In my last post, I expressed some dissatisfaction with attempted take-downs of Donald Trump’s assertion that increased COVID cases are just an artifact of testing.  While looking over the COVID Tracking Project I found this nice visualization:

I think this does a better job of putting to rest Trump’s excuse for a greater COVID cases.  For one thing, the number of new cases does not track with the number of tests being done.  From May through June cases declined while testing grew.  For a second thing, the smoothing of the death trend makes the increase starting in July a bit clearer (although still obviously the number of new deaths is fortunately quite lower than it was in April).  For a third thing, the hospitalization data is very clear that we have seen a true increase in COVID cases that is hard to wish away as data artifacts (I mean, I suppose one could argue that COVID cases have been constant and the dip in June was because people were staying from the hospital, but that stretches credulity). But the hospitalization trend looks quite different than the trend I showed in my last post, from COVID-NET:

I missed the fine print that the hospitalizations are really based on 100 counties  in 14 states (although I am not sure if it is actually 100 counties in the 10 states in the Emerging Infections program  and the entirety of four states in the Influenza Hospitalization Surveillance Project), whereas the COVID Tracking Project aggregates reported hospitalizations from all 50 states (although again, the quality of state reporting on COVID hospitalizations is a bit murky to me).

Posted in Uncategorized | Comments closed

COVID testing data

Donald Trump’s disastrous interview with Jonathan Swan centered around COVID testing data which I want to take up here and just organize my thoughts on the subject.  I am looking for a good overview of COVID case data and where its potential biases are and I am a bit flummoxed.

As best as I can tell, there are two main stages where COVID data goes awry:

  • Who Gets Tested
  • How Testing Data Is Aggregated

Who Gets Tested

It goes without saying that those infected with SARS-Cov-2 but are not tested are not going to show up in the data.  Who are these cases?  My initial thought was that these would primarily be asymptomatic cases and people who get sick but get over the disease fairly quickly.  But this does not take into account how testing availability is segregated by race and class.  Science Magazine discusses the research of epidemiologists Wil Lieberman-Cribbin, Stephanie Tuminello, Raja Flores, and Emanuela Taioli showing that in New York City testing rates in March 2020 were highest in White-dominated zip codes (although it was not correlated with zip code SES).  Taioli told Science Magazine that NYC equalized testing capacity  since March but we do not know what the situation is in other parts of the country (h/t Vox).

How Testing Data is Aggregated


Even more depressingly, even the very simple act of combining testing data to get a picture of what’s going on in the entire United States is fraught.  Ourworldindata talks about their data sources for COVID tests which is the COVID Tracking Project and the CDC (they don’t make clear how they combine these two sources of data). As of this writing, the CDC’s “Testing Data in the U.S.” page (updated 8/3/2020) lists 52.9 million tests done and 5.0 million positive tests, while the CDC COVID Data Tracker (also updated 8/3/2020) lists 61.5 million tests done and 5.5 million positive tests.  And the COVID Tracking Project (updated today 8/4/2020)  lists 52.5 million tests and 4.7 million positive tests.


As best as I can tell, all COVID-testing labs are required to report their testing results to “the appropriate state or local public health department”.  [I am not sure about this, but I believe that the recent kerfuffle over HHS being the sole recipient of hospital COVID data is not related to this basic issue of COVID testing and case counts but I could be wrong].  In turn, these state and local public health departments report their data to the CDC (as well as make it public, so the COVID Tracking Project picks that up).  We also know that the testing aggregation is screwed up as some states are combining viral tests (which measure if someone has a current infection) and antibody tests (which measure if someone was infected in the past) which really screws things up (the focus of most international comparisons and time trends have been viral tests).  According to the CDC Data Tracker, the states still doing this as of now are Delaware, D.C., Maine, Mississippi, Missouri, Oklahoma, Puerto Rico, Tennessee.  It appears this aggregation is where we are getting these different estimates.

Back to Trump

Trump’s argument that increased COVID cases is just an artifact of increased COVID testing is not a totally crazy idea.  As Joel Best points out in Damned Lies and Statistics, often times when researchers start counting incidents of a new social problem, they inevitably show dramatic increases early on just because surveillance has gotten better, but not necessarily because the actual problem got worse).  Best illustrates this with the example of FBI statistics on the incidents of hate crimes.

Sharon Begley at StatNews wrote an article taking on this argument.  Essentially, she treats the percentage of tests that are positive as a more “valid” measure of prevalence.  So in Florida (the state that saw the biggest relative increase in COVID cases from May to July), .002% of the population tested as positive in mid-May; in mid-July, the corresponding percentage was .059%, 25 times the percentage in May.  But Florida’s testing also increased over time, from 479 tests to 65,567 tests.    That 2,400 percentage increase in positive cases could just be an artifact of the increased tests.  So Begley looks at the percentage of tests that were positive over the same time frame; in mid-May, it was 3.160 percent; in mid-July, it was 19.254 percent, 5 times the percentage in mid-May.  So the prevalence of COVID increased even among the tested which undermines Trump’s argument.

According to Begley’s calculations, there is only a handful of states seeing increased cases of COVID but steady (or even declining) percentages among those tested: Colorado, Indiana, Michigan, Missouri, North Carolina, Ohio, and Wisconsin.

I find this exercise a bit unsatisfying though.  Although Begley did not do it, we can do the same calculation for the United States.  In mid-May, .008 percent of the population tested positive; in mid-July it was .019 percent of the population, 1.4 times that in May.    And guess what? The percentage of tests that came back positive help pretty steady from mid-May to mid-July, from 7.4 to 7.6 percent.  So if you really think the percentage of positive cases among the tested is a more valid measure of COVID prevalence, you would have to accept Trump’s argument that overall, the increased number of COVID cases in the United States is due to greater testing.  I guess the state-level analysis is a bit more valid because it is getting “closer to the ground” but it seems like an arbitrary unit of analysis (why is the state-level better than, say, the county-level?).

[to be clear, I am skeptical of the argument that trends in the percentage of positive cases among the tested is a more valid estimate of trends in the population overall; it seems to me that the number of tests is responding to more and more people getting the disease]

I think more compelling would be looking at COVID hospitalizations and deaths.  Both trends increased since mid-June, but the levels are still quite lower than what the United States was experiencing in April (although hospitalization data has been disrupted due to the switchover to HHS reporting so the recent hospitalization counts are probably downwardly biased).


To be honest, I am a bit pessimistic that data could really put Trump’s argument to rest, because our data is so poor (which I would guess is the legacy of our federal system combined with willful negligence in the executive branch).  Vox had an interesting argument about the need for “surveillance testing” which, instead of aggregating all tests ever done,  would involve outpatient clinics recording data on all patients showing symptoms.  This would supposedly yield trends that would be more comparable over time.

Posted in Uncategorized | Comments closed

Gender conservatism among young people

Back in February, New York Times writer Clair Cain Miller wrote a piece about recent research on the gendered division of labor among heterosexual partners.   Part of her article discussed the study by sociologists Brittany Dernberger and Joanna Pepin, who used Monitoring the Future data to track 12th-graders’ attitudes towards division of labor arrangements from 1976 to 2014.  These students were asked to imagine being married and having at least one preschool-aged child.  They were asked to indicate the desirability of various division of labor schemes (e.g. “husband works full time, wife doesn’t work” or “husband workers about half time, wife works full time”; the only permutation of working full-time/working half-time/not working not asked was both husband and wife not working).  They found that while the desirability of the traditional breadwinner-husband/stay-at-home wife scheme had declined since the 1970, it was still the most desireable arrangement in 2014.  In 1976, 44 percent of students thought the traditional arrangement was “desireable” (as opposed to “acceptable” or “not at all acceptable”); in 2014 this was down to 23 percent.  But people desiring gender parity (saying that both working part-time or both working full-time were desireable) had risen only slightly, from 9 percent to 19 percent.


Figure 1 in Dernberger & Peppin (2020), p. 43

For my general education statistics class, an early assignment I have my students do is double-check a media write-up of results so they go from the media piece, to the study, and then they go and look into the actual data documents (like questionnaires).  In this case, it looks like Miller’s write-up checks out.  It was a little bit of a struggle to get the MTF questionnaires or codebooks but ICPSR has them; in the case of 2014, the questions about imaginary division of labor arrangements is in “Form 2”.


Posted in Uncategorized | Comments closed

Long-lasting symptoms of COVID

Fabio Rojas has argued for most institutions reopening in the face of COVID-19, on the grounds that COVID mortality rates are quite low for the non-elderly.

A number of commenters on his posted legitimately raised the issue of people with serious, long-lasting complications from COVID.  It’s just not a matter of a few people dying and the overwhelming majority of people recovering, but we are hearing about people who have been suffering for months from debilitating, COVID-related complications.

This raises the question, how common are these so-called COVID-19 “long-haulers”?

The CDC released a study a last week tracing the resolution of COVID-19 among people with mild symptoms–that is, who were diagnosed in out-patient clinics (rather than being diagnosed having been hospitalized for serious COVID-19 symptoms).  Some media accounts [1,2,3] are linking the study to the nightmarish long-hauler phenomenon, but I am skeptical.

The CDC obtained a list of people testing positive between March 31 and June 4 from 14 academic health centers and they randomly sampled individuals within each test site.  Subjects were called 2-3 weeks after their test date and interviewed about their symptoms.  They were able to interview around 46% of their sample (274/582).

Among these interviewees, 35 percent said they had not returned to usual health  at the time of the interview, with the rate depending on age (at most a third of those less than 50 years old still had symptoms; around half of those 50 years or older still reported dealing with symptoms).  The most common symptoms failing to resolve were coughing and fatigue (43 and 35 percent of people experiencing those respective symptoms on the day of testing reported still having them at the day of the interview); fevers  and chills resolved for nearly everyone experiencing those symptoms on the day of testing.

[a CNN report on this study misreported the symptom results as “for the people whose symptoms lingered, 43% said they had a cough, 35% said they felt tired…” which implies that 15% of subjects still had a cough (that is, 43% of the 35% who still felt unwell at the time of the interview).   In reality, it was 26% of subjects still reporting having a cough–that is 43% of the 166 subjects reporting having a cough at the time of testing.]

It is not clear to me how serious these symptoms are; it looks like many people who had mild cases of COVID-19 were still impaired for 2-3 weeks, but I thought we already knew that having COVID-19 could mean being sick for a couple of weeks.  I wish the CDC had interviewed people a bit further out from the time of a positive test result (as long-haulers say they have been sick for months).  In addition, the CDC is not making the survey instrument available but the measurement of symptoms seems binary and does not capture the debilitating nature of the long-hauler symptoms.

I would also guess that the study’s estimates of the prevalence of long-lasting symptoms are upwardly biased, as people with mild forms of COVID-19 are probably (a) less likely to get tested and (b) less likely to have enduring symptoms.  I suspect, but I don’t really know, that people with long-lasting symptoms are also more likely to agree to an interview request from health researchers (although if they are really sick they may not; the researchers excluded nine subjects because a proxy did the interview and I wonder if that was because they felt too unwell to talk with an interviewer).

Having said that, I do not endorse arguments for re-opening, or even qualified ones calling for re-opening for the non-elderly while keeping the elderly locked down (I am not even sure what that would look like, much less work).  While I suspect long-hauler cases are rare, we just do not know enough about the disease to let it burn through the population (even the non-elderly population).

Posted in Uncategorized | Tagged , , | Comments closed


Trying not to go for something obvious here

In Damned Lies and Statistics, Joel Best argues that consumers of statistics need to especially scrutinize international comparisons because there are so many opportunities to mix up apples and oranges (I have discussed this with regard to the conceptual definitions used to quantify police-related deaths in different countries).   One of Best’s examples was international comparisons of test scores; he pointed out that sampling strategies used vary across countries and often countries’ performance levels could be chalked up to the broadness of their sampling strategy.  In particular, countries with comprehensive secondary school systems (like the United States, where all, or nearly all, adolescents are exposed to an academic-focused curriculum) would sample from the entire population of schools, while countries with  “streaming” systems (like in Germany), where some adolescents go to academic high schools while others go to more vocationally-oriented schools, would sample from the academic high schools only.  This would stack the deck against countries like the United States.

Damned Lies and Statistics came out in 2001, and the international testing comparisons Best talked about have been supplanted by the Programme for International Student Achievement (PISA), run by the OECD.  When I teach Best in my statistics class, I show students the general sampling strategy of PISA:

The desired base PISA target population in each country consisted of 15-year-old students attending educational institutions located within the country. This meant that countries were to include (i) 15-year-olds enrolled full-time in educational institutions, (ii) 15-year-olds enrolled in educational institutions who attended on only a part-time basis, (iii) students in vocational training types of programmes, or any other related type of educational programmes…

Sure, there were some problems with China, but what you are going to do?  Surely PISA must be good for comparing democratic countries, right?

Well, no.  A team of UCL educational researchers headed by Jake Anders have analyzed the 2015 Canadian sample for PISA and their analysis raises questions about the quality of comparisons involving Canada, which does very well on the PISA in terms of high average scores.

Their article is nice for walking the reader through the sampling strategy of PISA countries.

  • First you have to talk about sample exclusions–what part of the population are you trying to generalize to, and what part are you not trying to generalize to?  As shown above, PISA is trying to get at 15-year olds in any kind of educational institution.  In Canada, that covers 96 percent of 15-year olds so you are dropping 4 percent right off the bat there (the Anders article has a really nice table comparing Canada’s figures to other countries including the United States, where 95% of 15-year-olds are in educational institutions).
  • Not mentioned above is that PISA lets countries exclude students from their target population based on Special Needs (although PISA caps this at 5% of students).  In 2015 Canada broached this cap–Anders et al. note that Canada “has one of the highest rates of student exclusions (7.5%).” So now Canada’s sample is supposed to cover 88.8 [96%*(1-.075)] percent of Canadian 15-year olds.
  • PISA countries are in charge of their own sampling, and I did not realize how much discretion they have.  With nationally representative samples, you really need to do stratified sampling (and probably clustered sampling as well which Anders et al do not get into).  Stratified sampling means countries divide schools into strata based on combinations of variables and sample from each of their strata to ensure a representative sample.  Countries choose their own stratifying variables (!), and in Canada these are “province, language, and school size” which seems fine to me.
  • We have not even talked about school and student non-compliance, and here is where things get really messy.  Canada selected 1008 schools to participate in the 2015 PISA, and 30% refused.  What countries can do is try to recruit “replacement schools” that are similar to refusing schools based on the stratifying variables as well as another set of variables (which Anders et al. refer to as “implicit” stratifying variables).  Canada was able to recruit 23 replacement schools, but it is not clear that the variables Canada used to implicitly stratify schools were that meaningful for test scores–meaning it is possible that the replacement schools are very different from the originally-selected schools in unobserved ways.  It is only 2% of the sample but this problem of using meaningless variables to gauge the representativeness of the sample will be an issue.
  • Anders et al. points out that at 70%, Canada is fourth-worst among OECD countries in terms of the response rate of initial schools (beaten out by the Netherlands, New Zealand, and USA).  In terms of overall response rate (after including the replacement schools) Canada’s 72 is the worst (the US, at 83 percent, still looks very bad relative to other OECD countries, most of which are at 95 percent or above).
  • PISA requires countries with low initial response rates between 65% and 85% to do a non-response bias analysis (NRBA).  Countries below 65% (like the Netherlands) are supposed to be excluded, but in this case, they were not.  PISA does not report the details of these NRBAs, and Anders et al.  tracked down Canada’s province-specific NRBAs and found they were pretty superficial, and suffered from the problem of using a handful of variables to show that non-responding schools are similar to responding schools (although in the case of Quebec there were significant differences between refusal and complying schools).
  • Now we get into pupil non-response, and again, Canada is among the worse in this regard, with 81% of students in complying schools taking the PISA tests (the other two countries with worse or comparable rates are Austria at 71% and Australia also at 81%).  We know that non-participating students tend to do worse on the tests, and one way to get around this is to weigh student participants such that those with characteristics similar to non-participants are weighted more.  But again, we get into this issue where Canada uses weights based on variables that do not really matter for test scores (the stratifying variables plus “urbanisation, source of school funding, and [school level]”).
  • I am not sure how Anders et al. calculate this, but all told, Canada’s sample is really only representative of 53% of Canadian students (although I wonder if they meant to say 53% of Canadian 15-year olds).
  • Anders et al. do some simulations for reading scores, assuming that non-participant students on average would perform worse on the PISA tests than participant students.  If we assume that the non-participating students do moderately worse on the PISA instrument (say, at the 40 or 35th percentiles), Canada’s reading scores are still better than average but are not at the “super-star” levels it enjoys with its reported performance.  If we assume the non-participants do substantially worse (say, at the 30th percentile) for Canada’s mean PISA reading scores take a serious dive and Canada starts looking more like an average country.

The one thing that really sticks out here–and also with Tom Loveless’s discussion of PISA and China–is that PISA’s behavior is not consistent with the dispassionate collection and analysis of data.  They have opened the door to countries (especially wealthy ones) fudging their data and they do not really seem to care.

Posted in Uncategorized | Tagged , , | Comments closed
  • Posts