A central concern in economic policymaking is ensuring that policies benefit all Americans. Too often, policy evaluation is done at the broadest population level, missing important variations within subgroups. For example, free trade may benefit Americans broadly, while still hurting particular individuals or groups. Policymakers are increasingly focused on understanding the impact of policies on marginalized groups, including the disabled, Black Americans, and Americans with only a high school education, to ensure equitable outcomes for all.
However, measuring policy impacts on subgroups is complicated. All economic indicators have some degree of meaningless variation–or “noise” as statisticians say– attached to them. This may be due to changes in the population being surveyed (a survey of a thousand people will have a different thousand people each month) or simply because the economy is big and messy.
For example, in recent months there have been substantial fluctuations in the employment rate of Black Americans. In April 2023, the Black American unemployment rate hit an all-time low of 4.7%. However, in the subsequent month, it jumped to 5.6% and then hit 6.0% in June. Perhaps there were sizable and rapid shifts in the labor market that hit Black Americans especially hard, but it’s more plausible we are just seeing noise in the data collection process.
The unemployment rate data comes from surveys, just like political polls. Even if unemployment is consistently steady, some month-to-month shifts in the measured rate as people move in and out of the survey sample would be expected. This is true for all survey results, but especially for subgroups. Because Black Americans are only one-fifth of the population, small changes in the survey composition can have significant effects. It’s more likely that April somewhat underestimated Black American unemployment, and June somewhat overestimated it.
Noisy data makes assessing economic changes hard in general but even harder when we are trying to look at subgroup effects for several reasons:
Subgroup effects are especially noisy:
If you flip a coin twice, it’s not particularly surprising if it comes up “heads” both times. But if you flip a coin a hundred times and it continually comes up heads, you would be confident that it is not a fair coin. Similarly, we should expect to see more noise in subgroup effects simply because we are looking at fewer cases. This means we will sometimes see patterns in the data that are not really there.
Subgroup analyses are often underpowered:
On the other hand, because of the noise we just discussed, statistical analyses often fail to find subgroup effects even when they exist. Researchers often look for effects that are “statistically significant” (that is, large enough that it is unlikely that they would be generated by random variation). This can be hard to obtain for small subgroups because statistical power is a function of sample size. For example, to have sufficient statistical power to detect the existence of an interaction between a policy intervention and gender, you need four times the number of subjects as you would to detect an effect across the full population. As a result, many evaluations lack the statistical power necessary to see subgroup effects.
Subgroup effects are vulnerable to exploitation:
Finally, subgroup analyses are vulnerable to exploitation precisely because they require the researcher to conduct multiple analyses. This means that we should be looking for “p-hacking” — cases where a researcher ran multiple tests until they could find a statistically significant result. If a researcher does not find an effect of a policy on the general population, they may be tempted to run the same analysis for smaller and smaller subgroups until they find an effect.
While it is possible to “pre-register” analyses, ensuring that any subgroup analyses are not the result of ad-hoc data dredging — throwing stuff at the wall until something sticks — most papers and commentaries do not take this step.
These factors highlight the need to interpret subgroup analyses when making policy decisions carefully. We do not want to make policies based on analyses of subgroup effects that may be statistical noise, nor do we want to ignore subgroup effects because they can be hard to find.
Below, I discuss three recent examples of policy-relevant subgroup analyses that suffer from the above limitations: potential disparities in economic recovery between men and women, possible adverse effects from taxes on particular demographic groups, and whether men and women equally benefit from re-employment programs. For each, I show how subgroup analyses can be challenging to interpret and potentially lead people away from accurate narratives.
There was no great she-cession
During the Covid economic recovery, a frequent concern was the possibility of a “she-cession” in which female employment consistently lagged behind male employment. For example, in September 2022, male employment increased by 451,000, while female employment actually declined by 395,000. It briefly appeared as if men and women were experiencing very different recoveries.
However, this initial observation was misleading. Month-to-month jobs data is inherently noisy, and this is especially true for subgroup effects. In three years of post-pandemic jobs data, it should not be surprising that we should see isolated months where male employment grew substantially faster than female employment or vice versa.
When you zoom out across several years and look at employment levels instead of month-to-month changes, the picture is more clear: male and female employment recovered at similar rates. As economist Claudia Goldin suggested, “The real story of women during the pandemic is that they remained in the labor force and stayed on their jobs as much as they could.”
Journalists and policymakers are attentive to potential disparities between subgroups for good reason. But this extra attention can highlight apparent disparities that might be fluctuations in the data. When women lagged behind men in a monthly jobs report, it was reported on — but not when they caught up in the subsequent month. It’s critical to look across a broad range of months when reporting on differential trends for subgroups. Otherwise, it is easy to accidentally “p-hack” your way into a phenomenon that doesn’t exist.
Higher taxes do not cause infant mortality for African Americans
A recent paper published in JAMA Network Open examined the effects of state-level tax policies on infant mortality outcomes. Looking at infant mortality across all 50 states for six years (1995, 2002, 2009, 2012, 2014, and 2018), they tested for an association between an additional $1,000 raised in tax revenue (per capita) and infant mortality.
Their primary effect was that states with higher tax rates saw lower infant mortality levels. For every $1,000 in tax revenues raised, infant mortality decreased by 3%.
They also examined the effects of tax revenues on infant mortality rates by race for Hispanic, non-Hispanic Black, and non-Hispanic White populations. They found no impact from an additional $1,000 in tax revenues for Hispanic infants and a 4% increase in infant mortality for African Americans.
This is surprising, as it seems unlikely that higher taxes are protective for the general population but harm Black Americans specifically. The study suggests possible mechanisms that could explain this result — for example, it may be that structural racism leads to Black Americans paying the increased tax rates, without gaining the benefits of more government services.
An alternative explanation is that the apparent relationship is a statistical artifact. If you run tests on the general population Hispanics, non-Hispanic Whites, and non-Hispanic Blacks), every test increases the probability that one will return an apparently significant — but non-meaningful — result. This is especially true of outcome variables such as infant mortality. Since the infant mortality rate is already low, you are making inferences about a fraction of a fraction of the population.
As subgroup analyses become more common, we will see more reporting of weird effects that are hard to explain. Researchers should exercise caution in reporting these effects, which can confuse policymakers and the public.
Reemployment programs help men as much as women
Richard Reeves’s 2022 book “Of Boys and Men” presents an overview of how institutions are failing men. One phenomenon he highlights is that many social programs and policies are only effective for women. Reeves claims that “there is a clear, recurring pattern in evaluation studies of policy interventions, with stronger effects for girls and women than for boys and men.” His evidence for this claim includes several policy evaluations that were effective for women, but not men, including Kalamazoo Promise (a program that provides Kalamazoo residents with free college tuition) and Paycheck Plus (a wage subsidy program similar to the Earned Income Credit).
But we should expect that some evaluations would show this phenomenon because of the limitations of subgroup analyses discussed above. Some policies may be ineffective but show effects for women because of statistical noise. Others may be effective for both genders but not show an effect for men for the same reason.
To limit the effects of these potential pitfalls, we need to look at a broad range of studies to ensure that the studies examined are representative of the whole literature. One way of doing that is using a pre-existing database of studies. By looking at the entire set of studies in a particular area, we can be assured that any patterns we see reflect the literature.
For example, we can use the Department of Labor’s “Clearinghouse for Labor Evaluation and Research” (CLEAR) database of reemployment studies as a corpus. CLEAR is useful here because it catalogs many interventions aimed at improving attachment to the labor force — one of the outcomes Reeves is interested in.
CLEAR’s Reemployment topic area lists 49 studies published between 1978 and 2018. I could not find 6 of the studies online (for example, because some were unpublished doctoral dissertations, and others were older typewritten studies that have not been scanned and published online). Of the 43 studies examined, 10 included a table showing differential effects of the intervention by gender, breaking down the treatment effects, if any, over men and women. Unfortunately, most studies did not include a specific test of the interaction between the intervention and gender, so it is unclear if any differential gender effects were statistically significant.
The table below lists each of the 10 studies, briefly describes the intervention and summarizes any differential results by gender.
|First impact analysis of the Washington State Self-Employment and Enterprise Development (SEED) demonstration(Benus et al. 1994)||UI claimants participated in “self employment” activities instead of work search.||No differential gender effects|
|Back to work: Testing reemployment services for displaced workers(Bloom 1990)||UI claimants received enhanced Job Search Assistance Services||Stronger effects for women|
|Pennsylvania Reemployment Bonus Demonstration final report(Corson et al. 1992)||UI claimants were given financial incentives for finding a new job.||Stronger effects for women|
|Evaluation of the Charleston Claimant Placement and Work Test Demonstration(Corson et al. 1985)||UI claimants received enhanced Job Search Assistance services, as well as more stringent work search requirements.||Stronger effects for men|
|Assisting Unemployment Insurance claimants: The long-term impacts of the Job Search Assistance Demonstration (DC)(Decker et al. 2000)||UI claimants received enhanced Job Search Assistance Services||Marginally stronger effects for women|
|Assisting Unemployment Insurance claimants: The long-term impacts of the Job Search Assistance Demonstration (Florida)(Decker et al. 2000)||UI claimants received enhanced Job Search Assistance Services||Marginally stronger effects for women|
|Evaluation of impacts of the Reemployment and Eligibility Assessment (REA) Program: Final report(Klerman et al. 2019)||UI claimants were provided with Reemployment and Eligibility Assessment (REA) services – an in-person meeting with a career counselor.||No differential gender effects|
|Assessment of the impact of WorkSource job search services(Lee et al. 2009)||UI claimants received enhanced Job Search Assistance Services||No differential gender effects|
|The Illinois Unemployment Insurance Incentive Experiments(Spiegelman & Woodbury 1987)||UI claimants were given financial incentives for finding a new job.||No differential gender effects|
|The Washington Reemployment Bonus Experiment: Final report(Spiegelman et al. 1992)||UI claimants were given financial incentives for finding a new job.||Stronger effects for men|
We do see more studies that find more substantial effects for women than men, but not enough to justify Reeves’s initial claim of a clear, recurring pattern. The pattern is murky at best, and none of the studies examined actually reported a statistically significant difference in the treatment effect across gender.
Reeves cautions against an overreliance on “gender blind programs and services,” but we should also be cautious about overinterpreting subgroup effects (or the lack thereof). Due to the statistical laws at play, we do not have a solid evidence base here. Most studies conducted were not sufficiently powered to detect any differential gender effects consistently.
When examining subgroup effects, it is crucial to consider the context and long-term trends rather than relying solely on isolated data points. Fluctuations observed in month-to-month data or single studies may not reflect meaningful patterns. Moreover, caution must be exercised in interpreting subgroup analyses, as the statistical limitations, including noise and limited statistical power, can lead to misleading or inconclusive results. Policy decisions should center on a comprehensive understanding of the overall impact and evidence from diverse studies.
Sound techniques to reduce misinterpretation of subgroup effects include:
- More data: Researchers, journalists, and policymakers should always be careful about reporting on data fluctuations, but especially careful with subgroup analyses. Any given pull of a noisy data series will likely have interesting subgroup findings, but most of these will be short-term aberrations, not actual trends.
- Pre-registration of analyses: Researchers are increasingly encouraged by academic journals to “pre-register” their plans with a third party before conducting data analysis. This makes it clearer when a given subgroup analysis is the researcher testing a specific hypothesis instead of hunting for significance.
- Statistical corrections for multiple testing – Researchers have developed various statistical tools to hold analyses with multiple trials to a higher standard. For example, Bonferroni corrections and similar statistical procedures can raise the bar for what counts as “statistically significant”.
- More comprehensive evaluations – Finally, if we want to take subgroup analyses seriously, it will be necessary to invest in collecting the data required to do it well. This can mean oversampling subpopulations to ensure we collect sufficient data for analysis. However, this isn’t always possible. For example, in an experimental program evaluation, it may not be possible to ensure that small demographic groups are oversampled.
In these cases, we would have to run substantially larger experiments (at a considerably higher cost) or accept that most evaluations cannot tell us anything useful about subgroup effects.
By promoting rigorous research practices, policymakers can make informed choices that aim to benefit all Americans. While we shouldn’t dismiss subgroup effects, it is essential to approach them cautiously and ensure that policies are based on reliable and representative evidence.
Four of the ten studies reported that the intervention had similar effects across both genders analyzed (“Program impacts do not substantially vary across subgroups” and “No difference in impact in pooled effects.” “For the major pair-wise comparisons of males and females…there is no statistically significant difference” “There is no difference”).
Two evaluations (both from the “Assisting Unemployment Insurance claimants” study, which CLEAR separates into two evaluations to differentiate between a study that was done in Washington, DC and a second study done in Florida) had marginally stronger effects for women. In both studies, there were 3 Job Search Assistance (JSA) programs being tested, for a total of 6 separate tests. None of the JSA results had significant effects for the male subgroups. In DC, one of the programs was statistically significant for women (p<0.10, suggesting there was less than a one-in-ten probability of observing the difference due to random chance). In Florida, two of the programs were statistically significant (again, at the p<0.10 level). While this is suggestive that there might be a general effect where JSAs are more effective for women, we should be cautious to not overinterpret these results. Since we are conducting 12 tests (3 interventions x 2 genders x 2 locations), we would expect to see 1.2 results with statistical significance at the p<0.10 level. That we see three is not particularly improbable – these are fairly weak effects.
Two studies had stronger effects for men than for women. The Washington Reemployment Bonus experiment reported that “male impact was double female impact” (though note that the interaction effect was not large enough to statistically significant), which the Charleston Claimant Demonstration reported that “Practically all the difference was due to increased placements for males”.
Finally, two studies reported stronger effects for women. The Worker Adjustment Demonstration said “Program impacts for displaced female workers were substantial and sustained throughout the one-year follow-up period, although they diminished continually over time…Impacts for men were appreciable, but much smaller and shorter-lived than those for women.”, and the Pennsylvania Reemployment Bonus Demonstration note “The treatment impacts do not vary by subgroup..impacts tended to be higher on females, but not significantly”
Similarly, it is hard to make strong claims about the composition of results. Based on this evidence, it seems worth investigating whether Job Search Assistance programs tend to be more effective for men than for women, given that we see multiple studies where that may be the case. On the other hand, we see effects that go in both directions for re-employment incentives. The “Washington Reemployment Bonus Experiment” finds stronger effects for men, while the “Pennsylvania Reemployment Bonus Demonstration” finds stronger effects for women.
In all cases, it seems likely that any difference (if they are real and not statistical artifacts) may be driven by what Nobel Laureate Esther Duflo has called the “plumbing” of the program. For example, it may be that the way in which “Job Search Activities” were marketed or described to the subjects was less appealing for male participants than female ones, instead of there being a general effect where JSAs tend to be less ineffective for men.
In recent years, policymakers have become increasingly attuned to the diversity of experience in America, and have looked at how different programs affect, not just the average American, but Americans across a wide spectrum of categories. But while it is valuable to examine differential outcomes for different groups, including gender, it is crucial to avoid overgeneralization or drawing definitive conclusions based on limited evidence. Embracing a cautious and rigorous approach to subgroup analyses enables policymakers and researchers to make informed decisions.
A similar effect can be seen in other subgroups. For example, in April 2023 African American unemployment rate hit an all-time low of 4.7%. However, in the subsequent month, it jumped back up to 5.6%. While this could be a cause for concern (rising African American unemployment might be a forerunner of increasing unemployment generally) it is likely is just statistical noise on both ends. Both the all-time low of 4.7% and the reversion in the next month could be statistical artifacts, and we should be more attentive to the overall trend, which continues to show a steady decrease.
Photo Credit: iStock