Highlights:
- We review a recently-published article in Proceedings of the National Academy of Sciences reporting findings of a randomized controlled trial that evaluated the effects of restoring vacant urban lots.
- The article claims that the restoration intervention produced sizable reductions in violence, crime, and fear in surrounding neighborhoods compared to a no-treatment control group.
- Our concern: The article does not make clear that the study’s primary, registered hypothesis—that the restoration intervention would reduce illegal drug trafficking and consumption—was not supported by the data. The article instead portrays the study results as positive based on exploratory findings that are considered preliminary and unreliable under accepted scientific standards since they could well be “false-positives,” produced by chance due to the study’s measurement of dozens of outcomes.
- Readers of the Proceedings article would have no way of knowing these limitations without looking up the study’s registration and protocol.
- This is a common practice in the research literature, generating results that could easily lead policy officials to adopt ineffective interventions under the mistaken belief that they are evidence based.
- The lead researcher’s response to our concerns, and our rejoinder, follow the main report.
Proceedings of the National Academy of Sciences—a leading scientific journal—recently published the results of a randomized controlled trial (RCT) that evaluated the effects of restoring blighted, vacant lots in Philadelphia. The study reports that restoring the vacant lots produced sizable reductions in violence, crime, and fear in surrounding neighborhoods. Here are relevant excerpts from the abstract:
[W]e investigated the effects of standardized, reproducible interventions that restore vacant land on the commission of violence, crime, and the perceptions of fear and safety…. A total of 541 randomly sampled vacant lots were randomly assigned into treatment and control study arms; outcomes from police and 445 randomly sampled participants were analyzed over a 38-month study period. Participants living near treated vacant lots reported significantly reduced perceptions of crime (−36.8%, P < 0.05), vandalism (−39.3%, P < 0.05), and safety concerns when going outside their homes (−57.8%, P < 0.05), as well as significantly increased use of outside spaces for relaxing and socializing (75.7%, P < 0.01). Significant reductions in crime overall (−13.3%, P < 0.01), gun violence (−29.1%, P < 0.001), burglary (−21.9%, P < 0.001), and nuisances (−30.3%, P < 0.05) were also found after the treatment of vacant lots in neighborhoods below the poverty line…. Restoration of [blighted and vacant] land can be an effective and scalable infrastructure intervention for gun violence, crime, and fear in urban neighborhoods.
The restoration intervention sounds like a great success, doesn’t it? Unfortunately, we believe that these study findings are not reliable for a straightforward reason: The researchers changed their main hypotheses and planned methods for analyzing the outcomes over the course of the study. Had they instead followed accepted scientific practice by adhering to their original, pre-specified hypotheses and methods, the study results would have appeared far less rosy. Under accepted standards, the above findings therefore should be considered preliminary and unreliable since they could well be “false-positives,” produced by chance as a result of the study’s measurement of numerous outcomes.
Let’s examine the specifics.
First, we credit the researchers with carrying out a well-implemented and innovative RCT. The researchers also appropriately pre-specified their main hypotheses about the effects of the restoration intervention, as well as their planned methods for analyzing the outcome data. So far, so good.
The main pre-specified hypotheses, however, focused on substance abuse rather than the other crime outcomes described in the abstract. The study registration is, in fact, entitled “Urban Vacant Lot Stabilization and Substance Abuse Outcomes,” and in the registration the researchers specified the study’s primary outcome as “illegal drug trafficking and consumption” and the secondary outcome as “illegal drunkenness and drinking,” both measured using police data.[i] The researchers hypothesized that the vacant lots randomly assigned to the restoration intervention would have better substance abuse outcomes than either of two randomly-assigned control groups (a “trash clean-up control” group and a “no-treatment” control group), as follows:
Study hypothesis
1. The stabilization of randomly chosen vacant lots will change the public occurrence of illegal drug trafficking and consumption compared with vacant lots that have been randomly chosen to receive only trash clean-up and lots that have been randomly chosen to receive nothing.
2. The stabilization of randomly chosen vacant lots changes the public occurrence of illegal drunkenness and drinking compared with vacant lots that have been randomly chosen to receive only trash clean-up and lots that have been randomly chosen to receive nothing.[ii]
If the researchers had reported their findings in accord with these pre-specified hypotheses, the study abstract and main text would have prominently stated that the primary hypothesis (an effect on illegal drug trafficking and consumption) was not supported by the data. Instead, the published article reads as if reducing substance abuse was never a hypothesized goal of the restoration intervention (we only discovered this by clicking on the article’s link to the study registration and locating the study protocol through an internet search). Examination of table 2 of the article reveals that the restoration intervention had no significant effect on the primary outcome; in fact, the study found a slightly higher number of illicit drug crimes around the restored lots than around either the trash clean-up control lots or the no-treatment control lots over the follow-up period, although these differences were not statistically significant.[iii]
We don’t know whether the restoration intervention had an effect on the pre-specified secondary outcome of illegal drunkenness and drinking because the published article does not report on that outcome in either the main text or supplemental materials.
In addition, the article reports effects on crime and safety outcomes for the restoration intervention group compared to the no-treatment control group but does not report on the other pre-specified comparison—namely, the restoration intervention group versus the trash clean-up control group. Through our examination of tables 1 and 2 of the article, we back-calculate that a comparison of the restoration intervention and trash clean-up control groups would show few or no significant differences between the two groups.
The positive effects reported in the article are thus based not on the pre-specified primary or secondary analyses, but rather exploratory and post-hoc analyses examining many outcomes and between-group comparisons. This is problematic because, for each effect that a study measures, there is roughly a one in 20 chance that the test for statistical significance will produce a false-positive result when the intervention’s true effect is zero. So if a study measures numerous effects (and this study measured dozens, including many that are not reported in the article[iv]), it becomes a near certainty that the study will produce false-positive findings. Recent Food and Drug Administration (FDA) draft guidance refers to this as a “multiplicity” problem, and explains why—in cases like this vacant lot restoration study—it results in findings that are best viewed as preliminary and not definitive:
In the past, it was not uncommon, after the study was unblinded and analyzed, to see a variety of post hoc adjustments of design features (e.g., endpoints, analyses), usually plausible on their face, to attempt to elicit a positive study result from a failed study …. Although post hoc analyses of trials that fail on their prospectively specified endpoints may be useful for generating hypotheses for future testing, they do not yield definitive results. The results of such analyses can be biased because the choice of analyses can be influenced by a desire for success. The results also represent a multiplicity problem because there is no way to know how many different analyses were performed and there is no credible way to correct for the multiplicity of statistical analyses and control the Type I error rate [i.e., rate of false-positives]. Consequently, post hoc analyses by themselves cannot establish effectiveness. [pp. 7-8]
In conclusion, we believe this was a well-conducted RCT whose primary hypothesis was unfortunately not supported by the data. The published article obscures this central finding and instead portrays the results as positive based on analyses that are not reliable under accepted scientific standards because they can often produce false-positive effects. This is an all-too-common practice in the research literature, generating results that could easily lead policy officials to adopt ineffective interventions under the mistaken belief that they are evidence based. To prevent this outcome, we believe it is imperative for researchers, research funders, and journal editors to ensure that studies fully pre-specify their primary outcomes and methods, and report their findings in accord with those specifications.
Response provided by Charles Branas, lead author of the Proceedings of the National Academy of Sciences paper on the RCT findings
(1) We only had three days to respond and, other than Jon Baron, had no idea who authored the Arnold Foundation report, their qualifications, or if was scientifically peer-reviewed. The title and aspects of the report misrepresent our RCT and do not account for our specific prior work that led to the RCT and that an RCT like this had never been attempted. It would have been irresponsible for us NOT to report our findings for gun assaults and nuisances, alongside the registered primary outcomes, because gun assaults and nuisances: (a) were strongly hypothesized to be affected by the same intervention in our prior quasi-experimental work and not simply included because of their statistical significance, (b) were in a short list of five police-reported outcomes tested (gun assaults, robbery/theft, burglary, illicit drugs, nuisances), and (c) produced low p-values (p<0.01, p<0.001) under random assignment.
(2) Accounting for multiple hypothesis testing and false discovery,[v] we find the overall crime reduction observed would occur by chance 3 times out of 1,000, the gun assault reduction 11/100 times (and even lower for gun assaults in neighborhoods-below-the-poverty-line), burglary reductions <1/1,000 times, and nuisance reductions 2/100 times. Public drunkenness was bundled in our nuisances outcome and if we separate it out it is statistically significant (p<0.001) with a false discovery rate of <1/100,000.
(3) The federal biomedical funding environment for gun violence research was difficult when application for this study was made, and remains so today. Only in 2013 did the NIH explicitly begin offering funding for gun violence research. As such, gun violence was not a primary outcome in our RCT funded by the NIH/NIAAA in 2012. However, a prior quasi-experimental study of ours [vi] pointed to gun violence and vandalism as key outcomes, significantly affected by the same treatment (“vacant lot greening was associated with consistent reductions in gun assaults … (P < 0.001) and consistent reductions in vandalism … (P < 0.001)” [vi]). We thus appropriately hypothesized, before the RCT began, that gun violence and vandalism would be affected by this treatment and that future RCTs should be conducted (“Community-based trials are warranted to further test these findings” [vi], although in keeping with the original grant proposal did not include gun violence as a primary outcome in our RCT registration.
(4) Any study that is first to report on a treatment that has yet to be tested in an RCT is technically “unreliable” until it is reproduced by other RCTs. This is obvious and the Arnold Foundation should retitle its report lest it be misinterpreted by public readers unfamiliar with the technical definition of “reliability.”[vii] We are eager to see other RCTs now test the impact of the same intervention to confirm its reliability. The current study remains a major first step in testing a new intervention that can improve health and safety for large populations, and demonstrating that experimental testing of this place-based intervention and others like it can indeed be conducted with random assignment, population-based random sampling, and integrated ethnographic research.
Rejoinder by the LJAF Evidence-Based Policy team
We agree with the lead author about the value of reporting the study’s findings on gun assaults, nuisances, and other outcomes. Our concern is about how they were reported in the article, with (i) no mention of the fact that the study’s primary, registered hypothesis was not supported by the data; and (ii) no recognition that the article’s reported positive findings were generated by exploratory and post-hoc analyses that are not reliable because they could easily produce false-positives. We believe it is important to report such findings as a source of hypotheses for future studies, but to do so with transparency about their limitations. The Institute of Education Sciences provides useful guidance on this score: “Results from post-hoc analyses are not automatically invalid, but, irrespective of plausibility or statistical significance, they should be regarded as preliminary and unreliable unless they can be rigorously tested and replicated in future studies” (link, page 4).
The lead author’s response presents the results of “multiple-hypothesis tests” as evidence that the study’s findings on gun assaults and other outcomes are unlikely to be false-positives. However, per the FDA guidance cited in our report, such tests are not valid in the context of post-hoc analyses because “there is no way to know how many different analyses were performed and there is no credible way to correct for the multiplicity of statistical analyses and control the Type I error rate [i.e., rate of false-positives].” Indeed, as we noted in endnote 4 of our report (below), the researchers have reported over 75 effects of the restoration intervention in publications on this RCT, and measured dozens of additional effects that have not been reported. The multiple-hypothesis tests that the author cites are based on only a small subset of the measured effects.
Finally, the lead author suggests that the research team intended to include gun violence as a primary outcome of the RCT, in addition to their registered primary outcome of illegal drug trafficking and consumption, but could not due to the funding environment at the time the study was launched. This is difficult to verify, but accepting it as true, the study findings—had they been fully reported in accord with the pre-specified analyses—do not clearly show an effect on gun violence. While the restoration intervention group had a significantly lower rate of gun assaults than the no-treatment control group, we back-calculate from table 2 of the article that it had a modestly higher rate of gun assaults than the other pre-specified control group (“trash clean-up control”).[viii]
As a next step, we would encourage the researchers to publish a table showing the restoration intervention’s effects, as compared to both pre-specified control groups, on the complete set of outcomes that the study measured—including those that have not yet been reported. Doing so would help readers, including potential funders (such as our Foundation), to gauge whether the overall pattern of exploratory findings is sufficiently promising to warrant investment in future replication studies.
References
[i] Similarly, the study protocol describes the RCT’s “primary aim [as] studying the occurrence of public occurrences of drug and alcohol use.”
[ii] The registration also set out a third hypothesis related to the restoration intervention’s cost-effectiveness in reducing illegal drug trafficking and consumption and illegal public drunkenness and drinking.
[iii] The table shows that the number of illicit drug crimes around the restored lots was 1.5 percent higher than around the no-treatment control lots. Based on the data in the table, we back-calculate that the restored lots also had a higher number of illicit drug crimes than the lots in the trash clean-up control group.
[iv] The Proceedings article reports on 48 effects that the study measured; a different article reports on an additional 28 effects; and the survey that the researchers administered to individuals residing near the vacant lots collected many additional outcomes (such as their personal substance use and sales) that are reported in neither article. In addition, the researchers presumably conducted additional group comparisons (such as the restoration intervention versus trash-only group comparison they had pre-specified) involving measurement of dozens of additional effects that are also not reported.
[v] Benjamini Y, Yekutieli D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of statistics, 1165-1188.
[vi] Branas CC, Cheney RA, MacDonald JM, Tam VW, Jackson TD, Ten Have TR. (2011). A difference-in-differences analysis of health, safety, and greening vacant urban space. American Journal of Epidemiology, 174(11):1296-306.
[vii] We did retitle our report as the lead author suggested. The current title is the revised one.
[viii] The article does not provide enough information for us to back-calculate whether this difference was statistically significant.