Reading summary #2 short assignment
50
Does Research Design Affect Study Outcomes in Criminal Justice?
By DAVID WEISBURD, CYNTHIA M. LUM, and ANTHONY PETROSINO
David Weisburd is a senior research fellow in the Department of Criminology and Criminal Justice at the University of Maryland and a professor of criminology at the Hebrew University Law School in Jerusalem.
Cynthia M. Lum is a doctoral student in the Department of Criminology and Crimi- nal Justice at the University of Maryland.
Anthony Petrosino is a research fellow at the Center for Evaluation, Initiative for Children Program at the American Academy of Arts and Sciences and a research asso- ciate at Harvard University. He is also the coordinator of the Campbell Crime and Jus- tice Coordinating Group.
ABSTRACT: Does the type of research design used in a crime and jus- tice study influence its conclusions? Scholars agree in theory that randomized experimental studies have higher internal validity than do nonrandomized studies. But there is not consensus regarding the costs of using nonrandomized studies in coming to conclusions re- garding criminal justice interventions. To examine these issues, the authors look at the relationship between research design and study outcomes in a broad review of research evidence on crime and justice commissioned by the National Institute of Justice. Their findings suggest that design does have a systematic effect on outcomes in criminal justice studies. The weaker a design, indicated by internal validity, the more likely a study is to report a result in favor of treat- ment and the less likely it is to report a harmful effect of treatment. Even when comparing randomized studies with strong quasi-experi- mental research designs, systematic and statistically significant dif- ferences are observed.
NOTE: We are indebted to a number of colleagues for helpful comments in preparing this arti- cle. We especially want to thank Iain Chalmers, John Eck, David Farrington, Denise Gottfredson, Doris MacKenzie, Joan McCord, Lawrence Sherman, Brandon Welsh, Charles Wellford, and David Wilson.
51
T HERE is a growing consensus-L among scholars, practitioners, and policy makers that crime control practices and policies should be rooted as much as possible in scien- tific research (Cullen and Gendreau 2000; MacKenzie 2000; Sherman 1998). This is reflected in the steady growth in interest in evaluation of criminal justice programs and prac- tices in the United States and the United Kingdom over the past de- cade and by large increases in crimi- nal justice funding for research dur- ing this period (Visher and Weisburd 1998). Increasing support for re- search and evaluation in criminal
justice may be seen as part of a more general trend toward utilization of scientific research for establishing rational and effective practices and policies. This trend is perhaps most prominent in the health professions, where the idea of evidence-based medicine has gained strong govern- ment and professional support (Millenson 1997; Zuger 1997), though the evidence-based paradigm is also developing in other fields (see Nutley and Davies 1999; Davies, Nutley, and Smith 2000). A central component of the move-
ment toward evidence-based practice and policy is reliance on systematic review of prior research and evalua- tion (Davies 1999). Such review allows policy makers and practitio- ners to identify what programs and practices are most effective and in which contexts. The Cochrane Col-
laboration, for example, seeks to pre- pare, maintain, and make accessible systematic reviews of research on the effects of health care interventions (see Chalmers and Altman 1995;
www.cochrane.org.) The Cochrane Library is now widely recognized as the single best source of evidence on the effectiveness of health care and medical treatments and has played an important part in the advance- ment of evidence-based medicine
(Egger and Smith 1998). More recently, social scientists following the Cochrane model established the
Campbell Collaboration for develop- ing systematic reviews of research evidence in the area of social and educational interventions (see Boruch, Petrosino, and Chalmers 1999). In recognition of the growing importance of evidence-based poli- cies in criminal justice, the Campbell Collaboration commissioned a coor-
dinating group to deal with crime and justice issues. This group began with the goal of providing the best evidence on &dquo;what works in crime and justice&dquo; through the develop- ment of &dquo;systematic reviews of research&dquo; on the effects of crime and
justice interventions (Farrington and Petrosino 2001 [this issue] ).
In the Cochrane Collaboration, and in medical research in general, clinical trials that randomize partici- pants to treatment and control or comparison groups are considered more reliable than studies that do not employ randomization. And the recognition that experimental designs form the gold standard for drawing conclusions about the effects of treatments or programs is not restricted to medicine. There is broad agreement among social and behavioral scientists that random- ized experiments provide the best method for drawing causal infer- ences between treatments and
52
programs and their outcomes (for example, see Boruch, Snyder, and DeMoya 2000; Campbell and Boruch 1975; Farrington 1983; Feder, Jolin, and Feyerherm 2000). Indeed, a task force convened by the Board of Scien- tific Affairs of the American Psycho- logical Association to look into statis- tical methods concluded that &dquo;for research involving causal inferences, the assignments of units to levels of the causal variable is critical. Ran- dom assignment (not to be confused with random selection) allows for the
strongest possible causal inferences free of extraneous assumptions&dquo; (Wilkinson and Task Force on Statis- tical Inference 1999).
While reliance on experimental studies in drawing conclusions about treatment outcomes has become common in the development of evi- dence-based medicine, the Campbell Collaboration Crime and Justice Co-
ordinating Group has concluded that it is unrealistic at this time to restrict
systematic reviews on the effects of interventions relevant to crime and
justice to experimental studies. In developing its Standards for Inclu- sion of Studies in SystematicReviews (Farrington 2000), the group notes that it does not require that review- ers select only randomized experi- ments :
This might possibly be the case for an in- tervention where there are many ran- domized experiments (e.g. cognitive-be- havioral skills training). However, randomized experiments to evaluate criminological interventions are rela- tively uncommon. If reviews were re- stricted to randomized experiments, they
would be relevant to only a small fraction of the key questions for policy and prac- tice in criminology. Where there are few randomized experiments, it is expected that reviewers will select both random- ized and non-randomized studies for in- clusion in detailed reviews. (3)
In this article we examine a cen- tral question relevant both to the Campbell Collaboration crime and justice effort and to the more general emphasis on developing evidence- based practice in criminal justice: Does the type of research design used in a crime and justice study influence the conclusions that are reached? As-
suming that experimental designs are the gold standard for evaluating practices and policies, it is important to ask what price we pay in including other types of studies in our reviews of what works in crime and justice. Are we likely to overestimate or un- derestimate the positive effects of treatment? Or conversely, might we expect that the use of well-designed nonrandomized studies will lead to about the same conclusions as we would gain from randomized experi- mental evaluations?
To examine these issues, we look at the relationship between research design and study outcomes in a broad review of research evidence on crime and justice commissioned by the National Institute of Justice. Gen-
erally referred to as the Maryland Report because it was developed in the Department of Criminology and Criminal Justice at the University of Maryland at College Park, the study was published under the title Pre- venting Crime: What Works, What
53
Doesn’t, What’s Promising (Sherman et al. 1997). The Maryland Report provides an unusual opportunity for assessing the impact of study design on study outcomes in crime and jus- tice both because it sought to be com- prehensive in identifying available research and because the principal investigators of the study devoted specific attention to the nature of the research designs of the studies included. Below we detail the meth- ods we used to examine how study design affects study outcomes in crime and justice research and report on our main findings. We turn first, however, to a discussion of why ran- domized experiments as contrasted with quasi-experimental and non- experimental research designs are generally considered a gold standard for making causal inferences. We also examine what prior research sug- gests regarding the questions we raise.
WHY ARE RANDOMIZED EXPERIMENTS CONSIDERED
THE GOLD STANDARD?
The key to understanding the strength of experimental research designs is found in what scholars refer to as the internal validity of a study. A research design in which the effects of treatment or intervention can be clearly distinguished from other effects has high internal valid- ity. A research design in which the effects of treatment are confounded with other factors is one in which there is low internal validity. For example, suppose a researcher seeks to assess the effects of a specific drug
treatment program on recidivism. If
at the end of the evaluation the researcher can present study results and confidently assert that the effects of treatment have been iso- lated from other confounding causes, the internal validity of the study is high. But if the researcher has been unable to ensure that other factors such as the seriousness of prior records or the social status of offend- ers have been disentangled from the influence of treatment, he or she must note that the effects observed for treatment may be due to such con-
founding causes. In this case internal validity is low.
In randomized experimental stud- ies, internal validity is developed through the process of random allo- cation of the units of treatment or intervention to experimental and control or comparison groups. This means that the researcher has ran- domized other factors besides treat- ment itself, since there is no system- atic bias that brings one type of subject into the treatment group and another into the control or compari- son group. Although the groups are not necessarily the same on every characteristic-indeed, simply by chance, there are likely to be differ- ences-such differences can be assumed to be distributed randomly and are part and parcel of the sto- chastic processes taken into account in statistical tests. Random alloca- tion thus allows the researcher to assume that the only systematic dif- ferences between the treatment and
comparison groups are found in the treatments or interventions that are
applied. When the study is complete,
54
the researcher can argue with confi- dence that if a difference has been observed between treatment and
comparison groups, it is likely the result of the treatment itself (since randomization has isolated the treat- ment effect from other possible causes).
In nonrandomized studies, two methods may be used for isolating treatment or program effects. Quasi- experiments, like randomized exper- iments, rely on the design of a research study to isolate the effects of treatment. Using matching or other methods in an attempt to establish equivalence between groups, quasi- experiments mimic experimental designs in that they attempt to rule out competing causes by identifying groups that are similar except in the nature of the treatment that they receive in the study. Importantly, however, quasi-experiments do not randomize out the effects of other causes as is the case in randomized
experimental designs; rather they seek to maximize the equivalence between the units studied through matching or other methods. Threats to internal validity in quasi-experi- mental studies derive from the fact that it is seldom possible to find or to create treatment and control groups that are not systematically different in one respect or another.
Nonexperimental studies rely pri- marily on statistical techniques to distinguish the effects of the inter- vention or treatment from other con-
founding causes. In practice, quasi- experimental studies often rely as well on statistical approaches to increase the equivalence of the
comparisons made.l However, in non- experimental studies, statistical con- trols are the primary method applied in attempts to increase the level of a study’s internal validity. In this case, multivariate statistical methods are used to isolate the effects of treat- ment from that of other causes. This demands of course that the researcher clearly identify and mea- sure all other factors that may threaten the internal validity of the study outcomes. Only if all such fac- tors are included in the multivariate models estimated can the researcher be confident that the effects of treat- ment that have been reported are not confounded with other causes.
In theory, the three methods described here are equally valid for solving the problem of isolating treatment or program effects. Each can ensure high internal validity when applied correctly. In practice, however, as Feder and Boruch (2000) note, &dquo;there is little disagreement that experiments provide a superior method for assessing the effective- ness of a given intervention&dquo; (292). Randomization, according to Kunz and Oxman (1998), &dquo;is the only means of controlling for unknown and unmeasured differences between
comparison groups as well as those that are known and measured&dquo;
(1185). While random allocation itself ensures high internal validity in experimental research, for quasi- experimental and nonexperimental research designs, unknown and unmeasured causes are generally seen as representing significant potential threats to the internal validity of the comparisons made.2
2
55
INTERNAL VALIDITY AND STUDY OUTCOMES
IN PRIOR REVIEWS
While there is general agreement that experimental studies are more likely to ensure high internal valid- ity than are quasi-experimental or nonexperimental studies, it is diffi- cult to specify at the outset the effects that this will have on study out- comes. On one hand, it can be assumed that weaker internal valid-
ity is likely to lead to biases in assess- ment of the effects of treatments or interventions. However, the direction of that bias in any particular study is likely to depend on factors related to the specific character of the research that is conducted. For example, if nonrandomized studies do not account for important confounding causes that are positively related to treatment, they may on average overestimate program outcomes.
However, if such unmeasured causes are negatively related to treatment, nonrandomized studies would be
expected to underestimate program outcomes. Heinsman and Shadish (1996) suggested that whatever the differences in research design, if nonrandomized and randomized studies are equally well designed and implemented (and thus internal validity is maximized in each), there should be little difference in the esti- mates gained. Much of what is known empirically about these questions is drawn from reviews in such fields as medicine, psychology, economics, and education (for example, see Burtless 1995; Hedges 2000; Kunz and Oxman 1998; Lipsey and Wilson 1993). Fol- lowing, what one would expect in the- ory, a general conclusion that can be
reached from the literature is that there is not a consistent bias that results from use of nonrandomized research designs. At the same time, a few studies suggest that differences, in whatever direction, will be small- est when nonrandomized studies are well designed and implemented. Kunz and Oxman (1998), for
example, using studies drawn from the Cochrane database, found vary- ing results when analyzing 18 meta- analyses (incorporating 1211 clinical trials) in the field of health care. Of these 18 systematic reviews, 4 found randomized and higher-quality studies’ to give higher estimates of effects than nonrandomized and
lower-quality studies, and 8 reviews found randomized or high-quality studies to produce lower estimates of effect sizes than nonrandomized or
lower-quality studies. Five other reviews found little or inconclusive differences between different types of research designs, and in one review, low-quality studies were found to be more likely to report find- ings of harmful effects of treatments.
Mixed results are also found in systematic reviews in the social sci- ences. Some reviews suggest that nonrandomized studies will on aver-
age underestimate program effects. For example, Heinsman and Shadish (1996) looked at four meta-analyses that focused on interventions in four different areas: drug use, effects of coaching on Scholastic Aptitude Test performance, ability grouping of pupils in secondary schools, and psychosocial interventions for postsurgery outcomes. Included in their analysis were 98 published and unpublished studies. As a whole,
56
randomized experiments were found to yield larger effect sizes than stud- ies where randomization was not used. In contrast, Friedlander and Robins (2001), in a review of social welfare programs, found that non-
experimental statistical approaches often yielded estimates larger than those gained in randomized studies (see also Cox, Davidson, and Bynum 1995; LaLonde 1986).
In a large-scale meta-analysis examining the efficacy of psychologi- cal, educational, and behavioral treatment, Lipsey and Wilson (1993) suggested that conclusions reached on the basis of nonrandomized stud- ies are not likely to strongly bias con- clusions regarding treatment or pro- gram effects. Although studies varied greatly in both directions as to whether nonrandomized designs overestimated or underestimated effects as compared with randomized designs, no consistent bias in either direction was detected. Lipsey and Wilson, however, did find a notable difference between studies that
employed a control/comparison design and those that used one-group pre and post designs. The latter stud- ies produced consistently higher esti- mates of treatment effects.
Support for the view that stronger nonrandomized studies are likely to provide results similar to random- ized experimental designs is pro- vided by Shadish and Ragsdale (1996). In a review of 100 studies of marital or family psychotherapy, they found overall that randomized experiments yielded significantly larger weighted average effect sizes than nonequivalent control group designs. Nonetheless, the difference
between randomized and nonran- domized studies decreased when
confounding variables related to the quality of the design of the study were included.
Works that specifically address the relationship between study design and study outcomes are scarce in criminal justice. In turn, assessment of this relationship is most often not a central focus of the reviews developed, and reviewers generally examine a specific criminal justice area, most often corrections (for example, see Bailey 1966; Mac- Kenzie and Hickman 1998; White- head and Lab 1989). Results of these studies provide little guidance for specifying a general relationship between study design and study out- comes for criminal justice research. In an early review of 100 reports of correctional treatment between 1940 and 1960, for example, Bailey (1966) found that research design had little effect on the claimed success of treat-
ment, though he noted a slight posi- tive relationship between the &dquo;rigor&dquo; of the design and study outcome. Logan (1972), who also reviewed cor- rectional treatment programs, found a slight negative correlation between study design and claimed success.
Recent studies are no more conclu- sive. Wilson, Gallagher, and Mac- Kenzie (2000), in a meta-analysis of corrections-based education, voca- tion, and work programs, found that run-of-the-mill quasi-experimental studies produced larger effects than did randomized experiments. How- ever, such studies also produced larger effects than did low-quality designs that clearly lacked compara- bility among groups. In a review of
57
165 school-based prevention pro- grams, Whitehead and Lab (1989) found little difference in the size of effects in randomized and non- randomized studies. Interestingly however, they reported that nonrandomized studies were much less likely to report a backfire effect whereby treatment was found to exacerbate rather than ameliorate the problem examined. In contrast, a more recent review by Wilson, Gottfredson, and Najaka (in press) found overall that nonrandomized studies yielded results on average significantly lower than randomized experiments’ results, even account- ing for a series of other design char- acteristics (including the overall quality of the implementation of the study). However, it should be noted that many of these studies did not include delinquency measures, and schools rather than individuals were often the unit of random allocation.4 4
THE STUDY
We sought to define the influence of research design on study outcomes across a large group of studies repre- senting the different types of research design as well as a broad array of criminal justice areas. The most comprehensive source we could identify for this purpose has come to be known as the Maryland Report (Sherman et al. 1997). The Maryland Report was commissioned by the National Institute of Justice to iden-
tify &dquo;what works, what doesn’t, and what’s promising&dquo; in preventing crime. It was conducted at the Uni-
versity of Maryland’s Department of Criminology and Criminal Justice
over a yearlong period between 1996 and 1997. The report attempted to identify all available research rele- vant to crime prevention in seven broad areas: communities, families, schools, labor markets, places, polic- ing, and criminal justice (correc- tions). Studies chosen for inclusion in the Maryland Report met minimal methodological requirements.5
Though the Maryland Report did not examine the relationship be- tween study design and study out- comes, it did define the quality of the methods used to evaluate the
strength of the evidence provided through a scientific methods scale (SMS). This SMS was coded with numbers 1 through 5, with &dquo;5 being the strongest scientific evidence&dquo;
(Sherman et al. 1997, 2.18). Overall, studies higher on the scale have higher internal validity, and studies with lower scores have lower internal
validity. The 5-point scale was broadly defined in the Maryland Re- port (Sherman et al. 1997) as follows:
1: Correlation between a crime
prevention program and a mea- sure of crime or crime risk fac- tors.
2: Temporal sequence between the program and the crime or risk outcome clearly observed, or a comparison group present with- out the demonstrated compara- bility to the treatment group.
3: A comparison between two or more units of analysis, one with and one without the program.
4: Comparison between multiple units with and without the pro- gram, controlling for other fac- tors, or a non-equivalent com-
58
parison group has only minor differences evident.
5: Random assignment and analy- sis of comparable units to pro- gram and comparison groups. (2.18-2.19)
A score of 5 on this scale suggests a randomized experimental design, and a score of 1 a nonexperimental approach. Scores of 3 and 4 may be associated with quasi-experimental designs, with 4 distinguished from 3 by a greater concern with control for threats to internal validity. A score of 2 represents a stronger nonexperi- mental design or a weaker quasi-ex- perimental approach. However, the overall rating given to a study could be affected by other design criteria such as response rate, attrition, use of statistical tests, and statistical power. It is impossible to tell from the Maryland Report how much influ- ence such factors had on each study’s rating. However, correspondence with four of the main study investi- gators suggests that adjustments based on these other factors were un- common and generally would result in an SMS decrease or increase of only one level.
Although the Maryland Report included a measure of study design, it did not contain a standardized measure of study outcome. Most prior reviews have relied on stan- dardized effect measures as a crite- rion for studying the relationship between design type and study find- ings. Although in some of the area reviews in the Maryland Report, standardized effect sizes were calcu- lated for specific studies, this was not the case for the bulk of the studies
reviewed in the report. Importantly, in many cases it was not possible to code such information because the
original study authors did not pro- vide the specific details necessary for calculating standardized effect coef- ficients. But the approach used by the Maryland investigators also reflected a broader philosophical decision that emphasized the bottom line of what was known about the effects of crime and justice interven- tions. In criminal justice, the out- come of a study is often considered more important than the effect size noted. This is the case in good part because there are often only a very small number of studies that exam- ine a specific type of treatment or intervention. In addition, policy deci- sions are made not on the basis of a review of the effect sizes that are
reported but rather on whether one or a small group of studies suggests that the treatment or intervention works.
From the data available in the
Maryland Report, we developed an overall measure of study outcomes that we call the investigator reported result (IRR). The IRR was created as an ordinal scale with three values: 1, 0, and -1, reflecting whether a study concluded that the treatment or in- tervention worked, had no detected effect, or led to a backfire effect. It is defined by what is reported in the ta- bles of the Maryland Report and is coded as follows:6
1: The program or treatment is re- ported to have had an intended positive effect for the criminal justice system or society. Out- comes in this case supported
59
the position that interventions or treatments lead to reduc- tions in crime, recidivism, or re- lated measures. ’ 7
0: The program treatment was re-
ported to have no detected ef- fect, or the effect was reported as not statistically significant.
-1: The program or treatment had an unintended backfire effect for the criminal justice system or society. Outcomes in this case supported the position that in- terventions or treatments were harmful and lead to increases in crime, recidivism, or related measures.~ 8
This scale provides an overall measure of the conclusions reached
by investigators in the studies that were reviewed in the Maryland Re- port. However, we think it is impor- tant to note at the outset some spe- cific features of the methodology used that may affect the findings we gain using this approach. Perhaps most significant is the fact that
Maryland reviewers generally relied on the reported conclusions of inves- tigators unless there was obvious ev- idence to the contrary.’ This ap- proach led us to term the scale the investigator reported result and rein- forces the fact that we examine the
impacts of study design on what in- vestigators report rather than on the actual outcomes of the studies exam- ined.
While the Maryland reviewers examined tests of statistical signifi- cance in coming to conclusions about which programs or treatments
’work, 10 they did not require that sta- tistical tests be reported by investi-
gators to support the specific conclu- sions reached in each study. In turn, the tables in the Maryland Report often do not note whether specific studies employed statistical tests of significance. Accordingly, in review- ing the Maryland Report studies, we cannot assess whether the presence or absence of such tests influences our conclusions. Later in our article we reexamine our results, taking into account statistical significance in the context of a more recent review in the corrections area that was modeled on the Maryland Report.
Finally, as we noted earlier, most systematic reviews of study out- comes have come to use standardized effect size as a criterion. While we think that the IRR scale is useful for
gaining an understanding of the rela- tionship between research design and reported study conclusions, we recognize that a different set of con- clusions might have been reached had we focused on standardized effect sizes. Again, we use the correc- tions review referred to above to assess how our conclusions might have differed if we had focused on standardized effect sizes rather than the IRR scale.
We coded the Scientific Methods Scale and the IRR directly from the tables reported in Preventing Crime: What Works, What Doesn’t, What’s Promising (Sherman et al. 1997). We do not include all of the studies in the
Maryland Report in our review. First, given our interest in the area of crim- inal justice, we excluded studies that did not have a crime or delinquency outcome measure. Second, we excluded studies that did not provide an SMS score (a feature of some
60
TABLE 1
STUDIES CATEGORIZED BY SMS
tables in the community and family sections of the report). Finally, we excluded the school-based area from review because only selected studies were reported in tables.&dquo; All other studies reviewed in the Maryland Report were included, which resulted in a sample of 308 studies. Tables 1 and 2 display the breakdown of these studies by SMS and IRR.
As is apparent from Table 1, there is wide variability in the nature of the research methods used in the studies that are reviewed. About 15
percent were coded in the highest SMS category, which demands a ran- domized experimental design. Only 10 studies included were coded in the lowest SMS category, though almost a third fall in category 2. The largest category is score 3, which required simply a comparison between two units of analysis, one with and one without treatment. About 1 in 10 cases were coded as 4, suggesting a quasi-experimental study with strong attention to creating equiva- lence between the groups studied.
The most striking observation that is drawn from Table 2 is that almost two-thirds of the crime and
TABLE 2
STUDIES CATEGORIZED BY THE IRR
justice studies reviewed in the Mary- land Report produced a reported result in the direction of success for the treatment or intervention exam- ined. This result is very much at odds with reviews conducted in earlier decades that suggested that most interventions had little effect on crime or related problems (for example, see Lipton, Martinson, and Wilks 1975; Logan 1972; Martinson 1974).12 At the same time, a number of the studies examined, about 1 in 10, reported a backfire effect for treatment or intervention.
RELATING STUDY DESIGN AND STUDY OUTCOMES
In Tables 3 and 4 we present our basic findings regarding the relation- ship between study design and study outcomes in the Maryland Report sample. Table 3 provides mean IRR outcome scores across the five SMS
design categories. While the mean IRR scores in this case present a sim- ple method for examining the results, we also provide an overall statistical measure of correlation, Tau-c (and the associated significance level), which is more appropriate for data of this type. In Table 4 we provide the
61
TABLE 3
MEAN IRR SCORES ACROSS SMS CATEGORIES
NOTE: Tau-c = -.181. p < .001.
cross-tabulation of IRR and SMS scores. This presentation of the results allows us to examine more
carefully the nature of the relation- ship both in terms of outcomes in the expected treatment direction and outcomes that may be classified as backfire effects.
Overall Tables 3 and 4 suggest that there is a linear inverse rela-
tionship between the SMS and the IRR. The mean IRR score decreases with each increase in step in the SMS score (see Table 3). While fully nonexperimental designs have a mean IRR score of .80, randomized experiments have a mean of only .22. The run of the mill quasi-experimen- tal designs represented in category 3 have a mean IRR score of .56, while the strongest quasi experiments (cat- egory 4) have a mean of .39. The over- all correlation between study design and study outcomes is moderate and negative (-.18), and the relationship is statistically significant at the .001 level.
Looking at the cross-tabulation of SMS and IRR scores, our findings are reinforced. The stronger the method
in terms of internal validity as mea- sured by the SMS, the less likely is a study to conclude that the interven- tion or treatment worked. The weaker the method, the less likely the study is to conclude that the intervention or treatment backfired.
While 8 of the 10 studies in the lowest SMS category and 74 percent of those in category 2 show a treat- ment impact in the desired direction, this was true for only 37 percent of the randomized experiments in cate- gory 5. Only in the case of backfire outcomes in categories 4 and 5 does the table not follow our basic find-
ings, and this departure is small. Overall the relationship observed in the table is statistically significant at the .005 level.
Comparing the highest-qucxlity nonrandomized) studies with randomized experiments
As noted earlier, some scholars argue that higher-quality nonran- domized studies are likely to have outcomes similar to outcomes of ran- domized evaluations. This hypothe- sis is not supported by our data. In Table 5 we combine quasi-experi- mental studies in SMS categories 3 and 4 and compare them with ran- domized experimental studies placed in SMS category 5. Again we find a statistically significant negative relationship (p < .01). While 37 per- cent of the level 5 experimental stud- ies show a treatment effect in the desired direction, this was true for 65 percent of the quasi-experimental studies.
Even if we examine only the high- est-quality quasi-experimental stud- ies as represented by category 4 and
62
TABLE 4
CROSS-TABULATION OF SMS AND IRR
NOTE: Chi-square = 25.487 with 8 df (p < .005).
TABLE 5
COMPARING QUASI-EXPERIMENTAL STUDIES (SMS = 3 OR 4) WITH
RANDOMIZED EXPERIMENTS (SMS = 5)
NOTE: Chi-square = 12.971 with 2 df (p <
.01 ).
compare these to the randomized studies included in category 5, the relationship between study out- comes and study design remains sta- tistically significant at the .05 level (see Table 6). There is little difference between the two groups in the pro- portion of backfire outcomes
reported; however, there remains a very large gap between the propor- tion of SMS category 4 and SMS cate- gory 5 studies that report an outcome in the direction of treatment effec- tiveness. While 61 percent of the cat- egory 4 SMS studies reported a posi- tive treatment or intervention effect,
TABLE 6
COMPARING HIGH-QUALITY QUASI- EXPERIMENTAL DESIGNS (SMS = 4)
WITH RANDOMIZED DESIGNS (SMS = 5)
NOTE: Chi-square = 6.805 with 2 df (p <.05).
this was true for only 37 percent of the randomized studies in category 5. Accordingly, even when comparing those nonrandomized studies with the highest internal validity with randomized experiments, we find significant differences in terms of reported study outcomes.
Taking into account tests of statistical significance
It might be argued that had we used a criterion of statistical signifi- cance, the overall findings would not have been consistent with the analy- ses reported above. While we cannot
63
examine this question in the context of the Maryland Report, since statis- tical significance is generally not reported in the tables or the text of the report, we can review this con- cern in the context of a more recent review conducted in the corrections area by one of the Maryland investi- gators, which uses a similar method- ology and reports Maryland SMS (see MacKenzie and Hickman 1998). MacKenzie and Hickman (1998) examined 101 studies in their 1998 review of what works in correc-
tions, of which 68 are reported to have included tests of statistical
significance. Developing the IRR score for each
of MacKenzie and Hickman’s (1998) studies proved more complex than the coding done for the Maryland Report. MacKenzie and Hickman reported all of the studies’ results, sometimes breaking up results by gender, employment, treatment mix, or criminal history, to list a few exam- ples. Rather than count each result as a separate study, we developed two different methods that followed dif- ferent assumptions for coding the IRR index.
The first simply notes whether any significant findings were found supporting a treatment effect and codes a backfire effect when there are
statistically significant negative findings with no positive treatment effects (scale A). 13 The second (scale B) is more complex and gives weight to each result in each study 14
Taking this approach, our findings analyzing the MacKenzie and Hickman (1998) data follow those reported when analyzing the Mary- land Report. The correlation between
TABLE 7
RELATING SMS AND IRR ONLY FOR STUDIES IN MACKENZIE AND HICKMAN
(1998) THAT INCLUDE TESTS OF STATISTICAL SIGNIFICANCE
NOTE: Tau-c for scale A = -.285 (p < .005). Tau-c for scale B = -.311 (p < .005).
study design and study outcomes is negative and statistically significant (p < .005) irrespective of the approach we used to define the IRR outcome scale (see Table 7). Using scale A, the correlation observed is - .29, while using scale B, the observed correlation is -.31.
Comparing effect size and IRR score results
It might be argued that our overall findings are related to specific char- acteristics of the IRR scale rather than the underlying relationship between study design and study out- comes. We could not test this ques- tion directly using the Maryland Report data because, as noted earlier, standardized effect sizes were not
consistently recorded in the report. However, MacKenzie and Hickman (1998) did report standardized effect size coefficients, and thus we are able to reexamine this question in the context of corrections-based criminal
justice studies.
64
Using the average standardized effect size reported for each study reviewed by MacKenzie and Hick- man (1998) for the entire sample (including studies where statistical significance is not reported), the results follow those gained from relating IRR and SMS scores using the Maryland Report sample (see Table 8). Again the correlation between SMS and study outcomes is negative; in this case the correlation is about -.30. The observed relation-
ship is also statistically significant at the .005 level. Accordingly, these findings suggest that our observa- tion of a negative relationship between study design and study out- comes in the Maryland Report sam- ple is not an artifact of the particular codings of the IRR scale.
DISCUSSION
Our review of the Maryland Report Studies suggests that in crim- inal justice, there is a moderate inverse relationship between the quality of a research design, defined in terms of internal validity, and the outcomes reported in a study. This relationship continues to be observed even when comparing the highest- quality nonrandomized studies with randomized experiments. Using a related database concentrating only on the corrections area, we also found that our findings are consistent when taking into account only stud- ies that employed statistical tests of significance. Finally, using the same database, we were able to examine whether our results would have dif- fered had we used standardized effect size measures rather than the
TABLE 8
RELATING AVERAGE EFFECT SIZE AND SMS FOR STUDIES IN
MACKENZIE AND HICKMAN (1988)
NOTE: Correlation (rj = -.296 (p < .005).
IRR index that was drawn from the
Maryland Report. We found our results to be consistent using both methods. Studies that were defined as including designs with higher internal validity were likely to report smaller effect sizes than studies with
designs associated with lower inter- nal validity.
Prior reviews of the relationship between study design and study out- comes do not predict our findings. Indeed, as we noted earlier, the main lesson that can be drawn from prior research is that the impact of study design is very much dependant on the characteristics of the particular area or studies that are reviewed. In
theory as well, there is no reason to assume that there will be a system- atic type of bias in studies with lower internal validity. What can be said simply is that such studies, all else being equal, are likely to provide biased findings as compared with results drawn from randomized
experimental designs. Why then do we find in reviewing a broad group of
65
crime and justice studies what appears to be a systematic relation- ship between study design and study outcomes?
One possible explanation for our findings is that they are simply an artifact of combining a large number of studies drawn from many different areas of criminal justice. Indeed, there are generally very few studies that examine a very specific type of treatment or intervention in the
Maryland Report. And it may be that were we able to explore the impacts of study design on study outcomes for specific types of treatments or inter- ventions, we would fmd patterns dif- ferent from the aggregate ones reported here. We think it is likely that for specific areas of treatment or specific types of studies in criminal justice, the relationship between study design and study outcomes will differ from those we observe. None-
theless, review of this question in the context of one specific type of treat- ment examined by the Campbell Col- laboration (where there was a sub- stantial enough number of ran- domized and nonrandomized studies for comparison) points to the salience of our overall conclusions even within specific treatment areas (see Petrosino, Petrosino, and Buehler 2001). We think this example is par- ticularly important because it sug- gests the potential confusion that might result from drawing conclu- sions from nonrandomized studies.
Relying on a systematic review conducted by Petrosino, Petrosino, and Buehler (2001) on Scared Straight and other kids-visit pro- grams, we identified 20 programs that included crime-related outcome
measures. Of these, 9 were random- ized experiments, 4 were quasi- experimental trials, and 7 were fully nonexperimental studies. Petrosino, Petrosino, and Buehler reported on the randomized experimental trials in their Campbell Collaboration review. They concluded that Scared Straight and related programs do not evidence any benefit in terms of recidivism and actually increase sub- sequent delinquency. However, a very different picture of the effective- ness of these programs is drawn from our review of the quasi-experimental and nonexperimental studies. Over- all, these studies, in contrast to the experimental evaluations, suggest that Scared Straight programs not only are not harmful but are more likely than not to produce a crime prevention benefit. We believe that our fmdings, how-
ever preliminary, point to the possi- bility of an overall positive bias in nonrandomized criminal justice studies. This bias may in part reflect a number of other factors that we could not control for in our data, for example, publication bias or differen- tial attrition rates across designs (see Shadish and Ragsdale 1996). However, we think that a more gen- eral explanation for our findings is likely to be found in the norms of criminal justice research and practice.
Such norms are particularly important in the development of non- randomized studies. Randomized
experiments provide little freedom to the researcher in defining equiva- lence between treatment and com-
parison groups. Equivalence in ran- domized experiments is defined
66
simply through the process of ran- domization. However, nonran- domized studies demand much
insight and knowledge in the devel- opment of comparable groups of sub- jects. Not only must the researcher understand the factors that influ- ence treatment so that he or she can
prevent confounding in the study results, but such factors must be measured and then controlled for
through some statistical or practical procedure.
It may be that such manipulation is particularly difficult in criminal justice study. Criminal justice practi- tioners may not be as strongly social- ized to the idea of experimentation as are practitioners in other fields like medicine. And in this context, it may be that a subtle form of creaming in which the cases considered most amenable to intervention are placed in the intervention group is common. In specific areas of criminal justice, such creaming may be exacerbated by self-selection of subjects who are motivated toward rehabilitation. Nonrandomized designs, even in rel- atively rigorous quasi-experimental studies, may be unable to compen- sate or control for why a person is considered amenable and placed in the intervention group. Matching on traditional control variables like age and race, in turn, might not identify the subtle components that make individuals amenable to treatment and thus more likely to be placed in intervention or treatment categories.
Of course, we have so far assumed that nonrandomized studies are biased in their overestimation of pro- gram effects. Some scholars might
argue just the opposite. The inflexi- bility of randomized experimental designs has sometimes been seen as a barrier to development of effective theory and practice in criminology (for example, see Clarke and Cornish 1972; Eck 2001; Pawson and Tilley, 1997). Here it is argued that in a field in which we still know little about the root causes and processes that underlie phenomena we seek to influence, randomized studies may not allow investigators the freedom to carefully explore how treatments or programs influence their intended
subjects. While this argument has merit in specific circumstances, espe- cially in exploratory analyses of problems and treatments, we think our data suggest that it can lead in more developed areas of our field to significant misinterpretation and confusion.
CONCLUSION
We asked at the outset of our arti- cle whether the type of research design used in criminal justice influ- ences the conclusions that are reached. Our findings, based on the Maryland Report, suggest that design does matter and that its effect in criminal justice study is system- atic. The weaker a design, as indi- cated by internal validity, the more likely was a study to report a result in favor of treatment and the less likely it was to report a harmful effect of treatment. Even when comparing studies defined as randomized
designs in the Maryland Report with strong quasi-experimental research designs, systematic and statistically
67
significant differences were ob- served. Though our study should bscores e seen only as a preliminary step in understanding how research design affects study outcomes in criminal justice, it suggests that sys- tematic reviews of what works in criminal justice may be strongly biased when including nonrandomized studies. In efforts such as those being developed by the Campbell Collaboration, such poten- tial biases should be taken into account in coming to conclusions about the effects of interventions.
Notes
1. Statistical adjustments for random group differences are sometimes employed in experimental studies as well.
2. We should note that we have assumed so far that external validity (the degree to which it can be inferred that outcomes apply to the populations that are the focus of treat- ment) is held constant in these comparisons. Some scholars argue that experimental stud- ies are likely to have lower external validity because it is often difficult to identify institu- tions that are willing to randomize partici- pants. Clearly, where randomized designs have lower external validity, the assumption that they are to be preferred to nonrandomized studies is challenged.
3. Kunz and Oxman (1998) not only com- pared randomized and nonrandomized stud- ies but also adequately and inadequately con- cealed randomized trials and high-quality versus low-quality studies. Generally, high- quality randomized studies included ade- quately concealed allocation, while lower- quality randomized trails were inadequately concealed. In addition, the general terms high- quality trials and low-quality trials indicate a difference where "the specific effect of random- ization or allocation concealment could not be separated from the effect of other methodologi- cal manoeuvres such as double blinding" (Kunz and Oxman 1998, 1185).
4. Moreover, it may be that the finding of higher standardized effects sizes for random- ized studies in this review was due to school- level as opposed to individual-level assign- ment. When only those studies that include a delinquency outcome are examined, a larger effect is found when school rather than stu- dent is the unit of analysis (Denise Gott- fredson, personal communication, 2001).
5. As the following Scientific Methods Scale illustrates, the lowest acceptable type of evaluation for inclusion in the Maryland Re- port is a simple correlation between a crime prevention program and a measure of crime or crime risk factors. Thus studies that were de-
scriptive or contained only process measures were excluded.
6. There were also (although rarely) stud- ies in the Maryland Report that reported two findings in opposite directions. For instance, in Sherman and colleagues’ (1997) section on specific deterrence (8.18-8.19), studies of ar- rest for domestic violence had positive results for employed offenders and backfire results for nonemployed offenders. In these isolated cases, the study was coded twice with the same scientific methods scores and each of the in-
vestigator-reported result scores (of 1 and -1) separately.
7. For studies examining the absence of a program (such as a police strike) where social conditions worsened or crime increased, this would be coded as 1.
8. For studies examining the absence of a program (such as a police strike) where social conditions improved or crime decreased, this would be coded as -1.
9. Only in the school-based area was there a specific criterion for assessing the investiga- tor’s conclusions. As noted below, however, the school-based studies are excluded from our re- view for other reasons.
10. For example, the authors of the Mary- land Report noted in discussing criteria for de- ciding which programs work, "These are pro- grams that we are reasonably certain of preventing crime or reducing risk factors for crime in the kinds of social contexts in which
they have been evaluated, and for which the findings should be generalizable to similar set- tings in other places and times. Programs coded as ’working’ by this definition must have at least two level 3 evaluations with statistical
68
significance tests showing effectiveness and the preponderance of all available evidence supporting the same conclusion" (Sherman et al. 1997, 2-20).
11. It is the case that many of the studies in this area would have been excluded anyway since they often did not have a crime or delin- quency outcome measure (but rather exam- ined early risk factors for crime and delin- quency).
12. While the Maryland Report is consis- tent with other recent reviews that also point to greater success in criminal justice interven- tions during the past 20 years (for example, see Poyner 1993; Visher and Weisburd 1998; Weisburd 1997), we think the very high per- centage of studies showing a treatment impact is likely influenced by publication bias. The high rate of positive findings is also likely in- fluenced by the general weaknesses of the study designs employed. This is suggested by our findings reported later: that the weaker a research design in terms of internal validity, the more likely is the study to report a positive treatment outcome.
13. The coding scheme for scale A was as follows. A value of 1 indicates that the study had any statistically significant findings sup- porting a positive treatment effect, even if
findings included results that were not signifi- cant or had negative or backfire findings. A value of 0 indicates that the study had only nonsignificant findings. A value of-1 indicates that the study had only statistically signifi- cant negative or backfire findings or statisti- cally significant negative findings with other nonsignificant results.
14. Scale B was created according to the fol- lowing rules. A value of 2 indicates that the study had only or mostly statistically signifi- cant findings supporting a treatment effect (more than 50 percent) when including all re- sults, even nonsignificant ones. A value of 1 in- dicates that the study had some statistically significant findings supporting a treatment ef- fect (50 percent or less, counting both positive significant and nonsignificant results) even if the nonsignificant results outnumbered the positive statistically significant results. A value of 0 indicates that no statistically signifi- cant findings were reported. A value of-1 indi- cates that the study evidenced statistically significant backfire effects (even if non-
significant results were present) but no statistically significant results supporting the effectiveness of treatment.
References
Bailey, Walter C. 1966. Correctional Out- come : An Evaluation of 100 Reports. Journal of Criminal Law, Criminology and Police Science 57:153-60.
Boruch, Robert F., Anthony Petrosino, and lain Chalmers. 1999. The Camp- bell Collaboration: A Proposal for Sys- tematic, Multi-National, and Continu- ous Reviews of Evidence. Background paper for the meeting at University College-London, School of Public Pol- icy, July.
Boruch, Robert F., Brook Snyder, and Dorothy DeMoya. 2000. The Impor- tance of Randomized Field Trials. Crime & Delinquency 46:156-80.
Burtless, Gary. 1995. The Case for Ran- domized Field Trials in Economic and
Policy Research. Journal of Economic Perspectives 9:63-84.
Campbell, Donald P. and Robert F.
Boruch. 1975. Making the Case for Randomized Assignment to Treatments by Considering the Alternatives: Six Ways in Which Quasi-Experimental Evaluations in Compensatory Educa- tion Tend to Underestimate Effects. In
Evaluation and Experiment: Some Critical Issues in Assessing Social Programs, ed. Carl Bennett and Arthur Lumsdaine. New York: Aca- demic Press.
Chalmers, Iain and Douglas G. Altman. 1995. Systematic Reviews. London: British Medical Journal Press.
Clarke, Ronald V. and Derek B. Cornish. 1972. The Control Trial in Institu- tional Research: Paradigm or Pitfall for Penal Evaluators? London: HMSO.
Cox, Stephen M., William S. Davidson, and Timothy S. Bynum.1995. A Meta- Analytic Assessment of Delinquency-
69
Related Outcomes of Alternative Edu- cation Programs. Crime & Delin-
quency 41:219-34.
Cullen, Francis T. and Paul Gendreau. 2000. Assessing Correctional Rehabil- itation : Policy, Practice, and Prospects. In Policies, Processes, and Decisions of the Criminal Justice System: Criminal Justice 3, ed. Julie Horney. Washing- ton, DC: U.S. Department of Justice, National Institute of Justice.
Davies, Huw T. O., Sandra Nutley, and Peter C. Smith. 2000. What Works: Evi- dence-Based Policy and Practice in Public Services. London: Policy Press.
Davies, Philip. 1999. What Is Evidence- Based Education? British Journal of Educational Studies 47:108-21.
Eck, John. 2001. Learning from Experi- ence in Problem Oriented Policing and Crime Prevention: The Positive Func- tions of Weak Evaluations and the
Negative Functions of Strong Ones. Unpublished manuscript.
Egger, Matthias and G. Davey Smith. 1998. Bias in Location and Selection of Studies. British Medical Journal 316:61-66.
Farrington, David P. 1983. Randomized Experiments in Crime and Justice. In Crime and Justice: An Annual Review
of Research, ed. Norval Morris and Mi- chael Tonry. Chicago: University of Chicago Press.
—. 2000. Standards for Inclusion of Studies in Systematic Reviews. Dis- cussion paper for the Campbell Col- laboration Crime and Justice Coordi-
nating Group. Farrington, David P. and Anthony
Petrosino. 2001. The Campbell Collab- oration Crime and Justice Group. An- nals of the American Academy of Polit- ical and Social Science 578:35-49.
Feder, Lynette and Robert F. Boruch. 2000. The Need for Experiments in Criminal Justice Settings. Crime &
Delinquency 46:291-94.
Feder, Lynette, Annette Jolin, and William Feyerherm. 2000. Lessons from Two Randomized Experiments in Criminal Justice Settings. Crime & Delinquency 46:380-400.
Friedlander, Daniel and Philip K. Robins. 2001. Evaluating Program Evalua- tions: New Evidence on Commonly Used Non-Experimental Methods. Ameri- can Economic Review 85:923-37.
Hedges, Larry V 2000. Using Converging Evidence in Policy Formation: The Case of Class Size Research. Evalua- tion and Research in Education 14:193-205.
Heinsman, Donna T. and William R. Shadish. 1996. Assignment Methods in Experimentation: When Do Nonrandomized Experiments Ap- proximate Answers from Randomized Experiments? Psychological Methods 1:154-69.
Kunz, Regina and Andy Oxman. 1998. The Unpredictability Paradox: Re- view of Empirical Comparisons of Randomized and Non-Randomized Clinical Trials. British Medical Jour- nal 317:1185-90.
LaLonde, Robert J. 1986. Evaluating the Econometric Evaluations of Training Programs with Experimental Data. American Economic Review 76:604-20.
Lipsey, Mark W. and David B. Wilson. 1993. The Efficacy of Psychological, Educational, and Behavioral Treat- ment : Confirmation from Meta-Analy- sis. American Psychologist 48:1181- 209.
Lipton, Douglas S., Robert M. Martinson, and Judith Wilks. 1975. The Effective- ness of Correctional Treatment: A Sur- vey of Treatment Evaluation Studies. New York: Praeger.
Logan, Charles H. 1972. Evaluation Re- search in Crime and Delinquency—A Reappraisal. Journal of Criminal Law, Criminology and Police Science 63:378-87.
70
MacKenzie, Doris L. 2000. Evidence- based Corrections: Identifying What Works. Crime & Delinquency 46:457-71.
MacKenzie, Doris L. and Laura J. Hickman. 1998. What Works in Correc- tions (Report submitted to the State of Washington Legislature Joint Audit and Review Committee). College Park: University of Maryland.
Martinson, Robert. 1974. What Works? Questions and Answers About Prison Reform. Public Interest 35:22-54.
Millenson, Michael L. 1997. Demanding Medical Excellence: Doctors and Ac-
countability in the Information Age. Chicago: University of Chicago Press.
Nutley, Sandra and Huw T. O. Davies. 1999. The Fall and Rise of Evidence in Criminal Justice. Public Money &
Management 19:47-54. Pawson, Ray and Nick Tilley. 1997. Real-
istic Evaluation. London: Sage. Petrosino, Anthony, Carolyn Petrosino,
and John Buehler. 2001. Pilot Test: The Effects of Scared Straight and Other Juvenile Awareness Pro-
grams on Delinquency. Unpublished manuscript.
Poyner, Barry. 1993. What Works in Crime Prevention: An Overview of Evaluations. In Crime Prevention Studies. Vol. 1, ed. Ronald V. Clarke. Monsey, NY: Criminal Justice Press.
Shadish, William R. and Kevin Ragsdale. 1996. Random Versus Nonrandom As-
signment in Controlled Experiments: Do You Get the Same Answer? Jour- nal of Consulting and Clinical Psy- chology 64:1290-305.
Sherman, Lawrence W. 1998. Evidence- Based Policing. In Ideas in American Policing. Washington, DC: Police Foundation.
Sherman, Lawrence W., Denise C. Gottfredson, Doris Layton MacKen- zie, John E. Eck, Peter Reuter, and Shawn D. Bushway. 1997. Preventing Crime: What Works, What Doesn’t, What’s Promising. Washington, DC: U.S. Department of Justice, National Institute of Justice.
Visher, Christy A. and David Weisburd. 1998. Identifying What Works: Recent Trends in Crime Prevention. Crime, Law and Social Change 28:223-42.
Weisburd, David. 1997. Reorienting Crime Prevention Research and Pol-
icy : From the Causes of criminality to the Context of Crime (Research Report NIJ 16504). Washington, DC: U.S. De- partment of Justice, National Insti- tute of Justice.
Whitehead, John T. and Steven P. Lab. 1989. A Meta-Analysis of Juvenile Correctional Treatment. Journal of Research in Crime and Delinquency 26:276-95.
Wilkinson, Leland and Task Force on Sta- tistical Inference. 1999. Statistical Methods in Psychology Journals: Guidelines and Explanations. Ameri- can Psychologist 54:594-604.
Wilson, David B., Catherine A. Gallagher, Doris L. MacKenzie. 2000. A Meta-
Analysis of Corrections-Based Educa- tion, Vocation, and Work Programs for Adult Offenders. Journal of Research in Crime and Delinquency 37:347-68.
Wilson, David B., Denise C. Gottfredson, and Stacy S. Najaka. In Press. School- Based Prevention of Problem Behav- iors : A Meta-Analysis. Journal of Quantitative Criminology.
Zuger, Abigail. 1997. New Way of Doc- toring : By the Book. New York Times, 16 Dec.