research design

profilemr.frrxtnee
withinstudydesign.pdf

Methods for Policy Analysis Rebecca A. Maynard, Editor

Kenneth A. Couch, Guest Editor

Authors who wish to submit manuscripts for all sections except Book Reviews should do so electronically in PDF format through Editorial Express.

STRENGTHENING THE REGRESSION DISCONTINUITY DESIGN USING ADDITIONAL DESIGN ELEMENTS: A WITHIN-STUDY COMPARISON

Coady Wing and Thomas D. Cook

Abstract

The sharp regression discontinuity design (RDD) has three key weaknesses compared to the randomized clinical trial (RCT). It has lower statistical power, it is more de- pendent on statistical modeling assumptions, and its treatment effect estimates are limited to the narrow subpopulation of cases immediately around the cutoff, which is rarely of direct scientific or policy interest. This paper examines how adding an un- treated comparison to the basic RDD structure can mitigate these three problems. In the example we present, pretest observations on the posttest outcome measure are used to form a comparison RDD function. To assess its performance as a sup- plement to the basic RDD, we designed a within-study comparison that compares causal estimates and their standard errors for (1) the basic posttest-only RDD, (2) a pretest-supplemented RDD, and (3) an RCT chosen to serve as the causal bench- mark. The two RDD designs are constructed from the RCT, and all analyses are replicated with three different assignment cutoffs in three American states. The re- sults show that adding the pretest makes functional form assumptions more trans- parent. It also produces causal estimates that are more precise than in the posttest- only RDD, but that are nonetheless larger than in the RCT. Neither RDD version shows much bias at the cutoff, and the pretest-supplemented RDD produces causal effects in the region beyond the cutoff that are very similar to the RCT estimates for that same region. Thus, the pretest-supplemented RDD improves on the standard RDD in multiple ways that bring causal estimates and their standard errors closer to

Journal of Policy Analysis and Management, Vol. 32, No. 4, 853–877 (2013) C© 2013 by the Association for Public Policy Analysis and Management Published by Wiley Periodicals, Inc. View this article online at wileyonlinelibrary.com/journal/pam DOI:10.1002/pam.21721

854 / Methods for Policy Analysis

those of an RCT, not just at the cutoff, but also away from it. C© 2013 by the Association for Public Policy Analysis and Management.

INTRODUCTION

A carefully executed regression discontinuity design (RDD) is now widely considered a sound basis for causal inference. The design was introduced in Thistlewaite and Campbell (1960), and Goldberger (1972a, 1972b) showed that RDD produces causal estimates that are unbiased, but less efficient than those produced by a comparable randomized clinical trial (RCT). Recent work has clarified the assumptions that support parametric and nonparametric identification in the RDD (Hahn, Todd, & Van der Klauuw, 2001; Lee, 2008), and has examined the statistical properties of common estimators (Lee & Card, 2008; Porter, 2003; Schochet, 2009). In addition, a growing literature compares RDD estimates to benchmark estimates from an RCT, and these within-study comparisons show that RDD and RCT estimates have been similar in various applied settings (Cook & Wong, 2008; Green et al., 2009; Shadish et al., 2011). Despite this recent work, the basic elements of the design have not changed. An RDD requires an outcome variable, a binary treatment, a continuous assignment variable, and a cutoff-based treatment assignment rule. The assignment rule is crucial: In a successful RDD, individuals with assignment scores on one side of the cutoff receive one treatment and individuals on the other side receive another treatment, usually a no-treatment control condition. An RDD is sharp when all individuals receive the intended treatment, and it is fuzzy when compliance is partial. This paper deals only with sharp RDD studies.

The analysis of an RDD is not complicated in principle. Researchers estimate treat- ment effects by comparing mean outcomes among people with assignment scores immediately below and immediately above the cutoff. The difference between these two conditional means can be understood as a discontinuity in the regression func- tion that links average outcomes across subpopulations defined by the assignment variable. A basic assumption in the RDD is that in the absence of a treatment effect, the regression would be a smooth function near the cutoff; conversely, a sudden break or discontinuity at the cutoff is evidence of a treatment effect. The size of the discontinuity measures the magnitude of the effect.

The RDD has at least three important limitations relative to an RCT. The first involves the amount of statistical modeling required to identify and estimate causal effects. In an RCT, treatment effects are nonparametrically identified so that as- sumptions about the underlying statistical model are not required to interpret the data. Moreover, there is usually a close connection between the research design and the statistical tools used to perform the analysis.1 In RDD, on the other hand, treatment effects are nonparametrically identified, but fully nonparametric anal- ysis requires very large sample sizes that cannot always be attained. In practice, researchers often proceed by specifying a parametric or semiparametric functional form of the regression and allowing for an intercept shift at the cutoff (Lee & Card, 2008). Choosing the wrong functional form can lead to biased treatment effect estimates, so it is good practice for analysts to use flexible methods to estimate functional forms before evaluating how sensitive the results are to alternative speci- fications. Although many techniques for sensitivity analysis exist, it would be a boon

1 Of course, analysts often employ parametric regression models in the analysis of experimental data either to improve the statistical precision of the treatment effect estimates or to adjust for chance imbalances in observable covariates. But this additional modeling is usually not central to the study’s findings.

Journal of Policy Analysis and Management DOI: 10.1002/pam Published on behalf of the Association for Public Policy Analysis and Management

Methods for Policy Analysis / 855

in RDD studies to have better methods for validating functional form assumptions. We present one such method here.

A second limitation of standard RDD is that treatment effect estimates are less statistically precise than in an RCT, reducing the statistical power of key hypothesis tests (Goldberger, 1972b; Schochet, 2009). Some of the efficiency loss is due to the multicollinearity between assignment scores and treatment variable that is inherent in the RDD assignment rule. RDD estimates that rely on nonparametric estima- tion methods may also have lower power because they employ a bandwidth that decreases the study’s effective sample size. Lower statistical power is a secondary concern in RDD studies with large administrative databases, but it is more central when investigators prospectively design a study and collect their own data directly from respondents. In this last circumstance, adding more cases may be costly and tempt researchers into favoring alternative designs with greater power, but a weaker identification strategy (Schochet, 2009).

A third limitation concerns the generality of RDD results. RCTs produce treatment effect estimates averaged across all members of the study population. In contrast, RDD estimates are limited to average treatment effects among members of the narrow subpopulation located immediately around the cutoff. For example, if a treatment is given to students scoring above the 75th percentile on an achievement test, then RDD results can only be generalized to students near that point. Unfor- tunately, social science and public policy debates usually are concerned with the effects of treatments in broader subpopulations, such as all students, or all students in the upper quartile of the test score distribution. Constructing estimates of these more general parameters in an RDD setting requires making extrapolations beyond the cutoff score. Researchers often are reluctant to make such extrapolations be- cause there is rarely a firm theoretical basis for the assumption that the functional form of the regression is stable beyond the range of the observed data. The crux of the problem is that no one knows what the treatment group functional form would have looked like in the absence of the treatment. The absence of this counterfactual regression function is why it is standard practice to limit causal inferences to the cutoff subpopulation, even though this narrow applicability of the estimates reduces the value of the standard RDD as a practical method for policy analysis (Manski, 2013).

This paper explores an RDD variant that can improve on all three of these limita- tions. It requires supplementing the conventional posttest-only RDD with a pretest measure of the outcome variable. In what follows, we refer to the conventional RDD as a “posttest RDD” because it only requires posttest information. We refer to the pretest-supplemented design as a “pretest RDD,” noting that it makes use of both pretest and posttest outcome data. The key idea is that the pretest data provides information about what the regression function linking outcomes and assignment scores looked like in the absence of the treatment in an earlier time period. If the functions are stable over time, then the pretest data can inform the analysis of the posttest data. Minor differences between the pretest and posttest functional forms in the untreated part of the assignment variable, such as intercept differences, are easily accommodated. But functional forms that are observed to be very dissimilar over time in the untreated part of the assignment variable would cast doubt on the results of a pretest-supplemented RDD.

The core of this paper is a within-study comparison that evaluates the performance of the pretest and posttest RDDs relative to each other and to a benchmark RCT. LaLonde (1986) and Fraker and Maynard (1987) were the first to use this method to examine whether econometric adjustments for selection bias in an observational study could reproduce the results of job-training RCTs. Since then, researchers have used the method to study the performance of RDD (Green et al., 2009; Shadish et al., 2011), intact group and individual case matching (Bifulco, 2012; Cook, Shadish, &

Journal of Policy Analysis and Management DOI: 10.1002/pam Published on behalf of the Association for Public Policy Analysis and Management

856 / Methods for Policy Analysis

Wong, 2008; Wilde & Hollister, 2007), and alternative strategies for covariate se- lection (Cook & Steiner, 2010). The implementation details of within-study com- parisons vary, but the basic idea is always to test the validity of a nonexperimental method by comparing its estimates to a trustworthy benchmark from an RCT. Meth- ods for conducting a high-quality within-study comparison have evolved over time, and Cook, Shadish, and Wong (2008) describe the current best practices that we follow in this paper.

Our within-study comparison is based on data from the Cash and Counseling Demonstration Experiment (Dale & Brown, 2007). In the original study, disabled Medicaid beneficiaries in Arkansas, Florida, and New Jersey were randomly as- signed to obtain home- and community-based health services through Medicaid (the control group), or to receive a spending account that they could use to procure home- and community-based services directly (the treatment group). The origi- nal study examined the effects of the program on a variety of health, social, and economic outcomes. But for the purposes of our within-study comparison, the outcome variable we focus on is a measure of individual Medicaid expenditures in the 12 months after the study began.

To construct pretest and posttest RDDs from the RCT, we used baseline age as the assignment variable and sorted the RCT treatment-group and control-group cases by baseline age. Then, we defined a cutoff age for treatment assignment, selecting three of them for replication purposes—ages 35, 50, and 70. Next, we systematically deleted control cases from above the cutoff and treatment cases below the cutoff. Since we had data from Florida, New Jersey, and Arkansas, a total of nine posttest and nine pretest RDDs resulted—three age cutoffs crossed with three states. At each age cutoff, we compared the pretest and posttest RDD estimates to each other and to the corresponding RCT estimate. In the pretest RDD, we also used the comparison data to compute an estimate of the average treatment effect for everyone older than the cutoff, which is the average treatment effect on the treated (ATT) parameter that is often of interest in program evaluation research and is usually out of reach in RDD studies. We compared these extrapolated estimates to the corresponding RCT benchmarks.

The results of our analysis indicate that the pretest RDD can shore up all three key weaknesses of the posttest RDD. First, our comparisons show that the pretest and posttest functional forms are similar below the cutoff, thus providing some support for the proposition that the pretest data could be informative about the counterfactual untreated regression function in the posttest period. Second, we found that adding the pretest led to more statistically precise estimates than the conventional posttest RDD, although the estimates are still not quite as precise as in the RCT. And finally, the pretest RDD produced unbiased treatment effects relative to the RCT, not only at the cutoff, but also beyond the cutoff. In the within-study comparison considered in this paper, the multidimensional superiority of the pretest RDD over the posttest RDD is clear.

THE RCT DATA

The Cash and Counseling Demonstration and Evaluation is described in detail else- where (Brown & Dale, 2007; Dale & Brown, 2007a; Doty, Mahoney, & Simon- Rusinowitz, 2007; Carlson et al., 2007). Study participants were disabled elderly and nonelderly adult Medicaid beneficiaries who agreed to participate and lived in Arkansas, New Jersey, or Florida from 1999 to 2003. The study employed a rolling enrollment design in which new enrollees completed a baseline survey and then were randomly assigned to treatment or control status, after which the state agency was informed of the assignments. The treatment condition was a “consumer-directed

Journal of Policy Analysis and Management DOI: 10.1002/pam Published on behalf of the Association for Public Policy Analysis and Management

Methods for Policy Analysis / 857

Table 1. Descriptive statistics for the variables and samples used to form within-study comparisons.

Arkansas Florida New Jersey

Variable Control Treatment Control Treatment Control Treatment

Pretest Medicaid expenditures

$6,358 $6,439 $14,300 $14,377 $18,779 $18,215

Posttest Medicaid expenditures

$7,583 $9,443 $18,088 $19,944 $20,100 $21,299

Mean age 70 70 55 55 62 63

N 1,004 1,004 906 907 869 861

budget” program. It allowed disabled Medicaid beneficiaries to procure their own home- and community-based support services and providers using a Medicaid- financed spending allowance. The control group received home- and community- based support services procured by a local Medicaid agency from Medicaid certified providers, which is the status quo policy. In both groups, Medicaid pays for the ser- vices. The key difference is whether the Medicaid enrollee or the Medicaid agency makes the micro-level spending decisions. In the new program, the personal al- lowance was set to the amount the agency would have allocated in the absence of the new program because the intervention was meant to be revenue neutral. So, the study outcome we analyze—how much was actually spent for services—tests whether individuals or Medicaid officials spent more of the same allocated total.

Our methodological study used a small subset of the measures collected in the original study. For each member of the study, we retained information on age at baseline, state of residence, and randomly assigned treatment status. We created a measure of annual Medicaid expenditures by adding up six categories of monthly expenditures across the 12 months before random assignment (pretest) and after random assignment (posttest). The six expenditure categories were Inpatient Ex- penditures, Diagnosis-Related Group Expenditures, Skilled Nursing Expenditures, Personal Assistance Services Expenditures, Home Health Services Expenditures, and Other Services Expenditures.2 Throughout, we refer to this six-item index as “Medicaid expenditures,” and it is the sole outcome in our study.

The summary statistics in Table 1 show that in the RCT, Arkansas had 1,004 par- ticipants in each of the treatment and control arms, Florida had 906 control and 907 treatment participants, and New Jersey had 869 control and 861 treatment-group members. In Arkansas, the average participant was 70 years old, compared to 55 in Florida, and 62 in New Jersey. Within each state, average pretest expenditures were similar in the treatment and control groups, but the level of spending varied by state. The average person in Arkansas had pretest expenditures of $6,400 compared to $14,300 in Florida and $18,500 in New Jersey. Mean posttest expenditures were consistently higher in the treatment groups. Simple intent-to-treat (ITT) compar- isons imply that the intervention increased average expenditures by about $1,860

2 The claims data included a small number of cases with very high levels of expenditures that could be either real or data entry errors. To reduce concerns that these outliers would skew our regression estimates, we top coded the pretest and posttest Medicaid expenditures variable at the 99th percentile of the pooled distribution of posttest expenditures, which was equal to $78,273. The top coding procedure affected 89 posttest observations and 79 pretest observations.

Journal of Policy Analysis and Management DOI: 10.1002/pam Published on behalf of the Association for Public Policy Analysis and Management

858 / Methods for Policy Analysis

Table 2. Sample sizes in the nine constructed posttest regression discontinuity designs.

State Age cutoff Below the cutoff Above the cutoff Total

Arkansas 35 59 944 1003 Florida 35 296 609 905 New Jersey 35 106 770 876

Arkansas 50 143 868 1011 Florida 50 417 496 913 New Jersey 50 224 650 874

Arkansas 70 361 623 984 Florida 70 555 359 914 New Jersey 70 491 387 878

(P < 0.01) in Arkansas, $1,856 (P = 0.01) in Florida, and $1,200 (P = 0.09) in New Jersey. Thus, the Cash and Counseling treatment increased Medicaid expenditures relative to when Medicaid officials controlled the expenditures.

WITHIN-STUDY RESEARCH DESIGN

To implement the within-study comparison, we created 21 different subsets of the original RCT data. The first three are the state-specific RCT treatment and control groups, for which sample sizes and basic descriptive statistics are in Table 1. The next nine subsets represent state and age specific posttest RDDs based on three states and three age cutoffs (35, 50, and 70). To create the posttest RDD samples, we removed from the RCT data all treatment group members younger than the rel- evant age cutoff and all control group members at least as old as the cutoff. Table 2 shows the sample sizes for the nine posttest RDD subsets. The number of obser- vations below the cutoff increases with cutoff age. With the cutoff set at 35, there are many more observations above the cutoff than below; at age 50, observations are more balanced; and at age 70 balance is best overall. The different age cut- offs also determine how much extrapolation is required to compute average effects for everyone above the cutoff. For example, estimating the average effect among everyone older than 35 requires an extrapolation from 36 to 90. In contrast, esti- mating the average effect for people over 70 only requires an extrapolation from 71 to 90.

Next, we used Medicaid expenditures from the pre-randomization year to create nine pretest RDD data subsets based on the same cutoff values and states. With the pretest and posttest RDD subsets in hand, we created a long-form data set by stacking the pretest and posttest RDD data, and defined an indicator variable to identify which observations were from each time period. These stacked data sets form the pretest RDD. They combine data from the pretest period when no one of any age had received the treatment, with data from the posttest period when treatment was available above a specified age cutoff. Stacking the data in this way results in twice as many observations in the pretest RDD compared to the posttest RDD because each participant is observed twice.

These procedures resulted in an RCT, a posttest RDD, and a pretest RDD, each replicated across three states and three age cutoffs. The basic goal of our analysis is to construct estimates of the same causal parameters using each of these research designs. Interpreting the RCT estimates as internally valid allows us to measure the performance of the RDD estimates relative to each other and to the best estimate of the true effect.

Journal of Policy Analysis and Management DOI: 10.1002/pam Published on behalf of the Association for Public Policy Analysis and Management

Methods for Policy Analysis / 859

METHODS

Implementing the within-study comparison requires (1) defining treatment effects of interest, (2) specifying estimators for each effect in each design, and (3) developing measures of performance by which to judge the strengths and weaknesses of each design.

Parameters of Interest

Throughout the paper, we use i to index individuals, and t = [0, 1] to denote the pretest and posttest periods. Ai is a person’s (time invariant) age at baseline, and Preit = 1(t = 0) is a dummy variable that identifies observations made during the pretest time period. We adopt a potential outcomes framework in which Y(1)it denotes the ith person’s treated outcome at time t, and Y(0)it denotes the person’s untreated outcome at time t. The outcome variable in all of our analysis refers to the person’s Medicaid expenditures over the 12 months prior to period t.Dit is an indicator set to 1 if the person has received the treatment at time t. In the Cash and Counseling data, a person is treated if she has the option to control her own Medicaid-financed home care budget. Since no one received the treatment in the pretest time period, Di0 = 0 for everyone in the sample. A person’s realized outcome is Yit = Y(1)it Dit + Y(0)it (1 − Dit ).

To estimate treatment effects at the conventional RDD cutoff and also beyond it, we define treatment effects conditional on specific ages and age ranges. In our notation, the average treatment effect in the posttreatment time period for people who are, say, 70 years old is written as �(70) = E[Y(1)it | Preit = 0, Ai = 70] − E[Y(0)it | Preit = 0, Ai = 70]. If the cutoff value in an RDD was set at age 50, then �(50) = �(RDD) is the average treatment effect in the cutoff subpopulation for that particular RDD.

In a conventional RDD, inference is limited to the average treatment effect at the cutoff. Since part of our analysis is concerned with extrapolating beyond the cutoff, it is also useful to describe average treatment effects in broader subpopulations. One way to do this is to consider averages effects across a range of age groups as relative frequency weighted averages of age-specific treatment effects. For example, If the cutoff value in an RDD was set at age 50 then would be the average treatment effect in the cutoff subpopulation for that particular RDD cutoff is

�(m ≥ 50) = M∑

m=50 �(m) × Pr( Ai = m| Preit = 0)

Pr( Ai ≥ 50 | Preit = 0) .

In a sharp RDD with a cutoff set at c, the parameter �(m ≥ c) represents the average treatment effect above the cutoff, which might also be called the ATT: �(m ≥ c) = �( AT T ). Estimating the ATT parameter requires extrapolation away from the cutoff, so the ATT parameter is not immediately identified in a standard RDD. The pretest RDD that we propose provides one mechanism for making credible extrapolations beyond the cutoff.

Estimation

To estimate the quantities of interest, we used regression methods that account for unknown functional forms either with kernel weighting or a polynomial series in the age variable—the two most common methods used in the modern RDD literature. The use of these flexible models meant that we could not specify a single polynomial model or a single bandwidth for all the designs and states in the analysis. Instead, we

Journal of Policy Analysis and Management DOI: 10.1002/pam Published on behalf of the Association for Public Policy Analysis and Management

860 / Methods for Policy Analysis

specified a method of selecting polynomial specifications and bandwidth parameters that was applied uniformly across the designs. In what follows, we describe the general approach to estimation employed with the RCT, posttest RDD, and pretest RDD. Then, we explain the model selection algorithm used to guide our choice of smoothing parameters like bandwidths and polynomial series lengths. The details regarding the bandwidths and polynomial specifications employed in the analysis are reported in the Appendix.3

Estimation in the RCT

We estimated age-specific treatment effects using two methods. First, we estimated local linear regressions of Medicaid expenditures on age separately for the treatment and control groups. Then, we computed age-specific treatment effects as point- wise differences in treatment and control regression functions for each age. To calculate average treatment effects above the cutoff, we weighted these age-specific differences according to the relative frequency distribution of ages among all of the treatment and control observations from each state. We computed the frequency weights separately for each state to account for differences in the age distribution of each state’s study population.

Since many applied researchers prefer to work with flexible polynomial specifica- tions rather than kernel-based regressions, we also estimated ordinary least squares (OLS) regressions of Medicaid expenditures on a polynomial series in age, a treat- ment group indicator, and interactions between the polynomial series and age for each state. Treatment effect estimates were computed using the coefficients on the treatment indicator and the appropriate interaction terms. Average treatment ef- fects above the cutoff were taken as weighted averages of age-specific differences with weights equal to the relative frequency of each age in the state sample.

Estimation in the Posttest RDD

We estimated treatment effects in the posttest RDDs using both kernel and poly- nomial series regression methods. To implement the kernel regression approach, we estimated treatment effects at the cutoff using local linear regressions applied separately to the data from above and below the cutoff in each state. Treatment effects at the cutoff were calculated using the difference in the estimates of mean Medicaid expenditures at the cutoff.

To implement the polynomial series methods, we pooled data from above and below the cutoff and estimated OLS regressions of Medicaid expenditures on a polynomial in age, a dummy variable set to 1 for observations above the cutoff, and interactions between the age polynomial series and the cutoff dummy variable. In these posttest RDD analyses, we computed treatment effects only at the cutoff. We did not make extrapolations based on the functional form implied by the polynomial regression coefficients because of the well-known tendency of polynomial series estimates to have very poor out-of-sample properties.

Estimation in the Pretest RDD

The pretest RDD combines pretest and posttest RDD data, and for our purposes the key idea is that information about the relationship between the assignment

3 All appendices are available at the end of this article as it appears in JPAM online. Go to the pub- lisher’s Web site and use the search engine to locate the article at http://www3.interscience.wiley.com/ cgi-bin/jhome/34787.

Journal of Policy Analysis and Management DOI: 10.1002/pam Published on behalf of the Association for Public Policy Analysis and Management

Methods for Policy Analysis / 861

variable and the outcomes during the pretest period may provide a sound basis for extrapolation beyond the assignment cutoff in the posttest time period. To put the idea into practice, we specify a flexible model of the untreated outcome variable in the pretest and posttest periods that accounts for simple nonequivalencies between the two periods. In particular, we consider models in which pretest and posttest untreated outcome regression functions differ by a constant across all ages:

Y(0)it = PreitθP + g( Ai ) + νit. In this model, θP represents the fixed difference in conditional mean outcomes across the pretest and posttest periods, and g(.) is an unknown smooth function that is assumed to be constant across the two periods. We assume that E[νit | Preit, Ait ] = 0. In essence, our model assumes that the difference between the mean untreated potential outcome in the pretest and posttest time periods does not vary across subpopulations defined by the assignment variable. The assumption that there is an assignment variable invariant time period effect is important. It implies that, after adjusting for the constant period effect, the underlying regression relationship between the outcomes and the assignment variable function can be recovered across the entire range of the assignment variable in the pretest time period, and then applied to the posttest period. This implication is what makes extrapolation possible.

Similar fixed effect restrictions are widely used in the analysis of longitudinal data (Wooldridge, 2011), though with standard panel data models the assumptions are somewhat stronger than in RDD because such models usually pair a fixed effects assumption with a specific functional form assumption for a vector of time-varying covariates. The point here is that the pretest RDD model is agnostic with respect to the functional form associated with the assignment variable, but it does impose the restriction that the shape of the function does not change across the two time periods except for a change in level that is attributable to the time period effect. Clearly, the accuracy of extrapolations away from the cutoff depends on the validity of the assumption that the time period effect is age invariant. In the next section, we present evidence that this particular assumption is credible in the Cash and Coun- seling data, so our within-study comparisons represent a test of the performance of the pretest RDD method in a situation where the core assumptions appear plausi- ble. Readers should note, of course, that applying our methods in situations where the constant period effect assumption is implausible would likely lead to very poor performance.

With the basic pretest RDD model of the untreated outcomes defined, we turn to methods for estimating treatment effects using the pretest RDD. The first task is to estimate the untreated outcome regression function. As usual, one approach involves approximating the unknown smooth function, g( Ai ), using a polynomial series. For instance, one might specify a Kth order polynomial series and estimate model parameters using and OLS regression such as

Y(0)it = PreitθP + K∑

k=0 δk A

k i + νit.

The equation can be estimated by applying OLS to all of the untreated cases in the sample. The key point is that the untreated sample includes pretest Medicaid expenditures from the full range of ages and also the posttest Medicaid expenditures of people under the design’s age cutoff. In this setting, ĝ(a) = ∑Kk=0 δ̂kak represents an estimate of E[Y(0)i | Preit = 0, Ai = a]. The extrapolations beyond the cutoff are now made with what might be called partial empirical support. Rather than extrapolate outside the range of the data, extrapolations are made to the posttest outcomes on

Journal of Policy Analysis and Management DOI: 10.1002/pam Published on behalf of the Association for Public Policy Analysis and Management

862 / Methods for Policy Analysis

the support of pretest data under the assumption that estimates of θP are sufficient to account for any between period nonequivalence.

This method provides estimates of the untreated outcome function, but to form treatment effect estimates, we also require estimates of the treated outcome func- tion. An obvious strategy is to estimate polynomial regressions of expenditures on age using posttest data from those sample members who are above the age cutoff. Then, treatment effects can be computed at the cutoff using differences between the fitted value of the treated and untreated regression functions. Average treatment effects among all observations above the cutoff can be formed by computing age- specific treatment effects for each age above the cutoff and then forming a weighted average of these differences based on the relative frequency of the ages above the cutoff.

A second way to implement the pretest RDD model is to estimate a version of Robinson’s (1988) partial linear regression model. The model exploits the same as- sumptions that g(.) is a smooth function and that E[νit | Preit, Ai ] = 0, but it also requires a support condition so that the pretest indicator in the parametric com- ponent of the model is not a deterministic function of the assignment variable. Formally, the requirement is that V ar (Preit | Ai ) > 0. The support condition fails by definition in the full sample because there are no untreated RDD observations above the cutoff determining treatment. Our solution is to estimate the parametric period effect using only observations that fall on the common support of the different time periods. In practice, this means that we estimate the period effect using only the below the cutoff data from the two time periods. Then, with estimates of θ p in hand, we estimate the nonparametric component using the full sample of observations both above and below the cutoff value.

We calculated treatment effects at the cutoff using differences in the predicted val- ues from local linear regressions among treated observations from the posttest time period and predicted values from the partially linear model. And, we constructed average treatment effects above the cutoff by taking age-specific differences be- tween the predicted values from the two models and weighting them by the relative frequency of each age in each state sample.

The Validity of the Pretest RDD Assumption

A key issue for the analysis of the pretest RDD is the empirical validity of the assump- tion that the period effect is age-invariant, or that—equivalently—the cross-sectional age-expenditure profile in our sample would not have changed over a one-year time horizon in the absence of a treatment effect. We conducted two simple tests of this assumption. First, we computed the pre–post change in Medicaid expenditures for each person in the control group from each state. Figure 1 shows scatter plots of these person-specific time-period effects against age for the three control groups. Because the plots are based only on the control group, the change scores represent pure time effects that have nothing to do with the treatment. The scatter plots reveal no evidence of an age-biased pattern of time-period effects in any state. To test the time-invariant age-effect assumption more formally and using a modeling frame- work that is closer to the one we use in our analysis, we regressed control group Medicaid expenditures on a cubic function of age, a postperiod indicator variable, and interactions between the age terms and the postindicator separately for each state. The estimated coefficients are reported in Table 3. The coefficients on the in- teraction terms are not statistically different from zero in any of the states implying that functional relationship between age and Medicaid expenditures did not change between the two periods under analysis. Figure 1 and Table 3 suggest that, in our data, the central assumption in the pretest RDD is a reasonable one. Although these

Journal of Policy Analysis and Management DOI: 10.1002/pam Published on behalf of the Association for Public Policy Analysis and Management

Methods for Policy Analysis / 863

-50,000

0

50,000

100,000

-50,000

0

50,000

100,000

-50,000

0

50,000

100,000

20 40 60 80 100

Arkansas

Florida

New Jersey

C h a

n g

e i n

M e

d ic

a id

E x p

e n

d it u

re s

Baseline age

Note: The graph plots within-person expenditure changes against age in the three experimental control groups. The graphs are consistent with the assumption of a small period effect that affected each age group in the same way.

Figure 1. One-Year Changes in Expenditures by Age in the Control Groups.

empirical results are encouraging for our work, we think researchers are best ad- vised to evaluate the credibility of the pretest RDD assumptions on conceptual and theoretical grounds rather than a program of statistical testing. In the application at hand, we think the assumption is plausible because it is simply unlikely that the cross-sectional relationship between age and medical expenditures changes much over a one-year time horizon.

Although statistical tests and graphical evidence are always welcome, it is impor- tant to note that in real-world applications researchers will only be able to conduct such an analysis using data from the region of the assignment variable that falls below the cutoff. For instance, if the cutoff was set at age 50, then a researcher would only be able to inspect the change scores for people under age 50. If the assumption that the period effect did not vary with age seemed reasonable below the cutoff, the researcher would still be forced to accept the additional assumption that the invariance assumption continued to hold above the cutoff.

Procedures for Choosing Smoothing Parameters

Each of the methods described above depends on assumptions about the degree of smoothing to allow across the different age groups. In the local linear regressions, smoothing is controlled by a kernel function and a bandwidth parameter. In the partially linear model, a separate bandwidth is required for the two preliminary regressions and for the ultimate residualized regression model. And in the polyno- mial series regressions, the amount of smoothing is determined by the degree of the polynomial function. The point of these flexible modeling approaches is to allow the

Journal of Policy Analysis and Management DOI: 10.1002/pam Published on behalf of the Association for Public Policy Analysis and Management

864 / Methods for Policy Analysis

Table 3. A regression-based test of whether the age–expenditure relationship in the control group differed in the pretest and posttest samples.

Variable Statistic Arkansas New Jersey Florida

Age Coefficient 175.54 1,297.3 147.22 SE 592.36 722.52 491.32 P 0.77 0.07 0.76

Age squared Coefficient −5.86 −23.06 −4.56 SE 9.77 13.14 9.13 P 0.55 0.08 0.62

Age cubed Coefficient 0.04 0.12 0.03 SE 0.05 0.07 0.05 P 0.47 0.11 0.6

Post Coefficient 1,300.71 −8,718.18 11,410.03 SE 6,336.86 9,516.4 5,005.54 P 0.84 0.36 0.02

Age × post Coefficient −106.59 417.28 −373.14 SE 353.41 580.57 331.67 P 0.76 0.47 0.26

Age squared × post Coefficient 3.46 −6 5.38 SE 6.05 10.63 6.39 P 0.57 0.57 0.4

Age cubed × post Coefficient −0.03 0.03 −0.03 SE 0.03 0.06 0.04 P 0.42 0.61 0.5

Intercept Coefficient 9,849.19 −700.2 15,545.24 SE 11,173.13 12,119.55 7,775.39 P 0.38 0.95 0.05

N 2,008 1,738 1,812 R2 0.074 0.018 0.052

data to determine the functional form specification; however, some arbitrariness is inevitably associated with selecting these smoothing parameters, and so a model selection protocol needs to be specified in advance of data analysis.

To this end, we selected bandwidth parameters by using least-squares cross- validation to evaluate a grid of candidate bandwidths ranging from 1 to 90 years in width. We then inspected the function produced by using the bandwidth that minimized the cross-validation statistic.4 When visual inspection revealed that the bandwidth chosen by cross-validation led to a function that appeared jagged and undersmoothed, we increased the bandwidth to produce a more regular function. Details about the selected bandwidth for each of the research design are in the Appendix.5

To choose a polynomial functional form, we used least-squares cross-validation to evaluate a set of candidate models that included linear, quadratic, cubic, and quartic polynomial functions, and also models that fully interacted the polynomial terms with a treatment-group indicator variable. In the within-study comparisons,

4 The cross-validation statistic we worked with is the mean squared out of sample prediction error formed by predicting the value of each observation when it is left out of the estimation. 5 All appendices are available at the end of this article as it appears in JPAM online. Go to the pub- lisher’s Web site and use the search engine to locate the article at http://www3.interscience.wiley.com/ cgi-bin/jhome/34787.

Journal of Policy Analysis and Management DOI: 10.1002/pam Published on behalf of the Association for Public Policy Analysis and Management

Methods for Policy Analysis / 865

we always worked with the polynomial models that minimized the cross-validation function. That is, we conducted the cross-validation for each of the candidate spec- ifications and chose the specification that produced the smallest mean square error of out-of-sample predictions. The specific functional forms used in each part of the within-study comparison are reported in the Appendix.6

The need to choose smoothing parameters is one area of RDD that seems always to leave room for investigator manipulation. We worked hard to define a selection procedure that was separate from the within-study comparison component. Impor- tantly, we did not assess the performance of the alternative estimators until after settling on the bandwidths and polynomial series lengths. It is also worth noting that the bandwidth dependent methods (which supplemented the data-driven cross- validation procedure with more subjective assessments of undersmoothing) led to treatment effect estimates that were very similar to the effects estimated by the poly- nomial series methods (which relied exclusively on the data-driven cross-validation procedure). The consistency between the two sets of results gives us some confi- dence that the bandwidth selection procedure was not an important determinant of our results. We also conducted a small sensitivity analysis to better understand the extent to which our results were dependent on the chosen bandwidths. We reesti- mated the RCT benchmark estimates that were used throughout the analysis using bandwidths that were half the size of our preferred bandwidth in each state. Across the 18 RCT benchmark parameters of interest, the average difference in point esti- mates between the preferred bandwidth models and the half-sized bandwidths was $11, and the average difference in standard errors was −$90. We pursued a similar exercise with the pretest RDD by reducing the bandwidth used in the residualized regression stage of the partially linear model to half of our preferred size. Here, we found that average difference in point estimates between the results produced using our preferred bandwidth and the half-sized bandwidths was −$72 and the average difference in estimated standard errors was $21. This small sensitivity analysis sug- gests that our main results are unlikely to depend heavily on bandwidth parameters within a reasonable range of what we ultimately considered optimal.

Estimating Standard Errors

To ensure comparability across the different designs, we used a nonparametric boot- strap to estimate standard errors for all treatment effect estimates. We always used 500 bootstrap replications. Point estimates were recalculated for each replicate, and the standard deviation of the point estimates across the 500 replicates was used as the bootstrap estimate of the standard error. Bandwidths, polynomial functional forms, and relative frequency weights for computing above the cutoff averages were fixed across bootstrap replicates. In the pretest RDD designs, we resampled individ- ual participants rather than individual observations to account for within-person dependencies in the error structure.

Measuring Performance

It is straightforward to compare the point estimates and standard errors from the posttest RDD and pretest RDD to those of the RCTs, but to draw conclusions across the different treatment effect parameters in each design, we also examined two stan- dardized performance measures. The first is a measure of the standardized bias of

6 All appendices are available at the end of this article as it appears in JPAM online. Go to the pub- lisher’s Web site and use the search engine to locate the article at http://www3.interscience.wiley.com/ cgi-bin/jhome/34787.

Journal of Policy Analysis and Management DOI: 10.1002/pam Published on behalf of the Association for Public Policy Analysis and Management

866 / Methods for Policy Analysis

the quasi-experimental point estimate. Let πq be the point estimate of a given param- eter produced by quasi-experimental estimator, q. And let πRCT be the point estimate of the same parameter produced by the RCT. Finally, let σRCT be the standard devia- tion of posttest Medicaid expenditures observed in the RCT. The standardized bias measure that we worked with is SBq = (πq − πRCT ) × 1σRCT . Essentially, SBq measures the magnitude of the bias in a particular quasi-experimental estimate in standard deviation units. We computed this measure for each parameter estimated with each cutoff age, state, and research design. It provides a uniform account of the size of the bias across different causal parameters and research designs, but it does not in- corporate any information about the statistical precision of the different estimates.

The second performance measure combines bias and variance estimates in a mean squared error framework. To compute the mean square error statistic, we centered the point estimates from each bootstrap replicate around the experimen- tal benchmark. Then, we squared these deviations and computed the average of the squared deviations across the 500 bootstrap replicates. Formally, the statistic we work with isMSE(πq ) = 1B

∑B b=1(πq(b) − πRCT )2, where b = 1 . . . B indexes the boot-

strap replicates. To keep the scale of the statistic interpretable in dollar terms, we report the square root of the root mean square error (RMSE) in the results. The basic idea is that any particular estimate of the treatment effect will differ from the RCT benchmark because of both bias and statistical precision. The RMSE statistics measure the size of the typical error associated with a given estimation technique by using the RCT to form a benchmark and the bootstrap replications to evaluate variability. Estimators with smaller levels of error provide answers that are closer to the truth on average than estimators with higher levels of error, and the RMSE statistic formalizes this logic. As a performance metric, the RMSE statistic gives equal weight to improvements in correspondence of the quasi-experimental esti- mates with the RCT benchmarks that come from changes in both the bias and the variance of the estimator. Of course, some researchers may value improvements in correspondence that come from bias reduction differently from improvements that come from variance reduction. In presenting the results, we take care to present estimates of standardized bias, standard errors, and RMSE for each research design so that readers can form their own conclusions about performance.

RESULTS

Causal Benchmarks from the RCT

Figures 2 to 4 plot estimates of average Medicaid expenditures by age in the treat- ment and control groups from each state across all ages. Each graph includes es- timates based on the polynomial series estimator and the local linear regressions, and the two estimation approaches yield very similar results. It is also clear that the expenditure-age profile varies across the states: The relationship is linear in Florida, highly nonlinear in New Jersey, and somewhat nonlinear in Arkansas.

Age-specific treatment effects can be constructed by forming point-wise differ- ences between the treatment and control group regression functions. Table 4 reports RCT estimates of treatment effects for each of the three age cutoffs and also for the total group above each age cutoff. We treat the RCT point estimates as unbiased benchmarks, so the standardized bias metric is not reported in Table 4. Note that the RMSE estimates allowed for the possibility that the average of the bootstrap point estimates differed from the original point estimates due to finite sample bias. This means that the RMSE statistics are not theoretically identical to the standard error of the effect estimates. However, as might be expected, there was very little evidence of finite sample bias in the estimates from the RCT data, so the RMSE statistics

Journal of Policy Analysis and Management DOI: 10.1002/pam Published on behalf of the Association for Public Policy Analysis and Management

Methods for Policy Analysis / 867

5,000

10,000

15,000

20,000

25,000

P o

s tt e s t M

e d ic

a id

E x p

e n

d it u

re s

20 40 60 80 100 Baseline age

Control LLR Treated LLR

Control polynomial Treated polynomial

Arkansas

Note: The graph plots local linear regression and polynomial series regression estimates of the average posttest Medicaid expenditures by age for RCT treatment and control participants in Arkansas. These estimates form the basis for the causal benchmarks used in the within-study comparisons.

Figure 2. Benchmark Estimates from the RCT in Arkansas.

16,000

18,000

20,000

22,000

24,000

P o

s tt e s t M

e d ic

a id

E x p

e n

d it u

re s

20 40 60 80 100 Baseline age

Control LLR Treated LLR

Control polynomial Treated polynomial

New Jersey

Note: The graph plots local linear regression and polynomial series regression estimates of the average posttest Medicaid expenditures by age for RCT treatment and control participants in New Jersey. These estimates form the basis for the causal benchmarks used in the within-study comparisons.

Figure 3. Benchmark Estimates from the RCT in New Jersey.

Journal of Policy Analysis and Management DOI: 10.1002/pam Published on behalf of the Association for Public Policy Analysis and Management

868 / Methods for Policy Analysis

10,000

15,000

20,000

25,000

30,000 P

o s tt e s t M

e d ic

a id

E x p

e n

d it u

re s

20 40 60 80 100 Baseline age

Control LLR Treated LLR

Control polynomial Treated polynomial

Florida

Note: The graph plots local linear regression and polynomial series regression estimates of the average posttest Medicaid expenditures by age for RCT treatment and control participants in Florida. These estimates form the basis for the causal benchmarks used in the within-study comparisons.

Figure 4. Benchmark Estimates from the RCT in Florida.

for the RCT benchmarks were essentially identical to the bootstrap SEs. To avoid reporting a redundant column of results, we report the RMSE and SE statistics for the RCT benchmarks in a single column in the tables throughout the paper.

Three things stand out in Table 4. First, the polynomial and local linear estimates are very similar in all three states. Second, the age-specific RCT treatment effects have large RMSE scores—an expected finding since the Cash and Counseling RCT was not designed to estimate treatment effects in one-year age brackets. And third, the estimates of average effects across all participants older than the cutoff are more stable than estimates at the cutoff, particularly for the younger age cutoffs where most observations fall above the cutoff. For the age 70 cutoff, the RMSE statistics are almost the same at the cutoff as above it.

Treatment Effects at the Cutoff

Table 5 compares the performance of the three research designs at the cutoff in terms of standardized bias, standard error, and RMSE as described earlier.7 Both the posttest RDD and the pretest RDD performed quite well in terms of standardized bias at the cutoff. For instance, 12 of the 18 posttest RDD estimates had absolute standardized bias of less than 0.2 standard deviations. In comparison, 13 of the 18 pretest RDD were biased by less than 0.2 standard deviations. Table 5 suggests that both the posttest RDD and the pretest RDD performed better in terms of bias in within-study comparisons with older age cutoffs where the sample size was more balanced and more concentrated near the cutoff.

7 The standardized bias measure is suppressed for the RCT because it serves as the causal benchmark so that the standardized bias is always zero.

Journal of Policy Analysis and Management DOI: 10.1002/pam Published on behalf of the Association for Public Policy Analysis and Management

Methods for Policy Analysis / 869

Table 4. Benchmark treatment effect estimates at and above the cutoff for each design type based on the RCT.

Panel A: RCT benchmarks for designs with a cutoff at age 35 ATE at age 35 ATE above age 35

State Estimation method Point

estimate RMSE/SE Point

estimate RMSE/SE

Arkansas Polynomial series 2,980 1,334 1,703 258 New Jersey 3,202 1,469 622 718 Florida 3,329 1,590 788 729 Arkansas Local linear regression 2,738 1,331 1,772 256 New Jersey 3,529 1,594 755 724 Florida 3,456 1,139 741 562

Panel B: RCT benchmarks for designs with a cutoff at age 50 ATE at age 50 ATE above age 50

State Estimation method Point

estimate RMSE/SE Point

estimate RMSE/SE

Arkansas Polynomial series 1,467 616 1,671 250 New Jersey 822 1,155 387 728 Florida 2,034 978 357 795 Arkansas Local linear regression 1,547 887 1,752 239 New Jersey 611 1,235 529 727 Florida 2,191 774 224 604

Panel C: RCT benchmarks for designs with a cutoff at age 70 ATE at age 70 ATE above age 70

State Estimation method Point

estimate RMSE/SE Point

estimate RMSE/SE

Arkansas Polynomial series 1,134 410 1,872 249 New Jersey −84 902 562 786 Florida 739 732 0 901 Arkansas Local linear regression 1,294 335 1,934 238 New Jersey 466 788 679 843 Florida 625 569 −180 677

To facilitate comparisons across the designs, the first panel of Figure 5 plots the standardized bias of the pretest and posttest RDD estimates for each of the 18 within-study comparisons. In the graph, the dashed 45◦ line in the graph marks equality of bias in the two designs. The circles above the line are within-study comparisons in which the pretest RDD was more biased than the posttest RDD. The circles below the line are comparisons in which the posttest RDD had lower bias. The results are distributed around the 45◦ line, which shows that in some cases the pretest RDD reduced bias slightly and in other cases it increased bias slightly. One important point, however, is that most of the points in the graph are in the bottom left corner, which makes it clear that both the posttest RDD and the pretest RDD produce estimates of the treatment effect at the cutoff with very little bias. Since both the RCT and non-RCT parameters are estimated with error and exact point correspondences are therefore very likely, these results confirm conclusions from other within-study comparisons that the usual posttest-only RDD provides estimates that are quite close to the results from a comparable RCT (Cook, Shadish,

Journal of Policy Analysis and Management DOI: 10.1002/pam Published on behalf of the Association for Public Policy Analysis and Management

870 / Methods for Policy Analysis

Table 5. Performance of the posttest and pretest RDD at the cutoff.

Panel A: Age 35 cutoff Posttest RDD Pretest RDD

State Estimation

method RCT

RMSE/SE SE SD bias RMSE SE

SD bias RMSE

Arkansas Polynomial series

2,980 2,583 0.49 4,573 2,949 −0.07 3,019 New Jersey 3,202 2,338 0.24 5,391 2,120 0.21 4,962 Florida 3,329 3,715 −0.12 4,388 3,313 −0.22 5,351 Arkansas Local linear

regression 2,738 3,563 0.45 4,804 2,114 0.01 2,150

New Jersey 3,529 5,010 0.2 6,629 1,634 0.15 1,546 Florida 3,456 4,159 −0.06 4,299 3,226 −0.15 3,338 Panel B: Age 50 cutoff

Posttest RDD Pretest RDD

State Estimation

method RCT

RMSE/SE SE SD bias RMSE SE

SD bias RMSE

Arkansas Polynomial series

616 1,267 0.04 1,293 941 0.99 7,563 New Jersey 1,155 3,676 −0.1 4,355 2,292 0.11 3,226 Florida 978 3,359 0.25 6,035 2,743 0.22 5,073

Arkansas Local linear regression

887 2,300 0.32 3,340 2,018 0.51 1,889 New Jersey 1,235 3,689 −0.03 3,778 2,662 0.05 2,695 Florida 774 4,705 0.06 4,863 3,943 −0.04 3,746 Panel C: Age 70 cutoff

Posttest RDD Pretest RDD

State Estimation

method RCT

RMSE/SE SE SD bias RMSE SE

SD bias RMSE

Arkansas Polynomial series

410 894 −0.03 911 536 0.06 721 New Jersey 902 2,322 0.04 2,455 1,937 0.04 2,093 Florida 732 1,785 −0.07 2,357 1,387 −0.17 3,703 Arkansas Local linear

regression 335 857 −0.06 976 521 0.07 532

New Jersey 788 2,377 0.01 2,378 1,645 0.08 1,676 Florida 569 2,642 0.08 2,968 2,223 −0.02 2,196

& Wong, 2008; Green et al., 2009; Shadish et al., 2011). They also suggest that efforts to supplement the posttest RDD with data from a pretest time period does not lead to much in the way of additional bias.

As for standard error comparisons, Table 5 shows that both the pretest and posttest RDD estimates are larger than the RCT estimates of standard error. In the- ory, the age-specific treatment effects from the RCT are estimated more precisely partly because the way we constructed the RDD reduced the sample size relative to the RCT, and partly because the RDD assignment rules induce a correlation be- tween the age variable and the treatment variable (Shochet, 2009; Goldberger, 1972a, 1972b). Table 5 also shows that the pretest RDD estimates have smaller standard errors than the posttest RDD in all but one within-study comparison. The efficiency gains from the pretest RDD can arise because of the larger sample sizes and also because the pretest RDD assignment rule reduces the correlation between the age variable and the treatment variable so that variance inflation from multicollinearity may be less in the pretest RDD than in the RDD. All of the designs are more precise

Journal of Policy Analysis and Management DOI: 10.1002/pam Published on behalf of the Association for Public Policy Analysis and Management

Methods for Policy Analysis / 871

0

.25

.5

.75

1

P re

te s t

R D

D :

A b

s o

lu te

S ta

n d

a rd

iz e

d B

ia s

0 .25 .5 .75 1

Posttest RDD: absolute standardized bias

pretest RDD vs. RDD

Standardized bias at the cutoff

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

P re

te s t

R D

D :

R M

S E

0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000

Posttest RDD: RMSE

pretest RDD vs. RDD

Mean squared error at the cutoff

Note: The graphs plot measures of the performance of the pretest and posttest RDD estimators across the different within-study comparisons in our analysis. Standardized bias is shown in the left panel and RMSE statistics are in the right panel. The y-axis measures the performance of the pretest RDD strategy. The x-axis reports measures performance in the corresponding posttest RDD. The dashed 45◦ line marks points at which the two designs have equal performance. Points that fall above the 45◦ line represent within-study comparisons in which the posttest RDD outperformed the pretest RDD.

Figure 5. Comparative Performance of the Pretest and Posttest RDD at the Cutoff.

in within-study comparisons with older age cutoffs because the observations are more densely distributed near the cutoff.

One reaction to the idea of supplementing the RDD with pretest data is that researchers may face a trade-off between the efficiency and extrapolation gains from the new data and the possibility that bias will arise because the stronger identifying assumptions may fail to hold. The mean squared error criterion provides a way of gauging the net effect of changes in the bias and variance of estimates that arise from different estimation strategies. The RMSE statistics that are reported in Table 5 measure the size of the average error in dollars that is associated with the posttest RDD and pretest RDD estimates of the treatment effect at the cutoff. In the RCT, the RMSE ranges from $335 at the oldest age cutoffs to around $3,500 at the youngest age cutoff. The RMSE of the posttest RDD is larger than the corresponding RCT estimate in all 18 within-study comparisons. In contrast, the pretest RDD actually has a slightly lower RMSE in the three within-study comparisons in which the cutoff was set at age 35, and the local linear regression approaches were used for estimation. In general, all three designs had less error when the cutoff was fixed at older ages.

The right-hand panel of Figure 5 compares the RMSE statistic from the pretest RDD and posttest RDD in each of the within-study comparisons. As before, the 45◦ line marks points at which the two designs have the same average error. The circles above the line indicate comparisons where the pretest RDD had more error on av- erage than the posttest RDD, and the circles below the line are from comparisons in which the pretest RDD more reliably replicated the RCT benchmark than the posttest RDD. In the majority of within-study comparisons, supplementing the posttest RDD

Journal of Policy Analysis and Management DOI: 10.1002/pam Published on behalf of the Association for Public Policy Analysis and Management

872 / Methods for Policy Analysis

Table 6. Performance of the pretest RDD at extrapolation above the cutoff.

Panel A: Age 35 cutoff Pretest RDD

State Estimation method RCT RMSE/SE SE SD bias RMSE

Arkansas Polynomial series 258 1,312 0.19 1,859 New Jersey 718 1,394 0.2 4,351 Florida 729 841 −0.11 2,344 Arkansas Local linear regression 256 1,096 0.07 1,201 New Jersey 724 1,181 0.19 4,056 Florida 562 704 −0.09 1,961 Panel B: Age 50 cutoff

Pretest RDD

State Estimation method RCT RMSE/SE SE SD bias RMSE

Arkansas Polynomial series 250 877 0.11 1,203 New Jersey 728 1,176 0.09 2,349 Florida 795 738 −0.09 1,920 Arkansas Local linear regression 239 709 0.05 838 New Jersey 727 1,002 0.13 2,890 Florida 604 704 −0.08 1,755 Panel C: Age 70 cutoff

RDD

State Estimation method RCT RMSE/SE SE SD bias RMSE

Arkansas Polynomial series 249 469 0.09 822 New Jersey 786 933 0.14 3,116 Florida 901 891 −0.05 1,461 Arkansas Local linear regression 238 449 0.04 544 New Jersey 843 886 0.12 2,644 Florida 677 640 −0.04 992

with data from the pretest time period improved the correspondence between the quasi-experiment and the RCT in terms of estimating the treatment effect at the cutoff. Overall, our within-study comparisons provide considerable support for ef- forts to supplement the standard posttest RDD with data from a pretest time period. Adding the pretest led to only small changes in bias that would be of little substan- tive interest. And the results from the RMSE statistics suggest that the reductions in variance swamp the changes in bias in most cases so that adding the pretest leads to better correspondence with the RCT.

Extrapolation Beyond the Cutoff

Adding the pretest does more than improve correspondence with the RCT at the cutoff: It also provides a way to extrapolate beyond the cutoff subpopulation. Table 6 reports the performance of the estimates of the average treatment effect among all subjects above the cutoff based on the pretest RDD extrapolations. The results are again presented in terms of standardized bias, standard errors, and RMSE statistics from each of the within-study comparisons. Since no theorists argue that causal estimates beyond the cutoff are warranted for the posttest RDD, discussion is limited below to comparison of results from the pretest RDD and the RCT.

Journal of Policy Analysis and Management DOI: 10.1002/pam Published on behalf of the Association for Public Policy Analysis and Management

Methods for Policy Analysis / 873

All of the extrapolations had standardized bias of less than 0.2 standard deviations. Indeed, the median standardized bias of the estimates across all 18 within-study comparisons was only 0.09 standard deviations. Even within the narrow range of standardized bias estimates we observed, the extrapolations beyond the cutoff were usually less biased in the comparisons with older age cutoffs. This makes sense because less extrapolation is required when the cutoff is fixed at an older age. The main finding, though, is how consistently low the pretest RDD bias estimates are above the cutoff.

The standard error of the extrapolated treatment effects are larger in the pretest RDD than in the RCT, particularly for the youngest age cutoff. For instance, on average, the standard errors from the pretest RDD are about 2.5 times larger than the standard errors from the RCT when the cutoff is set at age 35. But the average ratio is only 1.9 when the cutoff is set at 50 and only 1.3 when the cutoff is set at 70. The RMSE of the extrapolations based on the pretest RDD follow a similar pattern.

Evaluating the performance of the pretest RDD at extrapolations beyond the cutoff is difficult because there is no uniform measure of what constitutes good performance. One line of thinking goes as follows. There is a growing consensus among applied researchers that the posttest RDD provides high-quality estimates of treatment effects at the cutoff. If we accept that the performance of the RDD is a reasonable standard of good performance, then we can compare the properties of the pretest RDD extrapolations to the properties of the posttest RDD estimates at the cutoff. Across the 18 within-study comparisons presented here, the average standardized bias at the cutoff in the posttest RDD estimates was about 0.095 stan- dard deviations. In contrast, the average standardized bias of the extrapolations away from the cutoff based on the pretest RDD was only 0.05 standard deviations. If the posttest RDD is lauded as an estimator with an acceptably low level of bias, then these extrapolations beyond the cutoff seem to meet an even higher standard of fidelity to the RCT results. Similarly, the average RMSE was about $3,655 across all of the posttest RDD estimates at the cutoff, and it was only $2,017 across all of the pretest RDD extrapolations beyond the cutoff. These arguments provide some support for our claim that the extrapolations beyond the cutoff are high quality relative to the standards that researchers apply to quasi-experimental research.

DISCUSSION

The results from the within-study comparisons show some specific ways in which the pretest RDD can shore up key weaknesses of the standard RDD. First, adding the pretest data led to more statistically precise estimates of the treatment effect at the cutoff without incurring a substantial penalty in terms of bias. Second, the pretest design improves the justification for extrapolations away from the cutoff by providing information about the relationship between the untreated outcome and the assignment variable across the full range of the assignment variable. And finally, supplementing the basic RDD with a pretest measure of the outcome led to estimates of the average treatment effect above the cutoff that were very similar to those produced by the RCT.

The key risk in supplementing the posttest RDD with data from a pretest time period is the possibility of additional bias that might arise if the fixed period effect assumption fails to hold. This does not seem to have been an important consequence in these within-study comparisons. Indeed, when both bias and variance are consid- ered together in the RMSE statistic, the pretest RDD performed much better than the posttest RDD at the cutoff, and the extrapolations beyond the cutoff met an even higher standard of performance. Of course, our within-study comparisons also sup- port the general superiority of the RCT in terms of both bias and statistical precision.

Journal of Policy Analysis and Management DOI: 10.1002/pam Published on behalf of the Association for Public Policy Analysis and Management

874 / Methods for Policy Analysis

The present study has some inevitable external validity limitations. The paper is framed in terms of a sharp RDD. To extend our basic approach to the fuzzy case requires additional assumptions both about the joint distribution of heterogeneous treatment effects and compliance status, and also about the stability of that joint distribution over time and across levels of the assignment variable. These additional assumptions may or may not be credible in particular instances of applied work. We leave to future work the application of within-study comparisons to fuzzy designs. Another limitation comes from the fact that only one data set is used here, and we have no guarantee that similar results would be achieved with other data sets with different characteristics than ours. This issue is an important concern with within- study comparisons and with case studies more generally. The weakness disappears to some extent when the literature is considered as a whole and when one attaches value to the insights into analytical choices and modeling assumptions that may be produced simply by engaging in the often complicated effort to evaluate the performance of a particular research design in a particular setting.

It is important not to confuse our use of pretest assessments in RDD with other legitimate ways that a pretest may be used. We treated the pretest as a comparison data set rather than as covariates. Using the pretest as a covariate in a regression framework will increase statistical power (Lee & Lemieux, 2010), but will not facil- itate the extrapolation beyond the cutoff that we have emphasized here. The pretest RDD estimator we used also shares some important features with the difference- in-differences (DID) design. But the two are not identical. Our approach exploits the smoothness assumptions that are central to RDD and uses pretest information about functional form to justify extrapolating beyond the cutoff.

One of the basic findings of our study is that supplementing the RDD with data from a pretest time period can improve the precision of the estimates without substantial costs in terms of bias. There are at least two mechanisms through which the pretest RDD achieves these efficiency gains. The first is an increase in the number of observations available for analysis. These gains may seem trivial, but they should not be ignored since pretest assessments are relatively common and, as we have seen, can improve statistical power and causal credibility without much threat of bias. In contrast, data generated according to a standard RDD assignment rule are relatively rare, and researchers may find it difficult to collect additional RDD data in the pursuit of improved statistical power.

A second way that the pretest RDD could improve statistical power is by altering the design effect associated with the RDD (Schochet, 2009). For example, combin- ing data from a pretest and posttest time period may change the variability in the treatment variable across the pooled sample, and it may also alter the degree of collinearity between the treatment variable and the assignment variable. In empiri- cal settings where the pooled data set increases variability in treatment and reduces collinearity, the pretest RDD will produce efficiency gains that are independent of the sample size.

Pretest measures of the outcome are not the only design elements (Corrin & Cook, 1998) that improve functional form estimation, statistical power, and causal generalization. Repeated cross-sectional samples are also possible. Lohr’s study reported in Cook and Campbell (1979) concerned how the introduction of Medicaid affected the number of doctor visits in a nationally representative sample. Household income was the assignment variable, an income threshold adjusted for family size was the cutoff, and the number of doctor visits in the year after the introduction of Medicaid was the posttest. The supplemental RDD element was a representative sample of families and their doctor visits from the year before Medicaid became available. Lohr did not perform all the analyses presented in Cook and Campbell paper, but he demonstrated that functional forms were very similar in the untreated

Journal of Policy Analysis and Management DOI: 10.1002/pam Published on behalf of the Association for Public Policy Analysis and Management

Methods for Policy Analysis / 875

part of the assignment variable in both the pretest and posttest samples, suggesting that this cohort-based design supplement may also perform well beyond the cutoff.

Other RDD supplements that could be considered include contemporaneous but nonequivalent comparison groups of people who were not offered the treatment. Depending on context, nonequivalent control groups might come from a different geographical area or from institutions where the treatment was not available, thus from another city, state, school, or workplace, and even groups matched on pre- treatment covariates to make the regression functions more comparable. In the early stages of this study, we explored pairing the RCT control group from one state with the nonequivalent comparison group for another state, but soon decided this strategy was not viable because, as Figure 4 makes clear, the regression functions are very different by state. Recent work by Dong and Lewbel (2011) considers extrapola- tion beyond the cutoff using information about the local derivative of the regression function at the cutoff, and work by Angrist and Rokkanen (2012) combines this idea with a conditional independence assumption that facilitates extrapolation. The spirit of these approaches to supplementing the RDD with additional design elements is similar to the approach we consider in this paper.

Nonequivalent dependent variables offer a third kind of supplement to the basic posttest RDD. These are variables that should be affected by the most plausible alter- native interpretations operating at the cutoff, but that are not related to treatment. An example is from Ludwig and Miller (2007) who did a long-term evaluation of Head Start where help in writing application proposals was originally given to the 300 poorest U.S. counties. They showed that spending on other poverty programs did not differentially occur at this cutoff, making it possible to use these other pro- grams as comparison RDD functions since there was little if any reason to believe that outcomes should change at the 300th poorest county.

COADY WING is Assistant Professor of Health Policy and Administration, University of Illinois at Chicago, 1603 West Taylor Street, 754 SPHPI, Chicago, IL 60612-4394.

THOMAS D. COOK, Joan and Sarepta Harrison Chair of Ethics and Justice and Professor of Sociology, Psychology, and Education and Social Policy, Institute for Policy Research, Northwestern University, 2040 Sheridan Road, Evanston, IL 60208.

ACKNOWLEDGMENTS

Several people deserve our thanks. The editor and three anonymous reviewers provided thoughtful comments and suggestions. Vivian Wong, Peter Steiner, Kelly Hallberg, Will Shadish, and Dan Black provided helpful feedback on an earlier draft. In addition, work on this project was facilitated by IES Grant R305D100033.

REFERENCES

Angrist, J. D., & Rokkanen, M. (2012). Wanna get away? RD identification away from the cutoff. NBER Working Paper 18662. Cambridge, MA: National Bureau of Economic Re- search.

Bifulco, R. (2012) Can non-experimental estimates replicate estimates based on random assignment in evaluations of school choice? A within study comparison. Journal of Policy Analysis and Management, 31, 729–251.

Brown, R. S., & Dale, S. B. (2007). The research design and methodological issues for the Cash and Counseling evaluation. Health Services Research, 42, 414–445.

Carlson, B. L., Foster, L., Dale, S. B., & Brown, R. (2007). Effects of Cash and Counseling on personal care and well-being. Health Services Research, 42, 467–487.

Journal of Policy Analysis and Management DOI: 10.1002/pam Published on behalf of the Association for Public Policy Analysis and Management

876 / Methods for Policy Analysis

Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis for field settings. Chicago, IL: Rand McNally.

Cook, T. D., & Steiner, P. M. (2010). Case matching and the reduction of selection bias in quasi-experiments: The relative importance of covariate choice, unreliable measurement and mode of data analysis. Psychological Methods, 1, 56–68.

Cook, T. D., & Wong, V. C. (2008). Empirical tests of the validity of the regression discontinuity design. Annals of Economics and Statistics, 91/92, 127–150.

Cook, T. D., Shadish, W. R., & Wong, V. C. (2008). Three conditions under which experi- ments and observational studies often produce comparable causal estimates: New find- ings from within-study comparisons. Journal of Policy Analysis and Management, 27, 724–750.

Corrin, W., & Cook, T. (1998). Design elements of quasi-experimentation. Advances in Edu- cational Productivity, 7, 35–57.

Dale, S. B., & Brown, R. S. (2007). How does Cash and Counseling affect costs? Health Services Research, 42, 488–509.

Dong, Y., & Lewbel, A. (2011). Regression discontinuity marginal threshold treatment effects. Working Paper. Boston, MA: Boston College.

Doty, P., Mahoney, K. J., & Simon-Rusinowitz, L. (2007). Designing the Cash and Counseling demonstration and evaluation. Health Services Research, 42, 378–396.

Fraker, T., & Maynard, R. (1987). Evaluating comparison group designs with employment- related programs. Journal of Human Resources, 22, 194–227.

Goldberger, A. S. (1972a). Selection bias in evaluating treatment effects: Some formal illus- trations. Unpublished manuscript.

Goldberger, A. S. (1972b). Selection bias in evaluating treatment effects: The case of interac- tion. Unpublished manuscript.

Green, D. P., Leong, T. Y., Kern, H. L., Gerber, A. S., & Larimer, C. W. (2009). Testing the accuracy of regression discontinuity analysis using experimental benchmarks. Political Analysis, 17, 400–417.

Hahn, J., Todd, P., & van der Klaauw, W. (2001). Identification and estimation of treatment effects with a regression-discontinuity design. Econometrica, 69, 201–209.

LaLonde, R. (1986). Evaluating the econometric evaluations of training with experimental data. American Economic Review, 76, 604–620.

Lee, D. S. (2008). Randomized experiments from non-random selection in U.S. House elec- tions. Journal of Econometrics, 142, 675–697.

Lee, D. S., & Card, D. (2008). Regression discontinuity inference with specification error. Journal of Econometrics, 142, 655–674.

Lee, D. S., & Lemieux, T. (2010). Regression discontinuity designs in economics. Journal of Economic Literature, 48, 281–355.

Ludwig, J., & Miller, D. L. (2007). Does Head Start improve children’s life chances? Ev- idence from a regression discontinuity design. Quarterly Journal of Economics, 122, 159–208.

Manski, C. (2013). Public policy in an uncertain world. Cambridge, MA: Harvard University Press.

Porter, J. (2003). Estimation in the regression discontinuity model. Mimeo. Madison, WI: Department of Economics, University of Wisconsin.

Robinson, P. (1988). Root-N-consistent semi-parametric regression. Econometrica, 56, 931– 954.

Schochet, P. Z. (2009). Statistical power for regression discontinuity designs in education evaluations. Journal of Educational and Behavioral Statistics, 34, 238–266.

Shadish, W., Galindo, R., Wong, V., Steiner, P., & Cook, T. (2011). A randomized ex- periment comparing random and cut-off-based assignment. Psychological Methods, 16, 179–191.

Journal of Policy Analysis and Management DOI: 10.1002/pam Published on behalf of the Association for Public Policy Analysis and Management

Methods for Policy Analysis / 877

Thistlewaite, D. L., & Campbell, D. T. (1960). Regression-discontinuity analysis: An alternative to the ex-post facto experiment. Journal of Educational Psychology, 51, 309–317.

Wilde, E. T. & Hollister, R. (2007) How close is close enough? Evaluating propensity score matching using data from a class size reduction experiment. Journal of Policy Analysis and Management, 26, 455–477.

Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data, 2nd ed. Cambridge, MA: MIT Press.

Journal of Policy Analysis and Management DOI: 10.1002/pam Published on behalf of the Association for Public Policy Analysis and Management

Methods for Policy Analysis

APPENDIX: SELECTION OF BANDWIDTHS AND POLYNOMIAL SERIES LENGTHS

We adopted the following procedures for selecting smoothing parameters:

1. Use cross-validation to evaluate a grid of candidate bandwidths from 1 to 90 years. For the polynomial models evaluate grid of four possible functions each for the treatment and control data: linear, quadratic, cubic, and quartic.

2. Select the candidate with smallest mean integrated squared error. 3. Visually inspect the fitted values from each of the models chosen in step 2 and

make adjustments to resolve concerns about undersmoothing.

Table A1. Smoothing parameters for the RCT benchmarks.

Parameter State

Cross- validation treatment

Cross- validation

control Preferred treatment

Preferred control

Bandwidth Arkansas 11 11 20 20 New Jersey 9 13 25 25 Florida 90 90 90 90

Polynomial Arkansas Interacted quadratic New Jersey Interacted quadratic Florida Interacted linear

Table A2. Smoothing parameters for the posttest RDD.

Panel A: cross-validation parameters Cross-validation

Parameter State Cut = 35 Cut = 50 Cut = 70 Bandwidth Arkansas 90 above/19 below 90 above/2 below 90 above/70 below

New Jersey 14 above/90 below 90 above/15 below 13 above/9 below Florida 90 above/17 below 90 above/4 below 90 above/11 below

Polynomial Arkansas Quadratic Linear Linear New Jersey Linear Quartic Quartic Florida Quartic Cubic Interacted Linear

Panel B: preferred parameters (used in the analysis) Preferred

Parameter State Cut = 35 Cut = 50 Cut = 70 Bandwidth Arkansas 90 above/19 below 90 above/20 below 90 above/70 below

New Jersey 14 above/90 below 90 above/15 below 19 above/13 below Florida 90 above/17 below 90 above/11 below 90 above/11 below

Polynomial Arkansas Quadratic Linear Linear New Jersey Linear Quartic Quartic Florida Quartic Cubic Interacted linear

Notes: The partially linear model approach to the pretest RDD required choosing three bandwidths for each research design under test. We used the cross-validation bandwidth of 90 years for the pretest indicator bandwidth for all of the models. Table A3 shows the bandwidths that were selected using the cross-validation and the preferred bandwidths used in the Medicaid expenditures equation and in the residualized equation used to form the treatment effect estimates.

Journal of Policy Analysis and Management DOI: 10.1002/pam Published on behalf of the Association for Public Policy Analysis and Management

Methods for Policy Analysis

Table A3. Bandwidths in the pretest RDD based on the partially linear model.

Panel A: cross-validation bandwidths Cross-validation Medicaid Cross-validation residual

Parameter State Cut = 35 Cut = 50 Cut = 70 Cut = 35 Cut = 50 Cut = 70 Bandwidth Arkansas 90 90 90 13 90 11

New Jersey 1 1 1 9 7 13 Florida 2 90 90 66 66 67

Panel B: preferred bandwidths Preferred Medicaid Preferred Residual

Parameter State Cut = 35 Cut = 50 Cut = 70 Cut = 35 Cut = 50 Cut = 70 Bandwidth Arkansas 90 90 90 13 90 11

New Jersey 10 10 10 15 7 13 Florida 20 90 90 66 66 67

Notes: We used cross-validation to select the length of the polynomial series for each model. In Arkansas, for the untreated samples, we used a quartic model in the age 35 design, a linear model for the age 50 design, and a quartic model for the age 70 design. For the treated samples in Arkansas, we used a quartic model for the age 35 design, a cubic model for the age 50 design, and a linear model for the age 70 design. In New Jersey, for the untreated samples, we used a quartic for all three designs. For the treated samples in New Jersey, we used a linear model for the age 35 design, a quadratic model for the age 50 design, and a quartic model for the age 70 design. In Florida, for the untreated samples, we used a linear model in the age 35 design, a quadratic model for the age 50 design, and a linear model for the age 70 design. For the treated samples in Florida, we used a cubic model for the age 35 design, a quadratic model for the age 50 design, and a linear model for the age 70 design.

Table A1 shows the cross-validation and preferred bandwidths and functional forms from our analysis for the RCT benchmarks. Table A2 shows the parameters for the posttest RDD.

Journal of Policy Analysis and Management DOI: 10.1002/pam Published on behalf of the Association for Public Policy Analysis and Management

Copyright of Journal of Policy Analysis & Management is the property of John Wiley & Sons, Inc. and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.