Class size

profileEpeak15
19114657.pdf

Econometric Reviews, 24(4):333–368, 2005 Copyright © Taylor & Francis, Inc. ISSN: 0747-4938 print/1532-4168 online DOI: 10.1080/07474930500405485

CLASS SIZE AND EDUCATIONAL POLICY: WHO BENEFITS FROM SMALLER CLASSES?

Esfandiar Maasoumi, Daniel L. Millimet, and Vasudha Rangaprasad � Department of Economics, Southern Methodist University, Dallas, Texas, USA

� The impact of class size on student achievement remains an open question despite hundreds of empirical studies and the perception among parents, teachers, and policymakers that larger classes are a significant detriment to student development. This study sheds new light on this ambiguity by utilizing nonparametric tests for stochastic dominance to analyze unconditional and conditional test score distributions across students facing different class sizes. Analyzing the conditional distributions of test scores (purged of observables using class-size specific returns), we find that there is little causal effect of marginal reductions in class size on test scores within the range of 20 or more students. However, reductions in class size from above 20 students to below 20 students, as well as marginal reductions in classes with fewer than 20 students, increase test scores for students below the median, but decrease test scores above the median. This nonuniform impact of class size suggests that compensatory school policies, whereby lower- performing students are placed in smaller classes and higher-performing students are placed in larger classes, improves the academic achievement of not just the lower-performing students but also the higher-performing students.

Keywords Class size; Program evaluation; Quantile treatment effects; School quality: Stochastic dominance; Student achievement.

JEL Classification C14; C33; I21; I28.

1. INTRODUCTION

The effect of school quality, notably class size, on student achievement is one of the most debated educational policy issues (e.g., Hanushek, 1986, 1996, 2003; Krueger, 2003). While a number of researchers have analyzed the issue, no consistent relationship between class size and student outcomes has been identified. In a recent review, Akerhielm (1995)

Received August 2004; Accepted September 2005 Address correspondence to Daniel L. Millimet, Department of Economics, Box 0496, Southern

Methodist University, Dallas, TX 75275-0496, USA; Fax: (214) 768-1821; E-mail: millimet@mail. smu.edu.

334 E. Maasoumi et al.

noted that of 112 studies in the literature, 23 documented a statistically significant relationship between class size and student achievement, 14 finding a negative relationship and nine showing a positive relationship.1

Consequently, Krueger (2003, p. F61) labels the effect “subtle and easily obscured,” Fertig (2003b, p. 2) calls the mixed results “disconcerting,” and Card and Krueger (1996, p. 47) note that “decisions about educational resources and reform have to be made in an environment of much uncertainty.” In spite of these ambiguous findings, the common perception among parents, teachers, and policymakers is that larger classes are a significant detriment to student development. Several states, including Texas, have mandated maximum class size levels. As a result, Dustmann (2003, p. F1) labels class size reduction policies as “easily the most popular policy for school improvements in the US,” although efficacy of such policies is “an issue of ongoing debate.” Hanushek (2002, p. 61) concludes, “Despite the political popularity of overall class size reduction, the scientific support of such policies is weak to nonexistent.”

Given the importance of student development, future labor market success, racial achievement and wage gaps, economic growth, and the limited budgets that school districts confront, improved understanding of any benefits of reductions in class size, and of who benefits, are crucial to informed policymaking. Unfortunately, all existing empirical studies (to our knowledge) are lacking in at least one of three important respects, contributing to the inconsistent results noted above. First, in the past, most studies utilized the average pupil–teacher ratio at an individual’s school to proxy for the actual class size faced by the individual, given the absence of data on actual class size. However, the relationship between student performance and average pupil–teacher ratios is likely to be weaker (Ehrenberg et al., 2001). Second, nearly all studies focus on only one aspect of the distribution of student achievement: the (conditional) mean.2

This is helpful only to a decision maker who places equal weights on all student groups. Finally, class size is treated as an intercept effect/shift, potentially missing important interactions between class size and other educational inputs. For example, it may be that effects of smaller classes arise because they increase the productivity of other inputs, such as teacher quality or parental involvement.

To account simultaneously for these three shortcomings, we use a nationally representative sample of public high school students in the US.

1More recently, Todd and Wolpin (2003, p. F4) state that despite “hundreds of empirical studies of the school quality–achievement relationship,” such studies “do not appear to be converging toward a consensus.” See also Hanushek (2002).

2Notable exceptions are Eide and Showalter (1998), Levin (2001), and Fertig (2003a), who use quantile regression methods to analyze the effect of school attributes on student achievement at select points in the distribution. However, Eide and Showalter (1998) only have data on the pupil–teacher ratio at the school level, not actual class size, and Levin (2001) and Fertig (2003a) analyze Dutch and German data, respectively.

Class Size and Educational Policy 335

The data, taken from the National Educational Longitudinal Study of 1988 (NELS), contain several test score measures, the class size corresponding to the subject being tested, as well as an exhaustive set of conditioning variables. We examine the entire distribution of tenth and twelfth grade test scores, both unconditional and conditional, across several class size groupings, controlling for a host of student, family, and school attributes and incorporating the logic of the Oaxaca–Blinder decomposition to allow for interaction effects between class size and the returns to other educational inputs.

Our comparisons of the distributions of test scores rely on recently developed nonparametric tests for stochastic dominance (SD). Such tests of first, second, and higher order stochastic dominance (hereafter FSD and SSD, respectively) were examined in McFadden (1989) and Kaur et al. (1994). Our implementation of these tests treats the distribution of student test scores as an unknown to be estimated nonparametrically, and it draws upon bootstrap techniques developed in Maasoumi and Heshmati (2000) and Linton et al. (2005) to assess the level of statistical confidence regarding various relations.

Testing for SD is a useful and insightful companion to standard regression analysis for several reasons. First, it allows one to examine effects of the “treatment” in question at different parts of the distribution, as viewed by very different evaluation functions. Because many policy interventions have different effects on high-, middle-, and low-achieving students, any summary (index) measure of this distributed outcome must assign implicit weights to these different groups. For instance, averages assign equal weights, while focusing on the lowest quintile, for instance, places zero weight on others, etc. Second, SD analysis endows a context to averages and estimators of Quantile Treatment Effects (QTEs), which exposes the welfare functions, or policy evaluation functions, underlying the assessment of the differences in the distributions of the outcomes of different groups exposed to different treatments (see, e.g., Bitler et al., 2005). For example, SD analysis may make clear that the only way two situations can be (could have been) ranked is by a summary measure that prefers higher scores and cares about “inequality” or certain trade- offs between groups. Third, finding a SD ranking (of a particular order) indicates that all utility/welfare functions belonging to a particular class would agree that one policy is preferable to another. Thus summarized comparisons based on specific indices (e.g., mean comparisons in the case of first order SD) are only needed for quantification of the treatment effect. In addition, the inability to infer a dominance relation is equally informative, indicating that any (implicit) welfare ordering based on a particular index (such as the conditional mean) is highly subjective; different indices, even within the same general class of utility/welfare functions, will yield different substantive conclusions. As a result, summary

336 E. Maasoumi et al.

measures may be used for complete, strong rankings of student outcomes across classes of varying size, consistent with explicitly revealed welfare preferences. More sophisticated techniques are needed for weak yet uniform ranking of student outcomes over large classes of welfare functions. Such uniform rankings (i.e., rankings robust to the specific choice of preference function and distribution of outcomes) are needed for “majority” rankings of policy outcomes. Empirical examination of such uniform rankings, based on the notion of stochastic dominance (SD), is the approach utilized in this paper.

The power of such SD relations, combined with the recently developed theory necessary to conduct rigorous statistical tests for the presence of such relations, has led to their growing application. For example, Maasoumi and Millimet (2005) examine changes in US pollution distributions over time and across regions at a point in time. Maasoumi and Heshmati (2005) analyze changes in the Swedish income distribution over time as well as across different population subgroups. Fisher et al. (1998) compare the distribution of returns to different length US Treasury Bills. Particularly relevant to the analysis at hand are previous applications of SD to the analysis of various treatment effects on the distribution of the outcome of interest. For instance, Amin et al. (2003) analyze the effect of a microcredit program in Bangladesh on the distribution of consumption of participants versus nonparticipants. Abadie (2002) analyzes the impact of veteran status on the distribution of civilian earnings. Bishop et al. (2000) compare the distribution of nutrition levels across populations exposed to two different types of food stamp programs. Anderson (1996) compares pre- and posttax income distributions in Canada over several years.

The results are quite striking. In particular, we reach four main conclusions. First, we find that the unconditional distributions of test scores favor students in larger classes below the 70th percentile or so, and students in smaller classes in the upper tail. Specifically, while marginal changes in class size in classes with at least 20 students are not associated with changes in the unconditional test score distributions, the unconditional test scores of low-achieving (high-achieving) students are superior in classes with at least (less than) 20 students. Second, analyzing the distribution of test scores purged of a multitude of individual, family, class, teacher, and school attributes, including lagged test scores and grade point average, using class-size specific returns, we find that there is little causal effect of marginal reductions in class size on test scores within the range of 20 or more students. However, reductions in class size from above 20 students to below 20 students, as well as marginal reductions in classes with fewer than 20 students, increase test scores for students below the median, but decrease test scores for students above the median. Third, our conclusions hinge on the fact that we allow the returns to observables to vary by class size. In fact, unknown heretofore, we conclude that most of the beneficial impacts

Class Size and Educational Policy 337

of class size reduction arise because of the productivity-enhancing effect it has on other educational inputs. However, the improvement in the returns to observables is nonmonotonic; the overall returns are maximized (in some sense) in classes with 16–19 students. Finally, class size reductions rarely have a uniform impact across the test score distributions (i.e., the QTEs vary across the distributions in terms of sign and magnitude), so that the current focus on mean treatment effects is misleading. Only by expanding one’s analysis to also incorporate the dispersion of test scores can policymakers hope to arrive at uniform rankings of class sizes.

In terms of informing educational policy, our results provide two key insights. First, since marginal reductions in class size in classes with more than 20 students have little impact on test scores, absent other considerations (e.g., student discipline), class size reductions in schools with current class sizes well above 20 should only be undertaken if schools are willing to reduce class size close to or below 20 students per class. Second, the nonuniform impact of class size suggests that compensatory school policies, whereby lower-performing students are placed in smaller classes and higher-performing students are placed in larger classes, improves the academic achievement of not just the lower- performing students but also the higher-performing students.

The remainder of the paper is organized as follows. Section 2 presents an overview of the literature. Section 3 defines the various dominance relations and describes the tests used to identify such relations in the data. Section 4 discusses the data. Section 5 presents the initial results, while Section 6 presents some further detailed analysis. Section 7 offers some concluding remarks.

2. LITERATURE REVIEW

Given that student achievement has been linked to variation in economic growth across countries (Barro, 2001; Hanushek and Kimko, 2000), and that racial disparities in student achievement have been well- documented (Cook and Evans, 2000; Hanushek, 2001; Jenks and Phillips, 1998), a considerable literature has developed attempting to understand the factors contributing to student development. The vast majority of previous studies focus on the pupil–teacher ratio at the school level to proxy for class size (e.g., Chubb and Moe, 1990; Coleman et al., 1996; Eide and Showalter, 1998), with the resulting link between student achievement and pupil–teacher ratios found to be tenuous at best (Hanushek, 1986, 1989). More recent work has utilized actual measures of class size (e.g., Akerhielm, 1995; Boozer and Rouse, 2001).

In early work, Summers and Wolfe (1977) analyzed data on 627 sixth grade elementary school students across 103 randomly selected elementary schools from the Philadelphia School District in 1970 and

338 E. Maasoumi et al.

1971, as well as 553 eighth grade students and 716 twelfth grade students. The authors concluded that disadvantaged students and students from low socioeconomic backgrounds benefited from smaller class sizes. Several more recent studies have reaffirmed this conclusion. Angrist and Lavy (1999) find a beneficial impact of smaller classes on the achievement of fourth and fifth grade students in Israel using an instrumental variable (IV) approach. Rivkin et al. (2002) use a complex fixed effects model and data on class size within Texas schools across multiple cohorts, documenting some significant positive effects of smaller classes on fourth and fifth grade students, although the effects disappear by sixth grade and in any event are of much smaller magnitude than the impact of teacher quality. Akerhielm (1995) and Boozer and Rouse (2001) use data from the NELS, relying on IV methods to control for potential endogeneity of class size. Both find statistically significant, negative effects of larger classes on test scores. Utilizing experimental evidence on kindergarten through third grade students from Tennessee’s Project STAR (Student/Teacher Achievement Ratio), Krueger (1999) finds a modest beneficial impact of smaller classes during kindergarten and first grade but no impact on subsequent achievement.3 Krueger and Whitmore (2002), also examining the Project STAR data, document a positive impact of class size reductions, particularly for minorities.

Conversely, several studies have offered compelling empirical support for the notion that (i) smaller classes are not as beneficial as the above studies suggest, or (ii) larger classes are actually beneficial for students. Jepsen and Rivkin (2002) examine the effectiveness of California’s Class Size Reduction Program (CSRP) on student achievement using longitudinal data from the 1990s. Employing a difference-in-differences approach, the authors find that smaller class sizes raised third grade mathematics and reading test scores, particularly for low income students. However, the reduction in class size lowered the quality of teachers on the margin, and the deterioration in teacher quality at least partially offsets the gains from smaller classes (see also Rivkin et al., 2002 for a similar argument). Hoxby (2000a) employs a similar approach to many of the above studies, relying on exogenous variation in class size in Connecticut arising from idiosyncratic variation in the population, and finds no significant relationship between class size and student performance on tests in fourth and sixth grade. Goldhaber and Brewer (1997) control for school and teacher fixed and random effects, finding a positive impact of class size on mathematics achievement of tenth

3Interestingly, Hanushek (2002) implicitly suggests the benefit of examining achievement distributions, rather than average outcomes, noting that while overall average kindergarten achievement is higher in smaller classes under the STAR experiment, this ranking only holds in 40 of 79 schools participating in the experiment.

Class Size and Educational Policy 339

grade students using the NELS. Wößmann (2003) uses international data from the Trends in International Mathematics and Science Study (TIMSS) on seventh and eighth grade students, spanning 39 countries, and presents statistically significant evidence (using IV and other estimation techniques) that smaller class size is related to lower student achievement in mathematics and science. Fertig (2003a), using German data on the reading achievement of 15 and 16 year old students, also finds a statistically significant positive impact of larger classes using both OLS and quantile regression methods and controlling for a host of other attributes, including the homogeneity (in terms of ability) of the school population. The author argues that heterogeneity, not class size, is a more important detriment to student performance (see also Fertig, 2003b and Lazear, 2001). Finally, Dobbelsteen et al. (2002, p. 36) utilize Dutch data on fourth, sixth, and eighth grade students and exogenous variation in class size arising from rules linking total school enrollment to the number of teachers. The authors find that “after correcting for endogeneity, pupils in large classes do no worse—and sometimes even better—than identical pupils in small classes.” Consonant with Fertig (2003a,b), the authors test and find support for their hypothesis that students learn from other students of similar ability; thus larger classes increase the probability of being surrounded by others from whom one may learn (see also Levin, 2001).

Finally, Krueger (2003) performs a meta-analysis of the literature, finding in general a small, not particularly robust, positive impact of smaller class sizes. Hanushek (1996, 2002, 2003) also performs several meta-analyses, concluding that there is no consistent relationship between resources and student achievement. This lack of consensus suggests that examination of the QTEs, along with the application of SD testing, may be particularly fruitful. To contrast, it is worth reemphasizing that practically all the above studies utilizing regression-based inferences face two limitations: (i) the assumed structure of the various educational production functions estimated may be too simplistic as they restrict class size to only an intercept shift and (ii) the focus is on the conditional mean of the distribution of scores/performance and the impact of different conditioning variables thereon.

3. EMPIRICAL METHODOLOGY

3.1. Test Statistics

Our distributional comparisons are based on the notion of SD. Several tests for SD have been proposed in the literature; the approach herein is based on a generalized Kolmogorov-Smirnov test.4 To begin, let X and Y

4Maasoumi and Heshmati (2000) provide a brief review of the development of alternative tests.

340 E. Maasoumi et al.

denote two outcome (test score) variables being compared (e.g., X (Y ) might refer to test scores of students exposed to a short (long) school year). �xi�Ni=1 is a vector of N possibly dependent observations of X ; �yi�

M i=1

is an analogous vector of realizations of Y . In the spirit of the historical development of such two-sample tests, �xi�Ni=1 and �yi�

M i=1 each constitute

one sample. Thus we refer to dependence between xi and xj, i �= j, as within-sample dependence (similarly for observations of Y ), and dependence between X and Y as between-sample dependence.

Assuming general von Neumann–Morgenstern conditions, let �1 denote the class of (increasing) utility functions u such that utility is increasing in test scores (i.e., u′ ≥ 0) and �2 the class of social welfare functions in �1 such that u′′ ≤ 0 (i.e., concavity). Concavity represents risk aversion, or an aversion to inequality in the achievement of students; a high concentration of both high- and low-achieving students is undesirable. Let F (x) and G(y) represent the cumulative density functions (CDF) of X and Y , respectively, which are assumed to be continuous and differentiable.

Under this notation, X first order stochastically dominates Y (denoted X FSD Y ) iff E[u(X )] ≥ E[u(Y )] for all u ∈ �1, with strict inequality for some u.5 Equivalently,

F (z) ≤ G(z) ∀z ∈ �, with strict inequality for some z� (1)

where � denotes the union of the supports of X and Y . If X FSD Y , then the expected welfare from X is at least as great as that from Y for all increasing welfare functions, with strict inequality holding for some utility function(s) in the class. Equivalent conditions may be given in terms of quantiles. Loosely speaking, all quantiles of F (·) will be at least as large as those of, and G(·). The distribution of X second order stochastically dominates Y (denoted as X SSD Y ) iff E[u(X )] ≥ E[u(Y )] for all u ∈ �2, with strict inequality for some u. Equivalently,

∫ z −∞

F (v)dv ≤ ∫ z

−∞ G(v)dv ∀z ∈ �, with strict inequality for some z�

(2)

If X SSD Y , then the expected social welfare from X is at least as great as that from Y for all increasing and concave utility functions in the class �2, with strict inequality holding for some utility function(s) in the class. FSD implies SSD and higher orders, and SSD is equivalent to generalized Lorenz dominance.

5Note that SD relations offer insights into which distribution provides greater welfare considering only the outcome of interest, without regard for other considerations such as cost. If the treatment is costly, a separate cost–benefit analysis is required to decide if the welfare gains exceed the costs.

Class Size and Educational Policy 341

Define the following generalizations of the Kolmogorov–Smirnov test criteria:

d = √

NM N + M min supz∈�

[F (z) − G(z)] (3)

s = √

NM N + M min supz∈�

∫ z −∞

[F (u) − G(u)]du, (4)

where min is taken over F − G and G − F , in effect performing two tests in order to leave no ambiguity between the “equal” and “unrankable” cases. Our nonparametric tests for FSD and SSD are based on the empirical counterparts of d and s using the empirical CDFs, where the empirical CDF for X is given by

F̂N (x) = 1 N

N∑ i=1

I(X ≤ x), (5)

where I(·) is an indicator function; ĜM (y) is defined similarly for Y . If d̂ ≤ 0 (ŝ ≤ 0) to a degree of statistical confidence, then the null hypothesis of FSD (SSD) is not rejected (see Appendix A for details).

To this point X and Y have represented two unconditional test score variables. However, dependence between class size and other determinants of student achievement may confound the effect of class size with the impact of other characteristics (Dustmann, 2003; Hanushek, 1979; Hanushek et al., 1996). In particular, student achievement may be related to family background variables (such as family structure and income), teacher attributes (such as salary and experience), and school characteristics (such as school size and the ability of peers).6 Concern over the ability to control for all such reasonable determinants of student achievement, however, has led researchers to focus on the various IV methods cited in the previous section. In particular, two potential sources of endogeneity of class size are frequently cited in the literature (Boozer and Rouse, 2001; Dobbelsteen et al., 2002; Ehrenberg et al., 2001; Hoxby, 2000b; Todd and Wolpin, 2003). First, family attributes may be correlated with both student achievement and class size through endogenous residential decisions (so-called Tiebout, 1956 sorting). Second, schools may have compensatory policies specifically designed to assign less able students to smaller classes and/or to superior teachers.

To control for the myriad of determinants of student performance, as well as circumvent the potential endogeneity issue, we follow in the spirit of Dearden et al. (2002) and perform dominance tests on conditional test

6For a theoretical account of the cognitive development of students, see Todd and Wolpin (2003).

342 E. Maasoumi et al.

score distributions derived via two approaches. Under the first approach, we control for a host of observable attributes that may generate a spurious correlation between class size and test scores and conduct dominance tests on the distributions of test scores purged of the “average” effects of these attributes. Under the second approach, we incorporate differential average effects by class size into the conditional distributions. In both cases, the conditioning covariates (discussed below) represent individual (such as race, gender, and lagged test scores), family (such as parental education, socioeconomic status, and family composition), class (such as subject and average student ability), teacher (such as race, gender, experience, and education), and school (such as enrollment, number of teachers, and average teacher salaries) attributes. Because our conditioning set is quite exhaustive, the implicit selection on observables assumption required to identify the treatment effect on the distribution of test scores appears reasonable.

To proceed with the first approach, we estimate separate educational production function models for students in each size category, obtain the intercept-adjusted residuals, and perform the dominance tests on these residuals.7 Specifically, in the first-stage, we estimate

tijk = �k + hij�k + �̃ijk, k = 1, � � � , K , (6) where tijk is the test score for individual i in school j in class size group k, h is a lengthy vector of individual, family, class, teacher, and school attributes, �̃ is the error term, and there are K class size categories (K = 3 in the application). In the second stage, we analyze the distributions of �̂ijk ≡ �̂k + ˆ̃�ijk, which correspond to test scores net of all observable characteristics (evaluated at the size specific returns, �k).8

Under the above approach, the intercept-adjusted residuals, �̂ijk reflect test scores net of all observable characteristics evaluated at the size specific returns, �k. Since this method nets out test score differences due to observ- ables as well as the size specific returns to such observables, we refer to these tests as being based on partial residuals (PR). As an alternative, we implement a second approach based on the full residuals (FR), where we denote the full residuals as inclusive of differences in the return to obser- vables. Specifically, we rewrite the first-stage regression (6) for class size k as

tijk = �k + hij�k + �̃ijk = �k + hij�k + �̃ijk +

( hij�k′ − hij�k′

) = �k + hij�k′ + hij(�k − �k′) + �̃ijk, (7)

7The intercepts are included as part of the residuals, otherwise the conditional distributions will all be mean zero, precluding the possibility of first order dominance.

8Note at controls for class size are omitted from (6), thereby allowing the error term to capture the residual effect of class size not captured by the included regressors.

Class Size and Educational Policy 343

where class size k′ is implicitly treated as the dominant category (Neuman and Oaxaca, 2004). Consequently, we amend the residual tests to compare the previous intercept-adjusted residual distribution of �̂ijk′ with �̂FRijk ≡ (�̂k + hij(�̂k − �̂k′) + ˆ̃�ijk), k �= k′.

To aid in the comparison of the various residual dominance results, we also present the results from standard Oaxaca–Blinder parametric decompositions. Specifically, the mean test score gap between any two class sizes, k and k′, may be expressed as

t̄k − t̄k′ = (�k − �k′)︸ ︷︷ ︸ U

+ (h̄k − h̄k′)�k′︸ ︷︷ ︸ E

+ h̄k(�k − �k′)︸ ︷︷ ︸ C

(8)

where class k′ is treated as the dominant category. If the difference in returns (term C) in (8) is large in absolute value, then the two residual tests may be expected to yield disparate results. Moreover, if the three sets of results (two sets of residual tests and the one set of unconditional tests) offer different inferences, one may infer a significant association between class size, the distribution of test scores, the set of conditioning variables, and the returns to the conditioning variables.

3.2. Inference

The asymptotic null distribution of the test statistics, d and s, depend on the unknown distributions F and G. In the analysis below, we first approximate the empirical distribution of the test statistics using simple bootstrap methods as in Maasoumi and Heshmati (2000) and Maasoumi and Millimet (2005) and report the estimated significance level. To evaluate the null Ho : d ≤ 0, we first report in our tables whether the observed empirical distributions are seemingly rankable by FSD or SSD; we present the sample values of max�d1�, max�d2�, d̂, max�s1�, max�s2�, and ŝ (see Appendix A). We then obtain bootstrap estimates of the probability that d lies in the nonpositive interval (i.e., Pr�d ≤ 0�) using the relative frequency of �d̂∗ ≤ 0�, where d̂∗ is the bootstrap estimate of d (500 repetitions are used).9 If this interval has a large probability, say 0.90 or higher, and d̂ ≤ 0, we may infer dominance to a desirable degree of confidence. If this interval has a low probability, say 0.10 or smaller, and d̂ > 0, we may infer the presence of significant crossings of the empirical CDFs, implying an inability to rank the outcomes. Finally, if the probability lies in the intermediate range, say between 0.10 and 0.90, there is insufficient evidence to distinguish between equal and unrankable distributions. This is a classic confidence interval test; specifically, we are assessing the

9Note that we also report simple bootstrap estimates of the Pr�d∗ ≥ d̂�. These are provided to facilitate visualization of the simple bootstrap distribution.

344 E. Maasoumi et al.

likelihood that the event d ≤ 0 has occurred. Similarly, we estimate Pr�s ≤ 0� to evaluate the second order dominance proposition given by Ho : s ≤ 0.10

As an alternative, we also evaluate the less decisive dominance proposition H o : d = 0 via the Linton et al. (2005) recentered bootstrap procedure, which the authors demonstrate provides a consistent test. It is known that d̂ converges to d under general conditions (likewise for the SSD statistic). However, under the null Ho : d = 0, centering of computations around their corresponding sample values introduces second order errors that are negligible for first order (asymptotic) approximations, but it is desirable for removing some uncertainties due to estimation of unknown parameters and distributions. This is the source of improvement in bootstrap power gained from recentering. The other source of improvement arising from recentering pertains to the technique’s robustness to within-sample dependence.

Utilizing the algorithm detailed in Appendix A, we obtain recentered bootstrap p-values in the classical sense as the relative frequency of �d̂∗∗ > d̂�, where d̂∗∗ is the recentered bootstrap estimate of d. If the Pr�d̂∗∗ > d̂� is low, say 0.10 or smaller, we reject the null Ho : d = 0; if this p-value is greater than 0.10, we do not reject the null.11 It is important to emphasize, however, that while rejection of the null provides valuable insight in the recentered bootstrap case, failure to reject the null provides less information. If we reject the null and d̂ < 0, we may infer dominance to a desirable degree of confidence. Conversely, if we reject the null and d̂ > 0, we may infer unequal, but unrankable, distributions. These are both strong findings, as the former (latter) indicates that all (not all) increasing social welfare functions will concur on the relative rankings of the distributions in question. On the other hand, failure to reject the null merely implies that we cannot eliminate the possibility that F = G; strict dominance also cannot be ruled out to some degree of confidence. Seen in this light, the recentered bootstrap is a conservative test. In the discussion of the results, we focus more heavily on the more decisive simple bootstrap for inference. Similarly, we report the relative frequency of �ŝ∗∗ > ŝ� and �ŝ∗∗ > 0� to evaluate the null Ho : s = 0.

10Note that we do not impose and or the least favorable case (LFC) of equality of the distributions. This could be done by combining the data on X and Y and bootstrapping from the combined sample (e.g., Abadie, 2002). Our bootstrap samples still contain N (M) observations from X (Y ). As argued in Linton et al. (2005), working under LFC has some undesirable power consequences as it can produce biased tests that are not similar on the boundary of the null. This happens when the boundary of the null itself is composite. The bootstrap methods employed herein, combined with a fixed critical value at zero (the boundary of the null hypothesis), renders our tests “asymptotically similar” and unbiased on the boundary (Maasoumi and Heshmati, 2005).

11We also report the Pr�d̂∗∗ ≤ 0� in the tables. This allows the reader to see the significance level (size) of the test associated with the special critical value zero. In our tables, these are obtained simply as Pr�d̂∗∗ > 0� = 1− Pr�d̂∗∗ ≤ 0�.

Class Size and Educational Policy 345

A final, necessary comment pertains to inference in the FR tests (i.e., those incorporating the Oaxaca–Blinder decomposition). Owing to the usage of a common set of coefficient estimates in obtaining both residual distributions being compared, there necessarily exists between- sample dependence. For example, the FR test using data on small (k = 1) and medium (k = 2) classes compares the distributions of �̂ij1 and �̂FRij2. The former depends on �tij1, hij1, �1(t1, h1)�, where t1 and h1 represent the full data vector for t and h for the sample of students in small classes; the latter, �̂FRij2 = tij2 − hij2�1, depends on �tij2, hij2, �1(t1, h1)�. This source of dependence is atypical. Between-sample dependence usually arises when the same individuals appear in the two samples being compared (e.g., distributions of pre- and posttax incomes for a sample of individuals). To handle this more common type of between-sample dependence, pairwise (or clustered) bootstrap samples are drawn in order to maintain the dependence in the resampled data (Linton et al., 2005). In the current situation, the between-sample dependence is maintained by reestimating the first-stage equations (6) and (7) on each bootstrap resample. Specifically, by resampling N observations �t∗ij1, h

∗ ij1� and M

observations �t∗ij2, h ∗ ij2� nonparametrically and reestimating (6), we obtain

the resampled distributions of �̂∗ij1 and �̂ FR∗ ij2 , where the former depends

on �t∗ij1, h ∗ ij1, �

∗ 1(t

∗ 1 , h

∗ 1)� and the latter depends on �t

∗ ij2, h

∗ ij2, �

∗ 1(t

∗ 1 , h

∗ 1)�. Thus,

as in the usual pairwise bootstrap case, the source of between-sample dependence is maintained in the resampling procedure.

4. DATA

The data are obtained from the National Education Longitudinal Study (NELS) of 1988, a large-scale longitudinal study of high school students conducted by the National Center for Education Statistics (NCES). The NELS contains a nationally representative sample of eighth graders first surveyed in the spring of 1988. Follow-up surveys were administered to the respondents in 1990 and 1992. The original sample was chosen by initially sampling some 1,000 public and private schools from a universe of approximately 40,000 schools containing eighth grade students, and then drawing random samples of approximately 24–26 eighth grade students per school. The original sample, therefore, contains roughly 25,000 eighth grade students.

The students were administered cognitive tests in reading, social studies, mathematics and science during the base year (1988), first follow- up (1990), and second follow-up (1992). The grade specific tests contained material appropriate for each grade but included sufficient overlap with the exams for the other grades to permit measurement of academic

346 E. Maasoumi et al.

growth.12 For each sampled student, the teachers from two of the four subjects were surveyed, thereby yielding information on class size. Thus, we have two observations for each student in each wave of the survey after we match the class size with the test score for that particular subject. The NELS also supplies general descriptive information about the school obtained from the chief administrator of each school in the sample.

Following Boozer and Rouse (2001), we construct two samples, each including only students attending public schools. The first sample uses test score results from the first follow-up (tenth grade) and information on eighth grade class size, and contains 12,412 students from 762 schools (23,549 total observations).13 The second sample is analogously defined using information from tenth and twelfth grades, and contains 8,685 students from 756 schools (14,796 total observations), considerably smaller than the previous sample due to attrition (either dropping out of the survey or relocating to a out-of-sample school). All results are weighted using the appropriate sample weights.

Because there is a continuum of class sizes to which students belong, we classify each student-test observation in each sample into one of three groups to make the number of SD tests manageable. The groupings are: small (19 or fewer students), medium (20–30 students), and large (more than 30 students).

To obtain the residual test scores, we utilize an extensive set of individual, family, class, teacher, and school characteristics available from the NELS. Specifically, the vector h in (6) includes controls for the following (in addition to a constant term):

Individual:. race, gender, an indicator for limited English proficiency (LEP), and lagged test scores;

Family:. father’s education, mother’s education, family composition, number of siblings, an indicator for the presence of a home computer, family socioeconomic status, and dummy variables indicating if family composition and number of siblings are missing;

Class:. subject and the overall relative ability of the class;

Teacher:. race, gender, experience, education, indicators if the student and teacher are of the same race and of the same gender;

School:. urban/rural status, region, total school enrollment, grade-level enrollment, number of full-time teachers, number of teachers by race,

12We follow Goldhaber and Brewer (1997) and Boozer and Rouse (2001) and utilize the raw item response theory (IRT) scores for each test.

13The exams may be admistered as early as January, midway through the school year. As a result, we follow the lead of Boozer and Rouse (2001) and initially examine the effect of previous class size on current test scores. See also Hoxby (2000a).

Class Size and Educational Policy 347

level of student disruptions in class, percentage of minority students in the school, teacher salaries, length of school year, percentage of students in remedial reading, remedial math, and bilingual education, and dummy variables indicating if teacher salary information, school year length, the number of full-time teachers by race, and the percentage of students in remedial reading, remedial math, and bilingual education are missing.

Summary statistics for the samples are available upon request. Before continuing, a few comments are warranted. First, the inclusion

of lagged test scores proxies for innate ability, following the strategy of Eide and Showalter (1998), Dearden et al. (2002), and others. Lagged test scores also control for all previous inputs into the educational production process, giving the results a “value-added” interpretation (Goldhaber and Brewer, 1997; Rivkin et al., 2002; Todd and Wolpin, 2003). In addition, we also condition on the teacher’s subjective assessment of the overall ability of the class from which the test score is taken.14 Such ability controls are vital to circumventing the potential endogeneity arising from endogenous residential choice and nonrandomness in the assignment of students to classes by schools (Betts and Shkolnik, 2000). The ability level of the class also captures peer effects that have been shown to be important (Dobbelsteen et al., 2002; Levin, 2001). Second, to minimize further any potential spurious correlation between class size and other determinants of student achievement, we condition on a fairly substantial vector of school attributes. Since actual class size is only observed ex post, while school attributes are observable ex ante, controlling for school characteristics not only removes their direct effect on student performance but also proxies for family background traits and parental involvement.15 Third, as argued in Hanushek (1979), controlling for family attributes such as socioeconomic status and parental education levels also severely mitigates any bias resulting from endogenous residential choice.

5. RESULTS

5.1. Unconditional SD Tests

The initial SD tests involve comparing the unconditional test score distributions across students differentiated by class size. Results are provided in Table 1; Panel A examines tenth grade test scores as a

14The response options are (i) the class is of above average ability (relative to the school), (ii) the class is of average ability, (iii) the class is of below average ability, or (iv) the class has students of widely varying abilities. Figlio and Paige (2002) also make use of this variable.

15For instance, when looking at homes for sale at http://www.realtor.com, links are provided for (virtually) every house in the US in order to view the average pupil–teacher ratio and total enrollment for that location (along with average SAT scores and percentage of students continuing to college). See also Todd and Wolpin (2003).

T A B L E

1 U n co

n d it io n al

st o ch

as ti c d o m in an

ce te st s

D is tr ib u ti o n s

F ir st

o rd er

d o m in an

ce Se

co n d

o rd er

d o m in an

ce O b se rv ed

X Y

ra n ki n g

d 1 ,M

A X

d 2 ,M

A X

d P r� d

∗ 1 ≤

0� P r� d

∗ 2 ≤

0� P r� d

∗ ≤

0� P r� d

∗ 1 ≥

d 1 �

P r� d

∗ 2 ≥

d 2 �

P r� d

∗ ≥

d �

s 1 ,M

A X

s 2 ,M

A X

s P r� s∗ 1

≤ 0�

P r� s∗ 2

≤ 0�

P r� s∗

≤ 0�

P r� s∗ 1

≥ s 1 �

P r� s∗ 2

≥ s 2 �

P r� s∗

≥ s�

A . 8t h

G ra de

C la ss

Si ze : 10

th G ra de

T es t Sc or es

Sm al l

M ed

iu m

N o n e

2. 15

0 1. 53

3 1. 53

3 0. 00

0 0. 00

0 0. 00

0 0. 62

2 0. 59

8 0. 51

4 24

6. 82

6 9. 04

9 9. 04

9 0. 00

0 0. 26

6 0. 26

6 0. 48

0 0. 53

0 0. 52

8 0. 01

6 0. 01

2 0. 02

8 0. 03

0 0. 09

0 0. 00

0 0. 14

8 0. 13

0 0. 27

8 0. 09

2 0. 66

6 0. 33

6 Sm

al l

L ar ge

N o n e

1. 63

6 0. 80

4 0. 80

4 0. 00

2 0. 00

0 0. 00

2 0. 65

8 0. 81

4 0. 75

2 15

3. 16

3 4. 38

5 4. 38

5 0. 09

6 0. 12

2 0. 21

8 0. 53

8 0. 63

4 0. 52

2 0. 01

4 0. 02

8 0. 04

2 0. 13

6 0. 48

6 0. 12

6 0. 22

0 0. 32

8 0. 54

8 0. 24

2 0. 60

2 0. 32

6 M ed

iu m

L ar ge

N o n e

0. 71

4 0. 64

8 0. 64

8 0. 00

0 0. 00

2 0. 00

2 0. 77

4 0. 81

8 0. 64

0 26

.2 11

41 .3 08

26 .2 11

0. 35

4 0. 10

0 0. 45

4 0. 53

8 0. 59

4 0. 19

8 0. 02

0 0. 01

8 0. 03

8 0. 60

8 0. 61

8 0. 30

8 0. 26

6 0. 32

8 0. 59

4 0. 58

6 0. 45

6 0. 11

4

B . 10

th G ra de

C la ss

Si ze : 12

th G ra de

T es t Sc or es

Sm al l

M ed

iu m

M SS

D S

2. 72

0 1. 16

9 1. 16

9 0. 00

0 0. 00

0 0. 00

0 0. 65

8 0. 65

8 0. 65

8 37

3. 40

4 −0

�4 39

−0 �4 39

0. 00

0 0. 88

2 0. 88

2 0. 52

4 0. 59

8 0. 59

8 0. 00

8 0. 01

0 0. 01

8 0. 00

0 0. 13

2 0. 00

0 0. 19

8 0. 18

4 0. 38

2 0. 00

4 0. 99

2 0. 98

6 Sm

al l

L ar ge

L SS

D S

4. 14

9 0. 33

3 0. 33

3 0. 00

0 0. 01

2 0. 01

2 0. 60

8 0. 72

8 0. 72

8 54

7. 16

9 −0

�1 74

−0 �1 74

0. 00

0 0. 87

0 0. 87

0 0. 54

2 0. 49

6 0. 49

6 0. 01

0 0. 00

4 0. 01

4 0. 00

0 0. 83

8 0. 70

4 0. 15

2 0. 19

0 0. 34

2 0. 00

0 0. 94

0 0. 89

2 M ed

iu m

L ar ge

L F SD

M 2. 93

5 0. 04

4 0. 04

4 0. 00

0 0. 13

4 0. 13

4 0. 57

8 0. 80

4 0. 80

4 47

7. 34

4 0. 11

2 0. 11

2 0. 00

0 0. 40

8 0. 40

8 0. 51

8 0. 53

2 0. 53

2 0. 00

8 0. 00

8 0. 01

6 0. 00

0 0. 98

4 0. 96

8 0. 18

4 0. 17

6 0. 36

0 0. 00

0 0. 80

8 0. 60

2

Sm al l cl as se s h av e

fe w er

th an

20 st u d en

ts ; m ed

iu m

cl as se s h av e

b et w ee n

20 an

d 30

st u d en

ts ; la rg e

cl as se s h av e

31 o r m o re

st u d en

ts . A ll

re su lt s u se

ap p ro p ri at e

p an

el w ei gh

ts . P ro b ab

il it ie s o b ta in ed

vi a

50 0

b o o ts tr ap

re p et it io n s (fi

rs t ro w : si m p le

b o o ts tr ap

; se co

n d

ro w : re ce n te re d

b o o ts tr ap

). N o

o b se rv ed

ra n ki n g

im p li es

o n ly

th at

th e

d is tr ib u ti o n s ar e

n o t ra n ka b le

in th e

fi rs t-

o r

se co

n d -d eg

re e

se n se . Se

e te xt

fo r

fu rt h er

d et ai ls .

Class Size and Educational Policy 349

FIGURE 1 CDFs and integrated CDFs: unconditional 10th grade test scores by 8th grade class size. Small, <20 students, Medium, 20–30 students, Large, >30 students.

function of eighth grade class size, and Panel B examines twelfth grade test scores as a function of tenth grade class size. The corresponding CDFs, integrated CDFs, and differences in the CDFs are plotted in Figures 1 and 2, respectively.

Comparing the empirical distributions of tenth grade test scores across small, medium, and large classes (Panel A), we find no instance where we are able to rank the distributions in either the first- or second-degree sense, although a clear ordering exists for mean test scores.16 Moreover, the simple bootstrap indicates that the crossings of the CDFs in each case are statistically meaningful at the p < 0�01 confidence level. The recentered bootstrap confirms this finding, indicating rejection of the null Ho : d = 0 when comparing small and medium classes (p-value = 0�000), rejecting strict dominance and equality of the distributions; the null is nearly rejected when comparing small and large classes as well (p-value = 0�126). The recentered bootstrap also nearly rejects the null Ho : s = 0 when comparing medium and large classes (p-value = 0�114), rejecting strict second order dominance and equality of the distributions.

The lack of first- and second-degree SD is an extremely powerful result. For example, if the test score level for Small FSD Medium (or Large), then any policymaker with a social welfare function increasing in test scores would prefer smaller class sizes. Similarly, a finding of SSD would imply that any policymaker with a social welfare function that is increasing and averse to dispersion in test scores would prefer smaller class sizes. However, we find no such dominance relations; individuals with different preference functions in the class �1 or �2 can reasonably disagree about the efficacy of smaller classes.

Examining the actual plots (Figure 1), we see that the three CDFs and integrated CDFs are extremely similar, never differing by more than

16The average test score is highest in small classes (31.24), followed by large classes (31.21), and then medium classes (31.17).

350 E. Maasoumi et al.

FIGURE 2 CDFs and integrated CDFs: unconditional 12th grade test scores by 10th grade class size. Small, <20 students, Medium, 20–30 students, Large, >30 students.

four test points in any part of the distribution.17 Nonetheless, the plot of the differences in the CDFs—corresponding to estimates of the QTEs—is still revealing, indicating that medium and large classes are most similar, while small classes are quite distinct. Specifically, small classes outperform medium and large classes at the upper end of the distribution (above the 70th percentile), with the converse holding below the 70th percentile.

The twelfth grade test scores (Panel B) reveal several instances where the empirical distributions of test scores are rankable, and in all such cases it is the distribution from the larger class size that dominates the empirical distribution from the smaller class size.18 However, not all of these rankings are statistically meaningful at conventional levels, highlighting the need for formal statistical testing. Specifically, we observe Medium SSD Small, Large SSD Small, and Large FSD Medium. The simple bootstrap yields a marginally significant Pr(s ≤ 0) = 0�882 and 0�870 in the first two cases, and a Pr(d ≤ 0) = 0�134 (Pr(s ≤ 0) = 0�408) in the final case. Moreover, the recentered bootstrap fails to reject the null Ho : s = 0 in any of the three cases; it does reject the null Ho : d = 0 in the test of small versus medium classes (p-value = 0�000), thereby rejecting strict dominance and equality of the distributions. Thus there is at best modest evidence that uniform rankings favor larger classes when (i) unconditional test scores and (ii) the dispersion of test scores are considered.

Examining the actual plots (Figure 2), we see that—as in Figure 1—the three CDFs and integrated CDFs are extremely similar, never differing by more than four points in any part of the distribution.19 However, as above, the plot of the differences in the CDFs (i.e., the QTEs) is informative, indicating that large classes are preferable to medium classes over the

17The standard deviation of tenth grade test scores is 12.08. 18Focusing on the mean test scores by class size grouping also favors larger classes. The average

twelfth grade test score is 36.6 in large classes, 35.5 in medium classes, and 35.0 in small classes. 19The standard deviation of twelfth grade test scores is 13.60.

Class Size and Educational Policy 351

entire distribution (consonant with the observed FSD ranking) and that the gains from large classes (relative to medium classes) is fairly uniform across the distribution. Furthermore, as was the case with eighth grade class size, small classes outperform medium and large classes at the upper end of the distribution (roughly above the 70th percentile), with the reverse holding at the lower end.

In sum, the unconditional tests assessing the impact of eighth grade class size refute the existence of a uniform ranking—a ranking robust to the choice of specific preference function—of test score distributions across class size categories. This finding highlights the false sense of decisiveness that one gets from summary comparisons, such as those based on mean treatment effects. The unconditional tests based on tenth grade class size, however, provide modest evidence of a uniform ranking of test score distributions across class size categories. However, such rankings, to the extent that they exist, are only obtained when one incorporates dispersion of test scores into the welfare criteria. Moreover, the distributional approach reveals exactly who gains and who loses (in the unconditional sense) from smaller classes: students in the upper (lower) tail of the distribution gain from smaller (larger) classes. These findings are merely suggestive, however, as they fail to control for observables correlated with both class size and student achievement. For instance, if low-achieving students are allocated to smaller classes, this may explain the lower test scores found in smaller classes below the median. To determine if the unconditional results hold once we purge test scores of such potential confounders, we turn to the conditional SD tests.

5.2. Conditional SD Tests

5.2.1. Tenth Grade Test Scores The PR and FR dominance test results for the pairwise comparisons

involving tenth grade test scores are displayed in Table 2.20 The corresponding plots are displayed in Figures 3–5. Examining the PR results (Panel A), we find that in every pairwise comparison, the empirical distribution for larger classes first order dominates the corresponding distributions for smaller and medium classes. Moreover, both rankings are statistically significant at conventional levels according to the simple bootstrap (Large FSD Small: Pr(d ≤ 0) = 0�986; Large FSD Medium: Pr(d ≤ 0) = 0�924). The recentered bootstrap fails to reject the null Ho : d = 0 or s = 0 (Large FSD Small: p-value = 0�922; Large FSD Medium: p-value = 0�874). In addition, we observe that Medium FSD Small, although this ranking is only marginally significant according to the simple

20First-stage regression results are not presented but are available from the authors upon request.

T A B L E

2 C o n d it io n al

st o ch

as ti c d o m in an

ce te st s

D is tr ib u ti o n s

F ir st

o rd er

d o m in an

ce Se

co n d

o rd er

d o m in an

ce

O b se rv ed

X Y

ra n ki n g

d 1 ,M

A X

d 2 ,M

A X

d P r� d

∗ 1 ≤

0� P r� d

∗ 2 ≤

0� P r� d

∗ ≤

0� P r� d

∗ 1 ≥

d 1 � P r� d

∗ 2 ≥

d 2 � P r� d

∗ ≥

d �

s 1 ,M

A X

s 2 ,M

A X

s P r� s∗ 1

≤ 0�

P r� s∗ 2

≤ 0�

P r� s∗

≤ 0�

P r� s∗ 1

≥ s 1 � P r� s∗ 2

≥ s 2 � P r� s∗

≥ s�

A . 8t h

G ra de

C la ss

Si ze : 10

th G ra de

T es t Sc or es

(P ar ti al

R es id u al )

Sm al l

M ed

iu m

M F SD

S 15

.5 71

−0 �9 02

−0 �9 02

0. 14

8 0. 73

4 0. 88

2 0. 59

4 0. 46

6 0. 43

8 36

97 .7 22

−0 �9 85

−0 �9 85

0. 19

6 0. 74

2 0. 93

8 0. 57

6 0. 44

0 0. 42

6 0. 32

4 0. 35

8 0. 68

2 0. 36

6 0. 78

0 0. 60

4 0. 35

8 0. 48

6 0. 84

4 0. 33

4 0. 59

8 0. 36

8 Sm

al l

L ar ge

L F SD

S 30

.6 81

−1 �0 84

−1 �0 84

0. 00

8 0. 97

8 0. 98

6 0. 55

2 0. 58

4 0. 58

4 73

91 .2 06

−1 �0 84

−1 �0 84

0. 00

8 0. 98

6 0. 99

4 0. 51

8 0. 58

2 0. 58

2 0. 25

2 0. 14

8 0. 40

0 0. 00

0 0. 93

0 0. 92

2 0. 37

8 0. 37

0 0. 74

8 0. 00

2 0. 73

6 0. 72

8 M ed

iu m

L ar ge

L F SD

M 29

.7 03

−0 �9 45

−0 �9 45

0. 04

2 0. 88

2 0. 92

4 0. 42

0 0. 50

4 0. 49

8 70

14 .7 20

−0 �9 45

−0 �9 45

0. 05

2 0. 92

0 0. 97

2 0. 40

4 0. 50

2 0. 48

4 0. 38

0 0. 15

8 0. 53

8 0. 00

2 0. 93

4 0. 87

4 0. 46

2 0. 31

6 0. 77

8 0. 01

4 0. 81

8 0. 75

4

B . 8t h

G ra de

C la ss

Si ze : 10

th G ra de

T es t Sc or es

(F u ll

R es id u al )

Sm al l

M ed

iu m

SS SD

M 1. 25

6 1. 61

5 1. 25

6 0. 00

0 0. 00

0 0. 00

0 0. 97

8 0. 89

6 0. 96

2 −0

�4 22

19 4. 29

6 −0

�4 22

0. 52

6 0. 00

0 0. 52

6 0. 62

0 0. 89

0 0. 62

0 0. 00

0 0. 00

0 0. 00

0 0. 81

6 0. 53

4 0. 58

4 0. 15

6 0. 02

6 0. 18

2 0. 91

8 0. 21

2 0. 91

4 Sm

al l

L ar ge

N o n e

1. 22

0 1. 48

4 1. 22

0 0. 00

0 0. 00

0 0. 00

0 0. 90

2 0. 82

8 0. 80

0 9. 34

1 14

4. 51

2 9. 34

1 0. 39

8 0. 00

2 0. 40

0 0. 53

6 0. 78

8 0. 53

0 0. 00

4 0. 00

0 0. 00

4 0. 79

6 0. 63

6 0. 55

8 0. 25

6 0. 05

6 0. 31

2 0. 67

2 0. 42

2 0. 58

8 M ed

iu m

L ar ge

M SS

D L

0. 21

2 3. 08

4 0. 21

2 0. 05

2 0. 00

0 0. 05

2 0. 70

2 0. 44

2 0. 70

2 −0

�0 76

46 1. 81

9 −0

�0 76

0. 58

6 0. 00

0 0. 58

6 0. 47

4 0. 42

8 0. 47

4 0. 00

0 0. 00

4 0. 00

4 1. 00

0 0. 05

2 0. 98

8 0. 12

2 0. 18

2 0. 30

4 0. 88

0 0. 00

6 0. 73

8

C . 10

th G ra de

C la ss

Si ze : 12

th G ra de

T es t Sc or es

(P ar ti al

R es id u al )

Sm al l

M ed

iu m

S F SD

M −0

�1 08

1. 76

0 −0

�1 08

0. 57

4 0. 21

6 0. 79

0 0. 45

0 0. 59

0 0. 26

0 −0

�3 62

39 4. 16

8 −0

�3 62

0. 62

0 0. 22

6 0. 84

6 0. 41

0 0. 58

0 0. 27

2 0. 27

2 0. 45

0 0. 72

2 0. 90

8 0. 37

2 0. 60

6 0. 41

8 0. 48

0 0. 89

8 0. 74

0 0. 33

2 0. 43

8 Sm

al l

L ar ge

L F SD

S 4. 85

4 −0

�1 65

−0 �1 65

0. 17

4 0. 65

0 0. 82

4 0. 66

0 0. 40

0 0. 25

0 11

21 .4 77

−0 �2 91

−0 �2 91

0. 18

8 0. 74

6 0. 93

4 0. 63

0 0. 31

4 0. 17

8 0. 27

2 0. 41

4 0. 68

6 0. 50

4 0. 66

2 0. 43

0 0. 31

2 0. 51

8 0. 83

0 0. 46

6 0. 59

6 0. 39

6 M ed

iu m

L ar ge

L F SD

M 8. 38

1 −0

�5 73

−0 �5 73

0. 12

2 0. 64

4 0. 76

6 0. 58

6 0. 46

4 0. 42

0 19

16 .5 52

−0 �5 99

−0 �5 99

0. 13

2 0. 75

8 0. 89

0 0. 57

4 0. 40

2 0. 35

0 0. 30

0 0. 29

8 0. 59

8 0. 38

4 0. 87

4 0. 77

4 0. 35

6 0. 39

6 0. 75

2 0. 33

4 0. 81

8 0. 72

0

D . 10

th G ra de

C la ss

Si ze : 12

th G ra de

T es t Sc or es

(F u ll

R es id u al )

Sm al l

M ed

iu m

N o n e

1. 52

3 0. 58

2 0. 58

2 0. 00

0 0. 00

0 0. 00

0 0. 89

4 0. 97

6 0. 97

6 23

1. 47

9 48

.1 65

48 .1 65

0. 08

6 0. 00

0 0. 08

6 0. 44

2 0. 96

8 0. 81

8 0. 00

2 0. 00

0 0. 00

2 0. 45

0 0. 94

6 0. 88

6 0. 39

2 0. 00

6 0. 39

8 0. 05

6 0. 83

4 0. 21

8 Sm

al l

L ar ge

N o n e

1. 89

8 0. 36

6 0. 36

6 0. 00

0 0. 00

6 0. 00

6 0. 72

2 0. 86

4 0. 86

4 34

6. 35

0 19

.3 59

19 .3 59

0. 07

4 0. 03

2 0. 10

6 0. 43

0 0. 79

2 0. 70

6 0. 00

8 0. 00

2 0. 01

0 0. 20

0 0. 96

6 0. 89

4 0. 29

4 0. 08

6 0. 38

0 0. 03

8 0. 82

0 0. 24

2 M ed

iu m

L ar ge

N o n e

0. 86

7 0. 45

2 0. 45

2 0. 00

0 0. 00

4 0. 00

4 0. 84

0 0. 74

6 0. 72

0 12

3. 88

5 2. 29

6 2. 29

6 0. 11

0 0. 27

0 0. 38

0 0. 51

4 0. 67

0 0. 53

2 0. 00

0 0. 00

0 0. 00

0 0. 67

2 0. 94

8 0. 87

2 0. 15

4 0. 09

0 0. 24

4 0. 19

2 0. 83

4 0. 60

4

Sm al l cl as se s h av e

fe w er

th an

20 st u d en

ts ; m ed

iu m

cl as se s h av e

b et w ee n

20 an

d 30

st u d en

ts ; la rg e

cl as se s h av e

31 o r m o re

st u d en

ts . A ll

re su lt s u se

ap p ro p ri at e

p an

el w ei gh

ts . P ro b ab

il it ie s o b ta in ed

vi a

50 0

b o o ts tr ap

re p et it io n s (fi

rs t ro w : si m p le

b o o ts tr ap

; se co

n d

ro w : re ce n te re d

b o o ts tr ap

). N o

o b se rv ed

ra n ki n g

im p li es

o n ly

th at

th e

d is tr ib u ti o n s ar e

n o t ra n ka b le

in th e

fi rs t-

o r se co

n d -d eg

re e

se n se . F ir st -s ta ge

re gr es si o n s in cl u d e

ra ce

d u m m ie s,

ge n d er

d u m m y,

li m it ed

E n gl is h

p ro fi ci en

cy (L

E P )

d u m m y,

fa th er ’s

ed u ca ti o n

d u m m ie s,

m o th er ’s

ed u ca ti o n

d u m m ie s,

h o m e

co m p u te r d u m m y,

fa m il y

co m p o si ti o n

d u m m ie s,

fa m il y so ci o ec o n o m ic

st at u s,

n u m b er

o f si b li n gs

d u m m ie s,

d u m m ie s fo r te ac h er

ra ce , te ac h er

ge n d er

d u m m y,

te ac h er

ex p er ie n ce

d u m m ie s,

te ac h er

ed u ca ti o n

d u m m ie s,

av er ag e

cl as s ab

il it y d u m m ie s,

d u m m ie s

in d ic at in g

st u d en

t an

d te ac h er

ar e

o f th e

sa m e

ra ce , d u m m y in d ic at in g

st u d en

t an

d te ac h er

ar e

o f th e

sa m e

ge n d er , d u m m ie s fo r am

o u n t o f st u d en

t d is ru p ti o n s in

cl as s,

re gi o n al

d u m m ie s,

u rb an

an d

ru ra l

d u m m ie s,

sc h o o l en

ro ll m en

t d u m m ie s,

gr ad

e- le ve l en

ro ll m en

t d u m m ie s,

d u m m ie s fo r th e

p er ce n ta ge

o f m in o ri ty

st u d en

ts in

sc h o o l,

d u m m ie s fo r n u m b er

o f to ta l fu ll -t im

e te ac h er s as

w el l as

b y ra ce , te ac h er

sa la ry

d u m m ie s,

p er ce n ta ge

o f st u d en

ts in

sc h o o l in

re m ed

ia l re ad

in g,

p er ce n ta ge

o f st u d en

ts in

sc h o o l in

re m ed

ia l m at h , p er ce n ta ge

o f st u d en

ts in

sc h o o l in

b il in gu

al ed

u ca ti o n , le n gt h

o f sc h o o l ye ar

d u m m ie s,

la gg

ed te st

sc o re , d u m m ie s in d ic at in g

if fa m il y

co m p o si ti o n , n u m b er

o f si b li n gs , te ac h er

sa la ri es , sc h o o l ye ar

le n gt h , n u m b er

o f fu ll -t im

e te ac h er s b y

ra ce , an

d p er ce n ta ge

o f st u d en

ts in

re m ed

ia l

re ad

in g,

re m ed

ia l m at h , an

d b il in gu

al ed

u ca ti o n

ar e

m is si n g.

P ar ti al

re si d u al

te st

co m p ar is o n s b as ed

o n

E q u at io n

(7 );

fu ll

re si d u al

co m p ar is o n s b as ed

o n

E q u at io n

(8 ).

Se e

te xt

fo r fu rt h er

d et ai ls .

Class Size and Educational Policy 353

FIGURE 3 CDFs and integrated CDFs: partial residual 10th grade test scores by 8th grade class size. Small, <20 students, Medium, 20–30 students, Large, >30 students.

bootstrap (simple: Pr(d ≤ 0) = 0�882), the recentered bootstrap fails to reject the null of equality (p-value = 0�604). However, the second order relation is statistically significant at conventional levels according to the simple bootstrap (Pr(s ≤ 0) = 0�938). Examination of Figure 3 reveals that disparities in PR test scores (the QTEs) are fairly uniform across the entire distribution: at any quantile of the distribution, students in large (medium) classes score roughly ten (two) points higher than students in small classes.

These striking results indicate that not only do smaller classes not lead to improvements in the subsequent distribution of test scores, net of a host of individual, family, class, teacher, and school attributes, but they actually result in inferior student performance according to the PR tests. Furthermore, this finding is robust across any social welfare function that is increasing in test scores. Finally, the corresponding unconditional distributions are unrankable (Table 1, Panel A), which indicates that observable attributes that are positively associated with test scores are negatively correlated with class size, consonant with positive selection into smaller classes through Tiebout sorting.

FIGURE 4 CDFs and integrated CDFs: full residual 10th grade test scores by 8th grade class size. Small class size is dominant category. Small, <20 students, Medium, 20–30 students, Large, >30 students.

354 E. Maasoumi et al.

FIGURE 5 CDFs and integrated CDFs: full residual 10th grade test scores by 8th grade class size. Medium class size is dominant category. Medium, 20–30 students, Large, >30 students.

As noted previously, however, a potential shortcoming of the PR tests is that differences in the class size specific returns to observables are not included in the PR distributions. As a result, any differences in the returns to observable characteristics are not attributable to the effects of class size, which may yield misleading inferences. The results of FR dominance tests, which account for such impacts of class size, are displayed in Panel B.21

The results indicate that the empirical distributions are rankable for two of the three comparisons: Small SSD Medium and Medium SSD Large.22 However, neither ranking is statistically significant according to the simple bootstrap. In the former (latter) case, the simple bootstrap yields Pr(s ≤ 0) = 0�526 (0.586). The recentered bootstrap gives p-values above 0.70 in both cases, failing to reject the null Ho : s = 0. In terms of the comparison of small and large classes, the FR distributions are unrankable.23

Examination of Figures 4 and 5 yield three additional findings. First, while small classes second order dominate medium classes, but not large classes, according to the empirical distributions, there is little difference in actuality between the observed distributions for medium and large classes (when small classes are the dominant group). Second, small classes fare better than medium and large classes at the lower tail of the distribution

21In the tables of FR results, the distribution in the X column is treated as the dominant category; see Equation (7).

22While the CDFs for small and medium classes appear to cross in the extreme lower tail in the top row of Figure 3 (favoring medium classes and thereby precluding an SSD ranking in favor of small classes), Table 2 reports a finding of Small SSD Medium because of a “trimming” procedure. Specifically, in the empirical implementation, the support points used to obtain the test statistics (see Appendix A) are chosen to be equally spaced, beginning at the first percentile and ending at the 99th percentile of the empirical support, �. This process focuses attention away from extreme outliers.

23Note that SSD relations, especially using the FR distributions, do not obey a transitivity property; Small SSD Medium and Medium SSD Large is not sufficient to guarantee Small SSD Large. While this is true in general for SSD (not FSD), it is especially true in the FR tests since the Small versus Medium and Small versus Large comparisons utilize Small as the dominant category, while the Medium versus Large comparison uses Medium as the dominant group.

Class Size and Educational Policy 355

(below the median), while the converse holds above the median. This fact highlights the concavity assumption explicitly required of the social welfare function under the SSD ranking. Finally, medium classes outperform large classes (when medium classes are the dominant group) across the majority of the distribution; if not for a few crossings in the extreme tails, one would have observed a FSD ranking. However, the difference is of low magnitude as test scores differ by less than half a point (in absolute value) over the majority of the distribution.

These results, which suggest a modest, but statistically insignificant, monotonic ranking in favor of smaller classes, reverse the findings from the PR tests and indicate the importance of allowing the returns to observable attributes to vary by class size, as well as incorporating these differential returns into the residual distributions being compared. To verify that this indeed the case, Table 3 presents the results from standard parametric Oaxaca–Blinder decompositions. Consonant with the change in results from the PR to the FR tests, Panel A indicates that the coefficients differ considerably across the three class size groups, with a significant advantage belonging to small classes (followed by medium classes). In particular, these results are driven primarily by greater returns to teacher race and education, school size, and lagged test scores. Thus an important benefit of smaller classes (in terms of subsequent test scores) appears to be a greater return to other observable attributes such as teacher education and students’ innate ability and/or previous educational inputs.

In the end, we believe that FR tests best isolate the effects of class size on the distribution of test scores and that the simple bootstrap is the more informative method of inference. Accordingly, we conclude that the unconditional distributions of test scores favor students in classes with 20 or more students below the 70th percentile, and students in classes with fewer than 20 students in the upper tail. However, after purging test scores

TABLE 3 Oaxaca–Blinder decompositions of mean test score gaps

Class size Portion of observed gap due to differences in

X Y Endowments Intercepts Coefficients

Observed gap

A. 8th Grade Class Size: 10th Grade Test Scores Small Medium 0�067 0�254 −3�067 2�880 Small Large 0�029 0�863 −10�047 9�213 Medium Large −0�038 0�032 −6�979 6�910 B. 10th Grade Class Size: 12th Grade Test Scores Small Medium −0�385 −0�267 1�272 −1�391 Small Large −1�939 −1�742 −2�085 1�888 Medium Large −1�554 −1�380 −3�357 3�183

Positive numbers in columns 4–6 indicate advantage to X; negative numbers indicate advantage to Y. See Table 2 for further details.

356 E. Maasoumi et al.

of the average impact of a host of observable attributes using class size specific returns to observables and invoking the selection on observables assumption, we find that test scores exhibit a monotonic second order SD relationship, albeit statistically insignificant at conventional levels, favoring smaller classes. Moreover, the advantage for small classes is confined solely to below the median and is mainly attributable to the advantageous returns to observables enjoyed by students in smaller classes. As a result, uniform rankings are only possible when the debate over class size is broadened to incorporate the dispersion of test scores into the evaluation of educational policy. Furthermore, these results suggest the possibility of a Pareto-improving reallocation of students whereby lower performing students are placed in smaller classes and higher performing students are placed in larger classes.

5.2.2. Twelfth Grade Test Scores The PR and FR results for twelfth grade test scores are given in Panels

C and D of Table 2, respectively. The corresponding plots are displayed in Figures 6–8. The PR results suggest a nonmonotonic effect of class size. Specifically, we observe Small FSD Medium, Large FSD Small, and Large FSD Medium. However, all three observed rankings are statistically insignificant at conventional levels according to the simple bootstrap. A second order ranking of Large over Small is statistically significant (Pr(s ≤ 0) = 0�934), while the other two comparisons are only marginally significant (Small SSD Medium: Pr(s ≤ 0) = 0�846; Large SSD Medium: Pr(s ≤ 0) = 0�890). Examination of Figure 6 confirms the previous results based on eighth grade class size (Figure 3): disparities in PR test scores (i.e., the QTEs) are fairly uniform across the entire distribution. Specifically, at any quantile of the distribution, students in large (small) classes score roughly two (one-quarter) points higher on twelfth grade tests than students in medium classes.

FIGURE 6 CDFs and integrated CDFs: partial residual 12th grade test scores by 10th grade class size. Small, <20 students, Medium, 20–30 students, Large, >30 students.

Class Size and Educational Policy 357

FIGURE 7 CDFs and integrated CDFs: full residual 12th grade test scores by 10th grade class size. Small class size is dominant category. Small, <20 students, Medium, 20–30 students, Large, >30 students.

As documented in the eighth grade class size results, however, the PR tests ignores a potentially important difference across class sizes: the returns to observable determinants of student achievement. The results of the FR dominance tests, displayed in Panel D, reveal that the three distributions are unrankable in the first- and second-degree sense. Examination of Figures 7 and 8 yield two additional findings. First, small classes fare better than medium and large classes only in the extreme lower tail of the distribution (roughly below the 15th percentile) when small classes are treated as the dominant group. Second, large classes outperform medium classes across the majority of the distribution, although the difference is of low magnitude as test scores differ by less than one-half point (in absolute value) over the majority of the distribution. This result is invariant to the choice of small or medium classes as the dominant group.

These results, which suggest little impact of class size on subsequent test scores, refute the findings from the PR tests and confirm our previous finding that the returns to observable attributes vary in an important

FIGURE 8 CDFs and integrated CDFs: full residual 12th grade test scores by 10th grade class size. Medium class size is dominant category. Medium, 20–30 students, Large, >30 students.

358 E. Maasoumi et al.

manner by class size. Specifically, Panel B of Table 3 verifies that, according to the standard parametric Oaxaca–Blinder decomposition, the return to observables is most favorable to medium classes, followed by small and then large classes. In particular, these results are driven by differences in the returns to teacher education and salary, teacher and student race, number of full-time teachers, and lagged test scores.

As stated previously, we believe that the FR tests best isolate the effects of class size on the distribution of test scores, and that the simple bootstrap is the more informative method of inference. Accordingly, the analysis using tenth grade test scores largely reaffirms the conclusions we drew from the analysis of tenth grade test scores. Specifically, we conclude that the unconditional distribution of test scores favors smaller classes roughly only above the 70th percentile. However, after purging test scores of a host of observable attributes using class size specific returns, we conclude that the distribution of subsequent test scores is largely unaffected by class size, although students in the extreme lower tail are, if anything, aided by larger classes.

5.3. Disaggregation of “Small” Classes

As noted in Ehrenberg et al. (2001), target class size varies across states, with some states targeting classes with 15 students and others targeting classes of 20 students. In addition, the few experiments that have been conducted in the US typically involve changes in class size at the lower end of the class size distribution. For example, Tennessee’s Project STAR compared classes of 13–17 students with classes of 22–26 students. Wisconsin’s SAGE (Student Achievement Guarantee in Education) program reduced many classes from between 21 and 25 students to only 12 to 15 students. California’s CSRP reduced classes from roughly 30 to 20 students. To see if marginal changes in this range matter, we define two new class size groupings: Small I (10–15 students) and Small II (16–19 students).

5.3.1. Tenth Grade Test Scores The results pertaining to the tenth grade test score distributions are

displayed in Panels A–C of Table 4. The CDFs, integrated CDFs, and differences in the CDFs are given in Figures 9–11. In terms of the unconditional empirical distribution (Panel A), we observe Small II SSD Small I. The observed dominance relation favoring the relatively larger class size is consonant with the previous unconditional tests examining tenth grade class size (Table 1). However, the second order ranking is not statistically significant at conventional levels according to the simple bootstrap (Pr(s ≤ 0) = 0�696). The recentered bootstrap fails to reject the null Ho : s = 0 (p-value = 0�926).

T A B L E

4 U n co

n d it io n al

an d

co n d it io n al

st o ch

as ti c d o m in an

ce te st s:

d et ai le d

co m p ar is o n

o f sm

al l cl as se s

D is tr ib u ti o n s

F ir st

o rd er

d o m in an

ce Se

co n d

o rd er

d o m in an

ce

O b se rv ed

X Y

ra n ki n g

d 1 ,M

A X

d 2 ,M

A X

d P r� d

∗ 1 ≤

0� P r� d

∗ 2 ≤

0� P r� d

∗ ≤

0� P r� d

∗ 1 ≥

d 1 � P r� d

∗ 2 ≥

d 2 � P r� d

∗ ≥

d �

s 1 ,M

A X

s 2 ,M

A X

s P r� s∗ 1

≤ 0�

P r� s∗ 2

≤ 0�

P r� s∗

≤ 0�

P r� s∗ 1

≥ s 1 � P r� s∗ 2

≥ s 2 � P r� s∗

≥ s�

A . 8t h

G ra de

C la ss

Si ze : 10

th G ra de

T es t Sc or es

(U n co n di ti on al )

Sm al l I Sm

al l II

II SS

D I

1. 94

3 0. 27

8 0. 27

8 0. 00

0 0. 11

4 0. 11

4 0. 56

4 0. 69

2 0. 69

2 28

2. 15

7 −0

�2 85

−0 �2 85

0. 00

2 0. 69

4 0. 69

6 0. 53

6 0. 57

8 0. 57

6 0. 01

6 0. 00

8 0. 02

4 0. 01

6 0. 89

8 0. 80

2 0. 17

2 0. 20

0 0. 37

2 0. 04

2 0. 94

6 0. 92

6

B . 8t h

G ra de

C la ss

Si ze : 10

th G ra de

T es t Sc or es

(P ar ti al

R es id u al )

Sm al l I Sm

al l II

I F SD

II −0

�3 60

9. 59

2 −0

�3 60

0. 79

4 0. 12

2 0. 91

6 0. 33

0 0. 64

0 0. 26

2 −0

�3 69

22 63

.6 25

−0 �3 69

0. 82

4 0. 13

6 0. 96

0 0. 31

6 0. 62

6 0. 23

4 0. 28

4 0. 26

0 0. 54

4 0. 80

0 0. 41

6 0. 66

4 0. 47

2 0. 33

0 0. 80

2 0. 60

6 0. 33

0 0. 46

2

C . 8t h

G ra de

C la ss

Si ze : 10

th G ra de

T es t Sc or es

(F u ll

R es id u al )

Sm al l I Sm

al l II

N o n e

2. 29

6 1. 19

1 1. 19

1 0. 00

0 0. 00

0 0. 00

0 0. 89

8 0. 90

4 0. 90

4 23

9. 62

6 96

.3 19

96 .3 19

0. 08

4 0. 00

0 0. 08

4 0. 48

6 0. 94

4 0. 72

4 0. 00

0 0. 00

0 0. 00

0 0. 44

8 0. 89

0 0. 79

2 0. 23

8 0. 01

2 0. 25

0 0. 12

8 0. 78

6 0. 13

0

D . 10

th G ra de

C la ss

Si ze : 12

th G ra de

T es t Sc or es

(U n co n di ti on al )

Sm al l I Sm

al l II

N o n e

0. 96

6 0. 75

5 0. 75

5 0. 00

0 0. 00

0 0. 00

0 0. 75

2 0. 68

4 0. 56

4 10

3. 46

4 11

.5 52

11 .5 52

0. 01

6 0. 40

0 0. 41

6 0. 55

0 0. 50

0 0. 42

6 0. 01

8 0. 01

6 0. 03

4 0. 25

0 0. 37

8 0. 05

6 0. 17

6 0. 20

0 0. 37

6 0. 27

0 0. 59

6 0. 24

8

E . 10

th G ra de

C la ss

Si ze : 12

th G ra de

T es t Sc or es

(P ar ti al

R es id u al )

Sm al l I Sm

al l II

I F SD

II −0

�5 58

8. 30

0 −0

�5 58

0. 79

4 0. 08

8 0. 88

2 0. 51

6 0. 60

8 0. 50

8 −0

�5 85

20 25

.6 31

−0 �5 85

0. 84

0 0. 10

0 0. 94

0 0. 52

8 0. 57

6 0. 49

8 0. 20

6 0. 30

4 0. 51

0 0. 93

2 0. 27

2 0. 80

2 0. 31

2 0. 40

8 0. 72

0 0. 87

0 0. 22

0 0. 71

8

F. 10

th G ra de

C la ss

Si ze : 12

th G ra de

T es t Sc or es

(F u ll

R es id u al )

Sm al l I Sm

al l II

N o n e

2. 06

8 1. 10

4 1. 10

4 0. 00

0 0. 00

0 0. 00

0 0. 97

8 0. 98

6 0. 98

6 15

0. 14

4 13

2. 35

3 13

2. 35

3 0. 14

6 0. 00

0 0. 14

6 0. 55

6 0. 97

6 0. 58

0 0. 00

0 0. 00

0 0. 00

0 0. 58

6 0. 92

2 0. 88

0 0. 20

2 0. 00

0 0. 20

2 0. 20

2 0. 72

2 0. 03

4

Sm al l I cl as se s h av e

10 –1

5 st u d en

ts ; Sm

al l II

cl as se s h av e

16 –1

9 st u d en

ts . Se

e T ab

le 1

fo r fu rt h er

d et ai ls .

360 E. Maasoumi et al.

FIGURE 9 CDFs and integrated CDFs: unconditional 10th grade test scores by 8th grade class sizes. Small I, 10–15 students, Small II, 16–19 students.

Examining the actual plots (Figure 9), we see that the Small II distribution lies to the right of the Small I distributions over the majority of the support. In fact, if not for a few crossings, we would have observed a first order ranking. Moreover, it is interesting to note that the unconditional test score gap favoring Small II classes gets wider in the upper tail (i.e., the QTEs are increasing across the quantiles). This contrasts with previous unconditional eighth grade class size results in Figure 1, where (the previously defined) small classes fared better than medium and large classes only in the upper tail.

The PR test result is displayed in Panel B. Now, we observe Small I FSD Small II. Given the reversal in ranking from the unconditional comparison, this implies that attributes associated with higher test scores are positively correlated with class size over the range being analyzed, consistent with compensatory policies in this range by schools. Moreover, the first order ranking is statistically significant at conventional levels according to the simple bootstrap (Pr(d ≤ 0) = 0�916). The recentered bootstrap fails to reject the null Ho : d = 0 or s = 0 (p-values above 0.46). Examination of Figure 10 reveals sizeable disparities in PR test scores that are fairly uniform

FIGURE 10 CDFs and integrated CDFs: partial residual 10th grade test scores by 8th grade class sizes. Small I, 10–15 students, Small II, 16–19 students.

Class Size and Educational Policy 361

FIGURE 11 CDFs and integrated CDFs: full residual 10th grade test scores by 8th grade class sizes. Small I class as dominant category. Small I, 10–15 students, Small II, 16–19 students.

across the entire distribution: at any quantile of the distribution, students with 10 to 15 students score roughly four points higher than students in classes with 16 to 19 students.

The results of the FR dominance tests are displayed in Panel C. Examination of the results reveals that the first order ranking found in the PR test disappears. The elimination of the previous FSD ranking implies that the returns to observables favors classes with 16 to 20 students. Indeed, this is confirmed in Panel A of Table 5, where the returns on the whole favor Small II classes. In particular, the returns to teacher education and salary, as well as the number of white teachers, yield the primary discrepancies in favor of classes with 16 to 19 students. Thus the improvement in returns as class size diminishes documented in the previous section (Panel A, Table 3) is not monotonic.

Viewing the actual plots in Figure 11 shows that the disparity in test scores favors classes with 10 to 15 students in the lower tail of the distribution (below the 30th percentile). However, the area between the CDFs (i.e., the cumulative sum, in absolute terms, of the QTEs) is greater above the 30th percentile, and this precludes a finding of SSD in favor of classes with 10 to 15 students. In the end, then, given our stated preference for the FR tests, disaggregation of eighth grade small classes indicates

TABLE 5 Oaxaca–Blinder decompositions of mean test score gaps

Class size Portion of observed gap due to differences in

X Y Endowments Intercepts Coefficients

Observed gap

A. 8th Grade Class Size: 10th Grade Test Scores Small I Medium II 1.430 1.332 1.079 −0�982 B. 10th Grade Class Size: 12th Grade Test Scores Small I Medium II −0�409 −0�023 6.061 −6�447

See Table 3.

362 E. Maasoumi et al.

FIGURE 12 CDFs and integrated CDFs: unconditional 12th grade test scores by 10th grade class sizes. Small I, 10–15 students, Small II, 16–19 students.

that reduction in class size below 16 students only aids the lowest scoring of students and does not enjoy unambiguous support by all preference functions in the class �2.

5.3.2. Twelfth Grade Test Scores The results analyzing the twelfth grade test score distributions are given

in Panels D–F of Table 4. The CDFs, integrated CDFs, and differences in the CDFs are given in Figures 12–14. In terms of the unconditional distributions, we are unable to rank the distributions in either the first- or second-degree sense. Noteworthy, however, is that the recentered bootstrap rejects the null H0 : d = 0 (p-value = 0�056), rejecting both strict dominance and equality of the distributions. Examining Figure 12, we see that—consonant with the earlier results comparing small, medium, and large classes—the relatively large Small II classes outperform Small I classes up to approximately the 70th percentile. Thus, as is the case for eighth grade class size, there is modest evidence of an unconditional advantage

FIGURE 13 CDFs and integrated CDFs: partial residual 12th grade test scores by 10th grade class sizes. Small I, 10–15 students, Small II, 16–19 students.

Class Size and Educational Policy 363

for classes with 16 to 19 students over those with 10 to 15 students, except for the highest performing students.

The PR dominance test results are displayed in Panel E. As with tenth grade test scores, we find that Small I FSD Small II. This continues to imply a positive correlation between attributes improving test scores and class size for students in classes with fewer than 20 students. However, unlike the previous results for eighth grade class size, this ranking is only marginally statistically significant according to the simple bootstrap (Pr(d ≤ 0) = 0�882); the second order ranking is statistically significant (Pr(s ≤ 0) = 0�940). The recentered bootstrap fails to reject the null H0 : d = 0 or s = 0.

Examination of Figure 13 reveals a sizeable, fairly uniform disparity in PR test scores: at any quantile of the distribution, students in tenth grade classes with 10 to 15 students score roughly three to four points higher than students in classes with 16 to 19 students. However, the distributions come close to crossing in the extreme lower tail, and apparently do cross sufficiently frequently in the bootstrap resamples to preclude a stronger statistically significant FSD ranking.

Lastly, the FR test results are displayed in Panel F. As with the previous section examining tenth grade test scores, we are unable to rank the empirical distributions. Moreover, the recentered bootstrap rejects the null Ho : s = 0 (p-value = 0�034), rejecting both strict second order dominance and equality of the distributions. The elimination of the FSD ranking obtained in Panel E implies that the returns to observables favors classes with 16 to 19 students. Panel B in Table 5 confirms this.24 Figure 14 shows further consistency with the previous tenth grade test score results: the disparity in test scores favors classes with 10 to 15 students in the lower tail of the distribution (below the 40th percentile). However, since the area between the CDFs (i.e., the cumulative sum (in absolute terms) of the QTEs) is greater above the 40th percentile, this precludes a finding of SSD in favor of classes with 10 to 15 students. Consequently, as stated above, preferences for classes of 10 to 15 or 16 to 19 students may differ among individuals with different welfare functions in the class �2.

6. CONCLUSION

Identifying the factors most relevant to student achievement is important to parents, as well as policymakers concerned with maximizing the use of funds available for public schools. However, the impact of improved student achievement is potentially even more far-reaching, extending beyond school walls through its effect on completed education,

24Some of the returns that differed most across class size groupings are for size of the tenth grade student body, student race, the number of white teachers, and teacher experience.

364 E. Maasoumi et al.

FIGURE 14 CDFs and integrated CDFs: full residual 12th grade test scores by 10th grade class sizes. Small I class size is dominant category. Small I, 10–15 students, Small II, 16–19 students.

future earnings, racial disparities, and economic competitiveness. While the factor most consistently discussed is class size, given the ease at which it may be manipulated by policy, previous empirical examinations of the impact of class size reductions have been mixed, with the results suggesting at best a modest impact. Despite this lack of convincing support, the US federal government allocated $12 billion (over a seven-year period) to reduce class sizes (Hoxby, 2000a), the state of California has spent over $3.6 billion on class size reduction since 1996, 20 US states are currently undertaking or discussing policies to reduce class sizes, and the Dutch government decided to allocate approximately $500 million (in US dollars) to reduce class sizes (Levin, 2001).

To investigate these previous mixed results, we analyze the relationship between class size and student achievement using recently developed tests for stochastic dominance. The tests are nonparametric and utilize information on the entire distribution of test scores. Moreover, through the use of bootstrap techniques, we are able to report the results of the dominance tests to a degree of statistical certainty. Thus a finding of a first or second degree dominance is extremely powerful, implying that any social welfare function that is increasing in test score levels (FSD) or increasing and concave (SSD) will prefer one distribution over another. This type of analysis is useful for policy decisions as it lends itself to broad-based, consensus ranking of outcomes. Furthermore, the absence of such rankings is equally informative, implying that any judgment of one distribution over another is subjective.

Using data from the National Education Longitudinal Study of 1988, we estimate the effects of eighth and tenth grade class size on the unconditional and conditional distributions of test scores in reading, social studies, mathematics, and science. The results shed some light on the previous ambiguous findings and should prove quite informative for future educational policy discussions. First, we find that the unconditional

Class Size and Educational Policy 365

distributions of test scores favor students in larger classes below the 70th percentile or so, and students in smaller classes in the upper tail. In particular, while marginal changes in class size in classes with at least 20 students are not associated with changes in the unconditional test score distributions, the unconditional test scores of low achieving (high achieving) students are superior in classes with at least (less than) 20 students. Second, analyzing the conditional distribution of test scores, using class-size specific returns, we find little causal effect of marginal reductions in class size on test scores within the range of 20 or more students. However, reductions in class size from above 20 students to below 20 students, as well as marginal reductions in classes with fewer than 20 students, increase test scores for students below the median, but decrease test scores for students above the median.

Our methodology and findings point to several possible culprits when trying to make sense of the current empirical evidence on class size. First, reducing class size by one student does not have a constant effect: reductions in classes above 20 students have essentially no impact, while reductions in classes with 20 or fewer students raise the test scores of some students (those below the median) and lower the test scores of other students (those above the median). Second, whereas the majority of the beneficial impact of class size reduction arises because of the productivity enhancing effect that it has on other educational inputs, the majority of previous studies allow, at most, the impact of class size to vary by race or gender.

A. APPENDIX: TECHNICAL DETAILS

A.1. Computation of d̂ and ŝ

The test for FSD requires

(i) Computing the values of F̂ (zj) and Ĝ(zj) for zj, j = 1, � � � , J , where J denotes the number of points in the support � that are utilized (J = 500 in the application, where the points are equally spaced beginning at the first percentile and ending at the 99th percentile of the empirical support, �, to focus attention away from extreme values)

(ii) Computing the differences d1(zj) = F̂ (zj) − Ĝ(zj) and d2(zj) = Ĝ(zj) − F̂ (zj)

(iii) Finding d̂ = √

NM N +M min�max�d1�, max�d2��.

If d̂ ≤ 0 (to a degree of statistical certainty), then the null of FSD is not rejected. Furthermore, if d̂ ≤ 0 and max�d1� < 0, then X FSD Y . On the other hand, if d̂ ≤ 0 and max�d2� < 0, then Y FSD X . If d̂ = max�d1� = max�d2� = 0, then the (estimated) distributions of X and Y are identical.

366 E. Maasoumi et al.

The test for SSD requires the following additional steps:

(i) Calculating the sums s1j = ∑j

k=1 d1(zk) and s2j = ∑j

k=1 d2(zk), j = 1, � � � , J

(ii) Finding ŝ = √

NM N +M min�max�s1j�, max�s2j��

If ŝ ≤ 0 (to a degree of statistical certainty), then the null of SSD is not rejected. Moreover, if ŝ ≤ 0 and max�s1j� < 0, then X SSD Y ; otherwise, if max�s2j� < 0, then Y SSD X .

A.2. The Recentered Bootstrap

To obtain recentered bootstrap p-values, we compute the relative frequency of �d̂∗∗ > d̂�, where d̂∗∗ is the recentered bootstrap estimate of d. The recentering algorithm requires

(i) Generating bootstrap samples of size N (M) from X (Y ) (ii) Computing the values of F̂ ∗(zj) and Ĝ ∗(zj) for zj, j = 1, � � � , J , where

the values of zj used to analyze the original sample are utilized (iii) Computing the differences dc1(zj) = �F̂ ∗(zj) − Ĝ ∗(zj) − �F̂ (zj) − Ĝ(zj)

and dc2(zj) = �Ĝ ∗(zj) − F̂ ∗(zj) − �Ĝ(zj) − F̂ (zj) (iv) Finding d̂∗∗ =

√ NM N +M min� max�d

c 1�, max�d

c 2��

We then compute the relative frequency of �d̂∗∗ > d̂�, where d̂ is the sample estimate of d.

8. ACKNOWLEDGMENTS

The authors are extremely grateful to David Drukker for assistance, Julian Betts, Eric Hanushek, and Christopher Jepsen for comments, seminar participants at the 8th Annual Texas Camp Econometrics, Georgetown University, University of Texas, Arlington, and the SOLE Labor Economics Internet Seminar.

REFERENCES

Abadie, A. (2002). Bootstrap tests for distributional treatment effects in instrumental variable models. Journal of the American Statistical Association 97:284–292.

Akerhielm, K. (1995). Does class size matter?, Economics of Education Review 14:229–241. Amin, S., Rai, A. S., Topa, G. (2003). Does microcredit reach the poor and vulnerable? Evidence

from Northern Bangladesh. Journal of Development Economics 70:59–82. Anderson, G. (1996). Nonparametric tests of stochastic dominance in income distributions.

Econometrica 64:1183–1193.

Class Size and Educational Policy 367

Angrist, J., Lavy, V. (1999). Using Maimonides’ rule to estimate the effect of class size on scholastic achievement. Quarterly Journal of Economics 114:533–575.

Barro, R. J. (2001). Human capital and growth. American Economic Review 91:12–17. Betts, J. R., Shkolnik, J. L. (2000). The effects of ability grouping on student math achievement and

resource allocation in secondary schools. Economics of Education Review 19:1–15. Bishop, J. A., Formby, J. P., Zeager, L. A. (2000). The effect of food stamp cashout on

undernutrition. Economics Letters 67:75–85. Bitler, M. P., Gelbach, J. B., Hoynes, H. H. (2005). Distributional impacts of the self-sufficiency

project. NBER Working Paper No. 11626. Boozer, M. A., Rouse, C. (2001). Intraschool variation in class size: patterns and implications. Journal

of Urban Economics 50:163–189. Card, D., Krueger, A. B. (1996). School resources and student outcomes: an overview of the

literature and new evidence from North and South Carolina. Journal of Economics Perspectives 10:31–50.

Chubb, J. E., Moe, T. M. (1990). Politics, Markets and America’s Schools. Washington, DC: Brookings Institution Press.

Coleman, J. S., et al. (1966). Equality of Educational Opportunity. Washington, DC: Department of Health, Education, and Welfare.

Cook, M. D., Evans, W. E. (2000). Families or schools? Explaining the convergence in white and black academic performance. Journal of Labor Economics 18:729–754.

Dearden, L., Ferri, J., Meghir, C. (2002). The effect of school quality on educational attainment and wages. Review of Economics and Statistics 84:1–20.

Dobbelsteen, S., Levin, J., Oosterbeek, H. (2002). The causal effect of class size on scholastic achievement: distinguishing the pure class size effect from the effect of changes in class composition. Oxford Bulletin of Economics and Statistics 64:17–38.

Dustmann, C. (2003). The class size debate and educational mechanisms: editorial. Economic Journal 113:F1–F2.

Ehrenberg, R. G., Brewer, D. J., Gamoran, A., Willms, J. D. (2001). Class size and student achievement. Psychological Science in the Public Interest 2:1–30.

Eide, E., Showalter, M. H. (1998). The effect of school quality on student performance: a quantile regression approach. Economics Letters 58:345–350.

Fertig, M. (2003a). Who’s to blame? The determinants of German students’ achievement in the PISA 2000 study. IZA Discussion Paper 739.

Fertig, M. (2003b). Educational production, endogenous peer group formation and class composition—evidence from the PISA 2000 study. IZA Discussion Paper 714.

Figlio, D. N., Paige, M. E. (2002). School choice and the distributional effects of ability tracking: does separation increase inequality? Journal of Urban Economics 51:497–514.

Fisher, G., Wilson, D., Xu, K. (1998). An empirical analysis of term premiums using significance tests for stochastic dominance. Economics Letters 60:195–203.

Goldhaber, D., Brewer, D. (1997). Why don’t schools and teachers seem to matter? Assessing the impact of unobservables on educational productivity. Journal of Human Resources 32m:505–523.

Hanushek, E. A. (1979). Conceptual and empirical issues in the estimation of educational production functions. Journal of Human Resources 14:351–388.

Hanushek, E. A. (1986). The economics of schooling: production and efficiency in public schools. Journal of Economic Literature 24:1141–1177.

Hanushek, E. A. (1989). The impact of differential expenditures on school performance. Educational Researcher 18:45–51.

Hanushek, E. A. (1996). School resources and student performance. In: Burtless, G., ed. Does Money Matter? The Effect of School Resources on Student Achievement and Adult Success. Washington, DC: Brookings Institution, pp. 43–73.

Hanushek, E. A. (2001). Black-white achievement differences and government interventions. American Economic Review 91:24–28.

Hanushek, E. A. (2002). Evidence, politics, and the class size debate. In: Mishel, L., Rothstein, R., eds. The Class Size Debate. Washington, DC: Economic Policy Institute, pp. 37–65.

Hanushek, E. A. (2003). The failure of input-based schooling policies. Economic Journal 113:F64–F98. Hanushek, E. A., Kimko, D. D. (2000). Schooling, labor-force quality, and the growth of nations.

American Economic Review 90:1184–1208.

368 E. Maasoumi et al.

Hanushek, E. A., Rivkin, S. G., Taylor, L. L. (1996). Aggregation and estimated effects of school resources. Review of Economics and Statistics 78:611–627.

Hoxby, C. M. (2000a). The effects of class size on student achievement: new evidence from population variation. Quarterly Journal of Economics 90:1239–1285.

Hoxby, C. M. (2000b). Does competition among public schools benefit students and taxpayers? American Economic Review 90:1209–1238.

Jenks, C., Phillips, M., eds. (1998). The Black-White Test Score Gap. Washington, D.C.: Brookings Institution Press.

Jepsen, C., Rivkin, S. (2002). What is the tradeoff between smaller classes and teacher quality? NBER Working Paper 9205.

Kaur, A., Prakasa Rao, B. L. S., Singh, H. (1994). Testing for second-order stochastic dominance of two distributions. Econometric Theory 10:849–866.

Krueger, A. B. (1999). Experimental estimates of education production functions. Quarterly Journal of Economics 1144:497–532.

Krueger, A. B. (2003). Economic considerations and class size. Economic Journal 113:F34–F63. Krueger, A. B., Whitmore, D. M. (2002). Would smaller classes help close the black-white

achievement gap? In: Chub, J., Loveless, T., eds. Bridging the Achievement Gap. Washington, DC: Brookings Institute Press.

Lazear, E. P. (2001). Educational production function. Quarterly Journal of Economics 116:777–801. Levin, J. (2001). For whom the reductions count: a quantile regression analysis of class size and

peer effects on stochastic achievement. Empirical Economics 26:221–246. Linton, O., Maasoumi, E., Whang, Y. J. (2005). Consistent testing for stochastic dominance: a

subsampling approach. Review of Economic Studies 72:735–765. Maasoumi, E., Heshmati, A. (2000). Stochastic dominance amongst swedish income distributions.

Econometric Reviews 19:287–320. Maasoumi, E., Heshmati, A. (2005). Evaluating dominance ranking of PSID incomes by various

household attributes. IZA Discussion Paper No. 1727. Maasoumi, E., Millimet, D. L. (2005). Robust inference concerning recent trends in U.S.

environmental quality. Journal of Applied Econometrics 20:55–77. McFadden, D. (1989). Testing for stochastic dominance. In: Part II of Fomby, T., Seo, T. K., eds.

Studies in the Economics of Uncertainty (in honor of J. Hadar). Springer-Verlag, pp. 113–134. Neuman, S., Oaxaca, R. L. (2004). Wage decompositions with selectivity corrected wage equations:

A methodological note. Journal of Economic Equality 2:3–10. Rivkin, S., Hanushek, E. A., Kain, J. (2002). Teachers, schools, and academic achievement.

Unpublished manuscript, Hoover Institution, Stanford University. Summers, A., Wolfe, B. (1977). Do schools make a difference? American Economic Review 56:639–652. Tiebout, C. M. (1956). A pure theory of local expenditures. Journal of Political Economy 64:416–424. Todd, P. E., Wolpin, K. I. (2003). On the specification and estimation of the production function

for cognitive achievement. Economic Journal 113:F3–F33. Wößmann, L. (2003). Schooling resources, educational institutions and student performance: the

international evidence. Oxford Bulletin of Economics and Statistics 65:117–170.