Chi-Square Data Analysis
Method Note
The Chi-Square Test: Often Used and More Often Misinterpreted
Todd Michael Franke 1 , Timothy Ho
2 , and
Christina A. Christie 3
Abstract
The examination of cross-classified category data is common in evaluation and research, with Karl Pearson’s family of chi-square tests representing one of the most utilized statistical analyses for answering questions about the association or difference between categorical variables. Unfortu- nately, these tests are also among the more commonly misinterpreted statistical tests in the field. The problem is not that researchers and evaluators misapply the results of chi-square tests, but rather they tend to over interpret or incorrectly interpret the results, leading to statements that may have limited or no statistical support based on the analyses preformed.
This paper attempts to clarify any confusion about the uses and interpretations of the family of chi-square tests developed by Pearson, focusing primarily on the chi-square tests of independence and homogeneity of variance (identity of distributions). A brief survey of the recent evaluation lit- erature is presented to illustrate the prevalence of the chi-square test and to offer examples of how these tests are misinterpreted. While the omnibus form of all three tests in the Karl Pearson family of chi-square tests—independence, homogeneity, and goodness-of-fit,—use essentially the same formula, each of these three tests is, in fact, distinct with specific hypotheses, sampling approaches, interpretations, and options following rejection of the null hypothesis. Finally, a little known option, the use and interpretation of post hoc comparisons based on Goodman’s procedure (Goodman, 1963) following the rejection of the chi-square test of homogeneity, is described in detail.
Keywords
chi-square test, quantitative methods, methods use, using chi-square test
1 Department of Social Welfare, Meyer and Rene Luskin School of Public Affairs, University of California, Los Angeles, CA,
USA 2 Department of Education, Graduate School of Education and Information Sciences, University of California, Los Angeles,
CA, USA 3 Department of Education, Social Research Methods Division, Graduate School of Education and Information Sciences,
University of California, Los Angeles, CA, USA
Corresponding Author:
Todd Michael Franke, Department of Social Welfare, Meyer and Rene Luskin School of Public Affairs, University of California,
Box 951656, Los Angeles, CA, 90095, USA
Email: [email protected]
American Journal of Evaluation 33(3) 448-458 ª The Author(s) 2012 Reprints and permission: sagepub.com/journalsPermissions.nav DOI: 10.1177/1098214011426594 http://aje.sagepub.com
Karl Pearson initially developed the chi-square test in 1900 and applied it to test the goodness of fit
for frequency curves. Later, in 1904, he extended it to contingency tables to test for independence
between rows and columns (Stigler, 1999). Since then, the Pearson family of chi-square tests has
become one of the most common sets of statistical analyses in evaluation and social science
research. Unfortunately, these tests are also among the more commonly misinterpreted statistical
tests in the field. The problem is not that researchers and evaluators misapply the results of chi-
square tests, but rather they tend to over interpret or incorrectly interpret the results, leading them
to make statements that may have limited or no statistical support based on the analyses preformed.
In this article, we will attempt to clarify any confusion about the uses and interpretations of the
family of chi-square tests developed by Pearson, focusing primarily on the chi-square tests of inde-
pendence and homogeneity of variance (identity of distributions). First, the family of chi-square sta-
tistics will be presented, including distinguishing features of and appropriate uses for each specific
test. Next, a brief survey of the recent evaluation literature will be presented to illustrate the preva-
lence of the chi-square test and to offer examples of how these tests are misinterpreted. Finally, a
little known option, the use of post hoc comparisons based on Goodman’s procedure (Goodman,
1963) following the rejection of the chi-square test of homogeneity, will be described.
The Karl Pearson Family of Chi-Square Tests
The chi-square test is computationally simple. It is used to examine independence across
two categorical variables or to assess how well a sample fits the distribution of a known population
(goodness of fit). The chi-square tests in the Karl Pearson family are not to be confused with others
such as the Yates chi-square test (correction for continuity), the Mantel–Haenszel chi-square or the
Maxwell–Stuart tests of correlated proportions. Each of these has its own applications, though they
all utilize the chi-square distribution as the reference distribution. In fact, many tests that assess
model fit use the chi-square distribution as the reference distribution. For example, many covar-
iance structure analyses, including factor analysis and structural equation modeling, assess model
fit by comparing the sample covariances to those derived from the model. Again, while they are
based on the same chi-square distribution, these tests are similar to the Karl Pearson family of tests
only in that they compare an observed set of data to what is expected.
The omnibus form of all three tests in the Karl Pearson family of chi-square tests—goodness of
fit, independence, homogeneity—use essentially the same formula. Each of these three tests is, in
fact, distinct with specific hypotheses, interpretations, and options following rejection of the null
hypothesis. The formula for computing the test statistic is as follows:
w2 ¼ Xn i¼1
ðOi � EiÞ2
Ei ;
where n is the number of cells in the table. The obtained test statistic is compared against a critical
value from the chi-square distribution with (r � 1)(c � 1) degrees of freedom. The main difference across each of the three chi-square tests relates to the appropriate situations
for which each should be used. The chi-square goodness of fit test is used when a sample is com-
pared on a variable of interest against a population with known parameters. For example, a goodness
of fit test might be applied on a survey sample to compare whether the ethnicity or income of the
survey respondents is consistent with the known demographic makeup of the geographic locale from
which the sample was drawn. The null and alternative hypotheses are:
Hypothesis0: The data follow a specified distribution.
HypothesisA: The data do not follow the specified distribution.
Franke et al. 449
The interpretation upon rejection is that the sample differs significantly from the population on
the variable of interest.
The chi-square test of independence determines whether two categorical variables in a single
sample are independent from or associated with each other. For example, a survey might be admi-
nistered to 1,000 participants who each respond with their hair color and favorite ice cream flavor.
The test would then be used to determine whether hair color and ice cream preference are indepen-
dent of each other. The null and alternative hypotheses are as follows:
Hypothesis0: The variables of interest are independent.
HypothesisA: The variables of interest are associated.
A significant test rejecting the null hypothesis would suggest that within the sample, one variable
of interest is associated with a second variable of interest.
Finally, the chi-square test of homogeneity is used to determine whether two or more independent
samples differ in their distributions on a single variable of interest. One common use of this test is to
compare two or more groups or conditions on a categorical outcome. A significant test statistic
would indicate that the groups differ on the distribution of the variable of interest but does not indi-
cate which of the groups are different or where the groups differ. The null and alternative hypotheses
are as follows:
Hypothesis0: The proportions between groups are the same.
HypothesisA: The proportions between groups are different.
We focus on the practical and important differences between the tests of independence
and homogeneity because they are so frequently used in evaluation and applied research studies.
Despite the fact that the formulation of the omnibus test statistic is the same for the test of inde-
pendence and the test of homogeneity, these two tests differ in their sampling assumptions, null
hypotheses, and options following a rejection. The main difference between them is how data are
collected and sampled. Specifically, the test of independence collects data on a single sample, and
then compares two variables within that sample to determine the relationship between them. The
test of homogeneity collects data on two 1
or more distinct groups intentionally, as might be the
case in a treatment or intervention study with a comparison group. The two samples are then com-
pared on a single variable of interest to test whether the proportions differ between them. Wickens
(1989) presents a thoughtful and succinct description of these tests, as well as their sampling
assumptions and hypotheses. In addition to the tests of homogeneity and independence, Wickens
presents an additional alternative where both margins are fixed, which he refers to as ‘‘test of unre-
lated classification.’’
When data are collected using only a single sample, only the test of independence is valid and
only interpretations of association between variables can be made. When data on two or more sam-
ples are collected, the test of homogeneity is appropriate and comparisons of proportions can be
made across the multiple groups. When sampling occurs from multiple populations, and thus the
homogeneity hypothesis appropriate, it is also reasonable (although less interesting) to ask the inde-
pendence question.
In the above example regarding hair color and ice cream preference, if the researcher
defined the population by hair color and eye color and collected information on 500
brunettes and 500 blondes, these would constitute two independent samples. Comparisons of
proportions of blondes and brunettes by their ice cream preferences would be valid. When
random assignment is used to assign participants to two or more conditions, these groups are
by definition independent and the test of homogeneity may be used to test for differences
between the groups.
450 American Journal of Evaluation 33(3)
Perhaps, these distinctions can be best illustrated by the null hypothesis tested in each of
these two tests. The chi-square test of independence null hypothesis states no association
between two categorical variables. It can be written as H0 : f ¼ 0 or H0 : n ¼ 0. This states that the association between two categorical variables, as measured by a Phi (f) correlation for 2 � 2 contingency tables or with Kramer’s V for larger tables, is zero or the variables are independent.
H0 : f ¼ 0 HA : f 6¼ 0
or H 0 : V ¼ 0; H A : V 6¼ 0:
The chi-square test of homogeneity compares the proportions between groups on a variable of
interest. The null hypothesis is presented in matrix form:
H0 :¼
p11 ¼ p12 ¼ ::: ¼ p1k p21 ¼ p22 ¼ ::: ¼ p2k p31 ¼ p32 ¼ ::: ¼ p3k pk1 ¼ pk2 ¼ ::: ¼ pkk
2 6664
3 7775
HA : The null is false
Rejection of the null hypothesis in the case of three or more groups only allows the researcher to
conclude that the proportions between the groups differ, not which groups are different. Table 1
summarizes the distinction between the three types of chi-square tests—specifically, the sampling
required for each test, the correct interpretation of each test, and the null hypothesis assumed of
each test.
One common misinterpretation of chi-square tests comes from not distinguishing between these
three specific tests. Indeed, when most researchers declare that they ‘‘utilized a chi-square test,’’
they are typically referring to the chi-square test of independence. This lack of specificity often leads
researchers to use interpretations of one test where another was actually conducted. For example,
researchers will more often feel compelled to compare the proportions between groups, regardless
of how the data were drawn. As is most often the case, the data on two categorical variables are
collected from a single sample (e.g., survey data), where the assumptions for chi-square test of
homogeneity are not met, and an interpretation comparing proportions between groups is not valid.
Even in those situations where data are drawn from multiple samples and the test of homogeneity
is appropriate, researchers seem unaware that procedures exist to specifically follow-up after the
rejection of the omnibus test. Consider the following null hypothesis:
H0 : p11 ¼ p12 ¼ p13 p21 ¼ p22 ¼ p23
� � :
Table 1. Chi-Square Tests and Attributes
Chi-Square Test Attribute Test of Independence Test of Homogeneity Test of Goodness of Fit
Sampling type Single dependent sample Two (or more) independent samples
Sample from population
Interpretation Association between variables Difference in proportions Difference from population Null hypothesis No association between
variables No difference in
proportion between groups
No difference in distribution between sample and population
Franke et al. 451
A rejection in this case indicates that at least one proportion is different from at least one other
proportion. 2
Often, a researcher will conduct a chi-square test, find a significant value, and then look
for the cells with the largest disparity in proportions or frequencies to make a substantive interpreta-
tion. The proper procedure would involve conducting post hoc comparisons after the omnibus
chi-square test to determine where the significant differences actually are. Post hoc procedures for
chi-square tests are discussed in a later section.
Chi-square Tests in Recent Evaluation Literature
A brief survey of recent evaluation literature was conducted in order to obtain a general sense of how
often chi-square tests are used and how often researchers misinterpret the results.
Surveying the evaluation literature is an approach that has been used by several researchers as a
method for better understanding the methods and strategies used in evaluation practice. For example,
Greene, Caracelli, and Graham (1989) included published evaluation studies in their sample when
reviewing 57 empirical mixed-methods evaluations. Findings from the empirical study were used to
refine a mixed-methods conceptual framework that had originally been developed from the theore-
tical literature and was intended to inform and guide practice. More recently, Miller and Campbell
(2006) studied empowerment evaluation in practice by examining 47 case examples published from
1994 through June 2005 to determine the extent to which empowerment evaluation could be distin-
guished from evaluation approaches emphasizing similar elements, and the extent to which empow-
erment evaluation led to empowered outcomes for program beneficiaries.
For the current study, four prominent evaluation journals were selected for review: American
Journal of Evaluation, Evaluation Review, Educational Evaluation and Policy Analysis, and Eva-
luation and Program Planning. Every article published in these four journals between January
2008 and August 2010 was reviewed. These journals and periods were not intended to be a compre-
hensive search of the evaluation literature, but mainly to obtain a picture of the prevalence of
chi-square tests and the extent to which these tests are incorrectly interpreted. The vast majority
of chi-square tests and misinterpretations probably exist in evaluation reports that are never read
beyond a small circle of intended users, but we believe that the proliferation of chi-square test mis-
interpretations is exacerbated by evaluation literature that is read by a larger audience.
After book reviews, section introductions, memoranda, and other editorial content were excluded,
there were a total of 292 articles available for review. Two graduate student researchers coded each
article on a variety of measures, including whether inferential statistics were used and whether a chi-
square test was used. For articles that used a chi-square test, additional codes identified whether the
article contained the correct interpretation given the sampling procedure, whether post hoc interpre-
tations were used, and whether post hoc tests were conducted.
Table 2 details the number of articles in each journal as well as how many used inferential
quantitative statistics. Overall, just over a third (36.6%; n ¼ 107) of the articles used some sort
Table 2. Use of Statistical Tests in Journal Articles
Total Number
of Articles
Articles Using Inferential Statistics
Articles Using Chi- Square
Test
Proportion of Articles Using
Chi-Square Test (%)
American Journal of Evaluation 65 16 3 18.75 Evaluation Review 61 30 11 36.67 Educational Evaluation and Policy Analysis 52 35 6 17.14 Evaluation and Program Planning 114 26 12 46.15 Total 292 107 32 29.91
452 American Journal of Evaluation 33(3)
of inferential statistic, ranging from a simple t test to more advanced structural equation models. Of
the 107 articles that used inferential statistics, 32 articles (29.9%) also used a chi-square test in the Karl Pearson family. Evaluation and Program Planning had the most articles employing a chi-
square test (n ¼ 12) while the American Journal of Evaluation had the fewest (n ¼ 3). The 32 articles that used chi-square tests were further reviewed to determine whether the inter-
pretations were justified. Often, researchers were not specific about which chi-square tests were
being used (only one of the 32 articles correctly specified the type of chi-square test conducted).
To make the determination, then, coders reviewed the Method section in each article to identify
which chi-square test would have been appropriate given the sampling design used. The interpreta-
tions from the chi-square tests presented in each article were then coded for the types of interpreta-
tion used, that is, whether an association claim was made between variables or whether a comparison
of proportions was made between groups. This allowed the researchers to determine the type of
chi-square test used by the researchers in each article. Any discrepancy between a study’s sampling
design and the type of chi-square test used was coded as a nonvalid interpretation of the chi-
square test. In addition, each of the 32 chi-square articles was coded on whether a post hoc inter-
pretation was used, meaning that the author made comparisons across select rows and columns of
the table.
The results from these additional analyses are presented in Table 3. Overall, less than half of
the chi-square articles (43.75%; n ¼ 14) had interpretations that were justified by the type of chi-square test used. All three articles in the American Journal of Evaluation included the correct
usage of the chi-square test, whereas only a third (two out of six) of the articles in Educational
Evaluation and Policy Analysis did so. As shown in Table 3, 9 of the 32 articles that used chi-
square (28.1%) included a post hoc interpretation. None of the articles used any post hoc analyses to justify their claims.
Hypothetical Example: Support Components for At-Risk Families
We offer a hypothetical example to illustrate the concepts described above and to guide readers
through a proper chi-square post hoc analysis. In this scenario, suppose that researchers are inves-
tigating the impact of various family support components for families at risk for child abuse and
neglect. Study participants were randomly assigned to receive either parent education/life skills,
connections to community resources, or wraparound services made up of the previous components
plus case management. Using the county data system, a sample was drawn from each of these three
conditions. The dependent variable of interest consisted of 4 outcomes measures 12 months after the
families’ initial involvement with Child Protective Services (CPS): (a) a CPS rereferral; (b) a sub-
stantiated allegation; (c) the child’s removal from home; or (d) no further involvement with CPS.
Table 3. Description of Articles Using Chi-Square Analyses
Number of Chi-Square
Articles
Number of Articles that Used a Valid Chi-Square
Test Interpretation
Number of Articles that Used a Post
Hoc Interpretation
N N % N %
American Journal of Evaluation 3 3 100.00 1 33.33 Evaluation Review 11 4 36.36 4 36.36 Educational Evaluation and Policy Analysis 6 2 33.33 2 33.33 Evaluation and Program Planning 12 5 41.67 2 16.67 Total 32 14 43.75 9 28.13
Franke et al. 453
While randomization is often used to form independent groups, it is not a prerequisite for the appro-
priate use of the test for homogeneity. What is required is that the groups are identified and sampled
intentionally. Table 4 shows the distribution with involvement with CPS across the three conditions.
The null hypothesis is as follows:
H0 :
p11 ¼ p21 ¼ p31 ¼ p41 ¼
p12 ¼ p22 ¼ p32 ¼ p42 ¼
p13
p23
p33
p43
2 6664
3 7775;
HA : The null is false:
The obtained X 26 ¼ 36:77 is significant at the conventional a level of .05. The justified interpre- tation following the rejection of the null hypothesis would be to conclude that the proportions are not
equal across the three groups.
Often at this point, researchers will conclude that the proportions are not equal and will want
to compare specific conditions. For example, they might examine the ‘‘no new involvement’’
row and conclude that the wraparound condition (72.3%) is preferable to the parent education (52.2%) or community resources (63.8%) condition. Alternatively, a researcher may be inter- ested in comparing the proportion of children removed across the conditions. It might be tempt-
ing to conclude that parent education (14.5%) is significantly different from community resources (4.26%) and wraparound (4.2%). However, this interpretation would be incorrect because there is no statistical justification for these claims based solely on the results of the
omnibus test; the omnibus test indicates only that the conditions are significantly different but
not which conditions are different.
Because the chi-square test is an omnibus test, post hoc procedures would need to be con-
ducted in order to compare individual conditions. As previously mentioned, the procedure for
comparing conditions or groups was developed by Goodman (1963). 3
Similar to the comparison
procedures following an analysis of variance (ANOVA), several different approaches—includ-
ing Scheffé, Holm, 4
and Dunn-Bonferroni—are available for selecting the appropriate critical
value. Also similar to the ANOVA, the comparison often takes on the name associated with
formulation of the critical value. For purposes of this article, the Scheffé post hoc values are
presented because this represents the most conservative approach. For an alternative approach
based on Dunn-Bonferonni, see Marasculio and Serlin (1988).
The Goodman procedure is described below. The test statistic for each contrast is as follows:
ĉffiffiffiffiffiffiffiffi SE2c
q ¼ Z:
Table 4. Involvement with CPS and Service Conditions
Parent Education Community Resources Wraparound Total N, Col % N, Col % N, Col % N, Col %
Rereferral to CPS 38, 20.43 42, 22.34 49, 13.73 129, 17.65 Substantiated allegation 24, 12.9 18, 9.57 35, 9.8 77, 10.53 Child removed 27, 14.52 8, 4.26 15, 4.2 50, 6.84 No new involvement with CPS 97, 52.15 120, 63.83 258, 72.27 475, 64.98 Total 186 188 357 731
Note. CPS ¼ child protective services.
454 American Journal of Evaluation 33(3)
The same equation in an expanded form is as follows:
ĉffiffiffiffiffiffiffiffi SE2c
q ¼ w1ðp1Þ� w2ðp2Þffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi w21
p1q1
n1
� � þ w22
p2q2
n2
� �s ¼ Z;
where ĉ represents the linear combination of weights (Wk) and proportions (yk ) of the specific contrast:
c ¼ W1y1 þ W2y2 þ�� �þ Wkyk; where
W1 þ W2 þ�� �þ Wk ¼ 0:
And the numerator of the test is the square root of the weighted standard error of the contrast:
SE 2 c ¼ W
2 1 SE
2 y1 þ W 22 SE
2 y2 þ�� �þ W 2k SE
2 yk :
The standard error of each column is the standard error of an estimated proportion:
SE 2 y ¼
pk qk
Nk :
Once the obtained test statistic is found for a comparison of interest, it is compared to a critical
value. The Scheffé critical value is found by taking the square root of the critical value in the original
omnibus chi-square analysis. In the above example, the chi-square omnibus critical value at the con-
ventional a level of .05 with (r � 1)(c � 1) ¼ (4 � 1)(3 � 1) ¼ 6 degrees of freedom is 12.59. The square root of this critical value is S� ¼
ffiffiffiffiffiffiffiffiffiffiffiffi w2v:1�a
p ¼
ffiffiffiffiffiffiffiffiffiffiffi 12:59 p
¼�3:55 which represents the Scheffé critical value for all contrasts.
Referring back to our previous example, comparing wraparound (72.3%) to parent education (52.2%) on ‘‘no new involvement’’ leads to the following hypothesis:
Hypothesis0 : pNo new involvement=wraparound ¼ pNo new involvement=parent education; HypothesisA : pNo new involvement=wraparound 6¼ pNo new involvement=parent education:
The appropriate test statistic is as follows:
357
357
� � :7227ð Þ�
186
186
� � :5215ð Þffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
357
357
� �2 :7227ð Þ :2773ð Þ
357
� � þ
186
186
� �2 :5215ð Þ :4785ð Þ
186
� �s ¼ :2012:0436 ¼ 4:61:
Since this is a pairwise comparison, the weights 357
357 and
186
186 equal 1, and essentially dropout of
the equation both in the numerator and in the denominator. Given 4.61 > +3.55, we reject and con- clude that there is a statistically significant difference between these conditions.
Comparisons can be performed within any row. If the researcher wanted to compare wraparound
(4.2%) to parent education (14.5%) on whether a child was removed, ‘‘child removed,’’ the test sta- tistic is given by
Franke et al. 455
357
357
� � :042ð Þ�
186
186
� � :1452ð Þffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
357
357
� �� � :042ð Þ :958ð Þ
357
� � þ
186
186
� � :1452ð Þ :8548ð Þ
186
� �s ¼�:1031:0278 ¼�3:69:
Given �3.69 > +3.55, we reject and conclude that there is a statistically significant difference between these conditions. A comparison between community resources (4.26%) and parent educa- tion (14.5%) produces a test statistic of 3.45 and is not significant due to the differing sample sizes and their impact on the standard error. This is an instance where simply examining the difference
between the proportions, without conducting the appropriate post hoc test, might lead to a statisti-
cally unsupported conclusion. In both of these, the comparisons the difference between the parent
education and the other two conditions were .10. However, in one case, there was a significant dif-
ference and in the other there was no difference based on the critical value. A complete listing of all
pairwise comparisons is available in the Table 5 at the end of article.
As noted previously, comparisons under this model are not limited to being pairwise. The post
hoc procedure can also be used to test complex contrasts. Suppose you want to compare wraparound
to the combination of parent education and community resources.
357
357
� � :1373ð Þ�
186
374
� � :2043ð Þþ
188
374
� � :2234ð Þ
� � ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
357
357
� �2 :1373ð Þ :8657ð Þ
357
� � þ
186
374
� �2 :2043ð Þ :7957ð Þ
186
� � þ
188
374
� �2 :2234ð Þ :7766ð Þ
188
� �" #vuut
¼ �:0766 :0273
¼�2:81:
Unlike with the previous pairwise contrast weights, the combination of parent education and
community resources needs to be weighted for their respective contributions. Once this is done, the
Table 5. Pairwise Contrasts from Hypothetical Example
c SE TS
Rereferral Wraparound versus parent education �.0670 .0347 �1.931 Wraparound versus community resources �.0861 .0354 �2.432 Parent education versus community resources �.0191 .0424 �0.451
Substantiated abuse Wraparound versus parent education �.0310 .0292 �1.062 Wraparound versus community resources .0023 .0306 0.075 Parent education versus community resources .0333 .0326 1.020
Child removed Wraparound versus parent education �.1031 .0279 �3.693 Wraparound versus community resources �.0005 .0182 �0.030 Parent education versus community resources .1026 .0297 3.451
No new case opened Wraparound versus parent education .2012 .0436 4.612 Wraparound versus community resources .0844 .0423 1.995 Parent Education versus community resources �.1168 .0507 �2.304
456 American Journal of Evaluation 33(3)
test statistic is calculated as it was before. Given �2.81 < +3.55, we do not reject and conclude that there is not a statistically significant difference between the wraparound condition and the combi-
nation of parent education and community resources.
Discussion
Common misconceptions of the chi-square test were clarified in this article. Specifically, we have
distinguished between the members of the Karl Pearson family of chi-square tests and presented post
hoc procedures. Evaluators often need to examine the association between categorical variables or to
compare groups or conditions on a categorical outcome, which explains their prevalence in evalua-
tion literature and reports. However, effective use of the chi-square test, or any other statistical test
for that matter, is dependent on a clear understanding of the assumptions of the test and what is actu-
ally being tested (null hypothesis) in the statistical procedure.
A correct interpretation of the chi-square test or of other statistical procedures is often dependent
on factors outside of distributional assumptions and characteristics of the data itself—for example,
individual observations must be independent from other observations in the contingency table. When
this is this case, an interpretation of the chi-square test is based on sampling procedures and how data
were collected. Furthermore, since the asymptotic approximation of the chi-square test is less precise
at the extreme end of the distribution, expected values of cells need to be greater than five.
The review of the evaluation literature reveals that in about half of the instances where a chi-square test
was used, the wrong interpretation was presented. The appropriate interpretation of the results is directly
tied to the null hypothesis under test and the interpretation—whether independence or homogeneity—is
limited to that hypothesis. More commonly, researchers prefer to interpret the chi-square test of homo-
geneity by comparing groups across a variable of interest. However, the sampling procedure precludes the
researcher from making this claim and has thus misinterpreted the results of the chi-square test.
Researchers also tend to over interpret the results of statistical tests. An omnibus chi-square test
informs us that the distribution of observed values deviates from expected values, but does not tell us
where the discrepancy is located in the contingency table. Often, researchers will make naı̈ve com-
parisons between two or more groups without conducting any post hoc tests to determine whether
the contrasts were significant.
Many more complex statistical models exist and we have faith that these procedures are still being
faithfully and thoughtfully applied. Although the chi-square tests were found to be commonly misinter-
preted in recent evaluation literature, the results of these studies are not wrong. Rather, the problem is
simply that there is often no statistical justification for some of the claims being made. However, Good-
man’s procedure is computationally simple and there is little reason it cannot be conducted to justify
significant contrasts. Our hope in this article is that researchers and evaluators will be more thoughtful
in using common statistical procedures and more carefully consider what their results actually say.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication
of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Notes
1. The two-sample test of proportions, which uses the Z distribution, is a special case of the test of homoge-
neity, employed when you have only two groups.
Franke et al. 457
2. Comparisons in this context are limited to pairwise contrasts. It is perfectly feasible that Groups 2 and 3
combined are from Group 1 and responsible for the significant result.
3. The approach presented here builds logically on the post hoc procedures following multiple group compar-
isons in analysis of variance (ANOVA) models. Goodman’s approach is not the only one available for
addressing pairwise comparisons, however. See Seaman and Hill (1996), Gardner (2000), and Delucchi
(1993).
4. Information on the use of the Holm procedure, see Holm, 1979.
References
Delucchi, K. L. (1993). On the use and misuse of chi-square. In G. Keren & C. Lewis (Eds.), A handbook for
data analysis in the behavioral sciences (pp. 295–319). Hillsdale, NJ: Lawrence Erlbaum.
Gardner, R. C. (2000). Psychological statistics using SPSS for Windows. Upper Saddle River, NJ: Prentice Hall.
Greene, J. C., Caracelli, V. J., & Graham, W. F. (1989). Toward a conceptual framework for mixed-method
evaluation designs. Educational Evaluation and Policy Analysis, 11, 255–274.
Goodman, L. (1963). Simultaneous confidence intervals for contrasts among multinomial populations. The
Annals of Mathematical Statistics, 35, 716–725.
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6,
65–70.
Marasculio, L., & Serlin, R. (1988). Statistical methods for the social and behavioral sciences. New York, NY:
W.H. Freeman.
Miller, R. L., & Campbell, R. (2006). Taking stock of empowerment evaluation: An empirical review. American
Journal of Evaluation, 27, 296–319. doi:10.1177/109821400602700303
Seaman, M. H., & Hill, C. C. (1996). Pairwise comparisons for proportions: A note on Cox and Key. Educational
and Psychological Measurement, 56, 452–459.
Stigler, S. (1999). Statistics on the table: The history of statistical concepts and methods. Cambridge, MA:
Harvard University Press.
Wickens, T. D. (1989). Multiple contingency tables analysis for the social sciences. Hillsdale, NJ: Lawrence
Erlbaum.
458 American Journal of Evaluation 33(3)