Week 4 - Assignment: Apply the Normal Distribution
Stat Methods Appl (2016) 25:581–599 DOI 10.1007/s10260-016-0353-z
ORIGINAL PAPER
Methods to test for equality of two normal distributions
Julian Frank1 · Bernhard Klar1
Accepted: 18 January 2016 / Published online: 29 January 2016 © Springer-Verlag Berlin Heidelberg 2016
Abstract Statistical tests for two independent samples under the assumption of nor- mality are applied routinely by most practitioners of statistics. Likewise, presumably each introductory course in statistics treats some statistical procedures for two inde- pendent normal samples. Often, the classical two-sample model with equal variances is introduced, emphasizing that a test for equality of the expected values is a test for equality of both distributions as well, which is the actual goal. In a second step, usually the assumption of equal variances is discarded. The two-sample t test with Welch correction and the F test for equality of variances are introduced. The first test is solely treated as a test for the equality of central location, as well as the second as a test for the equality of scatter. Typically, there is no discussion if and to which extent testing for equality of the underlying normal distributions is possible, which is quite unsatisfactorily regarding the motivation and treatment of the situation with equal variances. It is the aim of this article to investigate the problem of testing for equality of two normal distributions, and to do so using knowledge and methods adequate to statistical practitioners as well as to students in an introductory statistics course. The power of the different tests discussed in the article is examined empirically. Finally, we apply the tests to several real data sets to illustrate their performance. In particular, we consider several data sets arising from intelligence tests since there is a large body of research supporting the existence of sex differences in mean scores or in variability in specific cognitive abilities.
Keywords Fisher combination method · Minimum combination method · Likelihood ratio test · Two-sample model
B Bernhard Klar [email protected]
1 Department of Mathematics, Karlsruhe Institute of Technology (KIT), Englerstr. 2, 76199 Karlsruhe, Germany
123
582 J. Frank, B. Klar
1 Introduction
Statistical tests for two independent samples under the assumption of normality are applied routinely by most practitioners of statistics. Likewise, statistical inference for two independent normal samples is of great relevance in every introductory statistics course. There, the approach is often quite similar: First, the importance of shift models is stated, motivating the classical two-sample model with equal variances (see, e.g., Bickel and Doksum 2006, page 4). The ultimate aim is to compare both distributions. If normality is assumed, this corresponds to a test for equality of the expected values, i.e. Student’s t test. In a second step, usually the assumption of equal variances is discarded. The two-sample t test with Welch correction is introduced, however, at most times without going into details of Welch’s distribution approximation. The introduction and adjacent discussion on the F test for equality of variances often varies in the level of detail. Welch’s t test is solely treated as a test for the equality of central location, as well as the F test as a test for the equality of scatter. Typically, there is no discussion if and to which extent testing for equality of the underlying normal distributions is possible. Not only is this astonishing looking at the motivation of the classical t test, but also due to (at least) two other reasons: For one thing lectures continue with general procedures for testing nested parametric models, including in particularlikelihood-ratiotests.Foranother,whenitcomestodealingwiththeone-way anova, you rarely fail to see the problem of multiple testing being mentioned, along with suitable corrections including, most of the times, the Bonferroni correction.
In some textbooks testing for equality of variances is merely left as an exercise if not outright skipped. A possible reason for this could be seen in the non-robustness of this particular test against deviances from the normal distribution. Still, as no alternative tests are at least alluded, students get the impression that differences in scatter are more or less irrelevant - variance is a statistical Cinderella. Yet, everybody actually applying statistical procedures knows very well that differences in variance and location are of comparable importance.
Summing up, it can be said that from a practical point of view, given a two-sample model under normality, the aim has to be to judge whether the two samples originate from basically similar distributions or not. However, in many cases the classical and, of course, very comfortable assumption of equal variances has no grounding. In the midst of these considerations, discussion in lectures and textbooks stops without further ado and the students (and maybe some lecturers as well) are left without a clue how to deal with this situation.
It is the aim of the following article to investigate the problem of testing for the equality of two normal distributions, and to do so using knowledge and methods adequate to statistical practitioners as well as to students in an introductory course in mathematical statistics. Mathematically speaking, the following testing problem will be considered: Let X1, . . . , Xm, Y1, . . . , Yn be independent normally distributed random variables, where Xi ∼ N
( μ, σ 2
) for all i = 1, . . . , m and Yj ∼ N
( ν, τ 2
)
for all j = 1, . . . , n. In contrast to Student’s t test, we do not make further assumptions about the parameters, so that
( μ, ν, σ 2, τ 2
) ∈ � = R2 × (0, ∞)2 is arbitrary. It is the objectivetotestifthetwosamplesstemfromidenticaldistributions.Thecorresponding testing problem is given by the following hypothesis and alternative:
123
Methods to test for equality of two normal distributions 583
H0 : ϑ = ( μ, ν, σ
2 , τ
2 )
∈ �0 = { ϑ ∈ � : μ = ν, σ 2 = τ 2
}
vs. H1 : ϑ ∈ � � �0. (1)
The first classical approach is to develop a likelihood-ratio test. Doing so is a simple way to obtain an asymptotically valid test. In Sect. 2 the likelihood-ratio test statistic is derived, and different approximations of the distribution of the test statistic under H0 found in the literature are summed up. Among them one can find an asymptotic expansion proposed by Muirhead (1982), as well as a recently developed method to derive the exact distribution by numerical integration (Zhang et al. 2012).
A further approach is to combine different p values as illustrated in Sect. 3. For this procedure the hypothesis H0 is obtained by combining the hypotheses of both t and F test. Performing both tests using the same data (x1, . . . , xm) and (y1, . . . , yn), the resulting p values can be combined yielding a new test statistic and, thus, a test result for (1). Most combination methods require the tests to be combined being independent under H0 which holds in the case under consideration. In the specific case of Fisher’s method, the same approach, but applied in a slightly different way, can be found in Perng and Littell (1976).
In Sect. 4, power of the different tests is compared empirically. The ability of each method to correctly detect the alternative differs with respect to whether there is a difference in expectation, variance, or both. Loughin (2004) compares the method of combining the p values without regard to a specific testing problem. However, it is instructive to apply these methods directly to the problem at hand and compare them with the likelihood-ratio tests in Sect. 2.
Situations where one is interested in differences in variability as well as in means can be found almost everywhere. A long list of such applications is compiled in Gastwirth et al. (2009). We discuss in Sect. 5 several examples from two subject areas, namely engineering and psychology. In particular, we consider several data sets arising from mental or intelligence tests since there is a large body of research supporting the existence of sex differences in specific cognitive abilities, some favouring men, some favouring women, sometimes differences are found in mean scores, or in variability, or in both.
2 The likelihood ratio test
A classic approach in order to construct a test for H0 is the application of the maximum likelihood method. The unrestricted maximum likelihood estimator ϑ̂ is given by
ϑ̂ = ( μ̂, ν̂, σ̂
2 , τ̂
2 )
= ⎛
⎝X̄, Ȳ , 1
m
m∑
i=1 (Xi − X̄)2,
1
n
n∑
j=1
( Yj − Ȳ
)2 ⎞
⎠ ,
with X̄ = 1m ∑m
i=1 Xi , Ȳ = ∑n
j=1 Yj, while the maximum likelihood estimator ϑ̂0 under H0 is given by
ϑ̂0 = ( μ̂0, μ̂0, σ̂
2 0 , σ̂
2 0
)
123
584 J. Frank, B. Klar
with μ̂0 = mX̄+nȲm+n and σ̂ 20 = 1m+n (∑m
i=1 ( Xi − μ̂0
)2 + ∑nj=1 ( Yj − μ̂0
)2 ) . Denot-
ing the likelihood function by L(ϑ), the likelihood ratio statistic �m,n is equal to
�m,n = L(ϑ̂0)
L(ϑ̂)
= ( 2π σ̂ 20
)− m+n2 exp ( − 1
2σ̂0 2
(∑m i=1
( xi − μ̂0
)2 + ∑mj=1 ( yj − μ̂0
)2 ))
( 2π σ̂ 2
)− m2 exp ( − 1 2σ̂ 2
∑m i=1
( xi −μ̂
)2 )
· (2π τ̂ 2)− n 2 exp
( − 1 2τ̂ 2
∑n j=1
( yj −ν̂
)2 )
= ( σ̂ 2
) m 2 · (τ̂ 2)
n 2
( σ̂ 20
) m+n 2
.
Assuming mm+n → p for m + n → ∞ and some p ∈ (0, 1), it follows from the general theory of likelihood ratio tests, given that �0 and � have dimensions 2 and 4, that
− 2 log �m,n D−→ χ22 for m + n → ∞ under H0 (2)
(Hogg et al. 2005, pp. 351–353). Hence, an asymptotic level α test rejects H0 if
−2 log �m,n ≥ χ22;1−α, (3)
where χ22;p denotes the p-quantile of the χ 2-distribution with 2 degrees of freedom.
Typically, fairly large sample sizes are needed to use these asymptotic results for finite samples. However, there are several approaches available to transform the test statistic or to determine a more exact distribution in order to improve the finite sample behaviour.
Pearson and Neyman (1930) directly considered �m,n, showing that under H0, the limiting distribution is the uniform distributionU(0, 1) [note that, if Z is uniformly dis- tributed, −2 log Z is exponentially distributed with mean 2, or χ22 -distributed; hence, this result is in agreement with (2)]. They proposed to approximate the exact distribu- tion of �m,n for finite n and m by a beta distribution matching the first two moments.
Muirhead (1982) considered an asymptotic expansion of the distribution of the likelihood ratio test statistic under multivariate normality; in the univariate case, we obtain the following corollary.
Corollary 2.1 Let Fχ2q denote the distribution function of the χ 2 q -distribution. It holds
under H0:
PH0 (−2ρ log �m,n ≤ u
) = F χ22
(u) + γ ρ2 (m + n)2
( F
χ26 (u) − F
χ22 (u)
)
+O ( (m + n)−3
) ,
123
Methods to test for equality of two normal distributions 585
Table 1 Comparison of the χ2- and Muirhead-approximation for m = 10 and n = 20
p Asymptotic χ2 Muirhead approximation
F−1 χ22
(p) p-quantile of −2 log �10,20 F−110,20(p) p-quantile of−2ρ log �10,20 0.75 2.77 3.13 2.70 2.79
0.90 4.61 5.19 4.46 4.64
0.95 5.99 6.74 5.77 6.02
0.99 9.21 10.33 8.74 9.22
0.999 13.82 15.48 12.78 13.83
with
ρ = 1 − 22 24(m + n)
( m + n m
+ m + n n
− 1 )
,
γ = 1 2
(( m + n m
)2 +
( m + n
n
)2 − 1
)
− 121 96
( m + n m
+ m + n n
− 1 )2
.
Hence, the function Fm,n, defined by
Fm,n(u) = Fχ22 (u) + γ
ρ2 (m + n)2 ( F
χ26 (u) − F
χ22 (u)
) ,
is an approximation of the distribution function of −2ρ log �m,n under the hypothesis. Then, an approximate test of H0 against H1 rejects H0 if
−2ρ log �m,n ≥ F−1m,n(1 − α). (4)
The improvement achieved by this expansion is illustrated in Table 1. There, the quantiles of F
χ22 and F10,20 are compared with the simulated quantiles of −2 log �10,20
and −2ρ log �10,20 (based on 105 replications) for sample sizes m = 10 and n = 20. Table 1 indicates that the empirical and theoretical levels are much closer for the Muirhead-approximation than the test based on asymptotic χ2 results. For practical purposes, the empirical level of the test based on the expansion is sufficiently close to the theoretical level even for small sample sizes.
There have also been several approaches to determine the exact distribution of �m,n in more or less computable form. Jain et al. (1975) developed computable but complicated series representations for the density and distribution function. Nagar and Gupta (2004) tabulated the distribution of �m,n for the balanced case m = n. Zhang et al. (2012) determined the exact distribution of �m,n as
P ( �m,n ≤ u
) =1−C ∫ r2
r1 w
(m−3)/2−1 1
⎛
⎜ ⎝
∫ 1−w1 u2/nmm/nn
(m+n)(m+n)/n wm/n1
w (n−3)/2 2√
1 − w1 − w2 dw2
⎞
⎟ ⎠ dw1
(5)
123
586 J. Frank, B. Klar
for u ∈ (0, 1), with C = ( m+n−1
2
)
( m−1 2
)
( n−1 2
)
( 1 2
) and r1 < r2. Hereby r1 and r2 denote the
two roots of the function g(w1) = 1 − w1 − u2/nmm/nn (m+n)(m+n)/nwm/n1
. The double integral in
(5) can be evaluated by any numerical quadrature method.
Remark In many cases, likelihood ratio tests exhibit some kind of optimality. Hsieh (1979) has shown that for the testing problem under consideration, the likelihood ratio test is asymptotically optimal in the sense of Bahadur efficiency.
3 Combination of p values
Another method to obtain a test for H0 is to combine the p values of Student’s t test and the F test for equality of variances. For this purpose, the hypothesis H0 has to be rephrased as a multiple one. Using the hypothesis and alternative of the t test
H ′0 : ϑ ∈ �′0 = { ϑ ∈ � : μ = ν, σ 2 = τ 2
} vs.
H ′1 : ϑ ∈ �′1 = { ϑ ∈ � : μ = ν, σ 2 = τ 2
}
and the F test
H ′′0 : ϑ ∈ �′′0 = { ϑ ∈ � : σ 2 = τ 2
} vs.
H ′′1 : ϑ ∈ �′′1 = { ϑ ∈ � : σ 2 = τ 2
} ,
H0 can be reformulated as
H0 : H ′0 and H ′′0 are true vs. H1 : At least one of the alternatives H ′1 or H ′′1 is true.
The statistic of the t test
T = √
mn m+n (X̄ − Ȳ )
S
with S2X = 1m−1 ∑m
i=1(Xi − X̄)2, S2Y = 1n−1 ∑n
j=1(Yj − Ȳ )2 and the pooled variance estimator
S2 = 1 m + n − 2
( (m − 1)S2X + (n − 1)S2Y
) ,
has a t-distribution with m + n − 2 degrees of freedom under H ′0, while the F test statistic
Q = S 2 X
S2Y
123
Methods to test for equality of two normal distributions 587
is Fm−1,n−1-distributed under H ′′0 . Methods of combining p values like Fisher’s method (Fisher 1932) presume the independence of the corresponding p value under the hypothesis. To prove the independence of T and Q under H0, one can invoke Basu’s theorem to prove the independence of the sum and the quotient of two independent χ2-distributed random variables (Lehmann and Romano 2005, pp. 152–153). This result is applied to (m − 1)S2X /σ 2 and (n − 1)S2Y /σ 2 to derive the independence of Q and S2 under H0. Since S
2 X , S
2 Y , X̄ and Ȳ are independent, T and Q are independent as
well. After having shown the independence of t and F test statistics, we can combine the p values
G1(t) = PH ′0 (|T | ≥ |t|) and G2(q) = 2 min { PH ′′0 (Q ≥ q), PH ′′0 (Q ≤ q)
}
in order to obtain a test for H0. Note that G2(q) corresponds to the p value of the usual two-sided F test with equal tail probabilities.
The crucial fact in the following examples of combination methods proposed in the literature is the following: if the distribution of the test statistic under H0 is unique and continuous, then the p value, considered as a random variable, follows the uniform distribution on the unit interval under H0 (this fact is often stated only for simple null hypotheses which is overly restrictive).
3.1 Combination method due to Fisher
Fisher (1932) proposed the combined statistic
M1 = −2 log(G1(T )G2(Q)) = −2 log G1(T ) − 2 log G2(Q).
Since −2 log G1(T ) and −2 log G2(T ) are χ22 -distributed under H0, the decision to
reject H0 if M1 ≥ χ24;1−α leads to an exact level α-test.
For the testing problem given in (1), Perng and Littell (1976) proved that the test based on M1, just as the likelihood ratio test in Sect. 2, is asymptotically optimal in the sense of Bahadur efficiency (see also Singh 1986).
This is not too astonishing regarding the close relation between both tests. Fisher’s method combines both tests as a product with equal weights under H0. On the other hand, as shown by Pearson and Neyman (1930), the squared t test statistic and the F- statistic are one-to-one correspondences of the likelihood ratio statistics for testing H ′0 against H ′1 and H
′′ 0 against H
′′ 1 , respectively. They showed further that the likelihood
ratio for testing H0 against H1 can be expressed as the product of the likelihood ratio for testing H ′0 against H
′ 1 and the likelihood ratio for testing H
′′ 0 against H
′′ 1 . Hence,
the likelihood ratio test combines both tests as a product with approximately equal weights under H0 (since they have the same limiting distribution under H0).
123
588 J. Frank, B. Klar
3.2 Minimum combination method and Bonferroni correction
Another classical approach is the minimum combination method proposed by Tippett (1931) with test statistic
M2 = min(G1(T ), G2(Q)).
Since the distribution function of the minimum of two independent and uniformly distributed random variables is u(2 − u) for u ∈ (0, 1), an exact test at level α can be stated as
reject H0 if M2 ≤ 1 − √ 1 − α. (6)
This method is closely related to the simplest method to control the familywise error rate, the Bonferroni correction. Using this method, the overall hypothesis is rejected at a significance level of α if either of the individual hypotheses is rejected at a level of α/2, which corresponds to
reject H0 if M2 ≤ α/2. (7)
Generally, the Bonferroni correction is considered as rather conservative, which can indeed be the case for stronly dependent statistics. However, for independent test statistics, both methods are equal for practical purposes since α/2, the critical value in (7), is just the first order Taylor expansion of the critical value in (6).
3.3 Maximum and sum combination methods
Two further very simple approaches use maximum and sum of the two p values. Using the maximum statistic M3 = max(G1(T ), G2(Q)), the corresponding test of level α is given by
reject H0 if M3 ≤ √
α. (8)
The use of the sum M4 = G1 + G2 was proposed by Edington (1972). Since the distribution function of the convolution of two uniform random variables is u2/2 for u ∈ (0, 1), an exact test at level α is
reject H0 if M4 ≤ √ 2α. (9)
3.4 Combination methods due to Stouffer–Lipták and due to Mudholkar and George
Two further more often used methods are based on the statistic
M5 = ( �
−1 (1 − G1) + �−1(1 − G2)
) / √ 2,
123
Methods to test for equality of two normal distributions 589
where �−1 denotes the inverse of the standard normal distribution function, and the logit statistic
M6 = − √ 7/(4π2) (logit(G1) + logit(G2)) ,
with logit(p) = log (p/(1 − p)). Early references for the combination method using M5 are Stouffer et al. (1949) and Lipták (1958), and the method is found in the literature under both names. For brevity, we use the name Stouffer’s method in the following. Since �−1(1 − Gi ), i = 1, 2, is normally distributed under H0, the decision
reject H0 if M5 ≥ �−1(1 − α) (10)
defines an exact test of level α. Since logit(Gi ), i = 1, 2, follows a standard logistic distribution under H0, direct
calculations yield
PH0 (logit(G1)+logit(G2) >u) =u (
e−u
1−e−u )2
+(u−1) e −u
1 − e−u , −∞ <u < ∞,
which can be used to perform an exact test based on M6. However, Mudholkar and George (1979) (see also George and Mudholkar 1983) proposed to approximate the distribution of M6 by a t14-distribution, leading to the approximative level α test
reject H0 if M6 ≥ t14;1−α. (11)
The proposed approximation is indeed very accurate, the exact and approximate 0.95- quantiles given by 1.7649 and 1.7613, respectively.
The logit statistic has the same exact Bahadur slope under H0 as Fisher’s statistic. Hence, it is also optimal in the sense of Bahadur efficiency (Mudholkar and George 1979; Berk and Cohen 1979).
Figure 1 shows the different rejection regions of the discussed combination meth- ods. Each region covers 5% of the unit square. In the right figure, the abscissa is continued from 0.5 to 1, but the scaling of the ordinate is different. Similar displays are shown, for example, in Loughin (2004), who concludes that the main feature of the maximum method and the combination method due to Edington is the inability to reject the hypothesis if one p value is large, regardless how small the other is. He also states that these combination methods can be useful, given the circumstance that both p values are equally significant. Thus, it is crucial to compare the tests empirically under different alternatives.
It should be noted that the presented combination rules can also be used in other situations, for example for combining dependent tests. A description and comparison of combination rules in a nonparametric context can be found in Pesarin and Salmaso (2010, pp. 128–134).
123
590 J. Frank, B. Klar
Fig. 1 Rejection regions of the different combination methods for α = 5 %
4 Empirical level and power of the tests
In the following, the tests are compared at level α = 0.05, whereby the exact 0.95- quantile of M6 was used as well as the exact 0.95-quantile of �m,n, calculated by (5), which is 6.93032 for m = n = 10, 6.430465 for m = n = 20 and 6.657326 for m = 10, n = 30. The values for m = n coincide with the values given in Nagar and Gupta (2004). To achieve such a high accuracy, we used very high numbers of quadrature points in evaluating the double integral in (5) by Gauss–Legendre quadrature together with an extrapolation method.
First, we consider the case of equal variances but different expected values. To this end, the p values G1 and G2 were simulated 10
5 times with sample sizes of m = n = 20 and under fixed parameters σ 2 = τ 2 = 1, μ = 0. Table 2 shows the empirical power of the tests for varying expectation ν. Clearly, the power should only depend on the absolute value of ν, which is the case within the simulation accuracy.
The maximum and the sum combination methods are often incapable of rejecting H0, since with increasing ν the expected value of G1 is decreasing, but G2 remains
Table 2 Empirical power for varying ν, σ 2 = τ 2 = 1, μ = 0 and m = n = 20 ν Fisher Minimum Maximum Edington Stouffer Mudholkar LQ-test
−1.5 0.986 0.991 0.223 0.314 0.903 0.962 0.988 −1 0.769 0.798 0.216 0.291 0.613 0.698 0.781 −0.5 0.252 0.258 0.143 0.164 0.215 0.233 0.256 0 0.050 0.050 0.050 0.050 0.050 0.050 0.050
0.5 0.251 0.257 0.142 0.163 0.215 0.232 0.255
1 0.770 0.799 0.218 0.292 0.618 0.701 0.782
1.5 0.986 0.990 0.222 0.313 0.903 0.962 0.988
123
Methods to test for equality of two normal distributions 591
Fig. 2 Empirical absolute (left) and relative (right) power for varying ν
Table 3 Empirical power for varying τ, σ 2 = 1, μ = ν = 0 and m = n = 20 τ Fisher Minimum Maximum Edington Stouffer Mudholkar LQ-test
0.2 1.000 1.000 0.230 0.321 0.998 1.000 1.000
0.6 0.459 0.480 0.180 0.224 0.365 0.411 0.473
1 0.050 0.050 0.049 0.050 0.049 0.049 0.050
1.4 0.216 0.222 0.126 0.142 0.183 0.198 0.221
1.8 0.585 0.609 0.198 0.253 0.459 0.523 0.600
2.2 0.853 0.870 0.223 0.301 0.701 0.789 0.864
2.6 0.959 0.966 0.226 0.313 0.849 0.921 0.964
uniformly distributed over (0, 1) under H ′′0 . This leads to a minimal type II error of 1 − √α = 0.776 for the maximum method and a minimal error of 1 − √2α = 0.683 for the combination method due to Edington. This is illustrated in Fig. 2, where the left plot shows the absolute power, whereas the plot on the right hand side shows the power relative to the best test. From the remaining tests, the minimum method performs best, followed closely by the LQ-test and Fisher’s method, and, in some distance, Mudholkar’s and Stouffer’s methods.
Next, the case of equal expectations but different variances is considered. Here, the parameters μ = ν = 0, σ 2 = 1 are fixed, and τ varies between 0.2 and 2.6. As Table 3 and Fig. 3 show, the results are similar to the preceding setup. This is perhaps surprising at first sight, since the t test is not a level α-test if the variances differ. However, it is well-known that the t test is quite robust against the violation of homogeneity of variances; a quantitative statement in this direction can be found in Perng and Littell (1976, p. 970).
123
592 J. Frank, B. Klar
Fig. 3 Empirical absolute (left) and relative (right) power for varying τ
Table 4 Average p values for different choices of ν and τ
No. ν τ Ḡ1 Ḡ2
1 0.000 1.00 0.50 0.50
2 0.075 1.05 0.49 0.49
3 0.150 1.10 0.47 0.48
4 0.225 1.15 0.44 0.45
5 0.300 1.20 0.40 0.41
6 0.375 1.25 0.36 0.37
7 0.450 1.30 0.31 0.33
8 0.525 1.35 0.27 0.30
9 0.600 1.40 0.24 0.26
10 0.675 1.45 0.20 0.23
11 0.750 1.50 0.17 0.20
12 0.825 1.55 0.15 0.17
13 0.900 1.60 0.13 0.14
14 0.975 1.65 0.11 0.12
15 1.050 1.70 0.09 0.10
16 1.125 1.75 0.08 0.09
17 1.200 1.80 0.07 0.07
Our main interest lays in situations where the expected values as well as the vari- ances differ. Here, the parameters ν and τ have been chosen so that the p values of the t and F test are nearly equal on average (see Table 4; Fig. 4). To this end, the statistics T and Q were simulated 105 times in order to estimate Eϑ (G1) and Eϑ (G2) by their arithmetic means.
123
Methods to test for equality of two normal distributions 593
Fig. 4 Visualisation of table
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0. 0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
G1
G 2
Table 5 Empirical power for varying ν and τ, μ = 0, σ 2 = 1 and m = n = 20 No. Fisher Minimum Maximum Edington Stouffer Mudholkar LQ-test
1 0.050 0.050 0.049 0.049 0.049 0.049 0.049
3 0.076 0.074 0.069 0.071 0.074 0.075 0.076
5 0.159 0.148 0.129 0.139 0.153 0.156 0.158
7 0.294 0.260 0.234 0.257 0.287 0.293 0.291
9 0.455 0.393 0.364 0.401 0.449 0.454 0.448
11 0.621 0.539 0.509 0.555 0.617 0.624 0.612
13 0.758 0.671 0.637 0.689 0.757 0.763 0.749
15 0.859 0.783 0.742 0.791 0.855 0.862 0.852
17 0.922 0.864 0.822 0.864 0.921 0.925 0.917
In the present case, the methods of Mudholkar, Fisher, Stouffer and the LQ-test behave very similar, the first two having the edge over the latter (see Table 5). The minimum combination method exhibit a slightly lower power, as already stated by Loughin (2004). However, the results contradict Loughins statement that the maximum method and the one due to Edington perform in this case better than the other tests. As illustrated in Fig. 5, both methods can recognise the alternative now, but their empirical power is still lower.
We rerun all simulations with smaller sample sizes m = n = 10. Apart from generally lower values of the power, all of the aforementioned conclusions remain unchanged. Further, we rerun all simulations with unequal sample sizes, namely m = 10 and n = 30, hence maintaining the total sample size. Under this scenario the empirical power decreases under every alternative compared to the balanced case m = n = 20. Thereby, the power under a changing τ declines more than under a
123
594 J. Frank, B. Klar
Fig. 5 Empirical absolute (left) and relative (right) power for varying ν and τ, μ = 0, σ 2 = 1
changing ν. Given the alternative that both parameters change, the LQ-test performs considerably best and the minimum method worst.
Remark The statements about the asymptotic optimality of the likelihood ratio test or the combination method of Fisher are at first sight surprising. Consider, for example, the first simulation scenario with a difference between means but equal variances. Then, the optimality property of, say, Fisher’s method means that we lose no power in a specific asymptotic sense if we use this method instead of the t test which is well-known to be optimal in this situation for any finite sample size.
Relative efficiency of a sequence of tests {Sn} with respect to another sequence {Tn} is the ratio of the sample sizes necessary for {Sn} and {Tn} in order to attain the power β under the level α for a specific alternative. Bahadur asymptotic relative efficiency considers the limit of this ratio for a sequence of levels decreasing to zero keeping β and the alternative fixed.
To compute relative efficiencies exemplarily we fix power β = 0.6 and mean ν = 0.5, and consider four decreasing values of α, namely 0.187, 0.027, 0.0012, 3 · 10−6. For the balanced case m = n, the t test needs sample sizes 20,50,100 and 200 to reach power β = 0.6, whereas the combination method of Fisher has the same power for sample sizes 27,64,122 and 230. Hence, the relative efficiencies for the four different levels are 0.74, 0.78, 0.82 and 0.87. Indeed, relative efficiency increases, but even for large sample sizes (and, correspondingly, small levels) it is far away from 1, the limit for α → 0 given by theory.
5 Data examples
We applied the tests to several data sets exemplifying the behaviour of the presented methods.
123
Methods to test for equality of two normal distributions 595
−1 0
1 2
3 4
5
32 KV 36 KV −1 0 1 −1
0 1
2 3
4 5
32 KV −1 0 1
−1 0
1 2
3
36 KV
Fig. 6 Box-plots (left) and QQ-plots for normality (middle and right plot) for the log-transformed samples of example 1
Table 6 p values for the different testing methods applied to examples 1–3
Ex. Fisher Minimum Maximum Edington Stouffer Mudholkar LQ-test
1 0.0058∗∗ 0.030∗ 0.0021∗∗ 0.0019∗∗ 0.0033∗∗ 0.0045∗∗ 0.0072∗∗ 2a 0.0007∗∗∗ 0.0017∗∗ 0.0059∗∗ 0.0030∗∗ 0.0006∗∗∗ 0.0006∗∗∗ 0.0008∗∗∗ 2b 0.11 0.045∗ 1 0.52 1 1 0.074◦ 3 0.0098∗∗ 0.040∗ 0.0039∗∗ 0.0034∗∗ 0.0057∗∗ 0.0075∗∗ 0.012∗
Example 1 First, we consider a small data set discussed by Nair (1984) giving the times in minutes to breakdown of an insulating fluid under elevated voltage stresses of 32 and 36 kV, respectively, for the first and second sample. 15 observations are taken at each voltage. This data set is also considered in Shoemaker (1999), Marozzi (2011, 2012) where the question of interest is if the variability of times is significantly higher for the 32 KV power voltage. However, in practice, it would certainly be interesting as well if the mean times differ significantly between the two samples. Since the observations are failure times, the distributions of the raw data are highly skewed. However, after a log-transformation, the data look rather symmetric and short-tailed, as Fig. 6 shows. Hence, the proposed tests should be appropriate for the transformed data. Means and standard deviations of the transformed samples are
32 kV : x̄ = 2.229, sx = 0.902, 36 kV : ȳ = 2.198, sy = 1.110.
For testing the equality of the underlying distributions, we can use any of the test discussed above. The corresponding p values are given in the second line of Table 6; there, we added the usual significance codes to the p values for a fast overview: p∗∗∗ if p ≤ 0.001, p∗∗ if 0.001 < p ≤ 0.01, p∗ if 0.01 < p ≤ 0.05, and p◦ if 0.05 < p ≤ 0.10.
For the data at hand, the p value of the minimum combination method is consider- ably larger than for all other methods, whereas all remaining p values are comparable
123
596 J. Frank, B. Klar
and smaller than 0.01. Since the tests find a significant difference between the two underlying distributions, one could informally proceed as proposed in Sect. 3 of Zhang et al. (2012), first applying the F test (p = 0.015) followed by Welch’s t test [p = 0.050, cf. Bickel and Doksum (2006, pp. 264), or Aspin and Welch 1949]. Clearly, one has to be careful when formally reporting the results of such follow-up tests. The results of this example are in agreement with the simulation results: the min- imum method has comparably low power when there exist differences in the means as well as in the variances.
Example 2a,b There is a long scientific dispute about whether there are sex differences in cognitive abilities. Deary et al. (2007) used a novel design, comparing 1292 pairs of opposite-sex siblings who participated in the US National Longitudinal Survey of Youth 1979. The mental test applied was divided in several subtasks (Deary et al. 2007, Table 1). Here, we consider the results of two subtests of the test battery, namely word knowledge and mathematics knowledge. Means x̄, ȳ and standard deviations sx , sy for males and females for the word knowledge subtest are
x̄ = 22.3, sx = 9.0, ȳ = 22.9, sy = 8.2.
The third line in Table 6 shows the p values for the tests for equality of distributions. Here,thevaluesformaximum,sumandminimummethodarelargerthantheremaining values, but all tests yieldasignificant result onthe0.01-level. Sincethereis asignificant difference between the two distributions, we applied the two-sample t test (p = 0.077) and the F test (p = 0.0008), indicating a significantly larger variability in the results of males compared to females.
Means and standard deviations for the mathematics knowledge subtest for males and females are given by
x̄ = 11.9, sx = 6.5, ȳ = 11.9, sy = 6.1.
It is a somewhat extreme example showing no difference between the sample means. The p values of the tests can be found in line 4 of Table 6. Here, only the minimum combination method shows a significant result on the 0.05-level, followed by the LQ- test and Fisher’s combination method resulting in p values of 0.074 and 0.11. The remaining p values are larger than 0.5. Here, the F test yields a p value of 0.023. Clearly, the minimum method is not affected by the large p value of the t test, and hence, performs best in this example.
It should be noted that in this and the following example we are dealing with large samples where it is possible that very small differences are statistically significant but may be of not much scientific or practical importance.
Example 3 Table 1 in Steinmayr et al. (2010) shows the results for various subtests of the German Intelligence-Structure-Test 2000-R performed by 426 male and 551 female students attending 11th or 12th grade with age ranging from 16 to 18 years. Means and standard deviations for the matrices knowledge subtest for males and females are reported as
123
Methods to test for equality of two normal distributions 597
x̄ = 11.06, sx = 2.90, ȳ = 11.39, sy = 2.61.
The results in line 5 of Table 6 show that, as in example 1, the p value of the minimum method is considerably larger than for all other methods, followed by the LQ-test and Fisher’s method with p values around 0.01. Since there are significant differences between the underlying distributions, we applied the two sample t test (p = 0.062) and F test (p = 0.020), again indicating a larger variability in the test scores of male students. In this example, it is clearly noticeable that the p values of all combination tests except the minimum method can be much smaller than the p values of the individual t and F tests.
6 Discussion
• To sum up the results of the simulation study, it is clear that the maximum and the sum combination methods should not be used due to its inability to detect many alternatives. From Fig. 1, one could expect that both methods might be useful if differences in location come along with differences in variability which is the rule rather than the exception in many biometrical applications. However, the simulations show that these methods are inferior to other combination methods even in the case of location-scale differences. Furthermore, the minimum com- bination method is not really recommendable due to its comparably low power when there exist differences in the means as well as in the variances. There is not much to choose from the remaining tests. In terms of power, the likelihood ratio test has the edge over the other methods at least in unbalanced situations, whereas the Fisher combination method stands out due to its simplicity.
• Even if the data examples corroborate the previous findings, the performance of the different tests for specific data sets may be astonishing. As always in such situations, there is a danger that one performs several tests, and chooses a specific one afterwards for reporting.
• Like the F test for the homogeneity of variances, all tests previously described are sensitive to the assumption that the data are drawn from underlying Gaussian distributions. This assumption should be checked by diagnostic plots. There are various more robust (and less efficient) competitors to the F test available (see, e.g., Marozzi 2011), but combination of these tests with tests for equality of location are not straightforward since the test statistics are not independent. The same holds for combinations of nonparametric tests like the Lepage test (Lepage 1971; Marozzi 2013). There also exists nonparametric location-scale tests like the Cucconi rank test (Cucconi 1968) which are not combination tests. Marozzi (2009) shows that the Cucconi test is a powerful alternative to the Lepage test and suggests to carry out the test as permutation test. Clearly, such tests can be preferable in specific applications.
• It is certainly possible to cover one or more of the presented methods in class- room. At least, it should be made clear that the combination of the t and the F test using a Bonferroni correction leads to a valid test for H0 against H1 at level
123
598 J. Frank, B. Klar
α.
If one accepts (or proves) the independence of the tests, it is possible to discuss the more refined combination methods. Such a treatment accentuates the ran- domness of p values, an important fact which is often obscured in classroom (Murdoch et al. 2008).
Determining the likelihood ratio statistic in (2) is a worthwhile exercise, while a more or less sophisticated implementation of the likelihood ratio test of Muir- head in Corollary 2.1 is an interesting task for an accompanying statistical computing lab.
• One caveat: strictly speaking, none of the tests is a diagnostic test, insofar as it is not possible to deduce differences in means or variances from a rejection of the overall hypothesis (this would be possible using Welch’s t test and the F test with Bonferroni correction). However, nothing speaks against an informal approach as in Sect. 3 of Zhang et al. (2012).
Since the minimum methods corresponds to the Bonferroni correction, and since the t test is robust against violations of variance homogeneity, the minimum methods is closest to a diagnostic test.
Acknowledgments The authors thank the Editor and two anonymous referees for their valuable comments on the original version of the manuscript.
References
Aspin AA, Welch BL (1949) Tables for use in comparisons whose accuracy involves two variances, sepa- rately estimated. Biometrika 36:290–296
Berk RH, Cohen A (1979) Asymptotically optimal methods of combining tests. Journal of the American Statistical Association 74:812–814
Bickel PJ, Doksum KA (2006) Mathematical statistics, basic ideas and selected topics, 2nd ed, vol 1. Pearson, London
Cucconi O, (1968) Un nuovo test non parametrico per il confronto tra due gruppi campionari. Giornale degli Economisti XXVII, pp 225–248
Deary IJ, Irwing P, Der G, Bates TC (2007) Brother-sister differences in the g factor in intelligence: analysis of full, opposite-sex siblings from the NLSY1979. Intelligence 35:451–456
Edington ES (1972) An additive method for combining probability values from independent experiments. J Psychol 80:351–363
Fisher RA, (1932) Statistical methods for research workers, 4th ed. Oliver & Boyd, Edinburgh Gastwirth JL, Gel YR, Miao W (2009) The impact of Levene’s test of equality of variances on statistical
theory and practice. Stat Sci 24:343–360 George EO, Mudholkar GS (1983) On the convolution of logistic random variables. Metrika 30:1–13 Hogg RV, McKean JW, Craig AT (2005) Introduction to mathematical statistics, 6th ed. Pearson Education,
London Hsieh HK (1979) On asymptotic optimality of likelihood ratio tests for multivariate normal distributions.
Ann Statist 7:592–598 Jain SK, Rathie PN, Shah MC (1975) The exact distributions of certain likelihood ratio criteria. Sankhya
Ser A 37:150–163 Lehmann EL, Romano JP (2005) Testing statistical hypotheses, 3rd edn. Springer, Berlin Lepage Y (1971) A combination of Wilcoxon’s and Ansari–Bradley’s statistics. Biometrika 58:213–217
123
Methods to test for equality of two normal distributions 599
Lipták T (1958) On the combinationn of independent tests. Magyar Tudományos Akadémia Matematikai Kuatató Intezetenek Kozlemenyei 3:1971–1977
Loughin TM (2004) A systematic comparison of methods for combining p-values from independent tests. Comput Stat Data Anal 47:467–485
Marozzi M (2009) Some notes on the location-scale Cucconi test. J Nonparametric Stat 21:629–647 Marozzi M (2011) Levene type tests for the ratio of two scales. J Stat Comput Simul 81:815–826 Marozzi M (2012) A combined test for differences in scale based on the interquantile range. Stat Paper
53:61–72 Marozzi M (2013) Nonparametric simultaneous tests for location and scale testing: a comparison of several
methods. Commun Stat Simul Comput 42:1298–1317 Mudholkar GS, George EO (1979) The logit statistic for combining probabilities. In: Rustagi J (ed) Sym-
posium on optimizing methods in statistics. Academic Press, New York, pp 345–366 Muirhead RJ (1982) On the distribution of the likelihood ratio test of equality of normal populations. Can
J Stat 10:59–62 Murdoch DJ, Tsai Y, Adcock J (2008) P-values are random variables. Am Stat 62:242–245 Nagar DK, Gupta AK (2004) Percentage points for testing homogeneity of several Univariate Gaussian
populations. Appl Math Comput 156:551–561 Nair VN (1984) On the behaviour of some estimators from probability plots. J Am Stat Assoc 79:823–830 Pearson ES, Neyman J (1930) On the problem of two samples. In: Neyman J, Pearson ES (eds) Joint
statistical papers. Cambridge University Press, Cambridge, pp 99–115, 1967 Perng SK, Littell RC (1976) A test of equality of two normal population means and variances. J Am Stat
Assoc 71:968–971 Pesarin F, Salmaso L (2010) Permutation tests for complex data: theory, applications and software. Wiley,
New York ShoemakerLH(1999)Interquantiletestsfordispersioninskeweddistributions.CommunStatSimulComput
28:189–205 Singh N (1986) A simple and asymptotically optimal test for the equality of normal populations: a pragmatic
approach to one-way classification. J Am Stat Assoc 81:703–704 Steinmayr R, Beauducel A, Spinath B (2010) Do sex differences in a faceted model of fluid and crystallized
intelligence depend on the method applied? Intelligence 38:101–110 Stouffer S, Suchman E, DeVinnery L, Star S, Williams R (1949) The American soldier, vol I. Adjustment
during army life. Princeton University Press, Princeton Tippett LHC (1931) The method of statistics. Williams and Norgate, London Zhang L, Xu X, Chen G (2012) The exact likelihood ratio test for equality of two normal populations. Am
Stat 66:180–184
123
Statistical Methods & Applications is a copyright of Springer, 2016. All Rights Reserved.
- Methods to test for equality of two normal distributions
- Abstract
- 1 Introduction
- 2 The likelihood ratio test
- 3 Combination of p values
- 3.1 Combination method due to Fisher
- 3.2 Minimum combination method and Bonferroni correction
- 3.3 Maximum and sum combination methods
- 3.4 Combination methods due to Stouffer--Lipták and due to Mudholkar and George
- 4 Empirical level and power of the tests
- 5 Data examples
- 6 Discussion
- Acknowledgments
- References