Dummy Variables, Regression Diagnostics, and Model Evaluation

profilewjm3774
regression-diagnostics3.pdf

Non-Normally Distributed Errors

In: Regression Diagnostics

By: John Fox

Pub. Date: 2011

Access Date: October 16, 2019

Publishing Company: SAGE Publications, Inc.

City: Thousand Oaks

Print ISBN: 9780803939714

Online ISBN: 9781412985604

DOI: https://dx.doi.org/10.4135/9781412985604

Print pages: 41-48

© 1991 SAGE Publications, Inc. All Rights Reserved.

This PDF has been generated from SAGE Research Methods. Please note that the pagination of the

online version will vary from the pagination of the print book.

Non-Normally Distributed Errors

The assumption of normally distributed errors is almost always arbitrary. Nevertheless, the central-limit

theorem assures that under very broad conditions inference based on the least-squares estimators is

approximately valid in all but small samples. Why, then, should we be concerned about non-normal errors?

First, although the validity of least-squares estimation is robust—as stated, the levels of tests and confidence

intervals are approximately correct in large samples even when the assumption of normality is violated—the

method is not robust in efficiency: The least-squares estimator is maximally efficient among unbiased

estimators when the errors are normal. For some types of error distributions, however, particularly those with

heavy tails, the efficiency of least-squares estimation decreases markedly. In these cases, the least-squares

estimator becomes much less efficient than alternatives (e.g., so-called robust estimators, or least-squares

augmented by diagnostics). To a substantial extent, heavy-tailed error distributions are problematic because

they give rise to outliers, a problem that I addressed in the previous chapter.

A commonly quoted justification of least-squares estimation— called the Gauss-Markov theorem—states

that the least-squares coefficients are the most efficient unbiased estimators that are linear functions of

the observations yi. This result depends on the assumptions of linearity, constant error variance, and

independence, but does not require normality (see, e.g., Fox, 1984, pp. 42–43). Although the restriction to

linear estimators produces simple sampling properties, it is not compelling in light of the vulnerability of least

squares to heavy-tailed error distributions.

Second, highly skewed error distributions, aside from their propensity to generate outliers in the direction of

the skew, compromise the interpretation of the least-squares fit. This fit is, after all, a conditional mean (of y

given the xs), and the mean is not a good measure of the center of a highly skewed distribution. Consequently,

we may prefer to transform the data to produce a symmetric error distribution.

Finally, a multimodal error distribution suggests the omission of one or more qualitative variables mat

divide the data naturally into groups. An examination of the distribution of residuals may therefore motivate

respecification of the model.

Although there are tests for non-normal errors, I shall describe here instead graphical methods for examining

the distribution of the residuals (but see Chapter 9). These methods are more useful for pinpointing the

character of a problem and for suggesting solutions.

Normal Quantile-Comparison Plot of Residuals

One such graphical display is the quantile-comparison plot, which permits us to compare visually the

cumulative distribution of an independent random sample—here of studentized residuals—to a cumulative

reference distribution—the unit-normal distribution. Note that approximations are implied, because the

SAGE

1991 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 2 of 9 Non-Normally Distributed Errors

studentized residuals are t distributed and dependent, but generally the distortion is negligible, at least for

moderate-sized to large samples.

To construct the quantile-comparison plot:

1.

Arrange the studentized residuals in ascending order: t(1), t(1), …, t(n). By convention, the ith

largest studentized residual, t(i), has gi = (i - 1/2)/n proportion of the data below it. This convention

avoids cumulative proportions of zero and one by (in effect) counting half of each observation

below and half above its recorded value. Cumulative proportions of zero and one would be

problematic because the normal distribution, to which we wish to compare the distribution of the

residuals, never quite reaches cumulative probabilities of zero or one.

2.

Find the quantile of the unit-normal distribution that corresponds to a cumulative probability of gi

— that is, the value zi from Z ∼ N(0, 1) for which Pr(Z < zi) = gi.

3.

Plot the t(i) against the zi.

If the ti were drawn from a unit-normal distribution, then, within the bounds of sampling error, t(i) = zi.

Consequently, we expect to find an approximately linear plot with zero intercept and unit slope, a line that can

be placed on the plot for comparison. Nonlinearity in the plot, in contrast, is symptomatic of non-normality.

It is sometimes advantageous to adjust the fitted line for the observed center and spread of the residuals. To

understand how the adjustment may be accomplished, suppose more generally that a variable X is normally

distributed with mean μ. and variance ζ2. Then, for an ordered sample of values, approximately x(i) = μ +

ζzi, where zi is defined as before. In applications, we need to estimate μ and μ, preferably robustly, because

the usual estimators—the sample mean and standard deviation—are markedly affected by extreme values.

Generally effective choices are the median of x to estimate μ and (Q3 - Q1)/1.349 to estimate ζ, where Q1 and

Q3 are, respectively, the first and third quartiles of x: The median and quartiles are not sensitive to outliers.

Note that 1.349 is the number of standard deviations separating the quartiles of a normal distribution. Applied

to the studentized residuals, we have the fitted line (i) = median(t) + {[Q3(t) - Q1(t)]/1.349} × zi. The normal

quantile-comparison plots in this monograph employ the more general procedure.

Several illustrative normal-probability plots for simulated data are shown in Figure 5.1. In parts a and b of

the figure, independent samples of size n = 25 and n = 100, respectively, were drawn from a unit-normal

distribution. In parts c and d, samples of size n = 100 were drawn from the highly positively skewed χ4 2

distribution and the heavy-tailed t2 distribution, respectively. Note how the skew and heavy tails show up as

departures from linearity in the normal quantile-comparison plots. Outliers are discernible as unusually large

or small values in comparison with corresponding normal quantiles.

SAGE

1991 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 3 of 9 Non-Normally Distributed Errors

Judging departures from normality can be assisted by plotting information about sampling variation. If the

studentized residuals were drawn independently from a unit-normal distribution, then

where ϕ(zi) is the probability density (i.e., the “height”) of the unit-normal distribution at Z = zi. Thus, zi ± 2 ×

SE(t(i)) gives a rough 95% confidence interval around the fitted line (i) = zi in the quantile-comparison plot.

If the slope of the fitted line is taken as = (Q3 - Q1)/ 1.349 rather than 1, then the estimated standard error

may be multiplied by . As an alternative to computing standard errors, Atkinson (1985) has suggested a

computationally intensive simulation procedure that does not treat the studentized residuals as independent

and normally distributed.

Figure 5.1. Illustrative normal quantile-comparison plots. (a) For a sample of n = 25 from N(0, 1). (b)

For a sample of n = 100 from N(0, 1). (c) For a sample of n - 100 from the positively skewed χ4 2. (d)

For a sample of n = 100 from the heavy-tailed t2.

Figure 5.2 shows a normal quantile-comparison plot for the studentized residuals from Duncan's regression of

rated prestige on occupational income and education levels. The plot includes a fitted line with two-standard-

error limits. Note that the residual distribution is reasonably well behaved.

SAGE

1991 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 4 of 9 Non-Normally Distributed Errors

Figure 5.2. Normal quantile-comparison plot for the studentized residuals from the regression of

occupational prestige on income and education. The plot shows a fitted line, based on the median

and quartiles of the fs, and approximate ±2SE limits around the line.

Histograms of Residuals

A strength of the normal quantile-comparison plot is that it retains high resolution in the tails of the distribution,

where problems often manifest themselves. A weakness of the display, however, is that it does not convey a

good overall sense of the shape of the distribution of the residuals. For example, multiple modes are difficult

to discern in a quantile-comparison plot.

Histograms (frequency bar graphs), in contrast, have poor resolution in the tails or wherever data are

sparse, but do a good job of conveying general distributional information. The arbitrary class boundaries,

arbitrary intervals, and roughness of histograms sometimes produce misleading impressions of the data,

however. These problems can partly be addressed by smoothing the histogram (see Silverman, 1986, or Fox,

1990). Generally, I prefer to employ stem-and-leaf displays—a type of histogram (Tukey, 1977) that records

the numerical data values directly in the bars of the graph—for small samples (say n < 100), smoothed

histograms for moderate-sized samples (say 100 ≤ n ≤ 1,000), and histograms with relatively narrow bars for

large samples (say n > 1,000).

SAGE

1991 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 5 of 9 Non-Normally Distributed Errors

Figure 5.3. Stem-and-leaf display of studentized residuals from the regression of occupational

prestige on income and education.

A stem-and-leaf display of studentized residuals from the Duncan regression is shown in Figure 5.3. The

display reveals nothing of note: There is a single node, the distribution appears reasonably symmetric, and

there are no obvious outliers, although the largest value (3.1) is somewhat separated from the next-largest

value (2.0).

Each data value in the stem-and-leaf display is broken into two parts: The leading digits comprise the stem;

the first trailing digit forms the leaf; and the remaining trailing digits are discarded, thus truncating rather

than rounding the data value. (Truncation makes it simpler to locate values in a list or table.) For studentized

residuals, it is usually sensible to make this break at the decimal point. For example, for the residuals shown

in Figure 5.4: 0.3039 → 0 |3; 3.1345 → 3 |1; and -0.4981 → -0 |4. Note that each stem digit appears twice,

implicitly producing bins of width 0.5. Stems marked with asterisks (e.g., 1?) take leaves 0 — 4; stems

marked with periods (e.g., 1.) take leaves 5—9. (For more information about stem-and-leaf displays, see,

e.g., Velleman and Hoaglin [1981] or Fox [1990].)

SAGE

1991 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 6 of 9 Non-Normally Distributed Errors

Figure 5.4. The family of powers and roots. The transformation labeled “p” is actually y' = (yp - 1)/p;

for p = 0, y' = logey.

SOURCE: Adapted with permission from Figure 4-1 from Hoaglin, Mosteller, and Tukey (eds.). Understanding

Robust and Exploratory Data Analysis, © 1983 by John Wiley and Sons, Inc.

Correcting Asymmetry by Transformation

A frequently effective approach to a variety of problems in regression analysis is to transform the data so that

they conform more closely to the assumptions of the linear model. In this and later chapters I shall introduce

transformations to produce symmetry in the error distribution, to stabilize error variance, and to make the

relationship between y and the xs linear.

In each of these cases, we shall employ the family of powers and roots, replacing a variable y (used here

generically, because later we shall want to transform xs as well) by y' = yp. Typically, p = -2, -1, -1/2, 1/2, 2, or

3, although sometimes other powers and roots are considered. Note that p = 1 represents no transformation.

In place of the 0th power, which would be useless because y0 = 1 regardless of the value of y, we take y' =

log y, usually using base 2 or 10 for the log function. Because logs to different bases differ only by a constant

factor, we can select the base for convenience of interpretation. Using the log transformation as a “zeroth

power” is reasonable, because the closer p gets to zero, the more yp looks like the log function (formally,

limp→0[(y p - 1)/p] = logey, where the log to the base e ≈ 2.718 is the so-called “natural” logarithm). Finally, for

negative powers, we take y' = -yp, preserving the order of the y values, which would otherwise be reversed.

As we move away from p = 1 in either direction, the transformations get stronger, as illustrated in Figure 5.4.

The effect of some of these transformations is shown in Table 5.1a. Transformations “up the ladder” of powers

and roots (a term borrowed from Tukey, 1977)—that is, toward y2—serve differentially to spread out large

SAGE

1991 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 7 of 9 Non-Normally Distributed Errors

values of y relative to small ones; transformations “down the ladder”—toward log y—have the opposite effect.

To correct a positive skew (as in Table 5.1b), it is therefore necessary to move down the ladder; to correct a

negative skew (Table 5.1c), which is less common in applications, move up the ladder.

I have implicitly assumed that all data values are positive, a condition that must hold for power transformations

to maintain order. In practice, negative values can be eliminated prior to transformation by adding a small

constant, sometimes called a “start,” to the data. Likewise, for power transformations to be effective, the ratio

of the largest to the smallest data value must be sufficiently large; otherwise the transformation will be too

nearly linear. A small ratio can be dealt with by using a negative start.

In the specific context of regression analysis, a skewed error distribution, revealed by examining the

distribution of the residuals, can often be corrected by transforming the dependent variable. Although more

sophisticated approaches are available (see, e.g., Chapter 9), a good transformation can be located by trial

and error.

Dependent variables that are bounded below, and hence that tend to be positively skewed, often respond

well to transformations down the ladder of powers. Power transformations usually do not work well, however,

when many values stack up against the boundary, a situation termed truncation or censoring (see, e.g., Tobin

[1958] for a treatment of “limited” dependent variables in regression). As well, data that are bounded both

above and below—such as proportions and percentages—generally require another approach. For example

the logit or “log odds” transformation given by y' = log[y/(l - y)], often works well for proportions.

SAGE

1991 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 8 of 9 Non-Normally Distributed Errors

TABLE 5.1 Correcting Skews by Power Transformations

Transforming variables in a regression analysis raises issues of interpretation. I address these issues briefly

at the end of Chapter 7.

http://dx.doi.org/10.4135/9781412985604.n5

SAGE

1991 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 9 of 9 Non-Normally Distributed Errors

  • Non-Normally Distributed Errors
    • In: Regression Diagnostics