Multiple Regression

wjm3774
regression-diagnostics2WK9RESOURCE.pdf

Outlying and Influential Data

In: Regression Diagnostics

By: John Fox

Pub. Date: 2011

Access Date: October 13, 2019

Publishing Company: SAGE Publications, Inc.

City: Thousand Oaks

Print ISBN: 9780803939714

Online ISBN: 9781412985604

DOI: https://dx.doi.org/10.4135/9781412985604

Print pages: 22-40

© 1991 SAGE Publications, Inc. All Rights Reserved.

This PDF has been generated from SAGE Research Methods. Please note that the pagination of the

online version will vary from the pagination of the print book.

Outlying and Influential Data

Unusual data are problematic in a least-squares regression because they can unduly influence the results of

the analysis, and because their presence may be a signal that the regression model fails to capture important

characteristics of the data. Some central distinctions are illustrated in Figure 4.1 for the simple-regression

model y = β0 + β1x + ?

In simple regression, an outlier is an observation whose dependent-variable value is unusual given the value

of the independent variable. In contrast, a univariate outlier is a value of y or x that is unconditionally unusual;

such a value may or may not be a regression outlier. Regression outliers appear in both part a and part b

of Figure 4.1. In Figure 4.1a, the outlying observation has an x value at the center of the x distribution; as

a consequence, deleting the outlier has no impact on the least-squares slope b1 and little impact on the

intercept b0. In Figure 4.1b, the outlier has an unusual x value, and consequently its deletion markedly affects

both the slope and the intercept. Because of its unusual x value, the last observation in Figure 4.1b has strong

leverage on the regression coefficients, whereas the middle observation in Figure 4.1a is at a low-leverage

point.

The combination of high leverage with an outlier produces substantial influence on the regression coefficients.

In Figure 4.1c, the last observation has no influence on the regression coefficients even though it is a high-

leverage point, because this observation is not out of line with the rest of the data. The following heuristic

formula helps to distinguish among these concepts:

Influence on Coefficients = Leverage × Discrepancy

Figure 4.1. Leverage and influence in simple-regression analysis, (a)

An outlier near the mean of x has little influence on the regression

coefficients. (b) An outlier far from the mean of x markedly affects the

regression coefficients. (c) A high-leverage observation in line with the

rest of the data does not influence the regression coefficients.

SAGE

1991 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 2 of 18 Outlying and Influential Data

A simple and transparent example, with real data from Davis (1990), appears in Figure 4.2. These data record

the measured and reported weight (in kilograms) of 183 male and female subjects who engage in programs of

regular physical exercise. As part of a larger study, the investigator was interested in ascertaining whether the

subjects reported their weights accurately, and whether men and women reported similarly. (The published

study is based on the data for the female subjects only and includes additional data for non-exercising

women.) Davis (1990) gives the correlation between measured and reported weight.

Figure 4.2. Regression of reported weight in kilograms on measured

weight and gender for 183 subjects engaged in regular exercise. The

solid line shows the least-squares regression for women, the broken

line the regression for men.

SAGE

1991 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 3 of 18 Outlying and Influential Data

SOURCE: Data taken from C. Davis, personal communication.

A least-squares regression of reported weight (RW) on measured weight (MW), a dummy variable for sex

(F: coded one for women, zero for men), and an interaction regressor produces the following results (with

coefficient standard errors in parentheses):

Were these results to be taken seriously, we would conclude that men are on average accurate reporters

of their weights (because b0 = 0 and b1 = 1), whereas women tend to overreport their weights if they

are relatively light and underreport if they are relatively heavy. However, Figure 4.2 makes clear that the

differential results for women and men are due to one female subject whose reported weight is about average

(for women), but whose measured weight is extremely large.

In fact, this subject's measured weight and height (in centimeters) were switched erroneously on data entry,

as Davis discovered after calculating an anomalously low correlation between reported and measured weight

among women. Correcting the data produces the regression

which suggests that both women and men are accurate reporters of weight.

There is another way to analyze the Davis weight data: One of the investigator's interests was to determine

whether subjects reported their weights accurately enough to permit the substitution of reported weight for

SAGE

1991 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 4 of 18 Outlying and Influential Data

measured weight, which would decrease the cost of collecting data on weight. It is natural to think of reported

weight as influenced by “real” weight, as in the regression presented above in which reported weight is the

dependent variable. The question of substitution, however, is answered by the regression of measured weight

on reported weight, giving the following results for the uncorrected data:

Note that here the outlier does not have much impact on the regression coefficients, precisely because the

value of RW for this observation is near for women. However, there is a marked effect on the multiple

correlation and standard error: For the corrected data, R2 = 0.97, s = 2.25.

Measuring Leverage: Hat-Values

The so-called hat-value hi is a common measure of leverage in regression. These values are so named

because it is possible to express the fitted values ŷj in terms of the observed values yi:

Thus the weight hij captures the extent to which yi can affect ŷj: If hij is large, then the ith observation can

have a substantial impact on the jth fitted value. It may be shown that hii = ∑ and so the hat-value hi = hii

summarizes the potential influence (the leverage) of yi on all of the fitted values. The hat-values are bounded

between 1/n and 1 (i.e., l/n ≤ hi ≤ 1), and the average hat-value is = (k + 1)/n (see Appendix A4.1).

In simple-regression analysis, the hat-values measure distance from the mean of x:

In multiple regression, hi measures distance from the centroid (point of means) of the xs, taking into account

the correlational structure of the xs, as illustrated for k = 2 in Figure 4.3. Multivariate outliers in the x space

are thus high-leverage observations.

For Davis's regression of reported weight on measured weight, the largest hat-value by far belongs to the

12th subject, whose measured weight was erroneously recorded as 166 kg: h12 = 0.714. This quantity is

many times the average hat-value, = (3 + 1)/183 = 0.0219.

Detecting Outliers: Studentized Residuals

To identify an outlying observation, we need an index of the unusualness of y given the xs. Generally,

SAGE

1991 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 5 of 18 Outlying and Influential Data

discrepant observations have large residuals, but it turns out that even if the errors ?i have equal variances

(as assumed in the regression model), the residuals ei do not: V(ei) = σ2(1 - hi) (see Appendix A4.2).

High-leverage observations, therefore, tend to have small residuals—a sensible result, because these

observations can force the regression surface to be close to them.

Although we can form a standardized residual by calculating e'i = ei/s , this measure suffers from the

defect that the numerator and denominator are not independent, preventing e'i from following a t distribution:

When |ei| is large, s = , which contains ei 2, tends to be large as well. Suppose, however, that

we refit the regression model deleting the ith observation, obtaining an estimate s(-i) of σ based on the rest of

the data. Then the studentized residual

Figure 4.3. Contours of constant leverage (constant hi) for k = 2 independent variables. Two high-

leverage points appear: One (shown as a large hollow dot) has unusually large values for each of x1

and x2, but the other (large filled dot) is unusual only in its combination of x1 and x2 values.

has independent numerator and denominator, and follows a t distribution with n - k - 2 degrees of freedom.

An alternative, but equivalent, procedure for finding the studentized residuals employs the “mean-shift” outlier

model

where d is a dummy variable set to one for observation i and zero for all other observations. Thus

It would be natural to specify Equation 4.2 if before examining the data we suspected that observation i

differed from the others. Then, to test H0: γ = 0, we would find ti = / SE( ), which is distributed as tn - k - 2

SAGE

1991 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 6 of 18 Outlying and Influential Data

under H0, and which (it turns out) is the studentized residual of Equation 4.1.

Here, as elsewhere in statistics, terminology is not wholly standard: ti is sometimes called a deleted

studentized residual, an externally studentized residual, or even a standardized residual. Because the last

term also is often applied to e'i, it is important to determine exactly what is being calculated by a computer

program before using these quantities. In large samples, though, usually ti = e'i = ei/s.

Testing for Outliers in Regression. Because in most applications we do not suspect a particular observation

in advance, we can in effect refit the mean-shift model n times, once for each observation, producing t1, t2,

…, tn. In practice, alternative formulas to Equations 4.1 and 4.2 provide the ti with little computational effort.

Usually, our interest then will focus on the largest absolute ti, called t ?. Because we have picked the biggest

of n test statistics, however, it is no longer legitimate simply to use tn - k - 2 to find the statistical significance

of t?: For example, even if our model is wholly adequate, and disregarding for the moment the dependence

among the tis, we would expect to observe about 5% of tis beyond t0.025 ≈ ±2, about 1% beyond t0.005 ≈ ±

2.6, and so forth.

One solution to the problem of simultaneous inference is to perform a Bonferroni adjustment to the p value

for the largest ti. (Another way to take into account the number of studentized residuals, by constructing a

quantile-comparison plot, is discussed in Chapter 5.) The Bonferroni test requires either a special t table or,

more conveniently, a computer program that returns accurate p values for t far into the tail of the distribution.

In the latter event, suppose that p' = Pr(tn - k - 2 > |t ?|). Then the p value for testing the statistical significance

of t? is p = 2np'. The factor 2 reflects the two-tail character of the test: We want to detect large negative as well

as large positive outliers. The factor n adjusts for conducting n simultaneous tests, which is implicit in selecting

the largest of n test statistics. Beckman and Cook (1983) have shown that the Bonferroni adjustment usually

is exact for testing the largest studentized residual. Note that a much larger t? is required for a statistically

significant result than would be the case for an ordinary individual t test.

In Davis's regression of reported weight on measured weight, the largest studentized residual by far belongs

to the 12th observation: t12 = - 24.3. Here, n - k - 2 = 183 - 3 - 2 = 178, and Pr(t178 > 24.3) << 10 -8. (The

symbol “<<” means “much less than.” The computer program that I employed to find the tail probability was

unable to calculate a more accurate result for such a large t.) The Bonferroni p value for the outlier test is p

<< 178 × 2 × 10-8 = 4 × 10-6 (i.e., 0.000004), an unambiguous result.

An Analogy to Insurance. Thus far, I have treated the identification (and, implicitly, the potential correction,

removal, or accommodation) of outliers as a hypothesis-testing problem. Although this is by far the most

common approach in practice, a more reasonable general perspective weighs the costs and benefits for

estimation of rejecting a potentially outlying observation.

Suppose, for the moment, that the observation with the largest ti is simply an unusual data point, but one

SAGE

1991 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 7 of 18 Outlying and Influential Data

generated by the assumed statistical model, that is, yi = β0 + β1x1i + … + βkxki + ?i, with ?i ∼ NID(0, σ 2).

To discard an observation under these circumstances would decrease the efficiency of estimation, because

when the model—including the assumption of normality—is correct, the least-squares estimator is maximally

efficient among all unbiased estimators of the βs. If, however, the datapoint in question does not belong with

the rest (say, e.g., the mean-shift model applies), then to eliminate it may make estimation more efficient.

Anscombe (1960) expressed this insight by drawing an analogy to insurance: To obtain protection against

“bad” data, one purchases a policy of outlier rejection (or uses an estimator that is resistant to outliers—a

so-called robust estimator), a policy paid for by a small premium in efficiency when the policy rejects “good”

data.

Let P denote the desired premium, say 0.05—a 5% increase in estimator mean-squared error if the model

holds for all of the data. Let z represent the unit-normal deviate corresponding to a tail probability of

P(n-k-1)/n. Following the procedure derived by Anscombe and Tukey (1963), compute m = 1.4 + 0.85z, and

then find

and

Finally, reject the observation with the largest studentized residual if |t?| > t'. In a real application, of course,

we should inquire about discrepant observations (see the discussion at the end of this section).

For example, for Davis's first regression n = 183 and k = 3; so for a premium of P = 0.05, we have

P(n - k - 1)/n = 0.05 (183 - 3 - 1)/183 = 0.0489

From the unit-normal table, z = 1.66, from which m = 1.4 + 0.85 × 1.66 = 2.81. Then, using Equation 4.3, f

= 2.76, and using Equation 4.4, t' = 2.81. Because t? = 24.3 is much larger than t', the 12th observation is

identified as an outlier.

Measuring Influence: Cook's Distance and Other Diagnostics

As noted previously, influence on the regression coefficients combines leverage and discrepancy. The most

direct measure of influence simply examines the impact on each coefficient of deleting each observation in

turn:

dij = bj - bj(-i), for i = 1, …, n; j = 0, …, k

where bj(-i) denotes the least-squares estimate of βj produced when the ith observation is omitted. To assist

in interpretation, it is useful to scale the dij by (deleted) estimates of the coefficient standard errors:

SAGE

1991 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 8 of 18 Outlying and Influential Data

Following Belsley et al. (1980), the dij are often termed DFBETAij, and the dij ? are called DFBETASij.

One problem associated with using the dij or dij ? is their large number: n(k + 1) of each. Of course, these

values can be more quickly examined graphically than in numerical tables. For example, we can construct an

“index plot” of the dij ?s for each coefficient j = 0, 1, …, k—simple scatterplots with dij

? on the vertical axis

versus the observation index i on the horizontal axis. Nevertheless, it is useful to have a summary index of

the influence of each observation on the fit.

Cook (1977) has proposed measuring the “distance” between the bj and the corresponding bj(-i) by

calculating the F statistic for the “hypothesis” that βj = bj(-i)j = 0, 1, …, k. This statistic is recalculated for each

observation i = 1, …, n. The resulting values should not literally be interpreted as F tests—Cook's approach

merely exploits an analogy to testing to produce a measure of distance independent of the scales of the x

variables. Cook's statistic may be written (and simply calculated) as

In effect, the first term is a measure of discrepancy, and the second a measure of leverage (see Appendix

A4.3). We look for values of Di that are substantially larger than the rest.

Belsley et al. (1980) have suggested the very similar measure

Note that except for unusual data configurations Di ≈ DFFITSi 2/(k + 1). Other global measures of influence

are available (see Chatterjee and Hadi, 1988, Ch. 4, for a comparative treatment).

Because all of the deletion statistics depend on the hat-values and residuals, a graphical alternative to either

of the general influence measures is to plot the hi against the ti and to look for observations for which both are

big. A slightly more sophisticated version of this plot displays circles of area proportional to Cook's D instead

of points (see Figure 4.6, page 38). We can follow up by examining the dij or dij ? for the observations with the

largest few Dt', |DFFITSi|, or combination of large hi, and |ti|.

For Davis's regression of reported weight on measured weight, all of the indices of influence point to the

obviously discrepant 12th observation:

Cook's D12 = 85.9 (next largest, D21 = 0.065)

DFFITS12 = -38.4 (next largest, DFFITS50 = 0.512)

DFBETAS0, 12 = DFBETAS1, 12 = 0, DFBETAS2, 12 = 20.0, DFBETAS3, 12 = -24.8

SAGE

1991 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 9 of 18 Outlying and Influential Data

Note that observation 12, which is for a female subject, has no impact on the male intercept b0 and slope b1.

Influence on Standard Errors. In developing the concept of influence in regression, I have focused on changes

in regression coefficients. Other regression outputs may be examined as well, however. One important output

is the set of coefficient variances and covariances, which capture the precision of estimation. For example,

recall Figure 4.1c, where a high-leverage observation exerts no influence on the regression coefficients,

because it is in line with the rest of the data. The estimated standard error of the least-squares slope in

simple regression is SE(b1) = s/ , and, therefore, by increasing the variance of x the high-leverage

observation serves to decrease SE(b1), even though it does not influence b0 and b1. Depending on context,

such an observation may be considered beneficial—increasing the precision of estimation—or it may cause

us to exaggerate our confidence in the estimate b1.

In multiple regression, we can examine the impact of deleting each observation in turn on the size of the joint-

confidence region for the βs. Recall from Chapter 2 that the size of this region is analogous to the length of

a confidence interval for an individual coefficient, which in turn is proportional to coefficient standard error.

The squared length of a confidence interval is therefore proportional to coefficient sampling variance, and,

analogously, the squared size of a joint confidence region is proportional to the “generalized” variance of a

set of coefficients. An influence measure proposed by Belsley et al. (1980) closely approximates the squared

ratio of volumes of the deleted and full-data confidence regions:

Alternative, similar measures have been suggested by several authors (again, Chatterjee and Hadi, 1988,

Ch. 4, provide a comparative discussion). Look for values of COVRATIOi that differ substantially from 1.

As for measures of influence on the regression coefficients, both the hat-value and the (studentized)

residual figure in COVRATIO. A large hat-value produces a large COVRATIO, however, even when (actually,

especially when) t is small, because a high-leverage, in-line observation improves the precision of estimation.

In contrast, a discrepant, low-leverage observation might not change the coefficients much, but it decreases

the precision of estimation by increasing the estimated error variance; such an observation, with small h and

large t, produces a COVRATIO substantially below 1.

For example, for Davis's first regression by far the most extreme value is COVRATIO12 = 0.0103. In this case,

a very large h12 = 0.714 is more than offset by a massive t12 = -24.3.

Influence on Collinearity. Other characteristics of a regression analysis also may be influenced by individual

observations, including the degree of collinearity. Although a formal consideration of influence on collinearity

is above the level of this presentation (see Chatterjee and Hadi, 1988, Ch.4–5), the following remarks may

prove helpful:

1.

SAGE

1991 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 10 of 18 Outlying and Influential Data

Influence on collinearity is one of the factors reflected in influence on coefficient standard errors.

Influence on the error variance and influence on the variation of the xs also are implicitly

factored into a measure such as COVRATIO, however. As well, COVRATIO and similar measures

examine the sampling variances and covariances of all of the regression coefficients, including

the constant. Nevertheless, our concern for collinearity reflects its impact on the precision of

estimation, and the global precision of estimation is assessed by COVRATIO.

2.

Collinearity-influential points are those that either induce or substantially weaken correlations

among the xs. Such points usually—but not always—have large hat-values. Conversely, points

with large hat-values often influence collinearity.

3.

Individual points that induce collinearity are obviously problematic. Points that substantially

weaken collinearity also merit examination, because they may cause us to be overly confident in

our results.

4.

It is frequently possible to detect collinearity-influential observations by plotting independent

variables against each other. This approach will fail, however, if the collinear relations in question

involve more than two independent variables at a time.

Numerical Cutoffs for Diagnostic Statistics

I have deliberately refrained from suggesting specific numerical criteria for identifying noteworthy

observations on the basis of measures of leverage and influence. I believe that generally it is more useful

to examine the distributions of these quantities to locate observations with unusual values. For studentized

residuals, the hypothesis-testing and insurance approaches produce cutoffs of sorts, but even these

numerical criteria are no substitute for graphical examination of the residuals.

Nevertheless, cutoffs can be of some use, as long as they are not given too much weight, and especially

when they serve to enhance graphical displays. A horizontal line may be drawn on an index plot, for example,

to draw attention to values beyond a cutoff. Similarly, such values may be identified individually in a graph (as

in Figure 4.6, page 38).

Cutoffs for a diagnostic statistic may be the product of statistical theory, or they may result from examination

of the sample distribution of the statistic. Cutoffs may be absolute, or they may be adjusted for sample size

(Belsley et al., 1980, Ch. 2). For some diagnostic statistics, such as measures of influence, absolute cutoffs

are unlikely to identify noteworthy observations in large samples. In part, this characteristic reflects the ability

of large samples to absorb discrepant data without changing the results substantially, but it is still often of

interest to identify relatively influential points, even if no observation has strong absolute influence.

The cutoffs presented below are, as explained briefly, based on the application of statistical theory. An

SAGE

1991 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 11 of 18 Outlying and Influential Data

alternative, very simple, and universally applicable data-based criterion is to examine the most extreme 5%

(say) of values for a diagnostic measure.

1.

Hat-values: Belsley et al. (1980) suggest that hat-values exceeding about twice the average (k

+ 1)/n are noteworthy. This size-adjusted cutoff was derived as an approximation identifying the

most extreme 5% of cases when the xs are multivariate-normal and k and n - k - 1 are relatively

large, but it is recommended by these authors as a rough general guide. (See Chatterjee and

Hadi [1988, Ch. 4] for a discussion of alternative cutoffs for hat-values.)

2.

Studentized residuals: Beyond the issues of “statistical significance” and estimator robustness

and efficiency discussed above, it sometimes helps to call attention to residuals that are relatively

large. Recall that under ideal conditions about 5% of studentized residuals are outside the range

|ti| ≤ 2. It is therefore reasonable, for example, to draw lines at ±2 on a display of studentized

residuals to highlight observations outside this range.

3.

Measures of influence: Many cutoffs have been suggested for different measures of influence. A

few are presented here:

a. Standardized change in regression coefficients: The dij ? are scaled by standard

errors, and consequently |dij ?| > 1 or 2 suggests itself as an absolute cutoff. As

explained above, however, this criterion is unlikely to nominate observations in

large samples. Belsley et al. propose the size-adjusted cutoff 2/ ; for identifying

noteworthy dij ?s.

b. Cook's D and DFFITS: A variety of numerical cutoffs have been recommended

for Cook's D and DFFITS—exploiting the analogy between D and an F statistic,

for example. Chatterjee and Hadi (1988) suggest comparing |DFFITSi| with the

size-adjusted cutoff 2 . (Also see Cook [1977], Belsley et al.

[1980], and Velleman and Welsch [1981].) Moreover, because of the

approximate relationship between DFFITS and Cook's D, it is simple to

translate cutoffs between the two measures. For Chatterjee and Hadi's

criterion, for example, we have the translated cutoff Di > 4/(n - k - 1). Absolute

cutoffs, such as Di > 1, risk missing influential data.

c. COVRATIO: Belsley et al. suggest that COVRATIOi is noteworthy when

|COVRATIOi - 1| exceeds the size-adjusted cutoff 3(k + 1)/n.

Jointly Influential Subsets of Observations: Partial-Regression Plots

As illustrated in Figure 4.4, subsets of observations can be jointly influential or can offset each other's

SAGE

1991 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 12 of 18 Outlying and Influential Data

influence. Often, influential subsets or multiple outliers can be identified by applying single-observation

diagnostics sequentially. It is important, however, to refit the model after deleting each point, because the

presence of a single influential value may dramatically affect the fit at other points. Still, the sequential

approach is not always successful.

Although it is possible to generalize deletion statistics formally to subsets of several points, the very large

number of subsets (there are n!/[p!(n - p)!] subsets of size p) usually renders the approach impractical (but see

Belsley et al. 1980, Ch. 2; Chatterjee and Hadi, 1988, Ch. 5). An attractive alternative is to employ graphical

methods.

A particularly useful influence graphic is the partial-regression plot, also called a partial-regression leverage

plot or an added-variable plot. Let yi (1) represent the residuals from the least-squares regression of y on all

of the xs save x1, that is, the residuals from the fitted model

SAGE

1991 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 13 of 18 Outlying and Influential Data

Figure 4.4. Jointly influential data. In each case, the solid line gives the regression for all of the

data, the light broken line gives the regression with the triangle deleted, and the heavy broken line

gives the regresson with both the square and the triangle deleted. (a) Jointly influential observations

located close to one another: Deletion of both observations has a much greater impact than deletion

of only one. (b) Jointly influential observations located on opposite sides of the data. (c) Observations

that offset one another: The regression with both observations deleted is the same as for the whole

dataset.

Likewise, the xi (1) are residuals from the least-squares regression of x1 on the other xs:

The notation emphasizes the interpretation of the residuals y(1) and x(1) as the parts of y and x1 that remain

when the effects of x2, …, xk are removed. It may be shown (see Appendix A4.4) that the slope from the

least-squares regression of y(1) on x(1) is simply the least-squares slope b1 from the full multiple regression,

and that the residuals from this regression are the same as those from the full regression, that is, yi (1) =

b1xi (1) + ei. Note that no constant is required here, because as least-squares residuals, both y

(1) and x(1)

have means of zero.

SAGE

1991 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 14 of 18 Outlying and Influential Data

Plotting y(1) against x(1) permits us to examine leverage and influence on b1. Similar partial-regression plots

can be constructed for the other regression coefficients, including b0:

Plot y(j) versus x(j), for j = 0, 1, …, k

In the case of b0, we regress the “constant regressor” x0 = 1 and y on x1 through xk, with no constant in the

regression equations.

Illustrative partial-regression plots appear in Figure 4.5. The data for this example are drawn from Duncan

(1961), who regressed the rated prestige of 45 occupations (P, assessed as the percentage of raters

scoring the occupations as “good” or “excellent”) on the income and educational levels of the occupations in

1950 (respectively, I, the percent of males earning at least $3,500, and E, the percent of male high-school

graduates). The primary aim of this regression was to produce fitted prestige scores for occupations for which

there were no direct prestige ratings, but for which income and educational data were available. The fitted

regression (with standard errors in parentheses) is

= -6.06 + 0.599I + 0.546E (4.27) (0.120) (0.098)

R2 = 0.83 s = 13.4

The partial-regression plot for income (Figure 4.5a) reveals three apparently influential observations that

serve to decrease the income slope: ministers (6), whose income is unusually low given the educational level

of the occupation; and railroad conductors (16) and rail-road engineers (27), whose incomes are unusually

high given education. Recall that the horizontal variable in the partial-regression plot is the residual from the

regression of income on education, and thus values far from zero in this direction are those for which income

is unusual given education.

SAGE

1991 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 15 of 18 Outlying and Influential Data

Figure 4.5. Partial-regression plots for (a) income and (b) education in the regression of prestige on

the income and education levels of 45 U.S. occupations in 1950. The observation numbers of the

points are plotted. If the plots were drawn to a larger scale, as on a computer screen, then the names

of the occupations could be plotted in place of their numbers. The partial-regression plot for the

constant is not shown.

The partial-regression plot for education (Figure 4.5b) shows that the same three observations have relatively

high leverage on the education coefficient: Observations 6 and 16 tend to increase b2, whereas observation

27 appears to be closer in line with the rest of the data.

Examining the single-observation deletion diagnostics reveals that observation 6 has the largest Cook's D

(D6 = 0.566) and studentized residual (t6 = 3.14). This studentized residual is not especially big, however:

The Bonferroni p value for the outlier test is Pr(t41 > 3.14) × 2 × 45 = 0.14. Figure 4.6 displays a plot of

studentized residuals versus hat-values, with the areas of the plotted circles proportional to values of Cook's

D. Observation indices are shown on the plot for |ti| > 2 or hi > 2(k + 1)/n = 2(2 + l)/45 = 0.13.

SAGE

1991 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 16 of 18 Outlying and Influential Data

Figure 4.6. Plot of studentized residuals against hat-values for the regression of occupational prestige

on income and education. Each point is plotted as a circle with area proportional to Cook's D. The

observation number is shown when hi > 2 = 0.13 or |ti| > 2.

Deleting observations 6 and 16 produces the fitted regression

= -6.41 + 0.867I + 0.332E (3.65) (0.122) (0.099)

R2 = 0.88 s = 11.4

which, as expected from the partial-regression plots, has a larger income slope and smaller education slope

than the original regression. The estimated standard errors are likely optimistic, because relative outliers have

been trimmed away. Deleting observation 27 as well further increases the income slope and decreases the

education slope, but the change is not dramatic: bI = 0.931, bE = 0.285.

Should Unusual Data Be Discarded?

The discussion in this section has proceeded as if outlying and influential data are simply discarded. Though

problematic data should not be ignored, they also should not be deleted automatically and thoughtlessly:

1.

It is important to investigate why data are unusual. Truly bad data (e.g., errors in data entry

as in Davis's regression) can often be corrected or, if correction is not possible, thrown away.

Alternatively, when a discrepant data-point is correct, we may be able to understand why the

observation is unusual. For Duncan's regression, for example, it makes sense that ministers enjoy

prestige not accounted for by the income and educational levels of the occupation. Likewise, I

SAGE

1991 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 17 of 18 Outlying and Influential Data

suspect that the high incomes of railroad workers relative to their educational level and prestige

reflect the power of railroad unions around 1950. In a case like this, we may choose to deal with

outlying observations separately.

2.

Alternatively, outliers or influential data may motivate model respecification. For example, the

pattern of outlying data may suggest introduction of additional independent variables. If, in

Duncan's regression, we can identify a factor that produces the unusually high prestige of

ministers (net of their income and education), and we can measure that factor for other

occupations, then this variable could be added to the regression. In some instances,

transformation of the dependent variable or of an independent variable may, by rendering the

error distribution symmetric or eliminating nonlinearity (see Chapters 5 and 7), draw apparent

outliers toward the rest of the data. We must, however, be careful to avoid “overfitting” the

data—permitting a small portion of the data to determine the form of the model. I shall return to

this problem in Chapters 9 and 10.

3.

Except in clear-cut cases, we are justifiably reluctant to delete observations or to respecify

to accommodate unusual data. Some researchers reasonably adopt alternative estimation

strategies, such as robust regression, which continuously downweights outlying data rather than

simply including or discarding them. Such methods are termed “robust” because they behave well

even when the errors are not normally distributed (see the discussion of lowess in Appendix A6.1

for an example). As mentioned in passing, the attraction of robust estimation may be understood

using Anscombe's insurance analogy: Robust methods are nearly as efficient as least squares

when the errors are normally distributed, and much more efficient in the presence of outliers.

Because these methods assign zero or very small weight to highly discrepant data, however, the

result is not generally very different from careful application of least squares, and, indeed, robust-

regression weights may be used to identify outliers. Moreover, most robust-regression methods

are vulnerable to high-leverage points (but see the “high-breakdown” estimators described by

Rousseeuw and Leroy, 1987).

http://dx.doi.org/10.4135/9781412985604.n4

SAGE

1991 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 18 of 18 Outlying and Influential Data

  • Outlying and Influential Data
    • In: Regression Diagnostics