Multiple Regression
Outlying and Influential Data
In: Regression Diagnostics
By: John Fox
Pub. Date: 2011
Access Date: October 13, 2019
Publishing Company: SAGE Publications, Inc.
City: Thousand Oaks
Print ISBN: 9780803939714
Online ISBN: 9781412985604
DOI: https://dx.doi.org/10.4135/9781412985604
Print pages: 22-40
© 1991 SAGE Publications, Inc. All Rights Reserved.
This PDF has been generated from SAGE Research Methods. Please note that the pagination of the
online version will vary from the pagination of the print book.
Outlying and Influential Data
Unusual data are problematic in a least-squares regression because they can unduly influence the results of
the analysis, and because their presence may be a signal that the regression model fails to capture important
characteristics of the data. Some central distinctions are illustrated in Figure 4.1 for the simple-regression
model y = β0 + β1x + ?
In simple regression, an outlier is an observation whose dependent-variable value is unusual given the value
of the independent variable. In contrast, a univariate outlier is a value of y or x that is unconditionally unusual;
such a value may or may not be a regression outlier. Regression outliers appear in both part a and part b
of Figure 4.1. In Figure 4.1a, the outlying observation has an x value at the center of the x distribution; as
a consequence, deleting the outlier has no impact on the least-squares slope b1 and little impact on the
intercept b0. In Figure 4.1b, the outlier has an unusual x value, and consequently its deletion markedly affects
both the slope and the intercept. Because of its unusual x value, the last observation in Figure 4.1b has strong
leverage on the regression coefficients, whereas the middle observation in Figure 4.1a is at a low-leverage
point.
The combination of high leverage with an outlier produces substantial influence on the regression coefficients.
In Figure 4.1c, the last observation has no influence on the regression coefficients even though it is a high-
leverage point, because this observation is not out of line with the rest of the data. The following heuristic
formula helps to distinguish among these concepts:
Influence on Coefficients = Leverage × Discrepancy
Figure 4.1. Leverage and influence in simple-regression analysis, (a)
An outlier near the mean of x has little influence on the regression
coefficients. (b) An outlier far from the mean of x markedly affects the
regression coefficients. (c) A high-leverage observation in line with the
rest of the data does not influence the regression coefficients.
SAGE
1991 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 2 of 18 Outlying and Influential Data
A simple and transparent example, with real data from Davis (1990), appears in Figure 4.2. These data record
the measured and reported weight (in kilograms) of 183 male and female subjects who engage in programs of
regular physical exercise. As part of a larger study, the investigator was interested in ascertaining whether the
subjects reported their weights accurately, and whether men and women reported similarly. (The published
study is based on the data for the female subjects only and includes additional data for non-exercising
women.) Davis (1990) gives the correlation between measured and reported weight.
Figure 4.2. Regression of reported weight in kilograms on measured
weight and gender for 183 subjects engaged in regular exercise. The
solid line shows the least-squares regression for women, the broken
line the regression for men.
SAGE
1991 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 3 of 18 Outlying and Influential Data
SOURCE: Data taken from C. Davis, personal communication.
A least-squares regression of reported weight (RW) on measured weight (MW), a dummy variable for sex
(F: coded one for women, zero for men), and an interaction regressor produces the following results (with
coefficient standard errors in parentheses):
Were these results to be taken seriously, we would conclude that men are on average accurate reporters
of their weights (because b0 = 0 and b1 = 1), whereas women tend to overreport their weights if they
are relatively light and underreport if they are relatively heavy. However, Figure 4.2 makes clear that the
differential results for women and men are due to one female subject whose reported weight is about average
(for women), but whose measured weight is extremely large.
In fact, this subject's measured weight and height (in centimeters) were switched erroneously on data entry,
as Davis discovered after calculating an anomalously low correlation between reported and measured weight
among women. Correcting the data produces the regression
which suggests that both women and men are accurate reporters of weight.
There is another way to analyze the Davis weight data: One of the investigator's interests was to determine
whether subjects reported their weights accurately enough to permit the substitution of reported weight for
SAGE
1991 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 4 of 18 Outlying and Influential Data
measured weight, which would decrease the cost of collecting data on weight. It is natural to think of reported
weight as influenced by “real” weight, as in the regression presented above in which reported weight is the
dependent variable. The question of substitution, however, is answered by the regression of measured weight
on reported weight, giving the following results for the uncorrected data:
Note that here the outlier does not have much impact on the regression coefficients, precisely because the
value of RW for this observation is near for women. However, there is a marked effect on the multiple
correlation and standard error: For the corrected data, R2 = 0.97, s = 2.25.
Measuring Leverage: Hat-Values
The so-called hat-value hi is a common measure of leverage in regression. These values are so named
because it is possible to express the fitted values ŷj in terms of the observed values yi:
Thus the weight hij captures the extent to which yi can affect ŷj: If hij is large, then the ith observation can
have a substantial impact on the jth fitted value. It may be shown that hii = ∑ and so the hat-value hi = hii
summarizes the potential influence (the leverage) of yi on all of the fitted values. The hat-values are bounded
between 1/n and 1 (i.e., l/n ≤ hi ≤ 1), and the average hat-value is = (k + 1)/n (see Appendix A4.1).
In simple-regression analysis, the hat-values measure distance from the mean of x:
In multiple regression, hi measures distance from the centroid (point of means) of the xs, taking into account
the correlational structure of the xs, as illustrated for k = 2 in Figure 4.3. Multivariate outliers in the x space
are thus high-leverage observations.
For Davis's regression of reported weight on measured weight, the largest hat-value by far belongs to the
12th subject, whose measured weight was erroneously recorded as 166 kg: h12 = 0.714. This quantity is
many times the average hat-value, = (3 + 1)/183 = 0.0219.
Detecting Outliers: Studentized Residuals
To identify an outlying observation, we need an index of the unusualness of y given the xs. Generally,
SAGE
1991 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 5 of 18 Outlying and Influential Data
discrepant observations have large residuals, but it turns out that even if the errors ?i have equal variances
(as assumed in the regression model), the residuals ei do not: V(ei) = σ2(1 - hi) (see Appendix A4.2).
High-leverage observations, therefore, tend to have small residuals—a sensible result, because these
observations can force the regression surface to be close to them.
Although we can form a standardized residual by calculating e'i = ei/s , this measure suffers from the
defect that the numerator and denominator are not independent, preventing e'i from following a t distribution:
When |ei| is large, s = , which contains ei 2, tends to be large as well. Suppose, however, that
we refit the regression model deleting the ith observation, obtaining an estimate s(-i) of σ based on the rest of
the data. Then the studentized residual
Figure 4.3. Contours of constant leverage (constant hi) for k = 2 independent variables. Two high-
leverage points appear: One (shown as a large hollow dot) has unusually large values for each of x1
and x2, but the other (large filled dot) is unusual only in its combination of x1 and x2 values.
has independent numerator and denominator, and follows a t distribution with n - k - 2 degrees of freedom.
An alternative, but equivalent, procedure for finding the studentized residuals employs the “mean-shift” outlier
model
where d is a dummy variable set to one for observation i and zero for all other observations. Thus
It would be natural to specify Equation 4.2 if before examining the data we suspected that observation i
differed from the others. Then, to test H0: γ = 0, we would find ti = / SE( ), which is distributed as tn - k - 2
SAGE
1991 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 6 of 18 Outlying and Influential Data
under H0, and which (it turns out) is the studentized residual of Equation 4.1.
Here, as elsewhere in statistics, terminology is not wholly standard: ti is sometimes called a deleted
studentized residual, an externally studentized residual, or even a standardized residual. Because the last
term also is often applied to e'i, it is important to determine exactly what is being calculated by a computer
program before using these quantities. In large samples, though, usually ti = e'i = ei/s.
Testing for Outliers in Regression. Because in most applications we do not suspect a particular observation
in advance, we can in effect refit the mean-shift model n times, once for each observation, producing t1, t2,
…, tn. In practice, alternative formulas to Equations 4.1 and 4.2 provide the ti with little computational effort.
Usually, our interest then will focus on the largest absolute ti, called t ?. Because we have picked the biggest
of n test statistics, however, it is no longer legitimate simply to use tn - k - 2 to find the statistical significance
of t?: For example, even if our model is wholly adequate, and disregarding for the moment the dependence
among the tis, we would expect to observe about 5% of tis beyond t0.025 ≈ ±2, about 1% beyond t0.005 ≈ ±
2.6, and so forth.
One solution to the problem of simultaneous inference is to perform a Bonferroni adjustment to the p value
for the largest ti. (Another way to take into account the number of studentized residuals, by constructing a
quantile-comparison plot, is discussed in Chapter 5.) The Bonferroni test requires either a special t table or,
more conveniently, a computer program that returns accurate p values for t far into the tail of the distribution.
In the latter event, suppose that p' = Pr(tn - k - 2 > |t ?|). Then the p value for testing the statistical significance
of t? is p = 2np'. The factor 2 reflects the two-tail character of the test: We want to detect large negative as well
as large positive outliers. The factor n adjusts for conducting n simultaneous tests, which is implicit in selecting
the largest of n test statistics. Beckman and Cook (1983) have shown that the Bonferroni adjustment usually
is exact for testing the largest studentized residual. Note that a much larger t? is required for a statistically
significant result than would be the case for an ordinary individual t test.
In Davis's regression of reported weight on measured weight, the largest studentized residual by far belongs
to the 12th observation: t12 = - 24.3. Here, n - k - 2 = 183 - 3 - 2 = 178, and Pr(t178 > 24.3) << 10 -8. (The
symbol “<<” means “much less than.” The computer program that I employed to find the tail probability was
unable to calculate a more accurate result for such a large t.) The Bonferroni p value for the outlier test is p
<< 178 × 2 × 10-8 = 4 × 10-6 (i.e., 0.000004), an unambiguous result.
An Analogy to Insurance. Thus far, I have treated the identification (and, implicitly, the potential correction,
removal, or accommodation) of outliers as a hypothesis-testing problem. Although this is by far the most
common approach in practice, a more reasonable general perspective weighs the costs and benefits for
estimation of rejecting a potentially outlying observation.
Suppose, for the moment, that the observation with the largest ti is simply an unusual data point, but one
SAGE
1991 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 7 of 18 Outlying and Influential Data
generated by the assumed statistical model, that is, yi = β0 + β1x1i + … + βkxki + ?i, with ?i ∼ NID(0, σ 2).
To discard an observation under these circumstances would decrease the efficiency of estimation, because
when the model—including the assumption of normality—is correct, the least-squares estimator is maximally
efficient among all unbiased estimators of the βs. If, however, the datapoint in question does not belong with
the rest (say, e.g., the mean-shift model applies), then to eliminate it may make estimation more efficient.
Anscombe (1960) expressed this insight by drawing an analogy to insurance: To obtain protection against
“bad” data, one purchases a policy of outlier rejection (or uses an estimator that is resistant to outliers—a
so-called robust estimator), a policy paid for by a small premium in efficiency when the policy rejects “good”
data.
Let P denote the desired premium, say 0.05—a 5% increase in estimator mean-squared error if the model
holds for all of the data. Let z represent the unit-normal deviate corresponding to a tail probability of
P(n-k-1)/n. Following the procedure derived by Anscombe and Tukey (1963), compute m = 1.4 + 0.85z, and
then find
and
Finally, reject the observation with the largest studentized residual if |t?| > t'. In a real application, of course,
we should inquire about discrepant observations (see the discussion at the end of this section).
For example, for Davis's first regression n = 183 and k = 3; so for a premium of P = 0.05, we have
P(n - k - 1)/n = 0.05 (183 - 3 - 1)/183 = 0.0489
From the unit-normal table, z = 1.66, from which m = 1.4 + 0.85 × 1.66 = 2.81. Then, using Equation 4.3, f
= 2.76, and using Equation 4.4, t' = 2.81. Because t? = 24.3 is much larger than t', the 12th observation is
identified as an outlier.
Measuring Influence: Cook's Distance and Other Diagnostics
As noted previously, influence on the regression coefficients combines leverage and discrepancy. The most
direct measure of influence simply examines the impact on each coefficient of deleting each observation in
turn:
dij = bj - bj(-i), for i = 1, …, n; j = 0, …, k
where bj(-i) denotes the least-squares estimate of βj produced when the ith observation is omitted. To assist
in interpretation, it is useful to scale the dij by (deleted) estimates of the coefficient standard errors:
SAGE
1991 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 8 of 18 Outlying and Influential Data
Following Belsley et al. (1980), the dij are often termed DFBETAij, and the dij ? are called DFBETASij.
One problem associated with using the dij or dij ? is their large number: n(k + 1) of each. Of course, these
values can be more quickly examined graphically than in numerical tables. For example, we can construct an
“index plot” of the dij ?s for each coefficient j = 0, 1, …, k—simple scatterplots with dij
? on the vertical axis
versus the observation index i on the horizontal axis. Nevertheless, it is useful to have a summary index of
the influence of each observation on the fit.
Cook (1977) has proposed measuring the “distance” between the bj and the corresponding bj(-i) by
calculating the F statistic for the “hypothesis” that βj = bj(-i)j = 0, 1, …, k. This statistic is recalculated for each
observation i = 1, …, n. The resulting values should not literally be interpreted as F tests—Cook's approach
merely exploits an analogy to testing to produce a measure of distance independent of the scales of the x
variables. Cook's statistic may be written (and simply calculated) as
In effect, the first term is a measure of discrepancy, and the second a measure of leverage (see Appendix
A4.3). We look for values of Di that are substantially larger than the rest.
Belsley et al. (1980) have suggested the very similar measure
Note that except for unusual data configurations Di ≈ DFFITSi 2/(k + 1). Other global measures of influence
are available (see Chatterjee and Hadi, 1988, Ch. 4, for a comparative treatment).
Because all of the deletion statistics depend on the hat-values and residuals, a graphical alternative to either
of the general influence measures is to plot the hi against the ti and to look for observations for which both are
big. A slightly more sophisticated version of this plot displays circles of area proportional to Cook's D instead
of points (see Figure 4.6, page 38). We can follow up by examining the dij or dij ? for the observations with the
largest few Dt', |DFFITSi|, or combination of large hi, and |ti|.
For Davis's regression of reported weight on measured weight, all of the indices of influence point to the
obviously discrepant 12th observation:
Cook's D12 = 85.9 (next largest, D21 = 0.065)
DFFITS12 = -38.4 (next largest, DFFITS50 = 0.512)
DFBETAS0, 12 = DFBETAS1, 12 = 0, DFBETAS2, 12 = 20.0, DFBETAS3, 12 = -24.8
SAGE
1991 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 9 of 18 Outlying and Influential Data
Note that observation 12, which is for a female subject, has no impact on the male intercept b0 and slope b1.
Influence on Standard Errors. In developing the concept of influence in regression, I have focused on changes
in regression coefficients. Other regression outputs may be examined as well, however. One important output
is the set of coefficient variances and covariances, which capture the precision of estimation. For example,
recall Figure 4.1c, where a high-leverage observation exerts no influence on the regression coefficients,
because it is in line with the rest of the data. The estimated standard error of the least-squares slope in
simple regression is SE(b1) = s/ , and, therefore, by increasing the variance of x the high-leverage
observation serves to decrease SE(b1), even though it does not influence b0 and b1. Depending on context,
such an observation may be considered beneficial—increasing the precision of estimation—or it may cause
us to exaggerate our confidence in the estimate b1.
In multiple regression, we can examine the impact of deleting each observation in turn on the size of the joint-
confidence region for the βs. Recall from Chapter 2 that the size of this region is analogous to the length of
a confidence interval for an individual coefficient, which in turn is proportional to coefficient standard error.
The squared length of a confidence interval is therefore proportional to coefficient sampling variance, and,
analogously, the squared size of a joint confidence region is proportional to the “generalized” variance of a
set of coefficients. An influence measure proposed by Belsley et al. (1980) closely approximates the squared
ratio of volumes of the deleted and full-data confidence regions:
Alternative, similar measures have been suggested by several authors (again, Chatterjee and Hadi, 1988,
Ch. 4, provide a comparative discussion). Look for values of COVRATIOi that differ substantially from 1.
As for measures of influence on the regression coefficients, both the hat-value and the (studentized)
residual figure in COVRATIO. A large hat-value produces a large COVRATIO, however, even when (actually,
especially when) t is small, because a high-leverage, in-line observation improves the precision of estimation.
In contrast, a discrepant, low-leverage observation might not change the coefficients much, but it decreases
the precision of estimation by increasing the estimated error variance; such an observation, with small h and
large t, produces a COVRATIO substantially below 1.
For example, for Davis's first regression by far the most extreme value is COVRATIO12 = 0.0103. In this case,
a very large h12 = 0.714 is more than offset by a massive t12 = -24.3.
Influence on Collinearity. Other characteristics of a regression analysis also may be influenced by individual
observations, including the degree of collinearity. Although a formal consideration of influence on collinearity
is above the level of this presentation (see Chatterjee and Hadi, 1988, Ch.4–5), the following remarks may
prove helpful:
1.
SAGE
1991 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 10 of 18 Outlying and Influential Data
Influence on collinearity is one of the factors reflected in influence on coefficient standard errors.
Influence on the error variance and influence on the variation of the xs also are implicitly
factored into a measure such as COVRATIO, however. As well, COVRATIO and similar measures
examine the sampling variances and covariances of all of the regression coefficients, including
the constant. Nevertheless, our concern for collinearity reflects its impact on the precision of
estimation, and the global precision of estimation is assessed by COVRATIO.
2.
Collinearity-influential points are those that either induce or substantially weaken correlations
among the xs. Such points usually—but not always—have large hat-values. Conversely, points
with large hat-values often influence collinearity.
3.
Individual points that induce collinearity are obviously problematic. Points that substantially
weaken collinearity also merit examination, because they may cause us to be overly confident in
our results.
4.
It is frequently possible to detect collinearity-influential observations by plotting independent
variables against each other. This approach will fail, however, if the collinear relations in question
involve more than two independent variables at a time.
Numerical Cutoffs for Diagnostic Statistics
I have deliberately refrained from suggesting specific numerical criteria for identifying noteworthy
observations on the basis of measures of leverage and influence. I believe that generally it is more useful
to examine the distributions of these quantities to locate observations with unusual values. For studentized
residuals, the hypothesis-testing and insurance approaches produce cutoffs of sorts, but even these
numerical criteria are no substitute for graphical examination of the residuals.
Nevertheless, cutoffs can be of some use, as long as they are not given too much weight, and especially
when they serve to enhance graphical displays. A horizontal line may be drawn on an index plot, for example,
to draw attention to values beyond a cutoff. Similarly, such values may be identified individually in a graph (as
in Figure 4.6, page 38).
Cutoffs for a diagnostic statistic may be the product of statistical theory, or they may result from examination
of the sample distribution of the statistic. Cutoffs may be absolute, or they may be adjusted for sample size
(Belsley et al., 1980, Ch. 2). For some diagnostic statistics, such as measures of influence, absolute cutoffs
are unlikely to identify noteworthy observations in large samples. In part, this characteristic reflects the ability
of large samples to absorb discrepant data without changing the results substantially, but it is still often of
interest to identify relatively influential points, even if no observation has strong absolute influence.
The cutoffs presented below are, as explained briefly, based on the application of statistical theory. An
SAGE
1991 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 11 of 18 Outlying and Influential Data
alternative, very simple, and universally applicable data-based criterion is to examine the most extreme 5%
(say) of values for a diagnostic measure.
1.
Hat-values: Belsley et al. (1980) suggest that hat-values exceeding about twice the average (k
+ 1)/n are noteworthy. This size-adjusted cutoff was derived as an approximation identifying the
most extreme 5% of cases when the xs are multivariate-normal and k and n - k - 1 are relatively
large, but it is recommended by these authors as a rough general guide. (See Chatterjee and
Hadi [1988, Ch. 4] for a discussion of alternative cutoffs for hat-values.)
2.
Studentized residuals: Beyond the issues of “statistical significance” and estimator robustness
and efficiency discussed above, it sometimes helps to call attention to residuals that are relatively
large. Recall that under ideal conditions about 5% of studentized residuals are outside the range
|ti| ≤ 2. It is therefore reasonable, for example, to draw lines at ±2 on a display of studentized
residuals to highlight observations outside this range.
3.
Measures of influence: Many cutoffs have been suggested for different measures of influence. A
few are presented here:
a. Standardized change in regression coefficients: The dij ? are scaled by standard
errors, and consequently |dij ?| > 1 or 2 suggests itself as an absolute cutoff. As
explained above, however, this criterion is unlikely to nominate observations in
large samples. Belsley et al. propose the size-adjusted cutoff 2/ ; for identifying
noteworthy dij ?s.
b. Cook's D and DFFITS: A variety of numerical cutoffs have been recommended
for Cook's D and DFFITS—exploiting the analogy between D and an F statistic,
for example. Chatterjee and Hadi (1988) suggest comparing |DFFITSi| with the
size-adjusted cutoff 2 . (Also see Cook [1977], Belsley et al.
[1980], and Velleman and Welsch [1981].) Moreover, because of the
approximate relationship between DFFITS and Cook's D, it is simple to
translate cutoffs between the two measures. For Chatterjee and Hadi's
criterion, for example, we have the translated cutoff Di > 4/(n - k - 1). Absolute
cutoffs, such as Di > 1, risk missing influential data.
c. COVRATIO: Belsley et al. suggest that COVRATIOi is noteworthy when
|COVRATIOi - 1| exceeds the size-adjusted cutoff 3(k + 1)/n.
Jointly Influential Subsets of Observations: Partial-Regression Plots
As illustrated in Figure 4.4, subsets of observations can be jointly influential or can offset each other's
SAGE
1991 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 12 of 18 Outlying and Influential Data
influence. Often, influential subsets or multiple outliers can be identified by applying single-observation
diagnostics sequentially. It is important, however, to refit the model after deleting each point, because the
presence of a single influential value may dramatically affect the fit at other points. Still, the sequential
approach is not always successful.
Although it is possible to generalize deletion statistics formally to subsets of several points, the very large
number of subsets (there are n!/[p!(n - p)!] subsets of size p) usually renders the approach impractical (but see
Belsley et al. 1980, Ch. 2; Chatterjee and Hadi, 1988, Ch. 5). An attractive alternative is to employ graphical
methods.
A particularly useful influence graphic is the partial-regression plot, also called a partial-regression leverage
plot or an added-variable plot. Let yi (1) represent the residuals from the least-squares regression of y on all
of the xs save x1, that is, the residuals from the fitted model
SAGE
1991 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 13 of 18 Outlying and Influential Data
Figure 4.4. Jointly influential data. In each case, the solid line gives the regression for all of the
data, the light broken line gives the regression with the triangle deleted, and the heavy broken line
gives the regresson with both the square and the triangle deleted. (a) Jointly influential observations
located close to one another: Deletion of both observations has a much greater impact than deletion
of only one. (b) Jointly influential observations located on opposite sides of the data. (c) Observations
that offset one another: The regression with both observations deleted is the same as for the whole
dataset.
Likewise, the xi (1) are residuals from the least-squares regression of x1 on the other xs:
The notation emphasizes the interpretation of the residuals y(1) and x(1) as the parts of y and x1 that remain
when the effects of x2, …, xk are removed. It may be shown (see Appendix A4.4) that the slope from the
least-squares regression of y(1) on x(1) is simply the least-squares slope b1 from the full multiple regression,
and that the residuals from this regression are the same as those from the full regression, that is, yi (1) =
b1xi (1) + ei. Note that no constant is required here, because as least-squares residuals, both y
(1) and x(1)
have means of zero.
SAGE
1991 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 14 of 18 Outlying and Influential Data
Plotting y(1) against x(1) permits us to examine leverage and influence on b1. Similar partial-regression plots
can be constructed for the other regression coefficients, including b0:
Plot y(j) versus x(j), for j = 0, 1, …, k
In the case of b0, we regress the “constant regressor” x0 = 1 and y on x1 through xk, with no constant in the
regression equations.
Illustrative partial-regression plots appear in Figure 4.5. The data for this example are drawn from Duncan
(1961), who regressed the rated prestige of 45 occupations (P, assessed as the percentage of raters
scoring the occupations as “good” or “excellent”) on the income and educational levels of the occupations in
1950 (respectively, I, the percent of males earning at least $3,500, and E, the percent of male high-school
graduates). The primary aim of this regression was to produce fitted prestige scores for occupations for which
there were no direct prestige ratings, but for which income and educational data were available. The fitted
regression (with standard errors in parentheses) is
= -6.06 + 0.599I + 0.546E (4.27) (0.120) (0.098)
R2 = 0.83 s = 13.4
The partial-regression plot for income (Figure 4.5a) reveals three apparently influential observations that
serve to decrease the income slope: ministers (6), whose income is unusually low given the educational level
of the occupation; and railroad conductors (16) and rail-road engineers (27), whose incomes are unusually
high given education. Recall that the horizontal variable in the partial-regression plot is the residual from the
regression of income on education, and thus values far from zero in this direction are those for which income
is unusual given education.
SAGE
1991 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 15 of 18 Outlying and Influential Data
Figure 4.5. Partial-regression plots for (a) income and (b) education in the regression of prestige on
the income and education levels of 45 U.S. occupations in 1950. The observation numbers of the
points are plotted. If the plots were drawn to a larger scale, as on a computer screen, then the names
of the occupations could be plotted in place of their numbers. The partial-regression plot for the
constant is not shown.
The partial-regression plot for education (Figure 4.5b) shows that the same three observations have relatively
high leverage on the education coefficient: Observations 6 and 16 tend to increase b2, whereas observation
27 appears to be closer in line with the rest of the data.
Examining the single-observation deletion diagnostics reveals that observation 6 has the largest Cook's D
(D6 = 0.566) and studentized residual (t6 = 3.14). This studentized residual is not especially big, however:
The Bonferroni p value for the outlier test is Pr(t41 > 3.14) × 2 × 45 = 0.14. Figure 4.6 displays a plot of
studentized residuals versus hat-values, with the areas of the plotted circles proportional to values of Cook's
D. Observation indices are shown on the plot for |ti| > 2 or hi > 2(k + 1)/n = 2(2 + l)/45 = 0.13.
SAGE
1991 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 16 of 18 Outlying and Influential Data
Figure 4.6. Plot of studentized residuals against hat-values for the regression of occupational prestige
on income and education. Each point is plotted as a circle with area proportional to Cook's D. The
observation number is shown when hi > 2 = 0.13 or |ti| > 2.
Deleting observations 6 and 16 produces the fitted regression
= -6.41 + 0.867I + 0.332E (3.65) (0.122) (0.099)
R2 = 0.88 s = 11.4
which, as expected from the partial-regression plots, has a larger income slope and smaller education slope
than the original regression. The estimated standard errors are likely optimistic, because relative outliers have
been trimmed away. Deleting observation 27 as well further increases the income slope and decreases the
education slope, but the change is not dramatic: bI = 0.931, bE = 0.285.
Should Unusual Data Be Discarded?
The discussion in this section has proceeded as if outlying and influential data are simply discarded. Though
problematic data should not be ignored, they also should not be deleted automatically and thoughtlessly:
1.
It is important to investigate why data are unusual. Truly bad data (e.g., errors in data entry
as in Davis's regression) can often be corrected or, if correction is not possible, thrown away.
Alternatively, when a discrepant data-point is correct, we may be able to understand why the
observation is unusual. For Duncan's regression, for example, it makes sense that ministers enjoy
prestige not accounted for by the income and educational levels of the occupation. Likewise, I
SAGE
1991 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 17 of 18 Outlying and Influential Data
suspect that the high incomes of railroad workers relative to their educational level and prestige
reflect the power of railroad unions around 1950. In a case like this, we may choose to deal with
outlying observations separately.
2.
Alternatively, outliers or influential data may motivate model respecification. For example, the
pattern of outlying data may suggest introduction of additional independent variables. If, in
Duncan's regression, we can identify a factor that produces the unusually high prestige of
ministers (net of their income and education), and we can measure that factor for other
occupations, then this variable could be added to the regression. In some instances,
transformation of the dependent variable or of an independent variable may, by rendering the
error distribution symmetric or eliminating nonlinearity (see Chapters 5 and 7), draw apparent
outliers toward the rest of the data. We must, however, be careful to avoid “overfitting” the
data—permitting a small portion of the data to determine the form of the model. I shall return to
this problem in Chapters 9 and 10.
3.
Except in clear-cut cases, we are justifiably reluctant to delete observations or to respecify
to accommodate unusual data. Some researchers reasonably adopt alternative estimation
strategies, such as robust regression, which continuously downweights outlying data rather than
simply including or discarding them. Such methods are termed “robust” because they behave well
even when the errors are not normally distributed (see the discussion of lowess in Appendix A6.1
for an example). As mentioned in passing, the attraction of robust estimation may be understood
using Anscombe's insurance analogy: Robust methods are nearly as efficient as least squares
when the errors are normally distributed, and much more efficient in the presence of outliers.
Because these methods assign zero or very small weight to highly discrepant data, however, the
result is not generally very different from careful application of least squares, and, indeed, robust-
regression weights may be used to identify outliers. Moreover, most robust-regression methods
are vulnerable to high-leverage points (but see the “high-breakdown” estimators described by
Rousseeuw and Leroy, 1987).
http://dx.doi.org/10.4135/9781412985604.n4
SAGE
1991 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 18 of 18 Outlying and Influential Data
- Outlying and Influential Data
- In: Regression Diagnostics