Statistics lesson 28

MTH245Lesson28Notes-1.pdf

Home >Mathematics homework help >Statistics homework help >Statistics lesson 28

MTH 245 Lesson 28 Notes Outliers, High-Leverage Points, and Influence

In the context of linear regression, an outlier is a point whose residual is significantly large compared to the rest of the residuals of a fitted model. In other words, an outlier is far away from the estimated regression line in terms of vertical distance. The most common criterion for judging a point as an outlier is if it has a studentized residual (basically the 𝑧𝑧-score of the residual) of greater than 3.00 or less than −3.00.

A high-leverage point is a point whose 𝑥𝑥-coordinate is extreme compared to the rest of the points in the data set. In other words, a high-leverage point is far away from the rest of the scatter plot in terms of horizontal distance. For a SLR regression model, a point is high-leverage if the leverage value of its 𝑥𝑥-coordinate exceeds the high-leverage threshold of 6 𝑛𝑛⁄ .

An influential point is a point that strongly affects the graph of the regression line. Typically, an influential point is both an outlier and a high- leverage point. (For large data sets, a single point is not likely to be influential.)

There are several quantitative measures available to determine if a point is influential. The easiest to apply is Cook's distance (or Cook's D).

− An observation with a Cook's D of greater than 0.5 is a potential influential point, but the influence is relatively weak, so no further action is required.

− An observation with a Cook's D of greater than 1 is a likely influential point. The influence is relatively strong, so it should be flagged for further investigation.

To find the studentized residuals, leverages and Cook's D values for each observation, select the appropriate entries under "Save:" on the options screen:

When you fit the regression model, StatCrunch will calculate these three values for each ordered pair and output them to the same row in a new column.

The easiest way to find identify extreme values of each column is to sort the data on the appropriate column using Data  Sort:

1. In "Columns to sort:", elect all columns (do not sort single columns!).

2. In the "Sort criteria:" section, use the pull-down menus to select the appropriate column and sort direction.

3. In the "Options:" section, select the desired destination for the sorted data (you should leave this at "Replace current column(s)" unless absolutely necessary).

Example 1: Fit a model with the Handspan-Height data set, saving the studentized residuals, leverages and Cook's D values. Which points are outliers or high-leverage, if any? Are there any potential or likely influential points?

There are no outliers (points with studentized residuals greater than 3.00 or less than −3.00).

There are three high-leverage points (leverage > 6 𝑛𝑛⁄ = 6 167⁄ = 0.036): #70 (16.5, 63), # 114 (16, 57), 𝑎𝑎𝑛𝑛𝑎𝑎 # 151 (25.5, 78).

There are no potential or likely influential points (i.e, no points with Cook's Distance > 0.5).

Example 2: Fit a model with the IQ-Cranial data set, saving the studentized residuals, leverages and Cook's D values. Which points are outliers or high-leverage, if any? Are there any potential or likely influential points?

There are no outliers (points with studentized residuals greater than 3.00 or less than −3.00).

There are three high-leverage points (leverage > 6 𝑛𝑛⁄ = 6 20⁄ = 0.300): #70 (16.5, 63), # 114 (16, 57), 𝑎𝑎𝑛𝑛𝑎𝑎 # 151 (25.5, 78).

There are no potential or likely influential points (i.e, no points with Cook's Distance > 0.5).

Least-Squares Estimation: Predicting Response Values

When the full model is appropriate, we can use it to predict the value of a new observation of the response variable for a given value of the predictor variable. (As we saw in the last lesson, this 𝑥𝑥-value is within the scope of the full model.) This prediction has two parts: a point estimate, which we'll call 𝑦𝑦𝑛𝑛𝑛𝑛𝑛𝑛, and a prediction interval, which is conceptually similar to a confidence interval. (We use the word "prediction" because we are predicting an individual value of 𝑦𝑦 rather than estimating the mean 𝜇𝜇𝑦𝑦.

To estimate 𝑦𝑦𝑛𝑛𝑛𝑛𝑛𝑛, using StatCrunch, enter the appropriate 𝑥𝑥-value in the "X value(s):" field under the "Prediction of Y:" section. In the "Level:" field, enter the desired confidence level for the prediction interval.

After fitting the model, you can find 𝑦𝑦𝑛𝑛𝑛𝑛𝑛𝑛 on the output screen in the "Predicted values:" table. The point estimate 𝑦𝑦� is "Pred. Y" heading and the confidence interval limits are under the "XX% P.I. for new" heading (make sure you use the correct interval):

Round𝑦𝑦𝑛𝑛𝑛𝑛𝑛𝑛 and the interval limits to one more decimal place than the response variable values data in the original data set.

To plot the prediction limits for all values of 𝑥𝑥 within the scope of the model, select "--- with prediction interval" in the "Graphs:" section.

Example 3: Fit a model using the Handspan-Height data set and:

a. Construct a scatter plot that includes the least squares line and the 95% prediction interval lines.

b. Construct point and 95% prediction interval estimates of 𝑦𝑦𝑛𝑛𝑛𝑛𝑛𝑛 for 𝑥𝑥 = 25.0. Interpret this interval in the context of the problem.

StatCrunch produces the point estimate 𝑦𝑦𝑛𝑛𝑛𝑛𝑛𝑛 = 74.5 and the 95% Confidence interval 73.5 < 𝜇𝜇𝑦𝑦 < 75.5. We can be 95% confident that a single, randomly selected individual with a handspan of 25.0 cm is between 73.5 and 75.5 inches tall.

If the reduced model is appropriate or x-value is outside the scope of the full model. In these cases, use 𝑦𝑦�, the mean of the response variable, to estimate 𝑦𝑦𝑛𝑛𝑛𝑛𝑛𝑛. This involves simply calculating the mean of the data set's response variable column using the techniques of Lesson 6. As above, round 𝑦𝑦� to one more decimal place than the response variable values. Example 4: Using the IQ-Cranial data set, construct point and 95% prediction interval estimates of 𝑦𝑦𝑛𝑛𝑛𝑛𝑛𝑛 for an individual with a cranial capacity of 56.0. Is it appropriate to calculate a prediction interval estimate in this case?

The reduced model is appropriate, so we use the mean of the IQ data column as the point estimate: 𝑦𝑦𝑛𝑛𝑛𝑛𝑛𝑛 = 𝑦𝑦� = 101.0. Since we didn't use a valid regression model to determine 𝑦𝑦𝑛𝑛𝑛𝑛𝑛𝑛, we cannot construct a prediction interval.