Week 3 Discussion: Analyzing Correlation
Scatterplots
Does smoking cause lung cancer? Does low unemployment lead to inflation? Does human use of
fossil fuel cause global warming? A major goal of statistical studies is to determine if there is a
relationship between different variables. Once we know there is a relationship between the variables,
we can try to determine if one variable causes the other. One of the first steps in this process is to
make a scatterplot. A scatterplot is a diagram that represents the relationship between two
quantitative variable. It is a plot of paired values (x, y) with the horizontal axis representing the first, x,
variable and the vertical axis representing the second or y variable. In choosing the x and y variables,
x is the explanatory variable and y is the response variable. The choice of the x and y variables is
important. Ask yourself which variable depends on the result of the other and that will be the response
or y variable. The pattern of the dots in a scatterplot are important in determining whether there is a
correlation or relationship between the two variables.
Interpreting scatterplots involves examining the overall pattern for deviations from the pattern, or
outliers. The overall pattern can be explained using the direction, form, and strength of the
relationship. The direction would state whether the two variables have a positive or negative
relationship. In the case of a positive relationship, both variables move in the same direction. As one
variable increases, so does the other; and as one variable decreases, so does the other. A negative
direction would see the variables move in opposite directions. As one variable increases, the other
decreases. For a positive direction, the points would move in an upward direction while a negative
direction would see the points move downward. The form of the scatterplot would be whether the
points seem to cluster in the form of a straight line, a parabola, or a cubic function. We will only study
linear or straight-line relationships. The strength of the relationship is seen in how closely the points
are clustered together around a line. The measure of strength is the correlation coefficient, which
we will discuss next.
There are a number of ways to create a scatterplot. Using technology is preferred.
Stat Disk Example: We want to determine if the weight of a car is related to its city miles per gallon.
Solution: Open Stat Disk and choose Data Sets, 12th edition of the textbook, and the file Car
Measurements. The data will populate the
spreadsheet. Refer to page 4 in the Stat
Disk User’s Manual for directions on how
to open a data file. Once you have the
dataset displayed, click on Data,
Scatterplot, and choose column 3, weight
as the x or explanatory variable, and
column 8, city MPG, as the y or response
variable. Then click Evaluate. Refer to
page 13 of the Stat Disk User’s Manual for
help. The resulting scatterplot is
displayed to the left. Note that there is a
line drawn on the scatterplot. This is
called the regression line or line of best
fit, and we will discuss that later. Note
Scatterplots
that the dots and line have a downward direction. As in algebra, this means that the slope of the line
is negative meaning that as the x variable increases the y variable decreases. In this example, the city
MPG decreases as the weight of the car increases. The cluster of points does seem to form a
straight line so the form is linear. The strength seems to be strong because the points are clustered
around the line.
TI-84 Example: Scientists have examined data on sea surface temperature and coral growth per year at
locations in the Red Sea. Determine the explanatory and response variables and create a scatterplot using
the data in the table below.
Sea Surface temperature
29.68 29.87 30.16 30.22 30.48 30.65 30.90
Coral Growth
2.63 2.58 2.60 2.48 2.26 2.38 2.26
Solution: The coral growth would depend on the temperature so
growth would be the response variable and temperature would be the
explanatory variable. To do the scatterplot on the TI-84, select Stat, 1:
Edit and enter temperature in L1 and growth in L2. Then select STAT
PLOT and choose 1: Enter. Choose Plot 1 and verify the information
and turn it on. Click Zoom 9 to plot the data. The direction for this are
given in the D2L classroom TI Technology Manual. This relationship
has a negative direction, appears to be linear and the strength appears
to be strong because the points are clustered close to a line.
Example: There is some evidence that drinking moderate amounts of wine helps prevent heart attacks. The table on the next page gives data on yearly wine consumption (liters of alcohol from drinking wine, per person) and yearly deaths from heart disease (deaths per 100,000 people) in 19 developed nations.
(a) Make a scatterplot that shows how national wine consumption helps explain heart disease death rates. (b) Describe the form of the relationship. Is there a linear pattern? How strong is the relationship? Is the direction of the association positive or negative? Explain in simple language what this says about wine and heart disease. Do you think these data give good evidence that drinking wine causes a reduction in heart disease deaths? Why?
Solution: a) Enter the alcohol from wine into the first column and heart disease deaths in the second column in Stat Disk. Choose Data, Scatterplot, x as column 1 and y as column 2 and click Plot. The resulting scatterplot is shown to the left. b) The points tend to move downward which means a negative relationship. This means that as the alcohol consumption increases, the number of deaths from heart disease decrease. The relationship appears to be strong because the points are fairly close to the line. Later we will discuss calculating the correlation coefficient as a measure of the strength of the relationship. This particular relationship has a correlation coefficient of - 0.84. This qualifies as a strong, negative correlation.
This would give strong evidence that drinking moderate amounts of wine decreases heart disease.