Econometrics homework
Correlation
Class 4 for Econometrics 1
Vincent Geloso
Correlation
This is going to be the simplest of the themes we will discuss.
A correlation speaks to the relation between two variables as well as the degree of that relation.
Correlation
First, define the two variables (X and Y) which you think are relevant to study together:
E.g. 1 : Real Wages/Average Productivity to Unemployment
E.g. 2 : Heights (in cm) of a person to income of parents at birth
E.g. 3 : Pirates to population to climate change
Notice something important already by asking: what is different between examples 1,2 on the one hand and example 3 on the other hand? (I.e. why are 1 and 2 seem reasonable while 3 does not).
Correlation
In our case, let’s take this dataset from the textbook (see here).
There are tons of variables (which are described on 498 of the textbook)
Let’s pick unemployment (X) and relief (Y)
Notice that each row has a value associated with a single unique value of These are pairs of values that come together so we can draw them.
What do you see when you look at this graph?
What did I add in this graph that was not there on the previous one?
Think about it by looking at the cloud of dots and the density of that cloud.
Correlation
So, what we see (intuitively) is that there is a positive relationship between the two variables.
However, correlation is not causation (duhhh….pirates example – who thinks that pirates cause climate change? (or even the reverse?)).
Also, what if the relation looks like this?
Is there a relationship between X and Y in this one?
Correlation is merely a descriptive tool – it has great strenghts, but it can miss a lot and can be misused (correlation is not causation)
This is also because correlation is a linear measure (there are others that we can use, but we wont even discuss them in econometrics II – we remain linear for the whole term).
And what about that guy? Does he mess things up?
This is known as an outlier
Why outliers matter
In the 1970s, research emerged suggesting that the New Deal spending of President FDR was geared towards re-election odds (i.e. electorally important states received funds even if they were not those in need of the funds).
Later, in the 1990s, someone questioned this by pointing out that a single state (Nevada) was driving the results (because also in part of the way these smaller states were included – a point which speaks to the functional form of a statistical test and will only discuss in econometrics II so nevermind that particularly small point)
The Correlation Coefficient ( r )
Intuition about correlation
Look at the deviation from the means!!!! That is the correlation!
Correlation table
This is a correlation table that applies to pre-famine (1845) Ireland. All the counties of Ireland are the population and there are different variables whose relations are presented
The values in table are the r
Why arent all the cells filled up though? The table seems empty no? Or is it…
Correlation table
This is a correlation table that applies to pre-famine (1845) Ireland. All the counties of Ireland are the population and there are different variables whose relations are presented
The values in table are the r
Why arent all the cells filled up though? The table seems empty no? Or is it…
To get « arrrghhhh » (phonetic joke for r)
You first need the covariance (see right)
Covariance says this:
Cov(x,y) > 0 x and y tend to move in the same direction
Cov(x,y) < 0 x and y tend to move in opposite directions
Cov(x,y) = 0 x and y are independent
To get « arrrghhhh » (phonetic joke for r)
If it is the covariance of a sample (above, it was the formula for a population), the equation remains the same except that we take out « one » from the denominator
(for a sample)
To get « arrrghhhh » (phonetic joke for r)
With the covariance, you can get r by doing what is in the box on your right!
(if it is for a population, you use the covariance for the population and the standard deviations for the populations – if it is for the sample, you use the equation in the box there)
How to read « r »
R does not tell us the nature of the relationship (just like the high r in the graph with Nicholas Cage appearances and drownings), it only tells us how strongly related the two variables seem to be. Using a less extreme example, two variables can be related according to correlation coefficient but in fact be independent of each other and dependent of a third variable that give us the impression that they move together.
The strength of the r is harder to « measure » : what is strong? 0.8? 0.2? On this, it depends on what you expected (i.e. your priors).
An illustration on weakness of how to read r : the spirit level
In the early 2000s, two British academics became famous for proposing the idea of the « spirit level » : that inequality in economic dimensions meant worsening outcomes on other dimensions (social, health, crime etc.)
This was heavily criticized -- in part because they used correlations (mostly).
Using a single year, they showed this relation between inequality and life expectancy
Using another year before or after the one they used changed the relation – it inverted it!
The r of -0.44 became 0.02
Tweaking correlation when the data is not continuous
Interval or ratio measurements are not the same as ranks. Ranks (like saying that Canadian historians think Mackenzie Bowell was a worst prime minister than Pierre Elliott Trudeau) tell us nothing about the difference between the ranks (i.e. how much worse was Bowell is not a question that we can answer with such « discrete » data).
The substitute we have is the « Spearman’s Rank Correlation Coefficient ».
Example: Presidential Rankings
The rank of presidential greatness (made by panels of historians is in the first column). The second column is the number of soliders killed in duty per 100,000 Americans. The third column is the ordered ranking of the second column (see Lincoln).
The different between the ranks squared gives us Di in the equation above and then the rest is just use n
Question: Does it matter if I do C-span rank minus MDPC rank instead of MDPC rank minus C-Span Rank?
Using Excel to Find the Correlation Coefficient
Select Data / Data Analysis
Choose Correlation from the selection menu
Click OK . . .
Using Excel to Find the Correlation Coefficient
Input data range and select appropriate options
Click OK to get output
(continued)
Interpreting the Result
r = .733
There is a relatively
strong positive linear
relationship between
test score #1
and test score #2
Students who scored high on the first test tended to score high on second test
0
10
20
30
40
50
Relief expenditures of each parish per head of population (shillings)
0.2.4.6
Ratio of unemployed laborers to wage laborers
Relief payment to the poor in England, 1831
0
10
20
30
40
50
Relief expenditures of each parish per head of population (shillings)0.2.4.6
Ratio of unemployed laborers to wage laborers
Relief payments to the poor in England, 1831
Scatter Plot of Test Scores
70
75
80
85
90
95
100
707580859095100
Test #1 Score
Test #2 Score
Chart1
| 78 |
| 92 |
| 86 |
| 83 |
| 95 |
| 85 |
| 91 |
| 76 |
| 88 |
| 79 |
Sheet4
| Test #1 Score | Test #2 Score | |
| Test #1 Score | 1 | |
| Test #2 Score | 0.7332437047 | 1 |
Sheet1
| Test #1 Score | Test #2 Score |
| 78 | 82 |
| 92 | 88 |
| 86 | 91 |
| 83 | 90 |
| 95 | 92 |
| 85 | 85 |
| 91 | 89 |
| 76 | 81 |
| 88 | 96 |
| 79 | 77 |