Econometrics homework

Vincent666
Class4.pptx

Correlation

Class 4 for Econometrics 1

Vincent Geloso

Correlation

This is going to be the simplest of the themes we will discuss.

A correlation speaks to the relation between two variables as well as the degree of that relation.

Correlation

First, define the two variables (X and Y) which you think are relevant to study together:

E.g. 1 : Real Wages/Average Productivity to Unemployment

E.g. 2 : Heights (in cm) of a person to income of parents at birth

E.g. 3 : Pirates to population to climate change

Notice something important already by asking: what is different between examples 1,2 on the one hand and example 3 on the other hand? (I.e. why are 1 and 2 seem reasonable while 3 does not).

Correlation

In our case, let’s take this dataset from the textbook (see here).

There are tons of variables (which are described on 498 of the textbook)

Let’s pick unemployment (X) and relief (Y)

Notice that each row has a value associated with a single unique value of These are pairs of values that come together so we can draw them.

What do you see when you look at this graph?

What did I add in this graph that was not there on the previous one?

Think about it by looking at the cloud of dots and the density of that cloud.

Correlation

So, what we see (intuitively) is that there is a positive relationship between the two variables.

However, correlation is not causation (duhhh….pirates example – who thinks that pirates cause climate change? (or even the reverse?)).

Also, what if the relation looks like this?

Is there a relationship between X and Y in this one?

Correlation is merely a descriptive tool – it has great strenghts, but it can miss a lot and can be misused (correlation is not causation)

This is also because correlation is a linear measure (there are others that we can use, but we wont even discuss them in econometrics II – we remain linear for the whole term).

And what about that guy? Does he mess things up?

This is known as an outlier

Why outliers matter

In the 1970s, research emerged suggesting that the New Deal spending of President FDR was geared towards re-election odds (i.e. electorally important states received funds even if they were not those in need of the funds).

Later, in the 1990s, someone questioned this by pointing out that a single state (Nevada) was driving the results (because also in part of the way these smaller states were included – a point which speaks to the functional form of a statistical test and will only discuss in econometrics II so nevermind that particularly small point)

The Correlation Coefficient ( r )

Intuition about correlation

Look at the deviation from the means!!!! That is the correlation!

Correlation table

This is a correlation table that applies to pre-famine (1845) Ireland. All the counties of Ireland are the population and there are different variables whose relations are presented

The values in table are the r

Why arent all the cells filled up though? The table seems empty no? Or is it…

Correlation table

This is a correlation table that applies to pre-famine (1845) Ireland. All the counties of Ireland are the population and there are different variables whose relations are presented

The values in table are the r

Why arent all the cells filled up though? The table seems empty no? Or is it…

To get « arrrghhhh » (phonetic joke for r)

You first need the covariance (see right)

Covariance says this:

Cov(x,y) > 0 x and y tend to move in the same direction

Cov(x,y) < 0 x and y tend to move in opposite directions

Cov(x,y) = 0 x and y are independent

To get « arrrghhhh » (phonetic joke for r)

If it is the covariance of a sample (above, it was the formula for a population), the equation remains the same except that we take out « one » from the denominator

(for a sample)

To get « arrrghhhh » (phonetic joke for r)

With the covariance, you can get r by doing what is in the box on your right!

(if it is for a population, you use the covariance for the population and the standard deviations for the populations – if it is for the sample, you use the equation in the box there)

How to read « r »

R does not tell us the nature of the relationship (just like the high r in the graph with Nicholas Cage appearances and drownings), it only tells us how strongly related the two variables seem to be. Using a less extreme example, two variables can be related according to correlation coefficient but in fact be independent of each other and dependent of a third variable that give us the impression that they move together.

The strength of the r is harder to « measure » : what is strong? 0.8? 0.2? On this, it depends on what you expected (i.e. your priors).

An illustration on weakness of how to read r : the spirit level

In the early 2000s, two British academics became famous for proposing the idea of the « spirit level » : that inequality in economic dimensions meant worsening outcomes on other dimensions (social, health, crime etc.)

This was heavily criticized -- in part because they used correlations (mostly).

Using a single year, they showed this relation between inequality and life expectancy

Using another year before or after the one they used changed the relation – it inverted it!

The r of -0.44 became 0.02

Tweaking correlation when the data is not continuous

Interval or ratio measurements are not the same as ranks. Ranks (like saying that Canadian historians think Mackenzie Bowell was a worst prime minister than Pierre Elliott Trudeau) tell us nothing about the difference between the ranks (i.e. how much worse was Bowell is not a question that we can answer with such « discrete » data).

The substitute we have is the « Spearman’s Rank Correlation Coefficient ».

Example: Presidential Rankings

The rank of presidential greatness (made by panels of historians is in the first column). The second column is the number of soliders killed in duty per 100,000 Americans. The third column is the ordered ranking of the second column (see Lincoln).

The different between the ranks squared gives us Di in the equation above and then the rest is just use n

Question: Does it matter if I do C-span rank minus MDPC rank instead of MDPC rank minus C-Span Rank?

Using Excel to Find the Correlation Coefficient

Select Data / Data Analysis

Choose Correlation from the selection menu

Click OK . . .

Using Excel to Find the Correlation Coefficient

Input data range and select appropriate options

Click OK to get output

(continued)

Interpreting the Result

r = .733

There is a relatively

strong positive linear

relationship between

test score #1

and test score #2

Students who scored high on the first test tended to score high on second test

0

10

20

30

40

50

Relief expenditures of each parish per head of population (shillings)

0.2.4.6

Ratio of unemployed laborers to wage laborers

Relief payment to the poor in England, 1831

0

10

20

30

40

50

Relief expenditures of each parish per head of population (shillings)0.2.4.6

Ratio of unemployed laborers to wage laborers

Relief payments to the poor in England, 1831

Scatter Plot of Test Scores

70

75

80

85

90

95

100

707580859095100

Test #1 Score

Test #2 Score

Chart1

78
92
86
83
95
85
91
76
88
79
Test #2 Score
Test #1 Score
Test #2 Score
Scatter Plot of Test Scores
82
88
91
90
92
85
89
81
96
77

Sheet4

Test #1 Score Test #2 Score
Test #1 Score 1
Test #2 Score 0.7332437047 1

Sheet1

Test #1 Score Test #2 Score
78 82
92 88
86 91
83 90
95 92
85 85
91 89
76 81
88 96
79 77

Sheet1

Test #2 Score
Test #1 Score
Test #2 Score
Scatter Plot of Test Scores

Sheet2

Sheet3