Econometrics homework
The Presentation of Data
Class 2 for Econometrics 1
Vincent Geloso
Class organization
On Monday we do the slides and divide the class in two bits
Graphical presentation
Numerical presentation
On Wednesday, I will complete the lecture and introduce you to stata
NOTE: The homework I will assign may involve the use of Stata
Graphical presentation
Normally, and this is a dirty trick in econometrics, you can generally see where things will be going for any « applied » topcis with some basic graphical depictions.
Graphics from textbook
An illustration from economic history
It was the heat of the « moment »!
First moment: central tendency (mean)
Second moment: dispersion (variance)
Third moment: skewness (shape of the curve)
Fourth moment: kurtosis (tails or the sharpness of the peak of the distribution – kurtosis has its ethymology in the Greek for « arching »)
Measures of Central Tendency
Central tendency measures tell how « the central tendencies » (duh!) (and they are the first moments)
But think about it! « Central » refers to a single thing (there can only one centre to any unit or, in our case, any distribution).
A measure of central tendency will tell you about what the distribution moves around!
Measures of Central Tendency
Central Tendency
Mean
Median
Mode
Overview
Midpoint of ranked values
Most frequently observed value
(if one exists)
Arithmetic average
The arithmetic mean (mean) is the most common measure of central tendency (but it is affected by extreme values!)
For a population of N values:
For a sample of size n:
Sample size
Observed values
Population size
Population values
Arithmetic Mean
Weighted mean
Where w is the weight applied to each
Very useful when trying to get measures that are unaffected by other variables
For example: Cuba has an increasing death rate but also has an aging population. If you used the death rates in each age group as they evolve over time but conserved the weights of each group in, say, 1995, would you get a different evolution?
This relates a little to when we mentioned Simpson’s paradox (see this great clip on why these kinds of adjustments may matter a lot!)
Geometric Mean
Geometric mean
Used to measure the rate of change of a variable over time
Geometric mean rate of return
Measures the status of an investment over time
Where xi is the rate of return in time period i
Example
An investment of $100,000 rose to $150,000 at the end of year one and increased to $180,000 at end of year two:
50% increase 20% increase
What is the mean percentage return over time?
Example
Use the 1-year returns to compute the arithmetic mean and the geometric mean:
Arithmetic mean rate of return:
Geometric mean rate of return:
Misleading result
Accurate result
Median
In an ordered list, the median is the “middle” number (50% above, 50% below)
Not affected by extreme values
0 1 2 3 4 5 6 7 8 9 10
Median = 3
0 1 2 3 4 5 6 7 8 9 10
Median = 3
Finding the Median
The location of the median:
If the number of values is odd, the median is the middle number
If the number of values is even, the median is the average of the two middle numbers
Note that is not the value of the median, only the position of the median in the ranked data
Mean and median
You can already derive some things of value economically from just these two values – it will give you an idea of how big of an effect outliers have.
E.g. the ratio of the two over time is sometimes used to study income inequality
Mode
A measure of central tendency
Value that occurs most often
Not affected by extreme values
Used for either numerical or categorical data
There may may be no mode
There may be several modes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9
0 1 2 3 4 5 6
No Mode
Mean, median and mode
The three together should give you a rough idea (although not accurate) of what your data distribution looks like
For example: if you have two modes
Shape of a Distribution
The mean and median can also give you ideas about the shape of the distribution as well!
Symmetric or skewed
Mean = Median
Mean < Median
Median < Mean
Right-Skewed
Left-Skewed
Symmetric
Measures of Dispersion
The problem with « first moments » is that like « first impressions », they can mislead you!
Dispersion measures
Same center,
different variation
Variation
Variance
Standard Deviation
Coefficient of Variation
Range
Interquartile
Range
Measures of variation give information on the spread or variability of the data values.
The range
The range is probably the one you will never end up using in great details – its often simply reported in economics paper for the sake of due diligence (as min/max)
The difference between the minimum and maximum which has the obvious downside of ignoring everything in between
Percentiles and Quartiles
Percentiles and Quartiles indicate the position of a value relative to the entire set of data
Generally used to describe large data sets
Example: An IQ score at the 90th percentile means that 10% of the population has a higher IQ score and 90% have a lower IQ score.
Pth percentile = value located in the (P/100)(n + 1)th
ordered position
With percentiles, you can do the interquartile range
Interquartile range
Practical illustration
Income at P25, P50 (median) and P75, Delaware 1926-1938
Q1 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 372.08272782166262 313.47744692602799 357.39031554634653 378.66126527422097 386.70290242429132 248.94363369098119 288.12901623649975 289.17088151459524 285.32173524980732 235.54448218450719 378.64902368115588 402.0418201940102 390.38738551936262 Median 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 512.32897011112505 441.82857945602194 696.78626083522568 744.7537129199967 720.01732250626117 352.77213809849428 396.60422598844224 399.18681805270273 399.71294208815908 334.6259481131496 662.91704194072872 879.74065796761226 681.50002858928985 Q3 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1520.1075496622718 1308.5914241853047 1551.4755242327517 1660.8277277739442 1377.7429978796586 925.92294073361256 967.53302691484089 917.62393737098466 876.02474731927339 715.34285824703011 1352.9344683344982 1562.8723545897806 1456.1582624672712
Interquartile range
Very useful in discussions of income inequality – especially the « Great Compression » of the 1940s-1950s (incomes below the 90th centile grew and grew increasingly together)
Measures of Dispersion
The coefficient of variation is quite useful without any reliance on econometrics by the way. It can serve to assess market integration and convergence to the law of one price over time in economic history.
Population Variance
Average of squared deviations of values from the mean
Population variance:
Where
= population mean
N = population size
xi = ith value of the variable x
Sample Variance
Average (approximately) of squared deviations of values from the mean
Sample variance:
Where
= arithmetic mean
n = sample size
Xi = ith value of the variable X
Population Standard Deviation
Most commonly used measure of variation
Shows variation about the mean
Has the same units as the original data
Population standard deviation:
Sample Standard Deviation
Most commonly used measure of variation
Shows variation about the mean
Has the same units as the original data
Sample standard deviation:
Comparing Standard Deviations
s = 3.338
(compare to the two cases below)
11 12 13 14 15 16 17 18 19 20 21
11 12 13 14 15 16 17 18 19 20 21
Data B
Data A
s = 0.926
(values are concentrated near the mean)
11 12 13 14 15 16 17 18 19 20 21
s = 4.570
(values are dispersed far from the mean)
Data C
Mean = 15.5 for each data set
Measuring dispersion
Small standard deviation
Large standard deviation
Calculation Example: Sample Standard Deviation
Sample Data (xi) : 10 12 14 15 17 18 18 24
n = 8 Mean = x = 16
A measure of the “average” scatter around the mean
Coefficient of Variation
Measures relative variation
Always in percentage (%)
Shows variation relative to mean
Can be used to compare two or more sets of data measured in different units
Population coefficient of variation:
Sample coefficient of variation:
Comparing Coefficient of Variation
Stock A:
Average price last year = $50
Standard deviation = $5
Stock B:
Average price last year = $100
Standard deviation = $5
Both stocks have the same standard deviation, but stock B is less variable relative to its price
The coefficient of variation
The coefficient of variation is also super useful when comes the time to measure « market integration ».
Remember that in micro (and macro), you were shown the law of one price (prices net of transport costs should converge and, if there are no costs to transport, equalize).
How do you measure convergence?
The coefficient of variation for the sample of all places-prices is one way!
Skewness
Notice one thing important we did when dealing with measures of dispersion: their computation included measures of central tendency.
At each « moment », we include previous moments to arrive at values.
In the present case, we want to know about skewness
Skewness
As we will see in the next unit, a lot of econometrics relies on the « normal distribution » (i.e. bell curve).
The two best measures we will use in this class are the Pearson skewness coefficient and a variant on it