DATA SCIENCE PROJECT INSTRUCTIONS
Topic 6 Application of Data Science in Global Economic Analysis
1
Outline
2
Panel Data
Economic relevance
Fixed Effects Regression
Concept
Application
Example
Setting Hypotheses
Data Collection – Poly U
Data Analysis
Python Setup
Python Implementation
Fixed Effects Regression Example
Hypotheses Testing
Results Interpretation
Decision Rule
Conclusions (Example)
Data Analysis: Important Caveat
Panel Data
3
Panel Data (or Longitudinal Data): data that contains
observations about different cross sections across time,
a combination of time-series and cross-sectional data.
most common form of economic data for multiple countries
Example
Year Country Population Money Supply Growth
Current A/C Balance (% of
GDP)
2018 Singapore 5638676 3.90187887 15.40885493
2018 Thailand 69428454 4.667547639 5.610325153
2018 UnitedStates 326838199 4.030420024 -2.181753513
2019 Singapore 5703569 4.951350605 14.26299854
2019 Thailand 69625581 3.636896985 7.019674757
2019 UnitedStates 328329953 8.39189902 -2.240577453
2 years x 3 countries panel = 6 country-year observations
Fixed Effects Regression
4
A fixed effects regression is an estimation
technique that allows control for UNOBSERVED
individual characteristics that do not vary with
time (i.e. fixed), but that might impact both
independent and/or dependent variables in the
regression analysis.
It is usually employed in an analysis involving panel
data.
Example: a multiple country panel data analysis where
each country might have individually different but
unobserved characteristic that do not vary with time,
which however might have an impact on the observed
variables in the analysis.
4
Fixed Effects Regression: Example
CountryA
CountryB
MS Growth
Inflation
+a
• The fixed effect “a” captures the unobservable differences
between Countries A and B not captured by MS Growth, but
having a potential impact on either or both variables.
• An example of a is the resilience or perseverance of
labor unions in a country.
Fixed Effects Regression
6
Fixed effect regression model:
inflat_cpi(i,t) = + 1 ms_growth(i,t) + 2 pop(i,t)
+ 3 cab_gdp(i,t) + a(i) + (i,t)
Dependent variable: inflat_cpi(i,t) is the inflation rate of country i in year t.
Independent (explanatory) variables: ms_growth(i,t) is the mone supply growth.
pop(i,t) is the population.
cab_gdp(i,t) is the current account balance as % of
GDP.
Setting Hypotheses
7
Hypotheses:
H1: ms_growth has a positive effect on
inflat_cpi (1>0)
Explanation?
H2: pop has a positive effect on inflat_cpi (2 > 0)
Explanation?
H3: cab_gdp has a positive effect on inflat_cpi
(3 > 0)
Explanation?
Data Collection
8
At PolyU Library Website:
Find > Databases > Databases by Subject > Economics > World Development Indicators (WDI)
Example: data_wdi.csv
10 countries 10 years panel dataset
Data Analysis: Python Setup
9
Programming Language: Python
Platform: Anaconda
Install from:
https://www.anaconda.com/products/individual
Program Editor and Interface:
Jupyter Notebook
Example: Open the iPython notebook file
Python fixed effect model.ipynb (see Course Website)
Note: You might not have the “linearmodels” package
bundled with Anaconda download => run the following
to manually install with anaconda command prompt:
conda install -c conda-forge linearmodels
========================================================= 9
Parameter Estimates
========================================================
Parameter Std. Err. T-stat P-value
const 14.967 2.0452 7.3182 0.0000
ms_growth 0.0693 0.0272 2.5508 0.0125
pop -6.161e-08 9.726e-09 -6.3347 0.0000
cab_gdp -0.0659 0.0611 -1.0783 0.2839
Data Analysis: Python Implementation
After running the Python codes (in Jupyter):
Dep. Variable:
No. Observations:
R-squared:
inflat_cpi
100
0.3923
Hypothesis Testing: Results
Interpretation
11
The overall model accuracy measured by R-
square (R2):
0 R2 1
It is the proportion of the variance that can
be explained by the model.
The larger is R2, the more accurate is the model
in fitting the data.
Hypothesis Testing: Results
Interpretation
12
t-statistic - estimated coefficient divided by its
standard error.
the smaller the standard error (i.e., the larger is
the absolute value of the t-statistic), the more
statistically significant is the estimated coefficient.
p-value of the t-statistic: the probability that the null hypothesis ( = 0) is wrongly rejected (i.e., Type-I error) the smaller is the p-value, the higher is the
confidence in rejecting the null hypothesis ( = 0).
Hypothesis Testing: Decision Rule
13
Significance of Estimated Coefficient:
When p-value ≤ 0.01, one can reject the null
hypothesis of = 0 at the 1% level of significance.
When p-value ≤ 0.05, one can reject the null
hypothesis of = 0 at the 5% level of significance.
When p-value > 0.05, one cannot reject the null
hypothesis of = 0 (at least at the 5% level of significance) => there is lack of evidence
indicating that the estimated parameter is not
equal to zero.
Hypothesis Testing
14
Conclusions:
H1 is supported: ms_growth has a significantly
positive effect on inflat_cpi (1 > 0).
H2 is not supported: pop has a significantly
negative effect on inflat_cpi (2 < 0).
H3 is not supported: The effect of cab_gdp on
inflat_cpi is not significantly different from zero (p-
value is high => 3 = 0).
Data Analysis: Important Caveat
15
Correlation DOES NOT IMPLY Causation Storks Deliver Babies (p= 0.008) – studies the relationship
between the number of storks in a country and the number of human births using data from 17 European countries. Finds the existence of a statistically significant correlation (p-value =
0.008) between stork populations and human birth rates.
Does this really imply that storks do actually delivers babies? – unmindful and reckless usage of correlation and p-values
can deliver unreliable conclusions!
Usually such spurious correlations occur due to the presence of a confounding variable (in this particular case country size).
To reduce the possibility of such false and misleading results it is important to have: i. a good statistical design that includes all relevant variables,
including possible confounders
ii. a compelling conceptual analysis that demonstrates a potential causal mechanism in the statistical results