DATA SCIENCE PROJECT INSTRUCTIONS

profilehelp55
Topic_6_Data_Science_Handout.pdf

Topic 6 Application of Data Science in Global Economic Analysis

1

Outline

2

 Panel Data

 Economic relevance

 Fixed Effects Regression

 Concept

 Application

 Example

 Setting Hypotheses

 Data Collection – Poly U

 Data Analysis

 Python Setup

 Python Implementation

 Fixed Effects Regression Example

 Hypotheses Testing

 Results Interpretation

 Decision Rule

 Conclusions (Example)

 Data Analysis: Important Caveat

Panel Data

3

 Panel Data (or Longitudinal Data): data that contains

observations about different cross sections across time,

a combination of time-series and cross-sectional data.

 most common form of economic data for multiple countries

 Example

Year Country Population Money Supply Growth

Current A/C Balance (% of

GDP)

2018 Singapore 5638676 3.90187887 15.40885493

2018 Thailand 69428454 4.667547639 5.610325153

2018 UnitedStates 326838199 4.030420024 -2.181753513

2019 Singapore 5703569 4.951350605 14.26299854

2019 Thailand 69625581 3.636896985 7.019674757

2019 UnitedStates 328329953 8.39189902 -2.240577453

2 years x 3 countries panel = 6 country-year observations

Fixed Effects Regression

4

 A fixed effects regression is an estimation

technique that allows control for UNOBSERVED

individual characteristics that do not vary with

time (i.e. fixed), but that might impact both

independent and/or dependent variables in the

regression analysis.

 It is usually employed in an analysis involving panel

data.

 Example: a multiple country panel data analysis where

each country might have individually different but

unobserved characteristic that do not vary with time,

which however might have an impact on the observed

variables in the analysis.

4

Fixed Effects Regression: Example

CountryA

CountryB

MS Growth

Inflation

+a

• The fixed effect “a” captures the unobservable differences

between Countries A and B not captured by MS Growth, but

having a potential impact on either or both variables.

• An example of a is the resilience or perseverance of

labor unions in a country.

Fixed Effects Regression

6

 Fixed effect regression model:

 inflat_cpi(i,t) =  + 1  ms_growth(i,t) + 2  pop(i,t)

+ 3  cab_gdp(i,t) + a(i) + (i,t)

 Dependent variable:  inflat_cpi(i,t) is the inflation rate of country i in year t.

 Independent (explanatory) variables:  ms_growth(i,t) is the mone supply growth.

 pop(i,t) is the population.

 cab_gdp(i,t) is the current account balance as % of

GDP.

Setting Hypotheses

7

 Hypotheses:

 H1: ms_growth has a positive effect on

inflat_cpi (1>0)

 Explanation?

 H2: pop has a positive effect on inflat_cpi (2 > 0)

 Explanation?

 H3: cab_gdp has a positive effect on inflat_cpi

(3 > 0)

 Explanation?

Data Collection

8

 At PolyU Library Website:

 Find > Databases > Databases by Subject > Economics > World Development Indicators (WDI)

 Example: data_wdi.csv

 10 countries  10 years panel dataset

Data Analysis: Python Setup

9

 Programming Language: Python

 Platform: Anaconda

 Install from:

https://www.anaconda.com/products/individual

 Program Editor and Interface:

 Jupyter Notebook

 Example: Open the iPython notebook file

 Python fixed effect model.ipynb (see Course Website)

 Note: You might not have the “linearmodels” package

bundled with Anaconda download => run the following

to manually install with anaconda command prompt:

 conda install -c conda-forge linearmodels

========================================================= 9

Parameter Estimates

========================================================

Parameter Std. Err. T-stat P-value

const 14.967 2.0452 7.3182 0.0000

ms_growth 0.0693 0.0272 2.5508 0.0125

pop -6.161e-08 9.726e-09 -6.3347 0.0000

cab_gdp -0.0659 0.0611 -1.0783 0.2839

Data Analysis: Python Implementation

After running the Python codes (in Jupyter):

Dep. Variable:

No. Observations:

R-squared:

inflat_cpi

100

0.3923

Hypothesis Testing: Results

Interpretation

11

 The overall model accuracy measured by R-

square (R2):

0  R2  1

 It is the proportion of the variance that can

be explained by the model.

 The larger is R2, the more accurate is the model

in fitting the data.

Hypothesis Testing: Results

Interpretation

12

 t-statistic - estimated coefficient divided by its

standard error.

 the smaller the standard error (i.e., the larger is

the absolute value of the t-statistic), the more

statistically significant is the estimated coefficient.

 p-value of the t-statistic: the probability that the null hypothesis ( = 0) is wrongly rejected (i.e., Type-I error)  the smaller is the p-value, the higher is the

confidence in rejecting the null hypothesis ( = 0).

Hypothesis Testing: Decision Rule

13

 Significance of Estimated Coefficient:

 When p-value ≤ 0.01, one can reject the null

hypothesis of  = 0 at the 1% level of significance.

 When p-value ≤ 0.05, one can reject the null

hypothesis of  = 0 at the 5% level of significance.

 When p-value > 0.05, one cannot reject the null

hypothesis of  = 0 (at least at the 5% level of significance) => there is lack of evidence

indicating that the estimated parameter is not

equal to zero.

Hypothesis Testing

14

 Conclusions:

 H1 is supported: ms_growth has a significantly

positive effect on inflat_cpi (1 > 0).

 H2 is not supported: pop has a significantly

negative effect on inflat_cpi (2 < 0).

 H3 is not supported: The effect of cab_gdp on

inflat_cpi is not significantly different from zero (p-

value is high => 3 = 0).

Data Analysis: Important Caveat

15

 Correlation DOES NOT IMPLY Causation  Storks Deliver Babies (p= 0.008) – studies the relationship

between the number of storks in a country and the number of human births using data from 17 European countries.  Finds the existence of a statistically significant correlation (p-value =

0.008) between stork populations and human birth rates.

 Does this really imply that storks do actually delivers babies?  – unmindful and reckless usage of correlation and p-values

can deliver unreliable conclusions!

 Usually such spurious correlations occur due to the presence of a confounding variable (in this particular case country size).

 To reduce the possibility of such false and misleading results it is important to have: i. a good statistical design that includes all relevant variables,

including possible confounders

ii. a compelling conceptual analysis that demonstrates a potential causal mechanism in the statistical results