This is Statistics' assignments by using R studio.
Chapter 17: More on Simple Regression
Human Development Index (HDI) 2012
The United Nations Development Program (UNDP) collects data in the developing world to help countries solve global and national development challenges. In the UNDP annual Human Development Report, you can find data on over 100 variables for each of 197 counties worldwide. One summary measure used by the agency is the Human Development Index (HDI), which attempts to summarize in a single number the progress in health, education, and economics of a country. In 2012, the HDI was as high as 0.955 for Norway and as low as 0.304 for the Congo and Niger. The gross domestic product per capita (GDPPC), by contrast, is often used to summarize the overall economics strength of a country. Is the HDI related to the GDPPC?
Let’s first read the dataset into R hdi <- read.table('HDI.txt', sep = '\t', header = TRUE)
We then blindly regress GDPPC on HDI without checking any assumption imod0 <- lm(HDI ~ GDP.Per.Capita, data = hdi)
and check the regression summary summary(imod0)
## ## Call: ## lm(formula = HDI ~ GDP.Per.Capita, data = hdi) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.30095 -0.06230 0.04071 0.09097 0.15815 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.168e-01 1.625e-02 37.97 < 2e-16 *** ## GDP.Per.Capita 7.516e-06 7.987e-07 9.41 3.3e-15 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.1257 on 94 degrees of freedom ## Multiple R-squared: 0.4851, Adjusted R-squared: 0.4796 ## F-statistic: 88.56 on 1 and 94 DF, p-value: 3.295e-15
The coefficient of GDP.Per.Capita is positive and highly statistically significant (very small p-value). However, we cannot conclude that GDP.Per.Capita is positively correlated with HDI, and is a useful predictor, because some model assumptions may be violated, making the model not valid at all! We have to check the assumptions.
We first check the Normal population assumption using histogram and Q-Q plot: par(mfrow = c(1, 2)) hist(hdi$HDI, xlab = "HDI", main = "") qqnorm(hdi$HDI) qqline(hdi$HDI)
1
HDI
F re
qu en
cy
0.3 0.5 0.7 0.9
0 5
10 15
20
−2 0 1 2
0. 4
0. 6
0. 8
Normal Q−Q Plot
Theoretical Quantiles
S am
pl e
Q ua
nt ile
s
As we can see, the distribution of HDI is skewed to the left. We therefore apply square (power 2) transformation to HDI and recheck: par(mfrow = c(1, 2)) hist(hdi$HDI^2, xlab = expression(HDI^2), main = "") qqnorm(hdi$HDI^2) qqline(hdi$HDI^2)
HDI2
F re
qu en
cy
0.2 0.6 1.0
0 5
10 15
−2 0 1 2
0. 2
0. 4
0. 6
0. 8
Normal Q−Q Plot
Theoretical Quantiles
S am
pl e
Q ua
nt ile
s
Although it is still slighly left-skewed, the transformed distribution becomes more Normal and is now acceptable, considering the sample size is large (n = 96).
We then check the linearity assumption. To obtain residuals, we regress HDI2 on GDPPC:
2
imod1 <- lm(HDI^2 ~ GDP.Per.Capita, data = hdi)
Let’s check scatterplot plot of HDI2 against GDPPC and residual plot: par(mfrow = c(1,2)) plot(hdi$GDP.Per.Capita, hdi$HDI^2, xlab = 'GDPPC', ylab = expression(HDI^2)) plot(imod1$fitted.values, imod1$residuals, xlab = 'Fitted Value', ylab = 'Residual') abline(0,0)
0 20000 60000
0. 2
0. 4
0. 6
0. 8
GDPPC
H D
I2
0.4 0.6 0.8 1.0 1.2
− 0.
4 −
0. 2
0. 0
0. 2
Fitted Value
R es
id ua
l
par(mfrow = c(1,1))
We can see an obvious bend in the scatterplot and a pattern in the residual plot. The unequal spreads are also shown in both plots. We therefore try a few transformations in the ladder of powers on GDPPC, which are √
GDPPC, log10 GDPPC, −1/ √
GDPPC and −1/GDPPC. The following scatterplots show the effect of each transformation: par(mfrow = c(1, 2)) plot(sqrt(hdi$GDP.Per.Capita), hdi$HDI^2, xlab = expression(sqrt(GDPPC)),
ylab = expression(HDI^2)) plot(log10(hdi$GDP.Per.Capita), hdi$HDI^2, xlab = expression(Log[10](GDPPC)),
ylab = expression(HDI^2))
3
50 150 250
0. 2
0. 4
0. 6
0. 8
GDPPC
H D
I2
2.0 3.0 4.0
0. 2
0. 4
0. 6
0. 8
Log10(GDPPC) H
D I2
par(mfrow = c(1, 2)) plot(-1/sqrt(hdi$GDP.Per.Capita), hdi$HDI^2, xlab = expression(-1/sqrt(GDPPC)),
ylab = expression(HDI^2)) plot(-1/hdi$GDP.Per.Capita, hdi$HDI^2, xlab = expression(-1/GDPPC),
ylab = expression(HDI^2))
−0.08 −0.04
0. 2
0. 4
0. 6
0. 8
− 1 GDPPC
H D
I2
−0.008 −0.004 0.000
0. 2
0. 4
0. 6
0. 8
− 1 GDPPC
H D
I2
As we go down the ladder, the effect of the power transformation is similar but getting stronger. We choose log10 transformation because it makes the scatter most straight. We then fit the following model: imod2 <- lm(HDI^2 ~ log10(GDP.Per.Capita), data = hdi)
The resulting residual plot is given by plot(imod2$fitted.values, imod2$residuals, xlab = "Fitted Value", ylab = "Residual") abline(a = 0, b = 0)
4
0.2 0.4 0.6 0.8
− 0.
3 −
0. 1
0. 1
0. 2
Fitted Value
R es
id ua
l
We see no pat- tern and equal spread in the residual plot, so the linearity and constant variance assumptions are satisfied in the transformed model. But, there seems to be an outlier with a significant residual. Let’s find that outlier: idx <- which.min(imod2$residuals) hdi[idx, ]
## Country HDI ## 41 Equatorial Guinea 0.547 ## Expected.Years.of.Schooling..of.children...years. ## 41 7.9 ## Life.expectancy.at.birth..years. ## 41 50.8 ## Maternal.mortality.ratio..deaths.per.100000.live.births. ## 41 240 ## Mean.years.of.schooling..of.adults...years. Population.urban.... ## 41 5.4 39.3 ## GDP.Per.Capita Cell.phones.100.people ## 41 8886 10.49519
The outlier is from Equatorial Guinea, a country that has HDI = 0.547 and GDPPC = 8886. It has a significantly negative residual, which means given GDPPC = 8886 the HDI of that country is much lower than the prediction provided by the linear model. There must be some other variable(s) affecting the HDI in a negative way.
Is the outlier from Equatorial Guinea an influential point? Let’s fit two models: one with the outlier and one without. We need to create a new dataset without the outlier for comparison hdi.new <- hdi[-idx, ]
and use this dataset to fit a same regression model as imod2 imod3 <- lm(HDI^2 ~ log10(GDP.Per.Capita), data = hdi.new)
Let’s first compare the two models visually: plot(log10(hdi$GDP.Per.Capita), hdi$HDI^2, xlab = expression(Log[10](GDPPC)),
ylab = expression(HDI^2)) abline(imod2, col = 'blue') abline(imod3, col = 'red')
5
2.0 2.5 3.0 3.5 4.0 4.5
0. 2
0. 4
0. 6
0. 8
Log10(GDPPC)
H D
I2
It is hard to see any difference between the two regression lines because they almost overlap each other. Note that there is no data point of high leverage, as indicated in the plot.
We then compare the two models quantitatively: summary(imod2)
## ## Call: ## lm(formula = HDI^2 ~ log10(GDP.Per.Capita), data = hdi) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.32938 -0.03480 0.01118 0.05078 0.18190 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.52568 0.04336 -12.12 <2e-16 *** ## log10(GDP.Per.Capita) 0.29231 0.01172 24.95 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.08305 on 94 degrees of freedom ## Multiple R-squared: 0.8688, Adjusted R-squared: 0.8674 ## F-statistic: 622.3 on 1 and 94 DF, p-value: < 2.2e-16 summary(imod3)
## ## Call: ## lm(formula = HDI^2 ~ log10(GDP.Per.Capita), data = hdi.new) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.168950 -0.032771 0.007626 0.048351 0.179724 ## ## Coefficients:
6
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.52991 0.03974 -13.34 <2e-16 *** ## log10(GDP.Per.Capita) 0.29444 0.01075 27.40 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.07609 on 93 degrees of freedom ## Multiple R-squared: 0.8898, Adjusted R-squared: 0.8886 ## F-statistic: 750.5 on 1 and 93 DF, p-value: < 2.2e-16
As we can see, the two models have very similar regression coefficients and R2. As a result, that outlier from Equatorial Guinea is not influential. We then do not need to remove it from the dataset and stick with imod2, the model fitted with the full dataset.
Be very careful! We cannot interpret the estimated slope in the model as the change of HDI on average if GDPPC increases by 1 unit, because both variables are transformed. In this case, it should be interpreted as follows: the HDI2 will increase by 0.292 on average if log10 GDPPC increases by 1 unit.
Now let’s use imod2 to make prediction on HDI of a country if its GDPPC is 4000. For this purpose we first create a new dataset for predicting: pred.data <- data.frame(GDP.Per.Capita = 4000)
and then make prediction using pred.data and imod2: result.pred <- predict(imod2, newdata = pred.data, interval = 'prediction', level = 0.9) result.pred
## fit lwr upr ## 1 0.5272557 0.3885762 0.6659353
Be very careful! The fitted value and its prediction interval obtained above are for the transformed response variable HDI2. We need to transform result.pred back to the original scale: sqrt(result.pred)
## fit lwr upr ## 1 0.7261238 0.6233588 0.8160486
When a country has a GDPPC of 4000, its HDI is predicted to be 0.726 with 90% prediction interval from 0.623 and 0.816.
7
- Human Development Index (HDI) 2012