Module 6 Discussion-Stats
#Q1
Jtrain2
a. Fraction of men receiving job training is 0.4157 or 41.57%
b. Averages of re78 for the sample of men who received job training is 6.35 thousand dollars
whereas Averages of re78 for the sample of men not received job training is 4.55 thousand
dollars. This is a 1.8 thousand dollars which is a economically statistical figure.
c. Fraction of the men who received job training are unemployed is 0.2432 or 24.32% whereas
fraction of the men who not received job training are unemployed is 0.3538 or 35.38%. This
is a 11.06% difference which is very large and it tells us that job training decreases the
chances for the unemployment.
d. From both b) and C) we conclude that job training significantly increases the income and
reduced the unemployment.
#Q1 data(jtrain2, package='wooldridge') View(jtrain2) #a sum(jtrain2$train==1)/nrow(jtrain2)
## [1] 0.4157303
#b s1 <- subset(jtrain2,train==1) mean(s1$re78)
## [1] 6.349145
s2 <- subset(jtrain2,train==0) mean(s2$re78)
## [1] 4.554802
#c s3 <- subset(jtrain2,train==1) sum(s3$unem78==1)/sum(jtrain2$train==1)
## [1] 0.2432432
s4 <- subset(jtrain2,train==0) sum(s4$unem78==1)/sum(jtrain2$train==0)
## [1] 0.3538462
Q2
The mean, standard deviation, intercept, slope will differ each time because this is a simulation.
a. The Sample mean is approximately equal to 5 which is same as the mean of the Uniform
distribution in the range (0,10). The Standard deviation is 3.00 which is also very close to
the standard deviation of the Uniform distribution in the range (0,10).
b. No, the average of ui is not exactly 0 but very close to 0 which is also expected because
this is the random numbers and mean value can differ. The Standard deviation is 5.95 and
it is close to the actual standard deviation of 6.
c. The Intercept is 2.087 and slope is 1.779. They are not equal to the Population values but
they are very close especially the slope
d. Both residual sum and Product if xi and residual is almost 0.
e. The sum of error and Product of xi and error is not 0.
f. The new Intercept is 0.6041 and slope is 2.0126. They are different from the first no
because this is a simulation regression and results will come different each time but the
slope is almost 2 in both cases.
#Q2 #a. Generate 500 Uniform Distribution Observation in the range (0,10) xi<- runif(500)*10 mean(xi)
## [1] 5.007284
sd(xi)
## [1] 3.008299
#b.Generate 500 Normal Distribution Errors in the range (0,36) ui<-rnorm(500)*6 mean(ui)
## [1] -0.0205087
sd(ui)
## [1] 5.809095
#c yi <- 1+2*xi+ui resul <- lm(yi ~ xi) resul
## ## Call: ## lm(formula = yi ~ xi) ## ## Coefficients:
## (Intercept) xi ## 2.087 1.779
#d resid <- residuals(resul) options(scipen = 999) sum(resid)
## [1] -0.00000000000005384582
Prod <- xi*resid sum(Prod)
## [1] -0.000000000001600942
#e sum(ui)
## [1] -10.25435
Prod2 <- xi*ui sum(Prod2)
## [1] -1050.401
#f #a. Generate 500 Uniform Distribution Observation in the range (0,10) xi_new<- runif(500)*10 mean(xi_new)
## [1] 4.966936
sd(xi_new)
## [1] 2.899189
#b.Generate 500 Normal Distribution Errors in the range (0,36) ui_new<-rnorm(500)*6 mean(ui_new)
## [1] -0.3332974
sd(ui_new)
## [1] 6.400612
#c yi_new <- 1+2*xi_new+ui_new resul_new <- lm(yi_new ~ xi_new) resul_new
## ## Call:
## lm(formula = yi_new ~ xi_new) ## ## Coefficients: ## (Intercept) xi_new ## 0.6041 2.0126
Q3
Discrim
a. The averages for “prpblck” and “income” are 0.113 and 47,053.78, respectively. The
standard deviations for “prpblck” and “income” are 0.1824 and 13179.29 respectively.
prpblck represents a proportion of the black population and the income is represented in
dollar terms.
b. The resulting regression is psoda = 0.956 + 0.115 *prpblck + 0.0000016 *Income. The
Sample size is 401 and R-squared is 0.0642. The coefficient of prpblck indicates that if
population of African-Americans increases by 1%, then price of the soda will increase by
1.15 cents. It is not economically large.
c. The estimate of the coefficient on prpblack with the simple regression is 0.065 which is
lower than the prior estimate. This tells us that discrimination effect get smaller when the
income variable is excluded.
d. If “prpblck” increases by 20 percentage points, estimated psoda will increase by 2.43%
e. By adding the prppov variables, the estimate of the coefficient of prpblck falls to 0.0738.
f. The correlation between log(income) and prppov is approximately -0.8385 which tells us
that the relationship is strongly negative. This is also expected because decreases in
income will increase the poverty rates. This makes sense, because one would expect that
declines in income would result in higher poverty rates.
g. Yes, they are highly correlated and include both may result in multicollinearity problem
but our main purpose is to study the discrimination effect and we need to control as many
measures of income as we can. So, include both would makes sense.
#Q3 data(discrim, package='wooldridge') View(discrim) #a mean(discrim$prpblck,na.rm=TRUE)
## [1] 0.1134864
sd(discrim$prpblck,na.rm=TRUE)
## [1] 0.1824165
mean(discrim$income,na.rm=TRUE)
## [1] 47053.78
sd(discrim$income,na.rm=TRUE)
## [1] 13179.29
#b results <- lm(discrim$psoda ~ discrim$prpblck+discrim$income) options(scipen = 999) results
## ## Call: ## lm(formula = discrim$psoda ~ discrim$prpblck + discrim$income) ## ## Coefficients: ## (Intercept) discrim$prpblck discrim$income ## 0.956319626 0.114988191 0.000001603
# No of Observations nobs(results)
## [1] 401
# R-Square summary(results)$r.squared
## [1] 0.06422039
#c results2 <- lm(discrim$psoda ~ discrim$prpblck) options(scipen = 999) results2
## ## Call: ## lm(formula = discrim$psoda ~ discrim$prpblck) ## ## Coefficients: ## (Intercept) discrim$prpblck ## 1.03740 0.06493
#d results3 <- lm(log(discrim$psoda) ~ discrim$prpblck+log(discrim$income)) options(scipen = 999) results3
## ## Call: ## lm(formula = log(discrim$psoda) ~ discrim$prpblck + log(discrim$income)) ## ## Coefficients: ## (Intercept) discrim$prpblck log(discrim$income) ## -0.79377 0.12158 0.07651
# Percentage Change Coefficients <- coef(results3) 0.2*100*Coefficients["discrim$prpblck"]
## discrim$prpblck ## 2.431605
#e results4 <- lm(log(discrim$psoda) ~ discrim$prpblck+log(discrim$income)+discr im$prppov) results4
## ## Call: ## lm(formula = log(discrim$psoda) ~ discrim$prpblck + log(discrim$income) + ## discrim$prppov) ## ## Coefficients: ## (Intercept) discrim$prpblck log(discrim$income) ## -1.46333 0.07281 0.13696 ## discrim$prppov ## 0.38036
#f cor(log(discrim$income), discrim$prppov,use = "complete.obs")
## [1] -0.838467
Q4
K401Subs
a. 2017 single-person households are in the data set
b. The resulting regression is nettfa = -43.0398+ 0.7993*Inc + 0.8427*age. The Sample size
is 2017 and R-squared is 0.1193. The coefficient of Inc is 0.7993 and it’s indicates that if
annual Income of the family increases by $1000 then net financial wealth will increase by
$799.3 keeping age variable constant. The coefficient of age is 0.8427 and it’s indicates
that if age of the survey respondent increases by one year then net financial wealth will
increases by $842.7 keeping the annual family Income constant. I do not see any
surprises in the result because as the family income, their net wealth grows too but not in
same amount because of the expenses. Also, as the age grows, family accumulate more
funds and their wealth grows too.
c. Intercept is -43.0398 and it’s tells us that for 0 family income and 0 page, the net
financial wealth is -43.0309 thousand dollars. I do not think it is any interpretation
because it does not have any meaning, also the survey does not have any respondent with
0 age.
d. P value for the left tailed test is 0.0437. Since P value is greater than 0.01, we failed to
reject H0 and conclude that there is insufficient evidence that age coefficient is
significantly less than 1.
e. For simple regression of nettfa on inc, the estimated coefficient on inc is 0.8207 which is
not so much differ from the earlier part estimate. The correlation between age and
Income is 0.1056 which is very low and thus it’s also explained that why omitting an age
variable does not affect the Income variable much because variables are very weakly
corelated. So, there is no Omitted variable Bias.
#Q4 data(k401ksubs,package='wooldridge') View(k401ksubs) #a sum(k401ksubs$fsize==1)
## [1] 2017
#b res <- lm(k401ksubs$nettfa ~ k401ksubs$inc+k401ksubs$age,subset = (k401ksubs$ fsize==1)) res
## ## Call: ## lm(formula = k401ksubs$nettfa ~ k401ksubs$inc + k401ksubs$age, ## subset = (k401ksubs$fsize == 1)) ## ## Coefficients: ## (Intercept) k401ksubs$inc k401ksubs$age ## -43.0398 0.7993 0.8427
# No of Observations nobs(res)
## [1] 2017
# R-Square summary(res)$r.squared
## [1] 0.1193432
#d # Age Coefficient estimate age_coeff <- res$coefficients[3] # Age Coefficient standard error age_Sder<-summary(res)$coefficients[3,"Std. Error"] #t statistic with Beta2 = 1
t_slope <- (age_coeff - (1)) / age_Sder t_slope
## k401ksubs$age ## -1.709944
# calculate Left sided p-value, degrees of freedom = n - 2 pvalue <- pt(t_slope,df=length(k401ksubs$age)-2) pvalue
## k401ksubs$age ## 0.04365485
#e res2 <- lm(k401ksubs$nettfa ~ k401ksubs$inc,subset = (k401ksubs$fsize==1)) res2
## ## Call: ## lm(formula = k401ksubs$nettfa ~ k401ksubs$inc, subset = (k401ksubs$fsize = = ## 1)) ## ## Coefficients: ## (Intercept) k401ksubs$inc ## -10.5710 0.8207
cov(k401ksubs$age,k401ksubs$inc)
## [1] 26.21035