Module 6 Discussion-Stats

profilecsht
MidtermStatisticalAnalysisResults.pdf

#Q1

Jtrain2

a. Fraction of men receiving job training is 0.4157 or 41.57%

b. Averages of re78 for the sample of men who received job training is 6.35 thousand dollars

whereas Averages of re78 for the sample of men not received job training is 4.55 thousand

dollars. This is a 1.8 thousand dollars which is a economically statistical figure.

c. Fraction of the men who received job training are unemployed is 0.2432 or 24.32% whereas

fraction of the men who not received job training are unemployed is 0.3538 or 35.38%. This

is a 11.06% difference which is very large and it tells us that job training decreases the

chances for the unemployment.

d. From both b) and C) we conclude that job training significantly increases the income and

reduced the unemployment.

#Q1 data(jtrain2, package='wooldridge') View(jtrain2) #a sum(jtrain2$train==1)/nrow(jtrain2)

## [1] 0.4157303

#b s1 <- subset(jtrain2,train==1) mean(s1$re78)

## [1] 6.349145

s2 <- subset(jtrain2,train==0) mean(s2$re78)

## [1] 4.554802

#c s3 <- subset(jtrain2,train==1) sum(s3$unem78==1)/sum(jtrain2$train==1)

## [1] 0.2432432

s4 <- subset(jtrain2,train==0) sum(s4$unem78==1)/sum(jtrain2$train==0)

## [1] 0.3538462

Q2

The mean, standard deviation, intercept, slope will differ each time because this is a simulation.

a. The Sample mean is approximately equal to 5 which is same as the mean of the Uniform

distribution in the range (0,10). The Standard deviation is 3.00 which is also very close to

the standard deviation of the Uniform distribution in the range (0,10).

b. No, the average of ui is not exactly 0 but very close to 0 which is also expected because

this is the random numbers and mean value can differ. The Standard deviation is 5.95 and

it is close to the actual standard deviation of 6.

c. The Intercept is 2.087 and slope is 1.779. They are not equal to the Population values but

they are very close especially the slope

d. Both residual sum and Product if xi and residual is almost 0.

e. The sum of error and Product of xi and error is not 0.

f. The new Intercept is 0.6041 and slope is 2.0126. They are different from the first no

because this is a simulation regression and results will come different each time but the

slope is almost 2 in both cases.

#Q2 #a. Generate 500 Uniform Distribution Observation in the range (0,10) xi<- runif(500)*10 mean(xi)

## [1] 5.007284

sd(xi)

## [1] 3.008299

#b.Generate 500 Normal Distribution Errors in the range (0,36) ui<-rnorm(500)*6 mean(ui)

## [1] -0.0205087

sd(ui)

## [1] 5.809095

#c yi <- 1+2*xi+ui resul <- lm(yi ~ xi) resul

## ## Call: ## lm(formula = yi ~ xi) ## ## Coefficients:

## (Intercept) xi ## 2.087 1.779

#d resid <- residuals(resul) options(scipen = 999) sum(resid)

## [1] -0.00000000000005384582

Prod <- xi*resid sum(Prod)

## [1] -0.000000000001600942

#e sum(ui)

## [1] -10.25435

Prod2 <- xi*ui sum(Prod2)

## [1] -1050.401

#f #a. Generate 500 Uniform Distribution Observation in the range (0,10) xi_new<- runif(500)*10 mean(xi_new)

## [1] 4.966936

sd(xi_new)

## [1] 2.899189

#b.Generate 500 Normal Distribution Errors in the range (0,36) ui_new<-rnorm(500)*6 mean(ui_new)

## [1] -0.3332974

sd(ui_new)

## [1] 6.400612

#c yi_new <- 1+2*xi_new+ui_new resul_new <- lm(yi_new ~ xi_new) resul_new

## ## Call:

## lm(formula = yi_new ~ xi_new) ## ## Coefficients: ## (Intercept) xi_new ## 0.6041 2.0126

Q3

Discrim

a. The averages for “prpblck” and “income” are 0.113 and 47,053.78, respectively. The

standard deviations for “prpblck” and “income” are 0.1824 and 13179.29 respectively.

prpblck represents a proportion of the black population and the income is represented in

dollar terms.

b. The resulting regression is psoda = 0.956 + 0.115 *prpblck + 0.0000016 *Income. The

Sample size is 401 and R-squared is 0.0642. The coefficient of prpblck indicates that if

population of African-Americans increases by 1%, then price of the soda will increase by

1.15 cents. It is not economically large.

c. The estimate of the coefficient on prpblack with the simple regression is 0.065 which is

lower than the prior estimate. This tells us that discrimination effect get smaller when the

income variable is excluded.

d. If “prpblck” increases by 20 percentage points, estimated psoda will increase by 2.43%

e. By adding the prppov variables, the estimate of the coefficient of prpblck falls to 0.0738.

f. The correlation between log(income) and prppov is approximately -0.8385 which tells us

that the relationship is strongly negative. This is also expected because decreases in

income will increase the poverty rates. This makes sense, because one would expect that

declines in income would result in higher poverty rates.

g. Yes, they are highly correlated and include both may result in multicollinearity problem

but our main purpose is to study the discrimination effect and we need to control as many

measures of income as we can. So, include both would makes sense.

#Q3 data(discrim, package='wooldridge') View(discrim) #a mean(discrim$prpblck,na.rm=TRUE)

## [1] 0.1134864

sd(discrim$prpblck,na.rm=TRUE)

## [1] 0.1824165

mean(discrim$income,na.rm=TRUE)

## [1] 47053.78

sd(discrim$income,na.rm=TRUE)

## [1] 13179.29

#b results <- lm(discrim$psoda ~ discrim$prpblck+discrim$income) options(scipen = 999) results

## ## Call: ## lm(formula = discrim$psoda ~ discrim$prpblck + discrim$income) ## ## Coefficients: ## (Intercept) discrim$prpblck discrim$income ## 0.956319626 0.114988191 0.000001603

# No of Observations nobs(results)

## [1] 401

# R-Square summary(results)$r.squared

## [1] 0.06422039

#c results2 <- lm(discrim$psoda ~ discrim$prpblck) options(scipen = 999) results2

## ## Call: ## lm(formula = discrim$psoda ~ discrim$prpblck) ## ## Coefficients: ## (Intercept) discrim$prpblck ## 1.03740 0.06493

#d results3 <- lm(log(discrim$psoda) ~ discrim$prpblck+log(discrim$income)) options(scipen = 999) results3

## ## Call: ## lm(formula = log(discrim$psoda) ~ discrim$prpblck + log(discrim$income)) ## ## Coefficients: ## (Intercept) discrim$prpblck log(discrim$income) ## -0.79377 0.12158 0.07651

# Percentage Change Coefficients <- coef(results3) 0.2*100*Coefficients["discrim$prpblck"]

## discrim$prpblck ## 2.431605

#e results4 <- lm(log(discrim$psoda) ~ discrim$prpblck+log(discrim$income)+discr im$prppov) results4

## ## Call: ## lm(formula = log(discrim$psoda) ~ discrim$prpblck + log(discrim$income) + ## discrim$prppov) ## ## Coefficients: ## (Intercept) discrim$prpblck log(discrim$income) ## -1.46333 0.07281 0.13696 ## discrim$prppov ## 0.38036

#f cor(log(discrim$income), discrim$prppov,use = "complete.obs")

## [1] -0.838467

Q4

K401Subs

a. 2017 single-person households are in the data set

b. The resulting regression is nettfa = -43.0398+ 0.7993*Inc + 0.8427*age. The Sample size

is 2017 and R-squared is 0.1193. The coefficient of Inc is 0.7993 and it’s indicates that if

annual Income of the family increases by $1000 then net financial wealth will increase by

$799.3 keeping age variable constant. The coefficient of age is 0.8427 and it’s indicates

that if age of the survey respondent increases by one year then net financial wealth will

increases by $842.7 keeping the annual family Income constant. I do not see any

surprises in the result because as the family income, their net wealth grows too but not in

same amount because of the expenses. Also, as the age grows, family accumulate more

funds and their wealth grows too.

c. Intercept is -43.0398 and it’s tells us that for 0 family income and 0 page, the net

financial wealth is -43.0309 thousand dollars. I do not think it is any interpretation

because it does not have any meaning, also the survey does not have any respondent with

0 age.

d. P value for the left tailed test is 0.0437. Since P value is greater than 0.01, we failed to

reject H0 and conclude that there is insufficient evidence that age coefficient is

significantly less than 1.

e. For simple regression of nettfa on inc, the estimated coefficient on inc is 0.8207 which is

not so much differ from the earlier part estimate. The correlation between age and

Income is 0.1056 which is very low and thus it’s also explained that why omitting an age

variable does not affect the Income variable much because variables are very weakly

corelated. So, there is no Omitted variable Bias.

#Q4 data(k401ksubs,package='wooldridge') View(k401ksubs) #a sum(k401ksubs$fsize==1)

## [1] 2017

#b res <- lm(k401ksubs$nettfa ~ k401ksubs$inc+k401ksubs$age,subset = (k401ksubs$ fsize==1)) res

## ## Call: ## lm(formula = k401ksubs$nettfa ~ k401ksubs$inc + k401ksubs$age, ## subset = (k401ksubs$fsize == 1)) ## ## Coefficients: ## (Intercept) k401ksubs$inc k401ksubs$age ## -43.0398 0.7993 0.8427

# No of Observations nobs(res)

## [1] 2017

# R-Square summary(res)$r.squared

## [1] 0.1193432

#d # Age Coefficient estimate age_coeff <- res$coefficients[3] # Age Coefficient standard error age_Sder<-summary(res)$coefficients[3,"Std. Error"] #t statistic with Beta2 = 1

t_slope <- (age_coeff - (1)) / age_Sder t_slope

## k401ksubs$age ## -1.709944

# calculate Left sided p-value, degrees of freedom = n - 2 pvalue <- pt(t_slope,df=length(k401ksubs$age)-2) pvalue

## k401ksubs$age ## 0.04365485

#e res2 <- lm(k401ksubs$nettfa ~ k401ksubs$inc,subset = (k401ksubs$fsize==1)) res2

## ## Call: ## lm(formula = k401ksubs$nettfa ~ k401ksubs$inc, subset = (k401ksubs$fsize = = ## 1)) ## ## Coefficients: ## (Intercept) k401ksubs$inc ## -10.5710 0.8207

cov(k401ksubs$age,k401ksubs$inc)

## [1] 26.21035