Statistic

profileKelvin52
GEOG3000_Midterm_W21.pdf

GEOG 3000 - Advanced Geographic Statistics

Midterm Exam: Winter 2021

Submit: One .doc/.docx file, and one excel file .xls/.xlsx including calculations

General Instructions:

You have 7 days to complete this exam. This exam is different from our labs and should reflect

your work only, although you may use your notes and books. You may consult the instructor if

you have questions, but do not discuss the exam with your classmates. The exam questions are

included in this document and the data is in the accompanying Excel document. Worksheets are

labeled by question number (for example, question 3 data is labeled “Q3”). Please make sure that

you answer all of the questions and attach all relevant graphs and table outputs to your write-up

(you will need to submit the excel spreadsheet as well). Format all documents so that they are

organized and easy to interpret. The numbers in the parentheses shown after every question

number are the points associated with that question. There are a total of 100 points.

Section I. Basic Statistics

1. (6) It has been suggested that the average Facebook user spends 6 hours a week on the site.

If we assume a normal distribution, with a population standard deviation of 3 hours…

i.(3) What percentage of users spend more than 9 hours on Facebook?

ii.(3) What percentage of users spend between 2 and 4 hours on Facebook?

2. (6) On final exams, Sarah scored 90 in Economics and Martha scored 85 in Geography. We

can assume that the students’ scores within each subject are approximately “normal” in

distribution. The mean scores in Economics and Geography were both 72. The standard deviation

in Economics was 12 points, while in Geography it was 8.

Which of the two students has a better score, compared with his/her fellow students? Please show

how you arrived at your conclusion.

3. (6) The “Q3” tab in the midterm excel data sheet contains data on the maximum August

temperatures for Denver between 1958 and 2017. Calculate the means and the standard

deviations for the entire sample, and then separately for 1958-1987 and 1988-2017. Give the

equations used to calculate the values.

Select an appropriate bin size and draw a histogram of relative frequencies for the August high

temperature between 1958 and 2017.

4. (4) What is the difference between random sampling, stratified random sampling, and cluster

sampling? How might random sampling of the entire population in the US help us understand

the current spread of COVID-19, rather than just testing people who show symptoms?

Section II: Confidence Intervals and Hypothesis Testing

5 (20) Transparency International compiles an index of corruption for over 150 world nations.

The data for these nations for 2018 is provided in the midterm excel file. The 24 nations for

whom oil and gas make up more than 50% of export revenues are separated from the other 127

nations. Note that higher values in the index indicate lower levels of corruption.

i.(4) Find the means and variances of the corruption index for each of the two groups

of countries.

ii.(8) Construct a 95% confidence interval for the true mean difference between the

two groups.

iii.(8) Using t-stat, determine whether the difference in mean corruption between the

two groups of countries is statistically significant at the 95% confidence level. Give your

answer in the form of a classical hypothesis test.

6. (6) A group of researchers at the University of Texas-Houston conducted a comprehensive

study of pregnant cocaine-dependent women (Journal of Drug Issues, Summer 1997). All the

women in the study used cocaine on a regular basis for more than a year. One of the many

variables measured was birth weight (in grams) of the baby delivered. For a sample of 16 cocaine-

dependent women, the mean birth weight was 2,971 grams and the standard deviation was 410

grams. Test the hypothesis that the true mean birth weight of babies delivered by cocaine-

dependent women is less than 3,100 grams. Use alpha = .05.

7. (3) Confidence bands for population means are smaller if you (1) know the standard deviation

of the population, instead of estimating it from the sample, or (2) if you decrease the level of

confidence required. In addition to these two possibilities, how else could you obtain a smaller

confidence band?

8. (3) We can calculate confidence bands for means by using z-values from a normal distribution

table (or t-values from a t-distribution table), even if the population under study is not normally

distributed. Briefly explain the property of random samples that makes this possible.

9. (2) If the p-value is larger than the significance level, what does this suggest about the

hypothesis test?

a. Reject the null hypothesis

b. Fail to reject the null hypothesis

10. (2) The probability of concluding no significance exists when in fact it does, is which of the

following?

a. False positive

b. False negative

11. (2) Which error type is avoided if the null hypothesis is true?

a. Type I

b. Type II

Section III: Bivariate regression

12. (20) A “Happiness” statistic of different countries was compiled by the World Value

Survey. The "Happiness (net)" statistic was calculated by the percentage of people who rated

themselves as either "quite happy" or "very happy" minus the percentage of people who rated

themselves as either "not very happy" or "not at all happy". Now we want to know if reading

comprehension affects happiness. Conduct a regression analysis for predicting “Happiness”

from “Reading Comprehension” by answering the following questions. You can use the

regression tool in the data analysis toolpak, rather than performing regression manually in the

spreadsheet.

i. (2) State the dependent and independent variables and explain your selection.

ii. (2) Make a scatterplot of the variables and comment on the relationship between X

and Y evident from the scatterplot.

iii. (5) Perform a linear regression on the dataset. What is the regression equation?

What is the r2? What does the r2 mean?

iv. (2) Add a regression line to the scatter

v. (2) Provide a .05 level of significance test for slope being 0

vi. (5) Show and comment on the residual plot. Are there any apparent violations of

regression assumptions or outliers?

vii. (2) Calculate the expected net happiness of Hungary, which has a reading

comprehension index of 470.

13. (20) In the worksheet, Q12 lists a data set that you want to investigate for the relationship

between body weight and brain weight of some animals.

Show and comment on the scatterplot, discussing the apparent relationship between the two

variables.

Conduct a regression analysis (in data analysis toolpak) for the two variables and report the

results. In particular, comment on the strength of the relationship (i.e., coefficient of

determination), the regression coefficients, their significance, and analyze the residual plot for

violations of regression assumptions and outliers.

Try try to improve your initial regression by performing log transformations and/or eliminating

outlier(s). After each change, perform a new regression analysis and report the results, making

sure to discuss all of the points raised above. Pay particular attention to changes in the coefficient

of determination and in the residual plot. After performing 2 regression analyses with

transformed variables and/or removed outliers, compare your results and determine the best

regression equation. Explain why you chose that particular result.