Statistic
GEOG 3000 - Advanced Geographic Statistics
Midterm Exam: Winter 2021
Submit: One .doc/.docx file, and one excel file .xls/.xlsx including calculations
General Instructions:
You have 7 days to complete this exam. This exam is different from our labs and should reflect
your work only, although you may use your notes and books. You may consult the instructor if
you have questions, but do not discuss the exam with your classmates. The exam questions are
included in this document and the data is in the accompanying Excel document. Worksheets are
labeled by question number (for example, question 3 data is labeled “Q3”). Please make sure that
you answer all of the questions and attach all relevant graphs and table outputs to your write-up
(you will need to submit the excel spreadsheet as well). Format all documents so that they are
organized and easy to interpret. The numbers in the parentheses shown after every question
number are the points associated with that question. There are a total of 100 points.
Section I. Basic Statistics
1. (6) It has been suggested that the average Facebook user spends 6 hours a week on the site.
If we assume a normal distribution, with a population standard deviation of 3 hours…
i.(3) What percentage of users spend more than 9 hours on Facebook?
ii.(3) What percentage of users spend between 2 and 4 hours on Facebook?
2. (6) On final exams, Sarah scored 90 in Economics and Martha scored 85 in Geography. We
can assume that the students’ scores within each subject are approximately “normal” in
distribution. The mean scores in Economics and Geography were both 72. The standard deviation
in Economics was 12 points, while in Geography it was 8.
Which of the two students has a better score, compared with his/her fellow students? Please show
how you arrived at your conclusion.
3. (6) The “Q3” tab in the midterm excel data sheet contains data on the maximum August
temperatures for Denver between 1958 and 2017. Calculate the means and the standard
deviations for the entire sample, and then separately for 1958-1987 and 1988-2017. Give the
equations used to calculate the values.
Select an appropriate bin size and draw a histogram of relative frequencies for the August high
temperature between 1958 and 2017.
4. (4) What is the difference between random sampling, stratified random sampling, and cluster
sampling? How might random sampling of the entire population in the US help us understand
the current spread of COVID-19, rather than just testing people who show symptoms?
Section II: Confidence Intervals and Hypothesis Testing
5 (20) Transparency International compiles an index of corruption for over 150 world nations.
The data for these nations for 2018 is provided in the midterm excel file. The 24 nations for
whom oil and gas make up more than 50% of export revenues are separated from the other 127
nations. Note that higher values in the index indicate lower levels of corruption.
i.(4) Find the means and variances of the corruption index for each of the two groups
of countries.
ii.(8) Construct a 95% confidence interval for the true mean difference between the
two groups.
iii.(8) Using t-stat, determine whether the difference in mean corruption between the
two groups of countries is statistically significant at the 95% confidence level. Give your
answer in the form of a classical hypothesis test.
6. (6) A group of researchers at the University of Texas-Houston conducted a comprehensive
study of pregnant cocaine-dependent women (Journal of Drug Issues, Summer 1997). All the
women in the study used cocaine on a regular basis for more than a year. One of the many
variables measured was birth weight (in grams) of the baby delivered. For a sample of 16 cocaine-
dependent women, the mean birth weight was 2,971 grams and the standard deviation was 410
grams. Test the hypothesis that the true mean birth weight of babies delivered by cocaine-
dependent women is less than 3,100 grams. Use alpha = .05.
7. (3) Confidence bands for population means are smaller if you (1) know the standard deviation
of the population, instead of estimating it from the sample, or (2) if you decrease the level of
confidence required. In addition to these two possibilities, how else could you obtain a smaller
confidence band?
8. (3) We can calculate confidence bands for means by using z-values from a normal distribution
table (or t-values from a t-distribution table), even if the population under study is not normally
distributed. Briefly explain the property of random samples that makes this possible.
9. (2) If the p-value is larger than the significance level, what does this suggest about the
hypothesis test?
a. Reject the null hypothesis
b. Fail to reject the null hypothesis
10. (2) The probability of concluding no significance exists when in fact it does, is which of the
following?
a. False positive
b. False negative
11. (2) Which error type is avoided if the null hypothesis is true?
a. Type I
b. Type II
Section III: Bivariate regression
12. (20) A “Happiness” statistic of different countries was compiled by the World Value
Survey. The "Happiness (net)" statistic was calculated by the percentage of people who rated
themselves as either "quite happy" or "very happy" minus the percentage of people who rated
themselves as either "not very happy" or "not at all happy". Now we want to know if reading
comprehension affects happiness. Conduct a regression analysis for predicting “Happiness”
from “Reading Comprehension” by answering the following questions. You can use the
regression tool in the data analysis toolpak, rather than performing regression manually in the
spreadsheet.
i. (2) State the dependent and independent variables and explain your selection.
ii. (2) Make a scatterplot of the variables and comment on the relationship between X
and Y evident from the scatterplot.
iii. (5) Perform a linear regression on the dataset. What is the regression equation?
What is the r2? What does the r2 mean?
iv. (2) Add a regression line to the scatter
v. (2) Provide a .05 level of significance test for slope being 0
vi. (5) Show and comment on the residual plot. Are there any apparent violations of
regression assumptions or outliers?
vii. (2) Calculate the expected net happiness of Hungary, which has a reading
comprehension index of 470.
13. (20) In the worksheet, Q12 lists a data set that you want to investigate for the relationship
between body weight and brain weight of some animals.
Show and comment on the scatterplot, discussing the apparent relationship between the two
variables.
Conduct a regression analysis (in data analysis toolpak) for the two variables and report the
results. In particular, comment on the strength of the relationship (i.e., coefficient of
determination), the regression coefficients, their significance, and analyze the residual plot for
violations of regression assumptions and outliers.
Try try to improve your initial regression by performing log transformations and/or eliminating
outlier(s). After each change, perform a new regression analysis and report the results, making
sure to discuss all of the points raised above. Pay particular attention to changes in the coefficient
of determination and in the residual plot. After performing 2 regression analyses with
transformed variables and/or removed outliers, compare your results and determine the best
regression equation. Explain why you chose that particular result.