R_Project_STA5990.pdf
Assignment - R Project
STA5990 - MathStat Tools Dr. A Cohen
Fall 2018 Due at 11:59 pm on Sunday October 21, 2018
This assignment covers topics on how to use R to manage datasets, perform sta- tistical analyses, and data visualization. Your should submit a single PDF file that is typed with LATEX. Graphs should be produced using ggplot2 package. A single sepa- rate .R file with all the commands used to do your work (well-commented) should be submitted also.
Problem A: Data management
Suppose we want to find the closest pair of observations for some variable. This might be useful to test if some data had been accidentally duplicated. In this problem, we will create some sample data, say rnorm(n), and then create the IDs for each observation, say 1, 2, . . . , n. The results should return the IDs of the two closest observations and their distance from each other (difference). You may need to use the following functions order(), diff(), and which.min().
Problem B: Law of large numbers
The “law of large numbers” concerns the convergence of the arithmetic average to the expected value, as sample sizes increase. We will calculate and plot a moving average in this problem.
Create a function mymovave(n,typedist,...) that takes minimum two arguments: a sample size n and a function named typedist() that is used to generate samples from a distribution. Generate n = 1000 observations from Cauchy and t distributions. Plot the moving average for both distributions and comment on the results.
Problem C: Diploma problem (hat-check)
You most likely heard about this problem before. Smith College is a residential women’s liberal arts college in Northampton, MA that is steeped in tradition. One such tradition is to give each student at graduation a diploma at random. At the end of the ceremony, a diploma circle is
1
formed, and students pass the diplomas that they receive to the person next to them, and step out once they’ve received their diploma. What is the expected number of students who receive their diplomas in the initial disbursement?
The analytic solution (of the expect value) is easy to derive. Let Xi is the event that ith student receives their diploma then E(Xi) = 1/n, for all i (the diplomas are uniformly distributed). n is the number of diplomas/students. Thus, if Y is the sum of all the events Xi, then E(Y ) = 1. It is sometimes kind of surprising that the expected number of students receiving their diplomas in the initial disbursement does not depend on n! The variance can be more difficult to derive since Xi are dependent.
We will solve the problem using simulations with R. Simulate the problem and find the expected value and the variance of the number of students who receive their diplomas in the initial giving.
Problem D: More Practice with R
The equation of the standard error of a proportion, when we have a binomial experiment (n trials and p is the probability of success) is given as follows:√
p(1 − p) n
(1)
We are going to use the Economic.csv dataset. The data come from a large study, based on tax records, which allowed researchers to link the income of adults to the income of their parents several decades previously. For privacy reasons, we dont have that individual-level data, but we do have aggregate statistics about economic mobility for several hundred communities, containing most of the American population, and covariate information about those communities.
The dataset Economic.csv 1 has information on 741 communities (cities and their suburbs and exurbs, but also many rural areas with integrated economies). We will use the three variables:
Mobility The probability that a child born in 1980 − 1982 into the lowest quintile (20%) of household income will be in the top quintile at age 30. Individuals are assigned to the community they grew up in, not the one they were in as adults.
Population the population in 2000.
State the state of the principal city or town of the community.
1. Write a function stderr.prop() to calculate the standard error for proportions. This function has two arguments: a vector of proportions p and a vector of n (numbers of trials), and it returns a vector of standard errors.
2. Check that the result of stderr.prop() when is given a vector of different n′s with the same p is proportional to 1√
n .
1The data file can be downloaded in eLearning or from HERE
2
3. Check the function when given p = c(0.25, 0.75) and n = c(15, 88). You should also work out what the proper answer should be.
4. Use the stderr.prop() to find the standard error of the mobility variable for each community. Report the summary statistics.
5. Plot the histogram of the standard errors (use ggplot2 ).
6. Plot a scatter-plot of the the standard errors (y-axis) against population (x-axis) (ggplot2 ).
7. Plot a scatter-plot of the the standard errors (y-axis) against mobility (x-axis) (ggplot2 ).
8. Create a function wms() to calculate a weighted mean squared error. It should take as arguments a vector of predicted values, a vector of observed values, and a vector of weights. And It returns a single real number, the weighted mean squared error as follows:∑n
i=1wi(xi − x̂i)2∑n j=1wj
(2)
9. Check the wms() for the predicted values c(0.1, 0.8) and the observed values c(0.13, 0.79), and the weights c(0.05, 0.3).
10. Create two Maps one for Mobility and other one for its standard errors using the ggmap and ggplot2 packages. You may need to average the data by State. An example of a map would be as follows:
25
30
35
40
45
50
−120 −100 −80
long
la t
0.06
0.09
0.12
Mobility
Mobility in the 48 States
Problem E: Regression Analysis
Find a dataset (check out this Rdatasets list) and perform a regression analysis. Your data should have:
3
• A response variable: numeric • Explanatory variables: at least one numeric and at least one categorical variable. • You may need to plot your dataset to see if the relationship is linear. If the relationship
is not linear, you may want to consider transformation or find another dataset.
• Discuss the results.
4