need help with R

oiqmaf50j
7CentralLimitTheorem-1.doc

Today we will explore the Central Limit Theorem.

The Central Limit Theorem describes the sampling distribution of the sample mean, so before we talk about it, we must discuss sampling distributions. Recall that a probability distribution describes the likely values of a random number and how probable each value is. A sampling distribution does the same thing, but for a statistic. Let's look at this a little more closely.

Suppose we roll a fair die twice, 100 times.(samples<-replicate(100,sample(1:6,2,TRUE)))

Once we've got the sample, we could do various things with it, depending on what question we are trying to answer. We could average the numbers, find the standard deviation of the numbers, find the minimum number, or find the maximum number. Each of these values, which are derived from a sample, are different statistics. One definition of “statistic” is “a number that describes a sample.” For right now, let's focus on the maximum of the three values.

1. Create a new column that contains the maximum value of the two dice rolls in each row.

Then create a histogram of the values in this column, and paste it here.

In R:

maxsamples <- apply(samples,2,max)

hist(maxsamples)

This graph summarizes the possible values that can be the maximum and tells you how frequent each of those values was in the 100 rolls. Of course, the more dice doubles we roll, the more accurate the resulting graph of maximum values will be.

If yours is behaving like mine, the bin at one is on the other side of the value “1” than the bin at 6 is. You can fix this by setting where the bins start and stop.

hist(maxsamples, breaks=seq(.5,6,.5))

2. Repeat what you did above, but this time use 1000 samples instead of 100. Paste the new graph here, and describe the difference.

The Central Limit Theorem describes the shape of the distribution of sample means. Before we talk about the theorem itself, let's look at some graphs of sample means.

4. Starting with your dice roll data, create a vector that contains the sample means of the dice rolls. (meansamples<-apply(samples,2,mean)) Then create a histogram from the new vector. Paste your histogram here.

5. What if we had larger sample sizes? Recall that the sample size in our data was 2, and we had 1000 samples. Create some more columns of dice rolls, so that you have a sample size of 10, with 1000 samples. Then calculate the mean of these samples, and create a histogram of the means. Paste it here.

Notice that the bell shape of the normal distribution is showing up here. This turns out to be true in general: if you graph the distribution of sample means, you get a bell shape. This is not true of other statistics, in general. The Central Limit Theorem applies to sample means:

Central Limit Theorem:

If the sample size n is large, then the distribution of the sample means is

· normal (bell-shaped)

· with mean

image1.wmf

m

· with standard deviation

image2.wmf

n

s

OK, so we already discussed the bell-shape. What about the mean and standard deviation? Look back above at the graph you made of the means of samples of size 2 (number 4). Compare it to the graph you made of the means of samples of size 10 (number 5). Notice that both graphs are centered at about the same place, but the graph for the larger sample size is less spread out than the graph for the smaller sample size. This is because both of them will be centered at the average value for the population, which for a dice roll is 3.5, and the standard deviation will be the standard deviation of the population (1.71 for a dice roll) divided by the square root of the sample size (which is 2 for number 4, and 10 for number 5). If we calculate the expected standard deviation for the sample in number 4, we get 1.21 and if we calculate the expected standard deviation for the sample in number 5, we get 0.54.

6. Calculate the standard deviations for the vectors you used in numbers 4 and 5, and compare them to the expected standard deviations. (sd(meansamples))

Now, the hard part is the first phrase in the statement of the theorem: “If the sample size n is large...” How large is large enough? It turns out that the answer is “it depends.” We saw that with dice rolls, a sample size of 10 was enough, but in general you need more. We'll stick with 30 as a general rule: if n>=30, we can assume we know the distribution of the sample mean, if n<30, we don't.

7. Select either an exponential distribution (rexp(samplesize, mean)) or a binomial distribution with a p of .1 (rbinom(samplesize, n, p=.1)). Select 10000 samples of size 10, calculate the mean of each sample, and create a histogram. Then select 10000 samples of size 20, calculate the mean of each and create a histogram of means. Then select 10000 samples is size 30, calculate the mean of each sample and create a histogram. Below, paste your code and your three histograms. Comment on what happens as the sample size increases.

8. Assume we interview 150 people and ask them the mileage on their car. If we assume that the population mean mileage is 100000 miles, with a population standard deviation of 30000 miles, what does the Central Limit Theorem tell us about the distribution of the sample mean?

9. In the situation in number 7, what is the probability that we get a sample mean between 99000 and 101000?

NormSamples<- replicate(1000,rnorm(150,mean=100000,sd=30000))

   meansamples<-apply(NormSamples,2,mean)

   count <- sum(meansamples<101000&meansamples>99000)

   prob <- count/1000 #gives an empirical estimate of the probability

#To get the theoretical probability, use the command pnorm:

#Calculate the mean and standard deviation of the sample mean from the Central Limit Theorem. Suppose we call them meanofmeans and sdofmeans.

#then use pnorm(101000, meanofmeans, sdofmeans)-pnorm(99000, meanofmeans, sdofmeans)

10. In the previous problem, you were given code for finding an empirical probability and for finding a theoretical probability. What is the difference between empirical and theoretical probability?

_1330163401.unknown

_1330163418.unknown