Reflection paper Applied Business Analytics

Merina24
BasicStatisticalConcepts_Part21.docx

Basic Statistical Concepts Part 2

THE STANDARD NORMAL PROBABILITY DISTRIBUTION (finding normal probabilities)

For every y value there is a z value, thus there is a population of z values corresponding to the population of y values.

If y is randomly selected from a normally distributed population with mean µ and standard deviation σ, the z value corresponding to the y value is

Standard normal distribution – if y is normally distributed with mean µ and standard deviation σ, then z is normally distributed with mean 0 and standard deviation 1.

If we subtract µ from the inequality

and divide by σ, we obtain the following inequalities:

Note za is the z value corresponding to a, etc. We see that

This is the area under the standard normal curve corresponding to the interval [za, zb].

Example 1:

If the population of all mileages is normally distributed with mean µ = 31.5 and standard deviation σ = .8, what is

Hints: Solve for the z value corresponding to 29.9 and 33.1. What do these tell us? Now, use the normal table in the appendix of your book (you can also get one online) to solve for the probability.

Note: A normal distribution table often gives

for values ranging from .00 to 3.09. Looking at this table, we see that

Since the standard normal curve is symmetrical, it must be that

To perform the calculations involved in some statistical inference procedures, we need to find the z value such that the area to its right under the standard normal curve is γ, or zγ.

The point on the scale of the standard normal distribution is called the critical z value. This value is often compared to the z values for sample statistics.

Example 2:

Find z[.025].

Hints: The area under the standard normal curve between zero and z[.025] must equal .5-.025 = .475. Looking a normal distribution table sets the z value corresponding to an area of .4750 equal to 1.96.

This says that we must be 1.96 standard deviations above the mean to obtain a right-hand tail area of .025.

THE t-DISTRIBUTION, THE F-DISTRIBUTION, AND THE CHI-SQUARE DISTRIBUTION

Sometimes a population has what is called a t-distribution. We use a t-distribution for small sample sizes.

Google search “t-distribution” to review the properties of a t-distribution. Here is a great video:

https://www.youtube.com/watch?v=Uv6nGIgZMVw

To perform calculations involved in some statistical inference procedures, we need to find the t value such that the area to its right under the t-distribution is γ, or .

In general, we refer to point as the point on the scale of the t-distribution having df degrees of freedom such that the area under the curve to the right of this point is γ.

The point on the scale of the t-distribution is called the critical t value.

This value is often compared to the t values for sample statistics.

We can find this point using a t-table (there is one in the back of the book, but you can find one online, too).

The F-Distribution

Sometimes a population has an F-distribution.

What are the properties of the F-distribution?

https://www.youtube.com/watch?v=G_RDxAZJ-ug

To perform calculations involved in some statistical inference procedures, we need to find the point on the scale of the F-distribution having r1 and r2 degrees of freedom such that the area under this curve to the right of this point is γ.

This point is denoted as .

The point on the scale of the F-distribution is called a critical F value.

This value is often compared to the F values for sample statistics.

We can find this point using an F-table similar to the one in the back of the book (again, there are lots of examples online, too, just “Google” F-table).

The Chi-square Distribution

Sometimes a population has a chi-square distribution.

Here is a great video that reviews this distribution:

https://www.youtube.com/watch?v=6Z4MBxye5VA

The exact form of the curve depends on a parameter that is called the number of degrees of freedom and is denoted df.

We refer to the point as the point on the scale of the chi-square distribution having df degrees of freedom such that the area under this curve to the right of this point is γ.

CONFIDENCE INTERVALS FOR A POPULATION MEAN

We can use the sample mean and sample standard deviation s to solve for an interval estimate of the population mean, or a confidence interval.

Why? The mean does not provide any indication of how close it is to the population mean µ.

A 100(1-α)% confidence interval for the population mean µ based on the t-distribution can be solved for as follows (α is known as the level of significance and 100(1-α)% is known as the level of confidence):

Remember, is simply a critical t value.

Also note that as the number of degrees of freedom approaches 30 or so, we can use the critical z value.

Example 3:

Federal gasoline standards state that µ, the mean gasoline mileage obtained by the fleet of all Hawks, must be at least 30 mpg. To demonstrate that this standard is being met, National Motors randomly selects a sample of n=5 Hawks and tests them for gasoline mileage. If the sample

is obtained, solve for the 95% confidence interval for µ.

Input this data in Stata using the “input” command as follows:

. input mpg

1. 30.7

2. 31.8

3. 30.2

4. 32.0

5. 31.3

6. end

Solve for the confidence interval using the command “ameans” as follows:

. ameans mpg

You can also solve for the confidence interval by hand using the formula above. Since 100(1-α)%=95% implies that α=.05, we use

The Derivation of the Interval

Sampling distribution – A sampling distribution can be obtained by taking repeated samples of a population, calculating the mean for each sample, and examining the distribution of the sample means.

An important result is that the mean of a sampling distribution is equal to the population mean, and that the sampling distribution of the mean is normally distributed regardless of the distribution of the population of individual values.

To derive the 100(1-α)% confidence interval, we state several important results.

The population of all possible sample means

1. Has mean (the mean of the means is the population mean)

2. Has variance (the variance of the sampling distribution, or the variance of the distribution of sample means; if the population sampled is infinite)

3. Has standard deviation (the standard deviation of the sampling distribution; if the population sampled is infinite)

4. Has a normal distribution (if the population sampled has a normal distribution)

Since , or since the average of the sample means equals the true population mean, we say that the sample mean is unbiased, or that we are using an unbiased estimation procedure (in arriving at ).

Also since , the standard deviation of the population of all sample means decreases as the sample size n increases.

Since each possible sample mean is an average of n sample values, the sample mean “averages out” high and low sample values.

Thus, we’d expect the sample means to be more closely clustered around µ than the individual population values. That is, intuitively, , or the standard deviation of all sample means, should be smaller than σ, the standard deviation of the individual population values.

This is evident in the formula of the standard deviation of the population of all sample means since we divide the population standard deviation σ by .

Example 4:

If the true values of µ, σ2, and σ are 31.5, .64, and .8, then what are the values for ?

Moreover, assume that the population of all mileages is normally distributed. If this is the case, then the population of all possible sample means is also normally distributed. Thus, 95.44% of all possible sample means lie in the interval

Note: this interval is narrower than the interval containing 95.44% of the individual mileages (example 2.3 in the text).

Results 1, 2, and 3 from above imply that if the population that is sampled is normally distributed, then the population of all possible values of

has a standard normal distribution.

Note: we estimate by , which is called the standard error of the estimate .

Then it can be proved that if the population that is sampled is normally distributed, the population of all possible values of

has a t-distribution with n-1 degrees of freedom. This implies that

is the area under the curve of the t-distribution having n-1 degrees of freedom between and

Just as before, the probability a particular t-value is greater than a lower critical t-value, but less than an upper critical t-value is simply the area under the curve in between the lower and upper critical t-values.

We can define these values for any given significance level, or for any given α. Thus,

The probability that a particular t-value is greater than the lower critical t-value with .05/2 = .025 in the lower tail of the distribution, but less than the upper critical t-value with .05/2 = .025 in the upper tail of the distribution is 1-.5, or .95.

Multiply the inequality in the probability statement by to get

Subtracting through by implies

Multiplying the above inequality by -1 gives

This can be written as

This says that the proportion of confidence intervals containing the population mean µ in the population of all possible 100(1-α)% confidence intervals for µ is equal to 1-α.

Thus, if we compute a 100(1-α)% confidence interval confidence interval for µ by using the formula

Then 100(1-α)% of the confidence intervals in the population of all possible 100(1-α)% confidence intervals for µ contain µ, and 100(α)% of the confidence intervals in this population do not contain µ.

Confidence Intervals Based on the Normal Distribution

The preceding confidence interval is based on the t-distribution. It assumes that the population sampled is normally distributed. We now consider a confidence interval that is valid for any population.

The central limit theorem states that if the sample size n is large (greater than 30), then the population of all possible sample means is approximately normal with mean = µ and standard deviation = , no matter what probability distribution describes the population sampled.

Therefore, if n is large, the population of all possible values of[footnoteRef:1] [1: Remember, these are just z-values.]

approximately has a standard normal distribution (the distribution of z values is a standard normal distribution). This implies that

are approximately correct 100(1-α)% confidence intervals for µ, no matter what probability distribution describes the population sampled. We can derive these intervals using a process similar to that followed above.

Note: the 2nd interval follows from the first by approximating σ by s.

Note: a more precise statement of the central limit theorem says that the larger the same size n is, the more nearly normally distributed is the population of all possible sample means. Also, the larger n is, the smaller is .

In summary, when we do not know the population standard deviation σ, we should use the 100(1-α)% confidence interval for µ based on the normal distribution

if the sample n is large.

If the sample n is small and the population is normally distributed, we should use the 100(1-α)% confidence interval for µ based on the t-distribution

In both cases, if we know the population standard deviation, use σ, and if not, use s.

Note: please review example 2.7 in the text. Also, please review the Stata code for this example.

HYPOTHESIS TESTING FOR A POPULATION MEAN

We sometimes wish to test the null hypothesis, H0: µ=c versus the alternative hypothesis, Ha: µ ≠ c. Here µ is the population mean, which is estimated by , and c is an arbitrary constant.

If our sample statistic has a normal distribution – i.e., if the distribution of all of the sample means is normal, then we can see how far apart our sample statistic, or our estimate of the population mean, is from the hypothesized value by calculating

Remember, and s are the mean and standard deviation of a sample of size n that has been randomly selected from the population having mean µ.

Further, the sample mean is an unbiased estimator of the population mean – i.e., although the sample mean does not equal the population mean µ, the average of all of the different sample means that we could have calculated is equal to µ.

If the population is normally distributed, then we know the sampling distribution of the mean, and once we know the sampling distribution of the mean, we can make probabilistic statements about sampled values versus hypothesized values.

Think of the hypothesized value as a reference point – our test centers the sampling distribution of the mean on this value.

So, if the calculated value of our sample statistic is far from the hypothesized value (where distance is measured in standard deviations), then we have statistical evidence that the population mean is different than the hypothesized value c.

For example, if the t-statistic is a large positive value, this provides evidence to support rejecting H0 in favor of Ha b/c the point estimate indicates that µ is greater than c.

This likely warrants some sort of intervention by the firm. Interventions can be costly, but waiting too long to intervene can be even more costly.

A test statistic nearly equal to zero results when is nearly equal to c – such a test statistic provides little or no evidence to support rejecting H0 in favor of Ha. This is so b/c the point estimate indicates that µ is nearly or exactly equal to c.

A type 1 error is committed if we reject H0 when it is true. The probability of a type 1 error is α. Why? If a researcher chooses a significance level of α = .05, we reject the null hypothesis when it is true about 5% of the time. [footnoteRef:2] [2: Note: The significance level is chosen by the researcher in advance of conducting a hypothesis test.]

If the distribution of all possible sample means is normal with µ = c, then we’d expect to see sample means greater than and less than critical t-values about 5% of the time.

That is, if the population is normally distributed with mean µ, we can reject H0: µ = c in favor of Ha: µ ≠ c by setting the probability of a type 1 error equal to α if and only if

The points and are called rejection points, or critical values, because they tell us how different from zero t must be for us to be able to reject H0 by setting the probability of a type 1 error equal to α.

A type 2 error is committed if we do not reject H0 when it is false.

Why does using this rejection point procedure ensure that the probability of a type 1 error equals α?

Recall that if the population sampled is normally distributed with mean µ, then the population of all possible values of

has a t-distribution with n-1 degrees of freedom.

It follows that if the null hypothesis H0: µ = c is true, then the population of all possible values of the test statistic

has a t-distribution with n-1 degrees of freedom.

Thus, using the above rejection points says that if H0: µ = c is true, then the probability that

is 1-α. 95% of all possible values of t are in between these points.

Further, the probability that

is α. 5% of all possible values of t are to the left or to the right of the rejection points, which leads us to reject the null hypothesis when it is true.

Follow the following steps to hypothesis testing:

1. State in advance the significance level, or α.

2. State in advance the decision rule, or the H0 and Ha hypotheses.

3. Compute the test statistic.

4. Compare the test statistic to the critical value(s), or the rejection point(s).

5. Reject or fail to reject the null hypothesis H0: µ = c

Example 5:

1. Use a significance level of α = .05.

2. G&B Corporation will randomly select a sample of n = 6 bottle fills from its bottle-filling process to test the following hypotheses[footnoteRef:3]: [3: We assume that the infinite population of bottle fills is normally distributed, or at least mound shaped]

H0: µ = 16

Ha: µ ≠ 16

3. If G&B observes the following sample of n = 6 bottle fills:

Compute the test statistic as

4. Compare the absolute value of the test statistic to the critical value of .

5. Reject the null hypothesis H0: µ = 16 in favor of Ha: µ ≠ 16 since 3.2 > 2.571.

How can we solve for this in Stata?

First, input the data as

.input fills

. 15.68

. 16.00

. 15.61

. 15.93

. 15.86

. 15.72

. end

. ttest fills = 16

. save fills, replace

That’s it! You even get a 95% confidence interval with your ttest – since the hypothesized value is not contained in the interval, we can reject the null hypothesis.

One-Tailed Hypothesis Tests

Note: in the gasoline problem, recall that mileage standards state that the mean mileage µ must be at least 30 mpg.

Here we might be tempted to stay that National Motors can “prove” that µ ≥ 30 if it can accept the null hypothesis H0: µ ≥ 30 instead of the alternative hypothesis Ha: µ < 30.

However, hypothesis testing seeks to find how confident we can be that the null hypothesis should be rejected in favor of the alternative hypothesis. It does not seek to find how confident we can be that the null hypothesis should be accepted.

Therefore, we cannot use hypothesis testing to “prove” that a null hypothesis is true.

In conducting one-tail tests, do the following:

1. State what you wish to justify in the form of a strict inequality (<, >, or ≠), and make it the alternative hypothesis Ha.

2. State what we’d expect if the alternative hypothesis is false, and make this statement the null hypothesis H0.

Example 6:

1. Use a significance level of .05.

2. National Motors will randomly select a sample of n = 5 mileages to test the following hypotheses:

H0: µ ≤ 30

Ha: µ > 30

3. Using our mpg data, we saw that = 31.2 and that s = .7517. Thus

4. Compare the value of the test statistic to the critical value of .

5. Reject the null hypothesis H0: µ ≤ 30 in favor of Ha: µ > 30 since 3.569 > 2.132.

In Stata, do the following:

. use mpg, clear

. ttest mpg = 30

Note: This gives results for one-sided and two-sided tests.

Example 7:

Suppose National Motors wishes to claim in an advertisement that the Hawk’s mean stopping distance is less than 60 feet, the value claimed by its competitors. Do the following:

1. Use a significant level of .05

2. National Motors will randomly select a sample of n = 64 stopping distances to test the following hypotheses:

H0: µ ≥ 60

Ha: µ < 60

3. Suppose the sample mean and standard deviation are = 58.12 feet and s = 6.13 feet. Thus,

4. Compare the value of the test statistic to the critical value of .[footnoteRef:5] [5: Why do we use the z-value instead of the t-value?]

5. Since -2.45 < -1.645, reject the null hypothesis H0: µ ≥ 60 in favor of the alternative hypothesis Ha: µ < 60.

National motors will be allowed to make the television claim that µ < 60.

Note: Remember to use z-scores instead of t-scores for large samples.

Using p-Values

The p-value is twice the area under the curve of the t-distribution having n-1 df to the right of the absolute value of the calculated t-value.

You can think of the p-value as the area to the left or to the right of the calculated t-value for a one-sided test, or twice the area to the right of the calculated t-value for a two-sided test.

In other words, if the null hypothesis is true, the p-value is the probability we observe a sample statistic at least as large as that observed.

Thus, if p < α, reject the null hypothesis H0: µ = c.

P-values are automatically calculated in Stata.

. use fills, clear

. ttest fills = 16

What is the p-value?