8210 assgn wk 5

profileCandyy31
8210wk5skillbuilder.docx

Hypothesis Testing

In doing research, one of the most common activities is testing hypotheses. The Afrobarometer data set below is a survey of African citizens’ attitudes on democracy, governance, the economy, and other related topics ( www.afrobarometer.org ). Using this data set, you might want to examine hypotheses related to whether rural and urban citizens differ, on average, in how much they trust the government. The tables below present results from an independent samples t-test to examine these hypotheses using a random sample of 44 participants from the complete data set. Each respondent’s score is a value between 0 and 15 with a higher score indicating greater trust. You can see that the mean for the urban group is 7.00 ( SD = 4.17) and the mean for the rural group is 7.74 ( SD = 4.38).  The observed value of the t-statistic is -.564 and the p-value equals 0.576 (see the column labeled “Sig. (2-tailed)”).

African Citizens' Attitudes on Democracy

The tables below present results from an independent samples t-test to examine these hypotheses using a random sample of 44 participants from the complete data set. Each respondent’s score is a value between 0 and 15 with a higher score indicating greater trust. You can see that the mean for the urban group is 7.00 ( SD = 4.17) and the mean for the rural group is 7.74 ( SD = 4.38).  The observed value of the t-statistic is -.564 and the p-value equals 0.576 (see the column labeled “Sig. (2-tailed)”).

t

df

Sig. (2-tailed)

Mean Difference

Std. Error Difference

Trust in Government Index (higher scores = more trust)

-.564

41

.576

-.73913

1.30978

Group Statistics

Urban or Rural Primary Sampling Unit

N

Mean

Std. Deviation

Std. Error Mean

Trust in Government Index (higher scores = more trust)

Urban

20

7.000

4.16754

.93189

Rural

30

7.7391

4.38196

.91370

The  p-value  is the probability of obtaining a value more extreme than .564 (less than -.564 or greater than +.564) if you were to repeat the test with a new sample of data and if the null hypothesis is true. You will see in this Skill Builder that the p-value can easily be used to make statistical decisions in hypothesis testing. However, while the p-value is important in determining statistical significance, it does not tell the whole story.

Steps of Hypothesis Testing

To interpret p-values, let's review the key steps in hypothesis testing. Use the < and > icons to navigate between the steps.

Step 1

State the null and alternative hypotheses

Recall that hypotheses are statements about population parameters. For the Trust in Government example from the Afrobarometer data set, the null (HO) and alternative hypotheses (HA) is seen in the above image.

The Greek letter, µ, indicates a population mean, and the subscripts indicate levels of the independent variable (“urban” and “rural”). Here the null is saying that the mean for the urban population on the Trust In Government variable is the same as the mean for the rural population. The alternative hypothesis states that these means are not the same.

One-tailed vs. Two-tailed Tests

One important factor to be aware of is whether the test you are conducting is  one-tailed  or  two-tailed . So far, the hypotheses have been written for a two-tailed test, which means that the alternative hypothesis stated simply that there was a difference between the means, without specifying the direction of the difference. In a one-tailed test, the alternative hypothesis does specify the direction of the difference; that is, it specifies that one of the means (e.g., urban or rural) is expected to be larger than the other.

In a one-tailed test, the p-value will be the area in the test statistic distribution to the right of the observed value if the alternative hypothesis has an “is greater than” sign, and to the left of the observed value if the alternative hypothesis has an “is less than” sign. For example, suppose we had the following hypothesis test: 

In this hypothesis test, the alternative hypothesis HA states that the mean for the urban population will be greater than the mean for the rural population. The p-value would, therefore, be determined by the area to the right of the observed value of the test statistic using the sampling distribution for the test statistic. 

For a two-tailed test, as is being illustrated with the Afrobarometer data file, the area beyond the observed value is doubled to obtain the p-value. The reason for doubling is related to setting the rejection region for a two-tailed test. For a two-tailed test, alpha is divided in half (α/2), and the “half-areas” are used to identify rejection regions in both the upper and lower tails of the test statistic’s sampling distribution. 

The doubling of the area beyond the observed value allows the p-value to be compared to alpha to test the null hypothesis.

Figure 1

Figure 1 shows the p-value determination for the Afrobarometer hypothesis test. In the SPSS output, the observed value for the t-statistic is -0.564. Because the value of t is negative, the more extreme values of t are considered to be to the left of -0.564. As shown in Figure 1, the area under the curve and less than -0.564 is .288. Because of the two-tailed test, however, the area is doubled to account for the probability of the test statistic taking on a value greater than +0.564. Hence the p-value for the hypothesis test is .576.

Again, if alpha had been set equal to .05, the null hypothesis would be retained (fail to reject) because .576 is greater than .05. That is, the data support the position that in the populations of urban and rural citizens, there is no difference in average levels of trust in government.

Keep in mind the following important points related to making a statistical decision and interpreting your p-value:

· bullet

By definition, the p-value is the probability of obtaining a value for the test statistic as extreme or more extreme than the observed value if the null hypothesis is true.  

· bullet

If the p-value is less than alpha, the null is rejected, and the result is said to be statistically significant.

· bullet

If the p-value is greater than alpha, then researchers would fail to reject the null hypothesis.

Statistically Significant Results

The final step in conducting a hypothesis test is to link the statistical result to the real-world. That is, you need to examine the practical significance or the meaningfulness of the statistical result. 

If the result of the hypothesis test is to retain the null—that is, obtain a non-significant result—the researcher has clearly not identified a meaningful effect. In most hypothesis tests, retaining the null is not what the researcher is hoping to do.  

On the other hand, if you reject the null hypothesis, you will have a statistically significant result. You are, in essence, saying that the result is so unlikely under the assumption of the null being true that the null appears to be false. A false null hypothesis does not mean, however, that the result is scientifically or socially important. When a researcher finds a statistically significant result, knowledge of the research area is used to decide whether the result is important and meaningful. Large effects are more often meaningful than small effects, but there are times when small effects can be important. 

Knowledge of the research area is key in making the decision.

Probably the most frequent concern with meaningless statistically significant results has to do with sample size. With extremely large sample sizes, hypothesis tests can result in rejecting the null even though the effect is small and unimportant from an applied perspective. To understand how this works, let’s take another look at the Afrobarometer data set. Participants in the survey were asked whether they agreed or disagreed with the statement, “People must obey the law.” Responses were made using a five-point Likert scale:

1

2

3

4

5

strongly disagree

disagree

neither agree nor disagree

agree

strongly agree

Suppose a researcher had wanted to compare the urban and rural populations and tested the null hypothesis   Ho : μurban = μrural  using alpha equal to .05. Unlike the example above that used a sample of 43 participants, the following results are based on over 50,000 respondents. As shown in the following table, the p-value ( Sig (2-tailed)) for this test is .004.

t

df

Sig (2-tailed)

Mean Difference

Q48b. People must obey the law

Equal variances assumed

-.2892

50125

.004

-.029

Using APA style, the researcher could report that, on average, the urban population agrees less with the statement than does the rural population, t (50125) = -2.892, p = .004, d = .027, 95% CI [-.039, -.019].

· bullet

The statement says the t-test was conducted with 50,125 degrees of freedom or 50,127 participants.

· bullet

The p-value of .004 is less than alpha, so the null hypothesis is rejected.

· bullet

The d statistic is Cohen’s d, a common measure of effect size. 

· bullet

The 95% confidence interval for the difference in population means does not contain zero, which is consistent with having rejected the null hypothesis.

There is no doubt the result is statistically significant, but how meaningful is it? The d-statistic is quite useful because it compares the difference in sample means to an average of the standard deviations for the two groups. (The average standard deviation is based on a weighted average of the two sample variances.) According to Cohen, d = .2 is generally considered a small effect, d = .5 a medium effect, and d = .8 a large effect. The value of .027 is little more than 10% of a small effect. The statistically significant result that was obtained is therefore not likely to be important.

Statistical Power

Statistical power is the probability of rejecting a null hypothesis if the null is false (i.e., the alternative is true). It is the degree to which the researcher is able to detect an effect if there actually is one. With low statistical power, a researcher may struggle to detect an effect (to reject the null), even if an effect actually occurs in the population.

Suppose you are planning an experiment involving stereotype threat. Stereotype threat is defined as a tendency to behave in a manner consistent with negative beliefs that others have about a racial or gender group. For example, if some black test takers are told that as a group, black test takers do not perform well on math tests, performance among those black test-takers is worse than for black test takers for whom the stereotype is not evoked. One question you will need to answer is how many participants should you include in your study to be confident in identifying the effect? In other words, how many participants do you need in order to have adequate statistical power in your study?

The Affect of Statistical Power

Understanding how several factors affect the statistical power of a study will help you to understand and critique research findings and will also lead to greater satisfaction with your own research. When conducting your own research studies, you should do a power analysis prior to collecting data to make sure you have a good chance of demonstrating the effect you are looking for. 

There are three main factors that affect how much statistical power you have in your study:

· 1

1

Alpha (i.e., the probability of a type I error)

· 2

2

Effect size (i.e., the difference between the population means for the experimental and control groups)

· 3

3

Sample size (i.e., n )

As a researcher, you have control over alpha and sample size. The effect size, however, is not under your control and is predetermined. What will be important to you is having an idea about how great the effect may be. This Skill Builder is concerned with how alpha, effect size, and sample size are related to statistical power.

A Review of Hypothesis Testing

Before discussing power, let’s review the basics of hypothesis testing:

· bullet

The null hypothesis is the statement of no effect.

· bullet

The alternative hypothesis is a statement that an effect exists in the population.

· bullet

Obtaining a significant result means that you have rejected the null hypothesis and have concluded that it’s likely that there is an effect in the population.

· bullet

type I error happens when the null hypothesis is true but you reject it erroneously. This is referred to as a false positive.

· bullet

type II error happens when the null hypothesis is false but you fail to reject it. This is referred to as a false negative.

Reviewing Type I and Type II Errors

Type I and type II errors and their probabilities are important concepts when thinking about hypothesis testing. These error events are called “conditional,” meaning that the events can only occur under certain conditions.

The following is the language that is used to talk about these conditional events:

· Alpha (α) = P(type I error) = P(Reject H 0 |H 0 is true) which is read as the probability of a type I error equals the probability of rejecting the null hypothesis given the null is true.

· Beta (β) = P(type II error) = P(Retain H 0 |H A is true) which is read as the probability of a type II error equals the probability of retaining the null hypothesis given the alternative hypothesis is true.

Table 1 shows the possible outcomes for a hypothesis test.

Table 1: Possible Outcomes for a Hypothesis Test

D

True State of Nature

Decision

Ho is true

Ho is false

Retain Ho

Correct decision

Type II error

Reject Ho

Type 1 error

Correct decision

Power Analysis

Power analysis is the process of examining a test of the null hypothesis to determine the chances of rejecting it and placing belief in the alternative hypothesis.

 Researchers typically want to get a sense of how much statistical power they will have in their study before collecting data. In order to do so, they usually conduct a power analysis. Suppose you design a study, and a part of it is to demonstrate stereotype threat involving females. Nguyen and Ryan (2008) provide results that indicate the average Cohen’s d in previous studies of gender-based stereotype threat for cognitive tests is about .21. This means that over many studies, females who are NOT made aware of a gender stereotype (NOT primed) score about 0.2 standard deviations higher on cognitive tests than females who are made aware of a gender effect (primed). To demonstrate this effect in your study, you will test the following null hypothesis:

HA  :  μNOT primed − μprimed  ≤ 0

If you reject the null, you will place your confidence in the following alternative hypothesis:

HA :  μNOT primed − μprimed  > 0

μNOT primed Indicates the population mean for the “not primed” condition.

μprimed Indicates the population mean for the “primed” condition.

HA : μNOT primed − μprimed  > 0 The alternative hypothesis specifies that the “not primed” condition will score higher than the “primed” condition.

To test this null hypothesis, you would examine a test statistic distribution and note the area in the upper tail of the distribution equal to alpha. Suppose you plan to test this hypothesis with a t-test with 50 participants in each condition (primed or NOT primed).

Figure 1 sampling distribution shows what you should expect for the values of the test statistic if the null hypothesis is true. In order to reject the null hypothesis, the t value would need to be greater than 1.66055.

Figure 1

Because the test statistic is a continuous variable, the curve shows probability density, and probability is found by determining the area under the curve. 

The entire area under the curve, between  - ∞  and + ∞  , is 1.00. 

To find the probability of a statistic taking on a value within a certain range, you need to find the area under the curve within the range. For example, there are tables that will tell you that the area under the curve between t = 0 and t = +1 corresponds to a probability of about .34. Most importantly, because alpha has been set equal to .05, the area beyond 1.66 corresponds to a probability of .05. Fortunately, statistical programs calculate the areas for you, and you do not need to do the calculations yourself. 

Nevertheless, the essence of hypothesis testing is that if you obtain a value of t greater than 1.66, you will say, “This is not a very likely event if the null is true. Thus, the null hypothesis is probably not true because the alternative hypothesis provides a more likely explanation.” In making the decision to reject the null, however, you recognize that if the null is, in fact, true, you are making a type I error.

While alpha provides assurance that the researcher has a small chance of making a type I error, you are also interested in what will happen if the null hypothesis is false—the real world expectation that is driving you to do the study.

Figure 2

Now, in Figure 2, switch your focus from the curve on the left and attend to the curve on the right formed by the dashed line. This curve is based on the alternative hypothesis (i.e., that the unprimed group performs better than the primed group).  

To construct this curve based on the alternative hypothesis, a specific value for the difference in means had to be specified; in this case, the value of d = .21, the overall gender effect that Nguyen and Ryan (2008) found. Note, again, that the vertical line with = 1.66 separates the values of the test statistic that lead to rejecting versus retaining the null hypothesis, and that the line is based on the null hypothesis. The statistical power of the test, (1- β ), is the area under the curve with the dashed lines and to the right of the vertical line for = 1.69.  The area designated by beta ( β ), to the left of the vertical line, corresponds to the probability of a type II error, retaining the null if the null is actually false.  

In this example, note that the area corresponding to power (1- β ) is less than the area corresponding to  β . Hence, you can conclude that the power is less than 0.5 because the sum of the two areas is 1.0. Almost always, you would like statistical power to be greater than beta for the important hypothesis tests in your study. In this example, a plan to do an experiment with 50 participants in each group may be doomed. The statistical power of the test (.27) is relatively low, and the risk of making a type II error is relatively high. In other words, the statistical power of the test, as currently constructed, limits your ability to detect a gender effect of priming versus not priming if there is one.  

Numbered divider 1

Consider the following scenario when answering the question below.

You are planning a study of stereotype threat and are concerned you may not be able to detect a significant result, even though you believe your experimental procedures should induce the stereotype threat effect. 

Hint: A type I error happens when the null hypothesis is true and you reject it.

Which of the following errors are you concerned about?

Type I error

Type II error

TAKE AGAIN

The Relationship Between Power and Sample Size

Prior discussions have focused on testing hypotheses about population means, but you can also do hypothesis tests involving population proportions. In general, larger sample sizes give you more information to pin down the true nature of the population. You can, therefore, expect the sample mean and sample proportion obtained from a larger sample to be closer to the population mean and proportion, respectively. 

As a result, for the same level of confidence, you can report a smaller margin of error, and get a narrower confidence interval. In other words, larger sample sizes increase how much you trust your sample results. In the two scenarios below, you will see that a larger sample size results in a greater ability to reject the null when an effect actually exists in the population.

Scenario: Examining Marijuana Use

Imagine you are a researcher examining marijuana use at a certain liberal arts college and read through the scenario below.

Step 1

You believe that marijuana use at the college is greater than the national average, for which large-scale studies have shown that about 15.7% of college students use marijuana (reported by the Harvard School of Public Health). Based on this belief, you perform the hypothesis test shown in Figure 9 below.

· Note that p in this figure means population proportion and  pˆ  means sample proportion. On the other hand, p-value continues to have the same meaning as defined in the glossary.

Because the p-value is greater than .05, the customary alpha level, the data do not provide enough evidence that the proportion of marijuana users at the college is higher than the proportion among all U.S. college students, which is .157.

Step 2

Let’s make some small changes to the above problem. Suppose that in a simple random sample of 400 students from the college, 76 admitted to marijuana use as seen in Figure 8 below. Do the data provide enough evidence to conclude that the proportion of marijuana users among the students in the college (p) is higher than the national proportion, which is .157?

Step 3

You now have a larger sample (400 instead of 100), and also the number of marijuana users is 76 instead of 19. The question of interest did not change, so if you carry out the test in this case, you are testing the same hypotheses seen below.

Step 4

You select a random sample of size 400 and find that 76 are marijuana users, and the formula seen below. This is the same sample proportion as in the original problem, so it seems that the data give the same evidence.

Step 5

However, when you calculate the test statistic, you see that actually this is not the case as seen in the formula below.  

Even though the sample proportion is the same (.19), because here it is based on a larger sample (400 instead of 100), it is 1.81 standard deviations above the null value of .157 (as opposed to .91 standard deviations in the original problem). The sampling distribution for the sample proportion has a smaller standard error because of the larger sample size.

Step 6

The p-value here is .035, as opposed to .182 in the original problem. In other words, when Ho is true (i.e., if p = .157 at the certain college), it is quite unlikely (probability of .035) to get a sample proportion of .19 or higher based on a sample of size 400. When the sample size is 100, the probability of having a sample proportion greater than .19 is more likely (probability .182).

The results here are important. With n = 400, the data provide enough evidence to reject Ho and conclude that the proportion of marijuana users at the college is higher than among all U.S. students. With n = 100, however, the evidence is insufficient to reject the null. Figure 9 summarizes these findings.

You can see that results that are based on a larger sample carry more weight. A sample proportion of .19 based on a sample of size of 100 was not enough evidence that the proportion of marijuana users in the college is higher than .157. Recall that this conclusion (not having enough evidence to reject the null hypothesis) doesn't mean the null hypothesis is necessarily true; it only means that the particular study did not yield sufficient evidence to reject the null. It might be that the sample size was simply too small to detect a statistically significant difference, and a type II error was made.

To summarize, you saw that when the sample proportion of .19 is obtained from a sample of size 400, it carries much more weight, and in particular, provides enough evidence that the proportion of marijuana users in the college is higher than .157 (the national figure). In this case, the sample size of 400 was large enough to detect a statistically significant difference.

The following graphs show the power of the two tests if the population mean proportion p for the certain college is actually .19.  Use the < and > icon to navigate between slides.

· 1

· 2

Figure 10

Figure 11

Figure 12

Finally, Figure 12 shows how sample size affects the test for proportions concerning marijuana use at the liberal arts college. The graph is based on a hypothesis test with alpha = .05, the proportion for the null hypothesis equal to .157, and the population proportion for the liberal arts college = .19.

In general, whether you are testing hypotheses about proportions, means, or other parameters, the larger the sample size, the greater the statistical power. Because of your interest in rejecting the null, you need to pay attention to how large your sample size will be prior to collecting data.