Brilliant Answer

profilehottboy561
SkillBuilderStatisticalPower.docx

Statistical Power

Statistical power is the probability of rejecting a null hypothesis if the null is false (i.e., the alternative is true). It is the degree to which the researcher is able to detect an effect if there actually is one. With low statistical power, a researcher may struggle to detect an effect (to reject the null), even if an effect actually occurs in the population.

Suppose you are planning an experiment involving stereotype threat. Stereotype threat is defined as a tendency to behave in a manner consistent with negative beliefs that others have about a racial or gender group. For example, if some black test takers are told that as a group, black test takers do not perform well on math tests, performance among those black test-takers is worse than for black test takers for whom the stereotype is not evoked. One question you will need to answer is how many participants should you include in your study to be confident in identifying the effect? In other words, how many participants do you need in order to have adequate statistical power in your study?

The Affect of Statistical Power

Understanding how several factors affect the statistical power of a study will help you to understand and critique research findings and will also lead to greater satisfaction with your own research. When conducting your own research studies, you should do a power analysis prior to collecting data to make sure you have a good chance of demonstrating the effect you are looking for. 

There are three main factors that affect how much statistical power you have in your study:

· 1

1

Alpha (i.e., the probability of a type I error)

· 2

2

Effect size (i.e., the difference between the population means for the experimental and control groups)

· 3

3

Sample size (i.e., n )

As a researcher, you have control over alpha and sample size. The effect size, however, is not under your control and is predetermined. What will be important to you is having an idea about how great the effect may be. This Skill Builder is concerned with how alpha, effect size, and sample size are related to statistical power.

A Review of Hypothesis Testing

Before discussing power, let’s review the basics of hypothesis testing:

· bullet

The null hypothesis is the statement of no effect.

· bullet

The alternative hypothesis is a statement that an effect exists in the population.

· bullet

Obtaining a significant result means that you have rejected the null hypothesis and have concluded that it’s likely that there is an effect in the population.

· bullet

A type I error happens when the null hypothesis is true but you reject it erroneously. This is referred to as a false positive.

· bullet

A type II error happens when the null hypothesis is false but you fail to reject it. This is referred to as a false negative.

Reviewing Type I and Type II Errors

Type I and type II errors and their probabilities are important concepts when thinking about hypothesis testing. These error events are called “conditional,” meaning that the events can only occur under certain conditions.

The following is the language that is used to talk about these conditional events:

· Alpha (α) = P(type I error) = P(Reject H 0 |H 0 is true) which is read as the probability of a type I error equals the probability of rejecting the null hypothesis given the null is true.

· Beta (β) = P(type II error) = P(Retain H 0 |H A is true) which is read as the probability of a type II error equals the probability of retaining the null hypothesis given the alternative hypothesis is true.

Table 1 shows the possible outcomes for a hypothesis test.

Table 1: Possible Outcomes for a Hypothesis Test

D

True State of Nature

Decision

Ho is true

Ho is false

Retain Ho

Correct decision

Type II error

Reject Ho

Type 1 error

Correct decision

Power Analysis

Power analysis is the process of examining a test of the null hypothesis to determine the chances of rejecting it and placing belief in the alternative hypothesis.

 Researchers typically want to get a sense of how much statistical power they will have in their study before collecting data. In order to do so, they usually conduct a power analysis. Suppose you design a study, and a part of it is to demonstrate stereotype threat involving females. Nguyen and Ryan (2008) provide results that indicate the average Cohen’s d in previous studies of gender-based stereotype threat for cognitive tests is about .21. This means that over many studies, females who are NOT made aware of a gender stereotype (NOT primed) score about 0.2 standard deviations higher on cognitive tests than females who are made aware of a gender effect (primed). To demonstrate this effect in your study, you will test the following null hypothesis:

HA  :  μNOT primed − μprimed  ≤ 0

If you reject the null, you will place your confidence in the following alternative hypothesis:

HA :  μNOT primed − μprimed  > 0

μNOT primed Indicates the population mean for the “not primed” condition.

μprimed Indicates the population mean for the “primed” condition.

HA : μNOT primed − μprimed  > 0 The alternative hypothesis specifies that the “not primed” condition will score higher than the “primed” condition.

To test this null hypothesis, you would examine a test statistic distribution and note the area in the upper tail of the distribution equal to alpha. Suppose you plan to test this hypothesis with a t-test with 50 participants in each condition (primed or NOT primed).

Because the test statistic is a continuous variable, the curve shows probability density, and probability is found by determining the area under the curve. 

The entire area under the curve, between - ∞  and + ∞  , is 1.00. 

To find the probability of a statistic taking on a value within a certain range, you need to find the area under the curve within the range. For example, there are tables that will tell you that the area under the curve between t = 0 and t = +1 corresponds to a probability of about .34. Most importantly, because alpha has been set equal to .05, the area beyond 1.66 corresponds to a probability of .05. Fortunately, statistical programs calculate the areas for you, and you do not need to do the calculations yourself. 

Nevertheless, the essence of hypothesis testing is that if you obtain a value of t greater than 1.66, you will say, “This is not a very likely event if the null is true. Thus, the null hypothesis is probably not true because the alternative hypothesis provides a more likely explanation.” In making the decision to reject the null, however, you recognize that if the null is, in fact, true, you are making a type I error.

While alpha provides assurance that the researcher has a small chance of making a type I error, you are also interested in what will happen if the null hypothesis is false—the real world expectation that is driving you to do the study.

To construct this curve based on the alternative hypothesis, a specific value for the difference in means had to be specified; in this case, the value of d = .21, the overall gender effect that Nguyen and Ryan (2008) found. Note, again, that the vertical line with t = 1.66 separates the values of the test statistic that lead to rejecting versus retaining the null hypothesis, and that the line is based on the null hypothesis. The statistical power of the test, (1- β ), is the area under the curve with the dashed lines and to the right of the vertical line for t = 1.69.  The area designated by beta ( β ), to the left of the vertical line, corresponds to the probability of a type II error, retaining the null if the null is actually false.  

In this example, note that the area corresponding to power (1- β ) is less than the area corresponding to β . Hence, you can conclude that the power is less than 0.5 because the sum of the two areas is 1.0. Almost always, you would like statistical power to be greater than beta for the important hypothesis tests in your study. In this example, a plan to do an experiment with 50 participants in each group may be doomed. The statistical power of the test (.27) is relatively low, and the risk of making a type II error is relatively high. In other words, the statistical power of the test, as currently constructed, limits your ability to detect a gender effect of priming versus not priming if there is one.

Power Analysis

As the researcher, you have control of alpha, and you will set alpha when you are planning your study. Continuing with the example from the previous page, Figure 4 below shows what would happen to power if you change alpha, the probability of a type I error, to .15

Compare the curves in Figure 3 to the ones above in Figure 2; in that figure, alpha ( α ) was equal to .05. Notice that β becomes smaller, and power, (1- β ), becomes larger. If you change α to .01, a relatively small value for the probability of a type I error, beta ( β ) becomes larger, and power becomes less. See Figure 4 below.

 In general, making alpha ( α ) smaller results in a decrease in the power of the statistical test, and making alpha larger results in greater power. This is because if you set a more stringent alpha (e.g., .01 instead of .05), it becomes more difficult to reject the null hypothesis. While .05 is a typical value for α , the decision of which value to use for α is up to the researcher. Letting alpha ( α ) equal .05 is certainly common practice.

 bullet

Many journal editors expect alpha ( α ) to equal .05. There are other times, however, when the researcher may wish to use a different value for alpha ( α ) depending on the severity of the consequences for making a type I error. For example, if you are studying whether or not a drug has serious side effects, with the null specifying that there are no serious side effects, you may want to have a more stringent alpha to lower your risk of saying that there aren’t side effects when there actually are; you may opt for a .01 alpha instead of a .05 alpha.

Power and Effect Size

A second factor that is related to the statistical power of a test is the effect size. There are several measures of effect size. With a comparison of two populations, Cohen’s d is often used. The value of d is the difference in population means between two groups in standard deviation units. According to Cohen’s rule of thumb, a value of d = .2 is considered a small effect, d = .5 is considered a medium sized effect, and d = .8 is considered a large effect.  

Let’s revisit the earlier example about planning a study to demonstrate race-based stereotype threat. Nguyen and Ryan (2008) note that overall race-based stereotype threat studies have resulted in an average equal to about .32. Figure 5 below shows what you can expect if you induced a general racial stereotype threat in a rather typical way so that in the population d  = .32, there are 50 participants in each group, and alpha = .05. Note that power has increased noticeably compared to the study examined in Figure 2. This is due to the effect size ( d = .32) in this figure being larger than the effect size ( d = .21) in Figure 2.

There are instances in which stereotype effects as large as d = .64 have been identified in the samples being studied. If the population d is .64, the hypothesis test with alpha = .05 and 50 participants in each group will result in power equal to .93 as shown in Figure 6. This is a high value for statistical power, meaning that the researchers are very likely to detect an effect if d = .64 in the population.

Most researchers prefer to have the estimate of power be at least .80 before they are willing to conduct a study. So planning to do a study with 50 participants in each group may be a bad decision if the effect size in the population is small or moderate, as it was above in Figures 2 and 5. On the other hand, with a large effect (e.g., d = .64), a sample of 50 participants in each condition provides more than sufficient statistical power for most researchers.

The Relationship Between Power and Sample Size

Prior discussions have focused on testing hypotheses about population means, but you can also do hypothesis tests involving population proportions. In general, larger sample sizes give you more information to pin down the true nature of the population. You can, therefore, expect the sample mean and sample proportion obtained from a larger sample to be closer to the population mean and proportion, respectively. 

As a result, for the same level of confidence, you can report a smaller margin of error, and get a narrower confidence interval. In other words, larger sample sizes increase how much you trust your sample results. In the two scenarios below, you will see that a larger sample size results in a greater ability to reject the null when an effect actually exists in the population.

Scenario: Examining Marijuana Use

Imagine you are a researcher examining marijuana use at a certain liberal arts college and read through the scenario below.

Step 1

You believe that marijuana use at the college is greater than the national average, for which large-scale studies have shown that about 15.7% of college students use marijuana (reported by the Harvard School of Public Health). Based on this belief, you perform the hypothesis test shown in Figure 9 below.

· Note that p in this figure means population proportion and means sample proportion. On the other hand, p-value continues to have the same meaning as defined in the glossary.

Because the p-value is greater than .05, the customary alpha level, the data do not provide enough evidence that the proportion of marijuana users at the college is higher than the proportion among all U.S. college students, which is .157.

Step 2

Let’s make some small changes to the above problem. Suppose that in a simple random sample of 400 students from the college, 76 admitted to marijuana use as seen in Figure 8 below. Do the data provide enough evidence to conclude that the proportion of marijuana users among the students in the college (p) is higher than the national proportion, which is .157?

You can see that results that are based on a larger sample carry more weight. A sample proportion of .19 based on a sample of size of 100 was not enough evidence that the proportion of marijuana users in the college is higher than .157. Recall that this conclusion (not having enough evidence to reject the null hypothesis) doesn't mean the null hypothesis is necessarily true; it only means that the particular study did not yield sufficient evidence to reject the null. It might be that the sample size was simply too small to detect a statistically significant difference, and a type II error was made.

To summarize, you saw that when the sample proportion of .19 is obtained from a sample of size 400, it carries much more weight, and in particular, provides enough evidence that the proportion of marijuana users in the college is higher than .157 (the national figure). In this case, the sample size of 400 was large enough to detect a statistically significant difference.

The following graphs show the power of the two tests if the population mean proportion p for the certain college is actually .19.  Use the < and > icon to navigate between slides.