8210 assgn wk 5
Assignment: Evaluating Significance of Findings
Part of your task as a scholar-practitioner is to act as a critical consumer of research and ask informed questions of published material. Sometimes, claims are made that do not match the results of the analysis. Unfortunately, this is why statistics is sometimes unfairly associated with telling lies. These misalignments might not be solely attributable to statistical nonsense, but also “user error.” One of the greatest areas of user error is within the practice of hypothesis testing and interpreting statistical significance. As you continue to consume research, be sure and read everything with a critical eye and call out statements that do not match the results.
For this Assignment, you will examine statistical significance and meaningfulness based on sample statements.
To prepare for this Assignment:
· Review the Week 5 Scenarios found in this week’s Learning Resources and select two of the four scenarios for this Assignment.
· For additional support, review the Skill Builder: Evaluating P Values and the Skill Builder: Statistical Power, which you can find by navigating back to your Blackboard Course Home Page. From there, locate the Skill Builder link in the left navigation pane.
For this Assignment:
Critically evaluate the two scenarios you selected based upon the following points:
· Critically evaluate the sample size.
· Critically evaluate the statements for meaningfulness.
· Critically evaluate the statements for statistical significance.
· Based on your evaluation, provide an explanation of the implications for social change.
Use proper APA format and citations, and referencing.
https://www.amstat.org/asa/files/pdfs/p-valuestatement.pdf
Frankfort-Nachmias, C., Leon-Guerrero, A., & Davis, G. (2020). Social statistics for a diverse society (9th ed.). Thousand Oaks, CA: Sage Publications.
· Chapter 8, “Testing Hypothesis: Assumptions of Statistical Hypothesis Testing” (pp. 241-242)
Wagner, III, W. E. (2020). Using IBM® SPSS® statistics for research methods and social science statistics (7th ed.). Thousand Oaks, CA: Sage Publications.
· Chapter 6, “Testing Hypotheses Using Means and Cross-Tabulation”
Walden University, LLC. (Producer). (2016f). Meaningfulness vs. statistical significance [Video file]. Baltimore, MD: Author.
Note: The approximate length of this media piece is 4 minutes.
In this media program, Dr. Matt Jones discusses the differences in meaningfulness and statistical significance. Focus on how this information will inform your Discussion and Assignment for this week.
Skill builder: Evaluating P Values
Hypothesis Testing
In doing research, one of the most common activities is testing hypotheses. The Afrobarometer data set below is a survey of African citizens’ attitudes on democracy, governance, the economy, and other related topics ( www.afrobarometer.org ). Using this data set, you might want to examine hypotheses related to whether rural and urban citizens differ, on average, in how much they trust the government. The tables below present results from an independent samples t-test to examine these hypotheses using a random sample of 44 participants from the complete data set. Each respondent’s score is a value between 0 and 15 with a higher score indicating greater trust. You can see that the mean for the urban group is 7.00 ( SD = 4.17) and the mean for the rural group is 7.74 ( SD = 4.38). The observed value of the t-statistic is -.564 and the p-value equals 0.576 (see the column labeled “Sig. (2-tailed)”).
African Citizens' Attitudes on Democracy
The tables below present results from an independent samples t-test to examine these hypotheses using a random sample of 44 participants from the complete data set. Each respondent’s score is a value between 0 and 15 with a higher score indicating greater trust. You can see that the mean for the urban group is 7.00 ( SD = 4.17) and the mean for the rural group is 7.74 ( SD = 4.38). The observed value of the t-statistic is -.564 and the p-value equals 0.576 (see the column labeled “Sig. (2-tailed)”).
|
|
t |
df |
Sig. (2-tailed) |
Mean Difference |
Std. Error Difference |
|
Trust in Government Index (higher scores = more trust) |
-.564 |
41 |
.576 |
-.73913 |
1.30978 |
Group Statistics
|
|
Urban or Rural Primary Sampling Unit |
N |
Mean |
Std. Deviation |
Std. Error Mean |
|
Trust in Government Index (higher scores = more trust) |
Urban |
20 |
7.000 |
4.16754 |
.93189 |
|
|
Rural |
30 |
7.7391 |
4.38196 |
.91370 |
The p-value is the probability of obtaining a value more extreme than .564 (less than -.564 or greater than +.564) if you were to repeat the test with a new sample of data and if the null hypothesis is true. You will see in this Skill Builder that the p-value can easily be used to make statistical decisions in hypothesis testing. However, while the p-value is important in determining statistical significance, it does not tell the whole story.
Steps of Hypothesis Testing
To interpret p-values, let's review the key steps in hypothesis testing. Use the < and > icons to navigate between the steps.
Step 1
State the null and alternative hypotheses
Recall that hypotheses are statements about population parameters. For the Trust in Government example from the Afrobarometer data set, the null (HO) and alternative hypotheses (HA) is seen in the above image.
The Greek letter, µ, indicates a population mean, and the subscripts indicate levels of the independent variable (“urban” and “rural”). Here the null is saying that the mean for the urban population on the Trust In Government variable is the same as the mean for the rural population. The alternative hypothesis states that these means are not the same.
Step 2
Set alpha , the probability of a type I error
Frequently, the value of alpha is set equal to 0.05, although researchers are free to use other values. If using an alpha of .05, then researchers are specifying that there is a 5% chance that they will reject the null when, in fact, it should not be rejected. Setting alpha at .05 is popular because there is relatively minimal risk of making a type I error, and alpha is not so small that researchers greatly increase their risk of not rejecting the null when they actually should (a type II error). So in setting alpha, researchers have to be aware of both the risk of rejecting the null erroneously and of not rejecting it when they actually should. For our Afrobarometer example, we will set alpha at .05.
Step 1
State the null and alternative hypotheses
Recall that hypotheses are statements about population parameters. For the Trust in Government example from the Afrobarometer data set, the null (HO) and alternative hypotheses (HA) is seen in the above image.
The Greek letter, µ, indicates a population mean, and the subscripts indicate levels of the independent variable (“urban” and “rural”). Here the null is saying that the mean for the urban population on the Trust In Government variable is the same as the mean for the rural population. The alternative hypothesis states that these means are not the same.
Step 2
Set alpha , the probability of a type I error
Frequently, the value of alpha is set equal to 0.05, although researchers are free to use other values. If using an alpha of .05, then researchers are specifying that there is a 5% chance that they will reject the null when, in fact, it should not be rejected. Setting alpha at .05 is popular because there is relatively minimal risk of making a type I error, and alpha is not so small that researchers greatly increase their risk of not rejecting the null when they actually should (a type II error). So in setting alpha, researchers have to be aware of both the risk of rejecting the null erroneously and of not rejecting it when they actually should. For our Afrobarometer example, we will set alpha at .05.
Step 3
Decide on a test statistic
Because of a desire to compare two groups (rural and urban), a t-test for two independent samples is being used.
Step 1
State the null and alternative hypotheses
Recall that hypotheses are statements about population parameters. For the Trust in Government example from the Afrobarometer data set, the null (HO) and alternative hypotheses (HA) is seen in the above image.
The Greek letter, µ, indicates a population mean, and the subscripts indicate levels of the independent variable (“urban” and “rural”). Here the null is saying that the mean for the urban population on the Trust In Government variable is the same as the mean for the rural population. The alternative hypothesis states that these means are not the same.
Step 2
Set alpha , the probability of a type I error
Frequently, the value of alpha is set equal to 0.05, although researchers are free to use other values. If using an alpha of .05, then researchers are specifying that there is a 5% chance that they will reject the null when, in fact, it should not be rejected. Setting alpha at .05 is popular because there is relatively minimal risk of making a type I error, and alpha is not so small that researchers greatly increase their risk of not rejecting the null when they actually should (a type II error). So in setting alpha, researchers have to be aware of both the risk of rejecting the null erroneously and of not rejecting it when they actually should. For our Afrobarometer example, we will set alpha at .05.
Step 3
Decide on a test statistic
Because of a desire to compare two groups (rural and urban), a t-test for two independent samples is being used.
7
Step 4
Collect the data and examine the model assumptions
Before calculating the value for your test statistic, be sure you have checked assumptions, like homogeneity of variance and the absence of outliers.
Step 5
Calculate the observed value of the test statistic
Once the data have been collected, the observed value of the test statistic will be used to make a statistical decision. In the Afrobarometer example, the observed value of the test statistic is -.564, sometimes written as tobserved(41)= −.564 where the 41 is the number of degrees of freedom associated with the test.
Step 1
State the null and alternative hypotheses
Recall that hypotheses are statements about population parameters. For the Trust in Government example from the Afrobarometer data set, the null (HO) and alternative hypotheses (HA) is seen in the above image.
The Greek letter, µ, indicates a population mean, and the subscripts indicate levels of the independent variable (“urban” and “rural”). Here the null is saying that the mean for the urban population on the Trust In Government variable is the same as the mean for the rural population. The alternative hypothesis states that these means are not the same.
Step 2
Set alpha , the probability of a type I error
Frequently, the value of alpha is set equal to 0.05, although researchers are free to use other values. If using an alpha of .05, then researchers are specifying that there is a 5% chance that they will reject the null when, in fact, it should not be rejected. Setting alpha at .05 is popular because there is relatively minimal risk of making a type I error, and alpha is not so small that researchers greatly increase their risk of not rejecting the null when they actually should (a type II error). So in setting alpha, researchers have to be aware of both the risk of rejecting the null erroneously and of not rejecting it when they actually should. For our Afrobarometer example, we will set alpha at .05.
Step 3
Decide on a test statistic
Because of a desire to compare two groups (rural and urban), a t-test for two independent samples is being used.
Step 4
Collect the data and examine the model assumptions
Before calculating the value for your test statistic, be sure you have checked assumptions, like homogeneity of variance and the absence of outliers.
Step 5
Calculate the observed value of the test statistic
Once the data have been collected, the observed value of the test statistic will be used to make a statistical decision. In the Afrobarometer example, the observed value of the test statistic is -.564, sometimes written as tobserved(41)= −.564 where the 41 is the number of degrees of freedom associated with the test.
Step 6
Make a statistical decision using the observed value
This decision requires examining the distribution of the test statistic under the assumption the null hypothesis is true. Practically, the area in the tail of the distribution beyond the observed value of the test statistic, called the p-value, needs to be determined (see the figure above). Fortunately, computer programs can do the calculation of the area quickly and easily. If the probability is less than alpha (e.g., .05), we will reject the null hypothesis. Thus, if you set alpha equal to .05 and the p-value for your test statistic is any value less than .05, you will reject the null hypothesis. Otherwise, retain the null.
Step 7
Make a real-world decision
The statistical decision is focused on the abstract hypothesis test. The final step is to examine the implications of the statistical decision in the real world. You will need to consider whether your results are practically significant. It turns out that not all statistically significant results are important in the real world. We will discuss more about this later in the Skill Builder.
Skill Builder: Statistical Power,
Statistical Power
Statistical power is the probability of rejecting a null hypothesis if the null is false (i.e., the alternative is true). It is the degree to which the researcher is able to detect an effect if there actually is one. With low statistical power, a researcher may struggle to detect an effect (to reject the null), even if an effect actually occurs in the population.
Suppose you are planning an experiment involving stereotype threat. Stereotype threat is defined as a tendency to behave in a manner consistent with negative beliefs that others have about a racial or gender group. For example, if some black test takers are told that as a group, black test takers do not perform well on math tests, performance among those black test-takers is worse than for black test takers for whom the stereotype is not evoked. One question you will need to answer is how many participants should you include in your study to be confident in identifying the effect? In other words, how many participants do you need in order to have adequate statistical power in your study?
The Affect of Statistical Power
Understanding how several factors affect the statistical power of a study will help you to understand and critique research findings and will also lead to greater satisfaction with your own research. When conducting your own research studies, you should do a power analysis prior to collecting data to make sure you have a good chance of demonstrating the effect you are looking for.
There are three main factors that affect how much statistical power you have in your study:
· 1
1
Alpha (i.e., the probability of a type I error)
· 2
2
Effect size (i.e., the difference between the population means for the experimental and control groups)
· 3
3
Sample size (i.e., n )
As a researcher, you have control over alpha and sample size. The effect size, however, is not under your control and is predetermined. What will be important to you is having an idea about how great the effect may be. This Skill Builder is concerned with how alpha, effect size, and sample size are related to statistical power.
A Review of Hypothesis Testing
Before discussing power, let’s review the basics of hypothesis testing:
· bullet
The null hypothesis is the statement of no effect.
· bullet
The alternative hypothesis is a statement that an effect exists in the population.
· bullet
Obtaining a significant result means that you have rejected the null hypothesis and have concluded that it’s likely that there is an effect in the population.
· bullet
A type I error happens when the null hypothesis is true but you reject it erroneously. This is referred to as a false positive.
· bullet
A type II error happens when the null hypothesis is false but you fail to reject it. This is referred to as a false negative.
Reviewing Type I and Type II Errors
Type I and type II errors and their probabilities are important concepts when thinking about hypothesis testing. These error events are called “conditional,” meaning that the events can only occur under certain conditions.
The following is the language that is used to talk about these conditional events:
· Alpha (α) = P(type I error) = P(Reject H 0 |H 0 is true) which is read as the probability of a type I error equals the probability of rejecting the null hypothesis given the null is true.
· Beta (β) = P(type II error) = P(Retain H 0 |H A is true) which is read as the probability of a type II error equals the probability of retaining the null hypothesis given the alternative hypothesis is true.
Table 1 shows the possible outcomes for a hypothesis test.
Table 1: Possible Outcomes for a Hypothesis Test
|
D |
True State of Nature |
|
|
Decision |
Ho is true |
Ho is false |
|
Retain Ho |
Correct decision |
Type II error |
|
Reject Ho |
Type 1 error |
Correct decision |
Power Analysis
Power analysis is the process of examining a test of the null hypothesis to determine the chances of rejecting it and placing belief in the alternative hypothesis.
Researchers typically want to get a sense of how much statistical power they will have in their study before collecting data. In order to do so, they usually conduct a power analysis. Suppose you design a study, and a part of it is to demonstrate stereotype threat involving females. Nguyen and Ryan (2008) provide results that indicate the average Cohen’s d in previous studies of gender-based stereotype threat for cognitive tests is about .21. This means that over many studies, females who are NOT made aware of a gender stereotype (NOT primed) score about 0.2 standard deviations higher on cognitive tests than females who are made aware of a gender effect (primed). To demonstrate this effect in your study, you will test the following null hypothesis:
HA : μNOT primed − μprimed ≤ 0
If you reject the null, you will place your confidence in the following alternative hypothesis:
HA : μNOT primed − μprimed > 0
μNOT primed Indicates the population mean for the “not primed” condition.
μprimed Indicates the population mean for the “primed” condition.
HA : μNOT primed − μprimed > 0 The alternative hypothesis specifies that the “not primed” condition will score higher than the “primed” condition.
To test this null hypothesis, you would examine a test statistic distribution and note the area in the upper tail of the distribution equal to alpha. Suppose you plan to test this hypothesis with a t-test with 50 participants in each condition (primed or NOT primed).
Figure 1 sampling distribution shows what you should expect for the values of the test statistic if the null hypothesis is true. In order to reject the null hypothesis, the t value would need to be greater than 1.66055.
Figure 1
Because the test statistic is a continuous variable, the curve shows probability density, and probability is found by determining the area under the curve.
The entire area under the curve, between - ∞ and + ∞ , is 1.00.
To find the probability of a statistic taking on a value within a certain range, you need to find the area under the curve within the range. For example, there are tables that will tell you that the area under the curve between t = 0 and t = +1 corresponds to a probability of about .34. Most importantly, because alpha has been set equal to .05, the area beyond 1.66 corresponds to a probability of .05. Fortunately, statistical programs calculate the areas for you, and you do not need to do the calculations yourself.
Nevertheless, the essence of hypothesis testing is that if you obtain a value of t greater than 1.66, you will say, “This is not a very likely event if the null is true. Thus, the null hypothesis is probably not true because the alternative hypothesis provides a more likely explanation.” In making the decision to reject the null, however, you recognize that if the null is, in fact, true, you are making a type I error.
While alpha provides assurance that the researcher has a small chance of making a type I error, you are also interested in what will happen if the null hypothesis is false—the real world expectation that is driving you to do the study.
Figure 2
Now, in Figure 2, switch your focus from the curve on the left and attend to the curve on the right formed by the dashed line. This curve is based on the alternative hypothesis (i.e., that the unprimed group performs better than the primed group).
To construct this curve based on the alternative hypothesis, a specific value for the difference in means had to be specified; in this case, the value of d = .21, the overall gender effect that Nguyen and Ryan (2008) found. Note, again, that the vertical line with t = 1.66 separates the values of the test statistic that lead to rejecting versus retaining the null hypothesis, and that the line is based on the null hypothesis. The statistical power of the test, (1- β ), is the area under the curve with the dashed lines and to the right of the vertical line for t = 1.69. The area designated by beta ( β ), to the left of the vertical line, corresponds to the probability of a type II error, retaining the null if the null is actually false.
In this example, note that the area corresponding to power (1- β ) is less than the area corresponding to β . Hence, you can conclude that the power is less than 0.5 because the sum of the two areas is 1.0. Almost always, you would like statistical power to be greater than beta for the important hypothesis tests in your study. In this example, a plan to do an experiment with 50 participants in each group may be doomed. The statistical power of the test (.27) is relatively low, and the risk of making a type II error is relatively high. In other words, the statistical power of the test, as currently constructed, limits your ability to detect a gender effect of priming versus not priming if there is one.
Power Analysis
As the researcher, you have control of alpha, and you will set alpha when you are planning your study. Continuing with the example from the previous page, Figure 4 below shows what would happen to power if you change alpha, the probability of a type I error, to .15.
Figure 2
Figure 2
Compare the curves in Figure 3 to the ones above in Figure 2; in that figure, alpha ( α ) was equal to .05. Notice that β becomes smaller, and power, (1- β ), becomes larger. If you change α to .01, a relatively small value for the probability of a type I error, beta ( β ) becomes larger, and power becomes less. See Figure 4 below.
Figure 3
Figure 3
Figure 4
Figure 4
· bullet
In general, making alpha ( α ) smaller results in a decrease in the power of the statistical test, and making alpha larger results in greater power. This is because if you set a more stringent alpha (e.g., .01 instead of .05), it becomes more difficult to reject the null hypothesis. While .05 is a typical value for α , the decision of which value to use for α is up to the researcher. Letting alpha ( α ) equal .05 is certainly common practice.
· bullet
Many journal editors expect alpha ( α ) to equal .05. There are other times, however, when the researcher may wish to use a different value for alpha ( α ) depending on the severity of the consequences for making a type I error. For example, if you are studying whether or not a drug has serious side effects, with the null specifying that there are no serious side effects, you may want to have a more stringent alpha to lower your risk of saying that there aren’t side effects when there actually are; you may opt for a .01 alpha instead of a .05 alpha.
Power and Effect Size
A second factor that is related to the statistical power of a test is the effect size. There are several measures of effect size. With a comparison of two populations, Cohen’s d is often used. The value of d is the difference in population means between two groups in standard deviation units. According to Cohen’s rule of thumb, a value of d = .2 is considered a small effect, d = .5 is considered a medium sized effect, and d = .8 is considered a large effect.
Let’s revisit the earlier example about planning a study to demonstrate race-based stereotype threat. Nguyen and Ryan (2008) note that overall race-based stereotype threat studies have resulted in an average d equal to about .32. Figure 5 below shows what you can expect if you induced a general racial stereotype threat in a rather typical way so that in the population d = .32, there are 50 participants in each group, and alpha = .05. Note that power has increased noticeably compared to the study examined in Figure 2. This is due to the effect size ( d = .32) in this figure being larger than the effect size ( d = .21) in Figure 2.
Figure 5
Figure 5
There are instances in which stereotype effects as large as d = .64 have been identified in the samples being studied. If the population d is .64, the hypothesis test with alpha = .05 and 50 participants in each group will result in power equal to .93 as shown in Figure 6. This is a high value for statistical power, meaning that the researchers are very likely to detect an effect if d = .64 in the population.
Figure 6
Figure 6
Most researchers prefer to have the estimate of power be at least .80 before they are willing to conduct a study. So planning to do a study with 50 participants in each group may be a bad decision if the effect size in the population is small or moderate, as it was above in Figures 2 and 5. On the other hand, with a large effect (e.g., d = .64), a sample of 50 participants in each condition provides more than sufficient statistical power for most researchers.
The Relationship Between Power and Sample Size
Prior discussions have focused on testing hypotheses about population means, but you can also do hypothesis tests involving population proportions. In general, larger sample sizes give you more information to pin down the true nature of the population. You can, therefore, expect the sample mean and sample proportion obtained from a larger sample to be closer to the population mean and proportion, respectively.
As a result, for the same level of confidence, you can report a smaller margin of error, and get a narrower confidence interval. In other words, larger sample sizes increase how much you trust your sample results. In the two scenarios below, you will see that a larger sample size results in a greater ability to reject the null when an effect actually exists in the population.
Scenario: Examining Marijuana Use
Imagine you are a researcher examining marijuana use at a certain liberal arts college and read through the scenario below.
Step 1
You believe that marijuana use at the college is greater than the national average, for which large-scale studies have shown that about 15.7% of college students use marijuana (reported by the Harvard School of Public Health). Based on this belief, you perform the hypothesis test shown in Figure 9 below.
· Note that p in this figure means population proportion and pˆ means sample proportion. On the other hand, p-value continues to have the same meaning as defined in the glossary.
Because the p-value is greater than .05, the customary alpha level, the data do not provide enough evidence that the proportion of marijuana users at the college is higher than the proportion among all U.S. college students, which is .157.
Step 2
Let’s make some small changes to the above problem. Suppose that in a simple random sample of 400 students from the college, 76 admitted to marijuana use as seen in Figure 8 below. Do the data provide enough evidence to conclude that the proportion of marijuana users among the students in the college (p) is higher than the national proportion, which is .157?
Step 3
You now have a larger sample (400 instead of 100), and also the number of marijuana users is 76 instead of 19. The question of interest did not change, so if you carry out the test in this case, you are testing the same hypotheses seen below.
Step 4
You select a random sample of size 400 and find that 76 are marijuana users, and the formula seen below. This is the same sample proportion as in the original problem, so it seems that the data give the same evidence.
Step 5
However, when you calculate the test statistic, you see that actually this is not the case as seen in the formula below.
Even though the sample proportion is the same (.19), because here it is based on a larger sample (400 instead of 100), it is 1.81 standard deviations above the null value of .157 (as opposed to .91 standard deviations in the original problem). The sampling distribution for the sample proportion has a smaller standard error because of the larger sample size.
Step 6
The p-value here is .035, as opposed to .182 in the original problem. In other words, when Ho is true (i.e., if p = .157 at the certain college), it is quite unlikely (probability of .035) to get a sample proportion of .19 or higher based on a sample of size 400. When the sample size is 100, the probability of having a sample proportion greater than .19 is more likely (probability .182).
The results here are important. With n = 400, the data provide enough evidence to reject Ho and conclude that the proportion of marijuana users at the college is higher than among all U.S. students. With n = 100, however, the evidence is insufficient to reject the null. Figure 9 summarizes these findings.
You can see that results that are based on a larger sample carry more weight. A sample proportion of .19 based on a sample of size of 100 was not enough evidence that the proportion of marijuana users in the college is higher than .157. Recall that this conclusion (not having enough evidence to reject the null hypothesis) doesn't mean the null hypothesis is necessarily true; it only means that the particular study did not yield sufficient evidence to reject the null. It might be that the sample size was simply too small to detect a statistically significant difference, and a type II error was made.
To summarize, you saw that when the sample proportion of .19 is obtained from a sample of size 400, it carries much more weight, and in particular, provides enough evidence that the proportion of marijuana users in the college is higher than .157 (the national figure). In this case, the sample size of 400 was large enough to detect a statistically significant difference.
The following graphs show the power of the two tests if the population mean proportion p for the certain college is actually .19. Use the < and > icon to navigate between slides.
· 1
· 2
Figure 10
Figure 11
Figure 12
Finally, Figure 12 shows how sample size affects the test for proportions concerning marijuana use at the liberal arts college. The graph is based on a hypothesis test with alpha = .05, the proportion for the null hypothesis equal to .157, and the population proportion for the liberal arts college = .19.
In general, whether you are testing hypotheses about proportions, means, or other parameters, the larger the sample size, the greater the statistical power. Because of your interest in rejecting the null, you need to pay attention to how large your sample size will be prior to collecting data.