04CH_Sukal_Statistics.pdf

Andreas Rentz/Getty Images

chapter 4

Applying z to Groups

Learning Objectives

After reading this chapter, you will be able to. . .

1. describe the distribution of sample means.

2. explain the central limit theorem.

3. analyze the relationship between sample size and confidence in normality.

4. calculate and explain z-test results.

5. explain statistical significance.

6. calculate and explain confidence intervals.

7. explain how decision errors can affect statistical analysis.

8. calculate the z-test using Excel.

9. present results and draw conclusions based on z-tests.

10. interpret results of z-tests in APA format.

CN

CO_LO

CO_TX

CO_NL

CT

CO_CRD

suk85842_04_c04.indd 103 10/23/13 1:16 PM

CHAPTER 4Section 4.1 The Distribution of Sample Means

Chapter 3 ended by noting that as interesting as it is to be able to determine the percent-age of individuals below or above a point or between two scores, we are more often interested in groups than in individuals. A researcher is more likely to investigate the probability that a group of clients a psychologist has been working with will score below some point on a depression scale. In this chapter, what you have learned in the first three chapters will be applied to analyses of groups.

In Chapters 2 and 3, we discussed that many characteristics that interest behavioral sci- entists are normally distributed in a population. But, by inference, that also means that some characteristics are probably not normally distributed, and in the moment it may not be clear which are not. Relying on Table A in the Appendix to reveal the proportions of the entire population that fall in certain areas is appropriate only if the data is normal; Table A assumes a normal distribution.

4.1 The Distribution of Sample Means

So where do we go if we are suspicious about the normality of the data? The answer is distribution of sample means, a distribution made of the means of samples rather than individual scores.

To this point, population has meant populations created by sampling one subject at a time, measuring each individual on some trait, and then plotting each score in a frequency dis- tribution. Consider an alternative. What if instead of selecting each individual in a popu- lation one-at-a-time, an analyst

1. selects a group with a specified size, 2. calculates the sample mean (M) for each group, and then 3. plots M (rather than the individual scores) in a frequency distribution, and 4. continues doing this until the population is exhausted.

How would that affect the distribution? Would it still be a population? The answer to the second question is yes, it is still a population. Recall that, by definition, a population is all members of a defined group. Whether the members of the population are measured indi- vidually or as members of a group is incidental, as long as they are all included.

If researchers want to know how dogmatic registered voters in Brazos County, Texas, are, they can measure each voter and then record the mean level of dogmatism for each group of 30. If the mean for groups of 30 are recorded until the population is exhausted, the result is still a population.

The Central Limit Theorem

The first question—how would the distribution be affected?—is a little more involved, but it is very important to nearly everything we do in statistical analysis. The answer requires introducing the central limit theorem, which holds that

• if a population is sampled an infinite number of times using sample size n and • the mean (M) of each sample is determined,

H1

TX_DC

BLF

TX

BL

BLL

suk85842_04_c04.indd 104 10/23/13 1:16 PM

CHAPTER 4Section 4.1 The Distribution of Sample Means

• the multiple Ms will take on the characteristics of a normal distribution whether or not the original population of individuals is normal.

Take a minute to absorb this. A population of an infinite number of sample means drawn from one population will reflect a normal distribution whatever the nature of the original distribution. A healthy skepticism prompts at least two questions:

1. How would we know whether this is true given that an infinite number of samples is out of everyone’s reach?

2. How can sampling in groups rather than as individuals affect normality?

Although prove is too strong a word, we can at least provide evidence for the effect of the central limit theorem with an example. Perhaps a psychologist is working with 10 people who are very resistant to change; they are highly dogmatic. Technically, because 10 is the number in the entire group, the population is N 5 10 (the uppercase N signifies the popu- lation). A small population does not change the fact that there still cannot be an infinite number of samples, of course, but for the sake of the illustration let us assume that

• dogmatism scores are available for each of the 10 people; • the data is interval scale; • the scores range from 1 to 10; and • each person receives a different score.

So with N 5 10, the scores are

1, 2, 3, 4, 5, 6, 7, 8, 9, 10

Figure 4.1 is a frequency distribution of those 10 scores.

Figure 4.1: A frequency distribution for the scores 1–10, with each score occurring once

S c o

re F

re q

u e n

c y

Score Values

10

9

8

7

6

5

4

3

2

1

10987654321

suk85842_04_c04.indd 105 10/23/13 1:17 PM

CHAPTER 4Section 4.1 The Distribution of Sample Means

The distribution in Figure 4.1 is many things, perhaps, but it is not normal. With range 5 10 2 1 5 9 and s 5 3.028 (a calculation worth checking), it is extremely platykur- tic, with no apparent mode (or 10 modes). We can illustrate the workings of the central limit theorem with the following procedure:

a. We will use samples of just n 5 2. b. Rather than an infinite number of samples, we will make the example

manageable by using one sample for each possible combination of scores in samples of n 5 2 from the population.

All the possible combinations of two scores from values 1–10 are listed in Table 4.1. There are 90 possible combinations of the 10 dogmatism scores.

Table 4.1: All possible combinations of the integers 1–10

1, 2 2, 1 3, 1 4, 1 5, 1 6, 1 7, 1 8, 1 9, 1 10, 1

1, 3 2, 3 3, 2 4, 2 5, 2 6, 2 7, 2 8, 2 9, 2 10, 2

1, 4 2, 4 3, 4 4, 3 5, 3 6, 3 7, 3 8, 3 9, 3 10, 3

1, 5 2, 5 3, 5 4, 5 5, 4 6, 4 7, 4 8, 4 9, 4 10, 4

1, 6 2, 6 3, 6 4, 6 5, 6 6, 5 7, 5 8, 5 9, 5 10, 5

1, 7 2, 7 3, 7 4, 7 5, 7 6, 7 7, 6 8, 6 9, 6 10, 6

1, 8 2, 8 3, 8 4, 8 5, 8 6, 8 7, 8 8, 7 9, 7 10, 7

1, 9 2, 9 3, 9 4, 9 5, 9 6, 9 7, 9 8, 9 9, 8 10, 8

1, 10 2, 10 3, 10 4, 10 5, 10 6, 10 7, 10 8, 10 9, 10 10, 9

If we calculate a mean and plot the value in a frequency distribution as a test of the central limit theorem for each possible pair of scores, the result is Figure 4.2. Because the entire distribution is based on sample means, Figure 4.2 is a distribution of sample means.

suk85842_04_c04.indd 106 10/23/13 1:17 PM

CHAPTER 4Section 4.1 The Distribution of Sample Means

Figure 4.2: A frequency distribution of the means of all possible pairs of scores 1–10

The Mean of the Distribution of Sample Means

The symbol used for a population mean to this point, m, is actually the symbol for a population mean formed from one score at a time. To distinguish between the mean of the population of individual scores and the mean of the population of sample means, we will subscript m with an M: mM. This symbol indicates a population mean based on sample means.

With a distribution of just 90 sample means, this is nothing like an infinite number, of course, but the resulting figure is instructive nevertheless.

• The mean of the scores 1–10 is 5.5: m 5 5.5.

Study Figure 4.2 for a moment. What is the mean of that distribution?

• The mean of that distribution of sample means is also 5.5: mM 5 5.5.

The point is this: When the same data is used to create two distributions, one a population based on individual scores and the other a distribution of sample means, the two popula- tion means will have the same value,

• m 5 mM.

Describing the distribution as “normal” is a stretch, but Figure 4.2 is certainly much more like a normal distribution than Figure 4.1 is. For one thing, rather than the perfectly flat distribution that occurs when all the scores have the same frequency, mean scores near

S c o

re F

re q

u e n

c y

Sample Means

10

9

8

7

6

5

4

3

2

1

6.05.55.04.54.03.53.02.5 9.08.58.07.57.0 9.56.52.01.5

suk85842_04_c04.indd 107 10/23/13 1:17 PM

CHAPTER 4Section 4.1 The Distribution of Sample Means

A Why is there less variability in the distribution of sample means than in a distribution of individual scores?

Try It! the middle of the distribution in Figure 4.2 occur more frequently than means at the extreme right or left. Why are extreme scores less likely than scores near the middle of the distribution? It is because many combinations of scores can produce the mean values in the middle of the distribution, but comparatively few combinations can produce the values in the tails. With repetitive sampling, the mean scores that can be produced by multiple combinations increase in frequency and the more extreme scores occur only occasionally; this tendency is illus- trated in the next section.

Variability in the Distribution of Sample Means

In the original distribution of 10 scores (Figure 4.1), what is the probability that someone could randomly select one score (x) that happens to have a value of 1? Because there are 10 scores, and just one score of 1, the probability is p 5 1/10 5 .1, right? For the same reason, what is the probability of selecting x 5 10? It is the same, p 5 .1.

Now, moving to the distribution based on the 90 scores, what is the probability of select- ing a sample of n 5 2 that will have M 5 1.0? Is there any probability of selecting two scores out of the 10 that will have M 5 1.0? Because there is only a single 1, the answer is no. As soon as a score of 1 is averaged with any other score in the group, M . 1 because all other scores are greater than 1. That’s why the lowest possible mean score in Figure 4.2 is 1.5, which can occur only when 1 and 2 are in the same sample.

The same thing occurs in the upper end of the distribution. The probability of selecting a group of n 5 2 with M 5 10 is also zero (p 5 0) because all other scores are lower than 10. In the 90 possible combinations, the highest possible mean score is 9.5, which can occur only when the 10 and the 9 happen to be in the same sample. There will always be less variability in a distribution of sample means than in a distribution of scores sampled one at a time. As the size of the sample increases, the impact of the most extreme scores like- wise diminishes as they are included in samples with less extreme scores.

The Standard Error of the Mean

The sigma, s, which indicates a population standard deviation, is specific to a population based on individual scores. The symbol for the standard deviation of the sample means is sM. The formal name for this value is standard error of the mean.

Note the difference between the language of statistics and everyday language. The error part in “standard error of the mean” has nothing to do with making a mistake. In statistics, there are actually several different kinds of “standard errors,” and they all have one thing in common: They are all measures of data variability. The size of the statistic indicates the amount of variability in whatever the particular standard error is gauging.

Earlier in this chapter, we noted that whether it is the distribution of individual scores or the distribution of sample means, the means of the two distributions will always be equal, m 5 mM. Does the same hold true for the measures of variability? Will s 5 sM? Actually, we answered that question when noting that there is always more variability in the distri- bution of individual scores than in the distribution of sample means. Put more succinctly, s . sM. We can check this conclusion with our data.

suk85842_04_c04.indd 108 10/23/13 1:17 PM

CHAPTER 4Section 4.1 The Distribution of Sample Means

The standard deviation of the 10 original scores (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) is

s 5 2.872

It is a lot of data to enter, but it is a good idea to check this. For this calculation and for the calculation of the standard error of the mean that follows, use the formula for the popula- tion rather than the sample standard deviation (N rather than n 2 1 in the denominator). Elsewhere in this presentation, it will always be n 2 1. Here we get into the degrees of freedom (df ) as discussed in Chapter 1, which is a theoretical and mathematical adjust- ment for the use of samples. Keep in mind that if we want to look at a population param- eter, we would use N or the population size, but because we are using samples, the adjust- ment would be the sample size minus 1, or (n 2 1).

The standard error of the mean can be calculated by taking the standard deviation of the mean scores of each of those 90 samples from which the distribution of sample means was constituted. It is a little laborious, and happily this is not a pattern that must be followed later, but the value is

sM 5 1.915

This too is a parameter, so it involves the N rather than n 2 1 in the denominator. As predicted, s has a larger value than sM. It reflects the moderating influence that the less extreme scores have on the more extreme scores when they occur in the same sample.

Sampling Error

Although the standard error of the mean does not refer to a mistake per se, another kind of error, the sampling error, does refer to a mistake in sampling that causes an error. In inferential statistics, samples are important for what they reveal about populations. This is effective only when the sample accurately represents the population. The degree to which the sample does not represent the population is the degree of sampling error.

Samples tend to accurately reflect the population when two important prerequisites are satisfied:

• the sample must be relatively large, and • the sample must be based on random selection.

The safety of large samples is explained by the law of large numbers. According to this mathematical principle, errors diminish as a proportion of the whole as the number of data points increases. The potential for serious sampling error diminishes as the size of the sample grows.

Random selection refers to a situation where every member of the population has an equal probability of being selected. A random sample of n 5 5 could be created from the 10 people being treated for dogmatic behavior by

• assigning each person a number, • placing the 10 numbers into a paper bag,

suk85842_04_c04.indd 109 10/23/13 1:17 PM

CHAPTER 4Section 4.2 The z-Test

• shaking the bag well, and • without looking, drawing out five numbers.

The result would be a randomly selected sample. When they are randomly selected, samples differ from populations only by chance. Such randomization of subjects can be achieved through a random number generator (e.g., in SPSS) used for experimental purposes.

If the sample should fail to capture some important characteristic of the population other than its size, there is a sampling-error problem. The important characteristic might be the mean, for example, and if M ? m, this indicates a sampling error. Actually, there is always some sampling error because a sample can never exactly duplicate all the descriptive characteristics of the population, but the sampling error will usually be minor if samples are relatively large and randomly selected. Statistical analysis procedures tolerate minor, random sampling error, but systematic sampling error is another matter. Systematic sam- pling error occurs when the same mistake is made time after time.

In 1936, the publishers of The Literary Digest, a prominent publication of the time, decided to predict the outcome of that year’s presidential election in the United States. To ensure that the sample size would not be a problem, they sent out millions of postcards to reg- istered voters. It would seem that they at least met the requirement for a relatively large sample, because the Harris and Gallup organizations typically get very accurate results with a few thousand, and sometimes just a few hundred, responses. Fatefully, they decided to use telephone books and automobile registrations to locate those who would be polled. Consider the historical setting. At the height of the Great Depression, voters were identified by two indicators of relative prosperity: a telephone in the home and a currently registered car. The study was disastrous for the magazine’s reputation. This misprediction was directly challenged by George Gallup (founder of the American Institute of Public Opinion), who in fact predicted that FDR would win (using quota sampling) and that The Literary Digest poll was false. The results indicated that Alf Landon would win, but of course Franklin D. Roosevelt was elected in a landslide to a second term, carrying every state in the union except Maine and Vermont. Since Gallup was correct, the Gallup poll gained credibility and went on to become one of the most recognized and used polling systems in public opinion polling.

The problem encountered by The Literary Digest poll was systematic sampling error. The voters were consistently and nonrandomly selected from groups not representative of the entire population. If they had been randomly selected, chances are that with the large sample size, the study would have predicted the election results very accurately, but the sample size alone was not enough to salvage the effort.

4.2 The z-Test

To summarize, the distribution of sample means is a distribution based not on indi-vidual scores but on the means of samples of the same size repeatedly drawn from a population. The central limit theorem indicates that when a population is based on sam- ples rather than individual scores, the resulting population will be normal regardless of how the population of individual scores was distributed.

suk85842_04_c04.indd 110 10/23/13 1:17 PM

CHAPTER 4Section 4.2 The z-Test

Because the central limit theorem provides assurance of a normal distribution, if the z score formula from Chapter 3 is adjusted to accommodate groups rather than individual scores, Table A will answer all the same questions about groups that were initially asked about individuals in Chapter 3. Recall that the z score formula (3.1) had the following form:

z 5 x 2 M

s

If the following substitutions are made:

M for x so that the focus is on the mean of the group rather than the individual,

mM for M to shift from the sample mean to the mean of the distribution of sample means, and

sM for s so that the measure of variability is for the distribution rather than the sample, the result is the z-test:

z 5 M 2 mM

sM Formula 4.1

The z-test produces a z value for groups rather than individual scores which indicates how distant a particular sample mean is from the mean of the distribution of sample means. The procedure is the same as it was for individual scores: Calculate a value of z and then use the table to interpret that value.

Note the similarities between Formula 3.1 and Formula 4.1:

• Both formulas produce values of z. • Both numerators call for subtractions that result in difference scores. • Both denominators measure data variability.

Calculating the z-Test

When calculating z scores, as you saw in Chapter 3, everything that is needed (x, M, and s) can be determined from the sample:

z 5 x 2 M

s

What is needed for the z-test, however, is often not as easy to determine. Because mM 5 m, one of those two parameters must be provided. The standard error of the mean (sM) can also be a problem. No one wishing to complete a z-test is going to have the mean scores for that infinite number of samples that make up the distribution of sample means. So calculating the standard deviation of those means, which is what the standard error of the mean represents, is not an option. Nevertheless, there is a way to determine this value that lets us skip the tedium of calculating a standard deviation for who-knows-how-many scores. If sM is not given (“And the standard error of the mean is . . .”), but the population standard deviation (s) is provided, sM can be found as follows:

sM 5 s

"N Formula 4.2

suk85842_04_c04.indd 111 10/23/13 1:17 PM

CHAPTER 4Section 4.2 The z-Test

Where

sM 5 the standard error of the mean

s 5 the population standard deviation

N 5 the number in the group

So for a group of 100 with the value of s as 15, then sM is

sM 5 s

"N

sM 5 15

"100 5

15 10

5 1.5

This is only a partial solution, however, because it still requires at least s .

You will learn a way around the problem of mM in Chapter 5, but in the meantime, let us take the following example. A marriage and family counselor has access to some national data on the frequency of negative verbal comments exchanged between divorcing cou- ples. The counselor finds that

• couples in troubled marriages tend to have 11 negative exchanges per week, with a standard deviation of 4.755, and

• a study of 45 couples who have filed for divorce in the counselor’s county reveals that the mean number of negative comments per week is 12.865.

• Given the national data, the counselor wants to know the probability that a ran- domly selected group of couples from that population will have as many nega- tive exchanges as the counselors’ clients, or more.

Although the question is about groups rather than individuals, the problem is much like a z score problem. Here is the information that is available:

m and therefore mM 5 11.0

s 5 4.755

N 5 45

M 5 12.865

1. The standard error of the mean is

sM 5 s

"N 5

4.755

"45 5 0.709

2. And z is

z 5 M 2 mM

sM 5

12.865 2 11.0 0.798

5 2.630

suk85842_04_c04.indd 112 10/23/13 1:17 PM

CHAPTER 4Section 4.2 The z-Test

Comparing M to mM indicates that the counselor’s group has a higher number of negative verbal exchanges per week than the number nationally among couples with troubled mar- riages: 12.865 is a higher value than 11.0. What else can be determined from the analysis?

Interpreting the Value from the z-Test

This is a value of z just like those that were calculated in Chapter 3, except that the value indicates how much a sample mean (M) differs from the mean of a population of samples (mM) rather than how an individual (x) differs from either a sample mean (M) or a popula- tion mean (m). The Table A value indicates that 0.4957 out of 0.5 occurs between a value for z 5 2.63 and the mean of the distribution.

• So among the population of couples with troubled marriages, 49.57% will have negative verbal exchanges somewhere between the level of this group (12.865 per week) and the mean of the population (11.0) per week.

• But the question is the probability that a group of clients selected at random would have 12.865 negative comments per week, or more.

• Because 49.57% will have 12.865 or fewer negative exchanges per week, just 0.43% (50% 2 49.57%) will have 12.865 negative comments per week or more. Stated as a probability, p 5 .0043, a group of individuals in troubled marriages will have 12.865 negative exchanges per week or more.

This result is depicted in Figure 4.3.

Figure 4.3: The probability of selecting a sample with M 5 12.865 or higher from a population with MM 5 11.0

The probability of selecting a sample with M 5 12.865 or higher is indicated by determin- ing the z equivalent of a sample with M 5 12.865 and then determining the proportion of the distribution at that point and higher in the population.

There are some important differences between this z and those calculated in Chapter 3.

p = 0.5 p = 0.4957 p = 0.0043

0 z = 2.63

z-value

suk85842_04_c04.indd 113 10/23/13 1:17 PM

CHAPTER 4Section 4.2 The z-Test

• Note that the difference between the mean of the population (mM 5 11.0) and the sample mean (M 5 12.865) is really quite modest, but the z value (z 5 2.630) is comparatively extreme. Recall that 62z includes 95% of the distribution, and at z 5 2.630 we are substantially beyond that.

• The reason for the rather large value of z is the quite small standard error of the mean, 0.709.

• Because variability between group means tends to be small relative to the vari- ability between individuals, it does not take much of a difference between the sample mean (M) and the mean of the distribution of sample means (mM) to pro- duce an extreme value of z.

Apply It!

Confidence in the Claim

A parent is looking at private high schools for his child. A particular high school claims that last year their students performed above average in math and ver-

bal SAT scores. The parent, who knows statistics, decides to test this claim. The parent finds the nationwide results for last year’s SAT scores. The mean math SAT score was 520, with a standard deviation of 110. The mean verbal score was 508, with a standard deviation of 98. The parent asks to see the high school’s study.

The high school looked at SAT scores from a random sample of 40 students for that same year. The mean math score was 535 and the mean verbal score was 540.The parent would like to test if the high school scores come from a different population than the national scores. The z-test will give him a way to determine this. If the value of z could occur by chance with a probability p 5 .05 or less, the parent will view this as a nonrandom occur- rence. First, he looked at the math scores.

Math Scores

m, and therefore mM, 5 520 s 5 110 N 5 40

M 5 535

Calculate the standard error of the mean:

sM 5 s

"N 5

110

"40 5 17.39

Then determine z:

z 5 M 2 mM

sM 5

535 2 520 17.39

5 0.8625

The table value for z 5 0.8625 is 0.3051.

(continued)

suk85842_04_c04.indd 114 10/23/13 1:17 PM

CHAPTER 4Section 4.2 The z-Test

Apply It! (continued)

The percentage of the population of sampling means scoring 535 or more can be determined by 0.50 2 0.3051 5 0.1949. About 19.5% of the samples of student scores selected at ran- dom will have a mean score higher than 535. That is almost a 1 in 5 chance. This result is not statistically significant at the p 5 .05 level, so the hypothesis that these students are better is not supported by the results. In other words, a sample of student scores with M 5 535 might well have been drawn from a population with mean scores of mM 5 520.

The parent then looked at verbal scores.

Verbal Scores

m, and therefore mM, 5 508 s 5 98 N 5 40 M 5 540

Calculate the standard error of the mean:

sM 5 s

"N 5

98

"40 5 15.5

Then determine z:

z 2 M 2 mM

sM 5

540 2 508 15.5

5 2.06

The table value for z 5 2.06 is 0.4803.

The percentage of this group scoring 540 or more can be determined by 0.50 2 0.4803 5 0.0197. About 2% of the samples of student scores selected at random will have a mean score higher than 540. This result is therefore statistically significant at the p 5 .05 level, so the null hypothesis can be rejected, and the alternative (research) hypoth- esis that these students have better verbal scores is supported. At less than p 5 .05, the outcome is 95% unlikely to have occurred by chance.

Using his knowledge of statistics, the parent was able to test the high school’s claim of bet- ter SAT scores. The parent rejects the claim of better math scores and accepts the claim of better verbal scores.

Apply It! boxes written by Shawn Murphy

suk85842_04_c04.indd 115 10/23/13 1:17 PM

CHAPTER 4Section 4.3 The Concept of Statistical Significance

Another z-Test

Interested in a possible connection between explicit reinforcement and performance in the workplace, researchers gather sales data for a group of 30 sales associates whose manag- ers provide daily verbal reinforcement. The mean level of sales for this group in a particu- lar month is $23,300. For people nationally in this type of retail sales, the mean is $22,538 with a standard deviation of $5,822. What percentage of all randomly selected groups will have mean sales of $23,300 or higher?

m, and therefore mM , 5 22,538

s 5 5,822

N 5 30

M 5 23,300

First, calculate the standard error of the mean:

sM 5 s

"N 5

5,822

"30 5 1,063

Then determine z:

z 5 M 2 mM

sM 5

23,300 2 22,538 1,063

5 0.72

The table value for z 5 0.72 is 0.2642.

The question is, what percentage of the distribution of sample means will have mean sales as high as this group’s sales or higher? The proportion at $22,538 or lower would be 0.50 (for the lower half of the distribution) plus the 0.2642 of the distribution between the mean and the z value for 23,300.

0.50 1 0.2642 5 0.7642

The percentage above M 5 23,300 5 1 2 0.7642 5 0.2358 3 100 (to convert the proportion to a percentage) 5 23.58%. About 24% of the samples of sales associates selected at ran- dom will have mean sales of $23,300 or higher.

4.3 The Concept of Statistical Significance

Like the z score problems in Chapter 3, the z-test is a ratio of the difference (the numera-tor) compared to data variability (the denominator). When the ratio is large, it indi- cates that the score (in the z score problem) or the sample mean (in the case of the z-test) is quite distant from the means to which it is compared.

Is there a point at which the sample mean (M) becomes so different from the mean of the distribution of sample means (reflected in a large value of z) that it is more likely to be characteristic of some other distribution? In the first z-test problem, we proceeded as

suk85842_04_c04.indd 116 10/23/13 1:17 PM

CHAPTER 4Section 4.3 The Concept of Statistical Significance

though the sample of those who had filed for divorce was a subgroup of all couples with troubled marriages. What if the sample is actually more characteristic of some other dis- tribution, say, a population of couples for whom divorce is imminent? Can large values of z reflect the fact that the sample actually represents a population different from the one to which it was compared?

Consider another example before we answer these questions. Those in a college honors program are probably adults. If researchers are interested in studying intelligence, would it be reasonable to expect that the members of this group represent what is characteristic of all adults? From the standpoint of age (and the absence of child prodigies), those honors students are probably all adults, but in terms of intelligence, they probably are not typical. Perhaps they are more representative of the population of intellectually gifted adults than of adults in general.

The individuals in every sample belong to many different populations. The couples on the verge of divorce belong to

• the population of married people, • the population of adults, • the population of adults in the particular state, • the population of adults in the particular county, • the population of couples with troubled marriages, and so on.

One of the questions the z-test helps answer is whether a particular sample is most charac- teristic of the population to which it is compared, or whether the sample is more like some other population. The magnitude of the z value is the key to the answer.

Statistical Significance and Probability

In the case of the z-test, an outcome is statistically significant when

• it is so unlike the population to which it is compared that it can be presumed to reflect some other population;

• said another way, an outcome is statistically significant when the value of M is distant enough from mM that it probably was not randomly selected from that particular distribution of sample means.

So, at what point is an outcome nonrandom? Ronald A. Fisher (1932), who coined the term statistically significant, made the answer a matter of probability. If the probability that an outcome (in our case, the value of z) occurred by chance is p 5 .05 or less, the outcome is probably not random; it is a statistically significant occurrence.

Although p 5 .05 is probably the most common, other probability levels have also been used to indicate statistical significance. Reviewing journal articles will indicate statistical testing done at p 5 .01, p 5 .001, and occasionally, even p 5 .1. It is up to the person doing the analysis to state the level chosen to indicate statistical significance (before conducting the test, by the way). That probability value is also called the alpha (a) level for reasons we will get to later.

suk85842_04_c04.indd 117 10/23/13 1:17 PM

CHAPTER 4Section 4.3 The Concept of Statistical Significance

Because we can use the z-test and the z score table to calculate the probability of an occur- rence (in addition to the other things we can do to determine the percentage of the popu- lation above a point, below a point, and between points), we can determine statistical significance. In the first z-test we completed,

• we compared the mean number of negative verbal exchanges in a sample of couples on the verge of divorce to the mean level of negative exchanges among those identified as the population of couples with “troubled” marriages and found that z 5 2.630.

• The table value indicates that the probability of randomly selecting a sample of couples that would have M 5 12.865 or more negative verbal exchanges per week was p 5 .0043. At less than p 5 .05, that outcome is unlikely to have occurred by chance. It is statistically significant.

In the second z-test,

• the issue was whether explicit reinforcement affects sales performance.

• For that problem, z 5 0.717, and the table value for z 5 0.72 is 0.2642.

• That means the probability of earning $23,300 or more can be determined by taking the upper half of the distribution, which is 0.50 2 0.2642. The differ- ence is 0.2358 (Figure 4.4).

• The probability that a group of sales associates selected at random would have mean sales of $23,300 or higher is p 5 .2358. That is a probability of occurrence of almost 1 chance in 4. It is too likely to have occurred by chance to be statistically significant.

• A sample of sales associates with M 5 $23,300 sales for the month might well have been drawn from a population with mean monthly sales of mM 5 $22,538.

Figure 4.4: The probability of selecting a sample with sales of M 5 $23,300 or higher from a population with mean sales of MM 5 $22,538

B What does the term statistically significant mean?

Try It!

p = 0.5 p = 0.2358 p = 0.2642

0

z-value

suk85842_04_c04.indd 118 10/23/13 1:17 PM

CHAPTER 4Section 4.3 The Concept of Statistical Significance

Determining Significance Without the Table

Remember that 6z 5 1.0 includes about 68% of the z distribution, so the probability of randomly selecting an outcome that occurs in the 6z 5 1.0 area is p 5 .68. Nothing in that region is going to be statistically significant because those z values indicate results that are very characteristic of the distribution as a whole. It is the uncharacteristic events that are significant, and Fisher’s standard of p 5 .05 indicates that the key is a z value that occurs at a point where only the most extreme 5% of the distribution is excluded.

Recall that normal distributions are symmetrical. That 5% exclusion means that the most extreme 2.5% of outcomes in the lower tail and the most extreme 2.5% of outcomes in the upper tail are statistically significant. Because Table A provides proportions for only the upper half of the distribution, the z value, which includes all but the extreme 2.5% of outcomes, will be the point at which results become statistically significant. If 2.5% needs to be the percentage excluded, 47.5% is the percentage included. As a proportion, 47.5% is 0.475.

• From Table A, what z value includes 0.475 of the distribution back to the mean of the distribution?

• Because z 5 1.96 includes 0.475 of the distribution, 6 that value will include 0.95 of the distribution (2 3 0.475 5 0.95).

• Anytime a z-test produces a z 5 1.96 or greater, the result is statistically signifi- cant at p 5 .05.

Another View of Significance

Whether p 5 .05, .01, .001, or some other amount, the particular standard for statistical significance is somewhat arbitrary. Fisher picked a point and said essentially, “Anything beyond this level of probability is not likely to have occurred by chance.” Yet another debatable issue in statistics is that not everyone agrees that there has to be such a stan- dard. One approach was to calculate the probability that an event could occur by chance, and then let consumers make their own decision about whether it is significant. Another approach is to accompany the significance level with the effect size or the magnitude of the relationship or effect. This is an additional reporting value without solely relying on the significance values. Effect sizes are discussed using Cohen’s (1988, pp. 145–153) effect size values starting in Chapter 5.

Another traditional approach to hypothesis testing is calculating statistical values (e.g., z values) and then comparing them to the appropriate critical value found in their respec- tive tables (usually found in appendices of statistical references). Seldom do researchers deal with critical values of a test statistic; with modern computing power, it is easy to get the actual probability value for the test statistic from the data and then to compare this probability to the desired critical alpha level (e.g., a 5 .05). The latter approach is most suitably used in ongoing analyses for the remainder of this text.

suk85842_04_c04.indd 119 10/23/13 1:17 PM

CHAPTER 4Section 4.3 The Concept of Statistical Significance

Sampling Error as an Explanation of Difference

In virtually every z-test, there will be some difference between M and mM, which means that z will have some value other than 0. When the differences fall short of statistical significance (z , 1.96), how are they explained? The answer is sampling error. Because no sample can exactly emulate the population, most samples in the distribution of sam- ple means will have a sample mean different from the population mean. In the second example dealing with the explicit reinforcement of sales associates, those who received explicit reinforcement actually did better than the population of all sales associates, but at z 5 0.717 the difference is not large enough to be statistically significant. Such a difference might reflect the fact that those selected for the sample group just happened to be gener- ally above the mean of the distribution. In the first example on the number of negative exchanges, some of the difference reflects sampling error, but that factor alone is not an adequate explanation of the difference between M and mM.

More Confidence in the Sample

The foregoing underscores the importance of having confidence in the sample to begin with. Even though samples can never mirror populations exactly, we noted that large, randomly selected samples minimize sampling error. It can be difficult to define “large,” however. One approach to determining the optimal sample size is based on the answers to two questions:

1. How much certainty must there be that the sample is like the population? 2. How much error can be tolerated?

The formula is as follows:

n 5 a 1z2 1s2

variation from s b

2

Formula 4.3

Where

n 5 the required sample size

z 5 the value of z that corresponds to how certain we wish to be of the result. Because 6z 5 1.96 includes the middle 95% of the distribution, using that value in the formula provides p 5 .95 that the sample emulates the popula- tion. If .99 certainty is required, z 5 2.58.

s 5 the standard deviation of the population. If the population standard devia- tion is not available, a sample standard deviation (s) can be substituted, although the estimate will lose some precision.

variation from s 5 the amount we are willing to allow s to vary from s

An instructor wants to gauge the impact that a service learning course has on students’ attitudes toward community service. The university research office has surveyed stu- dents’ interest in service learning and from the scores on the instrument has determined

suk85842_04_c04.indd 120 10/23/13 1:17 PM

CHAPTER 4Section 4.3 The Concept of Statistical Significance

a standard deviation of 8.294. The instructor is willing for the sample data to digress from university-wide data by 2 points and wishes to be .95 confident of the result.

n 5 a 1z2 1s2

variation from s b

2

With

s 5 8.294

z 5 1.96

n 5 a 11.962 18.2942

2.0 b

2

5 approximately 66

With p 5 .95, a random sample of about 66 people will provide a sample within 2 points of the population standard deviation.

Changing the conditions can dramatically affect the required sample size. If the instructor needs to be within 1 point of the population standard deviation and wishes for p 5 .99, note the impact on the result:

n 5 a 12.582 18.2942

1.0 b

2

5 approximately 458 people

There is constant tension between how certain and how precise we need to be on the one hand and the size of the needed sample on the other. The results illustrate that both increasing the level of certainty or requiring less error necessitates larger sample sizes, but at least Formula 4.3 can help us strike a balance. Samples that are very large can be time-consuming and expensive to work with. Samples that are very small may not reflect the essential characteristics of the population, making generalizing the results a problem.

Decision Errors

Statistical significance is based on the probability that an event could occur by chance, and interpreting outcomes based on probabilities carries a risk.

• Is it not possible, however unlikely, that a researcher could accidentally sample the couples in the distribution who have the most negative exchanges? Maybe they do not belong to a distinct population at all. Maybe they are just from the most extreme portion of the population of all married couples.

• On the other hand, is it not also possible that those sales associates who were explicitly reinforced actually did belong to a population of higher-performing salespeople, but because they were accidentally sampled from the lowest region in the population of all sales associates, their differences appeared to be not significant?

Because statistical decisions are based on probabilities rather than certainties, any sta- tistical decision can result in two decisions: a correct decision and a decision error. The two examples above represent the two types of decision errors, and they are mutually

C If a result is not statistically significant, how is the difference between M and mM explained?

Try It!

suk85842_04_c04.indd 121 10/23/13 1:17 PM

CHAPTER 4Section 4.3 The Concept of Statistical Significance

exclusive. Any statistical decision involves the risk of one or the other, but never both in the same analysis. As these decisions are discussed in the next section, refer to Figure 4.5 to aid in your understanding of correct decisions and type I and type II errors.

Type I Errors

Type I errors in statistical testing occur when an outcome is determined to be statistically significant, but further research and testing would indicate that it is not. In other words, the first, errant conclusion is an anomaly that fails to hold up under further scrutiny. The probability of this error is defined by the level at which the testing occurs. If the criterion for statistical significance is p 5 .05 or a 5 .05, and the result is deemed statistically sig- nificant, the probability of a type I error is .05. Because type I error is also called alpha (a) error, the significance level of a test is sometimes noted in terms of the risk of alpha error, a 5 .05, rather than p 5 .05. It means the same thing, except that the author has chosen to indicate the probability of type I error rather than referring directly to the criterion for statistical significance. At a 5 .05, for every 100 times someone concludes that a result is statistically significant; there will be a type I error an average of five times.

Note that what Fisher did was arbitrarily exclude the most extreme 5% of the distribu- tion as atypical of the most likely outcomes. Although we agree that those most extreme outcomes are the least likely to occur, that most extreme 5% of the distribution is still part of the distribution in question. Outcomes in that area of the distribution hold the greatest potential for a type I error.

• The only time a type I error is possible is when a result is deemed statistically significant. If there is no statistically significant outcome, there is no potential for a type I (a) error.

• In a particular significant finding, there is not any way to know whether a type I error has occurred. Gathering new data and repeating the analysis is the only way to check, which is why replication studies are so important.

Type II Errors

In a z-test, type II errors occur when the sample is actually characteristic of some popula- tion other than the distribution of sample means to which it was compared, but the statisti- cal testing (z , 1.96) suggests no significant difference. This type of decision error is also called a beta (b) error.

There would be little problem with type II errors if the populations involved were completely separate, but often there are important sim- ilarities. The population of all sales associates probably bears a num- ber of similarities to the population of sales associates who receive explicit reinforcement. The more the populations involving sales asso- ciates overlap, the more likely decision errors become.

Although the level at which the statistical test is conducted (often or a 5 .05) defines the likelihood of a type I error, the probability of a type II error is more elusive, and in fact we never know the exact

probability of committing this error, although some statistical tests are more prone to it than others.

D An analysis results in a finding that is statistically significant at p 5 .05. What is the probability of a type II error?

Try It!

suk85842_04_c04.indd 122 10/23/13 1:17 PM

CHAPTER 4Section 4.3 The Concept of Statistical Significance

• The only time a type II error can occur is when a result is determined not to be statistically significant.

• In a particular analysis where the result appears to be not significant, there is not any way to know whether a decision has resulted in a type II error.

• See Tables 4.2 and 4.3 for a look as to how these are interrelated.

Table 4.2: Correct decisions, type I, and type II errors

Reality

R es

ea rc

h

The null hypothesis (Ho) is True

The alternative hypothesis (Ha) is True

The null hypothesis (Ho) is True

Accurate p 5 1 2 a

Type II error p 5 b

The alternative hypothesis (Ha) is True

Type I error p 5 a

Accurate p 5 1 2 b

Table 4.3: Pregnancy test results

Result When Ho (No Baby) Is True When Ha (Baby) Is True

Not Pregnant Correct pregnancy test “Whew! Parental planning effective!”

Type II error False Negative pregnancy test “Oops! Baby on the way!”

Pregnant Type I error False Positive pregnancy test “Where’s the baby?”

Correct pregnancy test “As planned, baby on the way!”

Decision Errors in Summary

Is one error more damaging than the other? Do analysts have a preference for one type of error? The answer, of course, depends upon circumstances and especially on the impact that a decision error has on the people involved.

Perhaps a committee is evaluating certification programs for mental health professionals, and it deems the program at University A to be significantly better than the competing programs. If the result is that, the graduates from University A receive preferential hiring but the difference among programs really is not statistically significant after all, then there has been a type I error.

On the other hand, perhaps a client has a serious illness and comes to a health profes- sional for a diagnosis. If the health professional fails to recognize that the client is not

suk85842_04_c04.indd 123 10/23/13 1:17 PM

CHAPTER 4Section 4.3 The Concept of Statistical Significance

G* Power 3 is a free, online power analysis tool available via the Institute for Experimental Psychology at the Heinrich Heine University Düsseldorf. Follow the link below to learn more about the package, with accompanying manuals and articles, by developers Buchner, Erdfelder, and Faul (1996).

http://www.psycho.uni -duesseldorf.de/abtei lungen/aap/gpower3

Try It! healthy and misses the condition that is affecting the client’s well- being, there has been a type II error.

Which type of error is the more serious depends upon circumstances, but statisticians may have their own bias. Power in statistical testing is described in terms of the likelihood of a type II error. The most power- ful tests are associated with the fewest beta errors. The power of a sta- tistical test is symbolically indicated this way: 1 2 b (refer to the lower right quadrant “correct decision” in Figure 4.5). Power is indicated as a probability of rejecting the null hypothesis and therefore the higher the power (1 2 b) the greater the likelihood of supporting the alter- native hypothesis. For instance, a power of 1 2 0.2(b) 5 0.8 indicates an 80% chance of rejecting the null hypothesis or simply stated, an 80% power. As a researcher, increasing this likelihood or probability is imperative to finding a statistically significant difference or rela- tionship in hypothesis testing, and this is most commonly affected by sample size. As a result, researchers will conduct a Power analysis to calculate a minimum sample size based on effect sizes and statistical significance criteria using appropriate software such as G*Power 3 or SPSS Sample Power 3.

On “Picking Your Poison”

Although type I and type II errors cannot both occur in the same analysis, the probability of one affects the likelihood of the other. Mental health professionals ordinarily must pass some sort of licensing requirement, perhaps in the form of a test. Like most professional licensing tests, it probably is a reasonably good, but certainly not perfect, indicator of who is competent.

• If test results indicate that an individual is competent, but the individual actually lacks the skills and knowledge required, there has been a type I error.

• If test results indicate that an individual is not sufficiently competent to be licensed, but the person actually is, there has been a type II error.

If a type I error is thought to be the greater problem, the licensing body might simply raise the required test score. This would probably reduce the number of incompetent people who are licensed, but the companion problem is that it would also exclude more of those who actually are competent but, because they do not test well, fail to demonstrate their competence on the required measure. This inherent connection between the two kinds of decision errors is why someone has to decide which is the more damaging of the two.

With airline pilots and surgeons, it is straightforward. Usually the decision is in favor of excluding some who are competent (and therefore committing a type II error) rather than risk licensing some who are not competent (committing a type I error). The potential cost to the well-being of others is too great to do otherwise. In other circumstances, the greatest good is less clear.

suk85842_04_c04.indd 124 10/23/13 1:17 PM

CHAPTER 4Section 4.4 A Confidence Interval for the Mean of the Population

4.4 A Confidence Interval for the Mean of the Population

When the results from a z-test are statistically significant, the sample best represents a population other than the one to which it was compared. In these instances, the mean of the sample (M) is in fact an estimate of the value of that other population mean. Because it is a discrete value, M is called a “point estimate” of the population mean mM. For instance, a 95% confidence interval would provide a range of values within which this population mean mM would fall. As depicted in Figure 4.5, the confidence interval pro- vides a way to determine how precisely M estimates or predicts the population mean, mM.

Figure 4.5: The confidence interval based on a normal distribution

With regard to hypothesis testing, when a z value is significant, the confidence interval for the mean can produce a range of values within which mM is likely to occur. If the value produced is not within this confidence interval range, then this is not a probable estimate of mM. Conversely, if the z-test is not significant, there is no need for the confidence interval because our analysis indicates that the sample M belongs to the population described, and there is no need to estimate the value of mM. Calculating the confidence interval involves values from the z-test. The formula is

CI 5 6z(sM) 1 M Formula 4.4

Where

CI 5 the interval within which the population mean is expected to occur

z 5 the table value that reflects the level at which the z-testing was conducted (for p 5 .05, z 5 1.96)

sM 5 the value of the standard error of the mean from the z-test

M 5 the value of the sample mean

�3�4 �2 �1 0 1 2 3 4

Prediction of the population mean

z-value

suk85842_04_c04.indd 125 10/23/13 1:17 PM

CHAPTER 4Section 4.4 A Confidence Interval for the Mean of the Population

For the negative-verbal-exchanges problem, the result was significant, indicating that the sample probably belongs to some distribution of sample means other than the one to which it was compared. The confidence interval will establish a range of values within which the mean for that other population probably occurs.

CI 5 6z(sM) 1 M

CI 5 61.96(0.709) 1 12.865

CI 5 61.390 1 12.865 5 11.475, 14.255

With .95 probability, the population which the sample represents has a mean (mM) value somewhere between 11.475 and 14.255.

Note that the level of probability is one of the factors affecting the size of the confidence interval. If we wish to be more certain of capturing the population mean, a .99 confidence interval can be used instead of .95, and z 5 2.58 substituted for z 5 1.96 in the formula. Recalculating the confidence interval for p 5 .99,

CI 5 6z(sM) 1 M

CI 5 62.58(0.709) 1 12.865

CI 5 61.829 1 12.865 5 11.036, 14.694

A greater level of certainty of capturing the true mean of the distribution represented by the sample mean requires a wider confidence interval.

The other factor that affects the width of the confidence interval is the standard error of the mean, sM, which measures the amount of variability in the distribution of sample means. More data variability translates into a larger standard error of the mean, which makes the confidence interval larger.

There is no need for a confidence interval unless the z-test results are statistically signifi- cant. The reason can be illustrated by completing a confidence interval for the nonsignifi- cant salesperson problem. Recall that sM 5 1,063 and M 5 23,300 for that example.

CI 5 6z(sM) 1 M

CI 5 61.96(1,063) 1 23,300

CI 5 62,083 1 23,300 5 21,217, 25,383

Note that this confidence interval includes within its range the value of the original popu- lation, 22,538. That is because with a nonsignificant z value, the conclusion is that the population that the sample represented is likely the same population to which it was compared.

A nonsignificant z-test value will always produce a confidence interval that includes the original population mean. Note that neither the .95 nor the .99 confidence intervals included the original population mean from the negative-verbal-exchanges problem.

E What does the confidence interval for z determine?

Try It!

suk85842_04_c04.indd 126 10/23/13 1:17 PM

CHAPTER 4Section 4.4 A Confidence Interval for the Mean of the Population

Apply It!

Quality Control Revisited

Let us return to the example of the bottling company that uses an automated machine to fill 4-liter plastic containers with orange juice. The recalibrated

machine fills the containers to a mean of 4.05 liters, with a standard deviation of 0.09 liters.

The equipment engineer now wants to use the same machine to fill the 4-liter containers with apple juice. He would like to know if changing to apple juice will affect the machine’s performance. Will the mean fill amount still be 4.05 liters with a standard deviation of 0.09 liters?

To find an answer, the engineer measures 20 of the apple juice containers. The sample mean fill is 3.99 liters. What is the probability that a randomly selected group of 20 apple juice containers would have a mean fill of 3.99 liters or less? Has changing from orange juice to apple juice affected the machine? The engineer decides to use a value of p 5 .01 to indicate statistical significance.

m, and therefore mM, 5 4.05 liters s 5 0.09 liters N 5 20 M 5 3.99 liters

First, calculate the standard error of the mean:

sM 5 s

"N 5

0.09

"20 5 0.02 liters

Then determine z:

z 5 M 2 mM

sM 5

3.99 2 4.05 0.02

5 23.0

Note that even though the difference between the mean of the population (mM 5 4.05 liters) and the sample mean (M 5 3.99 liters) is small, the z value is very large because of the small standard error of the mean (sM 5 0.02 liters).

The table value for z 5 –3.0 is 0.4987.

The percentage of containers filled to 3.99 liters or less can be determined by 0.50 2 0.4987 5 .0013. About .13% of the samples selected at random will have a mean fill level lower than 3.99 liters. This result is therefore statistically significant at the p 5 .01 level. By switching from orange juice to apple juice, the mean fill level has changed. The machine controls will have to be adjusted to account for these differences if a mean fill amount of 4.05 liters is to be achieved when filling apple juice containers.

(continued)

suk85842_04_c04.indd 127 10/23/13 1:17 PM

CHAPTER 4Section 4.5 The z-Test Using Excel

4.5 The z-Test Using Excel

A social worker’s caseload includes 8 people with annual incomes as follows (in thou-sands of dollars): 13.5, 18, 22.375, 25.240, 26, 29.331, 30, 30

Is the mean income of the social worker’s clients significantly different from the mean income of all social workers’ clients for whom the average annual income is 19.500 thou- sand dollars with a standard deviation of 4.525 thousand dollars? We will proceed in Excel as follows:

1. Enter the income data into a spreadsheet in cells A1–A8. 2. Have Excel calculate the mean by entering the formula 5 average(A1:A8) in cell

A9. 3. In cell A11, determine the standard error of the mean by dividing the population

standard deviation by the square root of the number. The command in Excel is 54.525/sqrt(8).

4. Determine the z value in cell A13 by entering the command 5(A9 2 19.5)/A11. The part in parentheses is the numerator in the z ratio: M (cell A9) 2 mM. Figure 4.6 is a screenshot of what your display will look like just before you press Enter.

The result is z 5 3.004. Testing at p 5 .05 (for which z 5 1.96), these eight people have sig- nificantly different incomes than the population of all social workers’ clients.

Apply It! (continued)

Because the results from the z-test are statistically significant, the sample mean of 3.99 best represents the value of the population mean. The engineer next computes the confidence interval for p 5 .01 to determine the range of values within which mM is likely to occur.

CI 5 6z(sM) 1 M

Where

z 5 the table value that reflects the level at which the z-testing was conducted For p 5 .01, z 5 2.58

CI 5 62.58(0.02) 1 3.99 CI 5 6.05 1 3.99 5 3.94, 4.04

Therefore, there is a 99% probability that the mean fill value is between 3.94 and 4.04 liters when using the machine with apple juice.

Apply It! boxes written by Shawn Murphy

suk85842_04_c04.indd 128 10/23/13 1:17 PM

CHAPTER 4Section 4.6 Presenting Results

Figure 4.6: Calculating a z-test in Excel

4.6 Presenting Results

Using the data from Figure 4.6, the mean income of the social worker’s caseload is 24.31 (in thousands). The population average is 19.50 (SD 5 4.53). During the z-test, the population average is standardized at 0, and the sample mean is calculated as a z score to determine its difference or distance from the population. In this case, the sample mean results in a z score of 3.00. The sample mean is 3 standard deviations above the population mean. We only need the z score to be as high as 1.96 in order for the difference to be sta- tistically significant at p 5 .05. In this case, we met this criterion and can conclude that the social worker’s caseload has a significantly higher income than the population average.

suk85842_04_c04.indd 129 10/23/13 1:17 PM

CHAPTER 4Section 4.7 Interpreting Results

It is important to note in your interpretations the population mean, sample mean, z score, and significance level. Be sure to discuss whether the difference is statistically significant or not and whether or not the difference means the sample mean is higher or lower than the population mean.

4.7 Interpreting Results

Though you should refer to the most recent edition of the APA manual for specific detail on formatting statistics, Table 4.4 may be used may be used as a quick guide in pre- senting the statistics covered in this chapter.

Table 4.4: Guide to APA formatting of z test scores

Abbreviation or Term Description

CI Confidence interval; presented as CI [lowest, highest]

p Probability or significance level If statistically significant, report p , .05 or .01

If not statistically significant, report p 5 “calculated p level”

SEM Standard error of the mean; standard error of measurement

z z-test statistic or score

Source: Publication Manual of the American Psychological Association, 6th edition. ©2009 American Psychological Association, pp. 119–122.

Note that p, SEM, and z are italicized, whereas CI is not. The following are some examples of how to present results using these abbreviations, though you may use different combinations of results. These examples utilize the data presented in Section 4.5.

• The average annual income for the social worker’s caseload was significantly higher (M 5 24.31) than the population average income (M 5 19.5; z 5 3.00, p , .05).

• The annual income for the social worker’s caseload is statistically different from the population average income, z 5 3.00, SEM 5 1.60, p , .05.

Using the data from Apply It! Quality Control Revisited, we could present the results in the following way:

• The probability of an apple juice container being filled with 3.99 liters or less is statistically significant at 0.0013, z 5 –3.00, SEM 5 0.02, p , .01, 99% CI [3.94, 4.04].

• The difference between the population mean (m 5 4.05) and sample mean (M 5 3.99) is statistically significant at p , .01, 99% CI [3.94, 4.04] (z 5 –3.00, SEM 5 0.02).

suk85842_04_c04.indd 130 10/23/13 1:17 PM

CHAPTER 4Summary

Summary The z-test provides a good introduction to formal statistical testing. It is an uncomplicated test that involves many of the same issues that come up in the more advanced tests. In general, behavioral researchers are much more interested in analyzing the performance of groups than of single individuals. We have many reasons to wonder whether this or that group truly represents the population to which they are compared. The z-test provides a mechanism for comparing one group for whom we have data to an identified population.

The z-test is based on the distribution of sample means (Objective 1), a population of the means of samples rather than of individuals’ scores. The central limit theorem indicates that such a distribution will be normal even if the distribution of individual scores is not (Objective 2). The normality allows the use of the z table to analyze how groups compare to populations (Objectives 4 and 6). Because the sample data we analyze sometimes does not fit well with the population presumed to be the source, the z-test provides a way to determine whether the sample belongs to some other population, an outcome related to the concept of statistical significance (Objective 5).

When the sample is determined to represent some other population, the sample mean is a point estimate of the value of that other mM, but it is only an estimate. The confidence interval provides a range of values within which the mean of that other population will occur with a specified probability (Objective 7). In doing so, the confidence interval gives an indication of the precision with which M estimates mM.

Inferential statistical analysis involves the risk of making an incorrect decision. Occasion- ally, results that appear significant in one test will not hold up when the study is repeated with new data. On the other hand, sometimes a nonsignificant finding will be overturned on further analysis. These type I and type II errors, respectively, are a reminder that statistical decisions are based on probabilities rather than certainties (Objective 8). In addition, how to present results (Objective 9) and interpret them in APA format (Objective 10) as they relate to describing z-test results are important pieces of utilizing and writing about statistical data.

Small samples, no matter how carefully selected, cannot mirror all the relevant character- istics of complex populations, and populations involving people are invariably complex. For this reason, a procedure for determining the size of the sample needed to emulate the important characteristics of the population has some utility (Objective 3). Formula 4.3 meets that need. Statistical significance is a very important concept in educational analy- sis. When new programs or strategies are instituted, we often look for ways to determine whether the program makes a difference. The z-test helps answer some of these questions. As important as the z-test is as an introduction, it has limitations in that it requires access to both mM and sM. Although population means can usually be figured out, the standard error of the mean sometimes just is not accessible. The t-tests in Chapter 5 will provide a way around this difficulty.

This summary is probably a good barometer of your grasp of Chapters 1–4. Although some of the material has probably been familiar, many of the ideas are likely new. If the review here makes sense, that is excellent. If there are some holes, it is a good idea to take some time to go back to the relevant sections and review. Statistical analysis is incremen- tal, as we have stressed before, so it is important to understand what has been presented before continuing. Working the examples below, sometimes repeatedly, will help.

suk85842_04_c04.indd 131 10/23/13 1:17 PM

CHAPTER 4Key Terms

Key Terms

central limit theorem Proposition that holds that if a population is sampled an infinite number of times using sample size n and the mean of each sample is deter- mined, the multiple means (Ms) will take on the characteristics of a normal distribu- tion whether or not the original population of individuals was normal.

confidence interval (CI 5 6z(sM) 1 M) Provides a way to determine how precisely M estimates mM.

decision errors The two types of decision errors are type I errors and type II errors.

distribution of sample means A distribu- tion made of the means of samples rather than individual scores.

law of large numbers The mathematical principle that errors diminish as the num- ber of data points increases.

power In statistical testing, the likelihood of a type II error. The power of a statistical test is indicated as 1 2 b.

random selection The selection of a sam- ple from a population where every member has an equal probability of being selected.

sampling error Error that is reflected in the degree to which the characteristics of the sample, such as the mean and standard deviation, vary from those populations.

standard error of the mean The standard deviation of the sample means (sM).

statistically significant An outcome so unlike the population to which it is com- pared that it can be presumed to reflect some other population. Said another way, the value of M is distant enough from mM that it probably was not randomly selected from the distribution of sample means. If the probability that an outcome occurred by chance is p 5 .05 or less, the outcome is statistically significant.

systematic sampling error Sampling error that occurs because the same mistake in selecting a sample of a population is made repeatedly.

type I errors Also alpha (a) errors; type of decision errors made when a result is judged to be statistically significant, but further research and testing would show that it is not.

type II errors Also beta (b) errors; type of decision errors that occur when the sample is characteristic of some population other than the distribution of samples means to which it was compared but the statistical testing suggests no significant difference.

z-test Test that indicates how distant a sample mean is from the mean of the dis- tribution of sample means, in units of the standard error of the mean. When the value of z is 1.96 or greater, there is a probability of p 5 .05 or less that the sample belongs to the population.

suk85842_04_c04.indd 132 10/23/13 1:17 PM

CHAPTER 4Chapter Exercises

Chapter Exercises

Answers to Try It! Questions The answers to all Try It! questions introduced in this chapter are provided below.

A. There is less variability in the distribution of sample means than in a distribu- tion of individual scores because sample means moderate the effect of extreme scores. The larger the sample, the more extreme scores are minimized as factors in data variability.

B. “Statistically significant” means that the calculated value, z in this case, is large enough that it is not likely to have occurred by chance; it is probably not a ran- dom outcome.

C. When the difference between M and mM in a z-test is not significant, the differ- ence is attributed to random variability; the value of M is one of the possible values of samples drawn at random from the distribution of sample means.

D. A type II, or beta, error can occur only when a result is determined not statisti- cally significant. When the result is significant, the probability of b 5 0.

E. Calculated only for a statistically significant result, a confidence interval for the value of z indicates a range of scores within which the population mean that the sample does probably represent occurs.

Review Questions The answers to the odd-numbered items can be found in the answers appendix.

1. If all the psychologists working at a state mental hospital have an average age of 47.5 years, what will be the value of mM if it were created from such a population?

2. The standard deviation of the psychologists’ ages in Exercise 1 is calculated. If the standard error of the mean for the distribution of sample means is also calculated for the same data, which will have the greater value? Why is there a difference?

3. The assistant vice president for personnel at a college has job-performance scores for all clerical staff, with a mean value of 32.956 and a standard error of the mean of 5.924. What is the probability of randomly selecting a sample with a job satisfac- tion mean of 35.0 or higher?

4. If a group with M 5 35.0 is selected, are they significantly different from the population?

5. The clerical staff in a large law office has the following job performance scores:

25, 37, 38, 43, 44, 48, 51

If the mean level of performance for all clerical staff is 33.255 with a standard error of the mean of 3.248, are those in the law office characteristic of that population? Test at p 5 .05.

suk85842_04_c04.indd 133 10/23/13 1:17 PM

CHAPTER 4Chapter Exercises

6. The standard deviation for a major intelligence test is s 5 15.0. If in a given year the test is administered to 347 people, what is the value of the standard error of the mean?

7. An exclusive graduate program requires GRE Quantitative scores of 500 or bet- ter. This year’s entering class has n 5 16 and M 5 625. Are they characteristic of a national population of graduate students for whom m 5 500 with s 5 100? What is the probability that a group of 16 applicants selected at random would have s 5 100 and M 5 525 or better?

8. If a researcher wishes to gather a sample of people who have intelligence scores that differ from the national standard deviation of 15 by no more than 3 points, with .95 confidence, how large must the sample be? How large must the sample be if it is to vary from the national standard deviation by no more than 2 points?

9. A group of social workers takes a measure of optimism and scores as follows:

11, 14, 14, 16, 19, 20, 22, 23, 27, 30

If the population standard deviation is 4.554,

a. What is the value of the standard error of the mean? b. What is the z value for a z-test with this group if mM 5 26.0? c. If 26.0 is the mean for all employed adults, is this group of social workers sig-

nificantly different? d. Complete this problem on Excel. Refer to Figure 4.6 for help.

10. If a z-test result is not significant, why will a confidence interval for the popula- tion mean contain the value of the population mean to which the sample was compared?

11. What factors will reduce the size of a confidence interval?

12. If someone is testing at p 5 .01 and the result is statistically significant, what is the probability of a type I error? What is the probability of a type II error?

Analyzing the Research Review the article abstract provided below. You can then access the full article via your university’s online library portal to answer the critical thinking questions. Answers can be found in the answers appendix.

Using Normative Data for a Neuropsychology Study

Crawford, J. R., & Garthwaite, P. H. (2008). On the ‘optimal’ size for normative samples in neuropsychology: Capturing the uncertainty when normative data are used to quantify the standing of a neuropsychological test score. Child Neuropsychology, 14(2), 99–117. doi:10.1080/09297040801894709

suk85842_04_c04.indd 134 10/23/13 1:17 PM

CHAPTER 4Chapter Exercises

Article Abstract

Bridges and Holler (2007) have provided a useful reminder that normative data are fal- lible. Unfortunately, however, their paper misleads neuropsychologists as to the nature and extent of the problem. We show that the uncertainty attached to the estimated z score and percentile rank of a given raw score is much larger than they report and that it varies as a function of the extremity of the raw score. Methods for quantifying the uncertainty associated with normative data are described and used to illustrate the issues involved. A computer program is provided that, on entry of a normative sample mean, standard deviation, and sample size, provides point and interval estimates of percentiles and z scores for raw scores referred to these normative data. The methods and program pro- vide neuropsychologists with a means of evaluating the adequacy of existing norms and will be useful for those planning normative studies.

Critical Thinking Questions

1. The article states that Johnny has a z score of 21.66, assuming p , .05. Is there a significant difference from the mean of the distribution?

2. If the neuropsychological test has a m 5 50 and s 5 10, what is the z score of someone who received a 45 on the test?

3. Why would the psychologist want to convert their scores from a neuropsychological test to a z score?

suk85842_04_c04.indd 135 10/23/13 1:17 PM

suk85842_04_c04.indd 136 10/23/13 1:17 PM